POC: make mxidoff 64 bits

Started by Maxim Orlovover 1 year ago104 messages
#1Maxim Orlov
orlovmg@gmail.com
1 attachment(s)

Hi!

I've been trying to introduce 64-bit transaction identifications to
Postgres for quite a while [0]/messages/by-id/CACG=ezZe1NQSCnfHOr78AtAZxJZeCvxrts0ygrxYwe=pyyjVWA@mail.gmail.com. All this implies,
of course, an enormous amount of change that will have to take place in
various modules. Consider this, the
patch set become too big to be committed “at once”.

The obvious solutions was to try to split the patch set into smaller ones.
But here it comes a new challenge,
not every one of these parts, make Postgres better at the moment. Actually,
even switching to a
FullTransactionId in PGPROC will have disadvantage in increasing of WAL
size [1]/messages/by-id/CACG=ezY7msw+jip=rtfvnfz051dRqz4s-diuO46v3rAoAE0T0g@mail.gmail.com.

In fact, I believe, we're dealing with the chicken and the egg problem
here. Not able to commit full patch set
since it is too big to handle and not able to commit parts of it, since
they make sense all together and do not
help improve Postgres at the moment.

But it's not that bad. Since the commit 4ed8f0913bfdb5f, added in [3]/messages/by-id/CAJ7c6TPDOYBYrnCAeyndkBktO0WG2xSdYduTF0nxq+vfkmTF5Q@mail.gmail.com, we
are capable to use 64 bits to
indexing SLRUs.

PROPOSAL
Make multixact offsets 64 bit.

RATIONALE
It is not very difficult to overflow 32-bit mxidoff. Since, it is created
for every unique combination of the
transaction for each tuple, including XIDs and respective flags. And when a
transaction is added to a
specific multixact, it is rewritten with a new offset. In other words, it
is possible to overflow the offsets of
multixacts without overflowing the multixacts themselves and/or ordinary
transactions. I believe, there
was something about in the hackers mail lists, but I could not find links
now.

PFA, patch. Here is a WIP version. Upgrade machinery should be added later.

As always, any opinions on a subject a very welcome!

[0]: /messages/by-id/CACG=ezZe1NQSCnfHOr78AtAZxJZeCvxrts0ygrxYwe=pyyjVWA@mail.gmail.com
/messages/by-id/CACG=ezZe1NQSCnfHOr78AtAZxJZeCvxrts0ygrxYwe=pyyjVWA@mail.gmail.com
[1]: /messages/by-id/CACG=ezY7msw+jip=rtfvnfz051dRqz4s-diuO46v3rAoAE0T0g@mail.gmail.com
/messages/by-id/CACG=ezY7msw+jip=rtfvnfz051dRqz4s-diuO46v3rAoAE0T0g@mail.gmail.com
[3]: /messages/by-id/CAJ7c6TPDOYBYrnCAeyndkBktO0WG2xSdYduTF0nxq+vfkmTF5Q@mail.gmail.com
/messages/by-id/CAJ7c6TPDOYBYrnCAeyndkBktO0WG2xSdYduTF0nxq+vfkmTF5Q@mail.gmail.com

--
Best regards,
Maxim Orlov.

Attachments:

0001-WIP-mxidoff-to-64bit.patchapplication/octet-stream; name=0001-WIP-mxidoff-to-64bit.patchDownload
From 622a47c2fe7b10de398390358cee97151f10fbb9 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Wed, 6 Mar 2024 11:11:33 +0300
Subject: [PATCH] WIP : mxidoff to 64bit

---
 src/backend/access/rmgrdesc/mxactdesc.c   |   9 +-
 src/backend/access/rmgrdesc/xlogdesc.c    |   4 +-
 src/backend/access/transam/multixact.c    | 230 ++++------------------
 src/backend/access/transam/xlogrecovery.c |   5 +-
 src/bin/pg_controldata/pg_controldata.c   |   4 +-
 src/bin/pg_resetwal/pg_resetwal.c         |  10 +-
 src/bin/pg_resetwal/t/001_basic.pl        |   2 +-
 src/include/access/multixact.h            |   2 +-
 src/include/c.h                           |   2 +-
 9 files changed, 58 insertions(+), 210 deletions(-)

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 3e8ad4d5ef..1b486de38c 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -65,8 +65,8 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		xl_multixact_create *xlrec = (xl_multixact_create *) rec;
 		int			i;
 
-		appendStringInfo(buf, "%u offset %u nmembers %d: ", xlrec->mid,
-						 xlrec->moff, xlrec->nmembers);
+		appendStringInfo(buf, "%u offset %llu nmembers %d: ", xlrec->mid,
+						 (unsigned long long) xlrec->moff, xlrec->nmembers);
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
 	}
@@ -74,9 +74,10 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
 
-		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+		appendStringInfo(buf, "offsets [%u, %u), members [%llu, %llu)",
 						 xlrec->startTruncOff, xlrec->endTruncOff,
-						 xlrec->startTruncMemb, xlrec->endTruncMemb);
+						 (unsigned long long) xlrec->startTruncMemb,
+						 (unsigned long long) xlrec->endTruncMemb);
 	}
 }
 
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index e455400716..1e4abc7bcf 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -45,7 +45,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
 		appendStringInfo(buf, "redo %X/%X; "
-						 "tli %u; prev tli %u; fpw %s; xid %u:%u; oid %u; multi %u; offset %u; "
+						 "tli %u; prev tli %u; fpw %s; xid %u:%u; oid %u; multi %u; offset %llu; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
 						 "oldest running xid %u; %s",
@@ -57,7 +57,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 XidFromFullTransactionId(checkpoint->nextXid),
 						 checkpoint->nextOid,
 						 checkpoint->nextMulti,
-						 checkpoint->nextMultiOffset,
+						 (unsigned long long) checkpoint->nextMultiOffset,
 						 checkpoint->oldestXid,
 						 checkpoint->oldestXidDB,
 						 checkpoint->oldestMulti,
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 380c866d71..a547d6598f 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -95,14 +95,6 @@
 /*
  * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
  * used everywhere else in Postgres.
- *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
  */
 
 /* We need four bytes per offset */
@@ -156,7 +148,7 @@
 		((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
 
 /* page in which a member is to be found */
-#define MXOffsetToMemberPage(xid) ((xid) / (TransactionId) MULTIXACT_MEMBERS_PER_PAGE)
+#define MXOffsetToMemberPage(xid) ((xid) / (MultiXactOffset) MULTIXACT_MEMBERS_PER_PAGE)
 #define MXOffsetToMemberSegment(xid) (MXOffsetToMemberPage(xid) / SLRU_PAGES_PER_SEGMENT)
 
 /* Location (byte offset within page) of flag word for a given member */
@@ -230,9 +222,6 @@ typedef struct MultiXactStateData
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 
-	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
-
 	/*
 	 * This is used to sleep until a multixact offset is written when we want
 	 * to create the next one.
@@ -367,8 +356,6 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
-									 MultiXactOffset start, uint32 distance);
 static bool SetOffsetVacuumLimit(bool is_startup);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMZeroPageXlogRec(int64 pageno, uint8 info);
@@ -1117,78 +1104,6 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	else
 		*offset = nextOffset;
 
-	/*----------
-	 * Protect against overrun of the members space as well, with the
-	 * following rules:
-	 *
-	 * If we're past offsetStopLimit, refuse to generate more multis.
-	 * If we're close to offsetStopLimit, emit a warning.
-	 *
-	 * Arbitrarily, we start emitting warnings when we're 20 segments or less
-	 * from offsetStopLimit.
-	 *
-	 * Note we haven't updated the shared state yet, so if we fail at this
-	 * point, the multixact ID we grabbed can still be used by the next guy.
-	 *
-	 * Note that there is no point in forcing autovacuum runs here: the
-	 * multixact freeze settings would have to be reduced for that to have any
-	 * effect.
-	 *----------
-	 */
-#define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
-								 nmembers))
-	{
-		/* see comment in the corresponding offsets wraparound case */
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("multixact \"members\" limit exceeded"),
-				 errdetail_plural("This command would create a multixact with %u members, but the remaining space is only enough for %u member.",
-								  "This command would create a multixact with %u members, but the remaining space is only enough for %u members.",
-								  MultiXactState->offsetStopLimit - nextOffset - 1,
-								  nmembers,
-								  MultiXactState->offsetStopLimit - nextOffset - 1),
-				 errhint("Execute a database-wide VACUUM in database with OID %u with reduced vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age settings.",
-						 MultiXactState->oldestMultiXactDB)));
-	}
-
-	/*
-	 * Check whether we should kick autovacuum into action, to prevent members
-	 * wraparound. NB we use a much larger window to trigger autovacuum than
-	 * just the warning limit. The warning is just a measure of last resort -
-	 * this is in line with GetNewTransactionId's behaviour.
-	 */
-	if (!MultiXactState->oldestOffsetKnown ||
-		(MultiXactState->nextOffset - MultiXactState->oldestOffset
-		 > MULTIXACT_MEMBER_SAFE_THRESHOLD))
-	{
-		/*
-		 * To avoid swamping the postmaster with signals, we issue the autovac
-		 * request only when crossing a segment boundary. With default
-		 * compilation settings that's roughly after 50k members.  This still
-		 * gives plenty of chances before we get into real trouble.
-		 */
-		if ((MXOffsetToMemberPage(nextOffset) / SLRU_PAGES_PER_SEGMENT) !=
-			(MXOffsetToMemberPage(nextOffset + nmembers) / SLRU_PAGES_PER_SEGMENT))
-			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	}
-
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
-								 nextOffset,
-								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
-		ereport(WARNING,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg_plural("database with OID %u must be vacuumed before %d more multixact member is used",
-							   "database with OID %u must be vacuumed before %d more multixact members are used",
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers,
-							   MultiXactState->oldestMultiXactDB,
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers),
-				 errhint("Execute a database-wide VACUUM in that database with reduced vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age settings.")));
-
 	ExtendMultiXactMember(nextOffset, nmembers);
 
 	/*
@@ -1217,7 +1132,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	LWLockRelease(MultiXactGenLock);
 
-	debug_elog4(DEBUG2, "GetNew: returning %u offset %u", result, *offset);
+	debug_elog4(DEBUG2, "GetNew: returning %u offset %llu", result,
+				(unsigned long long) *offset);
 	return result;
 }
 
@@ -1926,7 +1842,7 @@ MultiXactShmemInit(void)
 				  "pg_multixact/offsets", LWTRANCHE_MULTIXACTOFFSET_BUFFER,
 				  LWTRANCHE_MULTIXACTOFFSET_SLRU,
 				  SYNC_HANDLER_MULTIXACT_OFFSET,
-				  false);
+				  true);
 	SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE);
 	SimpleLruInit(MultiXactMemberCtl,
 				  "multixact_member", multixact_member_buffers, 0,
@@ -2244,8 +2160,9 @@ MultiXactGetCheckptMulti(bool is_shutdown,
 	LWLockRelease(MultiXactGenLock);
 
 	debug_elog6(DEBUG2,
-				"MultiXact: checkpoint is nextMulti %u, nextOffset %u, oldestMulti %u in DB %u",
-				*nextMulti, *nextMultiOffset, *oldestMulti, *oldestMultiDB);
+				"MultiXact: checkpoint is nextMulti %u, nextOffset %llu, oldestMulti %u in DB %u",
+				*nextMulti, (unsigned long long) *nextMultiOffset, *oldestMulti,
+				*oldestMultiDB);
 }
 
 /*
@@ -2279,8 +2196,8 @@ void
 MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset)
 {
-	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %u",
-				nextMulti, nextMultiOffset);
+	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %llu",
+				nextMulti, (unsigned long long) nextMultiOffset);
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->nextMXact = nextMulti;
 	MultiXactState->nextOffset = nextMultiOffset;
@@ -2470,8 +2387,8 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 	}
 	if (MultiXactOffsetPrecedes(MultiXactState->nextOffset, minMultiOffset))
 	{
-		debug_elog3(DEBUG2, "MultiXact: setting next offset to %u",
-					minMultiOffset);
+		debug_elog3(DEBUG2, "MultiXact: setting next offset to %llu",
+					(unsigned long long) minMultiOffset);
 		MultiXactState->nextOffset = minMultiOffset;
 	}
 	LWLockRelease(MultiXactGenLock);
@@ -2670,8 +2587,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	MultiXactOffset nextOffset;
 	bool		oldestOffsetKnown = false;
 	bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;
-	MultiXactOffset prevOffsetStopLimit;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2686,7 +2601,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	nextOffset = MultiXactState->nextOffset;
 	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
 	prevOldestOffset = MultiXactState->oldestOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2717,11 +2631,7 @@ SetOffsetVacuumLimit(bool is_startup)
 		oldestOffsetKnown =
 			find_multixact_start(oldestMultiXactId, &oldestOffset);
 
-		if (oldestOffsetKnown)
-			ereport(DEBUG1,
-					(errmsg_internal("oldest MultiXactId member is at offset %u",
-									 oldestOffset)));
-		else
+		if (!oldestOffsetKnown)
 			ereport(LOG,
 					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
 							oldestMultiXactId)));
@@ -2734,24 +2644,7 @@ SetOffsetVacuumLimit(bool is_startup)
 	 * overrun of old data in the members SLRU area. We can only do so if the
 	 * oldest offset is known though.
 	 */
-	if (oldestOffsetKnown)
-	{
-		/* move back to start of the corresponding segment */
-		offsetStopLimit = oldestOffset - (oldestOffset %
-										  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-
-		/* always leave one segment before the wraparound point */
-		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-		if (!prevOldestOffsetKnown && !is_startup)
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are now enabled")));
-
-		ereport(DEBUG1,
-				(errmsg_internal("MultiXact member stop limit is now %u based on MultiXact %u",
-								 offsetStopLimit, oldestMultiXactId)));
-	}
-	else if (prevOldestOffsetKnown)
+	if (prevOldestOffsetKnown)
 	{
 		/*
 		 * If we failed to get the oldest offset this time, but we have a
@@ -2761,14 +2654,12 @@ SetOffsetVacuumLimit(bool is_startup)
 		 */
 		oldestOffset = prevOldestOffset;
 		oldestOffsetKnown = true;
-		offsetStopLimit = prevOffsetStopLimit;
 	}
 
 	/* Install the computed values */
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestOffset = oldestOffset;
 	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	MultiXactState->offsetStopLimit = offsetStopLimit;
 	LWLockRelease(MultiXactGenLock);
 
 	/*
@@ -2778,54 +2669,6 @@ SetOffsetVacuumLimit(bool is_startup)
 		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
 }
 
-/*
- * Return whether adding "distance" to "start" would move past "boundary".
- *
- * We use this to determine whether the addition is "wrapping around" the
- * boundary point, hence the name.  The reason we don't want to use the regular
- * 2^31-modulo arithmetic here is that we want to be able to use the whole of
- * the 2^32-1 space here, allowing for more multixacts than would fit
- * otherwise.
- */
-static bool
-MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
-						 uint32 distance)
-{
-	MultiXactOffset finish;
-
-	/*
-	 * Note that offset number 0 is not used (see GetMultiXactIdMembers), so
-	 * if the addition wraps around the UINT_MAX boundary, skip that value.
-	 */
-	finish = start + distance;
-	if (finish < start)
-		finish++;
-
-	/*-----------------------------------------------------------------------
-	 * When the boundary is numerically greater than the starting point, any
-	 * value numerically between the two is not wrapped:
-	 *
-	 *	<----S----B---->
-	 *	[---)			 = F wrapped past B (and UINT_MAX)
-	 *		 [---)		 = F not wrapped
-	 *			  [----] = F wrapped past B
-	 *
-	 * When the boundary is numerically less than the starting point (i.e. the
-	 * UINT_MAX wraparound occurs somewhere in between) then all values in
-	 * between are wrapped:
-	 *
-	 *	<----B----S---->
-	 *	[---)			 = F not wrapped past B (but wrapped past UINT_MAX)
-	 *		 [---)		 = F wrapped past B (and UINT_MAX)
-	 *			  [----] = F not wrapped
-	 *-----------------------------------------------------------------------
-	 */
-	if (start < boundary)
-		return finish >= boundary || finish < start;
-	else
-		return finish >= boundary && finish < start;
-}
-
 /*
  * Find the starting offset of the given MultiXactId.
  *
@@ -2946,8 +2789,9 @@ MultiXactMemberFreezeThreshold(void)
 	 * we try to eliminate from the system is based on how far we are past
 	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
 	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD) /
-		(MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+	fraction /= (double) (MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+
 	victim_multixacts = multixacts * fraction;
 
 	/* fraction could be > 1.0, but lowest possible freeze age is zero */
@@ -2991,10 +2835,10 @@ SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int64 segpage, void *data
 static void
 PerformMembersTruncation(MultiXactOffset oldestOffset, MultiXactOffset newOldestOffset)
 {
-	const int	maxsegment = MXOffsetToMemberSegment(MaxMultiXactOffset);
-	int			startsegment = MXOffsetToMemberSegment(oldestOffset);
-	int			endsegment = MXOffsetToMemberSegment(newOldestOffset);
-	int			segment = startsegment;
+	const int64	maxsegment = MXOffsetToMemberSegment(MaxMultiXactOffset);
+	int64		startsegment = MXOffsetToMemberSegment(oldestOffset);
+	int64		endsegment = MXOffsetToMemberSegment(newOldestOffset);
+	int64		segment = startsegment;
 
 	/*
 	 * Delete all the segments but the last one. The last segment can still
@@ -3002,7 +2846,8 @@ PerformMembersTruncation(MultiXactOffset oldestOffset, MultiXactOffset newOldest
 	 */
 	while (segment != endsegment)
 	{
-		elog(DEBUG2, "truncating multixact members segment %x", segment);
+		elog(DEBUG2, "truncating multixact members segment %llx",
+			 (unsigned long long) segment);
 		SlruDeleteSegment(MultiXactMemberCtl, segment);
 
 		/* move to next segment, handling wraparound correctly */
@@ -3153,14 +2998,14 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	}
 
 	elog(DEBUG1, "performing multixact truncation: "
-		 "offsets [%u, %u), offsets segments [%x, %x), "
-		 "members [%u, %u), members segments [%x, %x)",
+		 "offsets [%u, %u), offsets segments [%llx, %llx), "
+		 "members [%llu, %llu), members segments [%llx, %llx)",
 		 oldestMulti, newOldestMulti,
-		 MultiXactIdToOffsetSegment(oldestMulti),
-		 MultiXactIdToOffsetSegment(newOldestMulti),
-		 oldestOffset, newOldestOffset,
-		 MXOffsetToMemberSegment(oldestOffset),
-		 MXOffsetToMemberSegment(newOldestOffset));
+		 (unsigned long long) MultiXactIdToOffsetSegment(oldestMulti),
+		 (unsigned long long) MultiXactIdToOffsetSegment(newOldestMulti),
+		 (unsigned long long) oldestOffset, (unsigned long long) newOldestOffset,
+		 (unsigned long long) MXOffsetToMemberSegment(oldestOffset),
+		 (unsigned long long) MXOffsetToMemberSegment(newOldestOffset));
 
 	/*
 	 * Do truncation, and the WAL logging of the truncation, in a critical
@@ -3285,7 +3130,7 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
 static bool
 MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
 {
-	int32		diff = (int32) (offset1 - offset2);
+	int64		diff = (int64) (offset1 - offset2);
 
 	return (diff < 0);
 }
@@ -3413,14 +3258,15 @@ multixact_redo(XLogReaderState *record)
 			   SizeOfMultiXactTruncate);
 
 		elog(DEBUG1, "replaying multixact truncation: "
-			 "offsets [%u, %u), offsets segments [%x, %x), "
-			 "members [%u, %u), members segments [%x, %x)",
+			 "offsets [%u, %u), offsets segments [%llx, %llx), "
+			 "members [%llu, %llu), members segments [%llx, %llx)",
 			 xlrec.startTruncOff, xlrec.endTruncOff,
-			 MultiXactIdToOffsetSegment(xlrec.startTruncOff),
-			 MultiXactIdToOffsetSegment(xlrec.endTruncOff),
-			 xlrec.startTruncMemb, xlrec.endTruncMemb,
-			 MXOffsetToMemberSegment(xlrec.startTruncMemb),
-			 MXOffsetToMemberSegment(xlrec.endTruncMemb));
+			 (unsigned long long) MultiXactIdToOffsetSegment(xlrec.startTruncOff),
+			 (unsigned long long) MultiXactIdToOffsetSegment(xlrec.endTruncOff),
+			 (unsigned long long) xlrec.startTruncMemb,
+			 (unsigned long long) xlrec.endTruncMemb,
+			 (unsigned long long) MXOffsetToMemberSegment(xlrec.startTruncMemb),
+			 (unsigned long long) MXOffsetToMemberSegment(xlrec.endTruncMemb));
 
 		/* should not be required, but more than cheap enough */
 		LWLockAcquire(MultiXactTruncationLock, LW_EXCLUSIVE);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 29c5bec084..aa9566b01c 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -876,8 +876,9 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 							 U64FromFullTransactionId(checkPoint.nextXid),
 							 checkPoint.nextOid)));
 	ereport(DEBUG1,
-			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %u",
-							 checkPoint.nextMulti, checkPoint.nextMultiOffset)));
+			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %llu",
+							 checkPoint.nextMulti,
+							 (unsigned long long) checkPoint.nextMultiOffset)));
 	ereport(DEBUG1,
 			(errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
 							 checkPoint.oldestXid, checkPoint.oldestXidDB)));
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 93e0837947..d38c752860 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -253,8 +253,8 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile->checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
-		   ControlFile->checkPointCopy.nextMultiOffset);
+	printf(_("Latest checkpoint's NextMultiOffset:  %llu\n"),
+		   (unsigned long long) ControlFile->checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile->checkPointCopy.oldestXid);
 	printf(_("Latest checkpoint's oldestXID's DB:   %u\n"),
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index e9dcb5a6d8..1af2ce4b93 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -264,7 +264,7 @@ main(int argc, char *argv[])
 
 			case 'O':
 				errno = 0;
-				set_mxoff = strtoul(optarg, &endptr, 0);
+				set_mxoff = strtou64(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
@@ -737,8 +737,8 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile.checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
-		   ControlFile.checkPointCopy.nextMultiOffset);
+	printf(_("Latest checkpoint's NextMultiOffset:  %llu\n"),
+		   (unsigned long long) ControlFile.checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile.checkPointCopy.oldestXid);
 	printf(_("Latest checkpoint's oldestXID's DB:   %u\n"),
@@ -809,8 +809,8 @@ PrintNewControlValues(void)
 
 	if (set_mxoff != -1)
 	{
-		printf(_("NextMultiOffset:                      %u\n"),
-			   ControlFile.checkPointCopy.nextMultiOffset);
+		printf(_("NextMultiOffset:                      %llu\n"),
+			   (unsigned long long) ControlFile.checkPointCopy.nextMultiOffset);
 	}
 
 	if (set_oid != 0)
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index 9829e48106..f8a8eef44d 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -206,7 +206,7 @@ push @cmd,
   sprintf("%d,%d", hex($files[0]) == 0 ? 3 : hex($files[0]), hex($files[-1]));
 
 @files = get_slru_files('pg_multixact/offsets');
-$mult = 32 * $blcksz / 4;
+$mult = 32 * $blcksz / 8;
 # -m argument is "new,old"
 push @cmd, '-m',
   sprintf("%d,%d",
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 7ffd256c74..90583634ec 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -27,7 +27,7 @@
 
 #define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
 
-#define MaxMultiXactOffset	((MultiXactOffset) 0xFFFFFFFF)
+#define MaxMultiXactOffset	UINT64CONST(0xFFFFFFFFFFFFFFFF)
 
 /*
  * Possible multixact lock modes ("status").  The first four modes are for
diff --git a/src/include/c.h b/src/include/c.h
index dc1841346c..ccfb82b478 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -661,7 +661,7 @@ typedef uint32 SubTransactionId;
 /* MultiXactId must be equivalent to TransactionId, to fit in t_xmax */
 typedef TransactionId MultiXactId;
 
-typedef uint32 MultiXactOffset;
+typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
-- 
2.34.1

#2Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Maxim Orlov (#1)
Re: POC: make mxidoff 64 bits

On 23/04/2024 11:23, Maxim Orlov wrote:

PROPOSAL
Make multixact offsets 64 bit.

+1, this is a good next step and useful regardless of 64-bit XIDs.

@@ -156,7 +148,7 @@
((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))

/* page in which a member is to be found */
-#define MXOffsetToMemberPage(xid) ((xid) / (TransactionId) MULTIXACT_MEMBERS_PER_PAGE)
+#define MXOffsetToMemberPage(xid) ((xid) / (MultiXactOffset) MULTIXACT_MEMBERS_PER_PAGE)
#define MXOffsetToMemberSegment(xid) (MXOffsetToMemberPage(xid) / SLRU_PAGES_PER_SEGMENT)

/* Location (byte offset within page) of flag word for a given member */

This is really a bug fix. It didn't matter when TransactionId and
MultiXactOffset were both typedefs of uint32, but it was always wrong.
The argument name 'xid' is also misleading.

I think there are some more like that, MXOffsetToFlagsBitShift for example.

--
Heikki Linnakangas
Neon (https://neon.tech)

#3Andrey M. Borodin
x4mmm@yandex-team.ru
In reply to: Maxim Orlov (#1)
Re: POC: make mxidoff 64 bits

On 23 Apr 2024, at 11:23, Maxim Orlov <orlovmg@gmail.com> wrote:

Make multixact offsets 64 bit.

- ereport(ERROR,
- (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
- errmsg("multixact \"members\" limit exceeded"),
Personally, I'd be happy with this! We had some incidents where the only mitigation was vacuum settings tweaking.

BTW as a side note... I see lot's of casts to (unsigned long long), can't we just cast to MultiXactOffset?

Best regards, Andrey Borodin.

#4wenhui qiu
qiuwenhuifx@gmail.com
In reply to: Heikki Linnakangas (#2)
Re: POC: make mxidoff 64 bits

Hi Maxim Orlov
Thank you so much for your tireless work on this. Increasing the WAL
size by a few bytes should have very little impact with today's disk
performance(Logical replication of this feature wal log is also increased a
lot, logical replication is a milestone new feature, and the community has
been improving the logical replication of functions),I believe removing
troubled postgresql Transaction ID Wraparound was also a milestone new
feature adding a few bytes is worth it!

Best regards

On Tue, 23 Apr 2024 at 17:37, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Show quoted text

On 23/04/2024 11:23, Maxim Orlov wrote:

PROPOSAL
Make multixact offsets 64 bit.

+1, this is a good next step and useful regardless of 64-bit XIDs.

@@ -156,7 +148,7 @@
((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))

/* page in which a member is to be found */
-#define MXOffsetToMemberPage(xid) ((xid) / (TransactionId)

MULTIXACT_MEMBERS_PER_PAGE)

+#define MXOffsetToMemberPage(xid) ((xid) / (MultiXactOffset)

MULTIXACT_MEMBERS_PER_PAGE)

#define MXOffsetToMemberSegment(xid) (MXOffsetToMemberPage(xid) /

SLRU_PAGES_PER_SEGMENT)

/* Location (byte offset within page) of flag word for a given member */

This is really a bug fix. It didn't matter when TransactionId and
MultiXactOffset were both typedefs of uint32, but it was always wrong.
The argument name 'xid' is also misleading.

I think there are some more like that, MXOffsetToFlagsBitShift for example.

--
Heikki Linnakangas
Neon (https://neon.tech)

#5Maxim Orlov
orlovmg@gmail.com
In reply to: wenhui qiu (#4)
Re: POC: make mxidoff 64 bits

On Tue, 23 Apr 2024 at 12:37, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

This is really a bug fix. It didn't matter when TransactionId and
MultiXactOffset were both typedefs of uint32, but it was always wrong.
The argument name 'xid' is also misleading.

I think there are some more like that, MXOffsetToFlagsBitShift for example.

Yeah, I always thought so too. I believe, this is just a copy-paste. You
mean, it is worth creating a separate CF
entry for these fixes?

On Tue, 23 Apr 2024 at 16:03, Andrey M. Borodin <x4mmm@yandex-team.ru>
wrote:

BTW as a side note... I see lot's of casts to (unsigned long long), can't
we just cast to MultiXactOffset?

Actually, first versions of the 64xid patch set have such a cast to types
TransactionID, MultiXact and so on. But,
after some discussions, we are switched to unsigned long long cast.
Unfortunately, I could not find an exact link
for that discussion. On the other hand, such a casting is already used
throughout the code. So, just for the
sake of the consistency, I would like to stay with these casts.

On Tue, 23 Apr 2024 at 16:03, wenhui qiu <qiuwenhuifx@gmail.com> wrote:

Hi Maxim Orlov
Thank you so much for your tireless work on this. Increasing the WAL
size by a few bytes should have very little impact with today's disk
performance(Logical replication of this feature wal log is also increased a
lot, logical replication is a milestone new feature, and the community has
been improving the logical replication of functions),I believe removing
troubled postgresql Transaction ID Wraparound was also a milestone new
feature adding a few bytes is worth it!

I'm 100% agree. Maybe, I should return to this approach and find some
benefits for having FXIDs in WAL.

#6Maxim Orlov
orlovmg@gmail.com
In reply to: Maxim Orlov (#5)
3 attachment(s)
Re: POC: make mxidoff 64 bits

Hi!

Sorry for delay. I was a bit busy last month. Anyway, here is my
proposal for making multioffsets 64 bit.
The patch set consists of three parts:
0001 - making user output of offsets 64-bit ready;
0002 - making offsets 64-bit;
0003 - provide 32 to 64 bit conversion in pg_upgarde.

I'm pretty sure this is just a beginning of the conversation, so any
opinions and reviews, as always, are very welcome!

--
Best regards,
Maxim Orlov.

Attachments:

v1-0002-Use-64-bit-multixact-offsets.patchapplication/x-patch; name=v1-0002-Use-64-bit-multixact-offsets.patchDownload
From 2e1f05b3b0504153e57188e968bb19cb6741c087 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Wed, 6 Mar 2024 11:11:33 +0300
Subject: [PATCH v1 2/3] Use 64-bit multixact offsets.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/transam/multixact.c | 182 ++-----------------------
 src/bin/pg_resetwal/pg_resetwal.c      |   2 +-
 src/bin/pg_resetwal/t/001_basic.pl     |   2 +-
 src/include/access/multixact.h         |   2 +-
 src/include/c.h                        |   2 +-
 5 files changed, 16 insertions(+), 174 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 57c5148933..f2a2aa9547 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -95,14 +95,6 @@
 /*
  * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
  * used everywhere else in Postgres.
- *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
  */
 
 /* We need four bytes per offset */
@@ -174,7 +166,7 @@ MXOffsetToMemberPage(MultiXactOffset offset)
 	return offset / MULTIXACT_MEMBERS_PER_PAGE;
 }
 
-static inline int
+static inline int64
 MXOffsetToMemberSegment(MultiXactOffset offset)
 {
 	return MXOffsetToMemberPage(offset) / SLRU_PAGES_PER_SEGMENT;
@@ -271,9 +263,6 @@ typedef struct MultiXactStateData
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 
-	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
-
 	/*
 	 * This is used to sleep until a multixact offset is written when we want
 	 * to create the next one.
@@ -408,8 +397,6 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
-									 MultiXactOffset start, uint32 distance);
 static bool SetOffsetVacuumLimit(bool is_startup);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMZeroPageXlogRec(int64 pageno, uint8 info);
@@ -1158,78 +1145,6 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	else
 		*offset = nextOffset;
 
-	/*----------
-	 * Protect against overrun of the members space as well, with the
-	 * following rules:
-	 *
-	 * If we're past offsetStopLimit, refuse to generate more multis.
-	 * If we're close to offsetStopLimit, emit a warning.
-	 *
-	 * Arbitrarily, we start emitting warnings when we're 20 segments or less
-	 * from offsetStopLimit.
-	 *
-	 * Note we haven't updated the shared state yet, so if we fail at this
-	 * point, the multixact ID we grabbed can still be used by the next guy.
-	 *
-	 * Note that there is no point in forcing autovacuum runs here: the
-	 * multixact freeze settings would have to be reduced for that to have any
-	 * effect.
-	 *----------
-	 */
-#define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
-								 nmembers))
-	{
-		/* see comment in the corresponding offsets wraparound case */
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("multixact \"members\" limit exceeded"),
-				 errdetail_plural("This command would create a multixact with %u members, but the remaining space is only enough for %u member.",
-								  "This command would create a multixact with %u members, but the remaining space is only enough for %u members.",
-								  MultiXactState->offsetStopLimit - nextOffset - 1,
-								  nmembers,
-								  MultiXactState->offsetStopLimit - nextOffset - 1),
-				 errhint("Execute a database-wide VACUUM in database with OID %u with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.",
-						 MultiXactState->oldestMultiXactDB)));
-	}
-
-	/*
-	 * Check whether we should kick autovacuum into action, to prevent members
-	 * wraparound. NB we use a much larger window to trigger autovacuum than
-	 * just the warning limit. The warning is just a measure of last resort -
-	 * this is in line with GetNewTransactionId's behaviour.
-	 */
-	if (!MultiXactState->oldestOffsetKnown ||
-		(MultiXactState->nextOffset - MultiXactState->oldestOffset
-		 > MULTIXACT_MEMBER_SAFE_THRESHOLD))
-	{
-		/*
-		 * To avoid swamping the postmaster with signals, we issue the autovac
-		 * request only when crossing a segment boundary. With default
-		 * compilation settings that's roughly after 50k members.  This still
-		 * gives plenty of chances before we get into real trouble.
-		 */
-		if ((MXOffsetToMemberPage(nextOffset) / SLRU_PAGES_PER_SEGMENT) !=
-			(MXOffsetToMemberPage(nextOffset + nmembers) / SLRU_PAGES_PER_SEGMENT))
-			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	}
-
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
-								 nextOffset,
-								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
-		ereport(WARNING,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg_plural("database with OID %u must be vacuumed before %d more multixact member is used",
-							   "database with OID %u must be vacuumed before %d more multixact members are used",
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers,
-							   MultiXactState->oldestMultiXactDB,
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers),
-				 errhint("Execute a database-wide VACUUM in that database with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.")));
-
 	ExtendMultiXactMember(nextOffset, nmembers);
 
 	/*
@@ -1968,7 +1883,7 @@ MultiXactShmemInit(void)
 				  "pg_multixact/offsets", LWTRANCHE_MULTIXACTOFFSET_BUFFER,
 				  LWTRANCHE_MULTIXACTOFFSET_SLRU,
 				  SYNC_HANDLER_MULTIXACT_OFFSET,
-				  false);
+				  true);
 	SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE);
 	SimpleLruInit(MultiXactMemberCtl,
 				  "multixact_member", multixact_member_buffers, 0,
@@ -2713,8 +2628,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	MultiXactOffset nextOffset;
 	bool		oldestOffsetKnown = false;
 	bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;
-	MultiXactOffset prevOffsetStopLimit;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2729,7 +2642,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	nextOffset = MultiXactState->nextOffset;
 	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
 	prevOldestOffset = MultiXactState->oldestOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2760,11 +2672,7 @@ SetOffsetVacuumLimit(bool is_startup)
 		oldestOffsetKnown =
 			find_multixact_start(oldestMultiXactId, &oldestOffset);
 
-		if (oldestOffsetKnown)
-			ereport(DEBUG1,
-					(errmsg_internal("oldest MultiXactId member is at offset %u",
-									 oldestOffset)));
-		else
+		if (!oldestOffsetKnown)
 			ereport(LOG,
 					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
 							oldestMultiXactId)));
@@ -2777,24 +2685,7 @@ SetOffsetVacuumLimit(bool is_startup)
 	 * overrun of old data in the members SLRU area. We can only do so if the
 	 * oldest offset is known though.
 	 */
-	if (oldestOffsetKnown)
-	{
-		/* move back to start of the corresponding segment */
-		offsetStopLimit = oldestOffset - (oldestOffset %
-										  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-
-		/* always leave one segment before the wraparound point */
-		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-		if (!prevOldestOffsetKnown && !is_startup)
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are now enabled")));
-
-		ereport(DEBUG1,
-				(errmsg_internal("MultiXact member stop limit is now %u based on MultiXact %u",
-								 offsetStopLimit, oldestMultiXactId)));
-	}
-	else if (prevOldestOffsetKnown)
+	if (prevOldestOffsetKnown)
 	{
 		/*
 		 * If we failed to get the oldest offset this time, but we have a
@@ -2804,14 +2695,12 @@ SetOffsetVacuumLimit(bool is_startup)
 		 */
 		oldestOffset = prevOldestOffset;
 		oldestOffsetKnown = true;
-		offsetStopLimit = prevOffsetStopLimit;
 	}
 
 	/* Install the computed values */
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestOffset = oldestOffset;
 	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	MultiXactState->offsetStopLimit = offsetStopLimit;
 	LWLockRelease(MultiXactGenLock);
 
 	/*
@@ -2821,54 +2710,6 @@ SetOffsetVacuumLimit(bool is_startup)
 		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
 }
 
-/*
- * Return whether adding "distance" to "start" would move past "boundary".
- *
- * We use this to determine whether the addition is "wrapping around" the
- * boundary point, hence the name.  The reason we don't want to use the regular
- * 2^31-modulo arithmetic here is that we want to be able to use the whole of
- * the 2^32-1 space here, allowing for more multixacts than would fit
- * otherwise.
- */
-static bool
-MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
-						 uint32 distance)
-{
-	MultiXactOffset finish;
-
-	/*
-	 * Note that offset number 0 is not used (see GetMultiXactIdMembers), so
-	 * if the addition wraps around the UINT_MAX boundary, skip that value.
-	 */
-	finish = start + distance;
-	if (finish < start)
-		finish++;
-
-	/*-----------------------------------------------------------------------
-	 * When the boundary is numerically greater than the starting point, any
-	 * value numerically between the two is not wrapped:
-	 *
-	 *	<----S----B---->
-	 *	[---)			 = F wrapped past B (and UINT_MAX)
-	 *		 [---)		 = F not wrapped
-	 *			  [----] = F wrapped past B
-	 *
-	 * When the boundary is numerically less than the starting point (i.e. the
-	 * UINT_MAX wraparound occurs somewhere in between) then all values in
-	 * between are wrapped:
-	 *
-	 *	<----B----S---->
-	 *	[---)			 = F not wrapped past B (but wrapped past UINT_MAX)
-	 *		 [---)		 = F wrapped past B (and UINT_MAX)
-	 *			  [----] = F not wrapped
-	 *-----------------------------------------------------------------------
-	 */
-	if (start < boundary)
-		return finish >= boundary || finish < start;
-	else
-		return finish >= boundary && finish < start;
-}
-
 /*
  * Find the starting offset of the given MultiXactId.
  *
@@ -2990,8 +2831,9 @@ MultiXactMemberFreezeThreshold(void)
 	 * we try to eliminate from the system is based on how far we are past
 	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
 	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD) /
-		(MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+	fraction /= (double) (MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+
 	victim_multixacts = multixacts * fraction;
 
 	/* fraction could be > 1.0, but lowest possible freeze age is zero */
@@ -3041,10 +2883,10 @@ SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int64 segpage, void *data
 static void
 PerformMembersTruncation(MultiXactOffset oldestOffset, MultiXactOffset newOldestOffset)
 {
-	const int	maxsegment = MXOffsetToMemberSegment(MaxMultiXactOffset);
-	int			startsegment = MXOffsetToMemberSegment(oldestOffset);
-	int			endsegment = MXOffsetToMemberSegment(newOldestOffset);
-	int			segment = startsegment;
+	const int64	maxsegment = MXOffsetToMemberSegment(MaxMultiXactOffset);
+	int64		startsegment = MXOffsetToMemberSegment(oldestOffset);
+	int64		endsegment = MXOffsetToMemberSegment(newOldestOffset);
+	int64		segment = startsegment;
 
 	/*
 	 * Delete all the segments but the last one. The last segment can still
@@ -3337,7 +3179,7 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
 static bool
 MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
 {
-	int32		diff = (int32) (offset1 - offset2);
+	int64		diff = (int64) (offset1 - offset2);
 
 	return (diff < 0);
 }
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 985cd06802..1af2ce4b93 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -264,7 +264,7 @@ main(int argc, char *argv[])
 
 			case 'O':
 				errno = 0;
-				set_mxoff = strtoul(optarg, &endptr, 0);
+				set_mxoff = strtou64(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index 9829e48106..f8a8eef44d 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -206,7 +206,7 @@ push @cmd,
   sprintf("%d,%d", hex($files[0]) == 0 ? 3 : hex($files[0]), hex($files[-1]));
 
 @files = get_slru_files('pg_multixact/offsets');
-$mult = 32 * $blcksz / 4;
+$mult = 32 * $blcksz / 8;
 # -m argument is "new,old"
 push @cmd, '-m',
   sprintf("%d,%d",
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 7ffd256c74..90583634ec 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -27,7 +27,7 @@
 
 #define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
 
-#define MaxMultiXactOffset	((MultiXactOffset) 0xFFFFFFFF)
+#define MaxMultiXactOffset	UINT64CONST(0xFFFFFFFFFFFFFFFF)
 
 /*
  * Possible multixact lock modes ("status").  The first four modes are for
diff --git a/src/include/c.h b/src/include/c.h
index dc1841346c..ccfb82b478 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -661,7 +661,7 @@ typedef uint32 SubTransactionId;
 /* MultiXactId must be equivalent to TransactionId, to fit in t_xmax */
 typedef TransactionId MultiXactId;
 
-typedef uint32 MultiXactOffset;
+typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
-- 
2.45.2

v1-0001-Use-64-bit-format-output-for-multixact-offsets.patchapplication/x-patch; name=v1-0001-Use-64-bit-format-output-for-multixact-offsets.patchDownload
From 95226756a225ca6b95e2baafff502034c355310d Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 7 Aug 2024 16:35:22 +0300
Subject: [PATCH v1 1/3] Use 64-bit format output for multixact offsets

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/rmgrdesc/mxactdesc.c   |  9 ++++----
 src/backend/access/rmgrdesc/xlogdesc.c    |  4 ++--
 src/backend/access/transam/multixact.c    | 26 +++++++++++++----------
 src/backend/access/transam/xlogrecovery.c |  5 +++--
 src/bin/pg_controldata/pg_controldata.c   |  4 ++--
 src/bin/pg_resetwal/pg_resetwal.c         |  8 +++----
 6 files changed, 31 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 3e8ad4d5ef..1b486de38c 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -65,8 +65,8 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		xl_multixact_create *xlrec = (xl_multixact_create *) rec;
 		int			i;
 
-		appendStringInfo(buf, "%u offset %u nmembers %d: ", xlrec->mid,
-						 xlrec->moff, xlrec->nmembers);
+		appendStringInfo(buf, "%u offset %llu nmembers %d: ", xlrec->mid,
+						 (unsigned long long) xlrec->moff, xlrec->nmembers);
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
 	}
@@ -74,9 +74,10 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
 
-		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+		appendStringInfo(buf, "offsets [%u, %u), members [%llu, %llu)",
 						 xlrec->startTruncOff, xlrec->endTruncOff,
-						 xlrec->startTruncMemb, xlrec->endTruncMemb);
+						 (unsigned long long) xlrec->startTruncMemb,
+						 (unsigned long long) xlrec->endTruncMemb);
 	}
 }
 
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 363294d623..aaa19c81c8 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -66,7 +66,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
 		appendStringInfo(buf, "redo %X/%X; "
-						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %u; "
+						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %llu; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
 						 "oldest running xid %u; %s",
@@ -79,7 +79,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 XidFromFullTransactionId(checkpoint->nextXid),
 						 checkpoint->nextOid,
 						 checkpoint->nextMulti,
-						 checkpoint->nextMultiOffset,
+						 (unsigned long long) checkpoint->nextMultiOffset,
 						 checkpoint->oldestXid,
 						 checkpoint->oldestXidDB,
 						 checkpoint->oldestMulti,
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index c601ff98a1..57c5148933 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1258,7 +1258,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	LWLockRelease(MultiXactGenLock);
 
-	debug_elog4(DEBUG2, "GetNew: returning %u offset %u", result, *offset);
+	debug_elog4(DEBUG2, "GetNew: returning %u offset %llu", result,
+				(unsigned long long) *offset);
 	return result;
 }
 
@@ -2285,8 +2286,9 @@ MultiXactGetCheckptMulti(bool is_shutdown,
 	LWLockRelease(MultiXactGenLock);
 
 	debug_elog6(DEBUG2,
-				"MultiXact: checkpoint is nextMulti %u, nextOffset %u, oldestMulti %u in DB %u",
-				*nextMulti, *nextMultiOffset, *oldestMulti, *oldestMultiDB);
+				"MultiXact: checkpoint is nextMulti %u, nextOffset %llu, oldestMulti %u in DB %u",
+				*nextMulti, (unsigned long long) *nextMultiOffset, *oldestMulti,
+				*oldestMultiDB);
 }
 
 /*
@@ -2320,8 +2322,8 @@ void
 MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset)
 {
-	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %u",
-				nextMulti, nextMultiOffset);
+	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %llu",
+				nextMulti, (unsigned long long) nextMultiOffset);
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->nextMXact = nextMulti;
 	MultiXactState->nextOffset = nextMultiOffset;
@@ -2511,8 +2513,8 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 	}
 	if (MultiXactOffsetPrecedes(MultiXactState->nextOffset, minMultiOffset))
 	{
-		debug_elog3(DEBUG2, "MultiXact: setting next offset to %u",
-					minMultiOffset);
+		debug_elog3(DEBUG2, "MultiXact: setting next offset to %llu",
+					(unsigned long long) minMultiOffset);
 		MultiXactState->nextOffset = minMultiOffset;
 	}
 	LWLockRelease(MultiXactGenLock);
@@ -3203,11 +3205,12 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 
 	elog(DEBUG1, "performing multixact truncation: "
 		 "offsets [%u, %u), offsets segments [%llx, %llx), "
-		 "members [%u, %u), members segments [%llx, %llx)",
+		 "members [%llu, %llu), members segments [%llx, %llx)",
 		 oldestMulti, newOldestMulti,
 		 (unsigned long long) MultiXactIdToOffsetSegment(oldestMulti),
 		 (unsigned long long) MultiXactIdToOffsetSegment(newOldestMulti),
-		 oldestOffset, newOldestOffset,
+		 (unsigned long long) oldestOffset,
+		 (unsigned long long) newOldestOffset,
 		 (unsigned long long) MXOffsetToMemberSegment(oldestOffset),
 		 (unsigned long long) MXOffsetToMemberSegment(newOldestOffset));
 
@@ -3463,11 +3466,12 @@ multixact_redo(XLogReaderState *record)
 
 		elog(DEBUG1, "replaying multixact truncation: "
 			 "offsets [%u, %u), offsets segments [%llx, %llx), "
-			 "members [%u, %u), members segments [%llx, %llx)",
+			 "members [%llu, %llu), members segments [%llx, %llx)",
 			 xlrec.startTruncOff, xlrec.endTruncOff,
 			 (unsigned long long) MultiXactIdToOffsetSegment(xlrec.startTruncOff),
 			 (unsigned long long) MultiXactIdToOffsetSegment(xlrec.endTruncOff),
-			 xlrec.startTruncMemb, xlrec.endTruncMemb,
+			 (unsigned long long) xlrec.startTruncMemb,
+			 (unsigned long long) xlrec.endTruncMemb,
 			 (unsigned long long) MXOffsetToMemberSegment(xlrec.startTruncMemb),
 			 (unsigned long long) MXOffsetToMemberSegment(xlrec.endTruncMemb));
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index ad817fbca6..388037a94b 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -877,8 +877,9 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 							 U64FromFullTransactionId(checkPoint.nextXid),
 							 checkPoint.nextOid)));
 	ereport(DEBUG1,
-			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %u",
-							 checkPoint.nextMulti, checkPoint.nextMultiOffset)));
+			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %llu",
+							 checkPoint.nextMulti,
+							 (unsigned long long) checkPoint.nextMultiOffset)));
 	ereport(DEBUG1,
 			(errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
 							 checkPoint.oldestXid, checkPoint.oldestXidDB)));
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 93a05d80ca..43b6727570 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -253,8 +253,8 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile->checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
-		   ControlFile->checkPointCopy.nextMultiOffset);
+	printf(_("Latest checkpoint's NextMultiOffset:  %llu\n"),
+		   (unsigned long long) ControlFile->checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile->checkPointCopy.oldestXid);
 	printf(_("Latest checkpoint's oldestXID's DB:   %u\n"),
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index e9dcb5a6d8..985cd06802 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -737,8 +737,8 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile.checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
-		   ControlFile.checkPointCopy.nextMultiOffset);
+	printf(_("Latest checkpoint's NextMultiOffset:  %llu\n"),
+		   (unsigned long long) ControlFile.checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile.checkPointCopy.oldestXid);
 	printf(_("Latest checkpoint's oldestXID's DB:   %u\n"),
@@ -809,8 +809,8 @@ PrintNewControlValues(void)
 
 	if (set_mxoff != -1)
 	{
-		printf(_("NextMultiOffset:                      %u\n"),
-			   ControlFile.checkPointCopy.nextMultiOffset);
+		printf(_("NextMultiOffset:                      %llu\n"),
+			   (unsigned long long) ControlFile.checkPointCopy.nextMultiOffset);
 	}
 
 	if (set_oid != 0)
-- 
2.45.2

v1-0003-Make-pg_upgrade-convert-multixact-offsets.patchapplication/x-patch; name=v1-0003-Make-pg_upgrade-convert-multixact-offsets.patchDownload
From 063ec2662d94f7a72e3162702c4051f34cd67000 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Tue, 13 Aug 2024 14:44:50 +0300
Subject: [PATCH v1 3/3] Make pg_upgrade convert multixact offsets.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/bin/pg_upgrade/Makefile      |   1 +
 src/bin/pg_upgrade/meson.build   |   1 +
 src/bin/pg_upgrade/pg_upgrade.c  |  29 ++-
 src/bin/pg_upgrade/pg_upgrade.h  |  13 +-
 src/bin/pg_upgrade/segresize.c   | 350 +++++++++++++++++++++++++++++++
 src/include/catalog/catversion.h |   2 +-
 6 files changed, 391 insertions(+), 5 deletions(-)
 create mode 100644 src/bin/pg_upgrade/segresize.c

diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index bde91e2beb..030816596f 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	info.o \
 	option.o \
 	parallel.o \
+	segresize.o \
 	pg_upgrade.o \
 	relfilenumber.o \
 	server.o \
diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index 9825fa3305..2d9f7e6b65 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -10,6 +10,7 @@ pg_upgrade_sources = files(
   'info.c',
   'option.c',
   'parallel.c',
+  'segresize.c',
   'pg_upgrade.c',
   'relfilenumber.c',
   'server.c',
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 663235816f..d9d8d0ea78 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -750,7 +750,30 @@ copy_xact_xlog_xid(void)
 	if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER &&
 		new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
 	{
-		copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
+		/*
+		 * If the old server is before the MULTIXACTOFFSET_FORMATCHANGE_CAT_VER
+		 * it must have 32-bit multixid offsets, thus it should be converted.
+		 */
+		if (old_cluster.controldata.cat_ver < MULTIXACTOFFSET_FORMATCHANGE_CAT_VER &&
+			new_cluster.controldata.cat_ver >= MULTIXACTOFFSET_FORMATCHANGE_CAT_VER)
+		{
+			uint64	oldest_offset = convert_multixact_offsets();
+
+			if (oldest_offset)
+			{
+				uint64	next_offset = old_cluster.controldata.chkpnt_nxtmxoff;
+
+				/* Handle possible wraparound. */
+				if (next_offset < oldest_offset)
+					next_offset += ((uint64) 1 << 32) - 1;
+
+				next_offset -= oldest_offset - 1;
+				old_cluster.controldata.chkpnt_nxtmxoff = next_offset;
+			}
+		}
+		else
+			copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
+
 		copy_subdir_files("pg_multixact/members", "pg_multixact/members");
 
 		prep_status("Setting next multixact ID and offset for new cluster");
@@ -760,9 +783,9 @@ copy_xact_xlog_xid(void)
 		 * counters here and the oldest multi present on system.
 		 */
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/pg_resetwal\" -O %u -m %u,%u \"%s\"",
+				  "\"%s/pg_resetwal\" -O %llu -m %u,%u \"%s\"",
 				  new_cluster.bindir,
-				  old_cluster.controldata.chkpnt_nxtmxoff,
+				  (unsigned long long) old_cluster.controldata.chkpnt_nxtmxoff,
 				  old_cluster.controldata.chkpnt_nxtmulti,
 				  old_cluster.controldata.chkpnt_oldstMulti,
 				  new_cluster.pgdata);
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index cdb6e2b759..37d173cb86 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -114,6 +114,13 @@ extern char *output_files[];
  */
 #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231
 
+/*
+ * Swicth from 32-bit to 64-bit for multixid offsets.
+ *
+ * XXX: should be changed to the actual CATALOG_VERSION_NO on commit.
+ */
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202408123
+
 /*
  * large object chunk size added to pg_controldata,
  * commit 5f93c37805e7485488480916b4585e098d3cc883
@@ -230,7 +237,7 @@ typedef struct
 	uint32		chkpnt_nxtepoch;
 	uint32		chkpnt_nxtoid;
 	uint32		chkpnt_nxtmulti;
-	uint32		chkpnt_nxtmxoff;
+	uint64		chkpnt_nxtmxoff;
 	uint32		chkpnt_oldstMulti;
 	uint32		chkpnt_oldstxid;
 	uint32		align;
@@ -494,3 +501,7 @@ void		parallel_transfer_all_new_dbs(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr
 										  char *old_pgdata, char *new_pgdata,
 										  char *old_tablespace);
 bool		reap_child(bool wait_for_child);
+
+/* segresize.c */
+
+uint64		convert_multixact_offsets(void);
diff --git a/src/bin/pg_upgrade/segresize.c b/src/bin/pg_upgrade/segresize.c
new file mode 100644
index 0000000000..e47c0a2407
--- /dev/null
+++ b/src/bin/pg_upgrade/segresize.c
@@ -0,0 +1,350 @@
+/*
+ *	segresize.c
+ *
+ *	SLRU segment resize utility
+ *
+ *	Copyright (c) 2024, PostgreSQL Global Development Group
+ *	src/bin/pg_upgrade/segresize.c
+ */
+
+#include "postgres_fe.h"
+
+#include "pg_upgrade.h"
+#include "access/multixact.h"
+
+/* See slru.h */
+#define SLRU_PAGES_PER_SEGMENT		32
+
+/*
+ * Some kind of iterator associated with a particular SLRU segment.  The idea is
+ * to specify the segment and page number and then move through the pages.
+ */
+typedef struct SlruSegState
+{
+	char	   *dir;
+	char	   *fn;
+	FILE	   *file;
+	int64		segno;
+	uint64		pageno;
+	bool		leading_gap;
+	bool		long_segment_names;
+} SlruSegState;
+
+/*
+ * Get SLRU segmen file name from state.
+ *
+ * NOTE: this function should mirror SlruFileName call.
+ */
+static inline char *
+SlruFileName(SlruSegState *state)
+{
+	if (state->long_segment_names)
+	{
+		Assert(state->segno >= 0 &&
+			   state->segno <= INT64CONST(0xFFFFFFFFFFFFFFF));
+		return psprintf("%s/%015llX", state->dir, (long long) state->segno);
+	}
+	else
+	{
+		Assert(state->segno >= 0 &&
+			   state->segno <= INT64CONST(0xFFFFFF));
+		return psprintf("%s/%04X", state->dir, (unsigned int) state->segno);
+	}
+}
+
+/*
+ * Create SLRU segment file.
+ */
+static void
+create_segment(SlruSegState *state)
+{
+	Assert(state->fn == NULL);
+	Assert(state->file == NULL);
+
+	state->fn = SlruFileName(state);
+	state->file = fopen(state->fn, "wb");
+	if (!state->file)
+		pg_fatal("could not create file \"%s\": %m", state->fn);
+}
+
+/*
+ * Open existing SLRU segment file.
+ */
+static void
+open_segment(SlruSegState *state)
+{
+	Assert(state->fn == NULL);
+	Assert(state->file == NULL);
+
+	state->fn = SlruFileName(state);
+	state->file = fopen(state->fn, "rb");
+	if (!state->file)
+		pg_fatal("could not open file \"%s\": %m", state->fn);
+}
+
+/*
+ * Close SLRU segment file.
+ */
+static void
+close_segment(SlruSegState *state)
+{
+	if (state->file)
+	{
+		fclose(state->file);
+		state->file = NULL;
+	}
+
+	if (state->fn)
+	{
+		pfree(state->fn);
+		state->fn = NULL;
+	}
+}
+
+/*
+ * Read next page from the old 32-bit offset segment file.
+ */
+static int
+read_old_segment_page(SlruSegState *state, void *buf, bool *empty)
+{
+	int		len;
+
+	/* Open next segment file, if needed. */
+	if (!state->fn)
+	{
+		if (!state->segno)
+			state->leading_gap = true;
+
+		open_segment(state);
+
+		/* Set position to the needed page. */
+		if (state->pageno > 0 &&
+			fseek(state->file, state->pageno * BLCKSZ, SEEK_SET))
+		{
+			close_segment(state);
+		}
+	}
+
+	if (state->file)
+	{
+		/* Segment file do exists, read page from it. */
+		state->leading_gap = false;
+
+		len = fread(buf, sizeof(char), BLCKSZ, state->file);
+
+		/* Are we done or was there an error? */
+		if (len <= 0)
+		{
+			if (ferror(state->file))
+				pg_fatal("error reading file \"%s\": %m", state->fn);
+
+			if (feof(state->file))
+			{
+				*empty = true;
+				len = -1;
+
+				close_segment(state);
+			}
+		}
+		else
+			*empty = false;
+	}
+	else if (!state->leading_gap)
+	{
+		/* We reached the last segment. */
+		len = -1;
+		*empty = true;
+	}
+	else
+	{
+		/* Skip few first segments if they were frozen and removed. */
+		len = BLCKSZ;
+		*empty = true;
+	}
+
+	if (++state->pageno >= SLRU_PAGES_PER_SEGMENT)
+	{
+		/* Start a new segment. */
+		state->segno++;
+		state->pageno = 0;
+
+		close_segment(state);
+	}
+
+	return len;
+}
+
+/*
+ * Write next page to the new 64-bit offset segment file.
+ */
+static void
+write_new_segment_page(SlruSegState *state, void *buf)
+{
+	/*
+	 * Create a new segment file if we still didn't.  Creation is
+	 * postponed until the first non-empty page is found.  This helps
+	 * not to create completely empty segments.
+	 */
+	if (!state->file)
+	{
+		create_segment(state);
+
+		/* Write zeroes to the previously skipped prefix. */
+		if (state->pageno > 0)
+		{
+			char		zerobuf[BLCKSZ] = {0};
+
+			for (int64 i = 0; i < state->pageno; i++)
+			{
+				if (fwrite(zerobuf, sizeof(char), BLCKSZ, state->file) != BLCKSZ)
+					pg_fatal("could not write file \"%s\": %m", state->fn);
+			}
+		}
+	}
+
+	/* Write page to the new segment (if it was created). */
+	if (state->file)
+	{
+		if (fwrite(buf, sizeof(char), BLCKSZ, state->file) != BLCKSZ)
+			pg_fatal("could not write file \"%s\": %m", state->fn);
+	}
+
+	state->pageno++;
+
+	/*
+	 * Did we reach the maximum page number?  Then close segment file
+	 * and create a new one on the next iteration.
+	 */
+	if (state->pageno >= SLRU_PAGES_PER_SEGMENT)
+	{
+		state->segno++;
+		state->pageno = 0;
+		close_segment(state);
+	}
+}
+
+/*
+ * Convert pg_multixact/offsets segments and return oldest multi offset.
+ */
+uint64
+convert_multixact_offsets(void)
+{
+	/* See multixact.c */
+#define MULTIXACT_OFFSETS_PER_PAGE_OLD	(BLCKSZ / sizeof(uint32))
+#define MULTIXACT_OFFSETS_PER_PAGE		(BLCKSZ / sizeof(MultiXactOffset))
+
+	SlruSegState	oldseg = {0},
+					newseg = {0};
+	uint32			oldbuf[MULTIXACT_OFFSETS_PER_PAGE_OLD] = {0};
+	MultiXactOffset	newbuf[MULTIXACT_OFFSETS_PER_PAGE] = {0};
+	/*
+	 * It is much easier to deal with multi wraparound in 64 bitd format.  Thus
+	 * we use 64 bits for multi-transactions, although they remain 32 bits.
+	 */
+	uint64			oldest_multi = old_cluster.controldata.chkpnt_oldstMulti,
+					next_multi = old_cluster.controldata.chkpnt_nxtmulti,
+					multi,
+					old_entry,
+					new_entry;
+	bool			found = false;
+	uint64			oldest_offset = 0;
+
+	prep_status("Converting pg_multixact/offsets to 64-bit");
+
+	oldseg.pageno = oldest_multi / MULTIXACT_OFFSETS_PER_PAGE_OLD;
+	oldseg.segno = oldseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	oldseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+	oldseg.dir = psprintf("%s/pg_multixact/offsets", old_cluster.pgdata);
+	oldseg.long_segment_names = false;		/* old format XXXX */
+
+	newseg.pageno = oldest_multi / MULTIXACT_OFFSETS_PER_PAGE;
+	newseg.segno = newseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	newseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+	newseg.dir = psprintf("%s/pg_multixact/offsets", new_cluster.pgdata);
+	newseg.long_segment_names = true;
+
+	old_entry = oldest_multi % MULTIXACT_OFFSETS_PER_PAGE_OLD;
+	new_entry = oldest_multi % MULTIXACT_OFFSETS_PER_PAGE;
+
+	if (next_multi < oldest_multi)
+		next_multi += (uint64) 1 << 32;		/* wraparound */
+
+	for (multi = oldest_multi; multi < next_multi; old_entry = 0)
+	{
+		int			oldlen;
+		bool		empty;
+
+		/* Handle possible segment wraparound. */
+		if (oldseg.segno > MaxMultiXactId /
+								MULTIXACT_OFFSETS_PER_PAGE_OLD /
+								SLRU_PAGES_PER_SEGMENT)
+			oldseg.segno = 0;
+
+		/* Read old offset segment. */
+		oldlen = read_old_segment_page(&oldseg, oldbuf, &empty);
+		if (oldlen <= 0 || empty)
+			pg_fatal("cannot read page %llu from file \"%s\": %m",
+					 (unsigned long long) oldseg.pageno, oldseg.fn);
+
+		/* Fill possible gap. */
+		if (oldlen < BLCKSZ)
+			memset((char *) oldbuf + oldlen, 0, BLCKSZ - oldlen);
+
+		/* Save oldest multi offset */
+		if (!found)
+		{
+			oldest_offset = oldbuf[old_entry];
+			found = true;
+		}
+
+		/* ... skip wrapped-around invalid multi */
+		if (multi == (uint64) 1 << 32)
+		{
+			Assert(oldseg.segno == 0);
+			Assert(oldseg.pageno == 1);
+			Assert(old_entry == 0);
+
+			multi += FirstMultiXactId;
+			old_entry = FirstMultiXactId;
+		}
+
+		/* Copy entries to the new page. */
+		for (; multi < next_multi && old_entry < MULTIXACT_OFFSETS_PER_PAGE_OLD;
+			 multi++, old_entry++)
+		{
+			MultiXactOffset offset = oldbuf[old_entry];
+
+			/* Handle possible offset wraparound. */
+			if (offset < oldest_offset)
+				offset += ((uint64) 1 << 32) - 1;
+
+			/* Subtract oldest_offset, so new offsets will start from 1. */
+			newbuf[new_entry++] = offset - oldest_offset + 1;
+			if (new_entry >= MULTIXACT_OFFSETS_PER_PAGE)
+			{
+				/* Write a new page. */
+				write_new_segment_page(&newseg, newbuf);
+				new_entry = 0;
+			}
+		}
+	}
+
+	/* Write the last incomplete page. */
+	if (new_entry > 0 || oldest_multi == next_multi)
+	{
+		memset(&newbuf[new_entry], 0,
+			   sizeof(newbuf[0]) * (MULTIXACT_OFFSETS_PER_PAGE - new_entry));
+		write_new_segment_page(&newseg, newbuf);
+	}
+
+	/* Release resources. */
+	close_segment(&oldseg);
+	close_segment(&newseg);
+
+	pfree(oldseg.dir);
+	pfree(newseg.dir);
+
+	check_ok();
+
+	return oldest_offset;
+}
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 9a0ae27823..f29dc9fc92 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202408122
+#define CATALOG_VERSION_NO	202408123
 
 #endif
-- 
2.45.2

#7Maxim Orlov
orlovmg@gmail.com
In reply to: Maxim Orlov (#6)
3 attachment(s)
Re: POC: make mxidoff 64 bits

Here is rebase. Apparently I'll have to do it often, since the
CATALOG_VERSION_NO changed in the patch.

--
Best regards,
Maxim Orlov.

Attachments:

v2-0001-Use-64-bit-format-output-for-multixact-offsets.patchapplication/octet-stream; name=v2-0001-Use-64-bit-format-output-for-multixact-offsets.patchDownload
From 228a21532bb441fe582a66b7404962ce5bf4b18b Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 7 Aug 2024 16:35:22 +0300
Subject: [PATCH v2 1/3] Use 64-bit format output for multixact offsets

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/rmgrdesc/mxactdesc.c   |  9 ++++----
 src/backend/access/rmgrdesc/xlogdesc.c    |  4 ++--
 src/backend/access/transam/multixact.c    | 26 +++++++++++++----------
 src/backend/access/transam/xlogrecovery.c |  5 +++--
 src/bin/pg_controldata/pg_controldata.c   |  4 ++--
 src/bin/pg_resetwal/pg_resetwal.c         |  8 +++----
 6 files changed, 31 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 3e8ad4d5ef..1b486de38c 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -65,8 +65,8 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		xl_multixact_create *xlrec = (xl_multixact_create *) rec;
 		int			i;
 
-		appendStringInfo(buf, "%u offset %u nmembers %d: ", xlrec->mid,
-						 xlrec->moff, xlrec->nmembers);
+		appendStringInfo(buf, "%u offset %llu nmembers %d: ", xlrec->mid,
+						 (unsigned long long) xlrec->moff, xlrec->nmembers);
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
 	}
@@ -74,9 +74,10 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
 
-		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+		appendStringInfo(buf, "offsets [%u, %u), members [%llu, %llu)",
 						 xlrec->startTruncOff, xlrec->endTruncOff,
-						 xlrec->startTruncMemb, xlrec->endTruncMemb);
+						 (unsigned long long) xlrec->startTruncMemb,
+						 (unsigned long long) xlrec->endTruncMemb);
 	}
 }
 
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 363294d623..aaa19c81c8 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -66,7 +66,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
 		appendStringInfo(buf, "redo %X/%X; "
-						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %u; "
+						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %llu; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
 						 "oldest running xid %u; %s",
@@ -79,7 +79,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 XidFromFullTransactionId(checkpoint->nextXid),
 						 checkpoint->nextOid,
 						 checkpoint->nextMulti,
-						 checkpoint->nextMultiOffset,
+						 (unsigned long long) checkpoint->nextMultiOffset,
 						 checkpoint->oldestXid,
 						 checkpoint->oldestXidDB,
 						 checkpoint->oldestMulti,
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 8c37d7eba7..ab90912ed3 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1264,7 +1264,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	LWLockRelease(MultiXactGenLock);
 
-	debug_elog4(DEBUG2, "GetNew: returning %u offset %u", result, *offset);
+	debug_elog4(DEBUG2, "GetNew: returning %u offset %llu", result,
+				(unsigned long long) *offset);
 	return result;
 }
 
@@ -2293,8 +2294,9 @@ MultiXactGetCheckptMulti(bool is_shutdown,
 	LWLockRelease(MultiXactGenLock);
 
 	debug_elog6(DEBUG2,
-				"MultiXact: checkpoint is nextMulti %u, nextOffset %u, oldestMulti %u in DB %u",
-				*nextMulti, *nextMultiOffset, *oldestMulti, *oldestMultiDB);
+				"MultiXact: checkpoint is nextMulti %u, nextOffset %llu, oldestMulti %u in DB %u",
+				*nextMulti, (unsigned long long) *nextMultiOffset, *oldestMulti,
+				*oldestMultiDB);
 }
 
 /*
@@ -2328,8 +2330,8 @@ void
 MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset)
 {
-	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %u",
-				nextMulti, nextMultiOffset);
+	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %llu",
+				nextMulti, (unsigned long long) nextMultiOffset);
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->nextMXact = nextMulti;
 	MultiXactState->nextOffset = nextMultiOffset;
@@ -2519,8 +2521,8 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 	}
 	if (MultiXactOffsetPrecedes(MultiXactState->nextOffset, minMultiOffset))
 	{
-		debug_elog3(DEBUG2, "MultiXact: setting next offset to %u",
-					minMultiOffset);
+		debug_elog3(DEBUG2, "MultiXact: setting next offset to %llu",
+					(unsigned long long) minMultiOffset);
 		MultiXactState->nextOffset = minMultiOffset;
 	}
 	LWLockRelease(MultiXactGenLock);
@@ -3211,11 +3213,12 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 
 	elog(DEBUG1, "performing multixact truncation: "
 		 "offsets [%u, %u), offsets segments [%llx, %llx), "
-		 "members [%u, %u), members segments [%llx, %llx)",
+		 "members [%llu, %llu), members segments [%llx, %llx)",
 		 oldestMulti, newOldestMulti,
 		 (unsigned long long) MultiXactIdToOffsetSegment(oldestMulti),
 		 (unsigned long long) MultiXactIdToOffsetSegment(newOldestMulti),
-		 oldestOffset, newOldestOffset,
+		 (unsigned long long) oldestOffset,
+		 (unsigned long long) newOldestOffset,
 		 (unsigned long long) MXOffsetToMemberSegment(oldestOffset),
 		 (unsigned long long) MXOffsetToMemberSegment(newOldestOffset));
 
@@ -3471,11 +3474,12 @@ multixact_redo(XLogReaderState *record)
 
 		elog(DEBUG1, "replaying multixact truncation: "
 			 "offsets [%u, %u), offsets segments [%llx, %llx), "
-			 "members [%u, %u), members segments [%llx, %llx)",
+			 "members [%llu, %llu), members segments [%llx, %llx)",
 			 xlrec.startTruncOff, xlrec.endTruncOff,
 			 (unsigned long long) MultiXactIdToOffsetSegment(xlrec.startTruncOff),
 			 (unsigned long long) MultiXactIdToOffsetSegment(xlrec.endTruncOff),
-			 xlrec.startTruncMemb, xlrec.endTruncMemb,
+			 (unsigned long long) xlrec.startTruncMemb,
+			 (unsigned long long) xlrec.endTruncMemb,
 			 (unsigned long long) MXOffsetToMemberSegment(xlrec.startTruncMemb),
 			 (unsigned long long) MXOffsetToMemberSegment(xlrec.endTruncMemb));
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 178491f6f5..0c5980a436 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -877,8 +877,9 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 							 U64FromFullTransactionId(checkPoint.nextXid),
 							 checkPoint.nextOid)));
 	ereport(DEBUG1,
-			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %u",
-							 checkPoint.nextMulti, checkPoint.nextMultiOffset)));
+			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %llu",
+							 checkPoint.nextMulti,
+							 (unsigned long long) checkPoint.nextMultiOffset)));
 	ereport(DEBUG1,
 			(errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
 							 checkPoint.oldestXid, checkPoint.oldestXidDB)));
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 93a05d80ca..43b6727570 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -253,8 +253,8 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile->checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
-		   ControlFile->checkPointCopy.nextMultiOffset);
+	printf(_("Latest checkpoint's NextMultiOffset:  %llu\n"),
+		   (unsigned long long) ControlFile->checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile->checkPointCopy.oldestXid);
 	printf(_("Latest checkpoint's oldestXID's DB:   %u\n"),
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index e9dcb5a6d8..985cd06802 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -737,8 +737,8 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile.checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
-		   ControlFile.checkPointCopy.nextMultiOffset);
+	printf(_("Latest checkpoint's NextMultiOffset:  %llu\n"),
+		   (unsigned long long) ControlFile.checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile.checkPointCopy.oldestXid);
 	printf(_("Latest checkpoint's oldestXID's DB:   %u\n"),
@@ -809,8 +809,8 @@ PrintNewControlValues(void)
 
 	if (set_mxoff != -1)
 	{
-		printf(_("NextMultiOffset:                      %u\n"),
-			   ControlFile.checkPointCopy.nextMultiOffset);
+		printf(_("NextMultiOffset:                      %llu\n"),
+			   (unsigned long long) ControlFile.checkPointCopy.nextMultiOffset);
 	}
 
 	if (set_oid != 0)
-- 
2.45.2

v2-0003-Make-pg_upgrade-convert-multixact-offsets.patchapplication/octet-stream; name=v2-0003-Make-pg_upgrade-convert-multixact-offsets.patchDownload
From a06557435597868cc654de2899b6cd618fed641c Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Tue, 13 Aug 2024 14:44:50 +0300
Subject: [PATCH v2 3/3] Make pg_upgrade convert multixact offsets.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/bin/pg_upgrade/Makefile      |   1 +
 src/bin/pg_upgrade/meson.build   |   1 +
 src/bin/pg_upgrade/pg_upgrade.c  |  29 ++-
 src/bin/pg_upgrade/pg_upgrade.h  |  13 +-
 src/bin/pg_upgrade/segresize.c   | 350 +++++++++++++++++++++++++++++++
 src/include/catalog/catversion.h |   2 +-
 6 files changed, 391 insertions(+), 5 deletions(-)
 create mode 100644 src/bin/pg_upgrade/segresize.c

diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index bde91e2beb..030816596f 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	info.o \
 	option.o \
 	parallel.o \
+	segresize.o \
 	pg_upgrade.o \
 	relfilenumber.o \
 	server.o \
diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index 9825fa3305..2d9f7e6b65 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -10,6 +10,7 @@ pg_upgrade_sources = files(
   'info.c',
   'option.c',
   'parallel.c',
+  'segresize.c',
   'pg_upgrade.c',
   'relfilenumber.c',
   'server.c',
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 663235816f..d9d8d0ea78 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -750,7 +750,30 @@ copy_xact_xlog_xid(void)
 	if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER &&
 		new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
 	{
-		copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
+		/*
+		 * If the old server is before the MULTIXACTOFFSET_FORMATCHANGE_CAT_VER
+		 * it must have 32-bit multixid offsets, thus it should be converted.
+		 */
+		if (old_cluster.controldata.cat_ver < MULTIXACTOFFSET_FORMATCHANGE_CAT_VER &&
+			new_cluster.controldata.cat_ver >= MULTIXACTOFFSET_FORMATCHANGE_CAT_VER)
+		{
+			uint64	oldest_offset = convert_multixact_offsets();
+
+			if (oldest_offset)
+			{
+				uint64	next_offset = old_cluster.controldata.chkpnt_nxtmxoff;
+
+				/* Handle possible wraparound. */
+				if (next_offset < oldest_offset)
+					next_offset += ((uint64) 1 << 32) - 1;
+
+				next_offset -= oldest_offset - 1;
+				old_cluster.controldata.chkpnt_nxtmxoff = next_offset;
+			}
+		}
+		else
+			copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
+
 		copy_subdir_files("pg_multixact/members", "pg_multixact/members");
 
 		prep_status("Setting next multixact ID and offset for new cluster");
@@ -760,9 +783,9 @@ copy_xact_xlog_xid(void)
 		 * counters here and the oldest multi present on system.
 		 */
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/pg_resetwal\" -O %u -m %u,%u \"%s\"",
+				  "\"%s/pg_resetwal\" -O %llu -m %u,%u \"%s\"",
 				  new_cluster.bindir,
-				  old_cluster.controldata.chkpnt_nxtmxoff,
+				  (unsigned long long) old_cluster.controldata.chkpnt_nxtmxoff,
 				  old_cluster.controldata.chkpnt_nxtmulti,
 				  old_cluster.controldata.chkpnt_oldstMulti,
 				  new_cluster.pgdata);
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index cdb6e2b759..445b46e5bd 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -114,6 +114,13 @@ extern char *output_files[];
  */
 #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231
 
+/*
+ * Swicth from 32-bit to 64-bit for multixid offsets.
+ *
+ * XXX: should be changed to the actual CATALOG_VERSION_NO on commit.
+ */
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202408302
+
 /*
  * large object chunk size added to pg_controldata,
  * commit 5f93c37805e7485488480916b4585e098d3cc883
@@ -230,7 +237,7 @@ typedef struct
 	uint32		chkpnt_nxtepoch;
 	uint32		chkpnt_nxtoid;
 	uint32		chkpnt_nxtmulti;
-	uint32		chkpnt_nxtmxoff;
+	uint64		chkpnt_nxtmxoff;
 	uint32		chkpnt_oldstMulti;
 	uint32		chkpnt_oldstxid;
 	uint32		align;
@@ -494,3 +501,7 @@ void		parallel_transfer_all_new_dbs(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr
 										  char *old_pgdata, char *new_pgdata,
 										  char *old_tablespace);
 bool		reap_child(bool wait_for_child);
+
+/* segresize.c */
+
+uint64		convert_multixact_offsets(void);
diff --git a/src/bin/pg_upgrade/segresize.c b/src/bin/pg_upgrade/segresize.c
new file mode 100644
index 0000000000..e47c0a2407
--- /dev/null
+++ b/src/bin/pg_upgrade/segresize.c
@@ -0,0 +1,350 @@
+/*
+ *	segresize.c
+ *
+ *	SLRU segment resize utility
+ *
+ *	Copyright (c) 2024, PostgreSQL Global Development Group
+ *	src/bin/pg_upgrade/segresize.c
+ */
+
+#include "postgres_fe.h"
+
+#include "pg_upgrade.h"
+#include "access/multixact.h"
+
+/* See slru.h */
+#define SLRU_PAGES_PER_SEGMENT		32
+
+/*
+ * Some kind of iterator associated with a particular SLRU segment.  The idea is
+ * to specify the segment and page number and then move through the pages.
+ */
+typedef struct SlruSegState
+{
+	char	   *dir;
+	char	   *fn;
+	FILE	   *file;
+	int64		segno;
+	uint64		pageno;
+	bool		leading_gap;
+	bool		long_segment_names;
+} SlruSegState;
+
+/*
+ * Get SLRU segmen file name from state.
+ *
+ * NOTE: this function should mirror SlruFileName call.
+ */
+static inline char *
+SlruFileName(SlruSegState *state)
+{
+	if (state->long_segment_names)
+	{
+		Assert(state->segno >= 0 &&
+			   state->segno <= INT64CONST(0xFFFFFFFFFFFFFFF));
+		return psprintf("%s/%015llX", state->dir, (long long) state->segno);
+	}
+	else
+	{
+		Assert(state->segno >= 0 &&
+			   state->segno <= INT64CONST(0xFFFFFF));
+		return psprintf("%s/%04X", state->dir, (unsigned int) state->segno);
+	}
+}
+
+/*
+ * Create SLRU segment file.
+ */
+static void
+create_segment(SlruSegState *state)
+{
+	Assert(state->fn == NULL);
+	Assert(state->file == NULL);
+
+	state->fn = SlruFileName(state);
+	state->file = fopen(state->fn, "wb");
+	if (!state->file)
+		pg_fatal("could not create file \"%s\": %m", state->fn);
+}
+
+/*
+ * Open existing SLRU segment file.
+ */
+static void
+open_segment(SlruSegState *state)
+{
+	Assert(state->fn == NULL);
+	Assert(state->file == NULL);
+
+	state->fn = SlruFileName(state);
+	state->file = fopen(state->fn, "rb");
+	if (!state->file)
+		pg_fatal("could not open file \"%s\": %m", state->fn);
+}
+
+/*
+ * Close SLRU segment file.
+ */
+static void
+close_segment(SlruSegState *state)
+{
+	if (state->file)
+	{
+		fclose(state->file);
+		state->file = NULL;
+	}
+
+	if (state->fn)
+	{
+		pfree(state->fn);
+		state->fn = NULL;
+	}
+}
+
+/*
+ * Read next page from the old 32-bit offset segment file.
+ */
+static int
+read_old_segment_page(SlruSegState *state, void *buf, bool *empty)
+{
+	int		len;
+
+	/* Open next segment file, if needed. */
+	if (!state->fn)
+	{
+		if (!state->segno)
+			state->leading_gap = true;
+
+		open_segment(state);
+
+		/* Set position to the needed page. */
+		if (state->pageno > 0 &&
+			fseek(state->file, state->pageno * BLCKSZ, SEEK_SET))
+		{
+			close_segment(state);
+		}
+	}
+
+	if (state->file)
+	{
+		/* Segment file do exists, read page from it. */
+		state->leading_gap = false;
+
+		len = fread(buf, sizeof(char), BLCKSZ, state->file);
+
+		/* Are we done or was there an error? */
+		if (len <= 0)
+		{
+			if (ferror(state->file))
+				pg_fatal("error reading file \"%s\": %m", state->fn);
+
+			if (feof(state->file))
+			{
+				*empty = true;
+				len = -1;
+
+				close_segment(state);
+			}
+		}
+		else
+			*empty = false;
+	}
+	else if (!state->leading_gap)
+	{
+		/* We reached the last segment. */
+		len = -1;
+		*empty = true;
+	}
+	else
+	{
+		/* Skip few first segments if they were frozen and removed. */
+		len = BLCKSZ;
+		*empty = true;
+	}
+
+	if (++state->pageno >= SLRU_PAGES_PER_SEGMENT)
+	{
+		/* Start a new segment. */
+		state->segno++;
+		state->pageno = 0;
+
+		close_segment(state);
+	}
+
+	return len;
+}
+
+/*
+ * Write next page to the new 64-bit offset segment file.
+ */
+static void
+write_new_segment_page(SlruSegState *state, void *buf)
+{
+	/*
+	 * Create a new segment file if we still didn't.  Creation is
+	 * postponed until the first non-empty page is found.  This helps
+	 * not to create completely empty segments.
+	 */
+	if (!state->file)
+	{
+		create_segment(state);
+
+		/* Write zeroes to the previously skipped prefix. */
+		if (state->pageno > 0)
+		{
+			char		zerobuf[BLCKSZ] = {0};
+
+			for (int64 i = 0; i < state->pageno; i++)
+			{
+				if (fwrite(zerobuf, sizeof(char), BLCKSZ, state->file) != BLCKSZ)
+					pg_fatal("could not write file \"%s\": %m", state->fn);
+			}
+		}
+	}
+
+	/* Write page to the new segment (if it was created). */
+	if (state->file)
+	{
+		if (fwrite(buf, sizeof(char), BLCKSZ, state->file) != BLCKSZ)
+			pg_fatal("could not write file \"%s\": %m", state->fn);
+	}
+
+	state->pageno++;
+
+	/*
+	 * Did we reach the maximum page number?  Then close segment file
+	 * and create a new one on the next iteration.
+	 */
+	if (state->pageno >= SLRU_PAGES_PER_SEGMENT)
+	{
+		state->segno++;
+		state->pageno = 0;
+		close_segment(state);
+	}
+}
+
+/*
+ * Convert pg_multixact/offsets segments and return oldest multi offset.
+ */
+uint64
+convert_multixact_offsets(void)
+{
+	/* See multixact.c */
+#define MULTIXACT_OFFSETS_PER_PAGE_OLD	(BLCKSZ / sizeof(uint32))
+#define MULTIXACT_OFFSETS_PER_PAGE		(BLCKSZ / sizeof(MultiXactOffset))
+
+	SlruSegState	oldseg = {0},
+					newseg = {0};
+	uint32			oldbuf[MULTIXACT_OFFSETS_PER_PAGE_OLD] = {0};
+	MultiXactOffset	newbuf[MULTIXACT_OFFSETS_PER_PAGE] = {0};
+	/*
+	 * It is much easier to deal with multi wraparound in 64 bitd format.  Thus
+	 * we use 64 bits for multi-transactions, although they remain 32 bits.
+	 */
+	uint64			oldest_multi = old_cluster.controldata.chkpnt_oldstMulti,
+					next_multi = old_cluster.controldata.chkpnt_nxtmulti,
+					multi,
+					old_entry,
+					new_entry;
+	bool			found = false;
+	uint64			oldest_offset = 0;
+
+	prep_status("Converting pg_multixact/offsets to 64-bit");
+
+	oldseg.pageno = oldest_multi / MULTIXACT_OFFSETS_PER_PAGE_OLD;
+	oldseg.segno = oldseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	oldseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+	oldseg.dir = psprintf("%s/pg_multixact/offsets", old_cluster.pgdata);
+	oldseg.long_segment_names = false;		/* old format XXXX */
+
+	newseg.pageno = oldest_multi / MULTIXACT_OFFSETS_PER_PAGE;
+	newseg.segno = newseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	newseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+	newseg.dir = psprintf("%s/pg_multixact/offsets", new_cluster.pgdata);
+	newseg.long_segment_names = true;
+
+	old_entry = oldest_multi % MULTIXACT_OFFSETS_PER_PAGE_OLD;
+	new_entry = oldest_multi % MULTIXACT_OFFSETS_PER_PAGE;
+
+	if (next_multi < oldest_multi)
+		next_multi += (uint64) 1 << 32;		/* wraparound */
+
+	for (multi = oldest_multi; multi < next_multi; old_entry = 0)
+	{
+		int			oldlen;
+		bool		empty;
+
+		/* Handle possible segment wraparound. */
+		if (oldseg.segno > MaxMultiXactId /
+								MULTIXACT_OFFSETS_PER_PAGE_OLD /
+								SLRU_PAGES_PER_SEGMENT)
+			oldseg.segno = 0;
+
+		/* Read old offset segment. */
+		oldlen = read_old_segment_page(&oldseg, oldbuf, &empty);
+		if (oldlen <= 0 || empty)
+			pg_fatal("cannot read page %llu from file \"%s\": %m",
+					 (unsigned long long) oldseg.pageno, oldseg.fn);
+
+		/* Fill possible gap. */
+		if (oldlen < BLCKSZ)
+			memset((char *) oldbuf + oldlen, 0, BLCKSZ - oldlen);
+
+		/* Save oldest multi offset */
+		if (!found)
+		{
+			oldest_offset = oldbuf[old_entry];
+			found = true;
+		}
+
+		/* ... skip wrapped-around invalid multi */
+		if (multi == (uint64) 1 << 32)
+		{
+			Assert(oldseg.segno == 0);
+			Assert(oldseg.pageno == 1);
+			Assert(old_entry == 0);
+
+			multi += FirstMultiXactId;
+			old_entry = FirstMultiXactId;
+		}
+
+		/* Copy entries to the new page. */
+		for (; multi < next_multi && old_entry < MULTIXACT_OFFSETS_PER_PAGE_OLD;
+			 multi++, old_entry++)
+		{
+			MultiXactOffset offset = oldbuf[old_entry];
+
+			/* Handle possible offset wraparound. */
+			if (offset < oldest_offset)
+				offset += ((uint64) 1 << 32) - 1;
+
+			/* Subtract oldest_offset, so new offsets will start from 1. */
+			newbuf[new_entry++] = offset - oldest_offset + 1;
+			if (new_entry >= MULTIXACT_OFFSETS_PER_PAGE)
+			{
+				/* Write a new page. */
+				write_new_segment_page(&newseg, newbuf);
+				new_entry = 0;
+			}
+		}
+	}
+
+	/* Write the last incomplete page. */
+	if (new_entry > 0 || oldest_multi == next_multi)
+	{
+		memset(&newbuf[new_entry], 0,
+			   sizeof(newbuf[0]) * (MULTIXACT_OFFSETS_PER_PAGE - new_entry));
+		write_new_segment_page(&newseg, newbuf);
+	}
+
+	/* Release resources. */
+	close_segment(&oldseg);
+	close_segment(&newseg);
+
+	pfree(oldseg.dir);
+	pfree(newseg.dir);
+
+	check_ok();
+
+	return oldest_offset;
+}
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 1980d492c3..7b1cd22d1a 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202408301
+#define CATALOG_VERSION_NO	202408302
 
 #endif
-- 
2.45.2

v2-0002-Use-64-bit-multixact-offsets.patchapplication/octet-stream; name=v2-0002-Use-64-bit-multixact-offsets.patchDownload
From 3afd483c1a2a505e14603da759adcefd7130fff9 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Wed, 6 Mar 2024 11:11:33 +0300
Subject: [PATCH v2 2/3] Use 64-bit multixact offsets.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/transam/multixact.c | 172 +------------------------
 src/bin/pg_resetwal/pg_resetwal.c      |   2 +-
 src/bin/pg_resetwal/t/001_basic.pl     |   2 +-
 src/include/access/multixact.h         |   2 +-
 src/include/c.h                        |   2 +-
 5 files changed, 11 insertions(+), 169 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index ab90912ed3..c51e03e832 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -96,14 +96,6 @@
 /*
  * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
  * used everywhere else in Postgres.
- *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
  */
 
 /* We need four bytes per offset */
@@ -272,9 +264,6 @@ typedef struct MultiXactStateData
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 
-	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
-
 	/*
 	 * This is used to sleep until a multixact offset is written when we want
 	 * to create the next one.
@@ -409,8 +398,6 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
-									 MultiXactOffset start, uint32 distance);
 static bool SetOffsetVacuumLimit(bool is_startup);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMZeroPageXlogRec(int64 pageno, uint8 info);
@@ -1164,78 +1151,6 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	else
 		*offset = nextOffset;
 
-	/*----------
-	 * Protect against overrun of the members space as well, with the
-	 * following rules:
-	 *
-	 * If we're past offsetStopLimit, refuse to generate more multis.
-	 * If we're close to offsetStopLimit, emit a warning.
-	 *
-	 * Arbitrarily, we start emitting warnings when we're 20 segments or less
-	 * from offsetStopLimit.
-	 *
-	 * Note we haven't updated the shared state yet, so if we fail at this
-	 * point, the multixact ID we grabbed can still be used by the next guy.
-	 *
-	 * Note that there is no point in forcing autovacuum runs here: the
-	 * multixact freeze settings would have to be reduced for that to have any
-	 * effect.
-	 *----------
-	 */
-#define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
-								 nmembers))
-	{
-		/* see comment in the corresponding offsets wraparound case */
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("multixact \"members\" limit exceeded"),
-				 errdetail_plural("This command would create a multixact with %u members, but the remaining space is only enough for %u member.",
-								  "This command would create a multixact with %u members, but the remaining space is only enough for %u members.",
-								  MultiXactState->offsetStopLimit - nextOffset - 1,
-								  nmembers,
-								  MultiXactState->offsetStopLimit - nextOffset - 1),
-				 errhint("Execute a database-wide VACUUM in database with OID %u with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.",
-						 MultiXactState->oldestMultiXactDB)));
-	}
-
-	/*
-	 * Check whether we should kick autovacuum into action, to prevent members
-	 * wraparound. NB we use a much larger window to trigger autovacuum than
-	 * just the warning limit. The warning is just a measure of last resort -
-	 * this is in line with GetNewTransactionId's behaviour.
-	 */
-	if (!MultiXactState->oldestOffsetKnown ||
-		(MultiXactState->nextOffset - MultiXactState->oldestOffset
-		 > MULTIXACT_MEMBER_SAFE_THRESHOLD))
-	{
-		/*
-		 * To avoid swamping the postmaster with signals, we issue the autovac
-		 * request only when crossing a segment boundary. With default
-		 * compilation settings that's roughly after 50k members.  This still
-		 * gives plenty of chances before we get into real trouble.
-		 */
-		if ((MXOffsetToMemberPage(nextOffset) / SLRU_PAGES_PER_SEGMENT) !=
-			(MXOffsetToMemberPage(nextOffset + nmembers) / SLRU_PAGES_PER_SEGMENT))
-			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	}
-
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
-								 nextOffset,
-								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
-		ereport(WARNING,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg_plural("database with OID %u must be vacuumed before %d more multixact member is used",
-							   "database with OID %u must be vacuumed before %d more multixact members are used",
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers,
-							   MultiXactState->oldestMultiXactDB,
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers),
-				 errhint("Execute a database-wide VACUUM in that database with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.")));
-
 	ExtendMultiXactMember(nextOffset, nmembers);
 
 	/*
@@ -1976,7 +1891,7 @@ MultiXactShmemInit(void)
 				  "pg_multixact/offsets", LWTRANCHE_MULTIXACTOFFSET_BUFFER,
 				  LWTRANCHE_MULTIXACTOFFSET_SLRU,
 				  SYNC_HANDLER_MULTIXACT_OFFSET,
-				  false);
+				  true);
 	SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE);
 	SimpleLruInit(MultiXactMemberCtl,
 				  "multixact_member", multixact_member_buffers, 0,
@@ -2721,8 +2636,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	MultiXactOffset nextOffset;
 	bool		oldestOffsetKnown = false;
 	bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;
-	MultiXactOffset prevOffsetStopLimit;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2737,7 +2650,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	nextOffset = MultiXactState->nextOffset;
 	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
 	prevOldestOffset = MultiXactState->oldestOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2768,11 +2680,7 @@ SetOffsetVacuumLimit(bool is_startup)
 		oldestOffsetKnown =
 			find_multixact_start(oldestMultiXactId, &oldestOffset);
 
-		if (oldestOffsetKnown)
-			ereport(DEBUG1,
-					(errmsg_internal("oldest MultiXactId member is at offset %u",
-									 oldestOffset)));
-		else
+		if (!oldestOffsetKnown)
 			ereport(LOG,
 					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
 							oldestMultiXactId)));
@@ -2785,24 +2693,7 @@ SetOffsetVacuumLimit(bool is_startup)
 	 * overrun of old data in the members SLRU area. We can only do so if the
 	 * oldest offset is known though.
 	 */
-	if (oldestOffsetKnown)
-	{
-		/* move back to start of the corresponding segment */
-		offsetStopLimit = oldestOffset - (oldestOffset %
-										  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-
-		/* always leave one segment before the wraparound point */
-		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-		if (!prevOldestOffsetKnown && !is_startup)
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are now enabled")));
-
-		ereport(DEBUG1,
-				(errmsg_internal("MultiXact member stop limit is now %u based on MultiXact %u",
-								 offsetStopLimit, oldestMultiXactId)));
-	}
-	else if (prevOldestOffsetKnown)
+	if (prevOldestOffsetKnown)
 	{
 		/*
 		 * If we failed to get the oldest offset this time, but we have a
@@ -2812,14 +2703,12 @@ SetOffsetVacuumLimit(bool is_startup)
 		 */
 		oldestOffset = prevOldestOffset;
 		oldestOffsetKnown = true;
-		offsetStopLimit = prevOffsetStopLimit;
 	}
 
 	/* Install the computed values */
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestOffset = oldestOffset;
 	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	MultiXactState->offsetStopLimit = offsetStopLimit;
 	LWLockRelease(MultiXactGenLock);
 
 	/*
@@ -2829,54 +2718,6 @@ SetOffsetVacuumLimit(bool is_startup)
 		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
 }
 
-/*
- * Return whether adding "distance" to "start" would move past "boundary".
- *
- * We use this to determine whether the addition is "wrapping around" the
- * boundary point, hence the name.  The reason we don't want to use the regular
- * 2^31-modulo arithmetic here is that we want to be able to use the whole of
- * the 2^32-1 space here, allowing for more multixacts than would fit
- * otherwise.
- */
-static bool
-MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
-						 uint32 distance)
-{
-	MultiXactOffset finish;
-
-	/*
-	 * Note that offset number 0 is not used (see GetMultiXactIdMembers), so
-	 * if the addition wraps around the UINT_MAX boundary, skip that value.
-	 */
-	finish = start + distance;
-	if (finish < start)
-		finish++;
-
-	/*-----------------------------------------------------------------------
-	 * When the boundary is numerically greater than the starting point, any
-	 * value numerically between the two is not wrapped:
-	 *
-	 *	<----S----B---->
-	 *	[---)			 = F wrapped past B (and UINT_MAX)
-	 *		 [---)		 = F not wrapped
-	 *			  [----] = F wrapped past B
-	 *
-	 * When the boundary is numerically less than the starting point (i.e. the
-	 * UINT_MAX wraparound occurs somewhere in between) then all values in
-	 * between are wrapped:
-	 *
-	 *	<----B----S---->
-	 *	[---)			 = F not wrapped past B (but wrapped past UINT_MAX)
-	 *		 [---)		 = F wrapped past B (and UINT_MAX)
-	 *			  [----] = F not wrapped
-	 *-----------------------------------------------------------------------
-	 */
-	if (start < boundary)
-		return finish >= boundary || finish < start;
-	else
-		return finish >= boundary && finish < start;
-}
-
 /*
  * Find the starting offset of the given MultiXactId.
  *
@@ -2998,8 +2839,9 @@ MultiXactMemberFreezeThreshold(void)
 	 * we try to eliminate from the system is based on how far we are past
 	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
 	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD) /
-		(MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+	fraction /= (double) (MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+
 	victim_multixacts = multixacts * fraction;
 
 	/* fraction could be > 1.0, but lowest possible freeze age is zero */
@@ -3345,7 +3187,7 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
 static bool
 MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
 {
-	int32		diff = (int32) (offset1 - offset2);
+	int64		diff = (int64) (offset1 - offset2);
 
 	return (diff < 0);
 }
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 985cd06802..1af2ce4b93 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -264,7 +264,7 @@ main(int argc, char *argv[])
 
 			case 'O':
 				errno = 0;
-				set_mxoff = strtoul(optarg, &endptr, 0);
+				set_mxoff = strtou64(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index 9829e48106..f8a8eef44d 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -206,7 +206,7 @@ push @cmd,
   sprintf("%d,%d", hex($files[0]) == 0 ? 3 : hex($files[0]), hex($files[-1]));
 
 @files = get_slru_files('pg_multixact/offsets');
-$mult = 32 * $blcksz / 4;
+$mult = 32 * $blcksz / 8;
 # -m argument is "new,old"
 push @cmd, '-m',
   sprintf("%d,%d",
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 7ffd256c74..90583634ec 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -27,7 +27,7 @@
 
 #define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
 
-#define MaxMultiXactOffset	((MultiXactOffset) 0xFFFFFFFF)
+#define MaxMultiXactOffset	UINT64CONST(0xFFFFFFFFFFFFFFFF)
 
 /*
  * Possible multixact lock modes ("status").  The first four modes are for
diff --git a/src/include/c.h b/src/include/c.h
index dc1841346c..ccfb82b478 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -661,7 +661,7 @@ typedef uint32 SubTransactionId;
 /* MultiXactId must be equivalent to TransactionId, to fit in t_xmax */
 typedef TransactionId MultiXactId;
 
-typedef uint32 MultiXactOffset;
+typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
-- 
2.45.2

#8Alexander Korotkov
aekorotkov@gmail.com
In reply to: Maxim Orlov (#7)
Re: POC: make mxidoff 64 bits

On Tue, Sep 3, 2024 at 4:30 PM Maxim Orlov <orlovmg@gmail.com> wrote:

Here is rebase. Apparently I'll have to do it often, since the CATALOG_VERSION_NO changed in the patch.

I don't think you need to maintain CATALOG_VERSION_NO change in your
patch for the exact reason you have mentioned: patch will get conflict
each time CATALOG_VERSION_NO is advanced. It's responsibility of
committer to advance CATALOG_VERSION_NO when needed.

------
Regards,
Alexander Korotkov
Supabase

#9Maxim Orlov
orlovmg@gmail.com
In reply to: Alexander Korotkov (#8)
Re: POC: make mxidoff 64 bits

On Tue, 3 Sept 2024 at 16:32, Alexander Korotkov <aekorotkov@gmail.com>
wrote:

I don't think you need to maintain CATALOG_VERSION_NO change in your
patch for the exact reason you have mentioned: patch will get conflict
each time CATALOG_VERSION_NO is advanced. It's responsibility of
committer to advance CATALOG_VERSION_NO when needed.

OK, I got it. My intention here was to help to test the patch. If someone
wants to have a
look at the patch, he won't need to make changes in the code. In the next
iteration, I'll
remove CATALOG_VERSION_NO version change.

--
Best regards,
Maxim Orlov.

#10Maxim Orlov
orlovmg@gmail.com
In reply to: Maxim Orlov (#9)
3 attachment(s)
Re: POC: make mxidoff 64 bits

Here is v3. I removed CATALOG_VERSION_NO change, so this should be done by
the actual commiter.

--
Best regards,
Maxim Orlov.

Attachments:

v3-0002-Use-64-bit-multixact-offsets.patchapplication/octet-stream; name=v3-0002-Use-64-bit-multixact-offsets.patchDownload
From 231886c2fafe9eb2d8535c4b590e387085d7aec7 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Wed, 6 Mar 2024 11:11:33 +0300
Subject: [PATCH v3 2/3] Use 64-bit multixact offsets.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/transam/multixact.c | 172 +------------------------
 src/bin/pg_resetwal/pg_resetwal.c      |   2 +-
 src/bin/pg_resetwal/t/001_basic.pl     |   2 +-
 src/include/access/multixact.h         |   2 +-
 src/include/c.h                        |   2 +-
 5 files changed, 11 insertions(+), 169 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index ab90912ed3..c51e03e832 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -96,14 +96,6 @@
 /*
  * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
  * used everywhere else in Postgres.
- *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
  */
 
 /* We need four bytes per offset */
@@ -272,9 +264,6 @@ typedef struct MultiXactStateData
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 
-	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
-
 	/*
 	 * This is used to sleep until a multixact offset is written when we want
 	 * to create the next one.
@@ -409,8 +398,6 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
-									 MultiXactOffset start, uint32 distance);
 static bool SetOffsetVacuumLimit(bool is_startup);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMZeroPageXlogRec(int64 pageno, uint8 info);
@@ -1164,78 +1151,6 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	else
 		*offset = nextOffset;
 
-	/*----------
-	 * Protect against overrun of the members space as well, with the
-	 * following rules:
-	 *
-	 * If we're past offsetStopLimit, refuse to generate more multis.
-	 * If we're close to offsetStopLimit, emit a warning.
-	 *
-	 * Arbitrarily, we start emitting warnings when we're 20 segments or less
-	 * from offsetStopLimit.
-	 *
-	 * Note we haven't updated the shared state yet, so if we fail at this
-	 * point, the multixact ID we grabbed can still be used by the next guy.
-	 *
-	 * Note that there is no point in forcing autovacuum runs here: the
-	 * multixact freeze settings would have to be reduced for that to have any
-	 * effect.
-	 *----------
-	 */
-#define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
-								 nmembers))
-	{
-		/* see comment in the corresponding offsets wraparound case */
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("multixact \"members\" limit exceeded"),
-				 errdetail_plural("This command would create a multixact with %u members, but the remaining space is only enough for %u member.",
-								  "This command would create a multixact with %u members, but the remaining space is only enough for %u members.",
-								  MultiXactState->offsetStopLimit - nextOffset - 1,
-								  nmembers,
-								  MultiXactState->offsetStopLimit - nextOffset - 1),
-				 errhint("Execute a database-wide VACUUM in database with OID %u with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.",
-						 MultiXactState->oldestMultiXactDB)));
-	}
-
-	/*
-	 * Check whether we should kick autovacuum into action, to prevent members
-	 * wraparound. NB we use a much larger window to trigger autovacuum than
-	 * just the warning limit. The warning is just a measure of last resort -
-	 * this is in line with GetNewTransactionId's behaviour.
-	 */
-	if (!MultiXactState->oldestOffsetKnown ||
-		(MultiXactState->nextOffset - MultiXactState->oldestOffset
-		 > MULTIXACT_MEMBER_SAFE_THRESHOLD))
-	{
-		/*
-		 * To avoid swamping the postmaster with signals, we issue the autovac
-		 * request only when crossing a segment boundary. With default
-		 * compilation settings that's roughly after 50k members.  This still
-		 * gives plenty of chances before we get into real trouble.
-		 */
-		if ((MXOffsetToMemberPage(nextOffset) / SLRU_PAGES_PER_SEGMENT) !=
-			(MXOffsetToMemberPage(nextOffset + nmembers) / SLRU_PAGES_PER_SEGMENT))
-			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	}
-
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
-								 nextOffset,
-								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
-		ereport(WARNING,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg_plural("database with OID %u must be vacuumed before %d more multixact member is used",
-							   "database with OID %u must be vacuumed before %d more multixact members are used",
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers,
-							   MultiXactState->oldestMultiXactDB,
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers),
-				 errhint("Execute a database-wide VACUUM in that database with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.")));
-
 	ExtendMultiXactMember(nextOffset, nmembers);
 
 	/*
@@ -1976,7 +1891,7 @@ MultiXactShmemInit(void)
 				  "pg_multixact/offsets", LWTRANCHE_MULTIXACTOFFSET_BUFFER,
 				  LWTRANCHE_MULTIXACTOFFSET_SLRU,
 				  SYNC_HANDLER_MULTIXACT_OFFSET,
-				  false);
+				  true);
 	SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE);
 	SimpleLruInit(MultiXactMemberCtl,
 				  "multixact_member", multixact_member_buffers, 0,
@@ -2721,8 +2636,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	MultiXactOffset nextOffset;
 	bool		oldestOffsetKnown = false;
 	bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;
-	MultiXactOffset prevOffsetStopLimit;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2737,7 +2650,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	nextOffset = MultiXactState->nextOffset;
 	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
 	prevOldestOffset = MultiXactState->oldestOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2768,11 +2680,7 @@ SetOffsetVacuumLimit(bool is_startup)
 		oldestOffsetKnown =
 			find_multixact_start(oldestMultiXactId, &oldestOffset);
 
-		if (oldestOffsetKnown)
-			ereport(DEBUG1,
-					(errmsg_internal("oldest MultiXactId member is at offset %u",
-									 oldestOffset)));
-		else
+		if (!oldestOffsetKnown)
 			ereport(LOG,
 					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
 							oldestMultiXactId)));
@@ -2785,24 +2693,7 @@ SetOffsetVacuumLimit(bool is_startup)
 	 * overrun of old data in the members SLRU area. We can only do so if the
 	 * oldest offset is known though.
 	 */
-	if (oldestOffsetKnown)
-	{
-		/* move back to start of the corresponding segment */
-		offsetStopLimit = oldestOffset - (oldestOffset %
-										  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-
-		/* always leave one segment before the wraparound point */
-		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-		if (!prevOldestOffsetKnown && !is_startup)
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are now enabled")));
-
-		ereport(DEBUG1,
-				(errmsg_internal("MultiXact member stop limit is now %u based on MultiXact %u",
-								 offsetStopLimit, oldestMultiXactId)));
-	}
-	else if (prevOldestOffsetKnown)
+	if (prevOldestOffsetKnown)
 	{
 		/*
 		 * If we failed to get the oldest offset this time, but we have a
@@ -2812,14 +2703,12 @@ SetOffsetVacuumLimit(bool is_startup)
 		 */
 		oldestOffset = prevOldestOffset;
 		oldestOffsetKnown = true;
-		offsetStopLimit = prevOffsetStopLimit;
 	}
 
 	/* Install the computed values */
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestOffset = oldestOffset;
 	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	MultiXactState->offsetStopLimit = offsetStopLimit;
 	LWLockRelease(MultiXactGenLock);
 
 	/*
@@ -2829,54 +2718,6 @@ SetOffsetVacuumLimit(bool is_startup)
 		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
 }
 
-/*
- * Return whether adding "distance" to "start" would move past "boundary".
- *
- * We use this to determine whether the addition is "wrapping around" the
- * boundary point, hence the name.  The reason we don't want to use the regular
- * 2^31-modulo arithmetic here is that we want to be able to use the whole of
- * the 2^32-1 space here, allowing for more multixacts than would fit
- * otherwise.
- */
-static bool
-MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
-						 uint32 distance)
-{
-	MultiXactOffset finish;
-
-	/*
-	 * Note that offset number 0 is not used (see GetMultiXactIdMembers), so
-	 * if the addition wraps around the UINT_MAX boundary, skip that value.
-	 */
-	finish = start + distance;
-	if (finish < start)
-		finish++;
-
-	/*-----------------------------------------------------------------------
-	 * When the boundary is numerically greater than the starting point, any
-	 * value numerically between the two is not wrapped:
-	 *
-	 *	<----S----B---->
-	 *	[---)			 = F wrapped past B (and UINT_MAX)
-	 *		 [---)		 = F not wrapped
-	 *			  [----] = F wrapped past B
-	 *
-	 * When the boundary is numerically less than the starting point (i.e. the
-	 * UINT_MAX wraparound occurs somewhere in between) then all values in
-	 * between are wrapped:
-	 *
-	 *	<----B----S---->
-	 *	[---)			 = F not wrapped past B (but wrapped past UINT_MAX)
-	 *		 [---)		 = F wrapped past B (and UINT_MAX)
-	 *			  [----] = F not wrapped
-	 *-----------------------------------------------------------------------
-	 */
-	if (start < boundary)
-		return finish >= boundary || finish < start;
-	else
-		return finish >= boundary && finish < start;
-}
-
 /*
  * Find the starting offset of the given MultiXactId.
  *
@@ -2998,8 +2839,9 @@ MultiXactMemberFreezeThreshold(void)
 	 * we try to eliminate from the system is based on how far we are past
 	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
 	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD) /
-		(MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+	fraction /= (double) (MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+
 	victim_multixacts = multixacts * fraction;
 
 	/* fraction could be > 1.0, but lowest possible freeze age is zero */
@@ -3345,7 +3187,7 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
 static bool
 MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
 {
-	int32		diff = (int32) (offset1 - offset2);
+	int64		diff = (int64) (offset1 - offset2);
 
 	return (diff < 0);
 }
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 985cd06802..1af2ce4b93 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -264,7 +264,7 @@ main(int argc, char *argv[])
 
 			case 'O':
 				errno = 0;
-				set_mxoff = strtoul(optarg, &endptr, 0);
+				set_mxoff = strtou64(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index 9829e48106..f8a8eef44d 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -206,7 +206,7 @@ push @cmd,
   sprintf("%d,%d", hex($files[0]) == 0 ? 3 : hex($files[0]), hex($files[-1]));
 
 @files = get_slru_files('pg_multixact/offsets');
-$mult = 32 * $blcksz / 4;
+$mult = 32 * $blcksz / 8;
 # -m argument is "new,old"
 push @cmd, '-m',
   sprintf("%d,%d",
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 7ffd256c74..90583634ec 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -27,7 +27,7 @@
 
 #define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
 
-#define MaxMultiXactOffset	((MultiXactOffset) 0xFFFFFFFF)
+#define MaxMultiXactOffset	UINT64CONST(0xFFFFFFFFFFFFFFFF)
 
 /*
  * Possible multixact lock modes ("status").  The first four modes are for
diff --git a/src/include/c.h b/src/include/c.h
index dc1841346c..ccfb82b478 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -661,7 +661,7 @@ typedef uint32 SubTransactionId;
 /* MultiXactId must be equivalent to TransactionId, to fit in t_xmax */
 typedef TransactionId MultiXactId;
 
-typedef uint32 MultiXactOffset;
+typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
-- 
2.45.2

v3-0001-Use-64-bit-format-output-for-multixact-offsets.patchapplication/octet-stream; name=v3-0001-Use-64-bit-format-output-for-multixact-offsets.patchDownload
From cc588f091a2c1970849a6e341ca1a8a79fc1a935 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 7 Aug 2024 16:35:22 +0300
Subject: [PATCH v3 1/3] Use 64-bit format output for multixact offsets

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/rmgrdesc/mxactdesc.c   |  9 ++++----
 src/backend/access/rmgrdesc/xlogdesc.c    |  4 ++--
 src/backend/access/transam/multixact.c    | 26 +++++++++++++----------
 src/backend/access/transam/xlogrecovery.c |  5 +++--
 src/bin/pg_controldata/pg_controldata.c   |  4 ++--
 src/bin/pg_resetwal/pg_resetwal.c         |  8 +++----
 6 files changed, 31 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 3e8ad4d5ef..1b486de38c 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -65,8 +65,8 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		xl_multixact_create *xlrec = (xl_multixact_create *) rec;
 		int			i;
 
-		appendStringInfo(buf, "%u offset %u nmembers %d: ", xlrec->mid,
-						 xlrec->moff, xlrec->nmembers);
+		appendStringInfo(buf, "%u offset %llu nmembers %d: ", xlrec->mid,
+						 (unsigned long long) xlrec->moff, xlrec->nmembers);
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
 	}
@@ -74,9 +74,10 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
 
-		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+		appendStringInfo(buf, "offsets [%u, %u), members [%llu, %llu)",
 						 xlrec->startTruncOff, xlrec->endTruncOff,
-						 xlrec->startTruncMemb, xlrec->endTruncMemb);
+						 (unsigned long long) xlrec->startTruncMemb,
+						 (unsigned long long) xlrec->endTruncMemb);
 	}
 }
 
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 363294d623..aaa19c81c8 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -66,7 +66,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
 		appendStringInfo(buf, "redo %X/%X; "
-						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %u; "
+						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %llu; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
 						 "oldest running xid %u; %s",
@@ -79,7 +79,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 XidFromFullTransactionId(checkpoint->nextXid),
 						 checkpoint->nextOid,
 						 checkpoint->nextMulti,
-						 checkpoint->nextMultiOffset,
+						 (unsigned long long) checkpoint->nextMultiOffset,
 						 checkpoint->oldestXid,
 						 checkpoint->oldestXidDB,
 						 checkpoint->oldestMulti,
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 8c37d7eba7..ab90912ed3 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1264,7 +1264,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	LWLockRelease(MultiXactGenLock);
 
-	debug_elog4(DEBUG2, "GetNew: returning %u offset %u", result, *offset);
+	debug_elog4(DEBUG2, "GetNew: returning %u offset %llu", result,
+				(unsigned long long) *offset);
 	return result;
 }
 
@@ -2293,8 +2294,9 @@ MultiXactGetCheckptMulti(bool is_shutdown,
 	LWLockRelease(MultiXactGenLock);
 
 	debug_elog6(DEBUG2,
-				"MultiXact: checkpoint is nextMulti %u, nextOffset %u, oldestMulti %u in DB %u",
-				*nextMulti, *nextMultiOffset, *oldestMulti, *oldestMultiDB);
+				"MultiXact: checkpoint is nextMulti %u, nextOffset %llu, oldestMulti %u in DB %u",
+				*nextMulti, (unsigned long long) *nextMultiOffset, *oldestMulti,
+				*oldestMultiDB);
 }
 
 /*
@@ -2328,8 +2330,8 @@ void
 MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset)
 {
-	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %u",
-				nextMulti, nextMultiOffset);
+	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %llu",
+				nextMulti, (unsigned long long) nextMultiOffset);
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->nextMXact = nextMulti;
 	MultiXactState->nextOffset = nextMultiOffset;
@@ -2519,8 +2521,8 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 	}
 	if (MultiXactOffsetPrecedes(MultiXactState->nextOffset, minMultiOffset))
 	{
-		debug_elog3(DEBUG2, "MultiXact: setting next offset to %u",
-					minMultiOffset);
+		debug_elog3(DEBUG2, "MultiXact: setting next offset to %llu",
+					(unsigned long long) minMultiOffset);
 		MultiXactState->nextOffset = minMultiOffset;
 	}
 	LWLockRelease(MultiXactGenLock);
@@ -3211,11 +3213,12 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 
 	elog(DEBUG1, "performing multixact truncation: "
 		 "offsets [%u, %u), offsets segments [%llx, %llx), "
-		 "members [%u, %u), members segments [%llx, %llx)",
+		 "members [%llu, %llu), members segments [%llx, %llx)",
 		 oldestMulti, newOldestMulti,
 		 (unsigned long long) MultiXactIdToOffsetSegment(oldestMulti),
 		 (unsigned long long) MultiXactIdToOffsetSegment(newOldestMulti),
-		 oldestOffset, newOldestOffset,
+		 (unsigned long long) oldestOffset,
+		 (unsigned long long) newOldestOffset,
 		 (unsigned long long) MXOffsetToMemberSegment(oldestOffset),
 		 (unsigned long long) MXOffsetToMemberSegment(newOldestOffset));
 
@@ -3471,11 +3474,12 @@ multixact_redo(XLogReaderState *record)
 
 		elog(DEBUG1, "replaying multixact truncation: "
 			 "offsets [%u, %u), offsets segments [%llx, %llx), "
-			 "members [%u, %u), members segments [%llx, %llx)",
+			 "members [%llu, %llu), members segments [%llx, %llx)",
 			 xlrec.startTruncOff, xlrec.endTruncOff,
 			 (unsigned long long) MultiXactIdToOffsetSegment(xlrec.startTruncOff),
 			 (unsigned long long) MultiXactIdToOffsetSegment(xlrec.endTruncOff),
-			 xlrec.startTruncMemb, xlrec.endTruncMemb,
+			 (unsigned long long) xlrec.startTruncMemb,
+			 (unsigned long long) xlrec.endTruncMemb,
 			 (unsigned long long) MXOffsetToMemberSegment(xlrec.startTruncMemb),
 			 (unsigned long long) MXOffsetToMemberSegment(xlrec.endTruncMemb));
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 178491f6f5..0c5980a436 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -877,8 +877,9 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 							 U64FromFullTransactionId(checkPoint.nextXid),
 							 checkPoint.nextOid)));
 	ereport(DEBUG1,
-			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %u",
-							 checkPoint.nextMulti, checkPoint.nextMultiOffset)));
+			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %llu",
+							 checkPoint.nextMulti,
+							 (unsigned long long) checkPoint.nextMultiOffset)));
 	ereport(DEBUG1,
 			(errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
 							 checkPoint.oldestXid, checkPoint.oldestXidDB)));
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 93a05d80ca..43b6727570 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -253,8 +253,8 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile->checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
-		   ControlFile->checkPointCopy.nextMultiOffset);
+	printf(_("Latest checkpoint's NextMultiOffset:  %llu\n"),
+		   (unsigned long long) ControlFile->checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile->checkPointCopy.oldestXid);
 	printf(_("Latest checkpoint's oldestXID's DB:   %u\n"),
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index e9dcb5a6d8..985cd06802 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -737,8 +737,8 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile.checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
-		   ControlFile.checkPointCopy.nextMultiOffset);
+	printf(_("Latest checkpoint's NextMultiOffset:  %llu\n"),
+		   (unsigned long long) ControlFile.checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile.checkPointCopy.oldestXid);
 	printf(_("Latest checkpoint's oldestXID's DB:   %u\n"),
@@ -809,8 +809,8 @@ PrintNewControlValues(void)
 
 	if (set_mxoff != -1)
 	{
-		printf(_("NextMultiOffset:                      %u\n"),
-			   ControlFile.checkPointCopy.nextMultiOffset);
+		printf(_("NextMultiOffset:                      %llu\n"),
+			   (unsigned long long) ControlFile.checkPointCopy.nextMultiOffset);
 	}
 
 	if (set_oid != 0)
-- 
2.45.2

v3-0003-Make-pg_upgrade-convert-multixact-offsets.patchapplication/octet-stream; name=v3-0003-Make-pg_upgrade-convert-multixact-offsets.patchDownload
From 78cba2fcfbe11451ec6b8cd6e4c48b315571ab0d Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Tue, 13 Aug 2024 14:44:50 +0300
Subject: [PATCH v3 3/3] Make pg_upgrade convert multixact offsets.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/bin/pg_upgrade/Makefile     |   1 +
 src/bin/pg_upgrade/meson.build  |   1 +
 src/bin/pg_upgrade/pg_upgrade.c |  29 ++-
 src/bin/pg_upgrade/pg_upgrade.h |  13 +-
 src/bin/pg_upgrade/segresize.c  | 350 ++++++++++++++++++++++++++++++++
 5 files changed, 390 insertions(+), 4 deletions(-)
 create mode 100644 src/bin/pg_upgrade/segresize.c

diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index bde91e2beb..030816596f 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	info.o \
 	option.o \
 	parallel.o \
+	segresize.o \
 	pg_upgrade.o \
 	relfilenumber.o \
 	server.o \
diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index 9825fa3305..2d9f7e6b65 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -10,6 +10,7 @@ pg_upgrade_sources = files(
   'info.c',
   'option.c',
   'parallel.c',
+  'segresize.c',
   'pg_upgrade.c',
   'relfilenumber.c',
   'server.c',
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 663235816f..d9d8d0ea78 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -750,7 +750,30 @@ copy_xact_xlog_xid(void)
 	if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER &&
 		new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
 	{
-		copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
+		/*
+		 * If the old server is before the MULTIXACTOFFSET_FORMATCHANGE_CAT_VER
+		 * it must have 32-bit multixid offsets, thus it should be converted.
+		 */
+		if (old_cluster.controldata.cat_ver < MULTIXACTOFFSET_FORMATCHANGE_CAT_VER &&
+			new_cluster.controldata.cat_ver >= MULTIXACTOFFSET_FORMATCHANGE_CAT_VER)
+		{
+			uint64	oldest_offset = convert_multixact_offsets();
+
+			if (oldest_offset)
+			{
+				uint64	next_offset = old_cluster.controldata.chkpnt_nxtmxoff;
+
+				/* Handle possible wraparound. */
+				if (next_offset < oldest_offset)
+					next_offset += ((uint64) 1 << 32) - 1;
+
+				next_offset -= oldest_offset - 1;
+				old_cluster.controldata.chkpnt_nxtmxoff = next_offset;
+			}
+		}
+		else
+			copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
+
 		copy_subdir_files("pg_multixact/members", "pg_multixact/members");
 
 		prep_status("Setting next multixact ID and offset for new cluster");
@@ -760,9 +783,9 @@ copy_xact_xlog_xid(void)
 		 * counters here and the oldest multi present on system.
 		 */
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/pg_resetwal\" -O %u -m %u,%u \"%s\"",
+				  "\"%s/pg_resetwal\" -O %llu -m %u,%u \"%s\"",
 				  new_cluster.bindir,
-				  old_cluster.controldata.chkpnt_nxtmxoff,
+				  (unsigned long long) old_cluster.controldata.chkpnt_nxtmxoff,
 				  old_cluster.controldata.chkpnt_nxtmulti,
 				  old_cluster.controldata.chkpnt_oldstMulti,
 				  new_cluster.pgdata);
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index cdb6e2b759..157e59e38f 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -114,6 +114,13 @@ extern char *output_files[];
  */
 #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231
 
+/*
+ * Swicth from 32-bit to 64-bit for multixid offsets.
+ *
+ * XXX: should be changed to the actual CATALOG_VERSION_NO on commit.
+ */
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202409041
+
 /*
  * large object chunk size added to pg_controldata,
  * commit 5f93c37805e7485488480916b4585e098d3cc883
@@ -230,7 +237,7 @@ typedef struct
 	uint32		chkpnt_nxtepoch;
 	uint32		chkpnt_nxtoid;
 	uint32		chkpnt_nxtmulti;
-	uint32		chkpnt_nxtmxoff;
+	uint64		chkpnt_nxtmxoff;
 	uint32		chkpnt_oldstMulti;
 	uint32		chkpnt_oldstxid;
 	uint32		align;
@@ -494,3 +501,7 @@ void		parallel_transfer_all_new_dbs(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr
 										  char *old_pgdata, char *new_pgdata,
 										  char *old_tablespace);
 bool		reap_child(bool wait_for_child);
+
+/* segresize.c */
+
+uint64		convert_multixact_offsets(void);
diff --git a/src/bin/pg_upgrade/segresize.c b/src/bin/pg_upgrade/segresize.c
new file mode 100644
index 0000000000..e47c0a2407
--- /dev/null
+++ b/src/bin/pg_upgrade/segresize.c
@@ -0,0 +1,350 @@
+/*
+ *	segresize.c
+ *
+ *	SLRU segment resize utility
+ *
+ *	Copyright (c) 2024, PostgreSQL Global Development Group
+ *	src/bin/pg_upgrade/segresize.c
+ */
+
+#include "postgres_fe.h"
+
+#include "pg_upgrade.h"
+#include "access/multixact.h"
+
+/* See slru.h */
+#define SLRU_PAGES_PER_SEGMENT		32
+
+/*
+ * Some kind of iterator associated with a particular SLRU segment.  The idea is
+ * to specify the segment and page number and then move through the pages.
+ */
+typedef struct SlruSegState
+{
+	char	   *dir;
+	char	   *fn;
+	FILE	   *file;
+	int64		segno;
+	uint64		pageno;
+	bool		leading_gap;
+	bool		long_segment_names;
+} SlruSegState;
+
+/*
+ * Get SLRU segmen file name from state.
+ *
+ * NOTE: this function should mirror SlruFileName call.
+ */
+static inline char *
+SlruFileName(SlruSegState *state)
+{
+	if (state->long_segment_names)
+	{
+		Assert(state->segno >= 0 &&
+			   state->segno <= INT64CONST(0xFFFFFFFFFFFFFFF));
+		return psprintf("%s/%015llX", state->dir, (long long) state->segno);
+	}
+	else
+	{
+		Assert(state->segno >= 0 &&
+			   state->segno <= INT64CONST(0xFFFFFF));
+		return psprintf("%s/%04X", state->dir, (unsigned int) state->segno);
+	}
+}
+
+/*
+ * Create SLRU segment file.
+ */
+static void
+create_segment(SlruSegState *state)
+{
+	Assert(state->fn == NULL);
+	Assert(state->file == NULL);
+
+	state->fn = SlruFileName(state);
+	state->file = fopen(state->fn, "wb");
+	if (!state->file)
+		pg_fatal("could not create file \"%s\": %m", state->fn);
+}
+
+/*
+ * Open existing SLRU segment file.
+ */
+static void
+open_segment(SlruSegState *state)
+{
+	Assert(state->fn == NULL);
+	Assert(state->file == NULL);
+
+	state->fn = SlruFileName(state);
+	state->file = fopen(state->fn, "rb");
+	if (!state->file)
+		pg_fatal("could not open file \"%s\": %m", state->fn);
+}
+
+/*
+ * Close SLRU segment file.
+ */
+static void
+close_segment(SlruSegState *state)
+{
+	if (state->file)
+	{
+		fclose(state->file);
+		state->file = NULL;
+	}
+
+	if (state->fn)
+	{
+		pfree(state->fn);
+		state->fn = NULL;
+	}
+}
+
+/*
+ * Read next page from the old 32-bit offset segment file.
+ */
+static int
+read_old_segment_page(SlruSegState *state, void *buf, bool *empty)
+{
+	int		len;
+
+	/* Open next segment file, if needed. */
+	if (!state->fn)
+	{
+		if (!state->segno)
+			state->leading_gap = true;
+
+		open_segment(state);
+
+		/* Set position to the needed page. */
+		if (state->pageno > 0 &&
+			fseek(state->file, state->pageno * BLCKSZ, SEEK_SET))
+		{
+			close_segment(state);
+		}
+	}
+
+	if (state->file)
+	{
+		/* Segment file do exists, read page from it. */
+		state->leading_gap = false;
+
+		len = fread(buf, sizeof(char), BLCKSZ, state->file);
+
+		/* Are we done or was there an error? */
+		if (len <= 0)
+		{
+			if (ferror(state->file))
+				pg_fatal("error reading file \"%s\": %m", state->fn);
+
+			if (feof(state->file))
+			{
+				*empty = true;
+				len = -1;
+
+				close_segment(state);
+			}
+		}
+		else
+			*empty = false;
+	}
+	else if (!state->leading_gap)
+	{
+		/* We reached the last segment. */
+		len = -1;
+		*empty = true;
+	}
+	else
+	{
+		/* Skip few first segments if they were frozen and removed. */
+		len = BLCKSZ;
+		*empty = true;
+	}
+
+	if (++state->pageno >= SLRU_PAGES_PER_SEGMENT)
+	{
+		/* Start a new segment. */
+		state->segno++;
+		state->pageno = 0;
+
+		close_segment(state);
+	}
+
+	return len;
+}
+
+/*
+ * Write next page to the new 64-bit offset segment file.
+ */
+static void
+write_new_segment_page(SlruSegState *state, void *buf)
+{
+	/*
+	 * Create a new segment file if we still didn't.  Creation is
+	 * postponed until the first non-empty page is found.  This helps
+	 * not to create completely empty segments.
+	 */
+	if (!state->file)
+	{
+		create_segment(state);
+
+		/* Write zeroes to the previously skipped prefix. */
+		if (state->pageno > 0)
+		{
+			char		zerobuf[BLCKSZ] = {0};
+
+			for (int64 i = 0; i < state->pageno; i++)
+			{
+				if (fwrite(zerobuf, sizeof(char), BLCKSZ, state->file) != BLCKSZ)
+					pg_fatal("could not write file \"%s\": %m", state->fn);
+			}
+		}
+	}
+
+	/* Write page to the new segment (if it was created). */
+	if (state->file)
+	{
+		if (fwrite(buf, sizeof(char), BLCKSZ, state->file) != BLCKSZ)
+			pg_fatal("could not write file \"%s\": %m", state->fn);
+	}
+
+	state->pageno++;
+
+	/*
+	 * Did we reach the maximum page number?  Then close segment file
+	 * and create a new one on the next iteration.
+	 */
+	if (state->pageno >= SLRU_PAGES_PER_SEGMENT)
+	{
+		state->segno++;
+		state->pageno = 0;
+		close_segment(state);
+	}
+}
+
+/*
+ * Convert pg_multixact/offsets segments and return oldest multi offset.
+ */
+uint64
+convert_multixact_offsets(void)
+{
+	/* See multixact.c */
+#define MULTIXACT_OFFSETS_PER_PAGE_OLD	(BLCKSZ / sizeof(uint32))
+#define MULTIXACT_OFFSETS_PER_PAGE		(BLCKSZ / sizeof(MultiXactOffset))
+
+	SlruSegState	oldseg = {0},
+					newseg = {0};
+	uint32			oldbuf[MULTIXACT_OFFSETS_PER_PAGE_OLD] = {0};
+	MultiXactOffset	newbuf[MULTIXACT_OFFSETS_PER_PAGE] = {0};
+	/*
+	 * It is much easier to deal with multi wraparound in 64 bitd format.  Thus
+	 * we use 64 bits for multi-transactions, although they remain 32 bits.
+	 */
+	uint64			oldest_multi = old_cluster.controldata.chkpnt_oldstMulti,
+					next_multi = old_cluster.controldata.chkpnt_nxtmulti,
+					multi,
+					old_entry,
+					new_entry;
+	bool			found = false;
+	uint64			oldest_offset = 0;
+
+	prep_status("Converting pg_multixact/offsets to 64-bit");
+
+	oldseg.pageno = oldest_multi / MULTIXACT_OFFSETS_PER_PAGE_OLD;
+	oldseg.segno = oldseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	oldseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+	oldseg.dir = psprintf("%s/pg_multixact/offsets", old_cluster.pgdata);
+	oldseg.long_segment_names = false;		/* old format XXXX */
+
+	newseg.pageno = oldest_multi / MULTIXACT_OFFSETS_PER_PAGE;
+	newseg.segno = newseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	newseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+	newseg.dir = psprintf("%s/pg_multixact/offsets", new_cluster.pgdata);
+	newseg.long_segment_names = true;
+
+	old_entry = oldest_multi % MULTIXACT_OFFSETS_PER_PAGE_OLD;
+	new_entry = oldest_multi % MULTIXACT_OFFSETS_PER_PAGE;
+
+	if (next_multi < oldest_multi)
+		next_multi += (uint64) 1 << 32;		/* wraparound */
+
+	for (multi = oldest_multi; multi < next_multi; old_entry = 0)
+	{
+		int			oldlen;
+		bool		empty;
+
+		/* Handle possible segment wraparound. */
+		if (oldseg.segno > MaxMultiXactId /
+								MULTIXACT_OFFSETS_PER_PAGE_OLD /
+								SLRU_PAGES_PER_SEGMENT)
+			oldseg.segno = 0;
+
+		/* Read old offset segment. */
+		oldlen = read_old_segment_page(&oldseg, oldbuf, &empty);
+		if (oldlen <= 0 || empty)
+			pg_fatal("cannot read page %llu from file \"%s\": %m",
+					 (unsigned long long) oldseg.pageno, oldseg.fn);
+
+		/* Fill possible gap. */
+		if (oldlen < BLCKSZ)
+			memset((char *) oldbuf + oldlen, 0, BLCKSZ - oldlen);
+
+		/* Save oldest multi offset */
+		if (!found)
+		{
+			oldest_offset = oldbuf[old_entry];
+			found = true;
+		}
+
+		/* ... skip wrapped-around invalid multi */
+		if (multi == (uint64) 1 << 32)
+		{
+			Assert(oldseg.segno == 0);
+			Assert(oldseg.pageno == 1);
+			Assert(old_entry == 0);
+
+			multi += FirstMultiXactId;
+			old_entry = FirstMultiXactId;
+		}
+
+		/* Copy entries to the new page. */
+		for (; multi < next_multi && old_entry < MULTIXACT_OFFSETS_PER_PAGE_OLD;
+			 multi++, old_entry++)
+		{
+			MultiXactOffset offset = oldbuf[old_entry];
+
+			/* Handle possible offset wraparound. */
+			if (offset < oldest_offset)
+				offset += ((uint64) 1 << 32) - 1;
+
+			/* Subtract oldest_offset, so new offsets will start from 1. */
+			newbuf[new_entry++] = offset - oldest_offset + 1;
+			if (new_entry >= MULTIXACT_OFFSETS_PER_PAGE)
+			{
+				/* Write a new page. */
+				write_new_segment_page(&newseg, newbuf);
+				new_entry = 0;
+			}
+		}
+	}
+
+	/* Write the last incomplete page. */
+	if (new_entry > 0 || oldest_multi == next_multi)
+	{
+		memset(&newbuf[new_entry], 0,
+			   sizeof(newbuf[0]) * (MULTIXACT_OFFSETS_PER_PAGE - new_entry));
+		write_new_segment_page(&newseg, newbuf);
+	}
+
+	/* Release resources. */
+	close_segment(&oldseg);
+	close_segment(&newseg);
+
+	pfree(oldseg.dir);
+	pfree(newseg.dir);
+
+	check_ok();
+
+	return oldest_offset;
+}
-- 
2.45.2

#11Pavel Borisov
pashkin.elfe@gmail.com
In reply to: Maxim Orlov (#10)
Re: POC: make mxidoff 64 bits

Hi, Maxim!

Previously we accessed offsets in shared MultiXactState without locks as
32-bit read is always atomic. But I'm not sure it's so when offset become
64-bit.
E.g. GetNewMultiXactId():

nextOffset = MultiXactState->nextOffset;
is outside lock.

There might be other places we do the same as well.

Regards,
Pavel Borisov
Supabase

#12Pavel Borisov
pashkin.elfe@gmail.com
In reply to: Pavel Borisov (#11)
Re: POC: make mxidoff 64 bits

On Thu, 12 Sept 2024 at 16:09, Pavel Borisov <pashkin.elfe@gmail.com> wrote:

Hi, Maxim!

Previously we accessed offsets in shared MultiXactState without locks as
32-bit read is always atomic. But I'm not sure it's so when offset become
64-bit.
E.g. GetNewMultiXactId():

nextOffset = MultiXactState->nextOffset;
is outside lock.

There might be other places we do the same as well.

I think the replacement of plain assignments by
pg_atomic_read_u64/pg_atomic_write_u64 would be sufficient.

(The same I think is needed for the patchset [1]/messages/by-id/CAJ7c6TMvPz8q+nC=JoKniy7yxPzQYcCTnNFYmsDP-nnWsAOJ2g@mail.gmail.com)
[1]: /messages/by-id/CAJ7c6TMvPz8q+nC=JoKniy7yxPzQYcCTnNFYmsDP-nnWsAOJ2g@mail.gmail.com
/messages/by-id/CAJ7c6TMvPz8q+nC=JoKniy7yxPzQYcCTnNFYmsDP-nnWsAOJ2g@mail.gmail.com

Regards,
Pavel Borisov

#13Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Pavel Borisov (#11)
Re: POC: make mxidoff 64 bits

On 2024-Sep-12, Pavel Borisov wrote:

Hi, Maxim!

Previously we accessed offsets in shared MultiXactState without locks as
32-bit read is always atomic. But I'm not sure it's so when offset become
64-bit.
E.g. GetNewMultiXactId():

nextOffset = MultiXactState->nextOffset;
is outside lock.

Good though. But fortunately I think it's not a problem. The one you
say is with MultiXactGetLock held in shared mode -- and that works OK,
as the assignment (in line 1263 at the bottom of the same routine) is
done with exclusive lock held.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/

#14Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Maxim Orlov (#10)
Re: POC: make mxidoff 64 bits

On 07/09/2024 07:36, Maxim Orlov wrote:

Here is v3.

MultiXactMemberFreezeThreshold looks quite bogus now. Now that
MaxMultiXactOffset==2^64-1, you cannot get anywhere near the
MULTIXACT_MEMBER_SAFE_THRESHOLD and MULTIXACT_MEMBER_DANGER_THRESHOLD
values anymore. Can we just get rid of MultiXactMemberFreezeThreshold? I
guess it would still be useful to trigger autovacuum if multixacts
members grows large though, to release the disk space, even if you can't
run out of members as such anymore. What should the logic for that look
like?

I'd love to see some tests for the pg_upgrade code. Something like a
little perl script to generate test clusters with different wraparound
scenarios etc. using the old version, and a TAP test to run pg_upgrade
on them and verify that queries on the upgraded cluster works correctly.
We don't have tests like that in the repository today, and I don't know
if we'd want to commit these permanently either, but it would be highly
useful now as a one-off thing, to show that the code works.

On upgrade, are there really no changes required to
pg_multixact/members? I imagined that the segment files would need to be
renamed around wraparound, so that if you previously had files like this:

pg_multixact/members/FFFE
pg_multixact/members/FFFF
pg_multixact/members/0000
pg_multixact/members/0001

after upgrade you would need to have:

pg_multixact/members/00000000FFFE
pg_multixact/members/00000000FFFF
pg_multixact/members/000000010000
pg_multixact/members/000000010001

Thanks for working on this!

--
Heikki Linnakangas
Neon (https://neon.tech)

#15Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#14)
4 attachment(s)
Re: POC: make mxidoff 64 bits

On Tue, 22 Oct 2024 at 12:43, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

MultiXactMemberFreezeThreshold looks quite bogus now. Now that
MaxMultiXactOffset==2^64-1, you cannot get anywhere near the
MULTIXACT_MEMBER_SAFE_THRESHOLD and MULTIXACT_MEMBER_DANGER_THRESHOLD
values anymore. Can we just get rid of MultiXactMemberFreezeThreshold? I
guess it would still be useful to trigger autovacuum if multixacts
members grows large though, to release the disk space, even if you can't
run out of members as such anymore. What should the logic for that look
like?

Yep, you're totally correct. The MultiXactMemberFreezeThreshold call is not
necessary any more and can be safely removed.
I made this as a separate commit in v4. But, as you rightly say, it will be
useful to trigger autovacuum in some cases. The obvious
place for this machinery is in the GetNewMultiXactId. I imagine this like
"if nextOff - oldestOff > threshold kick autovac". So, the
question is: what kind of threshold we want here? Is it a hard coded define
or GUC? If it is a GUC (32–bit), what values should it be?

And the other issue I feel a little regretful about. We still must be
holding MultiXactGenLock in order to track oldestOffset to do
"nextOff - oldestOff" calculation.

I'd love to see some tests for the pg_upgrade code. Something like a
little perl script to generate test clusters with different wraparound
scenarios etc.

Agree. I'll address this as soon as I can.

--
Best regards,
Maxim Orlov.

Attachments:

v4-0004-Make-pg_upgrade-convert-multixact-offsets.patchapplication/octet-stream; name=v4-0004-Make-pg_upgrade-convert-multixact-offsets.patchDownload
From 7432d8bd1fb2343bd873a21ba757c115d8a2dd59 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Tue, 13 Aug 2024 14:44:50 +0300
Subject: [PATCH v4 4/4] Make pg_upgrade convert multixact offsets.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/bin/pg_upgrade/Makefile     |   1 +
 src/bin/pg_upgrade/meson.build  |   1 +
 src/bin/pg_upgrade/pg_upgrade.c |  29 ++-
 src/bin/pg_upgrade/pg_upgrade.h |  13 +-
 src/bin/pg_upgrade/segresize.c  | 350 ++++++++++++++++++++++++++++++++
 5 files changed, 390 insertions(+), 4 deletions(-)
 create mode 100644 src/bin/pg_upgrade/segresize.c

diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index f83d2b5d30..70908d63a3 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	info.o \
 	option.o \
 	parallel.o \
+	segresize.o \
 	pg_upgrade.o \
 	relfilenumber.o \
 	server.o \
diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index 3d88419674..16f898ba14 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -10,6 +10,7 @@ pg_upgrade_sources = files(
   'info.c',
   'option.c',
   'parallel.c',
+  'segresize.c',
   'pg_upgrade.c',
   'relfilenumber.c',
   'server.c',
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 663235816f..d9d8d0ea78 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -750,7 +750,30 @@ copy_xact_xlog_xid(void)
 	if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER &&
 		new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
 	{
-		copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
+		/*
+		 * If the old server is before the MULTIXACTOFFSET_FORMATCHANGE_CAT_VER
+		 * it must have 32-bit multixid offsets, thus it should be converted.
+		 */
+		if (old_cluster.controldata.cat_ver < MULTIXACTOFFSET_FORMATCHANGE_CAT_VER &&
+			new_cluster.controldata.cat_ver >= MULTIXACTOFFSET_FORMATCHANGE_CAT_VER)
+		{
+			uint64	oldest_offset = convert_multixact_offsets();
+
+			if (oldest_offset)
+			{
+				uint64	next_offset = old_cluster.controldata.chkpnt_nxtmxoff;
+
+				/* Handle possible wraparound. */
+				if (next_offset < oldest_offset)
+					next_offset += ((uint64) 1 << 32) - 1;
+
+				next_offset -= oldest_offset - 1;
+				old_cluster.controldata.chkpnt_nxtmxoff = next_offset;
+			}
+		}
+		else
+			copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
+
 		copy_subdir_files("pg_multixact/members", "pg_multixact/members");
 
 		prep_status("Setting next multixact ID and offset for new cluster");
@@ -760,9 +783,9 @@ copy_xact_xlog_xid(void)
 		 * counters here and the oldest multi present on system.
 		 */
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/pg_resetwal\" -O %u -m %u,%u \"%s\"",
+				  "\"%s/pg_resetwal\" -O %llu -m %u,%u \"%s\"",
 				  new_cluster.bindir,
-				  old_cluster.controldata.chkpnt_nxtmxoff,
+				  (unsigned long long) old_cluster.controldata.chkpnt_nxtmxoff,
 				  old_cluster.controldata.chkpnt_nxtmulti,
 				  old_cluster.controldata.chkpnt_oldstMulti,
 				  new_cluster.pgdata);
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 53f693c2d4..4d65e4125e 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -114,6 +114,13 @@ extern char *output_files[];
  */
 #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231
 
+/*
+ * Swicth from 32-bit to 64-bit for multixid offsets.
+ *
+ * XXX: should be changed to the actual CATALOG_VERSION_NO on commit.
+ */
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202409041
+
 /*
  * large object chunk size added to pg_controldata,
  * commit 5f93c37805e7485488480916b4585e098d3cc883
@@ -230,7 +237,7 @@ typedef struct
 	uint32		chkpnt_nxtepoch;
 	uint32		chkpnt_nxtoid;
 	uint32		chkpnt_nxtmulti;
-	uint32		chkpnt_nxtmxoff;
+	uint64		chkpnt_nxtmxoff;
 	uint32		chkpnt_oldstMulti;
 	uint32		chkpnt_oldstxid;
 	uint32		align;
@@ -515,3 +522,7 @@ typedef struct
 	FILE	   *file;
 	char		path[MAXPGPATH];
 } UpgradeTaskReport;
+
+/* segresize.c */
+
+uint64		convert_multixact_offsets(void);
diff --git a/src/bin/pg_upgrade/segresize.c b/src/bin/pg_upgrade/segresize.c
new file mode 100644
index 0000000000..e47c0a2407
--- /dev/null
+++ b/src/bin/pg_upgrade/segresize.c
@@ -0,0 +1,350 @@
+/*
+ *	segresize.c
+ *
+ *	SLRU segment resize utility
+ *
+ *	Copyright (c) 2024, PostgreSQL Global Development Group
+ *	src/bin/pg_upgrade/segresize.c
+ */
+
+#include "postgres_fe.h"
+
+#include "pg_upgrade.h"
+#include "access/multixact.h"
+
+/* See slru.h */
+#define SLRU_PAGES_PER_SEGMENT		32
+
+/*
+ * Some kind of iterator associated with a particular SLRU segment.  The idea is
+ * to specify the segment and page number and then move through the pages.
+ */
+typedef struct SlruSegState
+{
+	char	   *dir;
+	char	   *fn;
+	FILE	   *file;
+	int64		segno;
+	uint64		pageno;
+	bool		leading_gap;
+	bool		long_segment_names;
+} SlruSegState;
+
+/*
+ * Get SLRU segmen file name from state.
+ *
+ * NOTE: this function should mirror SlruFileName call.
+ */
+static inline char *
+SlruFileName(SlruSegState *state)
+{
+	if (state->long_segment_names)
+	{
+		Assert(state->segno >= 0 &&
+			   state->segno <= INT64CONST(0xFFFFFFFFFFFFFFF));
+		return psprintf("%s/%015llX", state->dir, (long long) state->segno);
+	}
+	else
+	{
+		Assert(state->segno >= 0 &&
+			   state->segno <= INT64CONST(0xFFFFFF));
+		return psprintf("%s/%04X", state->dir, (unsigned int) state->segno);
+	}
+}
+
+/*
+ * Create SLRU segment file.
+ */
+static void
+create_segment(SlruSegState *state)
+{
+	Assert(state->fn == NULL);
+	Assert(state->file == NULL);
+
+	state->fn = SlruFileName(state);
+	state->file = fopen(state->fn, "wb");
+	if (!state->file)
+		pg_fatal("could not create file \"%s\": %m", state->fn);
+}
+
+/*
+ * Open existing SLRU segment file.
+ */
+static void
+open_segment(SlruSegState *state)
+{
+	Assert(state->fn == NULL);
+	Assert(state->file == NULL);
+
+	state->fn = SlruFileName(state);
+	state->file = fopen(state->fn, "rb");
+	if (!state->file)
+		pg_fatal("could not open file \"%s\": %m", state->fn);
+}
+
+/*
+ * Close SLRU segment file.
+ */
+static void
+close_segment(SlruSegState *state)
+{
+	if (state->file)
+	{
+		fclose(state->file);
+		state->file = NULL;
+	}
+
+	if (state->fn)
+	{
+		pfree(state->fn);
+		state->fn = NULL;
+	}
+}
+
+/*
+ * Read next page from the old 32-bit offset segment file.
+ */
+static int
+read_old_segment_page(SlruSegState *state, void *buf, bool *empty)
+{
+	int		len;
+
+	/* Open next segment file, if needed. */
+	if (!state->fn)
+	{
+		if (!state->segno)
+			state->leading_gap = true;
+
+		open_segment(state);
+
+		/* Set position to the needed page. */
+		if (state->pageno > 0 &&
+			fseek(state->file, state->pageno * BLCKSZ, SEEK_SET))
+		{
+			close_segment(state);
+		}
+	}
+
+	if (state->file)
+	{
+		/* Segment file do exists, read page from it. */
+		state->leading_gap = false;
+
+		len = fread(buf, sizeof(char), BLCKSZ, state->file);
+
+		/* Are we done or was there an error? */
+		if (len <= 0)
+		{
+			if (ferror(state->file))
+				pg_fatal("error reading file \"%s\": %m", state->fn);
+
+			if (feof(state->file))
+			{
+				*empty = true;
+				len = -1;
+
+				close_segment(state);
+			}
+		}
+		else
+			*empty = false;
+	}
+	else if (!state->leading_gap)
+	{
+		/* We reached the last segment. */
+		len = -1;
+		*empty = true;
+	}
+	else
+	{
+		/* Skip few first segments if they were frozen and removed. */
+		len = BLCKSZ;
+		*empty = true;
+	}
+
+	if (++state->pageno >= SLRU_PAGES_PER_SEGMENT)
+	{
+		/* Start a new segment. */
+		state->segno++;
+		state->pageno = 0;
+
+		close_segment(state);
+	}
+
+	return len;
+}
+
+/*
+ * Write next page to the new 64-bit offset segment file.
+ */
+static void
+write_new_segment_page(SlruSegState *state, void *buf)
+{
+	/*
+	 * Create a new segment file if we still didn't.  Creation is
+	 * postponed until the first non-empty page is found.  This helps
+	 * not to create completely empty segments.
+	 */
+	if (!state->file)
+	{
+		create_segment(state);
+
+		/* Write zeroes to the previously skipped prefix. */
+		if (state->pageno > 0)
+		{
+			char		zerobuf[BLCKSZ] = {0};
+
+			for (int64 i = 0; i < state->pageno; i++)
+			{
+				if (fwrite(zerobuf, sizeof(char), BLCKSZ, state->file) != BLCKSZ)
+					pg_fatal("could not write file \"%s\": %m", state->fn);
+			}
+		}
+	}
+
+	/* Write page to the new segment (if it was created). */
+	if (state->file)
+	{
+		if (fwrite(buf, sizeof(char), BLCKSZ, state->file) != BLCKSZ)
+			pg_fatal("could not write file \"%s\": %m", state->fn);
+	}
+
+	state->pageno++;
+
+	/*
+	 * Did we reach the maximum page number?  Then close segment file
+	 * and create a new one on the next iteration.
+	 */
+	if (state->pageno >= SLRU_PAGES_PER_SEGMENT)
+	{
+		state->segno++;
+		state->pageno = 0;
+		close_segment(state);
+	}
+}
+
+/*
+ * Convert pg_multixact/offsets segments and return oldest multi offset.
+ */
+uint64
+convert_multixact_offsets(void)
+{
+	/* See multixact.c */
+#define MULTIXACT_OFFSETS_PER_PAGE_OLD	(BLCKSZ / sizeof(uint32))
+#define MULTIXACT_OFFSETS_PER_PAGE		(BLCKSZ / sizeof(MultiXactOffset))
+
+	SlruSegState	oldseg = {0},
+					newseg = {0};
+	uint32			oldbuf[MULTIXACT_OFFSETS_PER_PAGE_OLD] = {0};
+	MultiXactOffset	newbuf[MULTIXACT_OFFSETS_PER_PAGE] = {0};
+	/*
+	 * It is much easier to deal with multi wraparound in 64 bitd format.  Thus
+	 * we use 64 bits for multi-transactions, although they remain 32 bits.
+	 */
+	uint64			oldest_multi = old_cluster.controldata.chkpnt_oldstMulti,
+					next_multi = old_cluster.controldata.chkpnt_nxtmulti,
+					multi,
+					old_entry,
+					new_entry;
+	bool			found = false;
+	uint64			oldest_offset = 0;
+
+	prep_status("Converting pg_multixact/offsets to 64-bit");
+
+	oldseg.pageno = oldest_multi / MULTIXACT_OFFSETS_PER_PAGE_OLD;
+	oldseg.segno = oldseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	oldseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+	oldseg.dir = psprintf("%s/pg_multixact/offsets", old_cluster.pgdata);
+	oldseg.long_segment_names = false;		/* old format XXXX */
+
+	newseg.pageno = oldest_multi / MULTIXACT_OFFSETS_PER_PAGE;
+	newseg.segno = newseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	newseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+	newseg.dir = psprintf("%s/pg_multixact/offsets", new_cluster.pgdata);
+	newseg.long_segment_names = true;
+
+	old_entry = oldest_multi % MULTIXACT_OFFSETS_PER_PAGE_OLD;
+	new_entry = oldest_multi % MULTIXACT_OFFSETS_PER_PAGE;
+
+	if (next_multi < oldest_multi)
+		next_multi += (uint64) 1 << 32;		/* wraparound */
+
+	for (multi = oldest_multi; multi < next_multi; old_entry = 0)
+	{
+		int			oldlen;
+		bool		empty;
+
+		/* Handle possible segment wraparound. */
+		if (oldseg.segno > MaxMultiXactId /
+								MULTIXACT_OFFSETS_PER_PAGE_OLD /
+								SLRU_PAGES_PER_SEGMENT)
+			oldseg.segno = 0;
+
+		/* Read old offset segment. */
+		oldlen = read_old_segment_page(&oldseg, oldbuf, &empty);
+		if (oldlen <= 0 || empty)
+			pg_fatal("cannot read page %llu from file \"%s\": %m",
+					 (unsigned long long) oldseg.pageno, oldseg.fn);
+
+		/* Fill possible gap. */
+		if (oldlen < BLCKSZ)
+			memset((char *) oldbuf + oldlen, 0, BLCKSZ - oldlen);
+
+		/* Save oldest multi offset */
+		if (!found)
+		{
+			oldest_offset = oldbuf[old_entry];
+			found = true;
+		}
+
+		/* ... skip wrapped-around invalid multi */
+		if (multi == (uint64) 1 << 32)
+		{
+			Assert(oldseg.segno == 0);
+			Assert(oldseg.pageno == 1);
+			Assert(old_entry == 0);
+
+			multi += FirstMultiXactId;
+			old_entry = FirstMultiXactId;
+		}
+
+		/* Copy entries to the new page. */
+		for (; multi < next_multi && old_entry < MULTIXACT_OFFSETS_PER_PAGE_OLD;
+			 multi++, old_entry++)
+		{
+			MultiXactOffset offset = oldbuf[old_entry];
+
+			/* Handle possible offset wraparound. */
+			if (offset < oldest_offset)
+				offset += ((uint64) 1 << 32) - 1;
+
+			/* Subtract oldest_offset, so new offsets will start from 1. */
+			newbuf[new_entry++] = offset - oldest_offset + 1;
+			if (new_entry >= MULTIXACT_OFFSETS_PER_PAGE)
+			{
+				/* Write a new page. */
+				write_new_segment_page(&newseg, newbuf);
+				new_entry = 0;
+			}
+		}
+	}
+
+	/* Write the last incomplete page. */
+	if (new_entry > 0 || oldest_multi == next_multi)
+	{
+		memset(&newbuf[new_entry], 0,
+			   sizeof(newbuf[0]) * (MULTIXACT_OFFSETS_PER_PAGE - new_entry));
+		write_new_segment_page(&newseg, newbuf);
+	}
+
+	/* Release resources. */
+	close_segment(&oldseg);
+	close_segment(&newseg);
+
+	pfree(oldseg.dir);
+	pfree(newseg.dir);
+
+	check_ok();
+
+	return oldest_offset;
+}
-- 
2.43.0

v4-0003-Get-rid-of-MultiXactMemberFreezeThreshold-call.patchapplication/octet-stream; name=v4-0003-Get-rid-of-MultiXactMemberFreezeThreshold-call.patchDownload
From 87978b0164a785d5758c99d892ff0c20e216769c Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Tue, 22 Oct 2024 18:53:18 +0300
Subject: [PATCH v4 3/4] Get rid of MultiXactMemberFreezeThreshold call.

Since MaxMultiXactOffset are UINT64_MAX now, MULTIXACT_MEMBER_SAFE_THRESHOLD and
MULTIXACT_MEMBER_DANGER_THRESHOLD values are not meaningful any more. Thus,
MultiXactMemberFreezeThreshold is not needed too.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/transam/multixact.c | 218 +------------------------
 src/backend/commands/vacuum.c          |   5 +-
 src/backend/postmaster/autovacuum.c    |   7 +-
 src/include/access/multixact.h         |   1 -
 4 files changed, 7 insertions(+), 224 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index c51e03e832..fc7d2cef70 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -250,14 +250,6 @@ typedef struct MultiXactStateData
 	MultiXactId oldestMultiXactId;
 	Oid			oldestMultiXactDB;
 
-	/*
-	 * Oldest multixact offset that is potentially referenced by a multixact
-	 * referenced by a relation.  We don't always know this value, so there's
-	 * a flag here to indicate whether or not we currently do.
-	 */
-	MultiXactOffset oldestOffset;
-	bool		oldestOffsetKnown;
-
 	/* support for anti-wraparound measures */
 	MultiXactId multiVacLimit;
 	MultiXactId multiWarnLimit;
@@ -398,7 +390,6 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool SetOffsetVacuumLimit(bool is_startup);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMZeroPageXlogRec(int64 pageno, uint8 info);
 static void WriteMTruncateXlogRec(Oid oldestMultiDB,
@@ -2284,16 +2275,13 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid,
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 	MultiXactId curMulti;
-	bool		needs_offset_vacuum;
 
 	Assert(MultiXactIdIsValid(oldest_datminmxid));
 
 	/*
 	 * We pretend that a wrap will happen halfway through the multixact ID
 	 * space, but that's not really true, because multixacts wrap differently
-	 * from transaction IDs.  Note that, separately from any concern about
-	 * multixact IDs wrapping, we must ensure that multixact members do not
-	 * wrap.  Limits for that are set in SetOffsetVacuumLimit, not here.
+	 * from transaction IDs.
 	 */
 	multiWrapLimit = oldest_datminmxid + (MaxMultiXactId >> 1);
 	if (multiWrapLimit < FirstMultiXactId)
@@ -2361,9 +2349,6 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid,
 
 	Assert(!InRecovery);
 
-	/* Set limits for offset vacuum. */
-	needs_offset_vacuum = SetOffsetVacuumLimit(is_startup);
-
 	/*
 	 * If past the autovacuum force point, immediately signal an autovac
 	 * request.  The reason for this is that autovac only processes one
@@ -2371,8 +2356,7 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid,
 	 * database, it'll call here, and we'll signal the postmaster to start
 	 * another iteration immediately if there are still any old databases.
 	 */
-	if ((MultiXactIdPrecedes(multiVacLimit, curMulti) ||
-		 needs_offset_vacuum) && IsUnderPostmaster)
+	if (MultiXactIdPrecedes(multiVacLimit, curMulti) && IsUnderPostmaster)
 		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
 
 	/* Give an immediate warning if past the wrap warn point */
@@ -2615,109 +2599,6 @@ GetOldestMultiXactId(void)
 	return oldestMXact;
 }
 
-/*
- * Determine how aggressively we need to vacuum in order to prevent member
- * wraparound.
- *
- * To do so determine what's the oldest member offset and install the limit
- * info in MultiXactState, where it can be used to prevent overrun of old data
- * in the members SLRU area.
- *
- * The return value is true if emergency autovacuum is required and false
- * otherwise.
- */
-static bool
-SetOffsetVacuumLimit(bool is_startup)
-{
-	MultiXactId oldestMultiXactId;
-	MultiXactId nextMXact;
-	MultiXactOffset oldestOffset = 0;	/* placate compiler */
-	MultiXactOffset prevOldestOffset;
-	MultiXactOffset nextOffset;
-	bool		oldestOffsetKnown = false;
-	bool		prevOldestOffsetKnown;
-
-	/*
-	 * NB: Have to prevent concurrent truncation, we might otherwise try to
-	 * lookup an oldestMulti that's concurrently getting truncated away.
-	 */
-	LWLockAcquire(MultiXactTruncationLock, LW_SHARED);
-
-	/* Read relevant fields from shared memory. */
-	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	oldestMultiXactId = MultiXactState->oldestMultiXactId;
-	nextMXact = MultiXactState->nextMXact;
-	nextOffset = MultiXactState->nextOffset;
-	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
-	prevOldestOffset = MultiXactState->oldestOffset;
-	Assert(MultiXactState->finishedStartup);
-	LWLockRelease(MultiXactGenLock);
-
-	/*
-	 * Determine the offset of the oldest multixact.  Normally, we can read
-	 * the offset from the multixact itself, but there's an important special
-	 * case: if there are no multixacts in existence at all, oldestMXact
-	 * obviously can't point to one.  It will instead point to the multixact
-	 * ID that will be assigned the next time one is needed.
-	 */
-	if (oldestMultiXactId == nextMXact)
-	{
-		/*
-		 * When the next multixact gets created, it will be stored at the next
-		 * offset.
-		 */
-		oldestOffset = nextOffset;
-		oldestOffsetKnown = true;
-	}
-	else
-	{
-		/*
-		 * Figure out where the oldest existing multixact's offsets are
-		 * stored. Due to bugs in early release of PostgreSQL 9.3.X and 9.4.X,
-		 * the supposedly-earliest multixact might not really exist.  We are
-		 * careful not to fail in that case.
-		 */
-		oldestOffsetKnown =
-			find_multixact_start(oldestMultiXactId, &oldestOffset);
-
-		if (!oldestOffsetKnown)
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
-							oldestMultiXactId)));
-	}
-
-	LWLockRelease(MultiXactTruncationLock);
-
-	/*
-	 * If we can, compute limits (and install them MultiXactState) to prevent
-	 * overrun of old data in the members SLRU area. We can only do so if the
-	 * oldest offset is known though.
-	 */
-	if (prevOldestOffsetKnown)
-	{
-		/*
-		 * If we failed to get the oldest offset this time, but we have a
-		 * value from a previous pass through this function, use the old
-		 * values rather than automatically forcing an emergency autovacuum
-		 * cycle again.
-		 */
-		oldestOffset = prevOldestOffset;
-		oldestOffsetKnown = true;
-	}
-
-	/* Install the computed values */
-	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
-	MultiXactState->oldestOffset = oldestOffset;
-	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	LWLockRelease(MultiXactGenLock);
-
-	/*
-	 * Do we need an emergency autovacuum?	If we're not sure, assume yes.
-	 */
-	return !oldestOffsetKnown ||
-		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
-}
-
 /*
  * Find the starting offset of the given MultiXactId.
  *
@@ -2761,101 +2642,6 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	return true;
 }
 
-/*
- * Determine how many multixacts, and how many multixact members, currently
- * exist.  Return false if unable to determine.
- */
-static bool
-ReadMultiXactCounts(uint32 *multixacts, MultiXactOffset *members)
-{
-	MultiXactOffset nextOffset;
-	MultiXactOffset oldestOffset;
-	MultiXactId oldestMultiXactId;
-	MultiXactId nextMultiXactId;
-	bool		oldestOffsetKnown;
-
-	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	nextOffset = MultiXactState->nextOffset;
-	oldestMultiXactId = MultiXactState->oldestMultiXactId;
-	nextMultiXactId = MultiXactState->nextMXact;
-	oldestOffset = MultiXactState->oldestOffset;
-	oldestOffsetKnown = MultiXactState->oldestOffsetKnown;
-	LWLockRelease(MultiXactGenLock);
-
-	if (!oldestOffsetKnown)
-		return false;
-
-	*members = nextOffset - oldestOffset;
-	*multixacts = nextMultiXactId - oldestMultiXactId;
-	return true;
-}
-
-/*
- * Multixact members can be removed once the multixacts that refer to them
- * are older than every datminmxid.  autovacuum_multixact_freeze_max_age and
- * vacuum_multixact_freeze_table_age work together to make sure we never have
- * too many multixacts; we hope that, at least under normal circumstances,
- * this will also be sufficient to keep us from using too many offsets.
- * However, if the average multixact has many members, we might exhaust the
- * members space while still using few enough members that these limits fail
- * to trigger relminmxid advancement by VACUUM.  At that point, we'd have no
- * choice but to start failing multixact-creating operations with an error.
- *
- * To prevent that, if more than a threshold portion of the members space is
- * used, we effectively reduce autovacuum_multixact_freeze_max_age and
- * to a value just less than the number of multixacts in use.  We hope that
- * this will quickly trigger autovacuuming on the table or tables with the
- * oldest relminmxid, thus allowing datminmxid values to advance and removing
- * some members.
- *
- * As the fraction of the member space currently in use grows, we become
- * more aggressive in clamping this value.  That not only causes autovacuum
- * to ramp up, but also makes any manual vacuums the user issues more
- * aggressive.  This happens because vacuum_get_cutoffs() will clamp the
- * freeze table and the minimum freeze age cutoffs based on the effective
- * autovacuum_multixact_freeze_max_age this function returns.  In the worst
- * case, we'll claim the freeze_max_age to zero, and every vacuum of any
- * table will freeze every multixact.
- */
-int
-MultiXactMemberFreezeThreshold(void)
-{
-	MultiXactOffset members;
-	uint32		multixacts;
-	uint32		victim_multixacts;
-	double		fraction;
-	int			result;
-
-	/* If we can't determine member space utilization, assume the worst. */
-	if (!ReadMultiXactCounts(&multixacts, &members))
-		return 0;
-
-	/* If member space utilization is low, no special action is required. */
-	if (members <= MULTIXACT_MEMBER_SAFE_THRESHOLD)
-		return autovacuum_multixact_freeze_max_age;
-
-	/*
-	 * Compute a target for relminmxid advancement.  The number of multixacts
-	 * we try to eliminate from the system is based on how far we are past
-	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
-	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-	fraction /= (double) (MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-
-	victim_multixacts = multixacts * fraction;
-
-	/* fraction could be > 1.0, but lowest possible freeze age is zero */
-	if (victim_multixacts > multixacts)
-		return 0;
-	result = multixacts - victim_multixacts;
-
-	/*
-	 * Clamp to autovacuum_multixact_freeze_max_age, so that we never make
-	 * autovacuum less aggressive than it would otherwise be.
-	 */
-	return Min(result, autovacuum_multixact_freeze_max_age);
-}
-
 typedef struct mxtruncinfo
 {
 	int64		earliestExistingPage;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ac8f5d9c25..97dd6bc8e2 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1131,10 +1131,9 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 
 	/*
 	 * Also compute the multixact age for which freezing is urgent.  This is
-	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
-	 * short of multixact member space.
+	 * autovacuum_multixact_freeze_max_age.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Almost ready to set freeze output parameters; check if OldestXmin or
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index dc3cf87aba..e9285ba44c 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1122,7 +1122,7 @@ do_start_worker(void)
 
 	/* Also determine the oldest datminmxid we will consider. */
 	recentMulti = ReadNextMultiXactId();
-	multiForceLimit = recentMulti - MultiXactMemberFreezeThreshold();
+	multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
 	if (multiForceLimit < FirstMultiXactId)
 		multiForceLimit -= FirstMultiXactId;
 
@@ -1912,10 +1912,9 @@ do_autovacuum(void)
 
 	/*
 	 * Compute the multixact age for which freezing is urgent.  This is
-	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
-	 * short of multixact member space.
+	 * autovacuum_multixact_freeze_max_age.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Find the pg_database entry and select the default freeze ages. We use
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 90583634ec..5aefbddce3 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -143,7 +143,6 @@ extern void MultiXactSetNextMXact(MultiXactId nextMulti,
 extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 									  MultiXactOffset minMultiOffset);
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
-extern int	MultiXactMemberFreezeThreshold(void);
 
 extern void multixact_twophase_recover(TransactionId xid, uint16 info,
 									   void *recdata, uint32 len);
-- 
2.43.0

v4-0001-Use-64-bit-format-output-for-multixact-offsets.patchapplication/octet-stream; name=v4-0001-Use-64-bit-format-output-for-multixact-offsets.patchDownload
From 5de021e3b012dbf71bb6b2893cd77864236bffcb Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 7 Aug 2024 16:35:22 +0300
Subject: [PATCH v4 1/4] Use 64-bit format output for multixact offsets

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/rmgrdesc/mxactdesc.c   |  9 ++++----
 src/backend/access/rmgrdesc/xlogdesc.c    |  4 ++--
 src/backend/access/transam/multixact.c    | 26 +++++++++++++----------
 src/backend/access/transam/xlogrecovery.c |  5 +++--
 src/bin/pg_controldata/pg_controldata.c   |  4 ++--
 src/bin/pg_resetwal/pg_resetwal.c         |  8 +++----
 6 files changed, 31 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 3e8ad4d5ef..1b486de38c 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -65,8 +65,8 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		xl_multixact_create *xlrec = (xl_multixact_create *) rec;
 		int			i;
 
-		appendStringInfo(buf, "%u offset %u nmembers %d: ", xlrec->mid,
-						 xlrec->moff, xlrec->nmembers);
+		appendStringInfo(buf, "%u offset %llu nmembers %d: ", xlrec->mid,
+						 (unsigned long long) xlrec->moff, xlrec->nmembers);
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
 	}
@@ -74,9 +74,10 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
 
-		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+		appendStringInfo(buf, "offsets [%u, %u), members [%llu, %llu)",
 						 xlrec->startTruncOff, xlrec->endTruncOff,
-						 xlrec->startTruncMemb, xlrec->endTruncMemb);
+						 (unsigned long long) xlrec->startTruncMemb,
+						 (unsigned long long) xlrec->endTruncMemb);
 	}
 }
 
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 363294d623..aaa19c81c8 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -66,7 +66,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
 		appendStringInfo(buf, "redo %X/%X; "
-						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %u; "
+						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %llu; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
 						 "oldest running xid %u; %s",
@@ -79,7 +79,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 XidFromFullTransactionId(checkpoint->nextXid),
 						 checkpoint->nextOid,
 						 checkpoint->nextMulti,
-						 checkpoint->nextMultiOffset,
+						 (unsigned long long) checkpoint->nextMultiOffset,
 						 checkpoint->oldestXid,
 						 checkpoint->oldestXidDB,
 						 checkpoint->oldestMulti,
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 8c37d7eba7..ab90912ed3 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1264,7 +1264,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	LWLockRelease(MultiXactGenLock);
 
-	debug_elog4(DEBUG2, "GetNew: returning %u offset %u", result, *offset);
+	debug_elog4(DEBUG2, "GetNew: returning %u offset %llu", result,
+				(unsigned long long) *offset);
 	return result;
 }
 
@@ -2293,8 +2294,9 @@ MultiXactGetCheckptMulti(bool is_shutdown,
 	LWLockRelease(MultiXactGenLock);
 
 	debug_elog6(DEBUG2,
-				"MultiXact: checkpoint is nextMulti %u, nextOffset %u, oldestMulti %u in DB %u",
-				*nextMulti, *nextMultiOffset, *oldestMulti, *oldestMultiDB);
+				"MultiXact: checkpoint is nextMulti %u, nextOffset %llu, oldestMulti %u in DB %u",
+				*nextMulti, (unsigned long long) *nextMultiOffset, *oldestMulti,
+				*oldestMultiDB);
 }
 
 /*
@@ -2328,8 +2330,8 @@ void
 MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset)
 {
-	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %u",
-				nextMulti, nextMultiOffset);
+	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %llu",
+				nextMulti, (unsigned long long) nextMultiOffset);
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->nextMXact = nextMulti;
 	MultiXactState->nextOffset = nextMultiOffset;
@@ -2519,8 +2521,8 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 	}
 	if (MultiXactOffsetPrecedes(MultiXactState->nextOffset, minMultiOffset))
 	{
-		debug_elog3(DEBUG2, "MultiXact: setting next offset to %u",
-					minMultiOffset);
+		debug_elog3(DEBUG2, "MultiXact: setting next offset to %llu",
+					(unsigned long long) minMultiOffset);
 		MultiXactState->nextOffset = minMultiOffset;
 	}
 	LWLockRelease(MultiXactGenLock);
@@ -3211,11 +3213,12 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 
 	elog(DEBUG1, "performing multixact truncation: "
 		 "offsets [%u, %u), offsets segments [%llx, %llx), "
-		 "members [%u, %u), members segments [%llx, %llx)",
+		 "members [%llu, %llu), members segments [%llx, %llx)",
 		 oldestMulti, newOldestMulti,
 		 (unsigned long long) MultiXactIdToOffsetSegment(oldestMulti),
 		 (unsigned long long) MultiXactIdToOffsetSegment(newOldestMulti),
-		 oldestOffset, newOldestOffset,
+		 (unsigned long long) oldestOffset,
+		 (unsigned long long) newOldestOffset,
 		 (unsigned long long) MXOffsetToMemberSegment(oldestOffset),
 		 (unsigned long long) MXOffsetToMemberSegment(newOldestOffset));
 
@@ -3471,11 +3474,12 @@ multixact_redo(XLogReaderState *record)
 
 		elog(DEBUG1, "replaying multixact truncation: "
 			 "offsets [%u, %u), offsets segments [%llx, %llx), "
-			 "members [%u, %u), members segments [%llx, %llx)",
+			 "members [%llu, %llu), members segments [%llx, %llx)",
 			 xlrec.startTruncOff, xlrec.endTruncOff,
 			 (unsigned long long) MultiXactIdToOffsetSegment(xlrec.startTruncOff),
 			 (unsigned long long) MultiXactIdToOffsetSegment(xlrec.endTruncOff),
-			 xlrec.startTruncMemb, xlrec.endTruncMemb,
+			 (unsigned long long) xlrec.startTruncMemb,
+			 (unsigned long long) xlrec.endTruncMemb,
 			 (unsigned long long) MXOffsetToMemberSegment(xlrec.startTruncMemb),
 			 (unsigned long long) MXOffsetToMemberSegment(xlrec.endTruncMemb));
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 320b14add1..4846126ef9 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -877,8 +877,9 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 							 U64FromFullTransactionId(checkPoint.nextXid),
 							 checkPoint.nextOid)));
 	ereport(DEBUG1,
-			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %u",
-							 checkPoint.nextMulti, checkPoint.nextMultiOffset)));
+			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %llu",
+							 checkPoint.nextMulti,
+							 (unsigned long long) checkPoint.nextMultiOffset)));
 	ereport(DEBUG1,
 			(errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
 							 checkPoint.oldestXid, checkPoint.oldestXidDB)));
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 93a05d80ca..43b6727570 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -253,8 +253,8 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile->checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
-		   ControlFile->checkPointCopy.nextMultiOffset);
+	printf(_("Latest checkpoint's NextMultiOffset:  %llu\n"),
+		   (unsigned long long) ControlFile->checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile->checkPointCopy.oldestXid);
 	printf(_("Latest checkpoint's oldestXID's DB:   %u\n"),
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index e9dcb5a6d8..985cd06802 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -737,8 +737,8 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile.checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
-		   ControlFile.checkPointCopy.nextMultiOffset);
+	printf(_("Latest checkpoint's NextMultiOffset:  %llu\n"),
+		   (unsigned long long) ControlFile.checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile.checkPointCopy.oldestXid);
 	printf(_("Latest checkpoint's oldestXID's DB:   %u\n"),
@@ -809,8 +809,8 @@ PrintNewControlValues(void)
 
 	if (set_mxoff != -1)
 	{
-		printf(_("NextMultiOffset:                      %u\n"),
-			   ControlFile.checkPointCopy.nextMultiOffset);
+		printf(_("NextMultiOffset:                      %llu\n"),
+			   (unsigned long long) ControlFile.checkPointCopy.nextMultiOffset);
 	}
 
 	if (set_oid != 0)
-- 
2.43.0

v4-0002-Use-64-bit-multixact-offsets.patchapplication/octet-stream; name=v4-0002-Use-64-bit-multixact-offsets.patchDownload
From cca60a5e487090252dddd515c716272786841c5e Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Wed, 6 Mar 2024 11:11:33 +0300
Subject: [PATCH v4 2/4] Use 64-bit multixact offsets.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/transam/multixact.c | 172 +------------------------
 src/bin/pg_resetwal/pg_resetwal.c      |   2 +-
 src/bin/pg_resetwal/t/001_basic.pl     |   2 +-
 src/include/access/multixact.h         |   2 +-
 src/include/c.h                        |   2 +-
 5 files changed, 11 insertions(+), 169 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index ab90912ed3..c51e03e832 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -96,14 +96,6 @@
 /*
  * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
  * used everywhere else in Postgres.
- *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
  */
 
 /* We need four bytes per offset */
@@ -272,9 +264,6 @@ typedef struct MultiXactStateData
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 
-	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
-
 	/*
 	 * This is used to sleep until a multixact offset is written when we want
 	 * to create the next one.
@@ -409,8 +398,6 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
-									 MultiXactOffset start, uint32 distance);
 static bool SetOffsetVacuumLimit(bool is_startup);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMZeroPageXlogRec(int64 pageno, uint8 info);
@@ -1164,78 +1151,6 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	else
 		*offset = nextOffset;
 
-	/*----------
-	 * Protect against overrun of the members space as well, with the
-	 * following rules:
-	 *
-	 * If we're past offsetStopLimit, refuse to generate more multis.
-	 * If we're close to offsetStopLimit, emit a warning.
-	 *
-	 * Arbitrarily, we start emitting warnings when we're 20 segments or less
-	 * from offsetStopLimit.
-	 *
-	 * Note we haven't updated the shared state yet, so if we fail at this
-	 * point, the multixact ID we grabbed can still be used by the next guy.
-	 *
-	 * Note that there is no point in forcing autovacuum runs here: the
-	 * multixact freeze settings would have to be reduced for that to have any
-	 * effect.
-	 *----------
-	 */
-#define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
-								 nmembers))
-	{
-		/* see comment in the corresponding offsets wraparound case */
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("multixact \"members\" limit exceeded"),
-				 errdetail_plural("This command would create a multixact with %u members, but the remaining space is only enough for %u member.",
-								  "This command would create a multixact with %u members, but the remaining space is only enough for %u members.",
-								  MultiXactState->offsetStopLimit - nextOffset - 1,
-								  nmembers,
-								  MultiXactState->offsetStopLimit - nextOffset - 1),
-				 errhint("Execute a database-wide VACUUM in database with OID %u with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.",
-						 MultiXactState->oldestMultiXactDB)));
-	}
-
-	/*
-	 * Check whether we should kick autovacuum into action, to prevent members
-	 * wraparound. NB we use a much larger window to trigger autovacuum than
-	 * just the warning limit. The warning is just a measure of last resort -
-	 * this is in line with GetNewTransactionId's behaviour.
-	 */
-	if (!MultiXactState->oldestOffsetKnown ||
-		(MultiXactState->nextOffset - MultiXactState->oldestOffset
-		 > MULTIXACT_MEMBER_SAFE_THRESHOLD))
-	{
-		/*
-		 * To avoid swamping the postmaster with signals, we issue the autovac
-		 * request only when crossing a segment boundary. With default
-		 * compilation settings that's roughly after 50k members.  This still
-		 * gives plenty of chances before we get into real trouble.
-		 */
-		if ((MXOffsetToMemberPage(nextOffset) / SLRU_PAGES_PER_SEGMENT) !=
-			(MXOffsetToMemberPage(nextOffset + nmembers) / SLRU_PAGES_PER_SEGMENT))
-			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	}
-
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
-								 nextOffset,
-								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
-		ereport(WARNING,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg_plural("database with OID %u must be vacuumed before %d more multixact member is used",
-							   "database with OID %u must be vacuumed before %d more multixact members are used",
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers,
-							   MultiXactState->oldestMultiXactDB,
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers),
-				 errhint("Execute a database-wide VACUUM in that database with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.")));
-
 	ExtendMultiXactMember(nextOffset, nmembers);
 
 	/*
@@ -1976,7 +1891,7 @@ MultiXactShmemInit(void)
 				  "pg_multixact/offsets", LWTRANCHE_MULTIXACTOFFSET_BUFFER,
 				  LWTRANCHE_MULTIXACTOFFSET_SLRU,
 				  SYNC_HANDLER_MULTIXACT_OFFSET,
-				  false);
+				  true);
 	SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE);
 	SimpleLruInit(MultiXactMemberCtl,
 				  "multixact_member", multixact_member_buffers, 0,
@@ -2721,8 +2636,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	MultiXactOffset nextOffset;
 	bool		oldestOffsetKnown = false;
 	bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;
-	MultiXactOffset prevOffsetStopLimit;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2737,7 +2650,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	nextOffset = MultiXactState->nextOffset;
 	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
 	prevOldestOffset = MultiXactState->oldestOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2768,11 +2680,7 @@ SetOffsetVacuumLimit(bool is_startup)
 		oldestOffsetKnown =
 			find_multixact_start(oldestMultiXactId, &oldestOffset);
 
-		if (oldestOffsetKnown)
-			ereport(DEBUG1,
-					(errmsg_internal("oldest MultiXactId member is at offset %u",
-									 oldestOffset)));
-		else
+		if (!oldestOffsetKnown)
 			ereport(LOG,
 					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
 							oldestMultiXactId)));
@@ -2785,24 +2693,7 @@ SetOffsetVacuumLimit(bool is_startup)
 	 * overrun of old data in the members SLRU area. We can only do so if the
 	 * oldest offset is known though.
 	 */
-	if (oldestOffsetKnown)
-	{
-		/* move back to start of the corresponding segment */
-		offsetStopLimit = oldestOffset - (oldestOffset %
-										  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-
-		/* always leave one segment before the wraparound point */
-		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-		if (!prevOldestOffsetKnown && !is_startup)
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are now enabled")));
-
-		ereport(DEBUG1,
-				(errmsg_internal("MultiXact member stop limit is now %u based on MultiXact %u",
-								 offsetStopLimit, oldestMultiXactId)));
-	}
-	else if (prevOldestOffsetKnown)
+	if (prevOldestOffsetKnown)
 	{
 		/*
 		 * If we failed to get the oldest offset this time, but we have a
@@ -2812,14 +2703,12 @@ SetOffsetVacuumLimit(bool is_startup)
 		 */
 		oldestOffset = prevOldestOffset;
 		oldestOffsetKnown = true;
-		offsetStopLimit = prevOffsetStopLimit;
 	}
 
 	/* Install the computed values */
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestOffset = oldestOffset;
 	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	MultiXactState->offsetStopLimit = offsetStopLimit;
 	LWLockRelease(MultiXactGenLock);
 
 	/*
@@ -2829,54 +2718,6 @@ SetOffsetVacuumLimit(bool is_startup)
 		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
 }
 
-/*
- * Return whether adding "distance" to "start" would move past "boundary".
- *
- * We use this to determine whether the addition is "wrapping around" the
- * boundary point, hence the name.  The reason we don't want to use the regular
- * 2^31-modulo arithmetic here is that we want to be able to use the whole of
- * the 2^32-1 space here, allowing for more multixacts than would fit
- * otherwise.
- */
-static bool
-MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
-						 uint32 distance)
-{
-	MultiXactOffset finish;
-
-	/*
-	 * Note that offset number 0 is not used (see GetMultiXactIdMembers), so
-	 * if the addition wraps around the UINT_MAX boundary, skip that value.
-	 */
-	finish = start + distance;
-	if (finish < start)
-		finish++;
-
-	/*-----------------------------------------------------------------------
-	 * When the boundary is numerically greater than the starting point, any
-	 * value numerically between the two is not wrapped:
-	 *
-	 *	<----S----B---->
-	 *	[---)			 = F wrapped past B (and UINT_MAX)
-	 *		 [---)		 = F not wrapped
-	 *			  [----] = F wrapped past B
-	 *
-	 * When the boundary is numerically less than the starting point (i.e. the
-	 * UINT_MAX wraparound occurs somewhere in between) then all values in
-	 * between are wrapped:
-	 *
-	 *	<----B----S---->
-	 *	[---)			 = F not wrapped past B (but wrapped past UINT_MAX)
-	 *		 [---)		 = F wrapped past B (and UINT_MAX)
-	 *			  [----] = F not wrapped
-	 *-----------------------------------------------------------------------
-	 */
-	if (start < boundary)
-		return finish >= boundary || finish < start;
-	else
-		return finish >= boundary && finish < start;
-}
-
 /*
  * Find the starting offset of the given MultiXactId.
  *
@@ -2998,8 +2839,9 @@ MultiXactMemberFreezeThreshold(void)
 	 * we try to eliminate from the system is based on how far we are past
 	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
 	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD) /
-		(MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+	fraction /= (double) (MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+
 	victim_multixacts = multixacts * fraction;
 
 	/* fraction could be > 1.0, but lowest possible freeze age is zero */
@@ -3345,7 +3187,7 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
 static bool
 MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
 {
-	int32		diff = (int32) (offset1 - offset2);
+	int64		diff = (int64) (offset1 - offset2);
 
 	return (diff < 0);
 }
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 985cd06802..1af2ce4b93 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -264,7 +264,7 @@ main(int argc, char *argv[])
 
 			case 'O':
 				errno = 0;
-				set_mxoff = strtoul(optarg, &endptr, 0);
+				set_mxoff = strtou64(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index 9829e48106..f8a8eef44d 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -206,7 +206,7 @@ push @cmd,
   sprintf("%d,%d", hex($files[0]) == 0 ? 3 : hex($files[0]), hex($files[-1]));
 
 @files = get_slru_files('pg_multixact/offsets');
-$mult = 32 * $blcksz / 4;
+$mult = 32 * $blcksz / 8;
 # -m argument is "new,old"
 push @cmd, '-m',
   sprintf("%d,%d",
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 7ffd256c74..90583634ec 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -27,7 +27,7 @@
 
 #define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
 
-#define MaxMultiXactOffset	((MultiXactOffset) 0xFFFFFFFF)
+#define MaxMultiXactOffset	UINT64CONST(0xFFFFFFFFFFFFFFFF)
 
 /*
  * Possible multixact lock modes ("status").  The first four modes are for
diff --git a/src/include/c.h b/src/include/c.h
index 55dec71a6d..556fffa333 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -652,7 +652,7 @@ typedef uint32 SubTransactionId;
 /* MultiXactId must be equivalent to TransactionId, to fit in t_xmax */
 typedef TransactionId MultiXactId;
 
-typedef uint32 MultiXactOffset;
+typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
-- 
2.43.0

#16Maxim Orlov
orlovmg@gmail.com
In reply to: Maxim Orlov (#15)
4 attachment(s)
Re: POC: make mxidoff 64 bits

After a bit of thought, I've realized that to be conservative here is the
way to go.

We can reuse a maximum of existing logic. I mean, we can remove offset
wraparound "error logic" and reuse "warning logic". But set the threshold
for "warning logic" to a much higher value. For now, I choose 2^32-1. In
other world, legit logic, in my view, here would be to trigger autovacuum
if the number of offsets (i.e. difference nextOffset - oldestOffset)
exceeds 2^32-1. PFA patch set.

--
Best regards,
Maxim Orlov.

Attachments:

v5-0002-Use-64-bit-multixact-offsets.patchapplication/octet-stream; name=v5-0002-Use-64-bit-multixact-offsets.patchDownload
From cca60a5e487090252dddd515c716272786841c5e Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Wed, 6 Mar 2024 11:11:33 +0300
Subject: [PATCH v5 2/4] Use 64-bit multixact offsets.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/transam/multixact.c | 172 +------------------------
 src/bin/pg_resetwal/pg_resetwal.c      |   2 +-
 src/bin/pg_resetwal/t/001_basic.pl     |   2 +-
 src/include/access/multixact.h         |   2 +-
 src/include/c.h                        |   2 +-
 5 files changed, 11 insertions(+), 169 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index ab90912ed3..c51e03e832 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -96,14 +96,6 @@
 /*
  * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
  * used everywhere else in Postgres.
- *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
  */
 
 /* We need four bytes per offset */
@@ -272,9 +264,6 @@ typedef struct MultiXactStateData
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 
-	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
-
 	/*
 	 * This is used to sleep until a multixact offset is written when we want
 	 * to create the next one.
@@ -409,8 +398,6 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
-									 MultiXactOffset start, uint32 distance);
 static bool SetOffsetVacuumLimit(bool is_startup);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMZeroPageXlogRec(int64 pageno, uint8 info);
@@ -1164,78 +1151,6 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	else
 		*offset = nextOffset;
 
-	/*----------
-	 * Protect against overrun of the members space as well, with the
-	 * following rules:
-	 *
-	 * If we're past offsetStopLimit, refuse to generate more multis.
-	 * If we're close to offsetStopLimit, emit a warning.
-	 *
-	 * Arbitrarily, we start emitting warnings when we're 20 segments or less
-	 * from offsetStopLimit.
-	 *
-	 * Note we haven't updated the shared state yet, so if we fail at this
-	 * point, the multixact ID we grabbed can still be used by the next guy.
-	 *
-	 * Note that there is no point in forcing autovacuum runs here: the
-	 * multixact freeze settings would have to be reduced for that to have any
-	 * effect.
-	 *----------
-	 */
-#define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
-								 nmembers))
-	{
-		/* see comment in the corresponding offsets wraparound case */
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("multixact \"members\" limit exceeded"),
-				 errdetail_plural("This command would create a multixact with %u members, but the remaining space is only enough for %u member.",
-								  "This command would create a multixact with %u members, but the remaining space is only enough for %u members.",
-								  MultiXactState->offsetStopLimit - nextOffset - 1,
-								  nmembers,
-								  MultiXactState->offsetStopLimit - nextOffset - 1),
-				 errhint("Execute a database-wide VACUUM in database with OID %u with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.",
-						 MultiXactState->oldestMultiXactDB)));
-	}
-
-	/*
-	 * Check whether we should kick autovacuum into action, to prevent members
-	 * wraparound. NB we use a much larger window to trigger autovacuum than
-	 * just the warning limit. The warning is just a measure of last resort -
-	 * this is in line with GetNewTransactionId's behaviour.
-	 */
-	if (!MultiXactState->oldestOffsetKnown ||
-		(MultiXactState->nextOffset - MultiXactState->oldestOffset
-		 > MULTIXACT_MEMBER_SAFE_THRESHOLD))
-	{
-		/*
-		 * To avoid swamping the postmaster with signals, we issue the autovac
-		 * request only when crossing a segment boundary. With default
-		 * compilation settings that's roughly after 50k members.  This still
-		 * gives plenty of chances before we get into real trouble.
-		 */
-		if ((MXOffsetToMemberPage(nextOffset) / SLRU_PAGES_PER_SEGMENT) !=
-			(MXOffsetToMemberPage(nextOffset + nmembers) / SLRU_PAGES_PER_SEGMENT))
-			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	}
-
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
-								 nextOffset,
-								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
-		ereport(WARNING,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg_plural("database with OID %u must be vacuumed before %d more multixact member is used",
-							   "database with OID %u must be vacuumed before %d more multixact members are used",
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers,
-							   MultiXactState->oldestMultiXactDB,
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers),
-				 errhint("Execute a database-wide VACUUM in that database with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.")));
-
 	ExtendMultiXactMember(nextOffset, nmembers);
 
 	/*
@@ -1976,7 +1891,7 @@ MultiXactShmemInit(void)
 				  "pg_multixact/offsets", LWTRANCHE_MULTIXACTOFFSET_BUFFER,
 				  LWTRANCHE_MULTIXACTOFFSET_SLRU,
 				  SYNC_HANDLER_MULTIXACT_OFFSET,
-				  false);
+				  true);
 	SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE);
 	SimpleLruInit(MultiXactMemberCtl,
 				  "multixact_member", multixact_member_buffers, 0,
@@ -2721,8 +2636,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	MultiXactOffset nextOffset;
 	bool		oldestOffsetKnown = false;
 	bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;
-	MultiXactOffset prevOffsetStopLimit;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2737,7 +2650,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	nextOffset = MultiXactState->nextOffset;
 	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
 	prevOldestOffset = MultiXactState->oldestOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2768,11 +2680,7 @@ SetOffsetVacuumLimit(bool is_startup)
 		oldestOffsetKnown =
 			find_multixact_start(oldestMultiXactId, &oldestOffset);
 
-		if (oldestOffsetKnown)
-			ereport(DEBUG1,
-					(errmsg_internal("oldest MultiXactId member is at offset %u",
-									 oldestOffset)));
-		else
+		if (!oldestOffsetKnown)
 			ereport(LOG,
 					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
 							oldestMultiXactId)));
@@ -2785,24 +2693,7 @@ SetOffsetVacuumLimit(bool is_startup)
 	 * overrun of old data in the members SLRU area. We can only do so if the
 	 * oldest offset is known though.
 	 */
-	if (oldestOffsetKnown)
-	{
-		/* move back to start of the corresponding segment */
-		offsetStopLimit = oldestOffset - (oldestOffset %
-										  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-
-		/* always leave one segment before the wraparound point */
-		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-		if (!prevOldestOffsetKnown && !is_startup)
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are now enabled")));
-
-		ereport(DEBUG1,
-				(errmsg_internal("MultiXact member stop limit is now %u based on MultiXact %u",
-								 offsetStopLimit, oldestMultiXactId)));
-	}
-	else if (prevOldestOffsetKnown)
+	if (prevOldestOffsetKnown)
 	{
 		/*
 		 * If we failed to get the oldest offset this time, but we have a
@@ -2812,14 +2703,12 @@ SetOffsetVacuumLimit(bool is_startup)
 		 */
 		oldestOffset = prevOldestOffset;
 		oldestOffsetKnown = true;
-		offsetStopLimit = prevOffsetStopLimit;
 	}
 
 	/* Install the computed values */
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestOffset = oldestOffset;
 	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	MultiXactState->offsetStopLimit = offsetStopLimit;
 	LWLockRelease(MultiXactGenLock);
 
 	/*
@@ -2829,54 +2718,6 @@ SetOffsetVacuumLimit(bool is_startup)
 		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
 }
 
-/*
- * Return whether adding "distance" to "start" would move past "boundary".
- *
- * We use this to determine whether the addition is "wrapping around" the
- * boundary point, hence the name.  The reason we don't want to use the regular
- * 2^31-modulo arithmetic here is that we want to be able to use the whole of
- * the 2^32-1 space here, allowing for more multixacts than would fit
- * otherwise.
- */
-static bool
-MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
-						 uint32 distance)
-{
-	MultiXactOffset finish;
-
-	/*
-	 * Note that offset number 0 is not used (see GetMultiXactIdMembers), so
-	 * if the addition wraps around the UINT_MAX boundary, skip that value.
-	 */
-	finish = start + distance;
-	if (finish < start)
-		finish++;
-
-	/*-----------------------------------------------------------------------
-	 * When the boundary is numerically greater than the starting point, any
-	 * value numerically between the two is not wrapped:
-	 *
-	 *	<----S----B---->
-	 *	[---)			 = F wrapped past B (and UINT_MAX)
-	 *		 [---)		 = F not wrapped
-	 *			  [----] = F wrapped past B
-	 *
-	 * When the boundary is numerically less than the starting point (i.e. the
-	 * UINT_MAX wraparound occurs somewhere in between) then all values in
-	 * between are wrapped:
-	 *
-	 *	<----B----S---->
-	 *	[---)			 = F not wrapped past B (but wrapped past UINT_MAX)
-	 *		 [---)		 = F wrapped past B (and UINT_MAX)
-	 *			  [----] = F not wrapped
-	 *-----------------------------------------------------------------------
-	 */
-	if (start < boundary)
-		return finish >= boundary || finish < start;
-	else
-		return finish >= boundary && finish < start;
-}
-
 /*
  * Find the starting offset of the given MultiXactId.
  *
@@ -2998,8 +2839,9 @@ MultiXactMemberFreezeThreshold(void)
 	 * we try to eliminate from the system is based on how far we are past
 	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
 	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD) /
-		(MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+	fraction /= (double) (MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+
 	victim_multixacts = multixacts * fraction;
 
 	/* fraction could be > 1.0, but lowest possible freeze age is zero */
@@ -3345,7 +3187,7 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
 static bool
 MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
 {
-	int32		diff = (int32) (offset1 - offset2);
+	int64		diff = (int64) (offset1 - offset2);
 
 	return (diff < 0);
 }
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 985cd06802..1af2ce4b93 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -264,7 +264,7 @@ main(int argc, char *argv[])
 
 			case 'O':
 				errno = 0;
-				set_mxoff = strtoul(optarg, &endptr, 0);
+				set_mxoff = strtou64(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index 9829e48106..f8a8eef44d 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -206,7 +206,7 @@ push @cmd,
   sprintf("%d,%d", hex($files[0]) == 0 ? 3 : hex($files[0]), hex($files[-1]));
 
 @files = get_slru_files('pg_multixact/offsets');
-$mult = 32 * $blcksz / 4;
+$mult = 32 * $blcksz / 8;
 # -m argument is "new,old"
 push @cmd, '-m',
   sprintf("%d,%d",
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 7ffd256c74..90583634ec 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -27,7 +27,7 @@
 
 #define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
 
-#define MaxMultiXactOffset	((MultiXactOffset) 0xFFFFFFFF)
+#define MaxMultiXactOffset	UINT64CONST(0xFFFFFFFFFFFFFFFF)
 
 /*
  * Possible multixact lock modes ("status").  The first four modes are for
diff --git a/src/include/c.h b/src/include/c.h
index 55dec71a6d..556fffa333 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -652,7 +652,7 @@ typedef uint32 SubTransactionId;
 /* MultiXactId must be equivalent to TransactionId, to fit in t_xmax */
 typedef TransactionId MultiXactId;
 
-typedef uint32 MultiXactOffset;
+typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
-- 
2.43.0

v5-0003-Make-pg_upgrade-convert-multixact-offsets.patchapplication/octet-stream; name=v5-0003-Make-pg_upgrade-convert-multixact-offsets.patchDownload
From 6da93c5db43d5f8c340cc45e47bc73752f16c72c Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Tue, 13 Aug 2024 14:44:50 +0300
Subject: [PATCH v5 3/4] Make pg_upgrade convert multixact offsets.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/bin/pg_upgrade/Makefile     |   1 +
 src/bin/pg_upgrade/meson.build  |   1 +
 src/bin/pg_upgrade/pg_upgrade.c |  29 ++-
 src/bin/pg_upgrade/pg_upgrade.h |  13 +-
 src/bin/pg_upgrade/segresize.c  | 350 ++++++++++++++++++++++++++++++++
 5 files changed, 390 insertions(+), 4 deletions(-)
 create mode 100644 src/bin/pg_upgrade/segresize.c

diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index f83d2b5d30..70908d63a3 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	info.o \
 	option.o \
 	parallel.o \
+	segresize.o \
 	pg_upgrade.o \
 	relfilenumber.o \
 	server.o \
diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index 3d88419674..16f898ba14 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -10,6 +10,7 @@ pg_upgrade_sources = files(
   'info.c',
   'option.c',
   'parallel.c',
+  'segresize.c',
   'pg_upgrade.c',
   'relfilenumber.c',
   'server.c',
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 663235816f..d9d8d0ea78 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -750,7 +750,30 @@ copy_xact_xlog_xid(void)
 	if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER &&
 		new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
 	{
-		copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
+		/*
+		 * If the old server is before the MULTIXACTOFFSET_FORMATCHANGE_CAT_VER
+		 * it must have 32-bit multixid offsets, thus it should be converted.
+		 */
+		if (old_cluster.controldata.cat_ver < MULTIXACTOFFSET_FORMATCHANGE_CAT_VER &&
+			new_cluster.controldata.cat_ver >= MULTIXACTOFFSET_FORMATCHANGE_CAT_VER)
+		{
+			uint64	oldest_offset = convert_multixact_offsets();
+
+			if (oldest_offset)
+			{
+				uint64	next_offset = old_cluster.controldata.chkpnt_nxtmxoff;
+
+				/* Handle possible wraparound. */
+				if (next_offset < oldest_offset)
+					next_offset += ((uint64) 1 << 32) - 1;
+
+				next_offset -= oldest_offset - 1;
+				old_cluster.controldata.chkpnt_nxtmxoff = next_offset;
+			}
+		}
+		else
+			copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
+
 		copy_subdir_files("pg_multixact/members", "pg_multixact/members");
 
 		prep_status("Setting next multixact ID and offset for new cluster");
@@ -760,9 +783,9 @@ copy_xact_xlog_xid(void)
 		 * counters here and the oldest multi present on system.
 		 */
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/pg_resetwal\" -O %u -m %u,%u \"%s\"",
+				  "\"%s/pg_resetwal\" -O %llu -m %u,%u \"%s\"",
 				  new_cluster.bindir,
-				  old_cluster.controldata.chkpnt_nxtmxoff,
+				  (unsigned long long) old_cluster.controldata.chkpnt_nxtmxoff,
 				  old_cluster.controldata.chkpnt_nxtmulti,
 				  old_cluster.controldata.chkpnt_oldstMulti,
 				  new_cluster.pgdata);
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 53f693c2d4..4d65e4125e 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -114,6 +114,13 @@ extern char *output_files[];
  */
 #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231
 
+/*
+ * Swicth from 32-bit to 64-bit for multixid offsets.
+ *
+ * XXX: should be changed to the actual CATALOG_VERSION_NO on commit.
+ */
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202409041
+
 /*
  * large object chunk size added to pg_controldata,
  * commit 5f93c37805e7485488480916b4585e098d3cc883
@@ -230,7 +237,7 @@ typedef struct
 	uint32		chkpnt_nxtepoch;
 	uint32		chkpnt_nxtoid;
 	uint32		chkpnt_nxtmulti;
-	uint32		chkpnt_nxtmxoff;
+	uint64		chkpnt_nxtmxoff;
 	uint32		chkpnt_oldstMulti;
 	uint32		chkpnt_oldstxid;
 	uint32		align;
@@ -515,3 +522,7 @@ typedef struct
 	FILE	   *file;
 	char		path[MAXPGPATH];
 } UpgradeTaskReport;
+
+/* segresize.c */
+
+uint64		convert_multixact_offsets(void);
diff --git a/src/bin/pg_upgrade/segresize.c b/src/bin/pg_upgrade/segresize.c
new file mode 100644
index 0000000000..e47c0a2407
--- /dev/null
+++ b/src/bin/pg_upgrade/segresize.c
@@ -0,0 +1,350 @@
+/*
+ *	segresize.c
+ *
+ *	SLRU segment resize utility
+ *
+ *	Copyright (c) 2024, PostgreSQL Global Development Group
+ *	src/bin/pg_upgrade/segresize.c
+ */
+
+#include "postgres_fe.h"
+
+#include "pg_upgrade.h"
+#include "access/multixact.h"
+
+/* See slru.h */
+#define SLRU_PAGES_PER_SEGMENT		32
+
+/*
+ * Some kind of iterator associated with a particular SLRU segment.  The idea is
+ * to specify the segment and page number and then move through the pages.
+ */
+typedef struct SlruSegState
+{
+	char	   *dir;
+	char	   *fn;
+	FILE	   *file;
+	int64		segno;
+	uint64		pageno;
+	bool		leading_gap;
+	bool		long_segment_names;
+} SlruSegState;
+
+/*
+ * Get SLRU segmen file name from state.
+ *
+ * NOTE: this function should mirror SlruFileName call.
+ */
+static inline char *
+SlruFileName(SlruSegState *state)
+{
+	if (state->long_segment_names)
+	{
+		Assert(state->segno >= 0 &&
+			   state->segno <= INT64CONST(0xFFFFFFFFFFFFFFF));
+		return psprintf("%s/%015llX", state->dir, (long long) state->segno);
+	}
+	else
+	{
+		Assert(state->segno >= 0 &&
+			   state->segno <= INT64CONST(0xFFFFFF));
+		return psprintf("%s/%04X", state->dir, (unsigned int) state->segno);
+	}
+}
+
+/*
+ * Create SLRU segment file.
+ */
+static void
+create_segment(SlruSegState *state)
+{
+	Assert(state->fn == NULL);
+	Assert(state->file == NULL);
+
+	state->fn = SlruFileName(state);
+	state->file = fopen(state->fn, "wb");
+	if (!state->file)
+		pg_fatal("could not create file \"%s\": %m", state->fn);
+}
+
+/*
+ * Open existing SLRU segment file.
+ */
+static void
+open_segment(SlruSegState *state)
+{
+	Assert(state->fn == NULL);
+	Assert(state->file == NULL);
+
+	state->fn = SlruFileName(state);
+	state->file = fopen(state->fn, "rb");
+	if (!state->file)
+		pg_fatal("could not open file \"%s\": %m", state->fn);
+}
+
+/*
+ * Close SLRU segment file.
+ */
+static void
+close_segment(SlruSegState *state)
+{
+	if (state->file)
+	{
+		fclose(state->file);
+		state->file = NULL;
+	}
+
+	if (state->fn)
+	{
+		pfree(state->fn);
+		state->fn = NULL;
+	}
+}
+
+/*
+ * Read next page from the old 32-bit offset segment file.
+ */
+static int
+read_old_segment_page(SlruSegState *state, void *buf, bool *empty)
+{
+	int		len;
+
+	/* Open next segment file, if needed. */
+	if (!state->fn)
+	{
+		if (!state->segno)
+			state->leading_gap = true;
+
+		open_segment(state);
+
+		/* Set position to the needed page. */
+		if (state->pageno > 0 &&
+			fseek(state->file, state->pageno * BLCKSZ, SEEK_SET))
+		{
+			close_segment(state);
+		}
+	}
+
+	if (state->file)
+	{
+		/* Segment file do exists, read page from it. */
+		state->leading_gap = false;
+
+		len = fread(buf, sizeof(char), BLCKSZ, state->file);
+
+		/* Are we done or was there an error? */
+		if (len <= 0)
+		{
+			if (ferror(state->file))
+				pg_fatal("error reading file \"%s\": %m", state->fn);
+
+			if (feof(state->file))
+			{
+				*empty = true;
+				len = -1;
+
+				close_segment(state);
+			}
+		}
+		else
+			*empty = false;
+	}
+	else if (!state->leading_gap)
+	{
+		/* We reached the last segment. */
+		len = -1;
+		*empty = true;
+	}
+	else
+	{
+		/* Skip few first segments if they were frozen and removed. */
+		len = BLCKSZ;
+		*empty = true;
+	}
+
+	if (++state->pageno >= SLRU_PAGES_PER_SEGMENT)
+	{
+		/* Start a new segment. */
+		state->segno++;
+		state->pageno = 0;
+
+		close_segment(state);
+	}
+
+	return len;
+}
+
+/*
+ * Write next page to the new 64-bit offset segment file.
+ */
+static void
+write_new_segment_page(SlruSegState *state, void *buf)
+{
+	/*
+	 * Create a new segment file if we still didn't.  Creation is
+	 * postponed until the first non-empty page is found.  This helps
+	 * not to create completely empty segments.
+	 */
+	if (!state->file)
+	{
+		create_segment(state);
+
+		/* Write zeroes to the previously skipped prefix. */
+		if (state->pageno > 0)
+		{
+			char		zerobuf[BLCKSZ] = {0};
+
+			for (int64 i = 0; i < state->pageno; i++)
+			{
+				if (fwrite(zerobuf, sizeof(char), BLCKSZ, state->file) != BLCKSZ)
+					pg_fatal("could not write file \"%s\": %m", state->fn);
+			}
+		}
+	}
+
+	/* Write page to the new segment (if it was created). */
+	if (state->file)
+	{
+		if (fwrite(buf, sizeof(char), BLCKSZ, state->file) != BLCKSZ)
+			pg_fatal("could not write file \"%s\": %m", state->fn);
+	}
+
+	state->pageno++;
+
+	/*
+	 * Did we reach the maximum page number?  Then close segment file
+	 * and create a new one on the next iteration.
+	 */
+	if (state->pageno >= SLRU_PAGES_PER_SEGMENT)
+	{
+		state->segno++;
+		state->pageno = 0;
+		close_segment(state);
+	}
+}
+
+/*
+ * Convert pg_multixact/offsets segments and return oldest multi offset.
+ */
+uint64
+convert_multixact_offsets(void)
+{
+	/* See multixact.c */
+#define MULTIXACT_OFFSETS_PER_PAGE_OLD	(BLCKSZ / sizeof(uint32))
+#define MULTIXACT_OFFSETS_PER_PAGE		(BLCKSZ / sizeof(MultiXactOffset))
+
+	SlruSegState	oldseg = {0},
+					newseg = {0};
+	uint32			oldbuf[MULTIXACT_OFFSETS_PER_PAGE_OLD] = {0};
+	MultiXactOffset	newbuf[MULTIXACT_OFFSETS_PER_PAGE] = {0};
+	/*
+	 * It is much easier to deal with multi wraparound in 64 bitd format.  Thus
+	 * we use 64 bits for multi-transactions, although they remain 32 bits.
+	 */
+	uint64			oldest_multi = old_cluster.controldata.chkpnt_oldstMulti,
+					next_multi = old_cluster.controldata.chkpnt_nxtmulti,
+					multi,
+					old_entry,
+					new_entry;
+	bool			found = false;
+	uint64			oldest_offset = 0;
+
+	prep_status("Converting pg_multixact/offsets to 64-bit");
+
+	oldseg.pageno = oldest_multi / MULTIXACT_OFFSETS_PER_PAGE_OLD;
+	oldseg.segno = oldseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	oldseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+	oldseg.dir = psprintf("%s/pg_multixact/offsets", old_cluster.pgdata);
+	oldseg.long_segment_names = false;		/* old format XXXX */
+
+	newseg.pageno = oldest_multi / MULTIXACT_OFFSETS_PER_PAGE;
+	newseg.segno = newseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	newseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+	newseg.dir = psprintf("%s/pg_multixact/offsets", new_cluster.pgdata);
+	newseg.long_segment_names = true;
+
+	old_entry = oldest_multi % MULTIXACT_OFFSETS_PER_PAGE_OLD;
+	new_entry = oldest_multi % MULTIXACT_OFFSETS_PER_PAGE;
+
+	if (next_multi < oldest_multi)
+		next_multi += (uint64) 1 << 32;		/* wraparound */
+
+	for (multi = oldest_multi; multi < next_multi; old_entry = 0)
+	{
+		int			oldlen;
+		bool		empty;
+
+		/* Handle possible segment wraparound. */
+		if (oldseg.segno > MaxMultiXactId /
+								MULTIXACT_OFFSETS_PER_PAGE_OLD /
+								SLRU_PAGES_PER_SEGMENT)
+			oldseg.segno = 0;
+
+		/* Read old offset segment. */
+		oldlen = read_old_segment_page(&oldseg, oldbuf, &empty);
+		if (oldlen <= 0 || empty)
+			pg_fatal("cannot read page %llu from file \"%s\": %m",
+					 (unsigned long long) oldseg.pageno, oldseg.fn);
+
+		/* Fill possible gap. */
+		if (oldlen < BLCKSZ)
+			memset((char *) oldbuf + oldlen, 0, BLCKSZ - oldlen);
+
+		/* Save oldest multi offset */
+		if (!found)
+		{
+			oldest_offset = oldbuf[old_entry];
+			found = true;
+		}
+
+		/* ... skip wrapped-around invalid multi */
+		if (multi == (uint64) 1 << 32)
+		{
+			Assert(oldseg.segno == 0);
+			Assert(oldseg.pageno == 1);
+			Assert(old_entry == 0);
+
+			multi += FirstMultiXactId;
+			old_entry = FirstMultiXactId;
+		}
+
+		/* Copy entries to the new page. */
+		for (; multi < next_multi && old_entry < MULTIXACT_OFFSETS_PER_PAGE_OLD;
+			 multi++, old_entry++)
+		{
+			MultiXactOffset offset = oldbuf[old_entry];
+
+			/* Handle possible offset wraparound. */
+			if (offset < oldest_offset)
+				offset += ((uint64) 1 << 32) - 1;
+
+			/* Subtract oldest_offset, so new offsets will start from 1. */
+			newbuf[new_entry++] = offset - oldest_offset + 1;
+			if (new_entry >= MULTIXACT_OFFSETS_PER_PAGE)
+			{
+				/* Write a new page. */
+				write_new_segment_page(&newseg, newbuf);
+				new_entry = 0;
+			}
+		}
+	}
+
+	/* Write the last incomplete page. */
+	if (new_entry > 0 || oldest_multi == next_multi)
+	{
+		memset(&newbuf[new_entry], 0,
+			   sizeof(newbuf[0]) * (MULTIXACT_OFFSETS_PER_PAGE - new_entry));
+		write_new_segment_page(&newseg, newbuf);
+	}
+
+	/* Release resources. */
+	close_segment(&oldseg);
+	close_segment(&newseg);
+
+	pfree(oldseg.dir);
+	pfree(newseg.dir);
+
+	check_ok();
+
+	return oldest_offset;
+}
-- 
2.43.0

v5-0004-Get-rid-of-MultiXactMemberFreezeThreshold-call.patchapplication/octet-stream; name=v5-0004-Get-rid-of-MultiXactMemberFreezeThreshold-call.patchDownload
From 95c4613092e4884fb2162624c4fb1dcf5f94c6f6 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 23 Oct 2024 18:23:39 +0300
Subject: [PATCH v5 4/4] Get rid of MultiXactMemberFreezeThreshold call.

Since MaxMultiXactOffset are UINT64_MAX now, MULTIXACT_MEMBER_SAFE_THRESHOLD and
MULTIXACT_MEMBER_DANGER_THRESHOLD values are not meaningful any more. Thus,
MultiXactMemberFreezeThreshold is not needed too.

Instead, switch to MULTIXACT_MEMBER_AUTOVAC_THRESHOLD (eq 2^32) members
threshold. It is used to determine if we need to force autovacuum or not.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/transam/multixact.c | 116 +++----------------------
 src/backend/commands/vacuum.c          |   2 +-
 src/backend/postmaster/autovacuum.c    |   4 +-
 src/include/access/multixact.h         |   1 -
 4 files changed, 14 insertions(+), 109 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index c51e03e832..7f12217309 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -204,10 +204,13 @@ MXOffsetToMemberOffset(MultiXactOffset offset)
 		member_in_group * sizeof(TransactionId);
 }
 
-/* Multixact members wraparound thresholds. */
-#define MULTIXACT_MEMBER_SAFE_THRESHOLD		(MaxMultiXactOffset / 2)
-#define MULTIXACT_MEMBER_DANGER_THRESHOLD	\
-	(MaxMultiXactOffset - MaxMultiXactOffset / 4)
+/*
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuum in order to release the disk space if possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD		UINT64CONST(0xFFFFFFFF)
 
 static inline MultiXactId
 PreviousMultiXactId(MultiXactId multi)
@@ -2616,15 +2619,13 @@ GetOldestMultiXactId(void)
 }
 
 /*
- * Determine how aggressively we need to vacuum in order to prevent member
- * wraparound.
+ * Determine if we need to vacuum for member or not.
  *
  * To do so determine what's the oldest member offset and install the limit
  * info in MultiXactState, where it can be used to prevent overrun of old data
  * in the members SLRU area.
  *
- * The return value is true if emergency autovacuum is required and false
- * otherwise.
+ * The return value is true if autovacuum is required and false otherwise.
  */
 static bool
 SetOffsetVacuumLimit(bool is_startup)
@@ -2712,10 +2713,10 @@ SetOffsetVacuumLimit(bool is_startup)
 	LWLockRelease(MultiXactGenLock);
 
 	/*
-	 * Do we need an emergency autovacuum?	If we're not sure, assume yes.
+	 * Do we need autovacuum?	If we're not sure, assume yes.
 	 */
 	return !oldestOffsetKnown ||
-		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
+		(nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);
 }
 
 /*
@@ -2761,101 +2762,6 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	return true;
 }
 
-/*
- * Determine how many multixacts, and how many multixact members, currently
- * exist.  Return false if unable to determine.
- */
-static bool
-ReadMultiXactCounts(uint32 *multixacts, MultiXactOffset *members)
-{
-	MultiXactOffset nextOffset;
-	MultiXactOffset oldestOffset;
-	MultiXactId oldestMultiXactId;
-	MultiXactId nextMultiXactId;
-	bool		oldestOffsetKnown;
-
-	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	nextOffset = MultiXactState->nextOffset;
-	oldestMultiXactId = MultiXactState->oldestMultiXactId;
-	nextMultiXactId = MultiXactState->nextMXact;
-	oldestOffset = MultiXactState->oldestOffset;
-	oldestOffsetKnown = MultiXactState->oldestOffsetKnown;
-	LWLockRelease(MultiXactGenLock);
-
-	if (!oldestOffsetKnown)
-		return false;
-
-	*members = nextOffset - oldestOffset;
-	*multixacts = nextMultiXactId - oldestMultiXactId;
-	return true;
-}
-
-/*
- * Multixact members can be removed once the multixacts that refer to them
- * are older than every datminmxid.  autovacuum_multixact_freeze_max_age and
- * vacuum_multixact_freeze_table_age work together to make sure we never have
- * too many multixacts; we hope that, at least under normal circumstances,
- * this will also be sufficient to keep us from using too many offsets.
- * However, if the average multixact has many members, we might exhaust the
- * members space while still using few enough members that these limits fail
- * to trigger relminmxid advancement by VACUUM.  At that point, we'd have no
- * choice but to start failing multixact-creating operations with an error.
- *
- * To prevent that, if more than a threshold portion of the members space is
- * used, we effectively reduce autovacuum_multixact_freeze_max_age and
- * to a value just less than the number of multixacts in use.  We hope that
- * this will quickly trigger autovacuuming on the table or tables with the
- * oldest relminmxid, thus allowing datminmxid values to advance and removing
- * some members.
- *
- * As the fraction of the member space currently in use grows, we become
- * more aggressive in clamping this value.  That not only causes autovacuum
- * to ramp up, but also makes any manual vacuums the user issues more
- * aggressive.  This happens because vacuum_get_cutoffs() will clamp the
- * freeze table and the minimum freeze age cutoffs based on the effective
- * autovacuum_multixact_freeze_max_age this function returns.  In the worst
- * case, we'll claim the freeze_max_age to zero, and every vacuum of any
- * table will freeze every multixact.
- */
-int
-MultiXactMemberFreezeThreshold(void)
-{
-	MultiXactOffset members;
-	uint32		multixacts;
-	uint32		victim_multixacts;
-	double		fraction;
-	int			result;
-
-	/* If we can't determine member space utilization, assume the worst. */
-	if (!ReadMultiXactCounts(&multixacts, &members))
-		return 0;
-
-	/* If member space utilization is low, no special action is required. */
-	if (members <= MULTIXACT_MEMBER_SAFE_THRESHOLD)
-		return autovacuum_multixact_freeze_max_age;
-
-	/*
-	 * Compute a target for relminmxid advancement.  The number of multixacts
-	 * we try to eliminate from the system is based on how far we are past
-	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
-	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-	fraction /= (double) (MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-
-	victim_multixacts = multixacts * fraction;
-
-	/* fraction could be > 1.0, but lowest possible freeze age is zero */
-	if (victim_multixacts > multixacts)
-		return 0;
-	result = multixacts - victim_multixacts;
-
-	/*
-	 * Clamp to autovacuum_multixact_freeze_max_age, so that we never make
-	 * autovacuum less aggressive than it would otherwise be.
-	 */
-	return Min(result, autovacuum_multixact_freeze_max_age);
-}
-
 typedef struct mxtruncinfo
 {
 	int64		earliestExistingPage;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ac8f5d9c25..b04d864095 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1134,7 +1134,7 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Almost ready to set freeze output parameters; check if OldestXmin or
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index dc3cf87aba..180bb7e96e 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1122,7 +1122,7 @@ do_start_worker(void)
 
 	/* Also determine the oldest datminmxid we will consider. */
 	recentMulti = ReadNextMultiXactId();
-	multiForceLimit = recentMulti - MultiXactMemberFreezeThreshold();
+	multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
 	if (multiForceLimit < FirstMultiXactId)
 		multiForceLimit -= FirstMultiXactId;
 
@@ -1915,7 +1915,7 @@ do_autovacuum(void)
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Find the pg_database entry and select the default freeze ages. We use
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 90583634ec..5aefbddce3 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -143,7 +143,6 @@ extern void MultiXactSetNextMXact(MultiXactId nextMulti,
 extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 									  MultiXactOffset minMultiOffset);
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
-extern int	MultiXactMemberFreezeThreshold(void);
 
 extern void multixact_twophase_recover(TransactionId xid, uint16 info,
 									   void *recdata, uint32 len);
-- 
2.43.0

v5-0001-Use-64-bit-format-output-for-multixact-offsets.patchapplication/octet-stream; name=v5-0001-Use-64-bit-format-output-for-multixact-offsets.patchDownload
From 5de021e3b012dbf71bb6b2893cd77864236bffcb Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 7 Aug 2024 16:35:22 +0300
Subject: [PATCH v5 1/4] Use 64-bit format output for multixact offsets

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/rmgrdesc/mxactdesc.c   |  9 ++++----
 src/backend/access/rmgrdesc/xlogdesc.c    |  4 ++--
 src/backend/access/transam/multixact.c    | 26 +++++++++++++----------
 src/backend/access/transam/xlogrecovery.c |  5 +++--
 src/bin/pg_controldata/pg_controldata.c   |  4 ++--
 src/bin/pg_resetwal/pg_resetwal.c         |  8 +++----
 6 files changed, 31 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 3e8ad4d5ef..1b486de38c 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -65,8 +65,8 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		xl_multixact_create *xlrec = (xl_multixact_create *) rec;
 		int			i;
 
-		appendStringInfo(buf, "%u offset %u nmembers %d: ", xlrec->mid,
-						 xlrec->moff, xlrec->nmembers);
+		appendStringInfo(buf, "%u offset %llu nmembers %d: ", xlrec->mid,
+						 (unsigned long long) xlrec->moff, xlrec->nmembers);
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
 	}
@@ -74,9 +74,10 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
 
-		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+		appendStringInfo(buf, "offsets [%u, %u), members [%llu, %llu)",
 						 xlrec->startTruncOff, xlrec->endTruncOff,
-						 xlrec->startTruncMemb, xlrec->endTruncMemb);
+						 (unsigned long long) xlrec->startTruncMemb,
+						 (unsigned long long) xlrec->endTruncMemb);
 	}
 }
 
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 363294d623..aaa19c81c8 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -66,7 +66,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
 		appendStringInfo(buf, "redo %X/%X; "
-						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %u; "
+						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %llu; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
 						 "oldest running xid %u; %s",
@@ -79,7 +79,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 XidFromFullTransactionId(checkpoint->nextXid),
 						 checkpoint->nextOid,
 						 checkpoint->nextMulti,
-						 checkpoint->nextMultiOffset,
+						 (unsigned long long) checkpoint->nextMultiOffset,
 						 checkpoint->oldestXid,
 						 checkpoint->oldestXidDB,
 						 checkpoint->oldestMulti,
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 8c37d7eba7..ab90912ed3 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1264,7 +1264,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	LWLockRelease(MultiXactGenLock);
 
-	debug_elog4(DEBUG2, "GetNew: returning %u offset %u", result, *offset);
+	debug_elog4(DEBUG2, "GetNew: returning %u offset %llu", result,
+				(unsigned long long) *offset);
 	return result;
 }
 
@@ -2293,8 +2294,9 @@ MultiXactGetCheckptMulti(bool is_shutdown,
 	LWLockRelease(MultiXactGenLock);
 
 	debug_elog6(DEBUG2,
-				"MultiXact: checkpoint is nextMulti %u, nextOffset %u, oldestMulti %u in DB %u",
-				*nextMulti, *nextMultiOffset, *oldestMulti, *oldestMultiDB);
+				"MultiXact: checkpoint is nextMulti %u, nextOffset %llu, oldestMulti %u in DB %u",
+				*nextMulti, (unsigned long long) *nextMultiOffset, *oldestMulti,
+				*oldestMultiDB);
 }
 
 /*
@@ -2328,8 +2330,8 @@ void
 MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset)
 {
-	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %u",
-				nextMulti, nextMultiOffset);
+	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %llu",
+				nextMulti, (unsigned long long) nextMultiOffset);
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->nextMXact = nextMulti;
 	MultiXactState->nextOffset = nextMultiOffset;
@@ -2519,8 +2521,8 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 	}
 	if (MultiXactOffsetPrecedes(MultiXactState->nextOffset, minMultiOffset))
 	{
-		debug_elog3(DEBUG2, "MultiXact: setting next offset to %u",
-					minMultiOffset);
+		debug_elog3(DEBUG2, "MultiXact: setting next offset to %llu",
+					(unsigned long long) minMultiOffset);
 		MultiXactState->nextOffset = minMultiOffset;
 	}
 	LWLockRelease(MultiXactGenLock);
@@ -3211,11 +3213,12 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 
 	elog(DEBUG1, "performing multixact truncation: "
 		 "offsets [%u, %u), offsets segments [%llx, %llx), "
-		 "members [%u, %u), members segments [%llx, %llx)",
+		 "members [%llu, %llu), members segments [%llx, %llx)",
 		 oldestMulti, newOldestMulti,
 		 (unsigned long long) MultiXactIdToOffsetSegment(oldestMulti),
 		 (unsigned long long) MultiXactIdToOffsetSegment(newOldestMulti),
-		 oldestOffset, newOldestOffset,
+		 (unsigned long long) oldestOffset,
+		 (unsigned long long) newOldestOffset,
 		 (unsigned long long) MXOffsetToMemberSegment(oldestOffset),
 		 (unsigned long long) MXOffsetToMemberSegment(newOldestOffset));
 
@@ -3471,11 +3474,12 @@ multixact_redo(XLogReaderState *record)
 
 		elog(DEBUG1, "replaying multixact truncation: "
 			 "offsets [%u, %u), offsets segments [%llx, %llx), "
-			 "members [%u, %u), members segments [%llx, %llx)",
+			 "members [%llu, %llu), members segments [%llx, %llx)",
 			 xlrec.startTruncOff, xlrec.endTruncOff,
 			 (unsigned long long) MultiXactIdToOffsetSegment(xlrec.startTruncOff),
 			 (unsigned long long) MultiXactIdToOffsetSegment(xlrec.endTruncOff),
-			 xlrec.startTruncMemb, xlrec.endTruncMemb,
+			 (unsigned long long) xlrec.startTruncMemb,
+			 (unsigned long long) xlrec.endTruncMemb,
 			 (unsigned long long) MXOffsetToMemberSegment(xlrec.startTruncMemb),
 			 (unsigned long long) MXOffsetToMemberSegment(xlrec.endTruncMemb));
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 320b14add1..4846126ef9 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -877,8 +877,9 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 							 U64FromFullTransactionId(checkPoint.nextXid),
 							 checkPoint.nextOid)));
 	ereport(DEBUG1,
-			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %u",
-							 checkPoint.nextMulti, checkPoint.nextMultiOffset)));
+			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %llu",
+							 checkPoint.nextMulti,
+							 (unsigned long long) checkPoint.nextMultiOffset)));
 	ereport(DEBUG1,
 			(errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
 							 checkPoint.oldestXid, checkPoint.oldestXidDB)));
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 93a05d80ca..43b6727570 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -253,8 +253,8 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile->checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
-		   ControlFile->checkPointCopy.nextMultiOffset);
+	printf(_("Latest checkpoint's NextMultiOffset:  %llu\n"),
+		   (unsigned long long) ControlFile->checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile->checkPointCopy.oldestXid);
 	printf(_("Latest checkpoint's oldestXID's DB:   %u\n"),
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index e9dcb5a6d8..985cd06802 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -737,8 +737,8 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile.checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
-		   ControlFile.checkPointCopy.nextMultiOffset);
+	printf(_("Latest checkpoint's NextMultiOffset:  %llu\n"),
+		   (unsigned long long) ControlFile.checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile.checkPointCopy.oldestXid);
 	printf(_("Latest checkpoint's oldestXID's DB:   %u\n"),
@@ -809,8 +809,8 @@ PrintNewControlValues(void)
 
 	if (set_mxoff != -1)
 	{
-		printf(_("NextMultiOffset:                      %u\n"),
-			   ControlFile.checkPointCopy.nextMultiOffset);
+		printf(_("NextMultiOffset:                      %llu\n"),
+			   (unsigned long long) ControlFile.checkPointCopy.nextMultiOffset);
 	}
 
 	if (set_oid != 0)
-- 
2.43.0

#17wenhui qiu
qiuwenhuifx@gmail.com
In reply to: Maxim Orlov (#16)
Re: POC: make mxidoff 64 bits

HI Maxim Orlov

After a bit of thought, I've realized that to be conservative here is the

way to go.

We can reuse a maximum of existing logic. I mean, we can remove offset

wraparound "error logic" and reuse "warning logic". But set the threshold
for "warning >logic" to a much higher value. For now, I choose 2^32-1. In
other world, legit logic, in my view, here would be to trigger autovacuum
if the number of offsets (i.e. >difference nextOffset - oldestOffset)
exceeds 2^32-1. PFA patch set.
good point ,Couldn't agree with you more. xid64 is the solution to the
wraparound problem,The previous error log is no longer meaningful ,But we
might want to refine the output waring log a little(For example, checking
the underlying reasons why age has been increasing),Though we don't have to
worry about xid wraparound

+/*
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuum in order to release the disk space if possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD UINT64CONST(0xFFFFFFFF)
Can we refine this annotation a bit? for example
+/*
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuum in order to release the disk space ,reduce table
bloat if possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD UINT64CONST(0xFFFFFFFF)

Thanks

Maxim Orlov <orlovmg@gmail.com> 于2024年10月23日周三 23:55写道:

Show quoted text

After a bit of thought, I've realized that to be conservative here is the
way to go.

We can reuse a maximum of existing logic. I mean, we can remove offset
wraparound "error logic" and reuse "warning logic". But set the threshold
for "warning logic" to a much higher value. For now, I choose 2^32-1. In
other world, legit logic, in my view, here would be to trigger autovacuum
if the number of offsets (i.e. difference nextOffset - oldestOffset)
exceeds 2^32-1. PFA patch set.

--
Best regards,
Maxim Orlov.

#18Maxim Orlov
orlovmg@gmail.com
In reply to: wenhui qiu (#17)
8 attachment(s)
Re: POC: make mxidoff 64 bits

On Fri, 25 Oct 2024 at 06:39, wenhui qiu <qiuwenhuifx@gmail.com> wrote:

+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value,
we
+ * trigger autovacuum in order to release the disk space if possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD UINT64CONST(0xFFFFFFFF)
Can we refine this annotation a bit? for example

Thank you, fixed.

Sorry for a late reply. There was a problem in upgrade with offset
wraparound. Here is a fixed version. Test also added. I decide to use my
old patch to set a non-standard multixacts for the old cluster, fill it
with data and do pg_upgrade.

Here is how to test. All the patches are for 14e87ffa5c543b5f3 master
branch.
1) Get the 14e87ffa5c543b5f3 master branch apply patches
0001-Add-initdb-option-to-initialize-cluster-with-non-sta.patch and
0002-TEST-lower-SLRU_PAGES_PER_SEGMENT.patch
2) Get the 14e87ffa5c543b5f3 master branch in a separate directory and
apply v6 patch set.
3) Build two branches.
4) Use ENV oldinstall to run the test: PROVE_TESTS=t/005_mxidoff.pl
oldinstall=/home/orlov/proj/pgsql-new PG_TEST_NOCLEAN=1 make check -C
src/bin/pg_upgrade/

Maybe, I'll make a shell script to automate this steps if required.

--
Best regards,
Maxim Orlov.

Attachments:

v6-0001-Use-64-bit-multixact-offsets.patchapplication/octet-stream; name=v6-0001-Use-64-bit-multixact-offsets.patchDownload
From 2a8708fa5d31c6523c7d2654ee1215beda6f1ff0 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Wed, 6 Mar 2024 11:11:33 +0300
Subject: [PATCH v6 1/6] Use 64-bit multixact offsets.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/transam/multixact.c | 172 +------------------------
 src/bin/pg_resetwal/pg_resetwal.c      |   2 +-
 src/bin/pg_resetwal/t/001_basic.pl     |   2 +-
 src/include/access/multixact.h         |   2 +-
 src/include/c.h                        |   2 +-
 5 files changed, 11 insertions(+), 169 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index ab90912ed3..c51e03e832 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -96,14 +96,6 @@
 /*
  * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
  * used everywhere else in Postgres.
- *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
  */
 
 /* We need four bytes per offset */
@@ -272,9 +264,6 @@ typedef struct MultiXactStateData
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 
-	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
-
 	/*
 	 * This is used to sleep until a multixact offset is written when we want
 	 * to create the next one.
@@ -409,8 +398,6 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
-									 MultiXactOffset start, uint32 distance);
 static bool SetOffsetVacuumLimit(bool is_startup);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMZeroPageXlogRec(int64 pageno, uint8 info);
@@ -1164,78 +1151,6 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	else
 		*offset = nextOffset;
 
-	/*----------
-	 * Protect against overrun of the members space as well, with the
-	 * following rules:
-	 *
-	 * If we're past offsetStopLimit, refuse to generate more multis.
-	 * If we're close to offsetStopLimit, emit a warning.
-	 *
-	 * Arbitrarily, we start emitting warnings when we're 20 segments or less
-	 * from offsetStopLimit.
-	 *
-	 * Note we haven't updated the shared state yet, so if we fail at this
-	 * point, the multixact ID we grabbed can still be used by the next guy.
-	 *
-	 * Note that there is no point in forcing autovacuum runs here: the
-	 * multixact freeze settings would have to be reduced for that to have any
-	 * effect.
-	 *----------
-	 */
-#define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
-								 nmembers))
-	{
-		/* see comment in the corresponding offsets wraparound case */
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("multixact \"members\" limit exceeded"),
-				 errdetail_plural("This command would create a multixact with %u members, but the remaining space is only enough for %u member.",
-								  "This command would create a multixact with %u members, but the remaining space is only enough for %u members.",
-								  MultiXactState->offsetStopLimit - nextOffset - 1,
-								  nmembers,
-								  MultiXactState->offsetStopLimit - nextOffset - 1),
-				 errhint("Execute a database-wide VACUUM in database with OID %u with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.",
-						 MultiXactState->oldestMultiXactDB)));
-	}
-
-	/*
-	 * Check whether we should kick autovacuum into action, to prevent members
-	 * wraparound. NB we use a much larger window to trigger autovacuum than
-	 * just the warning limit. The warning is just a measure of last resort -
-	 * this is in line with GetNewTransactionId's behaviour.
-	 */
-	if (!MultiXactState->oldestOffsetKnown ||
-		(MultiXactState->nextOffset - MultiXactState->oldestOffset
-		 > MULTIXACT_MEMBER_SAFE_THRESHOLD))
-	{
-		/*
-		 * To avoid swamping the postmaster with signals, we issue the autovac
-		 * request only when crossing a segment boundary. With default
-		 * compilation settings that's roughly after 50k members.  This still
-		 * gives plenty of chances before we get into real trouble.
-		 */
-		if ((MXOffsetToMemberPage(nextOffset) / SLRU_PAGES_PER_SEGMENT) !=
-			(MXOffsetToMemberPage(nextOffset + nmembers) / SLRU_PAGES_PER_SEGMENT))
-			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	}
-
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
-								 nextOffset,
-								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
-		ereport(WARNING,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg_plural("database with OID %u must be vacuumed before %d more multixact member is used",
-							   "database with OID %u must be vacuumed before %d more multixact members are used",
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers,
-							   MultiXactState->oldestMultiXactDB,
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers),
-				 errhint("Execute a database-wide VACUUM in that database with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.")));
-
 	ExtendMultiXactMember(nextOffset, nmembers);
 
 	/*
@@ -1976,7 +1891,7 @@ MultiXactShmemInit(void)
 				  "pg_multixact/offsets", LWTRANCHE_MULTIXACTOFFSET_BUFFER,
 				  LWTRANCHE_MULTIXACTOFFSET_SLRU,
 				  SYNC_HANDLER_MULTIXACT_OFFSET,
-				  false);
+				  true);
 	SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE);
 	SimpleLruInit(MultiXactMemberCtl,
 				  "multixact_member", multixact_member_buffers, 0,
@@ -2721,8 +2636,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	MultiXactOffset nextOffset;
 	bool		oldestOffsetKnown = false;
 	bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;
-	MultiXactOffset prevOffsetStopLimit;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2737,7 +2650,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	nextOffset = MultiXactState->nextOffset;
 	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
 	prevOldestOffset = MultiXactState->oldestOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2768,11 +2680,7 @@ SetOffsetVacuumLimit(bool is_startup)
 		oldestOffsetKnown =
 			find_multixact_start(oldestMultiXactId, &oldestOffset);
 
-		if (oldestOffsetKnown)
-			ereport(DEBUG1,
-					(errmsg_internal("oldest MultiXactId member is at offset %u",
-									 oldestOffset)));
-		else
+		if (!oldestOffsetKnown)
 			ereport(LOG,
 					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
 							oldestMultiXactId)));
@@ -2785,24 +2693,7 @@ SetOffsetVacuumLimit(bool is_startup)
 	 * overrun of old data in the members SLRU area. We can only do so if the
 	 * oldest offset is known though.
 	 */
-	if (oldestOffsetKnown)
-	{
-		/* move back to start of the corresponding segment */
-		offsetStopLimit = oldestOffset - (oldestOffset %
-										  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-
-		/* always leave one segment before the wraparound point */
-		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-		if (!prevOldestOffsetKnown && !is_startup)
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are now enabled")));
-
-		ereport(DEBUG1,
-				(errmsg_internal("MultiXact member stop limit is now %u based on MultiXact %u",
-								 offsetStopLimit, oldestMultiXactId)));
-	}
-	else if (prevOldestOffsetKnown)
+	if (prevOldestOffsetKnown)
 	{
 		/*
 		 * If we failed to get the oldest offset this time, but we have a
@@ -2812,14 +2703,12 @@ SetOffsetVacuumLimit(bool is_startup)
 		 */
 		oldestOffset = prevOldestOffset;
 		oldestOffsetKnown = true;
-		offsetStopLimit = prevOffsetStopLimit;
 	}
 
 	/* Install the computed values */
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestOffset = oldestOffset;
 	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	MultiXactState->offsetStopLimit = offsetStopLimit;
 	LWLockRelease(MultiXactGenLock);
 
 	/*
@@ -2829,54 +2718,6 @@ SetOffsetVacuumLimit(bool is_startup)
 		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
 }
 
-/*
- * Return whether adding "distance" to "start" would move past "boundary".
- *
- * We use this to determine whether the addition is "wrapping around" the
- * boundary point, hence the name.  The reason we don't want to use the regular
- * 2^31-modulo arithmetic here is that we want to be able to use the whole of
- * the 2^32-1 space here, allowing for more multixacts than would fit
- * otherwise.
- */
-static bool
-MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
-						 uint32 distance)
-{
-	MultiXactOffset finish;
-
-	/*
-	 * Note that offset number 0 is not used (see GetMultiXactIdMembers), so
-	 * if the addition wraps around the UINT_MAX boundary, skip that value.
-	 */
-	finish = start + distance;
-	if (finish < start)
-		finish++;
-
-	/*-----------------------------------------------------------------------
-	 * When the boundary is numerically greater than the starting point, any
-	 * value numerically between the two is not wrapped:
-	 *
-	 *	<----S----B---->
-	 *	[---)			 = F wrapped past B (and UINT_MAX)
-	 *		 [---)		 = F not wrapped
-	 *			  [----] = F wrapped past B
-	 *
-	 * When the boundary is numerically less than the starting point (i.e. the
-	 * UINT_MAX wraparound occurs somewhere in between) then all values in
-	 * between are wrapped:
-	 *
-	 *	<----B----S---->
-	 *	[---)			 = F not wrapped past B (but wrapped past UINT_MAX)
-	 *		 [---)		 = F wrapped past B (and UINT_MAX)
-	 *			  [----] = F not wrapped
-	 *-----------------------------------------------------------------------
-	 */
-	if (start < boundary)
-		return finish >= boundary || finish < start;
-	else
-		return finish >= boundary && finish < start;
-}
-
 /*
  * Find the starting offset of the given MultiXactId.
  *
@@ -2998,8 +2839,9 @@ MultiXactMemberFreezeThreshold(void)
 	 * we try to eliminate from the system is based on how far we are past
 	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
 	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD) /
-		(MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+	fraction /= (double) (MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+
 	victim_multixacts = multixacts * fraction;
 
 	/* fraction could be > 1.0, but lowest possible freeze age is zero */
@@ -3345,7 +3187,7 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
 static bool
 MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
 {
-	int32		diff = (int32) (offset1 - offset2);
+	int64		diff = (int64) (offset1 - offset2);
 
 	return (diff < 0);
 }
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 985cd06802..1af2ce4b93 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -264,7 +264,7 @@ main(int argc, char *argv[])
 
 			case 'O':
 				errno = 0;
-				set_mxoff = strtoul(optarg, &endptr, 0);
+				set_mxoff = strtou64(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index 9829e48106..f8a8eef44d 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -206,7 +206,7 @@ push @cmd,
   sprintf("%d,%d", hex($files[0]) == 0 ? 3 : hex($files[0]), hex($files[-1]));
 
 @files = get_slru_files('pg_multixact/offsets');
-$mult = 32 * $blcksz / 4;
+$mult = 32 * $blcksz / 8;
 # -m argument is "new,old"
 push @cmd, '-m',
   sprintf("%d,%d",
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 7ffd256c74..90583634ec 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -27,7 +27,7 @@
 
 #define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
 
-#define MaxMultiXactOffset	((MultiXactOffset) 0xFFFFFFFF)
+#define MaxMultiXactOffset	UINT64CONST(0xFFFFFFFFFFFFFFFF)
 
 /*
  * Possible multixact lock modes ("status").  The first four modes are for
diff --git a/src/include/c.h b/src/include/c.h
index 0a548d69d7..e1b3187d0b 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -664,7 +664,7 @@ typedef uint32 SubTransactionId;
 /* MultiXactId must be equivalent to TransactionId, to fit in t_xmax */
 typedef TransactionId MultiXactId;
 
-typedef uint32 MultiXactOffset;
+typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
-- 
2.43.0

v6-0002-Make-pg_upgrade-convert-multixact-offsets.patchapplication/octet-stream; name=v6-0002-Make-pg_upgrade-convert-multixact-offsets.patchDownload
From a48ec9aaf3de859050dd0ad484dc1fb5f174cf8a Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Tue, 13 Aug 2024 14:44:50 +0300
Subject: [PATCH v6 2/6] Make pg_upgrade convert multixact offsets.

Author: Maxim Orlov <orlovmg@gmail.com>
Author: Yura Sokolov <y.sokolov@postgrespro.ru>
---
 src/backend/access/transam/multixact.c |   2 +-
 src/bin/pg_upgrade/Makefile            |   1 +
 src/bin/pg_upgrade/meson.build         |   1 +
 src/bin/pg_upgrade/pg_upgrade.c        |  42 +-
 src/bin/pg_upgrade/pg_upgrade.h        |  14 +-
 src/bin/pg_upgrade/segresize.c         | 518 +++++++++++++++++++++++++
 6 files changed, 572 insertions(+), 6 deletions(-)
 create mode 100644 src/bin/pg_upgrade/segresize.c

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index c51e03e832..48e1c0160a 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1891,7 +1891,7 @@ MultiXactShmemInit(void)
 				  "pg_multixact/offsets", LWTRANCHE_MULTIXACTOFFSET_BUFFER,
 				  LWTRANCHE_MULTIXACTOFFSET_SLRU,
 				  SYNC_HANDLER_MULTIXACT_OFFSET,
-				  true);
+				  false);
 	SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE);
 	SimpleLruInit(MultiXactMemberCtl,
 				  "multixact_member", multixact_member_buffers, 0,
diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index f83d2b5d30..70908d63a3 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	info.o \
 	option.o \
 	parallel.o \
+	segresize.o \
 	pg_upgrade.o \
 	relfilenumber.o \
 	server.o \
diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index 3d88419674..16f898ba14 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -10,6 +10,7 @@ pg_upgrade_sources = files(
   'info.c',
   'option.c',
   'parallel.c',
+  'segresize.c',
   'pg_upgrade.c',
   'relfilenumber.c',
   'server.c',
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 663235816f..1654e877c0 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -750,8 +750,42 @@ copy_xact_xlog_xid(void)
 	if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER &&
 		new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
 	{
-		copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
-		copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		/*
+		 * If the old server is before the MULTIXACTOFFSET_FORMATCHANGE_CAT_VER
+		 * it must have 32-bit multixid offsets, thus it should be converted.
+		 */
+		if (old_cluster.controldata.cat_ver < MULTIXACTOFFSET_FORMATCHANGE_CAT_VER &&
+			new_cluster.controldata.cat_ver >= MULTIXACTOFFSET_FORMATCHANGE_CAT_VER)
+		{
+			MultiXactOffset		oldest_offset,
+								next_offset;
+
+			remove_new_subdir("pg_multixact/offsets", false);
+			prep_status("Converting pg_multixact/offsets to 64-bit");
+			oldest_offset = convert_multixact_offsets();
+			check_ok();
+
+			remove_new_subdir("pg_multixact/members", false);
+			prep_status("Converting pg_multixact/members");
+			convert_multixact_members(oldest_offset);
+			check_ok();
+
+			next_offset = old_cluster.controldata.chkpnt_nxtmxoff;
+			if (oldest_offset)
+			{
+				if (next_offset < oldest_offset)
+					next_offset += ((MultiXactOffset) 1 << 32) - 1;
+
+				next_offset -= oldest_offset - 1;
+
+				old_cluster.controldata.chkpnt_nxtmxoff = next_offset;
+			}
+		}
+		else
+		{
+			copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
+			copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		}
 
 		prep_status("Setting next multixact ID and offset for new cluster");
 
@@ -760,9 +794,9 @@ copy_xact_xlog_xid(void)
 		 * counters here and the oldest multi present on system.
 		 */
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/pg_resetwal\" -O %u -m %u,%u \"%s\"",
+				  "\"%s/pg_resetwal\" -O %llu -m %u,%u \"%s\"",
 				  new_cluster.bindir,
-				  old_cluster.controldata.chkpnt_nxtmxoff,
+				  (unsigned long long) old_cluster.controldata.chkpnt_nxtmxoff,
 				  old_cluster.controldata.chkpnt_nxtmulti,
 				  old_cluster.controldata.chkpnt_oldstMulti,
 				  new_cluster.pgdata);
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 53f693c2d4..2c85ec1e94 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -114,6 +114,13 @@ extern char *output_files[];
  */
 #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231
 
+/*
+ * Swicth from 32-bit to 64-bit for multixid offsets.
+ *
+ * XXX: should be changed to the actual CATALOG_VERSION_NO on commit.
+ */
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202409041
+
 /*
  * large object chunk size added to pg_controldata,
  * commit 5f93c37805e7485488480916b4585e098d3cc883
@@ -230,7 +237,7 @@ typedef struct
 	uint32		chkpnt_nxtepoch;
 	uint32		chkpnt_nxtoid;
 	uint32		chkpnt_nxtmulti;
-	uint32		chkpnt_nxtmxoff;
+	uint64		chkpnt_nxtmxoff;
 	uint32		chkpnt_oldstMulti;
 	uint32		chkpnt_oldstxid;
 	uint32		align;
@@ -515,3 +522,8 @@ typedef struct
 	FILE	   *file;
 	char		path[MAXPGPATH];
 } UpgradeTaskReport;
+
+/* segresize.c */
+
+MultiXactOffset		convert_multixact_offsets(void);
+void				convert_multixact_members(MultiXactOffset oldest_offset);
diff --git a/src/bin/pg_upgrade/segresize.c b/src/bin/pg_upgrade/segresize.c
new file mode 100644
index 0000000000..ff7ff65758
--- /dev/null
+++ b/src/bin/pg_upgrade/segresize.c
@@ -0,0 +1,518 @@
+/*
+ *	segresize.c
+ *
+ *	SLRU segment resize utility
+ *
+ *	Copyright (c) 2024, PostgreSQL Global Development Group
+ *	src/bin/pg_upgrade/segresize.c
+ */
+
+#include "postgres_fe.h"
+
+#include "pg_upgrade.h"
+#include "access/multixact.h"
+
+/* See slru.h */
+#define SLRU_PAGES_PER_SEGMENT		32
+
+/*
+ * Some kind of iterator associated with a particular SLRU segment.  The idea is
+ * to specify the segment and page number and then move through the pages.
+ */
+typedef struct SlruSegState
+{
+	char	   *dir;
+	char	   *fn;
+	FILE	   *file;
+	int64		segno;
+	uint64		pageno;
+	bool		leading_gap;
+} SlruSegState;
+
+/*
+ * Get SLRU segmen file name from state.
+ *
+ * NOTE: this function should mirror SlruFileName call.
+ */
+static inline char *
+SlruFileName(SlruSegState *state)
+{
+	Assert(state->segno >= 0 &&
+		   state->segno <= INT64CONST(0xFFFFFF));
+
+	return psprintf("%s/%04X", state->dir, (unsigned int) (state->segno));
+}
+
+/*
+ * Create SLRU segment file.
+ */
+static void
+create_segment(SlruSegState *state)
+{
+	Assert(state->fn == NULL);
+	Assert(state->file == NULL);
+
+	state->fn = SlruFileName(state);
+	state->file = fopen(state->fn, "wb");
+	if (!state->file)
+		pg_fatal("could not create file \"%s\": %m", state->fn);
+}
+
+/*
+ * Open existing SLRU segment file.
+ */
+static void
+open_segment(SlruSegState *state)
+{
+	Assert(state->fn == NULL);
+	Assert(state->file == NULL);
+
+	state->fn = SlruFileName(state);
+	state->file = fopen(state->fn, "rb");
+	if (!state->file)
+		pg_fatal("could not open file \"%s\": %m", state->fn);
+}
+
+/*
+ * Close SLRU segment file.
+ */
+static void
+close_segment(SlruSegState *state)
+{
+	if (state->file)
+	{
+		fclose(state->file);
+		state->file = NULL;
+	}
+
+	if (state->fn)
+	{
+		pfree(state->fn);
+		state->fn = NULL;
+	}
+}
+
+/*
+ * Read next page from the old 32-bit offset segment file.
+ */
+static int
+read_old_segment_page(SlruSegState *state, void *buf, bool *empty)
+{
+	int		len;
+
+	/* Open next segment file, if needed. */
+	if (!state->fn)
+	{
+		if (!state->segno)
+			state->leading_gap = true;
+
+		open_segment(state);
+
+		/* Set position to the needed page. */
+		if (state->pageno > 0 &&
+			fseek(state->file, state->pageno * BLCKSZ, SEEK_SET))
+		{
+			close_segment(state);
+		}
+	}
+
+	if (state->file)
+	{
+		/* Segment file do exists, read page from it. */
+		state->leading_gap = false;
+
+		len = fread(buf, sizeof(char), BLCKSZ, state->file);
+
+		/* Are we done or was there an error? */
+		if (len <= 0)
+		{
+			if (ferror(state->file))
+				pg_fatal("error reading file \"%s\": %m", state->fn);
+
+			if (feof(state->file))
+			{
+				*empty = true;
+				len = -1;
+
+				close_segment(state);
+			}
+		}
+		else
+			*empty = false;
+	}
+	else if (!state->leading_gap)
+	{
+		/* We reached the last segment. */
+		len = -1;
+		*empty = true;
+	}
+	else
+	{
+		/* Skip few first segments if they were frozen and removed. */
+		len = BLCKSZ;
+		*empty = true;
+	}
+
+	if (++state->pageno >= SLRU_PAGES_PER_SEGMENT)
+	{
+		/* Start a new segment. */
+		state->segno++;
+		state->pageno = 0;
+
+		close_segment(state);
+	}
+
+	return len;
+}
+
+/*
+ * Write next page to the new 64-bit offset segment file.
+ */
+static void
+write_new_segment_page(SlruSegState *state, void *buf)
+{
+	/*
+	 * Create a new segment file if we still didn't.  Creation is
+	 * postponed until the first non-empty page is found.  This helps
+	 * not to create completely empty segments.
+	 */
+	if (!state->file)
+	{
+		create_segment(state);
+
+		/* Write zeroes to the previously skipped prefix. */
+		if (state->pageno > 0)
+		{
+			char		zerobuf[BLCKSZ] = {0};
+
+			for (int64 i = 0; i < state->pageno; i++)
+			{
+				if (fwrite(zerobuf, sizeof(char), BLCKSZ, state->file) != BLCKSZ)
+					pg_fatal("could not write file \"%s\": %m", state->fn);
+			}
+		}
+	}
+
+	/* Write page to the new segment (if it was created). */
+	if (state->file)
+	{
+		if (fwrite(buf, sizeof(char), BLCKSZ, state->file) != BLCKSZ)
+			pg_fatal("could not write file \"%s\": %m", state->fn);
+	}
+
+	/*
+	 * Did we reach the maximum page number?  Then close segment file
+	 * and create a new one on the next iteration.
+	 */
+	if (++state->pageno >= SLRU_PAGES_PER_SEGMENT)
+	{
+		/* Start a new segment. */
+		state->segno++;
+		state->pageno = 0;
+
+		close_segment(state);
+	}
+}
+
+typedef uint32 MultiXactOffsetOld;
+
+#define MaxMultiXactOffsetOld	((MultiXactOffsetOld) 0xFFFFFFFF)
+
+#define MULTIXACT_OFFSETS_PER_PAGE_OLD (BLCKSZ / sizeof(MultiXactOffsetOld))
+#define MULTIXACT_OFFSETS_PER_PAGE_NEW (BLCKSZ / sizeof(MultiXactOffset))
+
+/*
+ * Convert pg_multixact/offsets segments and return oldest multi offset.
+ */
+MultiXactOffset
+convert_multixact_offsets(void)
+{
+	SlruSegState		oldseg = {0},
+						newseg = {0};
+	MultiXactOffsetOld	oldbuf[MULTIXACT_OFFSETS_PER_PAGE_OLD] = {0};
+	MultiXactOffset		newbuf[MULTIXACT_OFFSETS_PER_PAGE_NEW] = {0},
+						oldest_offset = 0;
+	uint64				oldest_multi = old_cluster.controldata.chkpnt_oldstMulti,
+						next_multi = old_cluster.controldata.chkpnt_nxtmulti,
+						multi;
+	uint64				old_entry;
+	uint64				new_entry;
+	bool				oldest_offset_known = false;
+
+	oldseg.dir = psprintf("%s/pg_multixact/offsets", old_cluster.pgdata);
+	newseg.dir = psprintf("%s/pg_multixact/offsets", new_cluster.pgdata);
+
+	old_entry = oldest_multi % MULTIXACT_OFFSETS_PER_PAGE_OLD;
+	oldseg.pageno = oldest_multi / MULTIXACT_OFFSETS_PER_PAGE_OLD;
+	oldseg.segno = oldseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	oldseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+
+	new_entry = oldest_multi % MULTIXACT_OFFSETS_PER_PAGE_NEW;
+	newseg.pageno = oldest_multi / MULTIXACT_OFFSETS_PER_PAGE_NEW;
+	newseg.segno = newseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	newseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+
+	if (next_multi < oldest_multi)
+		next_multi += (uint64) 1 << 32;	/* wraparound */
+
+	/* Copy multi offsets reading only needed segment pages */
+	for (multi = oldest_multi; multi < next_multi; old_entry = 0)
+	{
+		int			oldlen;
+		bool		is_empty;
+
+		/* Handle possible segment wraparound */
+		if (oldseg.segno > MaxMultiXactId / MULTIXACT_OFFSETS_PER_PAGE_OLD / SLRU_PAGES_PER_SEGMENT)
+			oldseg.segno = 0;
+
+		oldlen = read_old_segment_page(&oldseg, oldbuf, &is_empty);
+
+		if (oldlen <= 0 || is_empty)
+			pg_fatal("cannot read page %llu from file \"%s\": %m",
+					 (unsigned long long) oldseg.pageno, oldseg.fn);
+
+		if (oldlen < BLCKSZ)
+			memset((char *) oldbuf + oldlen, 0, BLCKSZ - oldlen);
+
+		/* Save oldest multi offset */
+		if (!oldest_offset_known)
+		{
+			oldest_offset = oldbuf[old_entry];
+			oldest_offset_known = true;
+		}
+
+		/* Skip wrapped-around invalid MultiXactIds */
+		if (multi == (uint64) 1 << 32)
+		{
+			Assert(oldseg.segno == 0);
+			Assert(oldseg.pageno == 1);
+			Assert(old_entry == 0);
+
+			multi += FirstMultiXactId;
+			old_entry = FirstMultiXactId;
+		}
+
+		/* Copy entries to the new page */
+		for (; multi < next_multi && old_entry < MULTIXACT_OFFSETS_PER_PAGE_OLD;
+			 multi++, old_entry++)
+		{
+			MultiXactOffset offset = oldbuf[old_entry];
+
+			/* Handle possible offset wraparound (1 becomes 2^32) */
+			if (offset < oldest_offset)
+				offset += ((uint64) 1 << 32) - 1;
+
+			/* Subtract oldest_offset, so new offsets will start from 1 */
+			newbuf[new_entry++] = offset - oldest_offset + 1;
+
+			if (new_entry >= MULTIXACT_OFFSETS_PER_PAGE_NEW)
+			{
+				/* Write new page */
+				write_new_segment_page(&newseg, newbuf);
+				new_entry = 0;
+			}
+		}
+	}
+
+	/* Write the last incomplete page */
+	if (new_entry > 0 || oldest_multi == next_multi)
+	{
+		memset(&newbuf[new_entry], 0,
+			   sizeof(newbuf[0]) * (MULTIXACT_OFFSETS_PER_PAGE_NEW - new_entry));
+		write_new_segment_page(&newseg, newbuf);
+	}
+
+	/* Use next_offset as oldest_offset, if oldest_multi == next_multi */
+	if (!oldest_offset_known)
+	{
+		Assert(oldest_multi == next_multi);
+		oldest_offset = (MultiXactOffset) old_cluster.controldata.chkpnt_nxtmxoff;
+	}
+
+	/* Release resources */
+	close_segment(&oldseg);
+	close_segment(&newseg);
+
+	pfree(oldseg.dir);
+	pfree(newseg.dir);
+
+	return oldest_offset;
+}
+
+#define MXACT_MEMBERS_FLAG_BYTES			1
+
+#define MULTIXACT_MEMBERS_PER_GROUP			4
+#define MULTIXACT_MEMBERGROUP_SIZE			\
+	(MULTIXACT_MEMBERS_PER_GROUP * (sizeof(TransactionId) + MXACT_MEMBERS_FLAG_BYTES))
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE		\
+	(BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+
+#define MULTIXACT_MEMBERS_PER_PAGE				\
+	(MULTIXACT_MEMBERS_PER_GROUP * MULTIXACT_MEMBERGROUPS_PER_PAGE)
+#define MULTIXACT_MEMBER_FLAG_BYTES_PER_GROUP	\
+	(MXACT_MEMBERS_FLAG_BYTES * MULTIXACT_MEMBERS_PER_GROUP)
+
+typedef struct MultiXactMembersCtx
+{
+	SlruSegState	seg;
+	char			buf[BLCKSZ];
+	int				group;
+	int				member;
+	char		   *flag;
+	TransactionId  *xid;
+} MultiXactMembersCtx;
+
+static void
+MultiXactMembersCtxInit(MultiXactMembersCtx *ctx)
+{
+	ctx->seg.dir = psprintf("%s/pg_multixact/members", new_cluster.pgdata);
+
+	ctx->group = 0;
+	ctx->member = 1;		/* skip invalid zero offset */
+
+	ctx->flag = (char *) ctx->buf + ctx->group * MULTIXACT_MEMBERGROUP_SIZE;
+	ctx->xid = (TransactionId *)(ctx->flag + MXACT_MEMBERS_FLAG_BYTES * MULTIXACT_MEMBERS_PER_GROUP);
+
+	ctx->flag += ctx->member;
+	ctx->xid += ctx->member;
+}
+
+static void
+MultiXactMembersCtxAdd(MultiXactMembersCtx *ctx, char flag, TransactionId xid)
+{
+	/* Copy member's xid and flags to the new page */
+	*ctx->flag++ = flag;
+	*ctx->xid++ = xid;
+
+	if (++ctx->member < MULTIXACT_MEMBERS_PER_GROUP)
+		return;
+
+	/* Start next member group */
+	ctx->member = 0;
+
+	if (++ctx->group >= MULTIXACT_MEMBERGROUPS_PER_PAGE)
+	{
+		/* Write current page and start new */
+		write_new_segment_page(&ctx->seg, ctx->buf);
+
+		ctx->group = 0;
+		memset(ctx->buf, 0, BLCKSZ);
+	}
+
+	ctx->flag = (char *) ctx->buf + ctx->group * MULTIXACT_MEMBERGROUP_SIZE;
+	ctx->xid = (TransactionId *)(ctx->flag + MXACT_MEMBERS_FLAG_BYTES * MULTIXACT_MEMBERS_PER_GROUP);
+}
+
+static void
+MultiXactMembersCtxFinit(MultiXactMembersCtx *ctx)
+{
+	if (ctx->flag > (char *) ctx->buf)
+		write_new_segment_page(&ctx->seg, ctx->buf);
+
+	close_segment(&ctx->seg);
+
+	pfree(ctx->seg.dir);
+}
+
+/*
+ * Convert pg_multixact/members segments, offsets will start from 1.
+ */
+void
+convert_multixact_members(MultiXactOffset oldest_offset)
+{
+	MultiXactOffset			next_offset;
+	MultiXactOffset			offset;
+	SlruSegState			oldseg = {0};
+	char					oldbuf[BLCKSZ] = {0};
+	int						oldidx;
+	MultiXactMembersCtx		newctx = {0};
+
+	oldseg.dir = psprintf("%s/pg_multixact/members", old_cluster.pgdata);
+
+	next_offset = (MultiXactOffset) old_cluster.controldata.chkpnt_nxtmxoff;
+	if (next_offset < oldest_offset)
+		next_offset += ((uint64) 1 << 32) - 1;
+
+	/* Initialize the old starting position */
+	oldseg.pageno = oldest_offset / MULTIXACT_MEMBERS_PER_PAGE;
+	oldseg.segno = oldseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	oldseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+
+	/* Initialize new starting position */
+	MultiXactMembersCtxInit(&newctx);
+
+	/* Iterate through the original directory */
+	oldidx = oldest_offset % MULTIXACT_MEMBERS_PER_PAGE;
+	for (offset = oldest_offset; offset < next_offset;)
+	{
+		bool	empty;
+		int		oldlen;
+		int		ngroups;
+		int		oldgroup;
+		int		oldmember;
+
+		oldlen = read_old_segment_page(&oldseg, oldbuf, &empty);
+		if (empty || oldlen != BLCKSZ)
+			pg_fatal("cannot read page %llu from file \"%s\": %m",
+					 (unsigned long long) oldseg.pageno, oldseg.fn);
+
+		/* Iterate through the old member groups */
+		ngroups = oldlen / MULTIXACT_MEMBERGROUP_SIZE;
+		oldmember = oldidx % MULTIXACT_MEMBERS_PER_GROUP;
+		oldgroup = oldidx / MULTIXACT_MEMBERS_PER_GROUP;
+		while (oldgroup < ngroups && offset < next_offset)
+		{
+			char		   *oldflag;
+			TransactionId  *oldxid;
+			int				i;
+
+			oldflag = (char *) oldbuf + oldgroup * MULTIXACT_MEMBERGROUP_SIZE;
+			oldxid = (TransactionId *)(oldflag + MULTIXACT_MEMBER_FLAG_BYTES_PER_GROUP);
+
+			oldxid += oldmember;
+			oldflag += oldmember;
+
+			/* Iterate through the old members */
+			for (i = oldmember;
+				 i < MULTIXACT_MEMBERS_PER_GROUP && offset < next_offset;
+				 i++)
+			{
+				MultiXactMembersCtxAdd(&newctx, *oldflag++, *oldxid++);
+
+				if (++offset == (uint64) 1 << 32)
+				{
+					Assert(i == MaxMultiXactOffsetOld % MULTIXACT_MEMBERS_PER_GROUP);
+					goto wraparound;
+				}
+			}
+
+			oldgroup++;
+			oldmember = 0;
+		}
+
+		oldidx = 0;
+
+		continue;
+
+wraparound:
+#define SEGNO_MAX	MaxMultiXactOffsetOld / MULTIXACT_MEMBERS_PER_PAGE / SLRU_PAGES_PER_SEGMENT
+#define PAGENO_MAX	MaxMultiXactOffsetOld / MULTIXACT_MEMBERS_PER_PAGE % SLRU_PAGES_PER_SEGMENT
+		Assert((oldseg.segno == SEGNO_MAX && oldseg.pageno == PAGENO_MAX + 1) ||
+			   (oldseg.segno == SEGNO_MAX + 1 && oldseg.pageno == 0));
+
+		/* Switch to segment 0000 */
+		close_segment(&oldseg);
+		oldseg.segno = 0;
+		oldseg.pageno = 0;
+
+		/* skip invalid zero multi offset */
+		oldidx = 1;
+	}
+
+	MultiXactMembersCtxFinit(&newctx);
+
+	/* Release resources */
+	close_segment(&oldseg);
+
+	pfree(oldseg.dir);
+}
-- 
2.43.0

v6-0004-TEST-lower-SLRU_PAGES_PER_SEGMENT-set-bump-catver.patchapplication/octet-stream; name=v6-0004-TEST-lower-SLRU_PAGES_PER_SEGMENT-set-bump-catver.patchDownload
From 970940711a6a4eab4e30f05412dba90fe2570433 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Tue, 29 Oct 2024 18:28:40 +0300
Subject: [PATCH v6 4/6] TEST: lower SLRU_PAGES_PER_SEGMENT + set bump catver

---
 src/bin/pg_upgrade/pg_upgrade.h  | 2 +-
 src/bin/pg_upgrade/segresize.c   | 2 +-
 src/include/access/slru.h        | 2 +-
 src/include/catalog/catversion.h | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 2c85ec1e94..01252a7ed5 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -119,7 +119,7 @@ extern char *output_files[];
  *
  * XXX: should be changed to the actual CATALOG_VERSION_NO on commit.
  */
-#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202409041
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202411082
 
 /*
  * large object chunk size added to pg_controldata,
diff --git a/src/bin/pg_upgrade/segresize.c b/src/bin/pg_upgrade/segresize.c
index ff7ff65758..0547b51741 100644
--- a/src/bin/pg_upgrade/segresize.c
+++ b/src/bin/pg_upgrade/segresize.c
@@ -13,7 +13,7 @@
 #include "access/multixact.h"
 
 /* See slru.h */
-#define SLRU_PAGES_PER_SEGMENT		32
+#define SLRU_PAGES_PER_SEGMENT		2
 
 /*
  * Some kind of iterator associated with a particular SLRU segment.  The idea is
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 97e612cd10..74dd54819d 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -36,7 +36,7 @@
  * take no explicit notice of that fact in slru.c, except when comparing
  * segment and page numbers in SimpleLruTruncate (see PagePrecedes()).
  */
-#define SLRU_PAGES_PER_SEGMENT	32
+#define SLRU_PAGES_PER_SEGMENT	2
 
 /*
  * Page status codes.  Note that these do not include the "dirty" bit.
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 86436e0356..05048a512b 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202411081
+#define CATALOG_VERSION_NO	202411082
 
 #endif
-- 
2.43.0

v6-0005-TEST-initdb-option-to-initialize-cluster-with-non.patchapplication/octet-stream; name=v6-0005-TEST-initdb-option-to-initialize-cluster-with-non.patchDownload
From 6e959f89e37614b94d3c4dd5695355095e8c38fd Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Wed, 4 May 2022 15:53:36 +0300
Subject: [PATCH v6 5/6] TEST: initdb option to initialize cluster with
 non-standard xid/mxid/mxoff

To date testing database cluster wraparund was not easy as initdb has always
inited it with default xid/mxid/mxoff. The option to specify any valid
xid/mxid/mxoff at cluster startup will make these things easier.

Author: Maxim Orlov <orlovmg@gmail.com>
Author: Pavel Borisov <pashkin.elfe@gmail.com>
Author: Svetlana Derevyanko <s.derevyanko@postgrespro.ru>
Discussion: https://www.postgresql.org/message-id/flat/CACG%3Dezaa4vqYjJ16yoxgrpa-%3DgXnf0Vv3Ey9bjGrRRFN2YyWFQ%40mail.gmail.com
---
 src/backend/access/transam/clog.c      |  21 +++++
 src/backend/access/transam/multixact.c |  53 ++++++++++++
 src/backend/access/transam/subtrans.c  |   8 +-
 src/backend/access/transam/xlog.c      |  15 ++--
 src/backend/bootstrap/bootstrap.c      |  50 +++++++++++-
 src/backend/main/main.c                |   6 ++
 src/backend/postmaster/postmaster.c    |  14 +++-
 src/backend/tcop/postgres.c            |  53 +++++++++++-
 src/bin/initdb/initdb.c                | 107 ++++++++++++++++++++++++-
 src/bin/initdb/t/001_initdb.pl         |  60 ++++++++++++++
 src/include/access/xlog.h              |   3 +
 src/include/c.h                        |   4 +
 src/include/catalog/pg_class.h         |   2 +-
 13 files changed, 382 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index e6f79320e9..17e29f4497 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -834,6 +834,7 @@ BootStrapCLOG(void)
 {
 	int			slotno;
 	LWLock	   *lock = SimpleLruGetBankLock(XactCtl, 0);
+	int64		pageno;
 
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
@@ -844,6 +845,26 @@ BootStrapCLOG(void)
 	SimpleLruWritePage(XactCtl, slotno);
 	Assert(!XactCtl->shared->page_dirty[slotno]);
 
+	pageno = TransactionIdToPage(XidFromFullTransactionId(TransamVariables->nextXid));
+	if (pageno != 0)
+	{
+		LWLock *nextlock = SimpleLruGetBankLock(XactCtl, pageno);
+
+		if (nextlock != lock)
+		{
+			LWLockRelease(lock);
+			LWLockAcquire(nextlock, LW_EXCLUSIVE);
+			lock = nextlock;
+		}
+
+		/* Create and zero the first page of the commit log */
+		slotno = ZeroCLOGPage(pageno, false);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(XactCtl, slotno);
+		Assert(!XactCtl->shared->page_dirty[slotno]);
+	}
+
 	LWLockRelease(lock);
 }
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index a817f539ee..095c39dd93 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1955,6 +1955,7 @@ BootStrapMultiXact(void)
 {
 	int			slotno;
 	LWLock	   *lock;
+	int64		pageno;
 
 	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, 0);
 	LWLockAcquire(lock, LW_EXCLUSIVE);
@@ -1966,6 +1967,26 @@ BootStrapMultiXact(void)
 	SimpleLruWritePage(MultiXactOffsetCtl, slotno);
 	Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
 
+	pageno = MultiXactIdToOffsetPage(MultiXactState->nextMXact);
+	if (pageno != 0)
+	{
+		LWLock *nextlock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
+
+		if (nextlock != lock)
+		{
+			LWLockRelease(lock);
+			LWLockAcquire(nextlock, LW_EXCLUSIVE);
+			lock = nextlock;
+		}
+
+		/* Create and zero the first page of the offsets log */
+		slotno = ZeroMultiXactOffsetPage(pageno, false);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
+		Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
+	}
+
 	LWLockRelease(lock);
 
 	lock = SimpleLruGetBankLock(MultiXactMemberCtl, 0);
@@ -1978,7 +1999,39 @@ BootStrapMultiXact(void)
 	SimpleLruWritePage(MultiXactMemberCtl, slotno);
 	Assert(!MultiXactMemberCtl->shared->page_dirty[slotno]);
 
+	pageno = MXOffsetToMemberPage(MultiXactState->nextOffset);
+	if (pageno != 0)
+	{
+		LWLock *nextlock = SimpleLruGetBankLock(MultiXactMemberCtl, pageno);
+
+		if (nextlock != lock)
+		{
+			LWLockRelease(lock);
+			LWLockAcquire(nextlock, LW_EXCLUSIVE);
+			lock = nextlock;
+		}
+
+		/* Create and zero the first page of the members log */
+		slotno = ZeroMultiXactMemberPage(pageno, false);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(MultiXactMemberCtl, slotno);
+		Assert(!MultiXactMemberCtl->shared->page_dirty[slotno]);
+	}
+
 	LWLockRelease(lock);
+
+	/*
+	 * If we're starting not from zero offset, initilize dummy multixact to
+	 * evade too long loop in PerformMembersTruncation().
+	 */
+	if (MultiXactState->nextOffset > 0 && MultiXactState->nextMXact > 0)
+	{
+		RecordNewMultiXact(FirstMultiXactId,
+						   MultiXactState->nextOffset, 0, NULL);
+		RecordNewMultiXact(MultiXactState->nextMXact,
+						   MultiXactState->nextOffset, 0, NULL);
+	}
 }
 
 /*
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 50bb1d8cfc..a5e6e8f090 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -270,12 +270,16 @@ void
 BootStrapSUBTRANS(void)
 {
 	int			slotno;
-	LWLock	   *lock = SimpleLruGetBankLock(SubTransCtl, 0);
+	LWLock	   *lock;
+	int64		pageno;
+
+	pageno = TransactionIdToPage(XidFromFullTransactionId(TransamVariables->nextXid));
+	lock = SimpleLruGetBankLock(SubTransCtl, pageno);
 
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
 	/* Create and zero the first page of the subtrans log */
-	slotno = ZeroSUBTRANSPage(0);
+	slotno = ZeroSUBTRANSPage(pageno);
 
 	/* Make sure it's written out */
 	SimpleLruWritePage(SubTransCtl, slotno);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f58412bca..c61d7d967c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -136,6 +136,10 @@ int			max_slot_wal_keep_size_mb = -1;
 int			wal_decode_buffer_size = 512 * 1024;
 bool		track_wal_io_timing = false;
 
+TransactionId		start_xid = FirstNormalTransactionId;
+MultiXactId			start_mxid = FirstMultiXactId;
+MultiXactOffset		start_mxoff = 0;
+
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
 #endif
@@ -5080,13 +5084,14 @@ BootStrapXLOG(uint32 data_checksum_version)
 	checkPoint.fullPageWrites = fullPageWrites;
 	checkPoint.wal_level = wal_level;
 	checkPoint.nextXid =
-		FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId);
+		FullTransactionIdFromEpochAndXid(0, Max(FirstNormalTransactionId,
+												start_xid));
 	checkPoint.nextOid = FirstGenbkiObjectId;
-	checkPoint.nextMulti = FirstMultiXactId;
-	checkPoint.nextMultiOffset = 0;
-	checkPoint.oldestXid = FirstNormalTransactionId;
+	checkPoint.nextMulti = Max(FirstMultiXactId, start_mxid);
+	checkPoint.nextMultiOffset = start_mxoff;
+	checkPoint.oldestXid = XidFromFullTransactionId(checkPoint.nextXid);
 	checkPoint.oldestXidDB = Template1DbOid;
-	checkPoint.oldestMulti = FirstMultiXactId;
+	checkPoint.oldestMulti = checkPoint.nextMulti;
 	checkPoint.oldestMultiDB = Template1DbOid;
 	checkPoint.oldestCommitTsXid = InvalidTransactionId;
 	checkPoint.newestCommitTsXid = InvalidTransactionId;
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index ed59dfce89..05ce03a3a3 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -216,7 +216,7 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	argv++;
 	argc--;
 
-	while ((flag = getopt(argc, argv, "B:c:d:D:Fkr:X:-:")) != -1)
+	while ((flag = getopt(argc, argv, "B:c:d:D:Fkm:o:r:X:x:-:")) != -1)
 	{
 		switch (flag)
 		{
@@ -271,12 +271,60 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 			case 'k':
 				bootstrap_data_checksum_version = PG_DATA_CHECKSUM_VERSION;
 				break;
+			case 'm':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactIdIsValid(start_mxid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact id")));
+					}
+				}
+				break;
+			case 'o':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxoff = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactOffsetIsValid(start_mxoff))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact offset")));
+					}
+				}
+				break;
 			case 'r':
 				strlcpy(OutputFileName, optarg, MAXPGPATH);
 				break;
 			case 'X':
 				SetConfigOption("wal_segment_size", optarg, PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
 				break;
+			case 'x':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_xid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartTransactionIdIsValid(start_xid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster xid value")));
+					}
+				}
+				break;
 			default:
 				write_stderr("Try \"%s --help\" for more information.\n",
 							 progname);
diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index aea93a0229..6a3224bb82 100644
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -358,12 +358,18 @@ help(const char *progname)
 	printf(_("  -E                 echo statement before execution\n"));
 	printf(_("  -j                 do not use newline as interactive query delimiter\n"));
 	printf(_("  -r FILENAME        send stdout and stderr to given file\n"));
+	printf(_("  -m START_MXID      set initial database cluster multixact id\n"));
+	printf(_("  -o START_MXOFF     set initial database cluster multixact offset\n"));
+	printf(_("  -x START_XID       set initial database cluster xid\n"));
 
 	printf(_("\nOptions for bootstrapping mode:\n"));
 	printf(_("  --boot             selects bootstrapping mode (must be first argument)\n"));
 	printf(_("  --check            selects check mode (must be first argument)\n"));
 	printf(_("  DBNAME             database name (mandatory argument in bootstrapping mode)\n"));
 	printf(_("  -r FILENAME        send stdout and stderr to given file\n"));
+	printf(_("  -m START_MXID      set initial database cluster multixact id\n"));
+	printf(_("  -o START_MXOFF     set initial database cluster multixact offset\n"));
+	printf(_("  -x START_XID       set initial database cluster xid\n"));
 
 	printf(_("\nPlease read the documentation for the complete list of run-time\n"
 			 "configuration settings and how to set them on the command line or in\n"
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 8bee1fb664..af4b004e04 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -562,7 +562,7 @@ PostmasterMain(int argc, char *argv[])
 	 * tcop/postgres.c (the option sets should not conflict) and with the
 	 * common help() function in main/main.c.
 	 */
-	while ((opt = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lN:OPp:r:S:sTt:W:-:")) != -1)
+	while ((opt = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lm:N:Oo:Pp:r:S:sTt:W:x:-:")) != -1)
 	{
 		switch (opt)
 		{
@@ -659,10 +659,18 @@ PostmasterMain(int argc, char *argv[])
 				SetConfigOption("max_connections", optarg, PGC_POSTMASTER, PGC_S_ARGV);
 				break;
 
+			case 'm':
+				/* only used by single-user backend */
+				break;
+
 			case 'O':
 				SetConfigOption("allow_system_table_mods", "true", PGC_POSTMASTER, PGC_S_ARGV);
 				break;
 
+			case 'o':
+				/* only used by single-user backend */
+				break;
+
 			case 'P':
 				SetConfigOption("ignore_system_indexes", "true", PGC_POSTMASTER, PGC_S_ARGV);
 				break;
@@ -713,6 +721,10 @@ PostmasterMain(int argc, char *argv[])
 				SetConfigOption("post_auth_delay", optarg, PGC_POSTMASTER, PGC_S_ARGV);
 				break;
 
+			case 'x':
+				/* only used by single-user backend */
+				break;
+
 			default:
 				write_stderr("Try \"%s --help\" for more information.\n",
 							 progname);
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index aac0b96bbc..1f0e27b9bf 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3918,7 +3918,7 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 	 * postmaster/postmaster.c (the option sets should not conflict) and with
 	 * the common help() function in main/main.c.
 	 */
-	while ((flag = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lN:nOPp:r:S:sTt:v:W:-:")) != -1)
+	while ((flag = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lm:N:nOo:Pp:r:S:sTt:v:W:x:-:")) != -1)
 	{
 		switch (flag)
 		{
@@ -4010,6 +4010,23 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 				SetConfigOption("ssl", "true", ctx, gucsource);
 				break;
 
+			case 'm':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactIdIsValid(start_mxid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact id")));
+					}
+				}
+				break;
+
 			case 'N':
 				SetConfigOption("max_connections", optarg, ctx, gucsource);
 				break;
@@ -4022,6 +4039,23 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 				SetConfigOption("allow_system_table_mods", "true", ctx, gucsource);
 				break;
 
+			case 'o':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxoff = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactOffsetIsValid(start_mxoff))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact offset")));
+					}
+				}
+				break;
+
 			case 'P':
 				SetConfigOption("ignore_system_indexes", "true", ctx, gucsource);
 				break;
@@ -4076,6 +4110,23 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 				SetConfigOption("post_auth_delay", optarg, ctx, gucsource);
 				break;
 
+			case 'x':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_xid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartTransactionIdIsValid(start_xid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster xid")));
+					}
+				}
+				break;
+
 			default:
 				errs++;
 				break;
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 9a91830783..410868dddf 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -168,6 +168,9 @@ static bool data_checksums = true;
 static char *xlog_dir = NULL;
 static int	wal_segment_size_mb = (DEFAULT_XLOG_SEG_SIZE) / (1024 * 1024);
 static DataDirSyncMethod sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+static TransactionId start_xid = 0;
+static MultiXactId start_mxid = 0;
+static MultiXactOffset start_mxoff = 0;
 
 
 /* internal vars */
@@ -1568,6 +1571,11 @@ bootstrap_template1(void)
 	bki_lines = replace_token(bki_lines, "POSTGRES",
 							  escape_quotes_bki(username));
 
+	/* relfrozenxid must not be less than FirstNormalTransactionId */
+	sprintf(buf, "%llu", (unsigned long long) Max(start_xid, 3));
+	bki_lines = replace_token(bki_lines, "RECENTXMIN",
+							  buf);
+
 	bki_lines = replace_token(bki_lines, "ENCODING",
 							  encodingid_to_string(encodingid));
 
@@ -1593,6 +1601,9 @@ bootstrap_template1(void)
 
 	printfPQExpBuffer(&cmd, "\"%s\" --boot %s %s", backend_exec, boot_options, extra_options);
 	appendPQExpBuffer(&cmd, " -X %d", wal_segment_size_mb * (1024 * 1024));
+	appendPQExpBuffer(&cmd, " -m %llu", (unsigned long long) start_mxid);
+	appendPQExpBuffer(&cmd, " -o %llu", (unsigned long long) start_mxoff);
+	appendPQExpBuffer(&cmd, " -x %llu", (unsigned long long) start_xid);
 	if (data_checksums)
 		appendPQExpBuffer(&cmd, " -k");
 	if (debug)
@@ -2532,12 +2543,20 @@ usage(const char *progname)
 	printf(_("  -d, --debug               generate lots of debugging output\n"));
 	printf(_("      --discard-caches      set debug_discard_caches=1\n"));
 	printf(_("  -L DIRECTORY              where to find the input files\n"));
+	printf(_("  -m, --multixact-id=START_MXID\n"
+			 "                            set initial database cluster multixact id\n"
+			 "                            max value is 2^62-1\n"));
 	printf(_("  -n, --no-clean            do not clean up after errors\n"));
 	printf(_("  -N, --no-sync             do not wait for changes to be written safely to disk\n"));
 	printf(_("      --no-instructions     do not print instructions for next steps\n"));
+	printf(_("  -o, --multixact-offset=START_MXOFF\n"
+			 "                            set initial database cluster multixact offset\n"
+			 "                            max value is 2^62-1\n"));
 	printf(_("  -s, --show                show internal settings, then exit\n"));
 	printf(_("      --sync-method=METHOD  set method for syncing files to disk\n"));
 	printf(_("  -S, --sync-only           only sync database files to disk, then exit\n"));
+	printf(_("  -x, --xid=START_XID       set initial database cluster xid\n"
+			 "                            max value is 2^62-1\n"));
 	printf(_("\nOther options:\n"));
 	printf(_("  -V, --version             output version information, then exit\n"));
 	printf(_("  -?, --help                show this help, then exit\n"));
@@ -3079,6 +3098,18 @@ initialize_data_directory(void)
 	/* Now create all the text config files */
 	setup_config();
 
+	if (start_mxid != 0)
+		printf(_("selecting initial multixact id ... %llu\n"),
+				 (unsigned long long) start_mxid);
+
+	if (start_mxoff != 0)
+		printf(_("selecting initial multixact offset ... %llu\n"),
+				 (unsigned long long) start_mxoff);
+
+	if (start_xid != 0)
+		printf(_("selecting initial xid ... %llu\n"),
+				 (unsigned long long) start_xid);
+
 	/* Bootstrap template1 */
 	bootstrap_template1();
 
@@ -3095,8 +3126,12 @@ initialize_data_directory(void)
 	fflush(stdout);
 
 	initPQExpBuffer(&cmd);
-	printfPQExpBuffer(&cmd, "\"%s\" %s %s template1 >%s",
-					  backend_exec, backend_options, extra_options, DEVNULL);
+	printfPQExpBuffer(&cmd, "\"%s\" %s %s",
+					  backend_exec, backend_options, extra_options);
+	appendPQExpBuffer(&cmd, " -m %llu", (unsigned long long) start_mxid);
+	appendPQExpBuffer(&cmd, " -o %llu", (unsigned long long) start_mxoff);
+	appendPQExpBuffer(&cmd, " -x %llu", (unsigned long long) start_xid);
+	appendPQExpBuffer(&cmd, " template1 >%s", DEVNULL);
 
 	PG_CMD_OPEN(cmd.data);
 
@@ -3183,6 +3218,9 @@ main(int argc, char *argv[])
 		{"icu-rules", required_argument, NULL, 18},
 		{"sync-method", required_argument, NULL, 19},
 		{"no-data-checksums", no_argument, NULL, 20},
+		{"xid", required_argument, NULL, 'x'},
+		{"multixact-id", required_argument, NULL, 'm'},
+		{"multixact-offset", required_argument, NULL, 'o'},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -3224,7 +3262,7 @@ main(int argc, char *argv[])
 
 	/* process command-line options */
 
-	while ((c = getopt_long(argc, argv, "A:c:dD:E:gkL:nNsST:U:WX:",
+	while ((c = getopt_long(argc, argv, "A:c:dD:E:gkL:m:nNo:sST:U:Wx:X:",
 							long_options, &option_index)) != -1)
 	{
 		switch (c)
@@ -3282,6 +3320,30 @@ main(int argc, char *argv[])
 				debug = true;
 				printf(_("Running in debug mode.\n"));
 				break;
+			case 'm':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactIdIsValid(start_mxid))
+					{
+						pg_log_error("invalid initial database cluster multixact id");
+						exit(1);
+					}
+					else if (start_mxid < 1) /* FirstMultiXactId */
+					{
+						/*
+						 * We avoid mxid to be silently set to
+						 * FirstMultiXactId, though it does not harm.
+						 */
+						pg_log_error("multixact id should be greater than 0");
+						exit(1);
+					}
+				}
+				break;
 			case 'n':
 				noclean = true;
 				printf(_("Running in no-clean mode.  Mistakes will not be cleaned up.\n"));
@@ -3289,6 +3351,21 @@ main(int argc, char *argv[])
 			case 'N':
 				do_sync = false;
 				break;
+			case 'o':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxoff = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactOffsetIsValid(start_mxoff))
+					{
+						pg_log_error("invalid initial database cluster multixact offset");
+						exit(1);
+					}
+				}
+				break;
 			case 'S':
 				sync_only = true;
 				break;
@@ -3377,6 +3454,30 @@ main(int argc, char *argv[])
 			case 20:
 				data_checksums = false;
 				break;
+			case 'x':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_xid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartTransactionIdIsValid(start_xid))
+					{
+						pg_log_error("invalid value for initial database cluster xid");
+						exit(1);
+					}
+					else if (start_xid < 3) /* FirstNormalTransactionId */
+					{
+						/*
+						 * We avoid xid to be silently set to
+						 * FirstNormalTransactionId, though it does not harm.
+						 */
+						pg_log_error("xid should be greater than 2");
+						exit(1);
+					}
+				}
+				break;
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index 7520d3d0dd..91a85d9f4d 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -282,4 +282,64 @@ command_fails(
 	[ 'pg_checksums', '-D', $datadir_nochecksums ],
 	"pg_checksums fails with data checksum disabled");
 
+# Set non-standard initial mxid/mxoff/xid.
+command_fails_like(
+	[ 'initdb', '-m', 'seven', $datadir ],
+	qr/initdb: error: invalid initial database cluster multixact id/,
+	'fails for invalid initial database cluster multixact id');
+command_fails_like(
+	[ 'initdb', '-o', 'seven', $datadir ],
+	qr/initdb: error: invalid initial database cluster multixact offset/,
+	'fails for invalid initial database cluster multixact offset');
+command_fails_like(
+	[ 'initdb', '-x', 'seven', $datadir ],
+	qr/initdb: error: invalid value for initial database cluster xid/,
+	'fails for invalid initial database cluster xid');
+
+command_checks_all(
+	[ 'initdb', '-m', '65535', "$tempdir/data-m65535" ],
+	0,
+	[qr/selecting initial multixact id ... 65535/],
+	[],
+	'selecting initial multixact id');
+command_checks_all(
+	[ 'initdb', '-o', '65535', "$tempdir/data-o65535" ],
+	0,
+	[qr/selecting initial multixact offset ... 65535/],
+	[],
+	'selecting initial multixact offset');
+command_checks_all(
+	[ 'initdb', '-x', '65535', "$tempdir/data-x65535" ],
+	0,
+	[qr/selecting initial xid ... 65535/],
+	[],
+	'selecting initial xid');
+
+# Setup new cluster with given mxid/mxoff/xid.
+my $node;
+my $result;
+
+$node = PostgreSQL::Test::Cluster->new('test-mxid');
+$node->init(extra => ['-m', '16777215']); # 0xFFFFFF
+$node->start;
+$result = $node->safe_psql('postgres', "SELECT next_multixact_id FROM pg_control_checkpoint();");
+ok($result >= 16777215, 'setup cluster with given mxid');
+$node->stop;
+
+$node = PostgreSQL::Test::Cluster->new('test-mxoff');
+$node->init(extra => ['-o', '16777215']); # 0xFFFFFF
+$node->start;
+$result = $node->safe_psql('postgres', "SELECT next_multi_offset FROM pg_control_checkpoint();");
+ok($result >= 16777215, 'setup cluster with given mxoff');
+$node->stop;
+
+$node = PostgreSQL::Test::Cluster->new('test-xid');
+$node->init(extra => ['-x', '16777215']); # 0xFFFFFF
+$node->start;
+$result = $node->safe_psql('postgres', "SELECT txid_current();");
+ok($result >= 16777215, 'setup cluster with given xid - check 1');
+$result = $node->safe_psql('postgres', "SELECT oldest_xid FROM pg_control_checkpoint();");
+ok($result >= 16777215, 'setup cluster with given xid - check 2');
+$node->stop;
+
 done_testing();
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 34ad46c067..4ce79b12e3 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -94,6 +94,9 @@ typedef enum RecoveryState
 } RecoveryState;
 
 extern PGDLLIMPORT int wal_level;
+extern PGDLLIMPORT TransactionId start_xid;
+extern PGDLLIMPORT MultiXactId start_mxid;
+extern PGDLLIMPORT MultiXactOffset start_mxoff;
 
 /* Is WAL archiving enabled (always or only while server is running normally)? */
 #define XLogArchivingActive() \
diff --git a/src/include/c.h b/src/include/c.h
index e1b3187d0b..f770e9a140 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -668,6 +668,10 @@ typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
+#define StartTransactionIdIsValid(xid)			((xid) <= 0xFFFFFFFF)
+#define StartMultiXactIdIsValid(mxid)			((mxid) <= 0xFFFFFFFF)
+#define StartMultiXactOffsetIsValid(offset)		((offset) <= 0xFFFFFFFF)
+
 #define FirstCommandId	((CommandId) 0)
 #define InvalidCommandId	(~(CommandId)0)
 
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index 0fc2c093b0..0a7518df0d 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -123,7 +123,7 @@ CATALOG(pg_class,1259,RelationRelationId) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83,Relat
 	Oid			relrewrite BKI_DEFAULT(0) BKI_LOOKUP_OPT(pg_class);
 
 	/* all Xids < this are frozen in this rel */
-	TransactionId relfrozenxid BKI_DEFAULT(3);	/* FirstNormalTransactionId */
+	TransactionId relfrozenxid BKI_DEFAULT(RECENTXMIN);	/* FirstNormalTransactionId */
 
 	/* all multixacts in this rel are >= this; it is really a MultiXactId */
 	TransactionId relminmxid BKI_DEFAULT(1);	/* FirstMultiXactId */
-- 
2.43.0

v6-0003-Get-rid-of-MultiXactMemberFreezeThreshold-call.patchapplication/octet-stream; name=v6-0003-Get-rid-of-MultiXactMemberFreezeThreshold-call.patchDownload
From d5f1e8880a5f072c389274954b21f982797af47e Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 23 Oct 2024 18:23:39 +0300
Subject: [PATCH v6 3/6] Get rid of MultiXactMemberFreezeThreshold call.

Since MaxMultiXactOffset are UINT64_MAX now, MULTIXACT_MEMBER_SAFE_THRESHOLD and
MULTIXACT_MEMBER_DANGER_THRESHOLD values are not meaningful any more. Thus,
MultiXactMemberFreezeThreshold is not needed too.

Instead, switch to MULTIXACT_MEMBER_AUTOVAC_THRESHOLD (eq 2^32) members
threshold. It is used to determine if we need to force autovacuum or not.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/transam/multixact.c | 117 +++----------------------
 src/backend/commands/vacuum.c          |   2 +-
 src/backend/postmaster/autovacuum.c    |   4 +-
 src/include/access/multixact.h         |   1 -
 4 files changed, 15 insertions(+), 109 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 48e1c0160a..a817f539ee 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -204,10 +204,14 @@ MXOffsetToMemberOffset(MultiXactOffset offset)
 		member_in_group * sizeof(TransactionId);
 }
 
-/* Multixact members wraparound thresholds. */
-#define MULTIXACT_MEMBER_SAFE_THRESHOLD		(MaxMultiXactOffset / 2)
-#define MULTIXACT_MEMBER_DANGER_THRESHOLD	\
-	(MaxMultiXactOffset - MaxMultiXactOffset / 4)
+/*
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuumin order to release the disk space, reduce table bloat if
+ * possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD		UINT64CONST(0xFFFFFFFF)
 
 static inline MultiXactId
 PreviousMultiXactId(MultiXactId multi)
@@ -2616,15 +2620,13 @@ GetOldestMultiXactId(void)
 }
 
 /*
- * Determine how aggressively we need to vacuum in order to prevent member
- * wraparound.
+ * Determine if we need to vacuum for member or not.
  *
  * To do so determine what's the oldest member offset and install the limit
  * info in MultiXactState, where it can be used to prevent overrun of old data
  * in the members SLRU area.
  *
- * The return value is true if emergency autovacuum is required and false
- * otherwise.
+ * The return value is true if autovacuum is required and false otherwise.
  */
 static bool
 SetOffsetVacuumLimit(bool is_startup)
@@ -2712,10 +2714,10 @@ SetOffsetVacuumLimit(bool is_startup)
 	LWLockRelease(MultiXactGenLock);
 
 	/*
-	 * Do we need an emergency autovacuum?	If we're not sure, assume yes.
+	 * Do we need autovacuum?	If we're not sure, assume yes.
 	 */
 	return !oldestOffsetKnown ||
-		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
+		(nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);
 }
 
 /*
@@ -2761,101 +2763,6 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	return true;
 }
 
-/*
- * Determine how many multixacts, and how many multixact members, currently
- * exist.  Return false if unable to determine.
- */
-static bool
-ReadMultiXactCounts(uint32 *multixacts, MultiXactOffset *members)
-{
-	MultiXactOffset nextOffset;
-	MultiXactOffset oldestOffset;
-	MultiXactId oldestMultiXactId;
-	MultiXactId nextMultiXactId;
-	bool		oldestOffsetKnown;
-
-	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	nextOffset = MultiXactState->nextOffset;
-	oldestMultiXactId = MultiXactState->oldestMultiXactId;
-	nextMultiXactId = MultiXactState->nextMXact;
-	oldestOffset = MultiXactState->oldestOffset;
-	oldestOffsetKnown = MultiXactState->oldestOffsetKnown;
-	LWLockRelease(MultiXactGenLock);
-
-	if (!oldestOffsetKnown)
-		return false;
-
-	*members = nextOffset - oldestOffset;
-	*multixacts = nextMultiXactId - oldestMultiXactId;
-	return true;
-}
-
-/*
- * Multixact members can be removed once the multixacts that refer to them
- * are older than every datminmxid.  autovacuum_multixact_freeze_max_age and
- * vacuum_multixact_freeze_table_age work together to make sure we never have
- * too many multixacts; we hope that, at least under normal circumstances,
- * this will also be sufficient to keep us from using too many offsets.
- * However, if the average multixact has many members, we might exhaust the
- * members space while still using few enough members that these limits fail
- * to trigger relminmxid advancement by VACUUM.  At that point, we'd have no
- * choice but to start failing multixact-creating operations with an error.
- *
- * To prevent that, if more than a threshold portion of the members space is
- * used, we effectively reduce autovacuum_multixact_freeze_max_age and
- * to a value just less than the number of multixacts in use.  We hope that
- * this will quickly trigger autovacuuming on the table or tables with the
- * oldest relminmxid, thus allowing datminmxid values to advance and removing
- * some members.
- *
- * As the fraction of the member space currently in use grows, we become
- * more aggressive in clamping this value.  That not only causes autovacuum
- * to ramp up, but also makes any manual vacuums the user issues more
- * aggressive.  This happens because vacuum_get_cutoffs() will clamp the
- * freeze table and the minimum freeze age cutoffs based on the effective
- * autovacuum_multixact_freeze_max_age this function returns.  In the worst
- * case, we'll claim the freeze_max_age to zero, and every vacuum of any
- * table will freeze every multixact.
- */
-int
-MultiXactMemberFreezeThreshold(void)
-{
-	MultiXactOffset members;
-	uint32		multixacts;
-	uint32		victim_multixacts;
-	double		fraction;
-	int			result;
-
-	/* If we can't determine member space utilization, assume the worst. */
-	if (!ReadMultiXactCounts(&multixacts, &members))
-		return 0;
-
-	/* If member space utilization is low, no special action is required. */
-	if (members <= MULTIXACT_MEMBER_SAFE_THRESHOLD)
-		return autovacuum_multixact_freeze_max_age;
-
-	/*
-	 * Compute a target for relminmxid advancement.  The number of multixacts
-	 * we try to eliminate from the system is based on how far we are past
-	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
-	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-	fraction /= (double) (MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-
-	victim_multixacts = multixacts * fraction;
-
-	/* fraction could be > 1.0, but lowest possible freeze age is zero */
-	if (victim_multixacts > multixacts)
-		return 0;
-	result = multixacts - victim_multixacts;
-
-	/*
-	 * Clamp to autovacuum_multixact_freeze_max_age, so that we never make
-	 * autovacuum less aggressive than it would otherwise be.
-	 */
-	return Min(result, autovacuum_multixact_freeze_max_age);
-}
-
 typedef struct mxtruncinfo
 {
 	int64		earliestExistingPage;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 86f36b3695..e7506e268a 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1133,7 +1133,7 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Almost ready to set freeze output parameters; check if OldestXmin or
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index dc3cf87aba..180bb7e96e 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1122,7 +1122,7 @@ do_start_worker(void)
 
 	/* Also determine the oldest datminmxid we will consider. */
 	recentMulti = ReadNextMultiXactId();
-	multiForceLimit = recentMulti - MultiXactMemberFreezeThreshold();
+	multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
 	if (multiForceLimit < FirstMultiXactId)
 		multiForceLimit -= FirstMultiXactId;
 
@@ -1915,7 +1915,7 @@ do_autovacuum(void)
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Find the pg_database entry and select the default freeze ages. We use
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 90583634ec..5aefbddce3 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -143,7 +143,6 @@ extern void MultiXactSetNextMXact(MultiXactId nextMulti,
 extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 									  MultiXactOffset minMultiOffset);
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
-extern int	MultiXactMemberFreezeThreshold(void);
 
 extern void multixact_twophase_recover(TransactionId xid, uint16 info,
 									   void *recdata, uint32 len);
-- 
2.43.0

v6-0006-TEST-add-basic-mxidoff64-tests-005_mxidoff.pl.patchapplication/octet-stream; name=v6-0006-TEST-add-basic-mxidoff64-tests-005_mxidoff.pl.patchDownload
From 386cfe747bc4ccd867f3e27f5f7669c8eb7692f3 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Sat, 2 Nov 2024 10:46:16 +0300
Subject: [PATCH v6 6/6] TEST: add basic mxidoff64 tests 005_mxidoff.pl

---
 src/bin/pg_upgrade/t/005_mxidoff.pl | 389 ++++++++++++++++++++++++++++
 1 file changed, 389 insertions(+)
 create mode 100644 src/bin/pg_upgrade/t/005_mxidoff.pl

diff --git a/src/bin/pg_upgrade/t/005_mxidoff.pl b/src/bin/pg_upgrade/t/005_mxidoff.pl
new file mode 100644
index 0000000000..e595870543
--- /dev/null
+++ b/src/bin/pg_upgrade/t/005_mxidoff.pl
@@ -0,0 +1,389 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use File::Find qw(find);
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if (!defined($ENV{oldinstall}))
+{
+	die "oldinstall is not defined";
+}
+
+sub mxid_prepare
+{
+	my ($node) = @_;
+
+	$node->safe_psql('postgres',
+	q(
+	CREATE TABLE FOO(BAR INT PRIMARY KEY, BAZ INT);
+	CREATE OR REPLACE PROCEDURE MXIDFILLER(N_STEPS INT DEFAULT 1000)
+	LANGUAGE PLPGSQL
+	AS $$
+	BEGIN
+		FOR I IN 1..N_STEPS LOOP
+			UPDATE FOO SET BAZ = RANDOM(1, 1000)
+			WHERE BAR IN (SELECT BAR FROM FOO TABLESAMPLE BERNOULLI(80));
+			COMMIT;
+		END LOOP;
+	END;$$;
+	INSERT INTO FOO (BAR, BAZ) SELECT ID, ID FROM GENERATE_SERIES(1, 512) ID;
+	));
+}
+
+sub mxid_fill
+{
+	my ($node) = @_;
+
+	$node->safe_psql('postgres',
+	q(
+	BEGIN;
+	SELECT * FROM FOO FOR KEY SHARE;
+	PREPARE TRANSACTION 'A';
+	CALL MXIDFILLER(365);
+	COMMIT PREPARED 'A';
+	),
+	timeout => 3600);
+}
+
+# Fetch latest multixact checkpoint values.
+sub multi_bounds
+{
+	my ($node) = @_;
+	my ($stdout, $stderr) = run_command([ 'pg_controldata', $node->data_dir ]);
+	my @control_data = split("\n", $stdout);
+	my $next = undef;
+	my $oldest = undef;
+
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint's NextMultiXactId:\s*(.*)$/mg)
+		{
+			$next = $1;
+		}
+
+		if ($_ =~ /^Latest checkpoint's oldestMultiXid:\s*(.*)$/mg)
+		{
+			$oldest = $1;
+		}
+
+		if (defined($oldest) && defined($next))
+		{
+			last;
+		}
+	}
+
+	die "Latest checkpoint's NextMultiXactId not found in control file!\n"
+	unless defined($next);
+
+	die "Latest checkpoint's oldestMultiXid not found in control file!\n"
+	unless defined($oldest);
+
+	return ($oldest, $next);
+}
+
+# List pg_multixact/offsets segments filenames.
+sub list_actual_multixact_offsets
+{
+	my ($node) = @_;
+	my $dir;
+
+	opendir($dir, $node->data_dir . '/pg_multixact/offsets') or die $!;
+	my @list = sort grep { /[0-9A-F]+/ } readdir $dir;
+	closedir $dir;
+
+	return @list;
+}
+
+use constant SIZEOF_MULTI_XACT_OFFSET   => 8;
+use constant BLCKSZ                     => 8192;
+use constant MULTIXACT_OFFSETS_PER_PAGE => BLCKSZ / SIZEOF_MULTI_XACT_OFFSET;
+use constant SLRU_PAGES_PER_SEGMENT     => 2;
+
+# See src/backend/access/transam/multixact.c
+sub MultiXactIdToOffsetSegment
+{
+	my ($multi) = @_;
+
+	return $multi / MULTIXACT_OFFSETS_PER_PAGE / SLRU_PAGES_PER_SEGMENT;
+}
+
+# Validate pg_multixact/offsets segments conversion.
+sub validate_multixact_offsets
+{
+	my ($old, $new, $oldnode) = @_;
+	my ($oldest, $next) = multi_bounds($oldnode);
+	my $maxsegno = MultiXactIdToOffsetSegment($next);
+	my $maxsegname = sprintf("%04X", $maxsegno);
+
+	print(">>>>>>>>>\n");
+	foreach my $segname ( @$old )
+	{
+		my $segno = hex($segname) * 2;
+		my $converted1 = sprintf("%04X", $segno);
+		my $converted2 = sprintf("%04X", $segno + 1);
+
+		print "[${segname}] -> [${converted1}, ${converted2}] \n";
+		# Skip the last segment as it may be incomplete.
+		if (not $converted1 eq $maxsegname)
+		{
+			die "Segmanet ${segname} is not properly converted"
+			unless (not $converted1 eq $maxsegname) and
+				   grep { $converted1 eq $_ } @$new and
+				   grep { $converted2 eq $_ } @$new;
+		}
+	}
+	print(">>>>>>>>>\n");
+
+	return 1;
+}
+
+#
+# Select tests to run.
+#
+my @tests = (0, 1, 2, 3);
+
+# =============================================================================
+# CASE 0
+#
+# There must be several segments starting from the zero.
+# =============================================================================
+SKIP:
+{
+	skip "case 0", 0
+		unless ( grep( /^0$/, @tests ) );
+
+	my $oldnode = PostgreSQL::Test::Cluster->new('old_node0',
+											  install_path => $ENV{oldinstall});
+	$oldnode->init(force_initdb => 1);
+	$oldnode->append_conf('postgresql.conf', 'max_prepared_transactions = 2');
+	$oldnode->append_conf('fsync', 'off');
+
+	my $newnode = PostgreSQL::Test::Cluster->new('new_node0');
+	$newnode->init();
+
+	command_ok(
+		[
+			'pg_upgrade', '--no-sync',
+			'-d', $oldnode->data_dir,
+			'-D', $newnode->data_dir,
+			'-b', $oldnode->config_data('--bindir'),
+			'-B', $newnode->config_data('--bindir'),
+			'-s', $newnode->host,
+			'-p', $oldnode->port,
+			'-P', $newnode->port,
+			'--copy'
+		],
+		'run of pg_upgrade');
+
+	my @o = list_actual_multixact_offsets($oldnode);
+	my @n = list_actual_multixact_offsets($newnode);
+	ok(validate_multixact_offsets(\@o, \@n, $oldnode),
+		"case0: offsets segmants matched");
+
+	$oldnode->start();
+	$newnode->start();
+
+	# just in case...
+	my $oldval = $oldnode->safe_psql('postgres', q(SELECT 1));
+	my $newval = $newnode->safe_psql('postgres', q(SELECT 1));
+	is($oldval, $newval, "case1: select eq");
+
+	$oldnode->stop();
+	$newnode->stop();
+}
+
+# =============================================================================
+# CASE 1
+#
+# There must be several segments starting from the zero.
+# =============================================================================
+SKIP:
+{
+	skip "case 1", 1
+		unless ( grep( /^1$/, @tests ) );
+
+	my $oldnode = PostgreSQL::Test::Cluster->new('old_node1',
+											  install_path => $ENV{oldinstall});
+	$oldnode->init(force_initdb => 1);
+	$oldnode->append_conf('postgresql.conf', 'max_prepared_transactions = 2');
+	$oldnode->append_conf('fsync', 'off');
+	$oldnode->start();
+
+	mxid_prepare($oldnode);
+	mxid_fill($oldnode);
+
+	$oldnode->safe_psql('postgres', q(CHECKPOINT));
+	$oldnode->stop();
+
+	my $newnode = PostgreSQL::Test::Cluster->new('new_node1');
+	$newnode->init();
+
+	command_ok(
+		[
+			'pg_upgrade', '--no-sync',
+			'-d', $oldnode->data_dir,
+			'-D', $newnode->data_dir,
+			'-b', $oldnode->config_data('--bindir'),
+			'-B', $newnode->config_data('--bindir'),
+			'-s', $newnode->host,
+			'-p', $oldnode->port,
+			'-P', $newnode->port,
+			'--copy'
+		],
+		'run of pg_upgrade');
+
+	my @o = list_actual_multixact_offsets($oldnode);
+	my @n = list_actual_multixact_offsets($newnode);
+	ok(validate_multixact_offsets(\@o, \@n, $oldnode),
+		"case1: offsets segmants matched");
+
+	$oldnode->start();
+	$newnode->start();
+
+	# just in case...
+	my $oldval = $oldnode->safe_psql('postgres', q(SELECT * FROM FOO));
+	my $newval = $newnode->safe_psql('postgres', q(SELECT * FROM FOO));
+	is($oldval, $newval, "case1: select eq");
+
+	$oldnode->stop();
+	$newnode->stop();
+}
+
+# =============================================================================
+# CASE 2
+#
+# Non-standard oldestMultiXid and NextMultiXactId.
+# There must be several segments starting from some value.
+# =============================================================================
+SKIP:
+{
+	skip "case 2", 2
+		unless ( grep( /^2$/, @tests ) );
+
+	my $oldnode = PostgreSQL::Test::Cluster->new('old_node2',
+											  install_path => $ENV{oldinstall});
+	$oldnode->init(force_initdb => 1,
+				extra => [
+					'-m', '0x123000', '-o', '0x123000'
+				]);
+
+	# Fixup MOX patch quirk
+	unlink $oldnode->data_dir . '/pg_multixact/members/0000';
+	unlink $oldnode->data_dir . '/pg_multixact/offsets/0000';
+
+	$oldnode->append_conf('postgresql.conf', 'max_prepared_transactions = 2');
+	$oldnode->append_conf('fsync', 'off');
+	$oldnode->start();
+
+	mxid_prepare($oldnode);
+	mxid_fill($oldnode);
+
+	$oldnode->safe_psql('postgres', q(CHECKPOINT));
+	$oldnode->stop();
+
+	my $newnode = PostgreSQL::Test::Cluster->new('new_node2');
+	$newnode->init();
+
+	command_ok(
+		[
+			'pg_upgrade', '--no-sync',
+			'-d', $oldnode->data_dir,
+			'-D', $newnode->data_dir,
+			'-b', $oldnode->config_data('--bindir'),
+			'-B', $newnode->config_data('--bindir'),
+			'-s', $newnode->host,
+			'-p', $oldnode->port,
+			'-P', $newnode->port,
+			'--copy'
+		],
+		'run of pg_upgrade');
+
+	my @o = list_actual_multixact_offsets($oldnode);
+	my @n = list_actual_multixact_offsets($newnode);
+	ok(validate_multixact_offsets(\@o, \@n, $oldnode),
+		"case2: non-standard offsets segmants matched");
+
+	$oldnode->start();
+	$newnode->start();
+
+	# just in case...
+	my $oldval = $oldnode->safe_psql('postgres', q(SELECT * FROM FOO));
+	my $newval = $newnode->safe_psql('postgres', q(SELECT * FROM FOO));
+	is($oldval, $newval, "case2: select eq");
+
+	$oldnode->stop();
+	$newnode->stop();
+}
+
+# =============================================================================
+# CASE 3
+#
+# Non-standard oldestMultiXid and NextMultiXactId.
+# =============================================================================
+SKIP:
+{
+	skip "case 3", 3
+		unless ( grep( /^3$/, @tests ) );
+	chdir ${PostgreSQL::Test::Utils::tmp_check};
+	my $oldnode = PostgreSQL::Test::Cluster->new('old_node3',
+											  install_path => $ENV{oldinstall});
+	$oldnode->init(force_initdb => 1,
+				extra => [
+					'-m', '0xFFFF0000', '-o', '0xFFFF0000'
+				]);
+
+	# Fixup MOX patch quirk
+	unlink $oldnode->data_dir . '/pg_multixact/members/0000';
+	unlink $oldnode->data_dir . '/pg_multixact/offsets/0000';
+
+	$oldnode->append_conf('postgresql.conf', 'max_prepared_transactions = 2');
+	$oldnode->append_conf('fsync', 'off');
+	$oldnode->start();
+
+	mxid_prepare($oldnode);
+	mxid_fill($oldnode);
+	mxid_fill($oldnode);
+
+	$oldnode->safe_psql('postgres', q(CHECKPOINT));
+	$oldnode->stop();
+
+	my $newnode = PostgreSQL::Test::Cluster->new('new_node3');
+	$newnode->init();
+
+	command_ok(
+		[
+			'pg_upgrade', '--no-sync',
+			'-d', $oldnode->data_dir,
+			'-D', $newnode->data_dir,
+			'-b', $oldnode->config_data('--bindir'),
+			'-B', $newnode->config_data('--bindir'),
+			'-s', $newnode->host,
+			'-p', $oldnode->port,
+			'-P', $newnode->port,
+			'--copy'
+		],
+		'run of pg_upgrade');
+
+	my @o = list_actual_multixact_offsets($oldnode);
+	my @n = list_actual_multixact_offsets($newnode);
+	ok(validate_multixact_offsets(\@o, \@n, $oldnode),
+		"case3: multi warp, non-standard offsets segmants matched");
+
+	$oldnode->start();
+	$newnode->start();
+
+	# just in case...
+	my $oldval = $oldnode->safe_psql('postgres', q(SELECT * FROM FOO));
+	my $newval = $newnode->safe_psql('postgres', q(SELECT * FROM FOO));
+	is($oldval, $newval, "case3: select eq");
+
+	$oldnode->stop();
+	$newnode->stop();
+}
+
+done_testing();
-- 
2.43.0

0002-TEST-lower-SLRU_PAGES_PER_SEGMENT.patchapplication/octet-stream; name=0002-TEST-lower-SLRU_PAGES_PER_SEGMENT.patchDownload
From 57f96bdfe7b78794e7abe8802550e4a31e6c9370 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Fri, 8 Nov 2024 20:56:27 +0300
Subject: [PATCH 2/2] TEST: lower SLRU_PAGES_PER_SEGMENT

---
 src/include/access/slru.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 97e612cd10..74dd54819d 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -36,7 +36,7 @@
  * take no explicit notice of that fact in slru.c, except when comparing
  * segment and page numbers in SimpleLruTruncate (see PagePrecedes()).
  */
-#define SLRU_PAGES_PER_SEGMENT	32
+#define SLRU_PAGES_PER_SEGMENT	2
 
 /*
  * Page status codes.  Note that these do not include the "dirty" bit.
-- 
2.43.0

0001-Add-initdb-option-to-initialize-cluster-with-non-sta.patchapplication/octet-stream; name=0001-Add-initdb-option-to-initialize-cluster-with-non-sta.patchDownload
From 34623803146a152796b611421dd9684e4fefa785 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Wed, 4 May 2022 15:53:36 +0300
Subject: [PATCH 1/2] Add initdb option to initialize cluster with non-standard
 xid/mxid/mxoff.

To date testing database cluster wraparund was not easy as initdb has always
inited it with default xid/mxid/mxoff. The option to specify any valid
xid/mxid/mxoff at cluster startup will make these things easier.

Author: Maxim Orlov <orlovmg@gmail.com>
Author: Pavel Borisov <pashkin.elfe@gmail.com>
Author: Svetlana Derevyanko <s.derevyanko@postgrespro.ru>
Discussion: https://www.postgresql.org/message-id/flat/CACG%3Dezaa4vqYjJ16yoxgrpa-%3DgXnf0Vv3Ey9bjGrRRFN2YyWFQ%40mail.gmail.com
---
 src/backend/access/transam/clog.c         |  21 +++++
 src/backend/access/transam/multixact.c    |  53 +++++++++++
 src/backend/access/transam/subtrans.c     |   8 +-
 src/backend/access/transam/xlog.c         |  15 ++-
 src/backend/bootstrap/bootstrap.c         |  50 +++++++++-
 src/backend/main/main.c                   |   6 ++
 src/backend/postmaster/postmaster.c       |  14 ++-
 src/backend/tcop/postgres.c               |  53 ++++++++++-
 src/bin/initdb/initdb.c                   | 107 +++++++++++++++++++++-
 src/bin/initdb/t/001_initdb.pl            |  60 ++++++++++++
 src/bin/pg_amcheck/t/004_verify_heapam.pl |  35 +++----
 src/include/access/xlog.h                 |   3 +
 src/include/c.h                           |   4 +
 src/include/catalog/pg_class.h            |   2 +-
 src/test/perl/PostgreSQL/Test/Cluster.pm  |   4 +-
 src/test/regress/pg_regress.c             |   3 +-
 src/test/xid-64/t/001_test_large_xids.pl  |  54 +++++++++++
 17 files changed, 460 insertions(+), 32 deletions(-)
 create mode 100644 src/test/xid-64/t/001_test_large_xids.pl

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index e6f79320e9..17e29f4497 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -834,6 +834,7 @@ BootStrapCLOG(void)
 {
 	int			slotno;
 	LWLock	   *lock = SimpleLruGetBankLock(XactCtl, 0);
+	int64		pageno;
 
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
@@ -844,6 +845,26 @@ BootStrapCLOG(void)
 	SimpleLruWritePage(XactCtl, slotno);
 	Assert(!XactCtl->shared->page_dirty[slotno]);
 
+	pageno = TransactionIdToPage(XidFromFullTransactionId(TransamVariables->nextXid));
+	if (pageno != 0)
+	{
+		LWLock *nextlock = SimpleLruGetBankLock(XactCtl, pageno);
+
+		if (nextlock != lock)
+		{
+			LWLockRelease(lock);
+			LWLockAcquire(nextlock, LW_EXCLUSIVE);
+			lock = nextlock;
+		}
+
+		/* Create and zero the first page of the commit log */
+		slotno = ZeroCLOGPage(pageno, false);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(XactCtl, slotno);
+		Assert(!XactCtl->shared->page_dirty[slotno]);
+	}
+
 	LWLockRelease(lock);
 }
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 8c37d7eba7..017eff07bd 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -2035,6 +2035,7 @@ BootStrapMultiXact(void)
 {
 	int			slotno;
 	LWLock	   *lock;
+	int64		pageno;
 
 	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, 0);
 	LWLockAcquire(lock, LW_EXCLUSIVE);
@@ -2046,6 +2047,26 @@ BootStrapMultiXact(void)
 	SimpleLruWritePage(MultiXactOffsetCtl, slotno);
 	Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
 
+	pageno = MultiXactIdToOffsetPage(MultiXactState->nextMXact);
+	if (pageno != 0)
+	{
+		LWLock *nextlock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
+
+		if (nextlock != lock)
+		{
+			LWLockRelease(lock);
+			LWLockAcquire(nextlock, LW_EXCLUSIVE);
+			lock = nextlock;
+		}
+
+		/* Create and zero the first page of the offsets log */
+		slotno = ZeroMultiXactOffsetPage(pageno, false);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
+		Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
+	}
+
 	LWLockRelease(lock);
 
 	lock = SimpleLruGetBankLock(MultiXactMemberCtl, 0);
@@ -2058,7 +2079,39 @@ BootStrapMultiXact(void)
 	SimpleLruWritePage(MultiXactMemberCtl, slotno);
 	Assert(!MultiXactMemberCtl->shared->page_dirty[slotno]);
 
+	pageno = MXOffsetToMemberPage(MultiXactState->nextOffset);
+	if (pageno != 0)
+	{
+		LWLock *nextlock = SimpleLruGetBankLock(MultiXactMemberCtl, pageno);
+
+		if (nextlock != lock)
+		{
+			LWLockRelease(lock);
+			LWLockAcquire(nextlock, LW_EXCLUSIVE);
+			lock = nextlock;
+		}
+
+		/* Create and zero the first page of the members log */
+		slotno = ZeroMultiXactMemberPage(pageno, false);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(MultiXactMemberCtl, slotno);
+		Assert(!MultiXactMemberCtl->shared->page_dirty[slotno]);
+	}
+
 	LWLockRelease(lock);
+
+	/*
+	 * If we're starting not from zero offset, initilize dummy multixact to
+	 * evade too long loop in PerformMembersTruncation().
+	 */
+	if (MultiXactState->nextOffset > 0 && MultiXactState->nextMXact > 0)
+	{
+		RecordNewMultiXact(FirstMultiXactId,
+						   MultiXactState->nextOffset, 0, NULL);
+		RecordNewMultiXact(MultiXactState->nextMXact,
+						   MultiXactState->nextOffset, 0, NULL);
+	}
 }
 
 /*
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 50bb1d8cfc..a5e6e8f090 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -270,12 +270,16 @@ void
 BootStrapSUBTRANS(void)
 {
 	int			slotno;
-	LWLock	   *lock = SimpleLruGetBankLock(SubTransCtl, 0);
+	LWLock	   *lock;
+	int64		pageno;
+
+	pageno = TransactionIdToPage(XidFromFullTransactionId(TransamVariables->nextXid));
+	lock = SimpleLruGetBankLock(SubTransCtl, pageno);
 
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
 	/* Create and zero the first page of the subtrans log */
-	slotno = ZeroSUBTRANSPage(0);
+	slotno = ZeroSUBTRANSPage(pageno);
 
 	/* Make sure it's written out */
 	SimpleLruWritePage(SubTransCtl, slotno);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f58412bca..c61d7d967c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -136,6 +136,10 @@ int			max_slot_wal_keep_size_mb = -1;
 int			wal_decode_buffer_size = 512 * 1024;
 bool		track_wal_io_timing = false;
 
+TransactionId		start_xid = FirstNormalTransactionId;
+MultiXactId			start_mxid = FirstMultiXactId;
+MultiXactOffset		start_mxoff = 0;
+
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
 #endif
@@ -5080,13 +5084,14 @@ BootStrapXLOG(uint32 data_checksum_version)
 	checkPoint.fullPageWrites = fullPageWrites;
 	checkPoint.wal_level = wal_level;
 	checkPoint.nextXid =
-		FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId);
+		FullTransactionIdFromEpochAndXid(0, Max(FirstNormalTransactionId,
+												start_xid));
 	checkPoint.nextOid = FirstGenbkiObjectId;
-	checkPoint.nextMulti = FirstMultiXactId;
-	checkPoint.nextMultiOffset = 0;
-	checkPoint.oldestXid = FirstNormalTransactionId;
+	checkPoint.nextMulti = Max(FirstMultiXactId, start_mxid);
+	checkPoint.nextMultiOffset = start_mxoff;
+	checkPoint.oldestXid = XidFromFullTransactionId(checkPoint.nextXid);
 	checkPoint.oldestXidDB = Template1DbOid;
-	checkPoint.oldestMulti = FirstMultiXactId;
+	checkPoint.oldestMulti = checkPoint.nextMulti;
 	checkPoint.oldestMultiDB = Template1DbOid;
 	checkPoint.oldestCommitTsXid = InvalidTransactionId;
 	checkPoint.newestCommitTsXid = InvalidTransactionId;
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index ed59dfce89..38165eb796 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -216,7 +216,7 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	argv++;
 	argc--;
 
-	while ((flag = getopt(argc, argv, "B:c:d:D:Fkr:X:-:")) != -1)
+	while ((flag = getopt(argc, argv, "B:c:d:D:Fkm:o:r:X:x:-:")) != -1)
 	{
 		switch (flag)
 		{
@@ -271,12 +271,60 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 			case 'k':
 				bootstrap_data_checksum_version = PG_DATA_CHECKSUM_VERSION;
 				break;
+			case 'm':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxid = strtoull(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactIdIsValid(start_mxid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact id")));
+					}
+				}
+				break;
+			case 'o':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxoff = strtoull(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactOffsetIsValid(start_mxoff))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact offset")));
+					}
+				}
+				break;
 			case 'r':
 				strlcpy(OutputFileName, optarg, MAXPGPATH);
 				break;
 			case 'X':
 				SetConfigOption("wal_segment_size", optarg, PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
 				break;
+			case 'x':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_xid = strtoull(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartTransactionIdIsValid(start_xid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster xid value")));
+					}
+				}
+				break;
 			default:
 				write_stderr("Try \"%s --help\" for more information.\n",
 							 progname);
diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index aea93a0229..6a3224bb82 100644
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -358,12 +358,18 @@ help(const char *progname)
 	printf(_("  -E                 echo statement before execution\n"));
 	printf(_("  -j                 do not use newline as interactive query delimiter\n"));
 	printf(_("  -r FILENAME        send stdout and stderr to given file\n"));
+	printf(_("  -m START_MXID      set initial database cluster multixact id\n"));
+	printf(_("  -o START_MXOFF     set initial database cluster multixact offset\n"));
+	printf(_("  -x START_XID       set initial database cluster xid\n"));
 
 	printf(_("\nOptions for bootstrapping mode:\n"));
 	printf(_("  --boot             selects bootstrapping mode (must be first argument)\n"));
 	printf(_("  --check            selects check mode (must be first argument)\n"));
 	printf(_("  DBNAME             database name (mandatory argument in bootstrapping mode)\n"));
 	printf(_("  -r FILENAME        send stdout and stderr to given file\n"));
+	printf(_("  -m START_MXID      set initial database cluster multixact id\n"));
+	printf(_("  -o START_MXOFF     set initial database cluster multixact offset\n"));
+	printf(_("  -x START_XID       set initial database cluster xid\n"));
 
 	printf(_("\nPlease read the documentation for the complete list of run-time\n"
 			 "configuration settings and how to set them on the command line or in\n"
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 8bee1fb664..af4b004e04 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -562,7 +562,7 @@ PostmasterMain(int argc, char *argv[])
 	 * tcop/postgres.c (the option sets should not conflict) and with the
 	 * common help() function in main/main.c.
 	 */
-	while ((opt = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lN:OPp:r:S:sTt:W:-:")) != -1)
+	while ((opt = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lm:N:Oo:Pp:r:S:sTt:W:x:-:")) != -1)
 	{
 		switch (opt)
 		{
@@ -659,10 +659,18 @@ PostmasterMain(int argc, char *argv[])
 				SetConfigOption("max_connections", optarg, PGC_POSTMASTER, PGC_S_ARGV);
 				break;
 
+			case 'm':
+				/* only used by single-user backend */
+				break;
+
 			case 'O':
 				SetConfigOption("allow_system_table_mods", "true", PGC_POSTMASTER, PGC_S_ARGV);
 				break;
 
+			case 'o':
+				/* only used by single-user backend */
+				break;
+
 			case 'P':
 				SetConfigOption("ignore_system_indexes", "true", PGC_POSTMASTER, PGC_S_ARGV);
 				break;
@@ -713,6 +721,10 @@ PostmasterMain(int argc, char *argv[])
 				SetConfigOption("post_auth_delay", optarg, PGC_POSTMASTER, PGC_S_ARGV);
 				break;
 
+			case 'x':
+				/* only used by single-user backend */
+				break;
+
 			default:
 				write_stderr("Try \"%s --help\" for more information.\n",
 							 progname);
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index aac0b96bbc..4636d99b2f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3918,7 +3918,7 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 	 * postmaster/postmaster.c (the option sets should not conflict) and with
 	 * the common help() function in main/main.c.
 	 */
-	while ((flag = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lN:nOPp:r:S:sTt:v:W:-:")) != -1)
+	while ((flag = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lm:N:nOo:Pp:r:S:sTt:v:W:x:-:")) != -1)
 	{
 		switch (flag)
 		{
@@ -4010,6 +4010,23 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 				SetConfigOption("ssl", "true", ctx, gucsource);
 				break;
 
+			case 'm':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxid = strtoull(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactIdIsValid(start_mxid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact id")));
+					}
+				}
+				break;
+
 			case 'N':
 				SetConfigOption("max_connections", optarg, ctx, gucsource);
 				break;
@@ -4022,6 +4039,23 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 				SetConfigOption("allow_system_table_mods", "true", ctx, gucsource);
 				break;
 
+			case 'o':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxoff = strtoull(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactOffsetIsValid(start_mxoff))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact offset")));
+					}
+				}
+				break;
+
 			case 'P':
 				SetConfigOption("ignore_system_indexes", "true", ctx, gucsource);
 				break;
@@ -4076,6 +4110,23 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 				SetConfigOption("post_auth_delay", optarg, ctx, gucsource);
 				break;
 
+			case 'x':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_xid = strtoull(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartTransactionIdIsValid(start_xid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster xid")));
+					}
+				}
+				break;
+
 			default:
 				errs++;
 				break;
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 9a91830783..1cc54392e5 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -168,6 +168,9 @@ static bool data_checksums = true;
 static char *xlog_dir = NULL;
 static int	wal_segment_size_mb = (DEFAULT_XLOG_SEG_SIZE) / (1024 * 1024);
 static DataDirSyncMethod sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+static TransactionId start_xid = 0;
+static MultiXactId start_mxid = 0;
+static MultiXactOffset start_mxoff = 0;
 
 
 /* internal vars */
@@ -1568,6 +1571,11 @@ bootstrap_template1(void)
 	bki_lines = replace_token(bki_lines, "POSTGRES",
 							  escape_quotes_bki(username));
 
+	/* relfrozenxid must not be less than FirstNormalTransactionId */
+	sprintf(buf, "%llu", (unsigned long long) Max(start_xid, 3));
+	bki_lines = replace_token(bki_lines, "RECENTXMIN",
+							  buf);
+
 	bki_lines = replace_token(bki_lines, "ENCODING",
 							  encodingid_to_string(encodingid));
 
@@ -1593,6 +1601,9 @@ bootstrap_template1(void)
 
 	printfPQExpBuffer(&cmd, "\"%s\" --boot %s %s", backend_exec, boot_options, extra_options);
 	appendPQExpBuffer(&cmd, " -X %d", wal_segment_size_mb * (1024 * 1024));
+	appendPQExpBuffer(&cmd, " -m %llu", (unsigned long long) start_mxid);
+	appendPQExpBuffer(&cmd, " -o %llu", (unsigned long long) start_mxoff);
+	appendPQExpBuffer(&cmd, " -x %llu", (unsigned long long) start_xid);
 	if (data_checksums)
 		appendPQExpBuffer(&cmd, " -k");
 	if (debug)
@@ -2532,12 +2543,20 @@ usage(const char *progname)
 	printf(_("  -d, --debug               generate lots of debugging output\n"));
 	printf(_("      --discard-caches      set debug_discard_caches=1\n"));
 	printf(_("  -L DIRECTORY              where to find the input files\n"));
+	printf(_("  -m, --multixact-id=START_MXID\n"
+			 "                            set initial database cluster multixact id\n"
+			 "                            max value is 2^62-1\n"));
 	printf(_("  -n, --no-clean            do not clean up after errors\n"));
 	printf(_("  -N, --no-sync             do not wait for changes to be written safely to disk\n"));
 	printf(_("      --no-instructions     do not print instructions for next steps\n"));
+	printf(_("  -o, --multixact-offset=START_MXOFF\n"
+			 "                            set initial database cluster multixact offset\n"
+			 "                            max value is 2^62-1\n"));
 	printf(_("  -s, --show                show internal settings, then exit\n"));
 	printf(_("      --sync-method=METHOD  set method for syncing files to disk\n"));
 	printf(_("  -S, --sync-only           only sync database files to disk, then exit\n"));
+	printf(_("  -x, --xid=START_XID       set initial database cluster xid\n"
+			 "                            max value is 2^62-1\n"));
 	printf(_("\nOther options:\n"));
 	printf(_("  -V, --version             output version information, then exit\n"));
 	printf(_("  -?, --help                show this help, then exit\n"));
@@ -3079,6 +3098,18 @@ initialize_data_directory(void)
 	/* Now create all the text config files */
 	setup_config();
 
+	if (start_mxid != 0)
+		printf(_("selecting initial multixact id ... %llu\n"),
+				 (unsigned long long) start_mxid);
+
+	if (start_mxoff != 0)
+		printf(_("selecting initial multixact offset ... %llu\n"),
+				 (unsigned long long) start_mxoff);
+
+	if (start_xid != 0)
+		printf(_("selecting initial xid ... %llu\n"),
+				 (unsigned long long) start_xid);
+
 	/* Bootstrap template1 */
 	bootstrap_template1();
 
@@ -3095,8 +3126,12 @@ initialize_data_directory(void)
 	fflush(stdout);
 
 	initPQExpBuffer(&cmd);
-	printfPQExpBuffer(&cmd, "\"%s\" %s %s template1 >%s",
-					  backend_exec, backend_options, extra_options, DEVNULL);
+	printfPQExpBuffer(&cmd, "\"%s\" %s %s",
+					  backend_exec, backend_options, extra_options);
+	appendPQExpBuffer(&cmd, " -m %llu", (unsigned long long) start_mxid);
+	appendPQExpBuffer(&cmd, " -o %llu", (unsigned long long) start_mxoff);
+	appendPQExpBuffer(&cmd, " -x %llu", (unsigned long long) start_xid);
+	appendPQExpBuffer(&cmd, " template1 >%s", DEVNULL);
 
 	PG_CMD_OPEN(cmd.data);
 
@@ -3183,6 +3218,9 @@ main(int argc, char *argv[])
 		{"icu-rules", required_argument, NULL, 18},
 		{"sync-method", required_argument, NULL, 19},
 		{"no-data-checksums", no_argument, NULL, 20},
+		{"xid", required_argument, NULL, 'x'},
+		{"multixact-id", required_argument, NULL, 'm'},
+		{"multixact-offset", required_argument, NULL, 'o'},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -3224,7 +3262,7 @@ main(int argc, char *argv[])
 
 	/* process command-line options */
 
-	while ((c = getopt_long(argc, argv, "A:c:dD:E:gkL:nNsST:U:WX:",
+	while ((c = getopt_long(argc, argv, "A:c:dD:E:gkL:m:nNo:sST:U:Wx:X:",
 							long_options, &option_index)) != -1)
 	{
 		switch (c)
@@ -3282,6 +3320,30 @@ main(int argc, char *argv[])
 				debug = true;
 				printf(_("Running in debug mode.\n"));
 				break;
+			case 'm':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxid = strtoull(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactIdIsValid(start_mxid))
+					{
+						pg_log_error("invalid initial database cluster multixact id");
+						exit(1);
+					}
+					else if (start_mxid < 1) /* FirstMultiXactId */
+					{
+						/*
+						 * We avoid mxid to be silently set to
+						 * FirstMultiXactId, though it does not harm.
+						 */
+						pg_log_error("multixact id should be greater than 0");
+						exit(1);
+					}
+				}
+				break;
 			case 'n':
 				noclean = true;
 				printf(_("Running in no-clean mode.  Mistakes will not be cleaned up.\n"));
@@ -3289,6 +3351,21 @@ main(int argc, char *argv[])
 			case 'N':
 				do_sync = false;
 				break;
+			case 'o':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxoff = strtoull(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactOffsetIsValid(start_mxoff))
+					{
+						pg_log_error("invalid initial database cluster multixact offset");
+						exit(1);
+					}
+				}
+				break;
 			case 'S':
 				sync_only = true;
 				break;
@@ -3377,6 +3454,30 @@ main(int argc, char *argv[])
 			case 20:
 				data_checksums = false;
 				break;
+			case 'x':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_xid = strtoull(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartTransactionIdIsValid(start_xid))
+					{
+						pg_log_error("invalid value for initial database cluster xid");
+						exit(1);
+					}
+					else if (start_xid < 3) /* FirstNormalTransactionId */
+					{
+						/*
+						 * We avoid xid to be silently set to
+						 * FirstNormalTransactionId, though it does not harm.
+						 */
+						pg_log_error("xid should be greater than 2");
+						exit(1);
+					}
+				}
+				break;
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index 7520d3d0dd..91a85d9f4d 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -282,4 +282,64 @@ command_fails(
 	[ 'pg_checksums', '-D', $datadir_nochecksums ],
 	"pg_checksums fails with data checksum disabled");
 
+# Set non-standard initial mxid/mxoff/xid.
+command_fails_like(
+	[ 'initdb', '-m', 'seven', $datadir ],
+	qr/initdb: error: invalid initial database cluster multixact id/,
+	'fails for invalid initial database cluster multixact id');
+command_fails_like(
+	[ 'initdb', '-o', 'seven', $datadir ],
+	qr/initdb: error: invalid initial database cluster multixact offset/,
+	'fails for invalid initial database cluster multixact offset');
+command_fails_like(
+	[ 'initdb', '-x', 'seven', $datadir ],
+	qr/initdb: error: invalid value for initial database cluster xid/,
+	'fails for invalid initial database cluster xid');
+
+command_checks_all(
+	[ 'initdb', '-m', '65535', "$tempdir/data-m65535" ],
+	0,
+	[qr/selecting initial multixact id ... 65535/],
+	[],
+	'selecting initial multixact id');
+command_checks_all(
+	[ 'initdb', '-o', '65535', "$tempdir/data-o65535" ],
+	0,
+	[qr/selecting initial multixact offset ... 65535/],
+	[],
+	'selecting initial multixact offset');
+command_checks_all(
+	[ 'initdb', '-x', '65535', "$tempdir/data-x65535" ],
+	0,
+	[qr/selecting initial xid ... 65535/],
+	[],
+	'selecting initial xid');
+
+# Setup new cluster with given mxid/mxoff/xid.
+my $node;
+my $result;
+
+$node = PostgreSQL::Test::Cluster->new('test-mxid');
+$node->init(extra => ['-m', '16777215']); # 0xFFFFFF
+$node->start;
+$result = $node->safe_psql('postgres', "SELECT next_multixact_id FROM pg_control_checkpoint();");
+ok($result >= 16777215, 'setup cluster with given mxid');
+$node->stop;
+
+$node = PostgreSQL::Test::Cluster->new('test-mxoff');
+$node->init(extra => ['-o', '16777215']); # 0xFFFFFF
+$node->start;
+$result = $node->safe_psql('postgres', "SELECT next_multi_offset FROM pg_control_checkpoint();");
+ok($result >= 16777215, 'setup cluster with given mxoff');
+$node->stop;
+
+$node = PostgreSQL::Test::Cluster->new('test-xid');
+$node->init(extra => ['-x', '16777215']); # 0xFFFFFF
+$node->start;
+$result = $node->safe_psql('postgres', "SELECT txid_current();");
+ok($result >= 16777215, 'setup cluster with given xid - check 1');
+$result = $node->safe_psql('postgres', "SELECT oldest_xid FROM pg_control_checkpoint();");
+ok($result >= 16777215, 'setup cluster with given xid - check 2');
+$node->stop;
+
 done_testing();
diff --git a/src/bin/pg_amcheck/t/004_verify_heapam.pl b/src/bin/pg_amcheck/t/004_verify_heapam.pl
index 95fe6e6d3b..93eefd0479 100644
--- a/src/bin/pg_amcheck/t/004_verify_heapam.pl
+++ b/src/bin/pg_amcheck/t/004_verify_heapam.pl
@@ -320,6 +320,8 @@ my $relfrozenxid = $node->safe_psql('postgres',
 	q(select relfrozenxid from pg_class where relname = 'test'));
 my $datfrozenxid = $node->safe_psql('postgres',
 	q(select datfrozenxid from pg_database where datname = 'postgres'));
+my $datminmxid = $node->safe_psql('postgres',
+	q(select datminmxid from pg_database where datname = 'postgres'));
 
 # Sanity check that our 'test' table has a relfrozenxid newer than the
 # datfrozenxid for the database, and that the datfrozenxid is greater than the
@@ -454,40 +456,39 @@ for (my $tupidx = 0; $tupidx < $ROWCOUNT; $tupidx++)
 
 		# Expected corruption report
 		push @expected,
-		  qr/${header}xmin $xmin precedes relation freeze threshold 0:\d+/;
+		  qr/${header}xmin $xmin precedes relation freeze threshold \d+/;
 	}
 	elsif ($offnum == 2)
 	{
 		# Corruptly set xmin < datfrozenxid
-		my $xmin = 3;
+		my $xmin = $datfrozenxid - 12;
 		$tup->{t_xmin} = $xmin;
 		$tup->{t_infomask} &= ~HEAP_XMIN_COMMITTED;
 		$tup->{t_infomask} &= ~HEAP_XMIN_INVALID;
 
 		push @expected,
-		  qr/${$header}xmin $xmin precedes oldest valid transaction ID 0:\d+/;
+		  qr/${$header}xmin $xmin precedes oldest valid transaction ID \d+/;
 	}
 	elsif ($offnum == 3)
 	{
-		# Corruptly set xmin < datfrozenxid, further back, noting circularity
-		# of xid comparison.
-		my $xmin = 4026531839;
+		# Corruptly set xmin > next transaction id.
+		my $xmin = $relfrozenxid + 1000000;
 		$tup->{t_xmin} = $xmin;
 		$tup->{t_infomask} &= ~HEAP_XMIN_COMMITTED;
 		$tup->{t_infomask} &= ~HEAP_XMIN_INVALID;
 
 		push @expected,
-		  qr/${$header}xmin ${xmin} precedes oldest valid transaction ID 0:\d+/;
+		  qr/${$header}xmin $xmin equals or exceeds next valid transaction ID \d+/;
 	}
 	elsif ($offnum == 4)
 	{
-		# Corruptly set xmax < relminmxid;
-		my $xmax = 4026531839;
+		# Corruptly set xmax > next transaction id.
+        my $xmax = $relfrozenxid + 1000000;
 		$tup->{t_xmax} = $xmax;
 		$tup->{t_infomask} &= ~HEAP_XMAX_INVALID;
 
 		push @expected,
-		  qr/${$header}xmax ${xmax} precedes oldest valid transaction ID 0:\d+/;
+		  qr/${$header}xmax $xmax equals or exceeds next valid transaction ID \d+/;
 	}
 	elsif ($offnum == 5)
 	{
@@ -590,31 +591,33 @@ for (my $tupidx = 0; $tupidx < $ROWCOUNT; $tupidx++)
 		# Set both HEAP_XMAX_COMMITTED and HEAP_XMAX_IS_MULTI
 		$tup->{t_infomask} |= HEAP_XMAX_COMMITTED;
 		$tup->{t_infomask} |= HEAP_XMAX_IS_MULTI;
-		$tup->{t_xmax} = 4;
+        my $xmax = $datminmxid + 1000000;
+		$tup->{t_xmax} = $xmax;
 
 		push @expected,
-		  qr/${header}multitransaction ID 4 equals or exceeds next valid multitransaction ID 1/;
+		  qr/${header}multitransaction ID $xmax equals or exceeds next valid multitransaction ID \d+/;
 	}
 	elsif ($offnum == 15)
 	{
 		# Set both HEAP_XMAX_COMMITTED and HEAP_XMAX_IS_MULTI
 		$tup->{t_infomask} |= HEAP_XMAX_COMMITTED;
 		$tup->{t_infomask} |= HEAP_XMAX_IS_MULTI;
-		$tup->{t_xmax} = 4000000000;
+		my $xmax = $datminmxid - 10;
+		$tup->{t_xmax} = $xmax;
 
 		push @expected,
-		  qr/${header}multitransaction ID 4000000000 precedes relation minimum multitransaction ID threshold 1/;
+		  qr/${header}multitransaction ID $xmax precedes relation minimum multitransaction ID threshold \d+/;
 	}
 	elsif ($offnum == 16)    # Last offnum must equal ROWCOUNT
 	{
 		# Corruptly set xmin > next_xid to be in the future.
-		my $xmin = 123456;
+		my $xmin = $relfrozenxid + 1000000;
 		$tup->{t_xmin} = $xmin;
 		$tup->{t_infomask} &= ~HEAP_XMIN_COMMITTED;
 		$tup->{t_infomask} &= ~HEAP_XMIN_INVALID;
 
 		push @expected,
-		  qr/${$header}xmin ${xmin} equals or exceeds next valid transaction ID 0:\d+/;
+		  qr/${$header}xmin ${xmin} equals or exceeds next valid transaction ID \d+/;
 	}
 	elsif ($offnum == 17)
 	{
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 34ad46c067..4ce79b12e3 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -94,6 +94,9 @@ typedef enum RecoveryState
 } RecoveryState;
 
 extern PGDLLIMPORT int wal_level;
+extern PGDLLIMPORT TransactionId start_xid;
+extern PGDLLIMPORT MultiXactId start_mxid;
+extern PGDLLIMPORT MultiXactOffset start_mxoff;
 
 /* Is WAL archiving enabled (always or only while server is running normally)? */
 #define XLogArchivingActive() \
diff --git a/src/include/c.h b/src/include/c.h
index 0a548d69d7..218afeeb3b 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -668,6 +668,10 @@ typedef uint32 MultiXactOffset;
 
 typedef uint32 CommandId;
 
+#define StartTransactionIdIsValid(xid)			((xid) <= 0xFFFFFFFF)
+#define StartMultiXactIdIsValid(mxid)			((mxid) <= 0xFFFFFFFF)
+#define StartMultiXactOffsetIsValid(offset)		((offset) <= 0xFFFFFFFF)
+
 #define FirstCommandId	((CommandId) 0)
 #define InvalidCommandId	(~(CommandId)0)
 
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index 0fc2c093b0..0a7518df0d 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -123,7 +123,7 @@ CATALOG(pg_class,1259,RelationRelationId) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83,Relat
 	Oid			relrewrite BKI_DEFAULT(0) BKI_LOOKUP_OPT(pg_class);
 
 	/* all Xids < this are frozen in this rel */
-	TransactionId relfrozenxid BKI_DEFAULT(3);	/* FirstNormalTransactionId */
+	TransactionId relfrozenxid BKI_DEFAULT(RECENTXMIN);	/* FirstNormalTransactionId */
 
 	/* all multixacts in this rel are >= this; it is really a MultiXactId */
 	TransactionId relminmxid BKI_DEFAULT(1);	/* FirstMultiXactId */
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index e5526c7565..79df6faeb9 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -643,7 +643,9 @@ sub init
 	{
 		note("initializing database system by running initdb");
 		PostgreSQL::Test::Utils::system_or_bail('initdb', '-D', $pgdata, '-A',
-			'trust', '-N', @{ $params{extra} });
+			'trust', '-N',
+			'-x', '124983', '-m', '242236', '-o', '359488',
+			@{ $params{extra} });
 	}
 	else
 	{
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index 0e40ed32a2..3511c4b500 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2333,7 +2333,8 @@ regression_main(int argc, char *argv[],
 			note("initializing database system by running initdb");
 
 			appendStringInfo(&cmd,
-							 "\"%s%sinitdb\" -D \"%s/data\" --no-clean --no-sync",
+							 "\"%s%sinitdb\" -D \"%s/data\" --no-clean --no-sync"
+							 " -x 124983 -m 242236 -o 359488",
 							 bindir ? bindir : "",
 							 bindir ? "/" : "",
 							 temp_instance);
diff --git a/src/test/xid-64/t/001_test_large_xids.pl b/src/test/xid-64/t/001_test_large_xids.pl
new file mode 100644
index 0000000000..4c7dbc6cb1
--- /dev/null
+++ b/src/test/xid-64/t/001_test_large_xids.pl
@@ -0,0 +1,54 @@
+# Tests for large xid values
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use bigint;
+
+sub command_output
+{
+	my ($cmd) = @_;
+	my ($stdout, $stderr);
+	print("# Running: " . join(" ", @{$cmd}) . "\n");
+	my $result = IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+	ok($result, "@$cmd exit code 0");
+	is($stderr, '', "@$cmd no stderr");
+	return $stdout;
+}
+
+my $START_VAL = 2**32;
+my $MAX_VAL = 2**62;
+
+my $ixid = $START_VAL + int(rand($MAX_VAL - $START_VAL));
+my $imxid = $START_VAL + int(rand($MAX_VAL - $START_VAL));
+my $imoff = $START_VAL + int(rand($MAX_VAL - $START_VAL));
+
+# Initialize master node with the random xid-related parameters
+my $node = PostgreSQL::Test::Cluster->new('master');
+$node->init(extra => [ "--xid=$ixid", "--multixact-id=$imxid", "--multixact-offset=$imoff" ]);
+$node->start;
+
+# Initialize master node and check the xid-related parameters
+my $pgcd_output = command_output(
+	[ 'pg_controldata', '-D', $node->data_dir ] );
+print($pgcd_output); print('\n');
+ok($pgcd_output =~ qr/Latest checkpoint's NextXID:\s*(\d+)/, "XID found");
+my ($nextxid) = ($1);
+ok($nextxid >= $ixid && $nextxid < $ixid + 1000,
+	"Latest checkpoint's NextXID ($nextxid) is close to the initial xid ($ixid).");
+ok($pgcd_output =~ qr/Latest checkpoint's NextMultiXactId:\s*(\d+)/, "MultiXactId found");
+my ($nextmxid) = ($1);
+ok($nextmxid >= $imxid && $nextmxid < $imxid + 1000,
+	"Latest checkpoint's NextMultiXactId ($nextmxid) is close to the initial multiXactId ($imxid).");
+ok($pgcd_output =~ qr/Latest checkpoint's NextMultiOffset:\s*(\d+)/, "MultiOffset found");
+my ($nextmoff) = ($1);
+ok($nextmoff >= $imoff && $nextmoff < $imoff + 1000,
+	"Latest checkpoint's NextMultiOffset ($nextmoff) is close to the initial multiOffset ($imoff).");
+
+# Run pgbench to check whether the database is working properly
+$node->command_ok(
+	[ qw(pgbench --initialize --no-vacuum --scale=10) ],
+	  'pgbench finished without errors');
+
+done_testing();
\ No newline at end of file
-- 
2.43.0

#19Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Maxim Orlov (#18)
1 attachment(s)
Re: POC: make mxidoff 64 bits

On 08/11/2024 20:10, Maxim Orlov wrote:

Sorry for a late reply. There was a problem in upgrade with offset
wraparound. Here is a fixed version. Test also added. I decide to use my
old patch to set a non-standard multixacts for the old cluster, fill it
with data and do pg_upgrade.

The wraparound logic is still not correct. To test, I created a cluster
where multixids have wrapped around, so that:

$ ls -l data-old/pg_multixact/offsets/
total 720
-rw------- 1 heikki heikki 212992 Nov 12 01:11 0000
-rw-r--r-- 1 heikki heikki 262144 Nov 12 00:55 FFFE
-rw------- 1 heikki heikki 262144 Nov 12 00:56 FFFF

After running pg_upgrade:

$ ls -l data-new/pg_multixact/offsets/
total 1184
-rw------- 1 heikki heikki 155648 Nov 12 01:12 0001
-rw------- 1 heikki heikki 262144 Nov 12 01:11 1FFFD
-rw------- 1 heikki heikki 262144 Nov 12 01:11 1FFFE
-rw------- 1 heikki heikki 262144 Nov 12 01:11 1FFFF
-rw------- 1 heikki heikki 262144 Nov 12 01:11 20000
-rw------- 1 heikki heikki 155648 Nov 12 01:11 20001

That's not right. The segments 20000 and 20001 were created by the new
pg_upgrade conversion code from old segment '0000'. But multixids are
still 32-bit values, so after segment 1FFFF, you should still wrap
around to 0000. The new segments should be '0000' and '0001'. The
segment '0001' is created when postgres is started after upgrade, but
it's created from scratch and doesn't contain the upgraded values.

When I try to select from a table after upgrade that contains
post-wraparound multixids:

TRAP: failed Assert("offset != 0"), File:
"../src/backend/access/transam/multixact.c", Line: 1353, PID: 63386

On a different note, I'm surprised you're rewriting member segments from
scratch, parsing all the individual member groups and writing them out
again. There's no change to the members file format, except for the
numbering of the files, so you could just copy the files under the new
names without paying attention to the contents. It's not wrong to parse
them in detail, but I'd assume that it would be simpler not to.

Here is how to test. All the patches are for 14e87ffa5c543b5f3 master
branch.
1) Get the 14e87ffa5c543b5f3 master branch apply patches 0001-Add-
initdb-option-to-initialize-cluster-with-non-sta.patch and 0002-TEST-
lower-SLRU_PAGES_PER_SEGMENT.patch
2) Get the 14e87ffa5c543b5f3 master branch in a separate directory and
apply v6 patch set.
3) Build two branches.
4) Use ENV oldinstall to run the test: PROVE_TESTS=t/005_mxidoff.pl
<http://005_mxidoff.pl&gt; oldinstall=/home/orlov/proj/pgsql-new
PG_TEST_NOCLEAN=1 make check -C src/bin/pg_upgrade/

Maybe, I'll make a shell script to automate this steps if required.

Yeah, I think we need something to automate this. I did the testing
manually. I used the attached python script to consume multixids faster,
but it's still tedious.

I used pg_resetwal to quickly create a cluster that's close to multixid
wrapround:

initdb -D data
pg_resetwal -D data -m 4294900001,4294900000
dd if=/dev/zero of=data/pg_multixact/offsets/FFFE bs=8192 count=32

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachments:

multixids.pytext/x-python; charset=UTF-8; name=multixids.pyDownload
#20Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#19)
5 attachment(s)
Re: POC: make mxidoff 64 bits

On Tue, 12 Nov 2024 at 02:31, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

The wraparound logic is still not correct.

Yep, my fault. I forget to reset segment counter if wraparound is happened.
Fixed.

When I try to select from a table after upgrade that contains

post-wraparound multixids:

TRAP: failed Assert("offset != 0"), File:
"../src/backend/access/transam/multixact.c", Line: 1353, PID: 63386

The problem was in converting offset segments. The new_entry index should
also bypass the invalid offset (0) value. Fixed.

On a different note, I'm surprised you're rewriting member segments from
scratch, parsing all the individual member groups and writing them out
again. There's no change to the members file format, except for the
numbering of the files, so you could just copy the files under the new
names without paying attention to the contents. It's not wrong to parse
them in detail, but I'd assume that it would be simpler not to.

Yes, at the beginning I also thought that it would be possible to get by
with simple copying. But in case of wraparound, we must "bypass" invalid
zero offset value. See, old 32 bit offsets a wrapped at 2^32, thus 0 values
appears in multixact.c So, they must be handled. Bypass, in fact. When we
are switched to the 64-bit offsets, we have two options:
1). Bypass every ((uint32) offset == 0) value in multixact.c;
2). Convert members and bypass invalid value once.

The first options seem too weird for me. So, we have to repack members and
bypass invalid value.

All patches are for master@38c18710b37a2d

--
Best regards,
Maxim Orlov.

Attachments:

v7-0001-Use-64-bit-format-output-for-multixact-offsets.patchapplication/octet-stream; name=v7-0001-Use-64-bit-format-output-for-multixact-offsets.patchDownload
From fdb7e2eee33dfb5df714d8d16112d4c907475d78 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 7 Aug 2024 16:35:22 +0300
Subject: [PATCH v7 1/5] Use 64-bit format output for multixact offsets

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/rmgrdesc/mxactdesc.c   |  9 ++++----
 src/backend/access/rmgrdesc/xlogdesc.c    |  4 ++--
 src/backend/access/transam/multixact.c    | 26 +++++++++++++----------
 src/backend/access/transam/xlogrecovery.c |  5 +++--
 src/bin/pg_controldata/pg_controldata.c   |  4 ++--
 src/bin/pg_resetwal/pg_resetwal.c         |  8 +++----
 6 files changed, 31 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 3e8ad4d5ef..1b486de38c 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -65,8 +65,8 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		xl_multixact_create *xlrec = (xl_multixact_create *) rec;
 		int			i;
 
-		appendStringInfo(buf, "%u offset %u nmembers %d: ", xlrec->mid,
-						 xlrec->moff, xlrec->nmembers);
+		appendStringInfo(buf, "%u offset %llu nmembers %d: ", xlrec->mid,
+						 (unsigned long long) xlrec->moff, xlrec->nmembers);
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
 	}
@@ -74,9 +74,10 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
 
-		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+		appendStringInfo(buf, "offsets [%u, %u), members [%llu, %llu)",
 						 xlrec->startTruncOff, xlrec->endTruncOff,
-						 xlrec->startTruncMemb, xlrec->endTruncMemb);
+						 (unsigned long long) xlrec->startTruncMemb,
+						 (unsigned long long) xlrec->endTruncMemb);
 	}
 }
 
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 363294d623..aaa19c81c8 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -66,7 +66,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
 		appendStringInfo(buf, "redo %X/%X; "
-						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %u; "
+						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %llu; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
 						 "oldest running xid %u; %s",
@@ -79,7 +79,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 XidFromFullTransactionId(checkpoint->nextXid),
 						 checkpoint->nextOid,
 						 checkpoint->nextMulti,
-						 checkpoint->nextMultiOffset,
+						 (unsigned long long) checkpoint->nextMultiOffset,
 						 checkpoint->oldestXid,
 						 checkpoint->oldestXidDB,
 						 checkpoint->oldestMulti,
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 8c37d7eba7..ab90912ed3 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1264,7 +1264,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	LWLockRelease(MultiXactGenLock);
 
-	debug_elog4(DEBUG2, "GetNew: returning %u offset %u", result, *offset);
+	debug_elog4(DEBUG2, "GetNew: returning %u offset %llu", result,
+				(unsigned long long) *offset);
 	return result;
 }
 
@@ -2293,8 +2294,9 @@ MultiXactGetCheckptMulti(bool is_shutdown,
 	LWLockRelease(MultiXactGenLock);
 
 	debug_elog6(DEBUG2,
-				"MultiXact: checkpoint is nextMulti %u, nextOffset %u, oldestMulti %u in DB %u",
-				*nextMulti, *nextMultiOffset, *oldestMulti, *oldestMultiDB);
+				"MultiXact: checkpoint is nextMulti %u, nextOffset %llu, oldestMulti %u in DB %u",
+				*nextMulti, (unsigned long long) *nextMultiOffset, *oldestMulti,
+				*oldestMultiDB);
 }
 
 /*
@@ -2328,8 +2330,8 @@ void
 MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset)
 {
-	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %u",
-				nextMulti, nextMultiOffset);
+	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %llu",
+				nextMulti, (unsigned long long) nextMultiOffset);
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->nextMXact = nextMulti;
 	MultiXactState->nextOffset = nextMultiOffset;
@@ -2519,8 +2521,8 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 	}
 	if (MultiXactOffsetPrecedes(MultiXactState->nextOffset, minMultiOffset))
 	{
-		debug_elog3(DEBUG2, "MultiXact: setting next offset to %u",
-					minMultiOffset);
+		debug_elog3(DEBUG2, "MultiXact: setting next offset to %llu",
+					(unsigned long long) minMultiOffset);
 		MultiXactState->nextOffset = minMultiOffset;
 	}
 	LWLockRelease(MultiXactGenLock);
@@ -3211,11 +3213,12 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 
 	elog(DEBUG1, "performing multixact truncation: "
 		 "offsets [%u, %u), offsets segments [%llx, %llx), "
-		 "members [%u, %u), members segments [%llx, %llx)",
+		 "members [%llu, %llu), members segments [%llx, %llx)",
 		 oldestMulti, newOldestMulti,
 		 (unsigned long long) MultiXactIdToOffsetSegment(oldestMulti),
 		 (unsigned long long) MultiXactIdToOffsetSegment(newOldestMulti),
-		 oldestOffset, newOldestOffset,
+		 (unsigned long long) oldestOffset,
+		 (unsigned long long) newOldestOffset,
 		 (unsigned long long) MXOffsetToMemberSegment(oldestOffset),
 		 (unsigned long long) MXOffsetToMemberSegment(newOldestOffset));
 
@@ -3471,11 +3474,12 @@ multixact_redo(XLogReaderState *record)
 
 		elog(DEBUG1, "replaying multixact truncation: "
 			 "offsets [%u, %u), offsets segments [%llx, %llx), "
-			 "members [%u, %u), members segments [%llx, %llx)",
+			 "members [%llu, %llu), members segments [%llx, %llx)",
 			 xlrec.startTruncOff, xlrec.endTruncOff,
 			 (unsigned long long) MultiXactIdToOffsetSegment(xlrec.startTruncOff),
 			 (unsigned long long) MultiXactIdToOffsetSegment(xlrec.endTruncOff),
-			 xlrec.startTruncMemb, xlrec.endTruncMemb,
+			 (unsigned long long) xlrec.startTruncMemb,
+			 (unsigned long long) xlrec.endTruncMemb,
 			 (unsigned long long) MXOffsetToMemberSegment(xlrec.startTruncMemb),
 			 (unsigned long long) MXOffsetToMemberSegment(xlrec.endTruncMemb));
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 05c738d661..727b6e744f 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -876,8 +876,9 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 							 U64FromFullTransactionId(checkPoint.nextXid),
 							 checkPoint.nextOid)));
 	ereport(DEBUG1,
-			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %u",
-							 checkPoint.nextMulti, checkPoint.nextMultiOffset)));
+			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %llu",
+							 checkPoint.nextMulti,
+							 (unsigned long long) checkPoint.nextMultiOffset)));
 	ereport(DEBUG1,
 			(errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
 							 checkPoint.oldestXid, checkPoint.oldestXidDB)));
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 93a05d80ca..43b6727570 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -253,8 +253,8 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile->checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
-		   ControlFile->checkPointCopy.nextMultiOffset);
+	printf(_("Latest checkpoint's NextMultiOffset:  %llu\n"),
+		   (unsigned long long) ControlFile->checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile->checkPointCopy.oldestXid);
 	printf(_("Latest checkpoint's oldestXID's DB:   %u\n"),
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index e9dcb5a6d8..985cd06802 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -737,8 +737,8 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile.checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
-		   ControlFile.checkPointCopy.nextMultiOffset);
+	printf(_("Latest checkpoint's NextMultiOffset:  %llu\n"),
+		   (unsigned long long) ControlFile.checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile.checkPointCopy.oldestXid);
 	printf(_("Latest checkpoint's oldestXID's DB:   %u\n"),
@@ -809,8 +809,8 @@ PrintNewControlValues(void)
 
 	if (set_mxoff != -1)
 	{
-		printf(_("NextMultiOffset:                      %u\n"),
-			   ControlFile.checkPointCopy.nextMultiOffset);
+		printf(_("NextMultiOffset:                      %llu\n"),
+			   (unsigned long long) ControlFile.checkPointCopy.nextMultiOffset);
 	}
 
 	if (set_oid != 0)
-- 
2.43.0

v7-0004-Get-rid-of-MultiXactMemberFreezeThreshold-call.patchapplication/octet-stream; name=v7-0004-Get-rid-of-MultiXactMemberFreezeThreshold-call.patchDownload
From 1b630d2f82ce69cd8479aaaec7dfe266a77fb718 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 23 Oct 2024 18:23:39 +0300
Subject: [PATCH v7 4/5] Get rid of MultiXactMemberFreezeThreshold call.

Since MaxMultiXactOffset are UINT64_MAX now, MULTIXACT_MEMBER_SAFE_THRESHOLD and
MULTIXACT_MEMBER_DANGER_THRESHOLD values are not meaningful any more. Thus,
MultiXactMemberFreezeThreshold is not needed too.

Instead, switch to MULTIXACT_MEMBER_AUTOVAC_THRESHOLD (eq 2^32) members
threshold. It is used to determine if we need to force autovacuum or not.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/transam/multixact.c | 117 +++----------------------
 src/backend/commands/vacuum.c          |   2 +-
 src/backend/postmaster/autovacuum.c    |   4 +-
 src/include/access/multixact.h         |   1 -
 4 files changed, 15 insertions(+), 109 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 48e1c0160a..a817f539ee 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -204,10 +204,14 @@ MXOffsetToMemberOffset(MultiXactOffset offset)
 		member_in_group * sizeof(TransactionId);
 }
 
-/* Multixact members wraparound thresholds. */
-#define MULTIXACT_MEMBER_SAFE_THRESHOLD		(MaxMultiXactOffset / 2)
-#define MULTIXACT_MEMBER_DANGER_THRESHOLD	\
-	(MaxMultiXactOffset - MaxMultiXactOffset / 4)
+/*
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuumin order to release the disk space, reduce table bloat if
+ * possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD		UINT64CONST(0xFFFFFFFF)
 
 static inline MultiXactId
 PreviousMultiXactId(MultiXactId multi)
@@ -2616,15 +2620,13 @@ GetOldestMultiXactId(void)
 }
 
 /*
- * Determine how aggressively we need to vacuum in order to prevent member
- * wraparound.
+ * Determine if we need to vacuum for member or not.
  *
  * To do so determine what's the oldest member offset and install the limit
  * info in MultiXactState, where it can be used to prevent overrun of old data
  * in the members SLRU area.
  *
- * The return value is true if emergency autovacuum is required and false
- * otherwise.
+ * The return value is true if autovacuum is required and false otherwise.
  */
 static bool
 SetOffsetVacuumLimit(bool is_startup)
@@ -2712,10 +2714,10 @@ SetOffsetVacuumLimit(bool is_startup)
 	LWLockRelease(MultiXactGenLock);
 
 	/*
-	 * Do we need an emergency autovacuum?	If we're not sure, assume yes.
+	 * Do we need autovacuum?	If we're not sure, assume yes.
 	 */
 	return !oldestOffsetKnown ||
-		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
+		(nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);
 }
 
 /*
@@ -2761,101 +2763,6 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	return true;
 }
 
-/*
- * Determine how many multixacts, and how many multixact members, currently
- * exist.  Return false if unable to determine.
- */
-static bool
-ReadMultiXactCounts(uint32 *multixacts, MultiXactOffset *members)
-{
-	MultiXactOffset nextOffset;
-	MultiXactOffset oldestOffset;
-	MultiXactId oldestMultiXactId;
-	MultiXactId nextMultiXactId;
-	bool		oldestOffsetKnown;
-
-	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	nextOffset = MultiXactState->nextOffset;
-	oldestMultiXactId = MultiXactState->oldestMultiXactId;
-	nextMultiXactId = MultiXactState->nextMXact;
-	oldestOffset = MultiXactState->oldestOffset;
-	oldestOffsetKnown = MultiXactState->oldestOffsetKnown;
-	LWLockRelease(MultiXactGenLock);
-
-	if (!oldestOffsetKnown)
-		return false;
-
-	*members = nextOffset - oldestOffset;
-	*multixacts = nextMultiXactId - oldestMultiXactId;
-	return true;
-}
-
-/*
- * Multixact members can be removed once the multixacts that refer to them
- * are older than every datminmxid.  autovacuum_multixact_freeze_max_age and
- * vacuum_multixact_freeze_table_age work together to make sure we never have
- * too many multixacts; we hope that, at least under normal circumstances,
- * this will also be sufficient to keep us from using too many offsets.
- * However, if the average multixact has many members, we might exhaust the
- * members space while still using few enough members that these limits fail
- * to trigger relminmxid advancement by VACUUM.  At that point, we'd have no
- * choice but to start failing multixact-creating operations with an error.
- *
- * To prevent that, if more than a threshold portion of the members space is
- * used, we effectively reduce autovacuum_multixact_freeze_max_age and
- * to a value just less than the number of multixacts in use.  We hope that
- * this will quickly trigger autovacuuming on the table or tables with the
- * oldest relminmxid, thus allowing datminmxid values to advance and removing
- * some members.
- *
- * As the fraction of the member space currently in use grows, we become
- * more aggressive in clamping this value.  That not only causes autovacuum
- * to ramp up, but also makes any manual vacuums the user issues more
- * aggressive.  This happens because vacuum_get_cutoffs() will clamp the
- * freeze table and the minimum freeze age cutoffs based on the effective
- * autovacuum_multixact_freeze_max_age this function returns.  In the worst
- * case, we'll claim the freeze_max_age to zero, and every vacuum of any
- * table will freeze every multixact.
- */
-int
-MultiXactMemberFreezeThreshold(void)
-{
-	MultiXactOffset members;
-	uint32		multixacts;
-	uint32		victim_multixacts;
-	double		fraction;
-	int			result;
-
-	/* If we can't determine member space utilization, assume the worst. */
-	if (!ReadMultiXactCounts(&multixacts, &members))
-		return 0;
-
-	/* If member space utilization is low, no special action is required. */
-	if (members <= MULTIXACT_MEMBER_SAFE_THRESHOLD)
-		return autovacuum_multixact_freeze_max_age;
-
-	/*
-	 * Compute a target for relminmxid advancement.  The number of multixacts
-	 * we try to eliminate from the system is based on how far we are past
-	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
-	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-	fraction /= (double) (MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-
-	victim_multixacts = multixacts * fraction;
-
-	/* fraction could be > 1.0, but lowest possible freeze age is zero */
-	if (victim_multixacts > multixacts)
-		return 0;
-	result = multixacts - victim_multixacts;
-
-	/*
-	 * Clamp to autovacuum_multixact_freeze_max_age, so that we never make
-	 * autovacuum less aggressive than it would otherwise be.
-	 */
-	return Min(result, autovacuum_multixact_freeze_max_age);
-}
-
 typedef struct mxtruncinfo
 {
 	int64		earliestExistingPage;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 86f36b3695..e7506e268a 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1133,7 +1133,7 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Almost ready to set freeze output parameters; check if OldestXmin or
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index dc3cf87aba..180bb7e96e 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1122,7 +1122,7 @@ do_start_worker(void)
 
 	/* Also determine the oldest datminmxid we will consider. */
 	recentMulti = ReadNextMultiXactId();
-	multiForceLimit = recentMulti - MultiXactMemberFreezeThreshold();
+	multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
 	if (multiForceLimit < FirstMultiXactId)
 		multiForceLimit -= FirstMultiXactId;
 
@@ -1915,7 +1915,7 @@ do_autovacuum(void)
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Find the pg_database entry and select the default freeze ages. We use
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 90583634ec..5aefbddce3 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -143,7 +143,6 @@ extern void MultiXactSetNextMXact(MultiXactId nextMulti,
 extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 									  MultiXactOffset minMultiOffset);
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
-extern int	MultiXactMemberFreezeThreshold(void);
 
 extern void multixact_twophase_recover(TransactionId xid, uint16 info,
 									   void *recdata, uint32 len);
-- 
2.43.0

v7-0005-TEST-bump-catver.patchapplication/octet-stream; name=v7-0005-TEST-bump-catver.patchDownload
From f229177951e7c233e4f827cd1996f9ae9eac8f88 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 13 Nov 2024 16:34:34 +0300
Subject: [PATCH v7 5/5] TEST: bump catver

---
 src/bin/pg_upgrade/pg_upgrade.h  | 2 +-
 src/include/catalog/catversion.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 2c85ec1e94..18faedc963 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -119,7 +119,7 @@ extern char *output_files[];
  *
  * XXX: should be changed to the actual CATALOG_VERSION_NO on commit.
  */
-#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202409041
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202411112
 
 /*
  * large object chunk size added to pg_controldata,
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 5dd91e190a..3d09caf5ae 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202411111
+#define CATALOG_VERSION_NO	202411112
 
 #endif
-- 
2.43.0

v7-0003-Make-pg_upgrade-convert-multixact-offsets.patchapplication/octet-stream; name=v7-0003-Make-pg_upgrade-convert-multixact-offsets.patchDownload
From 3f26d4d4d8aeb61729da3faf7506c7df2aa4347d Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Tue, 13 Aug 2024 14:44:50 +0300
Subject: [PATCH v7 3/5] Make pg_upgrade convert multixact offsets.

Author: Maxim Orlov <orlovmg@gmail.com>
Author: Yura Sokolov <y.sokolov@postgrespro.ru>
---
 src/backend/access/transam/multixact.c |   2 +-
 src/bin/pg_upgrade/Makefile            |   1 +
 src/bin/pg_upgrade/meson.build         |   1 +
 src/bin/pg_upgrade/pg_upgrade.c        |  42 +-
 src/bin/pg_upgrade/pg_upgrade.h        |  14 +-
 src/bin/pg_upgrade/segresize.c         | 529 +++++++++++++++++++++++++
 6 files changed, 583 insertions(+), 6 deletions(-)
 create mode 100644 src/bin/pg_upgrade/segresize.c

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index c51e03e832..48e1c0160a 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1891,7 +1891,7 @@ MultiXactShmemInit(void)
 				  "pg_multixact/offsets", LWTRANCHE_MULTIXACTOFFSET_BUFFER,
 				  LWTRANCHE_MULTIXACTOFFSET_SLRU,
 				  SYNC_HANDLER_MULTIXACT_OFFSET,
-				  true);
+				  false);
 	SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE);
 	SimpleLruInit(MultiXactMemberCtl,
 				  "multixact_member", multixact_member_buffers, 0,
diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index f83d2b5d30..70908d63a3 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	info.o \
 	option.o \
 	parallel.o \
+	segresize.o \
 	pg_upgrade.o \
 	relfilenumber.o \
 	server.o \
diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index 3d88419674..16f898ba14 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -10,6 +10,7 @@ pg_upgrade_sources = files(
   'info.c',
   'option.c',
   'parallel.c',
+  'segresize.c',
   'pg_upgrade.c',
   'relfilenumber.c',
   'server.c',
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 663235816f..1654e877c0 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -750,8 +750,42 @@ copy_xact_xlog_xid(void)
 	if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER &&
 		new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
 	{
-		copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
-		copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		/*
+		 * If the old server is before the MULTIXACTOFFSET_FORMATCHANGE_CAT_VER
+		 * it must have 32-bit multixid offsets, thus it should be converted.
+		 */
+		if (old_cluster.controldata.cat_ver < MULTIXACTOFFSET_FORMATCHANGE_CAT_VER &&
+			new_cluster.controldata.cat_ver >= MULTIXACTOFFSET_FORMATCHANGE_CAT_VER)
+		{
+			MultiXactOffset		oldest_offset,
+								next_offset;
+
+			remove_new_subdir("pg_multixact/offsets", false);
+			prep_status("Converting pg_multixact/offsets to 64-bit");
+			oldest_offset = convert_multixact_offsets();
+			check_ok();
+
+			remove_new_subdir("pg_multixact/members", false);
+			prep_status("Converting pg_multixact/members");
+			convert_multixact_members(oldest_offset);
+			check_ok();
+
+			next_offset = old_cluster.controldata.chkpnt_nxtmxoff;
+			if (oldest_offset)
+			{
+				if (next_offset < oldest_offset)
+					next_offset += ((MultiXactOffset) 1 << 32) - 1;
+
+				next_offset -= oldest_offset - 1;
+
+				old_cluster.controldata.chkpnt_nxtmxoff = next_offset;
+			}
+		}
+		else
+		{
+			copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
+			copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		}
 
 		prep_status("Setting next multixact ID and offset for new cluster");
 
@@ -760,9 +794,9 @@ copy_xact_xlog_xid(void)
 		 * counters here and the oldest multi present on system.
 		 */
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/pg_resetwal\" -O %u -m %u,%u \"%s\"",
+				  "\"%s/pg_resetwal\" -O %llu -m %u,%u \"%s\"",
 				  new_cluster.bindir,
-				  old_cluster.controldata.chkpnt_nxtmxoff,
+				  (unsigned long long) old_cluster.controldata.chkpnt_nxtmxoff,
 				  old_cluster.controldata.chkpnt_nxtmulti,
 				  old_cluster.controldata.chkpnt_oldstMulti,
 				  new_cluster.pgdata);
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 53f693c2d4..2c85ec1e94 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -114,6 +114,13 @@ extern char *output_files[];
  */
 #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231
 
+/*
+ * Swicth from 32-bit to 64-bit for multixid offsets.
+ *
+ * XXX: should be changed to the actual CATALOG_VERSION_NO on commit.
+ */
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202409041
+
 /*
  * large object chunk size added to pg_controldata,
  * commit 5f93c37805e7485488480916b4585e098d3cc883
@@ -230,7 +237,7 @@ typedef struct
 	uint32		chkpnt_nxtepoch;
 	uint32		chkpnt_nxtoid;
 	uint32		chkpnt_nxtmulti;
-	uint32		chkpnt_nxtmxoff;
+	uint64		chkpnt_nxtmxoff;
 	uint32		chkpnt_oldstMulti;
 	uint32		chkpnt_oldstxid;
 	uint32		align;
@@ -515,3 +522,8 @@ typedef struct
 	FILE	   *file;
 	char		path[MAXPGPATH];
 } UpgradeTaskReport;
+
+/* segresize.c */
+
+MultiXactOffset		convert_multixact_offsets(void);
+void				convert_multixact_members(MultiXactOffset oldest_offset);
diff --git a/src/bin/pg_upgrade/segresize.c b/src/bin/pg_upgrade/segresize.c
new file mode 100644
index 0000000000..1f02bb8aea
--- /dev/null
+++ b/src/bin/pg_upgrade/segresize.c
@@ -0,0 +1,529 @@
+/*
+ *	segresize.c
+ *
+ *	SLRU segment resize utility
+ *
+ *	Copyright (c) 2024, PostgreSQL Global Development Group
+ *	src/bin/pg_upgrade/segresize.c
+ */
+
+#include "postgres_fe.h"
+
+#include "pg_upgrade.h"
+#include "access/multixact.h"
+
+/* See slru.h */
+#define SLRU_PAGES_PER_SEGMENT		32
+
+/*
+ * Some kind of iterator associated with a particular SLRU segment.  The idea is
+ * to specify the segment and page number and then move through the pages.
+ */
+typedef struct SlruSegState
+{
+	char	   *dir;
+	char	   *fn;
+	FILE	   *file;
+	int64		segno;
+	uint64		pageno;
+	bool		leading_gap;
+} SlruSegState;
+
+/*
+ * Get SLRU segment file name from state.
+ */
+static inline char *
+SlruFileName(SlruSegState *state)
+{
+	Assert(state->segno >= 0 &&
+		   state->segno <= INT64CONST(0xFFFFFF));
+
+	return psprintf("%s/%04X", state->dir, (unsigned int) (state->segno));
+}
+
+/*
+ * Create SLRU segment file.
+ */
+static void
+create_segment(SlruSegState *state)
+{
+	Assert(state->fn == NULL);
+	Assert(state->file == NULL);
+
+	state->fn = SlruFileName(state);
+	state->file = fopen(state->fn, "wb");
+	if (!state->file)
+		pg_fatal("could not create file \"%s\": %m", state->fn);
+}
+
+/*
+ * Open existing SLRU segment file.
+ */
+static void
+open_segment(SlruSegState *state)
+{
+	Assert(state->fn == NULL);
+	Assert(state->file == NULL);
+
+	state->fn = SlruFileName(state);
+	state->file = fopen(state->fn, "rb");
+	if (!state->file)
+		pg_fatal("could not open file \"%s\": %m", state->fn);
+}
+
+/*
+ * Close SLRU segment file.
+ */
+static void
+close_segment(SlruSegState *state)
+{
+	if (state->file)
+	{
+		fclose(state->file);
+		state->file = NULL;
+	}
+
+	if (state->fn)
+	{
+		pfree(state->fn);
+		state->fn = NULL;
+	}
+}
+
+/*
+ * Read next page from the old 32-bit offset segment file.
+ */
+static int
+read_old_segment_page(SlruSegState *state, void *buf, bool *empty)
+{
+	int		len;
+
+	/* Open next segment file, if needed. */
+	if (!state->fn)
+	{
+		if (!state->segno)
+			state->leading_gap = true;
+
+		open_segment(state);
+
+		/* Set position to the needed page. */
+		if (state->pageno > 0 &&
+			fseek(state->file, state->pageno * BLCKSZ, SEEK_SET))
+		{
+			close_segment(state);
+		}
+	}
+
+	if (state->file)
+	{
+		/* Segment file do exists, read page from it. */
+		state->leading_gap = false;
+
+		len = fread(buf, sizeof(char), BLCKSZ, state->file);
+
+		/* Are we done or was there an error? */
+		if (len <= 0)
+		{
+			if (ferror(state->file))
+				pg_fatal("error reading file \"%s\": %m", state->fn);
+
+			if (feof(state->file))
+			{
+				*empty = true;
+				len = -1;
+
+				close_segment(state);
+			}
+		}
+		else
+			*empty = false;
+	}
+	else if (!state->leading_gap)
+	{
+		/* We reached the last segment. */
+		len = -1;
+		*empty = true;
+	}
+	else
+	{
+		/* Skip few first segments if they were frozen and removed. */
+		len = BLCKSZ;
+		*empty = true;
+	}
+
+	if (++state->pageno >= SLRU_PAGES_PER_SEGMENT)
+	{
+		/* Start a new segment. */
+		state->segno++;
+		state->pageno = 0;
+
+		close_segment(state);
+	}
+
+	return len;
+}
+
+/*
+ * Write next page to the new 64-bit offset segment file.
+ */
+static void
+write_new_segment_page(SlruSegState *state, void *buf)
+{
+	/*
+	 * Create a new segment file if we still didn't.  Creation is
+	 * postponed until the first non-empty page is found.  This helps
+	 * not to create completely empty segments.
+	 */
+	if (!state->file)
+	{
+		create_segment(state);
+
+		/* Write zeroes to the previously skipped prefix. */
+		if (state->pageno > 0)
+		{
+			char		zerobuf[BLCKSZ] = {0};
+
+			for (int64 i = 0; i < state->pageno; i++)
+			{
+				if (fwrite(zerobuf, sizeof(char), BLCKSZ, state->file) != BLCKSZ)
+					pg_fatal("could not write file \"%s\": %m", state->fn);
+			}
+		}
+	}
+
+	/* Write page to the new segment (if it was created). */
+	if (state->file)
+	{
+		if (fwrite(buf, sizeof(char), BLCKSZ, state->file) != BLCKSZ)
+			pg_fatal("could not write file \"%s\": %m", state->fn);
+	}
+
+	/*
+	 * Did we reach the maximum page number?  Then close segment file
+	 * and create a new one on the next iteration.
+	 */
+	if (++state->pageno >= SLRU_PAGES_PER_SEGMENT)
+	{
+		/* Start a new segment. */
+		state->segno++;
+		state->pageno = 0;
+
+		close_segment(state);
+	}
+}
+
+typedef uint32 MultiXactOffsetOld;
+
+#define MaxMultiXactOffsetOld	((MultiXactOffsetOld) 0xFFFFFFFF)
+
+#define MULTIXACT_OFFSETS_PER_PAGE_OLD (BLCKSZ / sizeof(MultiXactOffsetOld))
+#define MULTIXACT_OFFSETS_PER_PAGE_NEW (BLCKSZ / sizeof(MultiXactOffset))
+
+/*
+ * Convert pg_multixact/offsets segments and return oldest multi offset.
+ */
+MultiXactOffset
+convert_multixact_offsets(void)
+{
+	SlruSegState		oldseg = {0},
+						newseg = {0};
+	MultiXactOffsetOld	oldbuf[MULTIXACT_OFFSETS_PER_PAGE_OLD] = {0};
+	MultiXactOffset		newbuf[MULTIXACT_OFFSETS_PER_PAGE_NEW] = {0},
+						oldest_offset = 0;
+	uint64				oldest_multi = old_cluster.controldata.chkpnt_oldstMulti,
+						next_multi = old_cluster.controldata.chkpnt_nxtmulti,
+						multi,
+						old_entry,
+						new_entry;
+	bool				oldest_offset_known = false;
+
+	oldseg.dir = psprintf("%s/pg_multixact/offsets", old_cluster.pgdata);
+	newseg.dir = psprintf("%s/pg_multixact/offsets", new_cluster.pgdata);
+
+	old_entry = oldest_multi % MULTIXACT_OFFSETS_PER_PAGE_OLD;
+	oldseg.pageno = oldest_multi / MULTIXACT_OFFSETS_PER_PAGE_OLD;
+	oldseg.segno = oldseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	oldseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+
+	new_entry = oldest_multi % MULTIXACT_OFFSETS_PER_PAGE_NEW;
+	newseg.pageno = oldest_multi / MULTIXACT_OFFSETS_PER_PAGE_NEW;
+	newseg.segno = newseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	newseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+
+	if (next_multi < oldest_multi)
+		next_multi += (uint64) 1 << 32;	/* wraparound */
+
+	/* Copy multi offsets reading only needed segment pages */
+	for (multi = oldest_multi; multi < next_multi; old_entry = 0)
+	{
+		int		oldlen;
+		bool	empty;
+
+		/* Handle possible segment wraparound */
+#define OLD_OFFSET_SEGNO_MAX	\
+	(MaxMultiXactId / MULTIXACT_OFFSETS_PER_PAGE_OLD / SLRU_PAGES_PER_SEGMENT)
+		if (oldseg.segno > OLD_OFFSET_SEGNO_MAX)
+		{
+			oldseg.segno = 0;
+			oldseg.pageno = 0;
+		}
+
+		oldlen = read_old_segment_page(&oldseg, oldbuf, &empty);
+		if (empty || oldlen != BLCKSZ)
+			pg_fatal("cannot read page %llu from file \"%s\": %m",
+					 (unsigned long long) oldseg.pageno, oldseg.fn);
+
+		/* Save oldest multi offset */
+		if (!oldest_offset_known)
+		{
+			oldest_offset = oldbuf[old_entry];
+			oldest_offset_known = true;
+		}
+
+		/* Skip wrapped-around invalid MultiXactIds */
+		if (multi == (uint64) 1 << 32)
+		{
+			Assert(oldseg.segno == 0);
+			Assert(oldseg.pageno == 1);
+			Assert(old_entry == 0);
+			Assert(new_entry == 0);
+
+			multi += FirstMultiXactId;
+			old_entry = FirstMultiXactId;
+			new_entry = FirstMultiXactId;
+		}
+
+		/* Copy entries to the new page */
+		for (; multi < next_multi && old_entry < MULTIXACT_OFFSETS_PER_PAGE_OLD;
+			 multi++, old_entry++)
+		{
+			MultiXactOffset offset = oldbuf[old_entry];
+
+			/* Handle possible offset wraparound (1 becomes 2^32) */
+			if (offset < oldest_offset)
+				offset += ((uint64) 1 << 32) - 1;
+
+			/* Subtract oldest_offset, so new offsets will start from 1 */
+			newbuf[new_entry++] = offset - oldest_offset + 1;
+
+			if (new_entry >= MULTIXACT_OFFSETS_PER_PAGE_NEW)
+			{
+				/* Handle possible segment wraparound */
+#define NEW_OFFSET_SEGNO_MAX	\
+	(MaxMultiXactId / MULTIXACT_OFFSETS_PER_PAGE_NEW / SLRU_PAGES_PER_SEGMENT)
+				if (newseg.segno > NEW_OFFSET_SEGNO_MAX)
+				{
+					newseg.segno = 0;
+					newseg.pageno = 0;
+				}
+
+				/* Write new page */
+				write_new_segment_page(&newseg, newbuf);
+				new_entry = 0;
+			}
+		}
+	}
+
+	/* Write the last incomplete page */
+	if (new_entry > 0 || oldest_multi == next_multi)
+	{
+		memset(&newbuf[new_entry], 0,
+			   sizeof(newbuf[0]) * (MULTIXACT_OFFSETS_PER_PAGE_NEW - new_entry));
+		write_new_segment_page(&newseg, newbuf);
+	}
+
+	/* Use next_offset as oldest_offset, if oldest_multi == next_multi */
+	if (!oldest_offset_known)
+	{
+		Assert(oldest_multi == next_multi);
+		oldest_offset = (MultiXactOffset) old_cluster.controldata.chkpnt_nxtmxoff;
+	}
+
+	/* Release resources */
+	close_segment(&oldseg);
+	close_segment(&newseg);
+
+	pfree(oldseg.dir);
+	pfree(newseg.dir);
+
+	return oldest_offset;
+}
+
+#define MXACT_MEMBERS_FLAG_BYTES			1
+
+#define MULTIXACT_MEMBERS_PER_GROUP			4
+#define MULTIXACT_MEMBERGROUP_SIZE			\
+	(MULTIXACT_MEMBERS_PER_GROUP * (sizeof(TransactionId) + MXACT_MEMBERS_FLAG_BYTES))
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE		\
+	(BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+
+#define MULTIXACT_MEMBERS_PER_PAGE				\
+	(MULTIXACT_MEMBERS_PER_GROUP * MULTIXACT_MEMBERGROUPS_PER_PAGE)
+#define MULTIXACT_MEMBER_FLAG_BYTES_PER_GROUP	\
+	(MXACT_MEMBERS_FLAG_BYTES * MULTIXACT_MEMBERS_PER_GROUP)
+
+typedef struct MultiXactMembersCtx
+{
+	SlruSegState	seg;
+	char			buf[BLCKSZ];
+	int				group;
+	int				member;
+	char		   *flag;
+	TransactionId  *xid;
+} MultiXactMembersCtx;
+
+static void
+MultiXactMembersCtxInit(MultiXactMembersCtx *ctx)
+{
+	ctx->seg.dir = psprintf("%s/pg_multixact/members", new_cluster.pgdata);
+
+	ctx->group = 0;
+	ctx->member = 1;		/* skip invalid zero offset */
+
+	ctx->flag = (char *) ctx->buf + ctx->group * MULTIXACT_MEMBERGROUP_SIZE;
+	ctx->xid = (TransactionId *)(ctx->flag + MXACT_MEMBERS_FLAG_BYTES * MULTIXACT_MEMBERS_PER_GROUP);
+
+	ctx->flag += ctx->member;
+	ctx->xid += ctx->member;
+}
+
+static void
+MultiXactMembersCtxAdd(MultiXactMembersCtx *ctx, char flag, TransactionId xid)
+{
+	/* Copy member's xid and flags to the new page */
+	*ctx->flag++ = flag;
+	*ctx->xid++ = xid;
+
+	if (++ctx->member < MULTIXACT_MEMBERS_PER_GROUP)
+		return;
+
+	/* Start next member group */
+	ctx->member = 0;
+
+	if (++ctx->group >= MULTIXACT_MEMBERGROUPS_PER_PAGE)
+	{
+		/* Write current page and start new */
+		write_new_segment_page(&ctx->seg, ctx->buf);
+
+		ctx->group = 0;
+		memset(ctx->buf, 0, BLCKSZ);
+	}
+
+	ctx->flag = (char *) ctx->buf + ctx->group * MULTIXACT_MEMBERGROUP_SIZE;
+	ctx->xid = (TransactionId *)(ctx->flag + MXACT_MEMBERS_FLAG_BYTES * MULTIXACT_MEMBERS_PER_GROUP);
+}
+
+static void
+MultiXactMembersCtxFinit(MultiXactMembersCtx *ctx)
+{
+	if (ctx->flag > (char *) ctx->buf)
+		write_new_segment_page(&ctx->seg, ctx->buf);
+
+	close_segment(&ctx->seg);
+
+	pfree(ctx->seg.dir);
+}
+
+/*
+ * Convert pg_multixact/members segments, offsets will start from 1.
+ *
+ */
+void
+convert_multixact_members(MultiXactOffset oldest_offset)
+{
+	MultiXactOffset			next_offset,
+							offset;
+	SlruSegState			oldseg = {0};
+	char					oldbuf[BLCKSZ] = {0};
+	int						oldidx;
+	MultiXactMembersCtx		newctx = {0};
+
+	oldseg.dir = psprintf("%s/pg_multixact/members", old_cluster.pgdata);
+
+	next_offset = (MultiXactOffset) old_cluster.controldata.chkpnt_nxtmxoff;
+	if (next_offset < oldest_offset)
+		next_offset += ((uint64) 1 << 32) - 1;
+
+	/* Initialize the old starting position */
+	oldseg.pageno = oldest_offset / MULTIXACT_MEMBERS_PER_PAGE;
+	oldseg.segno = oldseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	oldseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+
+	/* Initialize new starting position */
+	MultiXactMembersCtxInit(&newctx);
+
+	/* Iterate through the original directory */
+	oldidx = oldest_offset % MULTIXACT_MEMBERS_PER_PAGE;
+	for (offset = oldest_offset; offset < next_offset;)
+	{
+		bool	empty;
+		int		oldlen;
+		int		ngroups;
+		int		oldgroup;
+		int		oldmember;
+
+		oldlen = read_old_segment_page(&oldseg, oldbuf, &empty);
+		if (empty || oldlen != BLCKSZ)
+			pg_fatal("cannot read page %llu from file \"%s\": %m",
+					 (unsigned long long) oldseg.pageno, oldseg.fn);
+
+		/* Iterate through the old member groups */
+		ngroups = oldlen / MULTIXACT_MEMBERGROUP_SIZE;
+		oldmember = oldidx % MULTIXACT_MEMBERS_PER_GROUP;
+		oldgroup = oldidx / MULTIXACT_MEMBERS_PER_GROUP;
+		while (oldgroup < ngroups && offset < next_offset)
+		{
+			char		   *oldflag;
+			TransactionId  *oldxid;
+			int				i;
+
+			oldflag = (char *) oldbuf + oldgroup * MULTIXACT_MEMBERGROUP_SIZE;
+			oldxid = (TransactionId *)(oldflag + MULTIXACT_MEMBER_FLAG_BYTES_PER_GROUP);
+
+			oldxid += oldmember;
+			oldflag += oldmember;
+
+			/* Iterate through the old members */
+			for (i = oldmember;
+				 i < MULTIXACT_MEMBERS_PER_GROUP && offset < next_offset;
+				 i++)
+			{
+				MultiXactMembersCtxAdd(&newctx, *oldflag++, *oldxid++);
+
+				if (++offset == (uint64) 1 << 32)
+				{
+					Assert(i == MaxMultiXactOffsetOld % MULTIXACT_MEMBERS_PER_GROUP);
+					goto wraparound;
+				}
+			}
+
+			oldgroup++;
+			oldmember = 0;
+		}
+
+		oldidx = 0;
+
+		continue;
+
+wraparound:
+#define SEGNO_MAX	MaxMultiXactOffsetOld / MULTIXACT_MEMBERS_PER_PAGE / SLRU_PAGES_PER_SEGMENT
+#define PAGENO_MAX	MaxMultiXactOffsetOld / MULTIXACT_MEMBERS_PER_PAGE % SLRU_PAGES_PER_SEGMENT
+		Assert((oldseg.segno == SEGNO_MAX && oldseg.pageno == PAGENO_MAX + 1) ||
+			   (oldseg.segno == SEGNO_MAX + 1 && oldseg.pageno == 0));
+
+		/* Switch to segment 0000 */
+		close_segment(&oldseg);
+		oldseg.segno = 0;
+		oldseg.pageno = 0;
+
+		/* skip invalid zero multi offset */
+		oldidx = 1;
+	}
+
+	MultiXactMembersCtxFinit(&newctx);
+
+	/* Release resources */
+	close_segment(&oldseg);
+
+	pfree(oldseg.dir);
+}
-- 
2.43.0

v7-0002-Use-64-bit-multixact-offsets.patchapplication/octet-stream; name=v7-0002-Use-64-bit-multixact-offsets.patchDownload
From 8d0b3a64804ba3b0c4104cd37907e2959934937b Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Wed, 6 Mar 2024 11:11:33 +0300
Subject: [PATCH v7 2/5] Use 64-bit multixact offsets.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/transam/multixact.c | 172 +------------------------
 src/bin/pg_resetwal/pg_resetwal.c      |   2 +-
 src/bin/pg_resetwal/t/001_basic.pl     |   2 +-
 src/include/access/multixact.h         |   2 +-
 src/include/c.h                        |   2 +-
 5 files changed, 11 insertions(+), 169 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index ab90912ed3..c51e03e832 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -96,14 +96,6 @@
 /*
  * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
  * used everywhere else in Postgres.
- *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
  */
 
 /* We need four bytes per offset */
@@ -272,9 +264,6 @@ typedef struct MultiXactStateData
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 
-	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
-
 	/*
 	 * This is used to sleep until a multixact offset is written when we want
 	 * to create the next one.
@@ -409,8 +398,6 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
-									 MultiXactOffset start, uint32 distance);
 static bool SetOffsetVacuumLimit(bool is_startup);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMZeroPageXlogRec(int64 pageno, uint8 info);
@@ -1164,78 +1151,6 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	else
 		*offset = nextOffset;
 
-	/*----------
-	 * Protect against overrun of the members space as well, with the
-	 * following rules:
-	 *
-	 * If we're past offsetStopLimit, refuse to generate more multis.
-	 * If we're close to offsetStopLimit, emit a warning.
-	 *
-	 * Arbitrarily, we start emitting warnings when we're 20 segments or less
-	 * from offsetStopLimit.
-	 *
-	 * Note we haven't updated the shared state yet, so if we fail at this
-	 * point, the multixact ID we grabbed can still be used by the next guy.
-	 *
-	 * Note that there is no point in forcing autovacuum runs here: the
-	 * multixact freeze settings would have to be reduced for that to have any
-	 * effect.
-	 *----------
-	 */
-#define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
-								 nmembers))
-	{
-		/* see comment in the corresponding offsets wraparound case */
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("multixact \"members\" limit exceeded"),
-				 errdetail_plural("This command would create a multixact with %u members, but the remaining space is only enough for %u member.",
-								  "This command would create a multixact with %u members, but the remaining space is only enough for %u members.",
-								  MultiXactState->offsetStopLimit - nextOffset - 1,
-								  nmembers,
-								  MultiXactState->offsetStopLimit - nextOffset - 1),
-				 errhint("Execute a database-wide VACUUM in database with OID %u with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.",
-						 MultiXactState->oldestMultiXactDB)));
-	}
-
-	/*
-	 * Check whether we should kick autovacuum into action, to prevent members
-	 * wraparound. NB we use a much larger window to trigger autovacuum than
-	 * just the warning limit. The warning is just a measure of last resort -
-	 * this is in line with GetNewTransactionId's behaviour.
-	 */
-	if (!MultiXactState->oldestOffsetKnown ||
-		(MultiXactState->nextOffset - MultiXactState->oldestOffset
-		 > MULTIXACT_MEMBER_SAFE_THRESHOLD))
-	{
-		/*
-		 * To avoid swamping the postmaster with signals, we issue the autovac
-		 * request only when crossing a segment boundary. With default
-		 * compilation settings that's roughly after 50k members.  This still
-		 * gives plenty of chances before we get into real trouble.
-		 */
-		if ((MXOffsetToMemberPage(nextOffset) / SLRU_PAGES_PER_SEGMENT) !=
-			(MXOffsetToMemberPage(nextOffset + nmembers) / SLRU_PAGES_PER_SEGMENT))
-			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	}
-
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
-								 nextOffset,
-								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
-		ereport(WARNING,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg_plural("database with OID %u must be vacuumed before %d more multixact member is used",
-							   "database with OID %u must be vacuumed before %d more multixact members are used",
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers,
-							   MultiXactState->oldestMultiXactDB,
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers),
-				 errhint("Execute a database-wide VACUUM in that database with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.")));
-
 	ExtendMultiXactMember(nextOffset, nmembers);
 
 	/*
@@ -1976,7 +1891,7 @@ MultiXactShmemInit(void)
 				  "pg_multixact/offsets", LWTRANCHE_MULTIXACTOFFSET_BUFFER,
 				  LWTRANCHE_MULTIXACTOFFSET_SLRU,
 				  SYNC_HANDLER_MULTIXACT_OFFSET,
-				  false);
+				  true);
 	SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE);
 	SimpleLruInit(MultiXactMemberCtl,
 				  "multixact_member", multixact_member_buffers, 0,
@@ -2721,8 +2636,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	MultiXactOffset nextOffset;
 	bool		oldestOffsetKnown = false;
 	bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;
-	MultiXactOffset prevOffsetStopLimit;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2737,7 +2650,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	nextOffset = MultiXactState->nextOffset;
 	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
 	prevOldestOffset = MultiXactState->oldestOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2768,11 +2680,7 @@ SetOffsetVacuumLimit(bool is_startup)
 		oldestOffsetKnown =
 			find_multixact_start(oldestMultiXactId, &oldestOffset);
 
-		if (oldestOffsetKnown)
-			ereport(DEBUG1,
-					(errmsg_internal("oldest MultiXactId member is at offset %u",
-									 oldestOffset)));
-		else
+		if (!oldestOffsetKnown)
 			ereport(LOG,
 					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
 							oldestMultiXactId)));
@@ -2785,24 +2693,7 @@ SetOffsetVacuumLimit(bool is_startup)
 	 * overrun of old data in the members SLRU area. We can only do so if the
 	 * oldest offset is known though.
 	 */
-	if (oldestOffsetKnown)
-	{
-		/* move back to start of the corresponding segment */
-		offsetStopLimit = oldestOffset - (oldestOffset %
-										  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-
-		/* always leave one segment before the wraparound point */
-		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-		if (!prevOldestOffsetKnown && !is_startup)
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are now enabled")));
-
-		ereport(DEBUG1,
-				(errmsg_internal("MultiXact member stop limit is now %u based on MultiXact %u",
-								 offsetStopLimit, oldestMultiXactId)));
-	}
-	else if (prevOldestOffsetKnown)
+	if (prevOldestOffsetKnown)
 	{
 		/*
 		 * If we failed to get the oldest offset this time, but we have a
@@ -2812,14 +2703,12 @@ SetOffsetVacuumLimit(bool is_startup)
 		 */
 		oldestOffset = prevOldestOffset;
 		oldestOffsetKnown = true;
-		offsetStopLimit = prevOffsetStopLimit;
 	}
 
 	/* Install the computed values */
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestOffset = oldestOffset;
 	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	MultiXactState->offsetStopLimit = offsetStopLimit;
 	LWLockRelease(MultiXactGenLock);
 
 	/*
@@ -2829,54 +2718,6 @@ SetOffsetVacuumLimit(bool is_startup)
 		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
 }
 
-/*
- * Return whether adding "distance" to "start" would move past "boundary".
- *
- * We use this to determine whether the addition is "wrapping around" the
- * boundary point, hence the name.  The reason we don't want to use the regular
- * 2^31-modulo arithmetic here is that we want to be able to use the whole of
- * the 2^32-1 space here, allowing for more multixacts than would fit
- * otherwise.
- */
-static bool
-MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
-						 uint32 distance)
-{
-	MultiXactOffset finish;
-
-	/*
-	 * Note that offset number 0 is not used (see GetMultiXactIdMembers), so
-	 * if the addition wraps around the UINT_MAX boundary, skip that value.
-	 */
-	finish = start + distance;
-	if (finish < start)
-		finish++;
-
-	/*-----------------------------------------------------------------------
-	 * When the boundary is numerically greater than the starting point, any
-	 * value numerically between the two is not wrapped:
-	 *
-	 *	<----S----B---->
-	 *	[---)			 = F wrapped past B (and UINT_MAX)
-	 *		 [---)		 = F not wrapped
-	 *			  [----] = F wrapped past B
-	 *
-	 * When the boundary is numerically less than the starting point (i.e. the
-	 * UINT_MAX wraparound occurs somewhere in between) then all values in
-	 * between are wrapped:
-	 *
-	 *	<----B----S---->
-	 *	[---)			 = F not wrapped past B (but wrapped past UINT_MAX)
-	 *		 [---)		 = F wrapped past B (and UINT_MAX)
-	 *			  [----] = F not wrapped
-	 *-----------------------------------------------------------------------
-	 */
-	if (start < boundary)
-		return finish >= boundary || finish < start;
-	else
-		return finish >= boundary && finish < start;
-}
-
 /*
  * Find the starting offset of the given MultiXactId.
  *
@@ -2998,8 +2839,9 @@ MultiXactMemberFreezeThreshold(void)
 	 * we try to eliminate from the system is based on how far we are past
 	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
 	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD) /
-		(MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+	fraction /= (double) (MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+
 	victim_multixacts = multixacts * fraction;
 
 	/* fraction could be > 1.0, but lowest possible freeze age is zero */
@@ -3345,7 +3187,7 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
 static bool
 MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
 {
-	int32		diff = (int32) (offset1 - offset2);
+	int64		diff = (int64) (offset1 - offset2);
 
 	return (diff < 0);
 }
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 985cd06802..1af2ce4b93 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -264,7 +264,7 @@ main(int argc, char *argv[])
 
 			case 'O':
 				errno = 0;
-				set_mxoff = strtoul(optarg, &endptr, 0);
+				set_mxoff = strtou64(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index 9829e48106..f8a8eef44d 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -206,7 +206,7 @@ push @cmd,
   sprintf("%d,%d", hex($files[0]) == 0 ? 3 : hex($files[0]), hex($files[-1]));
 
 @files = get_slru_files('pg_multixact/offsets');
-$mult = 32 * $blcksz / 4;
+$mult = 32 * $blcksz / 8;
 # -m argument is "new,old"
 push @cmd, '-m',
   sprintf("%d,%d",
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 7ffd256c74..90583634ec 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -27,7 +27,7 @@
 
 #define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
 
-#define MaxMultiXactOffset	((MultiXactOffset) 0xFFFFFFFF)
+#define MaxMultiXactOffset	UINT64CONST(0xFFFFFFFFFFFFFFFF)
 
 /*
  * Possible multixact lock modes ("status").  The first four modes are for
diff --git a/src/include/c.h b/src/include/c.h
index 0a548d69d7..e1b3187d0b 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -664,7 +664,7 @@ typedef uint32 SubTransactionId;
 /* MultiXactId must be equivalent to TransactionId, to fit in t_xmax */
 typedef TransactionId MultiXactId;
 
-typedef uint32 MultiXactOffset;
+typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
-- 
2.43.0

#21Maxim Orlov
orlovmg@gmail.com
In reply to: Maxim Orlov (#20)
3 attachment(s)
Re: POC: make mxidoff 64 bits

Here is the test scripts.
The generate.sh script is used to generate data dir with multimple clusters
in it. This script will call multixids.py in order to generate data. If you
are not use system psql consider using LD_LIBRARY_PATH env to specify path
to the lib directory.
OLDBIN=/.../pgsql-new ./generate.sh

Then the test.sh is used to run various upgrades.
OLDBIN=/.../pgsql-old NEWBIN=/.../pgsql-new ./test.sh

I hope that helps!

--
Best regards,
Maxim Orlov.

Attachments:

multixids.pytext/x-python-script; charset=US-ASCII; name=multixids.pyDownload
generate.shapplication/x-sh; name=generate.shDownload
test.shapplication/x-sh; name=test.shDownload
#22Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Maxim Orlov (#20)
Re: POC: make mxidoff 64 bits

On 13/11/2024 17:44, Maxim Orlov wrote:

On Tue, 12 Nov 2024 at 02:31, Heikki Linnakangas <hlinnaka@iki.fi
<mailto:hlinnaka@iki.fi>> wrote:
On a different note, I'm surprised you're rewriting member segments
from
scratch, parsing all the individual member groups and writing them out
again. There's no change to the members file format, except for the
numbering of the files, so you could just copy the files under the new
names without paying attention to the contents. It's not wrong to parse
them in detail, but I'd assume that it would be simpler not to.

Yes, at the beginning I also thought that it would be possible to get by
with simple copying. But in case of wraparound, we must "bypass" invalid
zero offset value. See, old 32 bit offsets a wrapped at 2^32, thus 0
values appears in multixact.c So, they must be handled. Bypass, in fact.
When we are switched to the 64-bit offsets, we have two options:
1). Bypass every ((uint32) offset == 0) value in multixact.c;
2). Convert members and bypass invalid value once.

The first options seem too weird for me. So, we have to repack members
and bypass invalid value.

Hmm, so if I understand correctly, this is related to how we determine
the length of the members array, by looking at the next multixid's
offset. This is explained in GetMultiXactIdMembers:

/*
* Find out the offset at which we need to start reading MultiXactMembers
* and the number of members in the multixact. We determine the latter as
* the difference between this multixact's starting offset and the next
* one's. However, there are some corner cases to worry about:
*
* 1. This multixact may be the latest one created, in which case there is
* no next one to look at. In this case the nextOffset value we just
* saved is the correct endpoint.
*
* 2. The next multixact may still be in process of being filled in: that
* is, another process may have done GetNewMultiXactId but not yet written
* the offset entry for that ID. In that scenario, it is guaranteed that
* the offset entry for that multixact exists (because GetNewMultiXactId
* won't release MultiXactGenLock until it does) but contains zero
* (because we are careful to pre-zero offset pages). Because
* GetNewMultiXactId will never return zero as the starting offset for a
* multixact, when we read zero as the next multixact's offset, we know we
* have this case. We handle this by sleeping on the condition variable
* we have just for this; the process in charge will signal the CV as soon
* as it has finished writing the multixact offset.
*
* 3. Because GetNewMultiXactId increments offset zero to offset one to
* handle case #2, there is an ambiguity near the point of offset
* wraparound. If we see next multixact's offset is one, is that our
* multixact's actual endpoint, or did it end at zero with a subsequent
* increment? We handle this using the knowledge that if the zero'th
* member slot wasn't filled, it'll contain zero, and zero isn't a valid
* transaction ID so it can't be a multixact member. Therefore, if we
* read a zero from the members array, just ignore it.
*
* This is all pretty messy, but the mess occurs only in infrequent corner
* cases, so it seems better than holding the MultiXactGenLock for a long
* time on every multixact creation.
*/

With 64-bit offsets, can we assume that it never wraps around? We often
treat 2^64 as "large enough that we'll never run out", e.g. LSNs are
also assumed to never wrap around. I think that would be a safe
assumption here too.

If we accept that, we don't need to worry about case 3 anymore. But if
we upgrade wrapped-around members files by just renaming them, there
could still be a members array where we had skipped offset 0, and
reading that after the upgrade might get confused. We could continue to
ignore a 0 XID in the members array like the comment says; I think that
would be enough. But yeah, maybe it's better to bite the bullet in
pg_upgrade and squeeze those out.

Does your upgrade test suite include case 3, where the next multixact's
offset is 1?

Can we remove MaybeExtendOffsetSlru() now? There are a bunch of other
comments and checks that talk about binary-upgraded values too that we
can hopefully clean up now.

If we are to parse the member segments in detail in upgrade anyway, I'd
be tempted to make some further changes / optimizations:

- You could leave out all locking XID members in upgrade, because
they're not relevant after upgrade any more (all the XIDs will be
committed or aborted and have released the locks; we require prepared
transactions to be completed before upgrading too). It'd be enough to
include actual UPDATE/DELETE XIDs.

- The way we determine the length of the members array by looking at the
next multixid's offset is a bit complicated. We could have one extra
flag per XID in the members to indicate "this is the last member of this
multixid". That could either to replace the current mechanism of looking
at the next offset, or be just an additional cross-check.

- Do we still like the "group" representation, with 4 bytes of flags
followed by 4 XIDs? I wonder if it'd be better to just store 5 bytes per
XID unaligned.

- A more radical idea: There can be only one updating XID in one
multixid. We could store that directly in the offsets SLRU, and keep
only the locking XIDs in members. That way, the members SLRU would
become less critical; it could be safely reset on crash for example
(except for prepared transactions, which could still be holding locks,
but it'd still be less serious). Separating correctness-critical data
from more ephemeral state is generally a good idea.

I'm not insisting on any of these changes, just some things that might
be worth considering if we're rewriting the SLRUs on upgrade anyway.

--
Heikki Linnakangas
Neon (https://neon.tech)

#23Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#22)
Re: POC: make mxidoff 64 bits

On Fri, 15 Nov 2024 at 14:06, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Hmm, so if I understand correctly, this is related to how we determine
the length of the members array, by looking at the next multixid's
offset. This is explained in GetMultiXactIdMembers:

Correct.

If we accept that, we don't need to worry about case 3 anymore. But if
we upgrade wrapped-around members files by just renaming them, there
could still be a members array where we had skipped offset 0, and
reading that after the upgrade might get confused. We could continue to
ignore a 0 XID in the members array like the comment says; I think that
would be enough. But yeah, maybe it's better to bite the bullet in
pg_upgrade and squeeze those out.

Correct. I couldn't explain this better. I'm more for the squeeze those
out. Overwise, we're ending up in adding another hack in multixact, but one
of the benefits from switching to 64-bits, it should make XID's logic more
straight forward. After all, mxact juggling in pg_upgrade is one time
inconvenience.

Does your upgrade test suite include case 3, where the next multixact's
offset is 1?

Not exactly.

simple
Latest checkpoint's NextMultiXactId: 119441
Latest checkpoint's NextMultiOffset: 5927049

offset-wrap
Latest checkpoint's NextMultiXactId: 119441
Latest checkpoint's NextMultiOffset: 5591183

multi-wrap
Latest checkpoint's NextMultiXactId: 82006
Latest checkpoint's NextMultiOffset: 7408811

offset-multi-wrap
Latest checkpoint's NextMultiXactId: 52146
Latest checkpoint's NextMultiOffset: 5591183

You want test case where NextMultiOffset will be 1?

Can we remove MaybeExtendOffsetSlru() now? There are a bunch of other
comments and checks that talk about binary-upgraded values too that we
can hopefully clean up now.

Yes, technically we can. But this is kinda unrelated to the offsets and
will make the patch set significantly complicated, thus more complicated to
review and less likely to be committed. Again, I'm not opposing the idea,
I'm not sure if it is worth to do it right now.

If we are to parse the member segments in detail in upgrade anyway, I'd
be tempted to make some further changes / optimizations:

- You could leave out all locking XID members in upgrade, because
they're not relevant after upgrade any more (all the XIDs will be
committed or aborted and have released the locks; we require prepared
transactions to be completed before upgrading too). It'd be enough to
include actual UPDATE/DELETE XIDs.

- The way we determine the length of the members array by looking at the
next multixid's offset is a bit complicated. We could have one extra
flag per XID in the members to indicate "this is the last member of this
multixid". That could either to replace the current mechanism of looking
at the next offset, or be just an additional cross-check.

- Do we still like the "group" representation, with 4 bytes of flags
followed by 4 XIDs? I wonder if it'd be better to just store 5 bytes per
XID unaligned.

Not really. But I would leave it for next iteration - switching multi to 64
bit. I already have some drafts for this. In any case, we'll must do
adjustments in pg_upgrade again. My goal is to move towards 64 XIDs, but
with the small steps, and I plan changes in "group" representation in
combination with switching multi to 64 bit. This seems a bit more
appropriate in my view.

As for your optimization suggestions, I like them. I don’t against them,
but I’m afraid to disrupt the clarity of thought, especially since the
algorithm is not the simplest.

--
Best regards,
Maxim Orlov.

#24Maxim Orlov
orlovmg@gmail.com
In reply to: Maxim Orlov (#23)
8 attachment(s)
Re: POC: make mxidoff 64 bits

Shame on me! I've sent an erroneous patch set. Version 7 is defective. Here
is the proper version v8 with minor refactoring in segresize.c.

Also, I rename bump cat version patch into txt in order not to break cfbot.

--
Best regards,
Maxim Orlov.

Attachments:

v8-0005-TEST-bump-catver.patch.txttext/plain; charset=US-ASCII; name=v8-0005-TEST-bump-catver.patch.txtDownload
From 73b8663093ff1c58def9a80abab142a12c993bf6 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 13 Nov 2024 16:34:34 +0300
Subject: [PATCH v8 5/5] TEST: bump catver

---
 src/bin/pg_upgrade/pg_upgrade.h  | 2 +-
 src/include/catalog/catversion.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 2c85ec1e94..18faedc963 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -119,7 +119,7 @@ extern char *output_files[];
  *
  * XXX: should be changed to the actual CATALOG_VERSION_NO on commit.
  */
-#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202409041
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202411112
 
 /*
  * large object chunk size added to pg_controldata,
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 5dd91e190a..3d09caf5ae 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202411111
+#define CATALOG_VERSION_NO	202411112
 
 #endif
-- 
2.43.0

v8-0002-Use-64-bit-multixact-offsets.patchapplication/octet-stream; name=v8-0002-Use-64-bit-multixact-offsets.patchDownload
From ad9a1509fd5cd68838169b3465ab4c5f9827a4e3 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Wed, 6 Mar 2024 11:11:33 +0300
Subject: [PATCH v8 2/5] Use 64-bit multixact offsets.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/transam/multixact.c | 172 +------------------------
 src/bin/pg_resetwal/pg_resetwal.c      |   2 +-
 src/bin/pg_resetwal/t/001_basic.pl     |   2 +-
 src/include/access/multixact.h         |   2 +-
 src/include/c.h                        |   2 +-
 5 files changed, 11 insertions(+), 169 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index ab90912ed3..c51e03e832 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -96,14 +96,6 @@
 /*
  * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
  * used everywhere else in Postgres.
- *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
  */
 
 /* We need four bytes per offset */
@@ -272,9 +264,6 @@ typedef struct MultiXactStateData
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 
-	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
-
 	/*
 	 * This is used to sleep until a multixact offset is written when we want
 	 * to create the next one.
@@ -409,8 +398,6 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
-									 MultiXactOffset start, uint32 distance);
 static bool SetOffsetVacuumLimit(bool is_startup);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMZeroPageXlogRec(int64 pageno, uint8 info);
@@ -1164,78 +1151,6 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	else
 		*offset = nextOffset;
 
-	/*----------
-	 * Protect against overrun of the members space as well, with the
-	 * following rules:
-	 *
-	 * If we're past offsetStopLimit, refuse to generate more multis.
-	 * If we're close to offsetStopLimit, emit a warning.
-	 *
-	 * Arbitrarily, we start emitting warnings when we're 20 segments or less
-	 * from offsetStopLimit.
-	 *
-	 * Note we haven't updated the shared state yet, so if we fail at this
-	 * point, the multixact ID we grabbed can still be used by the next guy.
-	 *
-	 * Note that there is no point in forcing autovacuum runs here: the
-	 * multixact freeze settings would have to be reduced for that to have any
-	 * effect.
-	 *----------
-	 */
-#define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
-								 nmembers))
-	{
-		/* see comment in the corresponding offsets wraparound case */
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("multixact \"members\" limit exceeded"),
-				 errdetail_plural("This command would create a multixact with %u members, but the remaining space is only enough for %u member.",
-								  "This command would create a multixact with %u members, but the remaining space is only enough for %u members.",
-								  MultiXactState->offsetStopLimit - nextOffset - 1,
-								  nmembers,
-								  MultiXactState->offsetStopLimit - nextOffset - 1),
-				 errhint("Execute a database-wide VACUUM in database with OID %u with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.",
-						 MultiXactState->oldestMultiXactDB)));
-	}
-
-	/*
-	 * Check whether we should kick autovacuum into action, to prevent members
-	 * wraparound. NB we use a much larger window to trigger autovacuum than
-	 * just the warning limit. The warning is just a measure of last resort -
-	 * this is in line with GetNewTransactionId's behaviour.
-	 */
-	if (!MultiXactState->oldestOffsetKnown ||
-		(MultiXactState->nextOffset - MultiXactState->oldestOffset
-		 > MULTIXACT_MEMBER_SAFE_THRESHOLD))
-	{
-		/*
-		 * To avoid swamping the postmaster with signals, we issue the autovac
-		 * request only when crossing a segment boundary. With default
-		 * compilation settings that's roughly after 50k members.  This still
-		 * gives plenty of chances before we get into real trouble.
-		 */
-		if ((MXOffsetToMemberPage(nextOffset) / SLRU_PAGES_PER_SEGMENT) !=
-			(MXOffsetToMemberPage(nextOffset + nmembers) / SLRU_PAGES_PER_SEGMENT))
-			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	}
-
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
-								 nextOffset,
-								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
-		ereport(WARNING,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg_plural("database with OID %u must be vacuumed before %d more multixact member is used",
-							   "database with OID %u must be vacuumed before %d more multixact members are used",
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers,
-							   MultiXactState->oldestMultiXactDB,
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers),
-				 errhint("Execute a database-wide VACUUM in that database with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.")));
-
 	ExtendMultiXactMember(nextOffset, nmembers);
 
 	/*
@@ -1976,7 +1891,7 @@ MultiXactShmemInit(void)
 				  "pg_multixact/offsets", LWTRANCHE_MULTIXACTOFFSET_BUFFER,
 				  LWTRANCHE_MULTIXACTOFFSET_SLRU,
 				  SYNC_HANDLER_MULTIXACT_OFFSET,
-				  false);
+				  true);
 	SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE);
 	SimpleLruInit(MultiXactMemberCtl,
 				  "multixact_member", multixact_member_buffers, 0,
@@ -2721,8 +2636,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	MultiXactOffset nextOffset;
 	bool		oldestOffsetKnown = false;
 	bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;
-	MultiXactOffset prevOffsetStopLimit;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2737,7 +2650,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	nextOffset = MultiXactState->nextOffset;
 	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
 	prevOldestOffset = MultiXactState->oldestOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2768,11 +2680,7 @@ SetOffsetVacuumLimit(bool is_startup)
 		oldestOffsetKnown =
 			find_multixact_start(oldestMultiXactId, &oldestOffset);
 
-		if (oldestOffsetKnown)
-			ereport(DEBUG1,
-					(errmsg_internal("oldest MultiXactId member is at offset %u",
-									 oldestOffset)));
-		else
+		if (!oldestOffsetKnown)
 			ereport(LOG,
 					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
 							oldestMultiXactId)));
@@ -2785,24 +2693,7 @@ SetOffsetVacuumLimit(bool is_startup)
 	 * overrun of old data in the members SLRU area. We can only do so if the
 	 * oldest offset is known though.
 	 */
-	if (oldestOffsetKnown)
-	{
-		/* move back to start of the corresponding segment */
-		offsetStopLimit = oldestOffset - (oldestOffset %
-										  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-
-		/* always leave one segment before the wraparound point */
-		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-		if (!prevOldestOffsetKnown && !is_startup)
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are now enabled")));
-
-		ereport(DEBUG1,
-				(errmsg_internal("MultiXact member stop limit is now %u based on MultiXact %u",
-								 offsetStopLimit, oldestMultiXactId)));
-	}
-	else if (prevOldestOffsetKnown)
+	if (prevOldestOffsetKnown)
 	{
 		/*
 		 * If we failed to get the oldest offset this time, but we have a
@@ -2812,14 +2703,12 @@ SetOffsetVacuumLimit(bool is_startup)
 		 */
 		oldestOffset = prevOldestOffset;
 		oldestOffsetKnown = true;
-		offsetStopLimit = prevOffsetStopLimit;
 	}
 
 	/* Install the computed values */
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestOffset = oldestOffset;
 	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	MultiXactState->offsetStopLimit = offsetStopLimit;
 	LWLockRelease(MultiXactGenLock);
 
 	/*
@@ -2829,54 +2718,6 @@ SetOffsetVacuumLimit(bool is_startup)
 		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
 }
 
-/*
- * Return whether adding "distance" to "start" would move past "boundary".
- *
- * We use this to determine whether the addition is "wrapping around" the
- * boundary point, hence the name.  The reason we don't want to use the regular
- * 2^31-modulo arithmetic here is that we want to be able to use the whole of
- * the 2^32-1 space here, allowing for more multixacts than would fit
- * otherwise.
- */
-static bool
-MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
-						 uint32 distance)
-{
-	MultiXactOffset finish;
-
-	/*
-	 * Note that offset number 0 is not used (see GetMultiXactIdMembers), so
-	 * if the addition wraps around the UINT_MAX boundary, skip that value.
-	 */
-	finish = start + distance;
-	if (finish < start)
-		finish++;
-
-	/*-----------------------------------------------------------------------
-	 * When the boundary is numerically greater than the starting point, any
-	 * value numerically between the two is not wrapped:
-	 *
-	 *	<----S----B---->
-	 *	[---)			 = F wrapped past B (and UINT_MAX)
-	 *		 [---)		 = F not wrapped
-	 *			  [----] = F wrapped past B
-	 *
-	 * When the boundary is numerically less than the starting point (i.e. the
-	 * UINT_MAX wraparound occurs somewhere in between) then all values in
-	 * between are wrapped:
-	 *
-	 *	<----B----S---->
-	 *	[---)			 = F not wrapped past B (but wrapped past UINT_MAX)
-	 *		 [---)		 = F wrapped past B (and UINT_MAX)
-	 *			  [----] = F not wrapped
-	 *-----------------------------------------------------------------------
-	 */
-	if (start < boundary)
-		return finish >= boundary || finish < start;
-	else
-		return finish >= boundary && finish < start;
-}
-
 /*
  * Find the starting offset of the given MultiXactId.
  *
@@ -2998,8 +2839,9 @@ MultiXactMemberFreezeThreshold(void)
 	 * we try to eliminate from the system is based on how far we are past
 	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
 	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD) /
-		(MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+	fraction /= (double) (MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+
 	victim_multixacts = multixacts * fraction;
 
 	/* fraction could be > 1.0, but lowest possible freeze age is zero */
@@ -3345,7 +3187,7 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
 static bool
 MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
 {
-	int32		diff = (int32) (offset1 - offset2);
+	int64		diff = (int64) (offset1 - offset2);
 
 	return (diff < 0);
 }
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 985cd06802..1af2ce4b93 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -264,7 +264,7 @@ main(int argc, char *argv[])
 
 			case 'O':
 				errno = 0;
-				set_mxoff = strtoul(optarg, &endptr, 0);
+				set_mxoff = strtou64(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index 9829e48106..f8a8eef44d 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -206,7 +206,7 @@ push @cmd,
   sprintf("%d,%d", hex($files[0]) == 0 ? 3 : hex($files[0]), hex($files[-1]));
 
 @files = get_slru_files('pg_multixact/offsets');
-$mult = 32 * $blcksz / 4;
+$mult = 32 * $blcksz / 8;
 # -m argument is "new,old"
 push @cmd, '-m',
   sprintf("%d,%d",
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 7ffd256c74..90583634ec 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -27,7 +27,7 @@
 
 #define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
 
-#define MaxMultiXactOffset	((MultiXactOffset) 0xFFFFFFFF)
+#define MaxMultiXactOffset	UINT64CONST(0xFFFFFFFFFFFFFFFF)
 
 /*
  * Possible multixact lock modes ("status").  The first four modes are for
diff --git a/src/include/c.h b/src/include/c.h
index 0a548d69d7..e1b3187d0b 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -664,7 +664,7 @@ typedef uint32 SubTransactionId;
 /* MultiXactId must be equivalent to TransactionId, to fit in t_xmax */
 typedef TransactionId MultiXactId;
 
-typedef uint32 MultiXactOffset;
+typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
-- 
2.43.0

generate.shapplication/x-sh; name=generate.shDownload
test.shapplication/x-sh; name=test.shDownload
multixids.pytext/x-python-script; charset=US-ASCII; name=multixids.pyDownload
v8-0001-Use-64-bit-format-output-for-multixact-offsets.patchapplication/octet-stream; name=v8-0001-Use-64-bit-format-output-for-multixact-offsets.patchDownload
From fdb7e2eee33dfb5df714d8d16112d4c907475d78 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 7 Aug 2024 16:35:22 +0300
Subject: [PATCH v8 1/5] Use 64-bit format output for multixact offsets

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/rmgrdesc/mxactdesc.c   |  9 ++++----
 src/backend/access/rmgrdesc/xlogdesc.c    |  4 ++--
 src/backend/access/transam/multixact.c    | 26 +++++++++++++----------
 src/backend/access/transam/xlogrecovery.c |  5 +++--
 src/bin/pg_controldata/pg_controldata.c   |  4 ++--
 src/bin/pg_resetwal/pg_resetwal.c         |  8 +++----
 6 files changed, 31 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 3e8ad4d5ef..1b486de38c 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -65,8 +65,8 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		xl_multixact_create *xlrec = (xl_multixact_create *) rec;
 		int			i;
 
-		appendStringInfo(buf, "%u offset %u nmembers %d: ", xlrec->mid,
-						 xlrec->moff, xlrec->nmembers);
+		appendStringInfo(buf, "%u offset %llu nmembers %d: ", xlrec->mid,
+						 (unsigned long long) xlrec->moff, xlrec->nmembers);
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
 	}
@@ -74,9 +74,10 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
 
-		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+		appendStringInfo(buf, "offsets [%u, %u), members [%llu, %llu)",
 						 xlrec->startTruncOff, xlrec->endTruncOff,
-						 xlrec->startTruncMemb, xlrec->endTruncMemb);
+						 (unsigned long long) xlrec->startTruncMemb,
+						 (unsigned long long) xlrec->endTruncMemb);
 	}
 }
 
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 363294d623..aaa19c81c8 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -66,7 +66,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
 		appendStringInfo(buf, "redo %X/%X; "
-						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %u; "
+						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %llu; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
 						 "oldest running xid %u; %s",
@@ -79,7 +79,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 XidFromFullTransactionId(checkpoint->nextXid),
 						 checkpoint->nextOid,
 						 checkpoint->nextMulti,
-						 checkpoint->nextMultiOffset,
+						 (unsigned long long) checkpoint->nextMultiOffset,
 						 checkpoint->oldestXid,
 						 checkpoint->oldestXidDB,
 						 checkpoint->oldestMulti,
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 8c37d7eba7..ab90912ed3 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1264,7 +1264,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	LWLockRelease(MultiXactGenLock);
 
-	debug_elog4(DEBUG2, "GetNew: returning %u offset %u", result, *offset);
+	debug_elog4(DEBUG2, "GetNew: returning %u offset %llu", result,
+				(unsigned long long) *offset);
 	return result;
 }
 
@@ -2293,8 +2294,9 @@ MultiXactGetCheckptMulti(bool is_shutdown,
 	LWLockRelease(MultiXactGenLock);
 
 	debug_elog6(DEBUG2,
-				"MultiXact: checkpoint is nextMulti %u, nextOffset %u, oldestMulti %u in DB %u",
-				*nextMulti, *nextMultiOffset, *oldestMulti, *oldestMultiDB);
+				"MultiXact: checkpoint is nextMulti %u, nextOffset %llu, oldestMulti %u in DB %u",
+				*nextMulti, (unsigned long long) *nextMultiOffset, *oldestMulti,
+				*oldestMultiDB);
 }
 
 /*
@@ -2328,8 +2330,8 @@ void
 MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset)
 {
-	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %u",
-				nextMulti, nextMultiOffset);
+	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %llu",
+				nextMulti, (unsigned long long) nextMultiOffset);
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->nextMXact = nextMulti;
 	MultiXactState->nextOffset = nextMultiOffset;
@@ -2519,8 +2521,8 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 	}
 	if (MultiXactOffsetPrecedes(MultiXactState->nextOffset, minMultiOffset))
 	{
-		debug_elog3(DEBUG2, "MultiXact: setting next offset to %u",
-					minMultiOffset);
+		debug_elog3(DEBUG2, "MultiXact: setting next offset to %llu",
+					(unsigned long long) minMultiOffset);
 		MultiXactState->nextOffset = minMultiOffset;
 	}
 	LWLockRelease(MultiXactGenLock);
@@ -3211,11 +3213,12 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 
 	elog(DEBUG1, "performing multixact truncation: "
 		 "offsets [%u, %u), offsets segments [%llx, %llx), "
-		 "members [%u, %u), members segments [%llx, %llx)",
+		 "members [%llu, %llu), members segments [%llx, %llx)",
 		 oldestMulti, newOldestMulti,
 		 (unsigned long long) MultiXactIdToOffsetSegment(oldestMulti),
 		 (unsigned long long) MultiXactIdToOffsetSegment(newOldestMulti),
-		 oldestOffset, newOldestOffset,
+		 (unsigned long long) oldestOffset,
+		 (unsigned long long) newOldestOffset,
 		 (unsigned long long) MXOffsetToMemberSegment(oldestOffset),
 		 (unsigned long long) MXOffsetToMemberSegment(newOldestOffset));
 
@@ -3471,11 +3474,12 @@ multixact_redo(XLogReaderState *record)
 
 		elog(DEBUG1, "replaying multixact truncation: "
 			 "offsets [%u, %u), offsets segments [%llx, %llx), "
-			 "members [%u, %u), members segments [%llx, %llx)",
+			 "members [%llu, %llu), members segments [%llx, %llx)",
 			 xlrec.startTruncOff, xlrec.endTruncOff,
 			 (unsigned long long) MultiXactIdToOffsetSegment(xlrec.startTruncOff),
 			 (unsigned long long) MultiXactIdToOffsetSegment(xlrec.endTruncOff),
-			 xlrec.startTruncMemb, xlrec.endTruncMemb,
+			 (unsigned long long) xlrec.startTruncMemb,
+			 (unsigned long long) xlrec.endTruncMemb,
 			 (unsigned long long) MXOffsetToMemberSegment(xlrec.startTruncMemb),
 			 (unsigned long long) MXOffsetToMemberSegment(xlrec.endTruncMemb));
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 05c738d661..727b6e744f 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -876,8 +876,9 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 							 U64FromFullTransactionId(checkPoint.nextXid),
 							 checkPoint.nextOid)));
 	ereport(DEBUG1,
-			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %u",
-							 checkPoint.nextMulti, checkPoint.nextMultiOffset)));
+			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %llu",
+							 checkPoint.nextMulti,
+							 (unsigned long long) checkPoint.nextMultiOffset)));
 	ereport(DEBUG1,
 			(errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
 							 checkPoint.oldestXid, checkPoint.oldestXidDB)));
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 93a05d80ca..43b6727570 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -253,8 +253,8 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile->checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
-		   ControlFile->checkPointCopy.nextMultiOffset);
+	printf(_("Latest checkpoint's NextMultiOffset:  %llu\n"),
+		   (unsigned long long) ControlFile->checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile->checkPointCopy.oldestXid);
 	printf(_("Latest checkpoint's oldestXID's DB:   %u\n"),
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index e9dcb5a6d8..985cd06802 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -737,8 +737,8 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile.checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
-		   ControlFile.checkPointCopy.nextMultiOffset);
+	printf(_("Latest checkpoint's NextMultiOffset:  %llu\n"),
+		   (unsigned long long) ControlFile.checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile.checkPointCopy.oldestXid);
 	printf(_("Latest checkpoint's oldestXID's DB:   %u\n"),
@@ -809,8 +809,8 @@ PrintNewControlValues(void)
 
 	if (set_mxoff != -1)
 	{
-		printf(_("NextMultiOffset:                      %u\n"),
-			   ControlFile.checkPointCopy.nextMultiOffset);
+		printf(_("NextMultiOffset:                      %llu\n"),
+			   (unsigned long long) ControlFile.checkPointCopy.nextMultiOffset);
 	}
 
 	if (set_oid != 0)
-- 
2.43.0

v8-0004-Get-rid-of-MultiXactMemberFreezeThreshold-call.patchapplication/octet-stream; name=v8-0004-Get-rid-of-MultiXactMemberFreezeThreshold-call.patchDownload
From edff3857e1cb6c67e75be1b00fd5da1cd4bde343 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 23 Oct 2024 18:23:39 +0300
Subject: [PATCH v8 4/5] Get rid of MultiXactMemberFreezeThreshold call.

Since MaxMultiXactOffset are UINT64_MAX now, MULTIXACT_MEMBER_SAFE_THRESHOLD and
MULTIXACT_MEMBER_DANGER_THRESHOLD values are not meaningful any more. Thus,
MultiXactMemberFreezeThreshold is not needed too.

Instead, switch to MULTIXACT_MEMBER_AUTOVAC_THRESHOLD (eq 2^32) members
threshold. It is used to determine if we need to force autovacuum or not.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/transam/multixact.c | 117 +++----------------------
 src/backend/commands/vacuum.c          |   2 +-
 src/backend/postmaster/autovacuum.c    |   4 +-
 src/include/access/multixact.h         |   1 -
 4 files changed, 15 insertions(+), 109 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index c51e03e832..c1f228c5fb 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -204,10 +204,14 @@ MXOffsetToMemberOffset(MultiXactOffset offset)
 		member_in_group * sizeof(TransactionId);
 }
 
-/* Multixact members wraparound thresholds. */
-#define MULTIXACT_MEMBER_SAFE_THRESHOLD		(MaxMultiXactOffset / 2)
-#define MULTIXACT_MEMBER_DANGER_THRESHOLD	\
-	(MaxMultiXactOffset - MaxMultiXactOffset / 4)
+/*
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuumin order to release the disk space, reduce table bloat if
+ * possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD		UINT64CONST(0xFFFFFFFF)
 
 static inline MultiXactId
 PreviousMultiXactId(MultiXactId multi)
@@ -2616,15 +2620,13 @@ GetOldestMultiXactId(void)
 }
 
 /*
- * Determine how aggressively we need to vacuum in order to prevent member
- * wraparound.
+ * Determine if we need to vacuum for member or not.
  *
  * To do so determine what's the oldest member offset and install the limit
  * info in MultiXactState, where it can be used to prevent overrun of old data
  * in the members SLRU area.
  *
- * The return value is true if emergency autovacuum is required and false
- * otherwise.
+ * The return value is true if autovacuum is required and false otherwise.
  */
 static bool
 SetOffsetVacuumLimit(bool is_startup)
@@ -2712,10 +2714,10 @@ SetOffsetVacuumLimit(bool is_startup)
 	LWLockRelease(MultiXactGenLock);
 
 	/*
-	 * Do we need an emergency autovacuum?	If we're not sure, assume yes.
+	 * Do we need autovacuum?	If we're not sure, assume yes.
 	 */
 	return !oldestOffsetKnown ||
-		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
+		(nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);
 }
 
 /*
@@ -2761,101 +2763,6 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	return true;
 }
 
-/*
- * Determine how many multixacts, and how many multixact members, currently
- * exist.  Return false if unable to determine.
- */
-static bool
-ReadMultiXactCounts(uint32 *multixacts, MultiXactOffset *members)
-{
-	MultiXactOffset nextOffset;
-	MultiXactOffset oldestOffset;
-	MultiXactId oldestMultiXactId;
-	MultiXactId nextMultiXactId;
-	bool		oldestOffsetKnown;
-
-	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	nextOffset = MultiXactState->nextOffset;
-	oldestMultiXactId = MultiXactState->oldestMultiXactId;
-	nextMultiXactId = MultiXactState->nextMXact;
-	oldestOffset = MultiXactState->oldestOffset;
-	oldestOffsetKnown = MultiXactState->oldestOffsetKnown;
-	LWLockRelease(MultiXactGenLock);
-
-	if (!oldestOffsetKnown)
-		return false;
-
-	*members = nextOffset - oldestOffset;
-	*multixacts = nextMultiXactId - oldestMultiXactId;
-	return true;
-}
-
-/*
- * Multixact members can be removed once the multixacts that refer to them
- * are older than every datminmxid.  autovacuum_multixact_freeze_max_age and
- * vacuum_multixact_freeze_table_age work together to make sure we never have
- * too many multixacts; we hope that, at least under normal circumstances,
- * this will also be sufficient to keep us from using too many offsets.
- * However, if the average multixact has many members, we might exhaust the
- * members space while still using few enough members that these limits fail
- * to trigger relminmxid advancement by VACUUM.  At that point, we'd have no
- * choice but to start failing multixact-creating operations with an error.
- *
- * To prevent that, if more than a threshold portion of the members space is
- * used, we effectively reduce autovacuum_multixact_freeze_max_age and
- * to a value just less than the number of multixacts in use.  We hope that
- * this will quickly trigger autovacuuming on the table or tables with the
- * oldest relminmxid, thus allowing datminmxid values to advance and removing
- * some members.
- *
- * As the fraction of the member space currently in use grows, we become
- * more aggressive in clamping this value.  That not only causes autovacuum
- * to ramp up, but also makes any manual vacuums the user issues more
- * aggressive.  This happens because vacuum_get_cutoffs() will clamp the
- * freeze table and the minimum freeze age cutoffs based on the effective
- * autovacuum_multixact_freeze_max_age this function returns.  In the worst
- * case, we'll claim the freeze_max_age to zero, and every vacuum of any
- * table will freeze every multixact.
- */
-int
-MultiXactMemberFreezeThreshold(void)
-{
-	MultiXactOffset members;
-	uint32		multixacts;
-	uint32		victim_multixacts;
-	double		fraction;
-	int			result;
-
-	/* If we can't determine member space utilization, assume the worst. */
-	if (!ReadMultiXactCounts(&multixacts, &members))
-		return 0;
-
-	/* If member space utilization is low, no special action is required. */
-	if (members <= MULTIXACT_MEMBER_SAFE_THRESHOLD)
-		return autovacuum_multixact_freeze_max_age;
-
-	/*
-	 * Compute a target for relminmxid advancement.  The number of multixacts
-	 * we try to eliminate from the system is based on how far we are past
-	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
-	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-	fraction /= (double) (MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-
-	victim_multixacts = multixacts * fraction;
-
-	/* fraction could be > 1.0, but lowest possible freeze age is zero */
-	if (victim_multixacts > multixacts)
-		return 0;
-	result = multixacts - victim_multixacts;
-
-	/*
-	 * Clamp to autovacuum_multixact_freeze_max_age, so that we never make
-	 * autovacuum less aggressive than it would otherwise be.
-	 */
-	return Min(result, autovacuum_multixact_freeze_max_age);
-}
-
 typedef struct mxtruncinfo
 {
 	int64		earliestExistingPage;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 86f36b3695..e7506e268a 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1133,7 +1133,7 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Almost ready to set freeze output parameters; check if OldestXmin or
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index dc3cf87aba..180bb7e96e 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1122,7 +1122,7 @@ do_start_worker(void)
 
 	/* Also determine the oldest datminmxid we will consider. */
 	recentMulti = ReadNextMultiXactId();
-	multiForceLimit = recentMulti - MultiXactMemberFreezeThreshold();
+	multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
 	if (multiForceLimit < FirstMultiXactId)
 		multiForceLimit -= FirstMultiXactId;
 
@@ -1915,7 +1915,7 @@ do_autovacuum(void)
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Find the pg_database entry and select the default freeze ages. We use
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 90583634ec..5aefbddce3 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -143,7 +143,6 @@ extern void MultiXactSetNextMXact(MultiXactId nextMulti,
 extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 									  MultiXactOffset minMultiOffset);
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
-extern int	MultiXactMemberFreezeThreshold(void);
 
 extern void multixact_twophase_recover(TransactionId xid, uint16 info,
 									   void *recdata, uint32 len);
-- 
2.43.0

v8-0003-Make-pg_upgrade-convert-multixact-offsets.patchapplication/octet-stream; name=v8-0003-Make-pg_upgrade-convert-multixact-offsets.patchDownload
From c324315152346d7f2090aaf79b142726aa2486ae Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Tue, 13 Aug 2024 14:44:50 +0300
Subject: [PATCH v8 3/5] Make pg_upgrade convert multixact offsets.

Author: Maxim Orlov <orlovmg@gmail.com>
Author: Yura Sokolov <y.sokolov@postgrespro.ru>
---
 src/bin/pg_upgrade/Makefile     |   1 +
 src/bin/pg_upgrade/meson.build  |   1 +
 src/bin/pg_upgrade/pg_upgrade.c |  42 ++-
 src/bin/pg_upgrade/pg_upgrade.h |  14 +-
 src/bin/pg_upgrade/segresize.c  | 541 ++++++++++++++++++++++++++++++++
 5 files changed, 594 insertions(+), 5 deletions(-)
 create mode 100644 src/bin/pg_upgrade/segresize.c

diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index f83d2b5d30..70908d63a3 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	info.o \
 	option.o \
 	parallel.o \
+	segresize.o \
 	pg_upgrade.o \
 	relfilenumber.o \
 	server.o \
diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index 3d88419674..16f898ba14 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -10,6 +10,7 @@ pg_upgrade_sources = files(
   'info.c',
   'option.c',
   'parallel.c',
+  'segresize.c',
   'pg_upgrade.c',
   'relfilenumber.c',
   'server.c',
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 663235816f..1654e877c0 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -750,8 +750,42 @@ copy_xact_xlog_xid(void)
 	if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER &&
 		new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
 	{
-		copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
-		copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		/*
+		 * If the old server is before the MULTIXACTOFFSET_FORMATCHANGE_CAT_VER
+		 * it must have 32-bit multixid offsets, thus it should be converted.
+		 */
+		if (old_cluster.controldata.cat_ver < MULTIXACTOFFSET_FORMATCHANGE_CAT_VER &&
+			new_cluster.controldata.cat_ver >= MULTIXACTOFFSET_FORMATCHANGE_CAT_VER)
+		{
+			MultiXactOffset		oldest_offset,
+								next_offset;
+
+			remove_new_subdir("pg_multixact/offsets", false);
+			prep_status("Converting pg_multixact/offsets to 64-bit");
+			oldest_offset = convert_multixact_offsets();
+			check_ok();
+
+			remove_new_subdir("pg_multixact/members", false);
+			prep_status("Converting pg_multixact/members");
+			convert_multixact_members(oldest_offset);
+			check_ok();
+
+			next_offset = old_cluster.controldata.chkpnt_nxtmxoff;
+			if (oldest_offset)
+			{
+				if (next_offset < oldest_offset)
+					next_offset += ((MultiXactOffset) 1 << 32) - 1;
+
+				next_offset -= oldest_offset - 1;
+
+				old_cluster.controldata.chkpnt_nxtmxoff = next_offset;
+			}
+		}
+		else
+		{
+			copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
+			copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		}
 
 		prep_status("Setting next multixact ID and offset for new cluster");
 
@@ -760,9 +794,9 @@ copy_xact_xlog_xid(void)
 		 * counters here and the oldest multi present on system.
 		 */
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/pg_resetwal\" -O %u -m %u,%u \"%s\"",
+				  "\"%s/pg_resetwal\" -O %llu -m %u,%u \"%s\"",
 				  new_cluster.bindir,
-				  old_cluster.controldata.chkpnt_nxtmxoff,
+				  (unsigned long long) old_cluster.controldata.chkpnt_nxtmxoff,
 				  old_cluster.controldata.chkpnt_nxtmulti,
 				  old_cluster.controldata.chkpnt_oldstMulti,
 				  new_cluster.pgdata);
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 53f693c2d4..2c85ec1e94 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -114,6 +114,13 @@ extern char *output_files[];
  */
 #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231
 
+/*
+ * Swicth from 32-bit to 64-bit for multixid offsets.
+ *
+ * XXX: should be changed to the actual CATALOG_VERSION_NO on commit.
+ */
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202409041
+
 /*
  * large object chunk size added to pg_controldata,
  * commit 5f93c37805e7485488480916b4585e098d3cc883
@@ -230,7 +237,7 @@ typedef struct
 	uint32		chkpnt_nxtepoch;
 	uint32		chkpnt_nxtoid;
 	uint32		chkpnt_nxtmulti;
-	uint32		chkpnt_nxtmxoff;
+	uint64		chkpnt_nxtmxoff;
 	uint32		chkpnt_oldstMulti;
 	uint32		chkpnt_oldstxid;
 	uint32		align;
@@ -515,3 +522,8 @@ typedef struct
 	FILE	   *file;
 	char		path[MAXPGPATH];
 } UpgradeTaskReport;
+
+/* segresize.c */
+
+MultiXactOffset		convert_multixact_offsets(void);
+void				convert_multixact_members(MultiXactOffset oldest_offset);
diff --git a/src/bin/pg_upgrade/segresize.c b/src/bin/pg_upgrade/segresize.c
new file mode 100644
index 0000000000..2f6f3b3288
--- /dev/null
+++ b/src/bin/pg_upgrade/segresize.c
@@ -0,0 +1,541 @@
+/*
+ *	segresize.c
+ *
+ *	SLRU segment resize utility
+ *
+ *	Copyright (c) 2024, PostgreSQL Global Development Group
+ *	src/bin/pg_upgrade/segresize.c
+ */
+
+#include "postgres_fe.h"
+
+#include "pg_upgrade.h"
+#include "access/multixact.h"
+
+/* See slru.h */
+#define SLRU_PAGES_PER_SEGMENT		32
+
+/*
+ * Some kind of iterator associated with a particular SLRU segment.  The idea is
+ * to specify the segment and page number and then move through the pages.
+ */
+typedef struct SlruSegState
+{
+	char	   *dir;
+	char	   *fn;
+	FILE	   *file;
+	int64		segno;
+	uint64		pageno;
+	bool		leading_gap;
+	bool		long_segment_names;
+} SlruSegState;
+
+/*
+ * Mirrors the SlruFileName from slru.c
+ */
+static inline char *
+SlruFileName(SlruSegState *state)
+{
+	if (state->long_segment_names)
+	{
+		Assert(state->segno >= 0 &&
+			   state->segno <= INT64CONST(0xFFFFFFFFFFFFFFF));
+		return psprintf("%s/%015llX", state->dir, (long long) state->segno);
+	}
+	else
+	{
+		Assert(state->segno >= 0 && state->segno <= INT64CONST(0xFFFFFF));
+		return psprintf("%s/%04X", state->dir, (unsigned int) state->segno);
+	}
+}
+
+/*
+ * Create new SLRU segment file.
+ */
+static void
+create_segment(SlruSegState *state)
+{
+	Assert(state->fn == NULL);
+	Assert(state->file == NULL);
+
+	state->fn = SlruFileName(state);
+	state->file = fopen(state->fn, "wb");
+	if (!state->file)
+		pg_fatal("could not create file \"%s\": %m", state->fn);
+}
+
+/*
+ * Open existing SLRU segment file.
+ */
+static void
+open_segment(SlruSegState *state)
+{
+	Assert(state->fn == NULL);
+	Assert(state->file == NULL);
+
+	state->fn = SlruFileName(state);
+	state->file = fopen(state->fn, "rb");
+	if (!state->file)
+		pg_fatal("could not open file \"%s\": %m", state->fn);
+}
+
+/*
+ * Close SLRU segment file.
+ */
+static void
+close_segment(SlruSegState *state)
+{
+	if (state->file)
+	{
+		fclose(state->file);
+		state->file = NULL;
+	}
+
+	if (state->fn)
+	{
+		pfree(state->fn);
+		state->fn = NULL;
+	}
+}
+
+/*
+ * Read next page from the old 32-bit offset segment file.
+ */
+static int
+read_old_segment_page(SlruSegState *state, void *buf, bool *empty)
+{
+	int		len;
+
+	/* Open next segment file, if needed. */
+	if (!state->fn)
+	{
+		if (!state->segno)
+			state->leading_gap = true;
+
+		open_segment(state);
+
+		/* Set position to the needed page. */
+		if (state->pageno > 0 &&
+			fseek(state->file, state->pageno * BLCKSZ, SEEK_SET))
+		{
+			close_segment(state);
+		}
+	}
+
+	if (state->file)
+	{
+		/* Segment file do exists, read page from it. */
+		state->leading_gap = false;
+
+		len = fread(buf, sizeof(char), BLCKSZ, state->file);
+
+		/* Are we done or was there an error? */
+		if (len <= 0)
+		{
+			if (ferror(state->file))
+				pg_fatal("error reading file \"%s\": %m", state->fn);
+
+			if (feof(state->file))
+			{
+				*empty = true;
+				len = -1;
+
+				close_segment(state);
+			}
+		}
+		else
+			*empty = false;
+	}
+	else if (!state->leading_gap)
+	{
+		/* We reached the last segment. */
+		len = -1;
+		*empty = true;
+	}
+	else
+	{
+		/* Skip few first segments if they were frozen and removed. */
+		len = BLCKSZ;
+		*empty = true;
+	}
+
+	if (++state->pageno >= SLRU_PAGES_PER_SEGMENT)
+	{
+		/* Start a new segment. */
+		state->segno++;
+		state->pageno = 0;
+
+		close_segment(state);
+	}
+
+	return len;
+}
+
+/*
+ * Write next page to the new 64-bit offset segment file.
+ */
+static void
+write_new_segment_page(SlruSegState *state, void *buf)
+{
+	/*
+	 * Create a new segment file if we still didn't.  Creation is
+	 * postponed until the first non-empty page is found.  This helps
+	 * not to create completely empty segments.
+	 */
+	if (!state->file)
+	{
+		create_segment(state);
+
+		/* Write zeroes to the previously skipped prefix. */
+		if (state->pageno > 0)
+		{
+			char		zerobuf[BLCKSZ] = {0};
+
+			for (int64 i = 0; i < state->pageno; i++)
+			{
+				if (fwrite(zerobuf, sizeof(char), BLCKSZ, state->file) != BLCKSZ)
+					pg_fatal("could not write file \"%s\": %m", state->fn);
+			}
+		}
+	}
+
+	/* Write page to the new segment (if it was created). */
+	if (state->file)
+	{
+		if (fwrite(buf, sizeof(char), BLCKSZ, state->file) != BLCKSZ)
+			pg_fatal("could not write file \"%s\": %m", state->fn);
+	}
+
+	/*
+	 * Did we reach the maximum page number?  Then close segment file
+	 * and create a new one on the next iteration.
+	 */
+	if (++state->pageno >= SLRU_PAGES_PER_SEGMENT)
+	{
+		/* Start a new segment. */
+		state->segno++;
+		state->pageno = 0;
+
+		close_segment(state);
+	}
+}
+
+typedef uint32 MultiXactOffsetOld;
+
+#define MaxMultiXactOffsetOld	((MultiXactOffsetOld) 0xFFFFFFFF)
+
+#define MULTIXACT_OFFSETS_PER_PAGE_OLD (BLCKSZ / sizeof(MultiXactOffsetOld))
+#define MULTIXACT_OFFSETS_PER_PAGE_NEW (BLCKSZ / sizeof(MultiXactOffset))
+
+/*
+ * Convert pg_multixact/offsets segments and return oldest multi offset.
+ */
+MultiXactOffset
+convert_multixact_offsets(void)
+{
+	SlruSegState		oldseg = {0},
+						newseg = {0};
+	MultiXactOffsetOld	oldbuf[MULTIXACT_OFFSETS_PER_PAGE_OLD] = {0};
+	MultiXactOffset		newbuf[MULTIXACT_OFFSETS_PER_PAGE_NEW] = {0},
+						oldest_offset = 0;
+	uint64				oldest_multi = old_cluster.controldata.chkpnt_oldstMulti,
+						next_multi = old_cluster.controldata.chkpnt_nxtmulti,
+						multi,
+						old_entry,
+						new_entry;
+	bool				oldest_offset_known = false;
+
+	oldseg.dir = psprintf("%s/pg_multixact/offsets", old_cluster.pgdata);
+	newseg.dir = psprintf("%s/pg_multixact/offsets", new_cluster.pgdata);
+
+	old_entry = oldest_multi % MULTIXACT_OFFSETS_PER_PAGE_OLD;
+	oldseg.pageno = oldest_multi / MULTIXACT_OFFSETS_PER_PAGE_OLD;
+	oldseg.segno = oldseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	oldseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+	oldseg.long_segment_names = false;
+
+	new_entry = oldest_multi % MULTIXACT_OFFSETS_PER_PAGE_NEW;
+	newseg.pageno = oldest_multi / MULTIXACT_OFFSETS_PER_PAGE_NEW;
+	newseg.segno = newseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	newseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+	newseg.long_segment_names = true;
+
+	if (next_multi < oldest_multi)
+		next_multi += (uint64) 1 << 32;	/* wraparound */
+
+	/* Copy multi offsets reading only needed segment pages */
+	for (multi = oldest_multi; multi < next_multi; old_entry = 0)
+	{
+		int		oldlen;
+		bool	empty;
+
+		/* Handle possible segment wraparound */
+#define OLD_OFFSET_SEGNO_MAX	\
+	(MaxMultiXactId / MULTIXACT_OFFSETS_PER_PAGE_OLD / SLRU_PAGES_PER_SEGMENT)
+		if (oldseg.segno > OLD_OFFSET_SEGNO_MAX)
+		{
+			oldseg.segno = 0;
+			oldseg.pageno = 0;
+		}
+
+		oldlen = read_old_segment_page(&oldseg, oldbuf, &empty);
+		if (empty || oldlen != BLCKSZ)
+			pg_fatal("cannot read page %llu from file \"%s\": %m",
+					 (unsigned long long) oldseg.pageno, oldseg.fn);
+
+		/* Save oldest multi offset */
+		if (!oldest_offset_known)
+		{
+			oldest_offset = oldbuf[old_entry];
+			oldest_offset_known = true;
+		}
+
+		/* Skip wrapped-around invalid MultiXactIds */
+		if (multi == (uint64) 1 << 32)
+		{
+			Assert(oldseg.segno == 0);
+			Assert(oldseg.pageno == 1);
+			Assert(old_entry == 0);
+			Assert(new_entry == 0);
+
+			multi += FirstMultiXactId;
+			old_entry = FirstMultiXactId;
+			new_entry = FirstMultiXactId;
+		}
+
+		/* Copy entries to the new page */
+		for (; multi < next_multi && old_entry < MULTIXACT_OFFSETS_PER_PAGE_OLD;
+			 multi++, old_entry++)
+		{
+			MultiXactOffset offset = oldbuf[old_entry];
+
+			/* Handle possible offset wraparound (1 becomes 2^32) */
+			if (offset < oldest_offset)
+				offset += ((uint64) 1 << 32) - 1;
+
+			/* Subtract oldest_offset, so new offsets will start from 1 */
+			newbuf[new_entry++] = offset - oldest_offset + 1;
+
+			if (new_entry >= MULTIXACT_OFFSETS_PER_PAGE_NEW)
+			{
+				/* Handle possible segment wraparound */
+#define NEW_OFFSET_SEGNO_MAX	\
+	(MaxMultiXactId / MULTIXACT_OFFSETS_PER_PAGE_NEW / SLRU_PAGES_PER_SEGMENT)
+				if (newseg.segno > NEW_OFFSET_SEGNO_MAX)
+				{
+					newseg.segno = 0;
+					newseg.pageno = 0;
+				}
+
+				/* Write new page */
+				write_new_segment_page(&newseg, newbuf);
+				new_entry = 0;
+			}
+		}
+	}
+
+	/* Write the last incomplete page */
+	if (new_entry > 0 || oldest_multi == next_multi)
+	{
+		memset(&newbuf[new_entry], 0,
+			   sizeof(newbuf[0]) * (MULTIXACT_OFFSETS_PER_PAGE_NEW - new_entry));
+		write_new_segment_page(&newseg, newbuf);
+	}
+
+	/* Use next_offset as oldest_offset, if oldest_multi == next_multi */
+	if (!oldest_offset_known)
+	{
+		Assert(oldest_multi == next_multi);
+		oldest_offset = (MultiXactOffset) old_cluster.controldata.chkpnt_nxtmxoff;
+	}
+
+	/* Release resources */
+	close_segment(&oldseg);
+	close_segment(&newseg);
+
+	pfree(oldseg.dir);
+	pfree(newseg.dir);
+
+	return oldest_offset;
+}
+
+#define MXACT_MEMBERS_FLAG_BYTES			1
+
+#define MULTIXACT_MEMBERS_PER_GROUP			4
+#define MULTIXACT_MEMBERGROUP_SIZE			\
+	(MULTIXACT_MEMBERS_PER_GROUP * (sizeof(TransactionId) + MXACT_MEMBERS_FLAG_BYTES))
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE		\
+	(BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+
+#define MULTIXACT_MEMBERS_PER_PAGE				\
+	(MULTIXACT_MEMBERS_PER_GROUP * MULTIXACT_MEMBERGROUPS_PER_PAGE)
+#define MULTIXACT_MEMBER_FLAG_BYTES_PER_GROUP	\
+	(MXACT_MEMBERS_FLAG_BYTES * MULTIXACT_MEMBERS_PER_GROUP)
+
+typedef struct MultiXactMembersCtx
+{
+	SlruSegState	seg;
+	char			buf[BLCKSZ];
+	int				group;
+	int				member;
+	char		   *flag;
+	TransactionId  *xid;
+} MultiXactMembersCtx;
+
+static void
+MultiXactMembersCtxInit(MultiXactMembersCtx *ctx)
+{
+	ctx->seg.dir = psprintf("%s/pg_multixact/members", new_cluster.pgdata);
+	ctx->seg.long_segment_names = false;
+
+	ctx->group = 0;
+	ctx->member = 1;		/* skip invalid zero offset */
+
+	ctx->flag = (char *) ctx->buf + ctx->group * MULTIXACT_MEMBERGROUP_SIZE;
+	ctx->xid = (TransactionId *)(ctx->flag + MXACT_MEMBERS_FLAG_BYTES * MULTIXACT_MEMBERS_PER_GROUP);
+
+	ctx->flag += ctx->member;
+	ctx->xid += ctx->member;
+}
+
+static void
+MultiXactMembersCtxAdd(MultiXactMembersCtx *ctx, char flag, TransactionId xid)
+{
+	/* Copy member's xid and flags to the new page */
+	*ctx->flag++ = flag;
+	*ctx->xid++ = xid;
+
+	if (++ctx->member < MULTIXACT_MEMBERS_PER_GROUP)
+		return;
+
+	/* Start next member group */
+	ctx->member = 0;
+
+	if (++ctx->group >= MULTIXACT_MEMBERGROUPS_PER_PAGE)
+	{
+		/* Write current page and start new */
+		write_new_segment_page(&ctx->seg, ctx->buf);
+
+		ctx->group = 0;
+		memset(ctx->buf, 0, BLCKSZ);
+	}
+
+	ctx->flag = (char *) ctx->buf + ctx->group * MULTIXACT_MEMBERGROUP_SIZE;
+	ctx->xid = (TransactionId *)(ctx->flag + MXACT_MEMBERS_FLAG_BYTES * MULTIXACT_MEMBERS_PER_GROUP);
+}
+
+static void
+MultiXactMembersCtxFinit(MultiXactMembersCtx *ctx)
+{
+	if (ctx->flag > (char *) ctx->buf)
+		write_new_segment_page(&ctx->seg, ctx->buf);
+
+	close_segment(&ctx->seg);
+
+	pfree(ctx->seg.dir);
+}
+
+/*
+ * Convert pg_multixact/members segments, offsets will start from 1.
+ *
+ */
+void
+convert_multixact_members(MultiXactOffset oldest_offset)
+{
+	MultiXactOffset			next_offset,
+							offset;
+	SlruSegState			oldseg = {0};
+	char					oldbuf[BLCKSZ] = {0};
+	int						oldidx;
+	MultiXactMembersCtx		newctx = {0};
+
+	oldseg.dir = psprintf("%s/pg_multixact/members", old_cluster.pgdata);
+
+	next_offset = (MultiXactOffset) old_cluster.controldata.chkpnt_nxtmxoff;
+	if (next_offset < oldest_offset)
+		next_offset += ((uint64) 1 << 32) - 1;
+
+	/* Initialize the old starting position */
+	oldseg.pageno = oldest_offset / MULTIXACT_MEMBERS_PER_PAGE;
+	oldseg.segno = oldseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	oldseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+	oldseg.long_segment_names = false;
+
+	/* Initialize new starting position */
+	MultiXactMembersCtxInit(&newctx);
+
+	/* Iterate through the original directory */
+	oldidx = oldest_offset % MULTIXACT_MEMBERS_PER_PAGE;
+	for (offset = oldest_offset; offset < next_offset;)
+	{
+		bool	empty;
+		int		oldlen;
+		int		ngroups;
+		int		oldgroup;
+		int		oldmember;
+
+		oldlen = read_old_segment_page(&oldseg, oldbuf, &empty);
+		if (empty || oldlen != BLCKSZ)
+			pg_fatal("cannot read page %llu from file \"%s\": %m",
+					 (unsigned long long) oldseg.pageno, oldseg.fn);
+
+		/* Iterate through the old member groups */
+		ngroups = oldlen / MULTIXACT_MEMBERGROUP_SIZE;
+		oldmember = oldidx % MULTIXACT_MEMBERS_PER_GROUP;
+		oldgroup = oldidx / MULTIXACT_MEMBERS_PER_GROUP;
+		while (oldgroup < ngroups && offset < next_offset)
+		{
+			char		   *oldflag;
+			TransactionId  *oldxid;
+			int				i;
+
+			oldflag = (char *) oldbuf + oldgroup * MULTIXACT_MEMBERGROUP_SIZE;
+			oldxid = (TransactionId *)(oldflag + MULTIXACT_MEMBER_FLAG_BYTES_PER_GROUP);
+
+			oldxid += oldmember;
+			oldflag += oldmember;
+
+			/* Iterate through the old members */
+			for (i = oldmember;
+				 i < MULTIXACT_MEMBERS_PER_GROUP && offset < next_offset;
+				 i++)
+			{
+				MultiXactMembersCtxAdd(&newctx, *oldflag++, *oldxid++);
+
+				if (++offset == (uint64) 1 << 32)
+				{
+					Assert(i == MaxMultiXactOffsetOld % MULTIXACT_MEMBERS_PER_GROUP);
+					goto wraparound;
+				}
+			}
+
+			oldgroup++;
+			oldmember = 0;
+		}
+
+		oldidx = 0;
+
+		continue;
+
+wraparound:
+#define SEGNO_MAX	MaxMultiXactOffsetOld / MULTIXACT_MEMBERS_PER_PAGE / SLRU_PAGES_PER_SEGMENT
+#define PAGENO_MAX	MaxMultiXactOffsetOld / MULTIXACT_MEMBERS_PER_PAGE % SLRU_PAGES_PER_SEGMENT
+		Assert((oldseg.segno == SEGNO_MAX && oldseg.pageno == PAGENO_MAX + 1) ||
+			   (oldseg.segno == SEGNO_MAX + 1 && oldseg.pageno == 0));
+
+		/* Switch to segment 0000 */
+		close_segment(&oldseg);
+		oldseg.segno = 0;
+		oldseg.pageno = 0;
+
+		/* skip invalid zero multi offset */
+		oldidx = 1;
+	}
+
+	MultiXactMembersCtxFinit(&newctx);
+
+	/* Release resources */
+	close_segment(&oldseg);
+
+	pfree(oldseg.dir);
+}
-- 
2.43.0

#25Maxim Orlov
orlovmg@gmail.com
In reply to: Maxim Orlov (#24)
7 attachment(s)
Re: POC: make mxidoff 64 bits

Oops! Sorry for the noise. I've must have been overworking yesterday and
messed up the working branches. v7 was a correct set and v8 don't. Here is
the correction with extended Perl test.

The test itself is in src/bin/pg_upgrade/t/005_offset.pl It is rather heavy
and took about 45 minutes on my i5 with 2.7 Gb data generated. Basically,
each test here is creating a cluster and fill it with multixacts. Thus,
dozens of segments are created using two methods. One is with prepared
transactions, and it creates, roughly, the same amount of segments for
members and for offsets. The other one is based on Heikki's multixids.py
and creates more members than offsets. I've used both of these methods to
generate as much diverse data as possible.

Here is how I test this patch set:

1. You need two pg clusters: the "old" one, i.e. without patch set, and
the "new" with patch set v9 applied.
2. Apply v9-0005-TEST-initdb-option-to-initialize-cluster-with-non.patch.txt
to the "old" and "new" clusters. Note, this is only patch required for
"old" cluster. This will allow you to create a cluster with non-standard
initial multixact and multixact offset. Unfortunately, this patch was not
did not arouse public interest since it is assumed that there is similar
functionality to the pg_resetwal utility. But similar is not mean equal.
See, pg_resetwal must be used after cluster init, thus, we step into some
problems with vacuum and some SLRU segments must be filled with zeroes.
Also, template0 datminmxid must be manually updated. So, in me view,
using this patch is justified and very handy here.
3. Also, apply all the "TEST" (0006 and 0007) patches to the "new"
cluster.
4. Build "old" and "new" pg clusters.
5. Run the test with: PROVE_TESTS=t/005_offset.pl PG_TEST_NOCLEAN=1
oldinstall=/home/orlov/proj/OFFSET3/pgsql-old make check -s -C
src/bin/pg_upgrade/
6. In my case, it took around 45 minutes and generate roughly 2.7 Gb of
data.

"TEST" patches, of course, are for the test purposes and not to be
committed.

In src/bin/pg_upgrade/t/005_offset.pl I try to consider next cases:

- Basic sanity checks.
Here I test various initial multi and offset values (including
wraparound) and see how appropriate segments are generated.
- pg_upgarde tests.
Here is oldinstall ENV is for. Run pg_upgrade for old cluster with multi
and offset values just like in previous step. i.e. with various
combinations.
- Self pg_upgarde.

--
Best regards,
Maxim Orlov.

Attachments:

v9-0005-TEST-initdb-option-to-initialize-cluster-with-non.patch.txttext/plain; charset=US-ASCII; name=v9-0005-TEST-initdb-option-to-initialize-cluster-with-non.patch.txtDownload
From 2642f597832cbed0ebc54202de4e0f5770ac5f50 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Wed, 4 May 2022 15:53:36 +0300
Subject: [PATCH v9 5/7] TEST: initdb option to initialize cluster with
 non-standard xid/mxid/mxoff

To date testing database cluster wraparund was not easy as initdb has always
inited it with default xid/mxid/mxoff. The option to specify any valid
xid/mxid/mxoff at cluster startup will make these things easier.

Author: Maxim Orlov <orlovmg@gmail.com>
Author: Pavel Borisov <pashkin.elfe@gmail.com>
Author: Svetlana Derevyanko <s.derevyanko@postgrespro.ru>
Discussion: https://www.postgresql.org/message-id/flat/CACG%3Dezaa4vqYjJ16yoxgrpa-%3DgXnf0Vv3Ey9bjGrRRFN2YyWFQ%40mail.gmail.com
---
 src/backend/access/transam/clog.c      |  21 +++++
 src/backend/access/transam/multixact.c |  53 ++++++++++++
 src/backend/access/transam/subtrans.c  |   8 +-
 src/backend/access/transam/xlog.c      |  15 ++--
 src/backend/bootstrap/bootstrap.c      |  50 +++++++++++-
 src/backend/main/main.c                |   6 ++
 src/backend/postmaster/postmaster.c    |  14 +++-
 src/backend/tcop/postgres.c            |  53 +++++++++++-
 src/bin/initdb/initdb.c                | 107 ++++++++++++++++++++++++-
 src/bin/initdb/t/001_initdb.pl         |  60 ++++++++++++++
 src/include/access/xlog.h              |   3 +
 src/include/c.h                        |   4 +
 src/include/catalog/pg_class.h         |   2 +-
 13 files changed, 382 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index e6f79320e9..17e29f4497 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -834,6 +834,7 @@ BootStrapCLOG(void)
 {
 	int			slotno;
 	LWLock	   *lock = SimpleLruGetBankLock(XactCtl, 0);
+	int64		pageno;
 
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
@@ -844,6 +845,26 @@ BootStrapCLOG(void)
 	SimpleLruWritePage(XactCtl, slotno);
 	Assert(!XactCtl->shared->page_dirty[slotno]);
 
+	pageno = TransactionIdToPage(XidFromFullTransactionId(TransamVariables->nextXid));
+	if (pageno != 0)
+	{
+		LWLock *nextlock = SimpleLruGetBankLock(XactCtl, pageno);
+
+		if (nextlock != lock)
+		{
+			LWLockRelease(lock);
+			LWLockAcquire(nextlock, LW_EXCLUSIVE);
+			lock = nextlock;
+		}
+
+		/* Create and zero the first page of the commit log */
+		slotno = ZeroCLOGPage(pageno, false);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(XactCtl, slotno);
+		Assert(!XactCtl->shared->page_dirty[slotno]);
+	}
+
 	LWLockRelease(lock);
 }
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index a817f539ee..095c39dd93 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1955,6 +1955,7 @@ BootStrapMultiXact(void)
 {
 	int			slotno;
 	LWLock	   *lock;
+	int64		pageno;
 
 	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, 0);
 	LWLockAcquire(lock, LW_EXCLUSIVE);
@@ -1966,6 +1967,26 @@ BootStrapMultiXact(void)
 	SimpleLruWritePage(MultiXactOffsetCtl, slotno);
 	Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
 
+	pageno = MultiXactIdToOffsetPage(MultiXactState->nextMXact);
+	if (pageno != 0)
+	{
+		LWLock *nextlock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
+
+		if (nextlock != lock)
+		{
+			LWLockRelease(lock);
+			LWLockAcquire(nextlock, LW_EXCLUSIVE);
+			lock = nextlock;
+		}
+
+		/* Create and zero the first page of the offsets log */
+		slotno = ZeroMultiXactOffsetPage(pageno, false);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
+		Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
+	}
+
 	LWLockRelease(lock);
 
 	lock = SimpleLruGetBankLock(MultiXactMemberCtl, 0);
@@ -1978,7 +1999,39 @@ BootStrapMultiXact(void)
 	SimpleLruWritePage(MultiXactMemberCtl, slotno);
 	Assert(!MultiXactMemberCtl->shared->page_dirty[slotno]);
 
+	pageno = MXOffsetToMemberPage(MultiXactState->nextOffset);
+	if (pageno != 0)
+	{
+		LWLock *nextlock = SimpleLruGetBankLock(MultiXactMemberCtl, pageno);
+
+		if (nextlock != lock)
+		{
+			LWLockRelease(lock);
+			LWLockAcquire(nextlock, LW_EXCLUSIVE);
+			lock = nextlock;
+		}
+
+		/* Create and zero the first page of the members log */
+		slotno = ZeroMultiXactMemberPage(pageno, false);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(MultiXactMemberCtl, slotno);
+		Assert(!MultiXactMemberCtl->shared->page_dirty[slotno]);
+	}
+
 	LWLockRelease(lock);
+
+	/*
+	 * If we're starting not from zero offset, initilize dummy multixact to
+	 * evade too long loop in PerformMembersTruncation().
+	 */
+	if (MultiXactState->nextOffset > 0 && MultiXactState->nextMXact > 0)
+	{
+		RecordNewMultiXact(FirstMultiXactId,
+						   MultiXactState->nextOffset, 0, NULL);
+		RecordNewMultiXact(MultiXactState->nextMXact,
+						   MultiXactState->nextOffset, 0, NULL);
+	}
 }
 
 /*
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 50bb1d8cfc..a5e6e8f090 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -270,12 +270,16 @@ void
 BootStrapSUBTRANS(void)
 {
 	int			slotno;
-	LWLock	   *lock = SimpleLruGetBankLock(SubTransCtl, 0);
+	LWLock	   *lock;
+	int64		pageno;
+
+	pageno = TransactionIdToPage(XidFromFullTransactionId(TransamVariables->nextXid));
+	lock = SimpleLruGetBankLock(SubTransCtl, pageno);
 
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
 	/* Create and zero the first page of the subtrans log */
-	slotno = ZeroSUBTRANSPage(0);
+	slotno = ZeroSUBTRANSPage(pageno);
 
 	/* Make sure it's written out */
 	SimpleLruWritePage(SubTransCtl, slotno);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f58412bca..c61d7d967c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -136,6 +136,10 @@ int			max_slot_wal_keep_size_mb = -1;
 int			wal_decode_buffer_size = 512 * 1024;
 bool		track_wal_io_timing = false;
 
+TransactionId		start_xid = FirstNormalTransactionId;
+MultiXactId			start_mxid = FirstMultiXactId;
+MultiXactOffset		start_mxoff = 0;
+
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
 #endif
@@ -5080,13 +5084,14 @@ BootStrapXLOG(uint32 data_checksum_version)
 	checkPoint.fullPageWrites = fullPageWrites;
 	checkPoint.wal_level = wal_level;
 	checkPoint.nextXid =
-		FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId);
+		FullTransactionIdFromEpochAndXid(0, Max(FirstNormalTransactionId,
+												start_xid));
 	checkPoint.nextOid = FirstGenbkiObjectId;
-	checkPoint.nextMulti = FirstMultiXactId;
-	checkPoint.nextMultiOffset = 0;
-	checkPoint.oldestXid = FirstNormalTransactionId;
+	checkPoint.nextMulti = Max(FirstMultiXactId, start_mxid);
+	checkPoint.nextMultiOffset = start_mxoff;
+	checkPoint.oldestXid = XidFromFullTransactionId(checkPoint.nextXid);
 	checkPoint.oldestXidDB = Template1DbOid;
-	checkPoint.oldestMulti = FirstMultiXactId;
+	checkPoint.oldestMulti = checkPoint.nextMulti;
 	checkPoint.oldestMultiDB = Template1DbOid;
 	checkPoint.oldestCommitTsXid = InvalidTransactionId;
 	checkPoint.newestCommitTsXid = InvalidTransactionId;
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index d31a67599c..8c33b8ba9d 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -217,7 +217,7 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	argv++;
 	argc--;
 
-	while ((flag = getopt(argc, argv, "B:c:d:D:Fkr:X:-:")) != -1)
+	while ((flag = getopt(argc, argv, "B:c:d:D:Fkm:o:r:X:x:-:")) != -1)
 	{
 		switch (flag)
 		{
@@ -272,12 +272,60 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 			case 'k':
 				bootstrap_data_checksum_version = PG_DATA_CHECKSUM_VERSION;
 				break;
+			case 'm':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactIdIsValid(start_mxid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact id")));
+					}
+				}
+				break;
+			case 'o':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxoff = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactOffsetIsValid(start_mxoff))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact offset")));
+					}
+				}
+				break;
 			case 'r':
 				strlcpy(OutputFileName, optarg, MAXPGPATH);
 				break;
 			case 'X':
 				SetConfigOption("wal_segment_size", optarg, PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
 				break;
+			case 'x':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_xid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartTransactionIdIsValid(start_xid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster xid value")));
+					}
+				}
+				break;
 			default:
 				write_stderr("Try \"%s --help\" for more information.\n",
 							 progname);
diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index aea93a0229..6a3224bb82 100644
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -358,12 +358,18 @@ help(const char *progname)
 	printf(_("  -E                 echo statement before execution\n"));
 	printf(_("  -j                 do not use newline as interactive query delimiter\n"));
 	printf(_("  -r FILENAME        send stdout and stderr to given file\n"));
+	printf(_("  -m START_MXID      set initial database cluster multixact id\n"));
+	printf(_("  -o START_MXOFF     set initial database cluster multixact offset\n"));
+	printf(_("  -x START_XID       set initial database cluster xid\n"));
 
 	printf(_("\nOptions for bootstrapping mode:\n"));
 	printf(_("  --boot             selects bootstrapping mode (must be first argument)\n"));
 	printf(_("  --check            selects check mode (must be first argument)\n"));
 	printf(_("  DBNAME             database name (mandatory argument in bootstrapping mode)\n"));
 	printf(_("  -r FILENAME        send stdout and stderr to given file\n"));
+	printf(_("  -m START_MXID      set initial database cluster multixact id\n"));
+	printf(_("  -o START_MXOFF     set initial database cluster multixact offset\n"));
+	printf(_("  -x START_XID       set initial database cluster xid\n"));
 
 	printf(_("\nPlease read the documentation for the complete list of run-time\n"
 			 "configuration settings and how to set them on the command line or in\n"
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 78e66a06ac..483307279f 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -572,7 +572,7 @@ PostmasterMain(int argc, char *argv[])
 	 * tcop/postgres.c (the option sets should not conflict) and with the
 	 * common help() function in main/main.c.
 	 */
-	while ((opt = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lN:OPp:r:S:sTt:W:-:")) != -1)
+	while ((opt = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lm:N:Oo:Pp:r:S:sTt:W:x:-:")) != -1)
 	{
 		switch (opt)
 		{
@@ -669,10 +669,18 @@ PostmasterMain(int argc, char *argv[])
 				SetConfigOption("max_connections", optarg, PGC_POSTMASTER, PGC_S_ARGV);
 				break;
 
+			case 'm':
+				/* only used by single-user backend */
+				break;
+
 			case 'O':
 				SetConfigOption("allow_system_table_mods", "true", PGC_POSTMASTER, PGC_S_ARGV);
 				break;
 
+			case 'o':
+				/* only used by single-user backend */
+				break;
+
 			case 'P':
 				SetConfigOption("ignore_system_indexes", "true", PGC_POSTMASTER, PGC_S_ARGV);
 				break;
@@ -723,6 +731,10 @@ PostmasterMain(int argc, char *argv[])
 				SetConfigOption("post_auth_delay", optarg, PGC_POSTMASTER, PGC_S_ARGV);
 				break;
 
+			case 'x':
+				/* only used by single-user backend */
+				break;
+
 			default:
 				write_stderr("Try \"%s --help\" for more information.\n",
 							 progname);
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 184b830168..4fd594cfe5 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3918,7 +3918,7 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 	 * postmaster/postmaster.c (the option sets should not conflict) and with
 	 * the common help() function in main/main.c.
 	 */
-	while ((flag = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lN:nOPp:r:S:sTt:v:W:-:")) != -1)
+	while ((flag = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lm:N:nOo:Pp:r:S:sTt:v:W:x:-:")) != -1)
 	{
 		switch (flag)
 		{
@@ -4010,6 +4010,23 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 				SetConfigOption("ssl", "true", ctx, gucsource);
 				break;
 
+			case 'm':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactIdIsValid(start_mxid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact id")));
+					}
+				}
+				break;
+
 			case 'N':
 				SetConfigOption("max_connections", optarg, ctx, gucsource);
 				break;
@@ -4022,6 +4039,23 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 				SetConfigOption("allow_system_table_mods", "true", ctx, gucsource);
 				break;
 
+			case 'o':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxoff = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactOffsetIsValid(start_mxoff))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact offset")));
+					}
+				}
+				break;
+
 			case 'P':
 				SetConfigOption("ignore_system_indexes", "true", ctx, gucsource);
 				break;
@@ -4076,6 +4110,23 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 				SetConfigOption("post_auth_delay", optarg, ctx, gucsource);
 				break;
 
+			case 'x':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_xid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartTransactionIdIsValid(start_xid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster xid")));
+					}
+				}
+				break;
+
 			default:
 				errs++;
 				break;
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 9a91830783..410868dddf 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -168,6 +168,9 @@ static bool data_checksums = true;
 static char *xlog_dir = NULL;
 static int	wal_segment_size_mb = (DEFAULT_XLOG_SEG_SIZE) / (1024 * 1024);
 static DataDirSyncMethod sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+static TransactionId start_xid = 0;
+static MultiXactId start_mxid = 0;
+static MultiXactOffset start_mxoff = 0;
 
 
 /* internal vars */
@@ -1568,6 +1571,11 @@ bootstrap_template1(void)
 	bki_lines = replace_token(bki_lines, "POSTGRES",
 							  escape_quotes_bki(username));
 
+	/* relfrozenxid must not be less than FirstNormalTransactionId */
+	sprintf(buf, "%llu", (unsigned long long) Max(start_xid, 3));
+	bki_lines = replace_token(bki_lines, "RECENTXMIN",
+							  buf);
+
 	bki_lines = replace_token(bki_lines, "ENCODING",
 							  encodingid_to_string(encodingid));
 
@@ -1593,6 +1601,9 @@ bootstrap_template1(void)
 
 	printfPQExpBuffer(&cmd, "\"%s\" --boot %s %s", backend_exec, boot_options, extra_options);
 	appendPQExpBuffer(&cmd, " -X %d", wal_segment_size_mb * (1024 * 1024));
+	appendPQExpBuffer(&cmd, " -m %llu", (unsigned long long) start_mxid);
+	appendPQExpBuffer(&cmd, " -o %llu", (unsigned long long) start_mxoff);
+	appendPQExpBuffer(&cmd, " -x %llu", (unsigned long long) start_xid);
 	if (data_checksums)
 		appendPQExpBuffer(&cmd, " -k");
 	if (debug)
@@ -2532,12 +2543,20 @@ usage(const char *progname)
 	printf(_("  -d, --debug               generate lots of debugging output\n"));
 	printf(_("      --discard-caches      set debug_discard_caches=1\n"));
 	printf(_("  -L DIRECTORY              where to find the input files\n"));
+	printf(_("  -m, --multixact-id=START_MXID\n"
+			 "                            set initial database cluster multixact id\n"
+			 "                            max value is 2^62-1\n"));
 	printf(_("  -n, --no-clean            do not clean up after errors\n"));
 	printf(_("  -N, --no-sync             do not wait for changes to be written safely to disk\n"));
 	printf(_("      --no-instructions     do not print instructions for next steps\n"));
+	printf(_("  -o, --multixact-offset=START_MXOFF\n"
+			 "                            set initial database cluster multixact offset\n"
+			 "                            max value is 2^62-1\n"));
 	printf(_("  -s, --show                show internal settings, then exit\n"));
 	printf(_("      --sync-method=METHOD  set method for syncing files to disk\n"));
 	printf(_("  -S, --sync-only           only sync database files to disk, then exit\n"));
+	printf(_("  -x, --xid=START_XID       set initial database cluster xid\n"
+			 "                            max value is 2^62-1\n"));
 	printf(_("\nOther options:\n"));
 	printf(_("  -V, --version             output version information, then exit\n"));
 	printf(_("  -?, --help                show this help, then exit\n"));
@@ -3079,6 +3098,18 @@ initialize_data_directory(void)
 	/* Now create all the text config files */
 	setup_config();
 
+	if (start_mxid != 0)
+		printf(_("selecting initial multixact id ... %llu\n"),
+				 (unsigned long long) start_mxid);
+
+	if (start_mxoff != 0)
+		printf(_("selecting initial multixact offset ... %llu\n"),
+				 (unsigned long long) start_mxoff);
+
+	if (start_xid != 0)
+		printf(_("selecting initial xid ... %llu\n"),
+				 (unsigned long long) start_xid);
+
 	/* Bootstrap template1 */
 	bootstrap_template1();
 
@@ -3095,8 +3126,12 @@ initialize_data_directory(void)
 	fflush(stdout);
 
 	initPQExpBuffer(&cmd);
-	printfPQExpBuffer(&cmd, "\"%s\" %s %s template1 >%s",
-					  backend_exec, backend_options, extra_options, DEVNULL);
+	printfPQExpBuffer(&cmd, "\"%s\" %s %s",
+					  backend_exec, backend_options, extra_options);
+	appendPQExpBuffer(&cmd, " -m %llu", (unsigned long long) start_mxid);
+	appendPQExpBuffer(&cmd, " -o %llu", (unsigned long long) start_mxoff);
+	appendPQExpBuffer(&cmd, " -x %llu", (unsigned long long) start_xid);
+	appendPQExpBuffer(&cmd, " template1 >%s", DEVNULL);
 
 	PG_CMD_OPEN(cmd.data);
 
@@ -3183,6 +3218,9 @@ main(int argc, char *argv[])
 		{"icu-rules", required_argument, NULL, 18},
 		{"sync-method", required_argument, NULL, 19},
 		{"no-data-checksums", no_argument, NULL, 20},
+		{"xid", required_argument, NULL, 'x'},
+		{"multixact-id", required_argument, NULL, 'm'},
+		{"multixact-offset", required_argument, NULL, 'o'},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -3224,7 +3262,7 @@ main(int argc, char *argv[])
 
 	/* process command-line options */
 
-	while ((c = getopt_long(argc, argv, "A:c:dD:E:gkL:nNsST:U:WX:",
+	while ((c = getopt_long(argc, argv, "A:c:dD:E:gkL:m:nNo:sST:U:Wx:X:",
 							long_options, &option_index)) != -1)
 	{
 		switch (c)
@@ -3282,6 +3320,30 @@ main(int argc, char *argv[])
 				debug = true;
 				printf(_("Running in debug mode.\n"));
 				break;
+			case 'm':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactIdIsValid(start_mxid))
+					{
+						pg_log_error("invalid initial database cluster multixact id");
+						exit(1);
+					}
+					else if (start_mxid < 1) /* FirstMultiXactId */
+					{
+						/*
+						 * We avoid mxid to be silently set to
+						 * FirstMultiXactId, though it does not harm.
+						 */
+						pg_log_error("multixact id should be greater than 0");
+						exit(1);
+					}
+				}
+				break;
 			case 'n':
 				noclean = true;
 				printf(_("Running in no-clean mode.  Mistakes will not be cleaned up.\n"));
@@ -3289,6 +3351,21 @@ main(int argc, char *argv[])
 			case 'N':
 				do_sync = false;
 				break;
+			case 'o':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxoff = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactOffsetIsValid(start_mxoff))
+					{
+						pg_log_error("invalid initial database cluster multixact offset");
+						exit(1);
+					}
+				}
+				break;
 			case 'S':
 				sync_only = true;
 				break;
@@ -3377,6 +3454,30 @@ main(int argc, char *argv[])
 			case 20:
 				data_checksums = false;
 				break;
+			case 'x':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_xid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartTransactionIdIsValid(start_xid))
+					{
+						pg_log_error("invalid value for initial database cluster xid");
+						exit(1);
+					}
+					else if (start_xid < 3) /* FirstNormalTransactionId */
+					{
+						/*
+						 * We avoid xid to be silently set to
+						 * FirstNormalTransactionId, though it does not harm.
+						 */
+						pg_log_error("xid should be greater than 2");
+						exit(1);
+					}
+				}
+				break;
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index 7520d3d0dd..91a85d9f4d 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -282,4 +282,64 @@ command_fails(
 	[ 'pg_checksums', '-D', $datadir_nochecksums ],
 	"pg_checksums fails with data checksum disabled");
 
+# Set non-standard initial mxid/mxoff/xid.
+command_fails_like(
+	[ 'initdb', '-m', 'seven', $datadir ],
+	qr/initdb: error: invalid initial database cluster multixact id/,
+	'fails for invalid initial database cluster multixact id');
+command_fails_like(
+	[ 'initdb', '-o', 'seven', $datadir ],
+	qr/initdb: error: invalid initial database cluster multixact offset/,
+	'fails for invalid initial database cluster multixact offset');
+command_fails_like(
+	[ 'initdb', '-x', 'seven', $datadir ],
+	qr/initdb: error: invalid value for initial database cluster xid/,
+	'fails for invalid initial database cluster xid');
+
+command_checks_all(
+	[ 'initdb', '-m', '65535', "$tempdir/data-m65535" ],
+	0,
+	[qr/selecting initial multixact id ... 65535/],
+	[],
+	'selecting initial multixact id');
+command_checks_all(
+	[ 'initdb', '-o', '65535', "$tempdir/data-o65535" ],
+	0,
+	[qr/selecting initial multixact offset ... 65535/],
+	[],
+	'selecting initial multixact offset');
+command_checks_all(
+	[ 'initdb', '-x', '65535', "$tempdir/data-x65535" ],
+	0,
+	[qr/selecting initial xid ... 65535/],
+	[],
+	'selecting initial xid');
+
+# Setup new cluster with given mxid/mxoff/xid.
+my $node;
+my $result;
+
+$node = PostgreSQL::Test::Cluster->new('test-mxid');
+$node->init(extra => ['-m', '16777215']); # 0xFFFFFF
+$node->start;
+$result = $node->safe_psql('postgres', "SELECT next_multixact_id FROM pg_control_checkpoint();");
+ok($result >= 16777215, 'setup cluster with given mxid');
+$node->stop;
+
+$node = PostgreSQL::Test::Cluster->new('test-mxoff');
+$node->init(extra => ['-o', '16777215']); # 0xFFFFFF
+$node->start;
+$result = $node->safe_psql('postgres', "SELECT next_multi_offset FROM pg_control_checkpoint();");
+ok($result >= 16777215, 'setup cluster with given mxoff');
+$node->stop;
+
+$node = PostgreSQL::Test::Cluster->new('test-xid');
+$node->init(extra => ['-x', '16777215']); # 0xFFFFFF
+$node->start;
+$result = $node->safe_psql('postgres', "SELECT txid_current();");
+ok($result >= 16777215, 'setup cluster with given xid - check 1');
+$result = $node->safe_psql('postgres', "SELECT oldest_xid FROM pg_control_checkpoint();");
+ok($result >= 16777215, 'setup cluster with given xid - check 2');
+$node->stop;
+
 done_testing();
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 34ad46c067..4ce79b12e3 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -94,6 +94,9 @@ typedef enum RecoveryState
 } RecoveryState;
 
 extern PGDLLIMPORT int wal_level;
+extern PGDLLIMPORT TransactionId start_xid;
+extern PGDLLIMPORT MultiXactId start_mxid;
+extern PGDLLIMPORT MultiXactOffset start_mxoff;
 
 /* Is WAL archiving enabled (always or only while server is running normally)? */
 #define XLogArchivingActive() \
diff --git a/src/include/c.h b/src/include/c.h
index e1b3187d0b..f770e9a140 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -668,6 +668,10 @@ typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
+#define StartTransactionIdIsValid(xid)			((xid) <= 0xFFFFFFFF)
+#define StartMultiXactIdIsValid(mxid)			((mxid) <= 0xFFFFFFFF)
+#define StartMultiXactOffsetIsValid(offset)		((offset) <= 0xFFFFFFFF)
+
 #define FirstCommandId	((CommandId) 0)
 #define InvalidCommandId	(~(CommandId)0)
 
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index 0fc2c093b0..0a7518df0d 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -123,7 +123,7 @@ CATALOG(pg_class,1259,RelationRelationId) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83,Relat
 	Oid			relrewrite BKI_DEFAULT(0) BKI_LOOKUP_OPT(pg_class);
 
 	/* all Xids < this are frozen in this rel */
-	TransactionId relfrozenxid BKI_DEFAULT(3);	/* FirstNormalTransactionId */
+	TransactionId relfrozenxid BKI_DEFAULT(RECENTXMIN);	/* FirstNormalTransactionId */
 
 	/* all multixacts in this rel are >= this; it is really a MultiXactId */
 	TransactionId relminmxid BKI_DEFAULT(1);	/* FirstMultiXactId */
-- 
2.43.0

v9-0004-Get-rid-of-MultiXactMemberFreezeThreshold-call.patchapplication/octet-stream; name=v9-0004-Get-rid-of-MultiXactMemberFreezeThreshold-call.patchDownload
From d703fe4538754534817596a0d4f51e06a8c3293f Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 23 Oct 2024 18:23:39 +0300
Subject: [PATCH v9 4/7] Get rid of MultiXactMemberFreezeThreshold call.

Since MaxMultiXactOffset are UINT64_MAX now, MULTIXACT_MEMBER_SAFE_THRESHOLD and
MULTIXACT_MEMBER_DANGER_THRESHOLD values are not meaningful any more. Thus,
MultiXactMemberFreezeThreshold is not needed too.

Instead, switch to MULTIXACT_MEMBER_AUTOVAC_THRESHOLD (eq 2^32) members
threshold. It is used to determine if we need to force autovacuum or not.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/transam/multixact.c | 117 +++----------------------
 src/backend/commands/vacuum.c          |   2 +-
 src/backend/postmaster/autovacuum.c    |   4 +-
 src/include/access/multixact.h         |   1 -
 4 files changed, 15 insertions(+), 109 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 48e1c0160a..a817f539ee 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -204,10 +204,14 @@ MXOffsetToMemberOffset(MultiXactOffset offset)
 		member_in_group * sizeof(TransactionId);
 }
 
-/* Multixact members wraparound thresholds. */
-#define MULTIXACT_MEMBER_SAFE_THRESHOLD		(MaxMultiXactOffset / 2)
-#define MULTIXACT_MEMBER_DANGER_THRESHOLD	\
-	(MaxMultiXactOffset - MaxMultiXactOffset / 4)
+/*
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuumin order to release the disk space, reduce table bloat if
+ * possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD		UINT64CONST(0xFFFFFFFF)
 
 static inline MultiXactId
 PreviousMultiXactId(MultiXactId multi)
@@ -2616,15 +2620,13 @@ GetOldestMultiXactId(void)
 }
 
 /*
- * Determine how aggressively we need to vacuum in order to prevent member
- * wraparound.
+ * Determine if we need to vacuum for member or not.
  *
  * To do so determine what's the oldest member offset and install the limit
  * info in MultiXactState, where it can be used to prevent overrun of old data
  * in the members SLRU area.
  *
- * The return value is true if emergency autovacuum is required and false
- * otherwise.
+ * The return value is true if autovacuum is required and false otherwise.
  */
 static bool
 SetOffsetVacuumLimit(bool is_startup)
@@ -2712,10 +2714,10 @@ SetOffsetVacuumLimit(bool is_startup)
 	LWLockRelease(MultiXactGenLock);
 
 	/*
-	 * Do we need an emergency autovacuum?	If we're not sure, assume yes.
+	 * Do we need autovacuum?	If we're not sure, assume yes.
 	 */
 	return !oldestOffsetKnown ||
-		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
+		(nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);
 }
 
 /*
@@ -2761,101 +2763,6 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	return true;
 }
 
-/*
- * Determine how many multixacts, and how many multixact members, currently
- * exist.  Return false if unable to determine.
- */
-static bool
-ReadMultiXactCounts(uint32 *multixacts, MultiXactOffset *members)
-{
-	MultiXactOffset nextOffset;
-	MultiXactOffset oldestOffset;
-	MultiXactId oldestMultiXactId;
-	MultiXactId nextMultiXactId;
-	bool		oldestOffsetKnown;
-
-	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	nextOffset = MultiXactState->nextOffset;
-	oldestMultiXactId = MultiXactState->oldestMultiXactId;
-	nextMultiXactId = MultiXactState->nextMXact;
-	oldestOffset = MultiXactState->oldestOffset;
-	oldestOffsetKnown = MultiXactState->oldestOffsetKnown;
-	LWLockRelease(MultiXactGenLock);
-
-	if (!oldestOffsetKnown)
-		return false;
-
-	*members = nextOffset - oldestOffset;
-	*multixacts = nextMultiXactId - oldestMultiXactId;
-	return true;
-}
-
-/*
- * Multixact members can be removed once the multixacts that refer to them
- * are older than every datminmxid.  autovacuum_multixact_freeze_max_age and
- * vacuum_multixact_freeze_table_age work together to make sure we never have
- * too many multixacts; we hope that, at least under normal circumstances,
- * this will also be sufficient to keep us from using too many offsets.
- * However, if the average multixact has many members, we might exhaust the
- * members space while still using few enough members that these limits fail
- * to trigger relminmxid advancement by VACUUM.  At that point, we'd have no
- * choice but to start failing multixact-creating operations with an error.
- *
- * To prevent that, if more than a threshold portion of the members space is
- * used, we effectively reduce autovacuum_multixact_freeze_max_age and
- * to a value just less than the number of multixacts in use.  We hope that
- * this will quickly trigger autovacuuming on the table or tables with the
- * oldest relminmxid, thus allowing datminmxid values to advance and removing
- * some members.
- *
- * As the fraction of the member space currently in use grows, we become
- * more aggressive in clamping this value.  That not only causes autovacuum
- * to ramp up, but also makes any manual vacuums the user issues more
- * aggressive.  This happens because vacuum_get_cutoffs() will clamp the
- * freeze table and the minimum freeze age cutoffs based on the effective
- * autovacuum_multixact_freeze_max_age this function returns.  In the worst
- * case, we'll claim the freeze_max_age to zero, and every vacuum of any
- * table will freeze every multixact.
- */
-int
-MultiXactMemberFreezeThreshold(void)
-{
-	MultiXactOffset members;
-	uint32		multixacts;
-	uint32		victim_multixacts;
-	double		fraction;
-	int			result;
-
-	/* If we can't determine member space utilization, assume the worst. */
-	if (!ReadMultiXactCounts(&multixacts, &members))
-		return 0;
-
-	/* If member space utilization is low, no special action is required. */
-	if (members <= MULTIXACT_MEMBER_SAFE_THRESHOLD)
-		return autovacuum_multixact_freeze_max_age;
-
-	/*
-	 * Compute a target for relminmxid advancement.  The number of multixacts
-	 * we try to eliminate from the system is based on how far we are past
-	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
-	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-	fraction /= (double) (MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-
-	victim_multixacts = multixacts * fraction;
-
-	/* fraction could be > 1.0, but lowest possible freeze age is zero */
-	if (victim_multixacts > multixacts)
-		return 0;
-	result = multixacts - victim_multixacts;
-
-	/*
-	 * Clamp to autovacuum_multixact_freeze_max_age, so that we never make
-	 * autovacuum less aggressive than it would otherwise be.
-	 */
-	return Min(result, autovacuum_multixact_freeze_max_age);
-}
-
 typedef struct mxtruncinfo
 {
 	int64		earliestExistingPage;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 86f36b3695..e7506e268a 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1133,7 +1133,7 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Almost ready to set freeze output parameters; check if OldestXmin or
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index dc3cf87aba..180bb7e96e 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1122,7 +1122,7 @@ do_start_worker(void)
 
 	/* Also determine the oldest datminmxid we will consider. */
 	recentMulti = ReadNextMultiXactId();
-	multiForceLimit = recentMulti - MultiXactMemberFreezeThreshold();
+	multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
 	if (multiForceLimit < FirstMultiXactId)
 		multiForceLimit -= FirstMultiXactId;
 
@@ -1915,7 +1915,7 @@ do_autovacuum(void)
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Find the pg_database entry and select the default freeze ages. We use
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 90583634ec..5aefbddce3 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -143,7 +143,6 @@ extern void MultiXactSetNextMXact(MultiXactId nextMulti,
 extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 									  MultiXactOffset minMultiOffset);
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
-extern int	MultiXactMemberFreezeThreshold(void);
 
 extern void multixact_twophase_recover(TransactionId xid, uint16 info,
 									   void *recdata, uint32 len);
-- 
2.43.0

v9-0002-Use-64-bit-multixact-offsets.patchapplication/octet-stream; name=v9-0002-Use-64-bit-multixact-offsets.patchDownload
From 8cc5477a23b383132fddd4386492c0ffe6b63fb7 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Wed, 6 Mar 2024 11:11:33 +0300
Subject: [PATCH v9 2/7] Use 64-bit multixact offsets.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/transam/multixact.c | 170 +------------------------
 src/bin/pg_resetwal/pg_resetwal.c      |   2 +-
 src/bin/pg_resetwal/t/001_basic.pl     |   2 +-
 src/include/access/multixact.h         |   2 +-
 src/include/c.h                        |   2 +-
 5 files changed, 10 insertions(+), 168 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index ab90912ed3..48e1c0160a 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -96,14 +96,6 @@
 /*
  * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
  * used everywhere else in Postgres.
- *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
  */
 
 /* We need four bytes per offset */
@@ -272,9 +264,6 @@ typedef struct MultiXactStateData
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 
-	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
-
 	/*
 	 * This is used to sleep until a multixact offset is written when we want
 	 * to create the next one.
@@ -409,8 +398,6 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
-									 MultiXactOffset start, uint32 distance);
 static bool SetOffsetVacuumLimit(bool is_startup);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMZeroPageXlogRec(int64 pageno, uint8 info);
@@ -1164,78 +1151,6 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	else
 		*offset = nextOffset;
 
-	/*----------
-	 * Protect against overrun of the members space as well, with the
-	 * following rules:
-	 *
-	 * If we're past offsetStopLimit, refuse to generate more multis.
-	 * If we're close to offsetStopLimit, emit a warning.
-	 *
-	 * Arbitrarily, we start emitting warnings when we're 20 segments or less
-	 * from offsetStopLimit.
-	 *
-	 * Note we haven't updated the shared state yet, so if we fail at this
-	 * point, the multixact ID we grabbed can still be used by the next guy.
-	 *
-	 * Note that there is no point in forcing autovacuum runs here: the
-	 * multixact freeze settings would have to be reduced for that to have any
-	 * effect.
-	 *----------
-	 */
-#define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
-								 nmembers))
-	{
-		/* see comment in the corresponding offsets wraparound case */
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("multixact \"members\" limit exceeded"),
-				 errdetail_plural("This command would create a multixact with %u members, but the remaining space is only enough for %u member.",
-								  "This command would create a multixact with %u members, but the remaining space is only enough for %u members.",
-								  MultiXactState->offsetStopLimit - nextOffset - 1,
-								  nmembers,
-								  MultiXactState->offsetStopLimit - nextOffset - 1),
-				 errhint("Execute a database-wide VACUUM in database with OID %u with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.",
-						 MultiXactState->oldestMultiXactDB)));
-	}
-
-	/*
-	 * Check whether we should kick autovacuum into action, to prevent members
-	 * wraparound. NB we use a much larger window to trigger autovacuum than
-	 * just the warning limit. The warning is just a measure of last resort -
-	 * this is in line with GetNewTransactionId's behaviour.
-	 */
-	if (!MultiXactState->oldestOffsetKnown ||
-		(MultiXactState->nextOffset - MultiXactState->oldestOffset
-		 > MULTIXACT_MEMBER_SAFE_THRESHOLD))
-	{
-		/*
-		 * To avoid swamping the postmaster with signals, we issue the autovac
-		 * request only when crossing a segment boundary. With default
-		 * compilation settings that's roughly after 50k members.  This still
-		 * gives plenty of chances before we get into real trouble.
-		 */
-		if ((MXOffsetToMemberPage(nextOffset) / SLRU_PAGES_PER_SEGMENT) !=
-			(MXOffsetToMemberPage(nextOffset + nmembers) / SLRU_PAGES_PER_SEGMENT))
-			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	}
-
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
-								 nextOffset,
-								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
-		ereport(WARNING,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg_plural("database with OID %u must be vacuumed before %d more multixact member is used",
-							   "database with OID %u must be vacuumed before %d more multixact members are used",
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers,
-							   MultiXactState->oldestMultiXactDB,
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers),
-				 errhint("Execute a database-wide VACUUM in that database with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.")));
-
 	ExtendMultiXactMember(nextOffset, nmembers);
 
 	/*
@@ -2721,8 +2636,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	MultiXactOffset nextOffset;
 	bool		oldestOffsetKnown = false;
 	bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;
-	MultiXactOffset prevOffsetStopLimit;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2737,7 +2650,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	nextOffset = MultiXactState->nextOffset;
 	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
 	prevOldestOffset = MultiXactState->oldestOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2768,11 +2680,7 @@ SetOffsetVacuumLimit(bool is_startup)
 		oldestOffsetKnown =
 			find_multixact_start(oldestMultiXactId, &oldestOffset);
 
-		if (oldestOffsetKnown)
-			ereport(DEBUG1,
-					(errmsg_internal("oldest MultiXactId member is at offset %u",
-									 oldestOffset)));
-		else
+		if (!oldestOffsetKnown)
 			ereport(LOG,
 					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
 							oldestMultiXactId)));
@@ -2785,24 +2693,7 @@ SetOffsetVacuumLimit(bool is_startup)
 	 * overrun of old data in the members SLRU area. We can only do so if the
 	 * oldest offset is known though.
 	 */
-	if (oldestOffsetKnown)
-	{
-		/* move back to start of the corresponding segment */
-		offsetStopLimit = oldestOffset - (oldestOffset %
-										  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-
-		/* always leave one segment before the wraparound point */
-		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-		if (!prevOldestOffsetKnown && !is_startup)
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are now enabled")));
-
-		ereport(DEBUG1,
-				(errmsg_internal("MultiXact member stop limit is now %u based on MultiXact %u",
-								 offsetStopLimit, oldestMultiXactId)));
-	}
-	else if (prevOldestOffsetKnown)
+	if (prevOldestOffsetKnown)
 	{
 		/*
 		 * If we failed to get the oldest offset this time, but we have a
@@ -2812,14 +2703,12 @@ SetOffsetVacuumLimit(bool is_startup)
 		 */
 		oldestOffset = prevOldestOffset;
 		oldestOffsetKnown = true;
-		offsetStopLimit = prevOffsetStopLimit;
 	}
 
 	/* Install the computed values */
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestOffset = oldestOffset;
 	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	MultiXactState->offsetStopLimit = offsetStopLimit;
 	LWLockRelease(MultiXactGenLock);
 
 	/*
@@ -2829,54 +2718,6 @@ SetOffsetVacuumLimit(bool is_startup)
 		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
 }
 
-/*
- * Return whether adding "distance" to "start" would move past "boundary".
- *
- * We use this to determine whether the addition is "wrapping around" the
- * boundary point, hence the name.  The reason we don't want to use the regular
- * 2^31-modulo arithmetic here is that we want to be able to use the whole of
- * the 2^32-1 space here, allowing for more multixacts than would fit
- * otherwise.
- */
-static bool
-MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
-						 uint32 distance)
-{
-	MultiXactOffset finish;
-
-	/*
-	 * Note that offset number 0 is not used (see GetMultiXactIdMembers), so
-	 * if the addition wraps around the UINT_MAX boundary, skip that value.
-	 */
-	finish = start + distance;
-	if (finish < start)
-		finish++;
-
-	/*-----------------------------------------------------------------------
-	 * When the boundary is numerically greater than the starting point, any
-	 * value numerically between the two is not wrapped:
-	 *
-	 *	<----S----B---->
-	 *	[---)			 = F wrapped past B (and UINT_MAX)
-	 *		 [---)		 = F not wrapped
-	 *			  [----] = F wrapped past B
-	 *
-	 * When the boundary is numerically less than the starting point (i.e. the
-	 * UINT_MAX wraparound occurs somewhere in between) then all values in
-	 * between are wrapped:
-	 *
-	 *	<----B----S---->
-	 *	[---)			 = F not wrapped past B (but wrapped past UINT_MAX)
-	 *		 [---)		 = F wrapped past B (and UINT_MAX)
-	 *			  [----] = F not wrapped
-	 *-----------------------------------------------------------------------
-	 */
-	if (start < boundary)
-		return finish >= boundary || finish < start;
-	else
-		return finish >= boundary && finish < start;
-}
-
 /*
  * Find the starting offset of the given MultiXactId.
  *
@@ -2998,8 +2839,9 @@ MultiXactMemberFreezeThreshold(void)
 	 * we try to eliminate from the system is based on how far we are past
 	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
 	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD) /
-		(MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+	fraction /= (double) (MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+
 	victim_multixacts = multixacts * fraction;
 
 	/* fraction could be > 1.0, but lowest possible freeze age is zero */
@@ -3345,7 +3187,7 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
 static bool
 MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
 {
-	int32		diff = (int32) (offset1 - offset2);
+	int64		diff = (int64) (offset1 - offset2);
 
 	return (diff < 0);
 }
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 985cd06802..1af2ce4b93 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -264,7 +264,7 @@ main(int argc, char *argv[])
 
 			case 'O':
 				errno = 0;
-				set_mxoff = strtoul(optarg, &endptr, 0);
+				set_mxoff = strtou64(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index 9829e48106..f8a8eef44d 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -206,7 +206,7 @@ push @cmd,
   sprintf("%d,%d", hex($files[0]) == 0 ? 3 : hex($files[0]), hex($files[-1]));
 
 @files = get_slru_files('pg_multixact/offsets');
-$mult = 32 * $blcksz / 4;
+$mult = 32 * $blcksz / 8;
 # -m argument is "new,old"
 push @cmd, '-m',
   sprintf("%d,%d",
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 7ffd256c74..90583634ec 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -27,7 +27,7 @@
 
 #define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
 
-#define MaxMultiXactOffset	((MultiXactOffset) 0xFFFFFFFF)
+#define MaxMultiXactOffset	UINT64CONST(0xFFFFFFFFFFFFFFFF)
 
 /*
  * Possible multixact lock modes ("status").  The first four modes are for
diff --git a/src/include/c.h b/src/include/c.h
index 0a548d69d7..e1b3187d0b 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -664,7 +664,7 @@ typedef uint32 SubTransactionId;
 /* MultiXactId must be equivalent to TransactionId, to fit in t_xmax */
 typedef TransactionId MultiXactId;
 
-typedef uint32 MultiXactOffset;
+typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
-- 
2.43.0

v9-0001-Use-64-bit-format-output-for-multixact-offsets.patchapplication/octet-stream; name=v9-0001-Use-64-bit-format-output-for-multixact-offsets.patchDownload
From bc77e08c2afae2d0e4ae9222dfff1a77ef2b3f18 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 7 Aug 2024 16:35:22 +0300
Subject: [PATCH v9 1/7] Use 64-bit format output for multixact offsets

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/rmgrdesc/mxactdesc.c   |  9 ++++----
 src/backend/access/rmgrdesc/xlogdesc.c    |  4 ++--
 src/backend/access/transam/multixact.c    | 26 +++++++++++++----------
 src/backend/access/transam/xlogrecovery.c |  5 +++--
 src/bin/pg_controldata/pg_controldata.c   |  4 ++--
 src/bin/pg_resetwal/pg_resetwal.c         |  8 +++----
 6 files changed, 31 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 3e8ad4d5ef..1b486de38c 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -65,8 +65,8 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		xl_multixact_create *xlrec = (xl_multixact_create *) rec;
 		int			i;
 
-		appendStringInfo(buf, "%u offset %u nmembers %d: ", xlrec->mid,
-						 xlrec->moff, xlrec->nmembers);
+		appendStringInfo(buf, "%u offset %llu nmembers %d: ", xlrec->mid,
+						 (unsigned long long) xlrec->moff, xlrec->nmembers);
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
 	}
@@ -74,9 +74,10 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
 
-		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+		appendStringInfo(buf, "offsets [%u, %u), members [%llu, %llu)",
 						 xlrec->startTruncOff, xlrec->endTruncOff,
-						 xlrec->startTruncMemb, xlrec->endTruncMemb);
+						 (unsigned long long) xlrec->startTruncMemb,
+						 (unsigned long long) xlrec->endTruncMemb);
 	}
 }
 
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 363294d623..aaa19c81c8 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -66,7 +66,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
 		appendStringInfo(buf, "redo %X/%X; "
-						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %u; "
+						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %llu; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
 						 "oldest running xid %u; %s",
@@ -79,7 +79,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 XidFromFullTransactionId(checkpoint->nextXid),
 						 checkpoint->nextOid,
 						 checkpoint->nextMulti,
-						 checkpoint->nextMultiOffset,
+						 (unsigned long long) checkpoint->nextMultiOffset,
 						 checkpoint->oldestXid,
 						 checkpoint->oldestXidDB,
 						 checkpoint->oldestMulti,
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 8c37d7eba7..ab90912ed3 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1264,7 +1264,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	LWLockRelease(MultiXactGenLock);
 
-	debug_elog4(DEBUG2, "GetNew: returning %u offset %u", result, *offset);
+	debug_elog4(DEBUG2, "GetNew: returning %u offset %llu", result,
+				(unsigned long long) *offset);
 	return result;
 }
 
@@ -2293,8 +2294,9 @@ MultiXactGetCheckptMulti(bool is_shutdown,
 	LWLockRelease(MultiXactGenLock);
 
 	debug_elog6(DEBUG2,
-				"MultiXact: checkpoint is nextMulti %u, nextOffset %u, oldestMulti %u in DB %u",
-				*nextMulti, *nextMultiOffset, *oldestMulti, *oldestMultiDB);
+				"MultiXact: checkpoint is nextMulti %u, nextOffset %llu, oldestMulti %u in DB %u",
+				*nextMulti, (unsigned long long) *nextMultiOffset, *oldestMulti,
+				*oldestMultiDB);
 }
 
 /*
@@ -2328,8 +2330,8 @@ void
 MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset)
 {
-	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %u",
-				nextMulti, nextMultiOffset);
+	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %llu",
+				nextMulti, (unsigned long long) nextMultiOffset);
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->nextMXact = nextMulti;
 	MultiXactState->nextOffset = nextMultiOffset;
@@ -2519,8 +2521,8 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 	}
 	if (MultiXactOffsetPrecedes(MultiXactState->nextOffset, minMultiOffset))
 	{
-		debug_elog3(DEBUG2, "MultiXact: setting next offset to %u",
-					minMultiOffset);
+		debug_elog3(DEBUG2, "MultiXact: setting next offset to %llu",
+					(unsigned long long) minMultiOffset);
 		MultiXactState->nextOffset = minMultiOffset;
 	}
 	LWLockRelease(MultiXactGenLock);
@@ -3211,11 +3213,12 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 
 	elog(DEBUG1, "performing multixact truncation: "
 		 "offsets [%u, %u), offsets segments [%llx, %llx), "
-		 "members [%u, %u), members segments [%llx, %llx)",
+		 "members [%llu, %llu), members segments [%llx, %llx)",
 		 oldestMulti, newOldestMulti,
 		 (unsigned long long) MultiXactIdToOffsetSegment(oldestMulti),
 		 (unsigned long long) MultiXactIdToOffsetSegment(newOldestMulti),
-		 oldestOffset, newOldestOffset,
+		 (unsigned long long) oldestOffset,
+		 (unsigned long long) newOldestOffset,
 		 (unsigned long long) MXOffsetToMemberSegment(oldestOffset),
 		 (unsigned long long) MXOffsetToMemberSegment(newOldestOffset));
 
@@ -3471,11 +3474,12 @@ multixact_redo(XLogReaderState *record)
 
 		elog(DEBUG1, "replaying multixact truncation: "
 			 "offsets [%u, %u), offsets segments [%llx, %llx), "
-			 "members [%u, %u), members segments [%llx, %llx)",
+			 "members [%llu, %llu), members segments [%llx, %llx)",
 			 xlrec.startTruncOff, xlrec.endTruncOff,
 			 (unsigned long long) MultiXactIdToOffsetSegment(xlrec.startTruncOff),
 			 (unsigned long long) MultiXactIdToOffsetSegment(xlrec.endTruncOff),
-			 xlrec.startTruncMemb, xlrec.endTruncMemb,
+			 (unsigned long long) xlrec.startTruncMemb,
+			 (unsigned long long) xlrec.endTruncMemb,
 			 (unsigned long long) MXOffsetToMemberSegment(xlrec.startTruncMemb),
 			 (unsigned long long) MXOffsetToMemberSegment(xlrec.endTruncMemb));
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 05c738d661..727b6e744f 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -876,8 +876,9 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 							 U64FromFullTransactionId(checkPoint.nextXid),
 							 checkPoint.nextOid)));
 	ereport(DEBUG1,
-			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %u",
-							 checkPoint.nextMulti, checkPoint.nextMultiOffset)));
+			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %llu",
+							 checkPoint.nextMulti,
+							 (unsigned long long) checkPoint.nextMultiOffset)));
 	ereport(DEBUG1,
 			(errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
 							 checkPoint.oldestXid, checkPoint.oldestXidDB)));
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 93a05d80ca..43b6727570 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -253,8 +253,8 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile->checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
-		   ControlFile->checkPointCopy.nextMultiOffset);
+	printf(_("Latest checkpoint's NextMultiOffset:  %llu\n"),
+		   (unsigned long long) ControlFile->checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile->checkPointCopy.oldestXid);
 	printf(_("Latest checkpoint's oldestXID's DB:   %u\n"),
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index e9dcb5a6d8..985cd06802 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -737,8 +737,8 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile.checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
-		   ControlFile.checkPointCopy.nextMultiOffset);
+	printf(_("Latest checkpoint's NextMultiOffset:  %llu\n"),
+		   (unsigned long long) ControlFile.checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile.checkPointCopy.oldestXid);
 	printf(_("Latest checkpoint's oldestXID's DB:   %u\n"),
@@ -809,8 +809,8 @@ PrintNewControlValues(void)
 
 	if (set_mxoff != -1)
 	{
-		printf(_("NextMultiOffset:                      %u\n"),
-			   ControlFile.checkPointCopy.nextMultiOffset);
+		printf(_("NextMultiOffset:                      %llu\n"),
+			   (unsigned long long) ControlFile.checkPointCopy.nextMultiOffset);
 	}
 
 	if (set_oid != 0)
-- 
2.43.0

v9-0003-Make-pg_upgrade-convert-multixact-offsets.patchapplication/octet-stream; name=v9-0003-Make-pg_upgrade-convert-multixact-offsets.patchDownload
From d731c49b8c51d57ee4ae0160a4668f9f99d4a2bc Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Tue, 13 Aug 2024 14:44:50 +0300
Subject: [PATCH v9 3/7] Make pg_upgrade convert multixact offsets.

Author: Maxim Orlov <orlovmg@gmail.com>
Author: Yura Sokolov <y.sokolov@postgrespro.ru>
---
 src/bin/pg_upgrade/Makefile     |   1 +
 src/bin/pg_upgrade/meson.build  |   1 +
 src/bin/pg_upgrade/pg_upgrade.c |  42 ++-
 src/bin/pg_upgrade/pg_upgrade.h |  14 +-
 src/bin/pg_upgrade/segresize.c  | 527 ++++++++++++++++++++++++++++++++
 5 files changed, 580 insertions(+), 5 deletions(-)
 create mode 100644 src/bin/pg_upgrade/segresize.c

diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index f83d2b5d30..70908d63a3 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	info.o \
 	option.o \
 	parallel.o \
+	segresize.o \
 	pg_upgrade.o \
 	relfilenumber.o \
 	server.o \
diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index 3d88419674..16f898ba14 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -10,6 +10,7 @@ pg_upgrade_sources = files(
   'info.c',
   'option.c',
   'parallel.c',
+  'segresize.c',
   'pg_upgrade.c',
   'relfilenumber.c',
   'server.c',
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 663235816f..1654e877c0 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -750,8 +750,42 @@ copy_xact_xlog_xid(void)
 	if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER &&
 		new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
 	{
-		copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
-		copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		/*
+		 * If the old server is before the MULTIXACTOFFSET_FORMATCHANGE_CAT_VER
+		 * it must have 32-bit multixid offsets, thus it should be converted.
+		 */
+		if (old_cluster.controldata.cat_ver < MULTIXACTOFFSET_FORMATCHANGE_CAT_VER &&
+			new_cluster.controldata.cat_ver >= MULTIXACTOFFSET_FORMATCHANGE_CAT_VER)
+		{
+			MultiXactOffset		oldest_offset,
+								next_offset;
+
+			remove_new_subdir("pg_multixact/offsets", false);
+			prep_status("Converting pg_multixact/offsets to 64-bit");
+			oldest_offset = convert_multixact_offsets();
+			check_ok();
+
+			remove_new_subdir("pg_multixact/members", false);
+			prep_status("Converting pg_multixact/members");
+			convert_multixact_members(oldest_offset);
+			check_ok();
+
+			next_offset = old_cluster.controldata.chkpnt_nxtmxoff;
+			if (oldest_offset)
+			{
+				if (next_offset < oldest_offset)
+					next_offset += ((MultiXactOffset) 1 << 32) - 1;
+
+				next_offset -= oldest_offset - 1;
+
+				old_cluster.controldata.chkpnt_nxtmxoff = next_offset;
+			}
+		}
+		else
+		{
+			copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
+			copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		}
 
 		prep_status("Setting next multixact ID and offset for new cluster");
 
@@ -760,9 +794,9 @@ copy_xact_xlog_xid(void)
 		 * counters here and the oldest multi present on system.
 		 */
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/pg_resetwal\" -O %u -m %u,%u \"%s\"",
+				  "\"%s/pg_resetwal\" -O %llu -m %u,%u \"%s\"",
 				  new_cluster.bindir,
-				  old_cluster.controldata.chkpnt_nxtmxoff,
+				  (unsigned long long) old_cluster.controldata.chkpnt_nxtmxoff,
 				  old_cluster.controldata.chkpnt_nxtmulti,
 				  old_cluster.controldata.chkpnt_oldstMulti,
 				  new_cluster.pgdata);
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 53f693c2d4..2c85ec1e94 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -114,6 +114,13 @@ extern char *output_files[];
  */
 #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231
 
+/*
+ * Swicth from 32-bit to 64-bit for multixid offsets.
+ *
+ * XXX: should be changed to the actual CATALOG_VERSION_NO on commit.
+ */
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202409041
+
 /*
  * large object chunk size added to pg_controldata,
  * commit 5f93c37805e7485488480916b4585e098d3cc883
@@ -230,7 +237,7 @@ typedef struct
 	uint32		chkpnt_nxtepoch;
 	uint32		chkpnt_nxtoid;
 	uint32		chkpnt_nxtmulti;
-	uint32		chkpnt_nxtmxoff;
+	uint64		chkpnt_nxtmxoff;
 	uint32		chkpnt_oldstMulti;
 	uint32		chkpnt_oldstxid;
 	uint32		align;
@@ -515,3 +522,8 @@ typedef struct
 	FILE	   *file;
 	char		path[MAXPGPATH];
 } UpgradeTaskReport;
+
+/* segresize.c */
+
+MultiXactOffset		convert_multixact_offsets(void);
+void				convert_multixact_members(MultiXactOffset oldest_offset);
diff --git a/src/bin/pg_upgrade/segresize.c b/src/bin/pg_upgrade/segresize.c
new file mode 100644
index 0000000000..73064c77de
--- /dev/null
+++ b/src/bin/pg_upgrade/segresize.c
@@ -0,0 +1,527 @@
+/*
+ *	segresize.c
+ *
+ *	SLRU segment resize utility
+ *
+ *	Copyright (c) 2024, PostgreSQL Global Development Group
+ *	src/bin/pg_upgrade/segresize.c
+ */
+
+#include "postgres_fe.h"
+
+#include "pg_upgrade.h"
+#include "access/multixact.h"
+
+/* See slru.h */
+#define SLRU_PAGES_PER_SEGMENT		32
+
+/*
+ * Some kind of iterator associated with a particular SLRU segment.  The idea is
+ * to specify the segment and page number and then move through the pages.
+ */
+typedef struct SlruSegState
+{
+	char	   *dir;
+	char	   *fn;
+	FILE	   *file;
+	int64		segno;
+	uint64		pageno;
+	bool		leading_gap;
+} SlruSegState;
+
+/*
+ * Mirrors the SlruFileName from slru.c
+ */
+static inline char *
+SlruFileName(SlruSegState *state)
+{
+	Assert(state->segno >= 0 && state->segno <= INT64CONST(0xFFFFFF));
+	return psprintf("%s/%04X", state->dir, (unsigned int) state->segno);
+}
+
+/*
+ * Create new SLRU segment file.
+ */
+static void
+create_segment(SlruSegState *state)
+{
+	Assert(state->fn == NULL);
+	Assert(state->file == NULL);
+
+	state->fn = SlruFileName(state);
+	state->file = fopen(state->fn, "wb");
+	if (!state->file)
+		pg_fatal("could not create file \"%s\": %m", state->fn);
+}
+
+/*
+ * Open existing SLRU segment file.
+ */
+static void
+open_segment(SlruSegState *state)
+{
+	Assert(state->fn == NULL);
+	Assert(state->file == NULL);
+
+	state->fn = SlruFileName(state);
+	state->file = fopen(state->fn, "rb");
+	if (!state->file)
+		pg_fatal("could not open file \"%s\": %m", state->fn);
+}
+
+/*
+ * Close SLRU segment file.
+ */
+static void
+close_segment(SlruSegState *state)
+{
+	if (state->file)
+	{
+		fclose(state->file);
+		state->file = NULL;
+	}
+
+	if (state->fn)
+	{
+		pfree(state->fn);
+		state->fn = NULL;
+	}
+}
+
+/*
+ * Read next page from the old 32-bit offset segment file.
+ */
+static int
+read_old_segment_page(SlruSegState *state, void *buf, bool *empty)
+{
+	int		len;
+
+	/* Open next segment file, if needed. */
+	if (!state->fn)
+	{
+		if (!state->segno)
+			state->leading_gap = true;
+
+		open_segment(state);
+
+		/* Set position to the needed page. */
+		if (state->pageno > 0 &&
+			fseek(state->file, state->pageno * BLCKSZ, SEEK_SET))
+		{
+			close_segment(state);
+		}
+	}
+
+	if (state->file)
+	{
+		/* Segment file do exists, read page from it. */
+		state->leading_gap = false;
+
+		len = fread(buf, sizeof(char), BLCKSZ, state->file);
+
+		/* Are we done or was there an error? */
+		if (len <= 0)
+		{
+			if (ferror(state->file))
+				pg_fatal("error reading file \"%s\": %m", state->fn);
+
+			if (feof(state->file))
+			{
+				*empty = true;
+				len = -1;
+
+				close_segment(state);
+			}
+		}
+		else
+			*empty = false;
+	}
+	else if (!state->leading_gap)
+	{
+		/* We reached the last segment. */
+		len = -1;
+		*empty = true;
+	}
+	else
+	{
+		/* Skip few first segments if they were frozen and removed. */
+		len = BLCKSZ;
+		*empty = true;
+	}
+
+	if (++state->pageno >= SLRU_PAGES_PER_SEGMENT)
+	{
+		/* Start a new segment. */
+		state->segno++;
+		state->pageno = 0;
+
+		close_segment(state);
+	}
+
+	return len;
+}
+
+/*
+ * Write next page to the new 64-bit offset segment file.
+ */
+static void
+write_new_segment_page(SlruSegState *state, void *buf)
+{
+	/*
+	 * Create a new segment file if we still didn't.  Creation is
+	 * postponed until the first non-empty page is found.  This helps
+	 * not to create completely empty segments.
+	 */
+	if (!state->file)
+	{
+		create_segment(state);
+
+		/* Write zeroes to the previously skipped prefix. */
+		if (state->pageno > 0)
+		{
+			char		zerobuf[BLCKSZ] = {0};
+
+			for (int64 i = 0; i < state->pageno; i++)
+			{
+				if (fwrite(zerobuf, sizeof(char), BLCKSZ, state->file) != BLCKSZ)
+					pg_fatal("could not write file \"%s\": %m", state->fn);
+			}
+		}
+	}
+
+	/* Write page to the new segment (if it was created). */
+	if (state->file)
+	{
+		if (fwrite(buf, sizeof(char), BLCKSZ, state->file) != BLCKSZ)
+			pg_fatal("could not write file \"%s\": %m", state->fn);
+	}
+
+	/*
+	 * Did we reach the maximum page number?  Then close segment file
+	 * and create a new one on the next iteration.
+	 */
+	if (++state->pageno >= SLRU_PAGES_PER_SEGMENT)
+	{
+		/* Start a new segment. */
+		state->segno++;
+		state->pageno = 0;
+
+		close_segment(state);
+	}
+}
+
+typedef uint32 MultiXactOffsetOld;
+
+#define MaxMultiXactOffsetOld	((MultiXactOffsetOld) 0xFFFFFFFF)
+
+#define MULTIXACT_OFFSETS_PER_PAGE_OLD (BLCKSZ / sizeof(MultiXactOffsetOld))
+#define MULTIXACT_OFFSETS_PER_PAGE_NEW (BLCKSZ / sizeof(MultiXactOffset))
+
+/*
+ * Convert pg_multixact/offsets segments and return oldest multi offset.
+ */
+MultiXactOffset
+convert_multixact_offsets(void)
+{
+	SlruSegState		oldseg = {0},
+						newseg = {0};
+	MultiXactOffsetOld	oldbuf[MULTIXACT_OFFSETS_PER_PAGE_OLD] = {0};
+	MultiXactOffset		newbuf[MULTIXACT_OFFSETS_PER_PAGE_NEW] = {0},
+						oldest_offset = 0;
+	uint64				oldest_multi = old_cluster.controldata.chkpnt_oldstMulti,
+						next_multi = old_cluster.controldata.chkpnt_nxtmulti,
+						multi,
+						old_entry,
+						new_entry;
+	bool				oldest_offset_known = false;
+
+	oldseg.dir = psprintf("%s/pg_multixact/offsets", old_cluster.pgdata);
+	newseg.dir = psprintf("%s/pg_multixact/offsets", new_cluster.pgdata);
+
+	old_entry = oldest_multi % MULTIXACT_OFFSETS_PER_PAGE_OLD;
+	oldseg.pageno = oldest_multi / MULTIXACT_OFFSETS_PER_PAGE_OLD;
+	oldseg.segno = oldseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	oldseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+
+	new_entry = oldest_multi % MULTIXACT_OFFSETS_PER_PAGE_NEW;
+	newseg.pageno = oldest_multi / MULTIXACT_OFFSETS_PER_PAGE_NEW;
+	newseg.segno = newseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	newseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+
+	if (next_multi < oldest_multi)
+		next_multi += (uint64) 1 << 32;	/* wraparound */
+
+	/* Copy multi offsets reading only needed segment pages */
+	for (multi = oldest_multi; multi < next_multi; old_entry = 0)
+	{
+		int		oldlen;
+		bool	empty;
+
+		/* Handle possible segment wraparound */
+#define OLD_OFFSET_SEGNO_MAX	\
+	(MaxMultiXactId / MULTIXACT_OFFSETS_PER_PAGE_OLD / SLRU_PAGES_PER_SEGMENT)
+		if (oldseg.segno > OLD_OFFSET_SEGNO_MAX)
+		{
+			oldseg.segno = 0;
+			oldseg.pageno = 0;
+		}
+
+		oldlen = read_old_segment_page(&oldseg, oldbuf, &empty);
+		if (empty || oldlen != BLCKSZ)
+			pg_fatal("cannot read page %llu from file \"%s\": %m",
+					 (unsigned long long) oldseg.pageno, oldseg.fn);
+
+		/* Save oldest multi offset */
+		if (!oldest_offset_known)
+		{
+			oldest_offset = oldbuf[old_entry];
+			oldest_offset_known = true;
+		}
+
+		/* Skip wrapped-around invalid MultiXactIds */
+		if (multi == (uint64) 1 << 32)
+		{
+			Assert(oldseg.segno == 0);
+			Assert(oldseg.pageno == 1);
+			Assert(old_entry == 0);
+			Assert(new_entry == 0);
+
+			multi += FirstMultiXactId;
+			old_entry = FirstMultiXactId;
+			new_entry = FirstMultiXactId;
+		}
+
+		/* Copy entries to the new page */
+		for (; multi < next_multi && old_entry < MULTIXACT_OFFSETS_PER_PAGE_OLD;
+			 multi++, old_entry++)
+		{
+			MultiXactOffset offset = oldbuf[old_entry];
+
+			/* Handle possible offset wraparound (1 becomes 2^32) */
+			if (offset < oldest_offset)
+				offset += ((uint64) 1 << 32) - 1;
+
+			/* Subtract oldest_offset, so new offsets will start from 1 */
+			newbuf[new_entry++] = offset - oldest_offset + 1;
+
+			if (new_entry >= MULTIXACT_OFFSETS_PER_PAGE_NEW)
+			{
+				/* Handle possible segment wraparound */
+#define NEW_OFFSET_SEGNO_MAX	\
+	(MaxMultiXactId / MULTIXACT_OFFSETS_PER_PAGE_NEW / SLRU_PAGES_PER_SEGMENT)
+				if (newseg.segno > NEW_OFFSET_SEGNO_MAX)
+				{
+					newseg.segno = 0;
+					newseg.pageno = 0;
+				}
+
+				/* Write new page */
+				write_new_segment_page(&newseg, newbuf);
+				new_entry = 0;
+			}
+		}
+	}
+
+	/* Write the last incomplete page */
+	if (new_entry > 0 || oldest_multi == next_multi)
+	{
+		memset(&newbuf[new_entry], 0,
+			   sizeof(newbuf[0]) * (MULTIXACT_OFFSETS_PER_PAGE_NEW - new_entry));
+		write_new_segment_page(&newseg, newbuf);
+	}
+
+	/* Use next_offset as oldest_offset, if oldest_multi == next_multi */
+	if (!oldest_offset_known)
+	{
+		Assert(oldest_multi == next_multi);
+		oldest_offset = (MultiXactOffset) old_cluster.controldata.chkpnt_nxtmxoff;
+	}
+
+	/* Release resources */
+	close_segment(&oldseg);
+	close_segment(&newseg);
+
+	pfree(oldseg.dir);
+	pfree(newseg.dir);
+
+	return oldest_offset;
+}
+
+#define MXACT_MEMBERS_FLAG_BYTES			1
+
+#define MULTIXACT_MEMBERS_PER_GROUP			4
+#define MULTIXACT_MEMBERGROUP_SIZE			\
+	(MULTIXACT_MEMBERS_PER_GROUP * (sizeof(TransactionId) + MXACT_MEMBERS_FLAG_BYTES))
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE		\
+	(BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+
+#define MULTIXACT_MEMBERS_PER_PAGE				\
+	(MULTIXACT_MEMBERS_PER_GROUP * MULTIXACT_MEMBERGROUPS_PER_PAGE)
+#define MULTIXACT_MEMBER_FLAG_BYTES_PER_GROUP	\
+	(MXACT_MEMBERS_FLAG_BYTES * MULTIXACT_MEMBERS_PER_GROUP)
+
+typedef struct MultiXactMembersCtx
+{
+	SlruSegState	seg;
+	char			buf[BLCKSZ];
+	int				group;
+	int				member;
+	char		   *flag;
+	TransactionId  *xid;
+} MultiXactMembersCtx;
+
+static void
+MultiXactMembersCtxInit(MultiXactMembersCtx *ctx)
+{
+	ctx->seg.dir = psprintf("%s/pg_multixact/members", new_cluster.pgdata);
+
+	ctx->group = 0;
+	ctx->member = 1;		/* skip invalid zero offset */
+
+	ctx->flag = (char *) ctx->buf + ctx->group * MULTIXACT_MEMBERGROUP_SIZE;
+	ctx->xid = (TransactionId *)(ctx->flag + MXACT_MEMBERS_FLAG_BYTES * MULTIXACT_MEMBERS_PER_GROUP);
+
+	ctx->flag += ctx->member;
+	ctx->xid += ctx->member;
+}
+
+static void
+MultiXactMembersCtxAdd(MultiXactMembersCtx *ctx, char flag, TransactionId xid)
+{
+	/* Copy member's xid and flags to the new page */
+	*ctx->flag++ = flag;
+	*ctx->xid++ = xid;
+
+	if (++ctx->member < MULTIXACT_MEMBERS_PER_GROUP)
+		return;
+
+	/* Start next member group */
+	ctx->member = 0;
+
+	if (++ctx->group >= MULTIXACT_MEMBERGROUPS_PER_PAGE)
+	{
+		/* Write current page and start new */
+		write_new_segment_page(&ctx->seg, ctx->buf);
+
+		ctx->group = 0;
+		memset(ctx->buf, 0, BLCKSZ);
+	}
+
+	ctx->flag = (char *) ctx->buf + ctx->group * MULTIXACT_MEMBERGROUP_SIZE;
+	ctx->xid = (TransactionId *)(ctx->flag + MXACT_MEMBERS_FLAG_BYTES * MULTIXACT_MEMBERS_PER_GROUP);
+}
+
+static void
+MultiXactMembersCtxFinit(MultiXactMembersCtx *ctx)
+{
+	if (ctx->flag > (char *) ctx->buf)
+		write_new_segment_page(&ctx->seg, ctx->buf);
+
+	close_segment(&ctx->seg);
+
+	pfree(ctx->seg.dir);
+}
+
+/*
+ * Convert pg_multixact/members segments, offsets will start from 1.
+ *
+ */
+void
+convert_multixact_members(MultiXactOffset oldest_offset)
+{
+	MultiXactOffset			next_offset,
+							offset;
+	SlruSegState			oldseg = {0};
+	char					oldbuf[BLCKSZ] = {0};
+	int						oldidx;
+	MultiXactMembersCtx		newctx = {0};
+
+	oldseg.dir = psprintf("%s/pg_multixact/members", old_cluster.pgdata);
+
+	next_offset = (MultiXactOffset) old_cluster.controldata.chkpnt_nxtmxoff;
+	if (next_offset < oldest_offset)
+		next_offset += ((uint64) 1 << 32) - 1;
+
+	/* Initialize the old starting position */
+	oldseg.pageno = oldest_offset / MULTIXACT_MEMBERS_PER_PAGE;
+	oldseg.segno = oldseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	oldseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+
+	/* Initialize new starting position */
+	MultiXactMembersCtxInit(&newctx);
+
+	/* Iterate through the original directory */
+	oldidx = oldest_offset % MULTIXACT_MEMBERS_PER_PAGE;
+	for (offset = oldest_offset; offset < next_offset;)
+	{
+		bool	empty;
+		int		oldlen;
+		int		ngroups;
+		int		oldgroup;
+		int		oldmember;
+
+		oldlen = read_old_segment_page(&oldseg, oldbuf, &empty);
+		if (empty || oldlen != BLCKSZ)
+			pg_fatal("cannot read page %llu from file \"%s\": %m",
+					 (unsigned long long) oldseg.pageno, oldseg.fn);
+
+		/* Iterate through the old member groups */
+		ngroups = oldlen / MULTIXACT_MEMBERGROUP_SIZE;
+		oldmember = oldidx % MULTIXACT_MEMBERS_PER_GROUP;
+		oldgroup = oldidx / MULTIXACT_MEMBERS_PER_GROUP;
+		while (oldgroup < ngroups && offset < next_offset)
+		{
+			char		   *oldflag;
+			TransactionId  *oldxid;
+			int				i;
+
+			oldflag = (char *) oldbuf + oldgroup * MULTIXACT_MEMBERGROUP_SIZE;
+			oldxid = (TransactionId *)(oldflag + MULTIXACT_MEMBER_FLAG_BYTES_PER_GROUP);
+
+			oldxid += oldmember;
+			oldflag += oldmember;
+
+			/* Iterate through the old members */
+			for (i = oldmember;
+				 i < MULTIXACT_MEMBERS_PER_GROUP && offset < next_offset;
+				 i++)
+			{
+				MultiXactMembersCtxAdd(&newctx, *oldflag++, *oldxid++);
+
+				if (++offset == (uint64) 1 << 32)
+				{
+					Assert(i == MaxMultiXactOffsetOld % MULTIXACT_MEMBERS_PER_GROUP);
+					goto wraparound;
+				}
+			}
+
+			oldgroup++;
+			oldmember = 0;
+		}
+
+		oldidx = 0;
+
+		continue;
+
+wraparound:
+#define SEGNO_MAX	MaxMultiXactOffsetOld / MULTIXACT_MEMBERS_PER_PAGE / SLRU_PAGES_PER_SEGMENT
+#define PAGENO_MAX	MaxMultiXactOffsetOld / MULTIXACT_MEMBERS_PER_PAGE % SLRU_PAGES_PER_SEGMENT
+		Assert((oldseg.segno == SEGNO_MAX && oldseg.pageno == PAGENO_MAX + 1) ||
+			   (oldseg.segno == SEGNO_MAX + 1 && oldseg.pageno == 0));
+
+		/* Switch to segment 0000 */
+		close_segment(&oldseg);
+		oldseg.segno = 0;
+		oldseg.pageno = 0;
+
+		/* skip invalid zero multi offset */
+		oldidx = 1;
+	}
+
+	MultiXactMembersCtxFinit(&newctx);
+
+	/* Release resources */
+	close_segment(&oldseg);
+
+	pfree(oldseg.dir);
+}
-- 
2.43.0

v9-0007-TEST-bump-catver.patch.txttext/plain; charset=US-ASCII; name=v9-0007-TEST-bump-catver.patch.txtDownload
From 33e21cf86b1813a67c699d703ab1f75bcf28a7b1 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 13 Nov 2024 16:34:34 +0300
Subject: [PATCH v9 7/7] TEST: bump catver

---
 src/bin/pg_upgrade/pg_upgrade.h  | 2 +-
 src/include/catalog/catversion.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 2c85ec1e94..18faedc963 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -119,7 +119,7 @@ extern char *output_files[];
  *
  * XXX: should be changed to the actual CATALOG_VERSION_NO on commit.
  */
-#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202409041
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202411112
 
 /*
  * large object chunk size added to pg_controldata,
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 5dd91e190a..3d09caf5ae 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202411111
+#define CATALOG_VERSION_NO	202411112
 
 #endif
-- 
2.43.0

v9-0006-TEST-add-src-bin-pg_upgrade-t-005_offset.pl.patch.txttext/plain; charset=US-ASCII; name=v9-0006-TEST-add-src-bin-pg_upgrade-t-005_offset.pl.patch.txtDownload
From 3558ccb4712d50bcda877474db5c9fd124b6e919 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Tue, 19 Nov 2024 17:08:10 +0300
Subject: [PATCH v9 6/7] TEST: add src/bin/pg_upgrade/t/005_offset.pl

---
 src/bin/pg_upgrade/t/005_offset.pl | 562 +++++++++++++++++++++++++++++
 1 file changed, 562 insertions(+)
 create mode 100644 src/bin/pg_upgrade/t/005_offset.pl

diff --git a/src/bin/pg_upgrade/t/005_offset.pl b/src/bin/pg_upgrade/t/005_offset.pl
new file mode 100644
index 0000000000..1cfd8b364a
--- /dev/null
+++ b/src/bin/pg_upgrade/t/005_offset.pl
@@ -0,0 +1,562 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use File::Find qw(find);
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# This pair of calls will create significantly more member segments than offset
+# segments.
+sub prep
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl} (I INT PRIMARY KEY, N_UPDATED INT) " .
+		"       WITH (AUTOVACUUM_ENABLED=FALSE);" .
+		"INSERT INTO ${tbl} SELECT G, 0 FROM GENERATE_SERIES(1, 50) G;");
+}
+
+sub fill
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	my $nclients = 50;
+	my $update_every = 90;
+	my @connections = ();
+
+	for (0..$nclients)
+	{
+		my $conn = $node->background_psql('postgres');
+		$conn->query_safe("BEGIN");
+
+		push(@connections, $conn);
+	}
+
+	for (my $i = 0; $i < 20000; $i++)
+	{
+		my $conn = $connections[$i % $nclients];
+
+		$conn->query_safe("COMMIT;");
+		$conn->query_safe("BEGIN");
+
+		if ($i % $update_every == 0)
+		{
+			$conn->query_safe(
+				"UPDATE ${tbl} SET " .
+				"N_UPDATED = N_UPDATED + 1 " .
+				"WHERE I = ${i} % 50");
+		}
+		else
+		{
+			$conn->query_safe(
+				"SELECT * FROM ${tbl} FOR KEY SHARE");
+		}
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+}
+
+# This pair of calls will create more or less the same amount of membsers and
+# offsets segments.
+sub prep2
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl}(BAR INT PRIMARY KEY, BAZ INT); " .
+		"CREATE OR REPLACE PROCEDURE MXIDFILLER(N_STEPS INT DEFAULT 1000) " .
+		"LANGUAGE PLPGSQL " .
+		"AS \$\$ " .
+		"BEGIN " .
+		"	FOR I IN 1..N_STEPS LOOP " .
+		"		UPDATE ${tbl} SET BAZ = RANDOM(1, 1000) " .
+		"		WHERE BAR IN (SELECT BAR FROM ${tbl} " .
+		"						TABLESAMPLE BERNOULLI(80)); " .
+		"		COMMIT; " .
+		"	END LOOP; " .
+		"END; \$\$; " .
+		"INSERT INTO ${tbl} (BAR, BAZ) " .
+		"SELECT ID, ID FROM GENERATE_SERIES(1, 1024) ID;");
+}
+
+sub fill2
+{
+	my $node = shift;
+	my $tbl = shift;
+	my $scale = shift // 1;
+
+	$node->safe_psql('postgres',
+		"BEGIN; " .
+		"SELECT * FROM ${tbl} FOR KEY SHARE; " .
+		"PREPARE TRANSACTION 'A'; " .
+		"CALL MXIDFILLER((365 * ${scale})::int); " .
+		"COMMIT PREPARED 'A';");
+}
+
+
+# generate around 2 offset segments and 55 member segments
+sub mxid_gen1
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	prep($node, $tbl);
+	fill($node, $tbl);
+
+	$node->safe_psql('postgres', q(CHECKPOINT));
+}
+
+# generate around 10 offset segments and 12 member segments
+sub mxid_gen2
+{
+	my $node = shift;
+	my $tbl = shift;
+	my $scale = shift // 1;
+
+	prep2($node, $tbl);
+	fill2($node, $tbl, $scale);
+
+	$node->safe_psql('postgres', q(CHECKPOINT));
+}
+
+# Fetch latest multixact checkpoint values.
+sub multi_bounds
+{
+	my ($node) = @_;
+	my $path = $node->config_data('--bindir');
+	my ($stdout, $stderr) = run_command([
+									$path . '/pg_controldata',
+									$node->data_dir
+								]);
+	my @control_data = split("\n", $stdout);
+	my $next = undef;
+	my $oldest = undef;
+	my $next_offset = undef;
+
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint's NextMultiXactId:\s*(.*)$/mg)
+		{
+			$next = $1;
+			print ">>> @ node ". $node->name . ", " . $_ . "\n";
+		}
+
+		if ($_ =~ /^Latest checkpoint's oldestMultiXid:\s*(.*)$/mg)
+		{
+			$oldest = $1;
+			print ">>> @ node ". $node->name . ", " . $_ . "\n";
+		}
+
+		if ($_ =~ /^Latest checkpoint's NextMultiOffset:\s*(.*)$/mg)
+		{
+			$next_offset = $1;
+			print ">>> @ node ". $node->name . ", " . $_ . "\n";
+		}
+
+		if (defined($oldest) && defined($next) && defined($next_offset))
+		{
+			last;
+		}
+	}
+
+	die "Latest checkpoint's NextMultiXactId not found in control file!\n"
+	unless defined($next);
+
+	die "Latest checkpoint's oldestMultiXid not found in control file!\n"
+	unless defined($oldest);
+
+	die "Latest checkpoint's NextMultiOffset not found in control file!\n"
+	unless defined($next_offset);
+
+	return ($oldest, $next, $next_offset);
+}
+
+# Create node from existing bins.
+sub create_new_node
+{
+	my ($name, %params) = @_;
+
+	create_node(0, @_);
+}
+
+# Create node from ENV oldinstall
+sub create_old_node
+{
+	my ($name, %params) = @_;
+
+	if (!defined($ENV{oldinstall}))
+	{
+		die "oldinstall is not defined";
+	}
+
+	create_node(1, @_);
+}
+
+sub create_node
+{
+	my ($install_path_from_env, $name, %params) = @_;
+	my $scale = defined $params{scale} ? $params{scale} : 1;
+	my $multi = defined $params{multi} ? $params{multi} : undef;
+	my $offset = defined $params{offset} ? $params{offset} : undef;
+
+	my $node =
+		$install_path_from_env ?
+			PostgreSQL::Test::Cluster->new($name,
+					install_path => $ENV{oldinstall}) :
+			PostgreSQL::Test::Cluster->new($name);
+
+	$node->init(force_initdb => 1, 
+		extra => [
+			$multi ? ('-m', $multi) : (),
+			$offset ? ('-o', $offset) : (),
+		]);
+
+	# Fixup MOX patch quirk
+	if ($multi)
+	{
+		unlink $node->data_dir . '/pg_multixact/offsets/0000';
+	}
+	if ($offset)
+	{
+		unlink $node->data_dir . '/pg_multixact/members/0000';
+	}
+
+	$node->append_conf('fsync', 'off');
+	$node->append_conf('postgresql.conf', 'max_prepared_transactions = 2');
+
+	$node->start();
+	mxid_gen2($node, 'FOO', $scale);
+	mxid_gen1($node, 'BAR', $scale);
+	$node->restart();
+	$node->safe_psql('postgres', q(SELECT * FROM FOO));		# just in case...
+	$node->safe_psql('postgres', q(SELECT * FROM BAR));
+	$node->safe_psql('postgres', q(CHECKPOINT));
+	$node->stop();
+
+	return $node;
+}
+
+sub do_upgrade
+{
+	my ($oldnode, $newnode) = @_;
+
+	command_ok(
+		[
+			'pg_upgrade', '--no-sync',
+			'-d', $oldnode->data_dir,
+			'-D', $newnode->data_dir,
+			'-b', $oldnode->config_data('--bindir'),
+			'-B', $newnode->config_data('--bindir'),
+			'-s', $newnode->host,
+			'-p', $oldnode->port,
+			'-P', $newnode->port,
+			'--check'
+		],
+		'run of pg_upgrade');
+
+	command_ok(
+		[
+			'pg_upgrade', '--no-sync',
+			'-d', $oldnode->data_dir,
+			'-D', $newnode->data_dir,
+			'-b', $oldnode->config_data('--bindir'),
+			'-B', $newnode->config_data('--bindir'),
+			'-s', $newnode->host,
+			'-p', $oldnode->port,
+			'-P', $newnode->port,
+			'--copy'
+		],
+		'run of pg_upgrade');
+
+	$oldnode->start();
+	$newnode->start();
+
+	my $oldfoo = $oldnode->safe_psql('postgres', q(SELECT * FROM FOO));
+	my $newfoo = $newnode->safe_psql('postgres', q(SELECT * FROM FOO));
+	is($oldfoo, $newfoo, "select foo eq");
+
+	my $oldbar = $oldnode->safe_psql('postgres', q(SELECT * FROM BAR));
+	my $newbar = $newnode->safe_psql('postgres', q(SELECT * FROM BAR));
+	is($oldbar, $newbar, "select bar eq");
+
+	$oldnode->stop();
+	$newnode->stop();
+
+	multi_bounds($oldnode);
+	multi_bounds($newnode);
+}
+
+my @TESTS = (
+	# tests without ENV oldinstall
+	0, 1, 2, 3, 4, 5, 6,
+	# tests with "real" pg_upgrade
+	100, 101, 102, 103, 104, 105, 106,
+	# self upgrade
+	1000,
+);
+
+# =============================================================================
+# Basic sanity tests on a NEW bin
+# =============================================================================
+
+# starts from the zero
+SKIP:
+{
+	my $TEST_NO = 0;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_mo',
+						scale => 1);
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value
+SKIP:
+{
+	my $TEST_NO = 1;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_Mo',
+						scale => 1.15,
+						multi => '0x123400');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# offsets starts from the value
+SKIP:
+{
+	my $TEST_NO = 2;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_mO',
+						scale => 1.15,
+						offset => '0x432100');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi and offsets starts from the value
+SKIP:
+{
+	my $TEST_NO = 3;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_MO',
+						scale => 1.15,
+						multi => '0xDEAD00', offset => '0xBEEF00');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value, multi wrap
+SKIP:
+{
+	my $TEST_NO = 4;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_Mo_wrap',
+						scale => 1.15,
+						multi => '0xFFFF7000');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# offsets starts from the value, offsets wrap
+SKIP:
+{
+	my $TEST_NO = 5;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_mO_wrap',
+						scale => 1.15,
+						offset => '0xFFFFFC00');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value, offsets starts from the value,
+# multi wrap, offsets wrap
+SKIP:
+{
+	my $TEST_NO = 6;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_MO_wrap',
+						scale => 1.15,
+						multi => '0xFFFF7000', offset => '0xFFFFFC00');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# =============================================================================
+# pg_upgarde tests
+# =============================================================================
+
+# starts from the zero
+SKIP:
+{
+	my $TEST_NO = 100;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'mo';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1);
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value
+SKIP:
+{
+	my $TEST_NO = 101;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'Mo';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						multi => '0x123400');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# offsets starts from the value
+SKIP:
+{
+	my $TEST_NO = 102;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'mO';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						offset => '0x432100');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi and offsets starts from the value
+SKIP:
+{
+	my $TEST_NO = 103;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'MO';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						multi => '0xDEAD00', offset => '0xBEEF00');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value, multi wrap
+SKIP:
+{
+	my $TEST_NO = 104;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'Mo_wrap';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						multi => '0xFFFF7000');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# offsets starts from the value, offsets wrap
+SKIP:
+{
+	my $TEST_NO = 105;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'mO_wrap';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						offset => '0xFFFFFC00');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value, offsets starts from the value,
+# multi wrap, offsets wrap
+SKIP:
+{
+	my $TEST_NO = 106;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'MO_wrap';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						multi => '0xFFFF7000', offset => '0xFFFFFC00');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# =============================================================================
+# Self upgrade
+# =============================================================================
+
+# starts from the zero
+SKIP:
+{
+	my $TEST_NO = 1000;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'self_upgrade';
+	my $oldnode = create_new_node("old_$dbname",
+						scale => 1);
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+done_testing();
-- 
2.43.0

#26Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Maxim Orlov (#25)
14 attachment(s)
Re: POC: make mxidoff 64 bits

Thanks for working on this!

On 19/11/2024 19:53, Maxim Orlov wrote:

The test itself is in src/bin/pg_upgrade/t/005_offset.pl
<http://005_offset.pl&gt; It is rather heavy and took about 45 minutes on
my i5 with 2.7 Gb data generated. Basically, each test here is creating
a cluster and fill it with multixacts. Thus, dozens of segments are
created using two methods. One is with prepared transactions, and it
creates, roughly, the same amount of segments for members and for
offsets. The other one is based on Heikki's multixids.py and creates
more members than offsets. I've used both of these methods to generate
as much diverse data as possible.

Here is how I test this patch set:

1. You need two pg clusters: the "old" one, i.e. without patch set, and
the "new" with patch set v9 applied.
2. Apply v9-0005-TEST-initdb-option-to-initialize-cluster-with-
non.patch.txt to the "old" and "new" clusters. Note, this is only
patch required for "old" cluster. This will allow you to create a
cluster with non-standard initial multixact and multixact offset.
Unfortunately, this patch was not did not arouse public interest
since it is assumed that there is similar functionality to the
pg_resetwal utility. But similar is not mean equal. See, pg_resetwal
must be used after cluster init, thus, we step into some problems
with vacuum and some SLRU segments must be filled with zeroes. Also,
template0 datminmxidmust be manually updated. So, in me view, using
this patch is justifiedand very handy here.
3. Also, apply all the "TEST" (0006 and 0007) patches to the "new" cluster.
4. Build "old" and "new" pg clusters.
5. Run the test with: PROVE_TESTS=t/005_offset.pl
<http://005_offset.pl&gt; PG_TEST_NOCLEAN=1 oldinstall=/home/orlov/
proj/OFFSET3/pgsql-old make check -s -C src/bin/pg_upgrade/
6. In my case, it took around 45 minutes and generate roughly 2.7 Gb of
data.

"TEST" patches, of course, are for the test purposes and not to be
committed.

In src/bin/pg_upgrade/t/005_offset.pl <http://005_offset.pl&gt; I try to
consider next cases:

* Basic sanity checks.
Here I test various initial multi and offset values (including
wraparound) and see how appropriate segments are generated.
* pg_upgarde tests.
Here is oldinstall ENV is for. Run pg_upgrade for old cluster with
multi and offset values just like in previous step. i.e. with
various combinations.
* Self pg_upgarde.

Attached is some more cleanup on top of patch set v9, removing more dead
stuff related to wraparound. I also removed the oldestOffsetKnown
variable and related code. It was needed to deal with clusters upgraded
from buggy 9.3 and 9.4 era versions, but now that pg_upgrade will
rewrite the SLRUs, it's no longer needed.

Does the pg_upgrade code work though, if you have that buggy situation
where oldestOffsetKnown == false ?

if (!TransactionIdIsValid(*xactptr))
{
/* Corner case 3: we must be looking at unused slot zero */
Assert(offset == 0);
continue;
}

After upgrade, this corner case 3 would *not* happen on offset == 0. So
looks like we're still missing test coverage for this upgrade corner case.

I'm still felt pretty uneasy about the pg_upgrade code. It's
complicated, and the way it rewrites offsets and members separately and
page at a time is quite different from the normal codepaths in
multixact.c, so it's a bit hard to see if it's handling all those corner
cases the same way. I rewrito that so that it's easier to understand,
IMHO anyway. The code for reading the old format and writing the new
format is now more decoupled. The code for reading the old format is in
a separate source file, multixact_old.c. The interface to that is the
GetOldMultiXactIdSingleMember() that returns the updating member of a
given multixid, much like the GetMultiXactIdMembers() backend function.
The conversion routine calls that for each multixid, and write it back
out in the new format, one multixid at a time.

The new code now "squeezes out" locking-only XIDs, keeping only the
updating ones. Not that important, but with this new code structure, it
was trivial and even easier to do than retaining all the XIDs.

Now that the offsets are rewritten one by one, we don't need the
"special case 3" in GetMultiXactIdMembers. The upgrade process removes
that special case.

I tried to keep my changes sepearate from your patches in the attached
patch series. This needs some more cleanup and squashing before
committing, but I think we're getting close.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachments:

v10-0013-Initialize-old-test-clusters-with-checksums.patchtext/x-patch; charset=UTF-8; name=v10-0013-Initialize-old-test-clusters-with-checksums.patchDownload
From 4db0fdc198382e6942a46e163eb076b7e97bc384 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 18 Dec 2024 11:41:48 +0200
Subject: [PATCH v10 13/14] Initialize old test clusters with checksums

because the default changed to checksums-on in v18.
---
 src/bin/pg_upgrade/t/005_offset.pl | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/bin/pg_upgrade/t/005_offset.pl b/src/bin/pg_upgrade/t/005_offset.pl
index 1cfd8b364a9..98a6be3abd8 100644
--- a/src/bin/pg_upgrade/t/005_offset.pl
+++ b/src/bin/pg_upgrade/t/005_offset.pl
@@ -220,6 +220,7 @@ sub create_node
 		extra => [
 			$multi ? ('-m', $multi) : (),
 			$offset ? ('-o', $offset) : (),
+			('-k'),
 		]);
 
 	# Fixup MOX patch quirk
-- 
2.39.5

v10-0001-Use-64-bit-format-output-for-multixact-offsets.patchtext/x-patch; charset=UTF-8; name=v10-0001-Use-64-bit-format-output-for-multixact-offsets.patchDownload
From c831cb41ae033899ceab0225f96a9dadf4a4db4e Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 7 Aug 2024 16:35:22 +0300
Subject: [PATCH v10 01/14] Use 64-bit format output for multixact offsets

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/rmgrdesc/mxactdesc.c   |  9 ++++----
 src/backend/access/rmgrdesc/xlogdesc.c    |  4 ++--
 src/backend/access/transam/multixact.c    | 26 +++++++++++++----------
 src/backend/access/transam/xlogrecovery.c |  5 +++--
 src/bin/pg_controldata/pg_controldata.c   |  4 ++--
 src/bin/pg_resetwal/pg_resetwal.c         |  8 +++----
 6 files changed, 31 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 3e8ad4d5ef8..1b486de38cf 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -65,8 +65,8 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		xl_multixact_create *xlrec = (xl_multixact_create *) rec;
 		int			i;
 
-		appendStringInfo(buf, "%u offset %u nmembers %d: ", xlrec->mid,
-						 xlrec->moff, xlrec->nmembers);
+		appendStringInfo(buf, "%u offset %llu nmembers %d: ", xlrec->mid,
+						 (unsigned long long) xlrec->moff, xlrec->nmembers);
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
 	}
@@ -74,9 +74,10 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
 
-		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+		appendStringInfo(buf, "offsets [%u, %u), members [%llu, %llu)",
 						 xlrec->startTruncOff, xlrec->endTruncOff,
-						 xlrec->startTruncMemb, xlrec->endTruncMemb);
+						 (unsigned long long) xlrec->startTruncMemb,
+						 (unsigned long long) xlrec->endTruncMemb);
 	}
 }
 
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 363294d6234..aaa19c81c8a 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -66,7 +66,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
 		appendStringInfo(buf, "redo %X/%X; "
-						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %u; "
+						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %llu; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
 						 "oldest running xid %u; %s",
@@ -79,7 +79,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 XidFromFullTransactionId(checkpoint->nextXid),
 						 checkpoint->nextOid,
 						 checkpoint->nextMulti,
-						 checkpoint->nextMultiOffset,
+						 (unsigned long long) checkpoint->nextMultiOffset,
 						 checkpoint->oldestXid,
 						 checkpoint->oldestXidDB,
 						 checkpoint->oldestMulti,
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 8c37d7eba76..ab90912ed3a 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1264,7 +1264,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	LWLockRelease(MultiXactGenLock);
 
-	debug_elog4(DEBUG2, "GetNew: returning %u offset %u", result, *offset);
+	debug_elog4(DEBUG2, "GetNew: returning %u offset %llu", result,
+				(unsigned long long) *offset);
 	return result;
 }
 
@@ -2293,8 +2294,9 @@ MultiXactGetCheckptMulti(bool is_shutdown,
 	LWLockRelease(MultiXactGenLock);
 
 	debug_elog6(DEBUG2,
-				"MultiXact: checkpoint is nextMulti %u, nextOffset %u, oldestMulti %u in DB %u",
-				*nextMulti, *nextMultiOffset, *oldestMulti, *oldestMultiDB);
+				"MultiXact: checkpoint is nextMulti %u, nextOffset %llu, oldestMulti %u in DB %u",
+				*nextMulti, (unsigned long long) *nextMultiOffset, *oldestMulti,
+				*oldestMultiDB);
 }
 
 /*
@@ -2328,8 +2330,8 @@ void
 MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset)
 {
-	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %u",
-				nextMulti, nextMultiOffset);
+	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %llu",
+				nextMulti, (unsigned long long) nextMultiOffset);
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->nextMXact = nextMulti;
 	MultiXactState->nextOffset = nextMultiOffset;
@@ -2519,8 +2521,8 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 	}
 	if (MultiXactOffsetPrecedes(MultiXactState->nextOffset, minMultiOffset))
 	{
-		debug_elog3(DEBUG2, "MultiXact: setting next offset to %u",
-					minMultiOffset);
+		debug_elog3(DEBUG2, "MultiXact: setting next offset to %llu",
+					(unsigned long long) minMultiOffset);
 		MultiXactState->nextOffset = minMultiOffset;
 	}
 	LWLockRelease(MultiXactGenLock);
@@ -3211,11 +3213,12 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 
 	elog(DEBUG1, "performing multixact truncation: "
 		 "offsets [%u, %u), offsets segments [%llx, %llx), "
-		 "members [%u, %u), members segments [%llx, %llx)",
+		 "members [%llu, %llu), members segments [%llx, %llx)",
 		 oldestMulti, newOldestMulti,
 		 (unsigned long long) MultiXactIdToOffsetSegment(oldestMulti),
 		 (unsigned long long) MultiXactIdToOffsetSegment(newOldestMulti),
-		 oldestOffset, newOldestOffset,
+		 (unsigned long long) oldestOffset,
+		 (unsigned long long) newOldestOffset,
 		 (unsigned long long) MXOffsetToMemberSegment(oldestOffset),
 		 (unsigned long long) MXOffsetToMemberSegment(newOldestOffset));
 
@@ -3471,11 +3474,12 @@ multixact_redo(XLogReaderState *record)
 
 		elog(DEBUG1, "replaying multixact truncation: "
 			 "offsets [%u, %u), offsets segments [%llx, %llx), "
-			 "members [%u, %u), members segments [%llx, %llx)",
+			 "members [%llu, %llu), members segments [%llx, %llx)",
 			 xlrec.startTruncOff, xlrec.endTruncOff,
 			 (unsigned long long) MultiXactIdToOffsetSegment(xlrec.startTruncOff),
 			 (unsigned long long) MultiXactIdToOffsetSegment(xlrec.endTruncOff),
-			 xlrec.startTruncMemb, xlrec.endTruncMemb,
+			 (unsigned long long) xlrec.startTruncMemb,
+			 (unsigned long long) xlrec.endTruncMemb,
 			 (unsigned long long) MXOffsetToMemberSegment(xlrec.startTruncMemb),
 			 (unsigned long long) MXOffsetToMemberSegment(xlrec.endTruncMemb));
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c6994b78282..d3af989c598 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -876,8 +876,9 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 							 U64FromFullTransactionId(checkPoint.nextXid),
 							 checkPoint.nextOid)));
 	ereport(DEBUG1,
-			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %u",
-							 checkPoint.nextMulti, checkPoint.nextMultiOffset)));
+			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %llu",
+							 checkPoint.nextMulti,
+							 (unsigned long long) checkPoint.nextMultiOffset)));
 	ereport(DEBUG1,
 			(errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
 							 checkPoint.oldestXid, checkPoint.oldestXidDB)));
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 93a05d80ca7..43b6727570a 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -253,8 +253,8 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile->checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
-		   ControlFile->checkPointCopy.nextMultiOffset);
+	printf(_("Latest checkpoint's NextMultiOffset:  %llu\n"),
+		   (unsigned long long) ControlFile->checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile->checkPointCopy.oldestXid);
 	printf(_("Latest checkpoint's oldestXID's DB:   %u\n"),
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index e9dcb5a6d89..985cd068029 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -737,8 +737,8 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile.checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
-		   ControlFile.checkPointCopy.nextMultiOffset);
+	printf(_("Latest checkpoint's NextMultiOffset:  %llu\n"),
+		   (unsigned long long) ControlFile.checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile.checkPointCopy.oldestXid);
 	printf(_("Latest checkpoint's oldestXID's DB:   %u\n"),
@@ -809,8 +809,8 @@ PrintNewControlValues(void)
 
 	if (set_mxoff != -1)
 	{
-		printf(_("NextMultiOffset:                      %u\n"),
-			   ControlFile.checkPointCopy.nextMultiOffset);
+		printf(_("NextMultiOffset:                      %llu\n"),
+			   (unsigned long long) ControlFile.checkPointCopy.nextMultiOffset);
 	}
 
 	if (set_oid != 0)
-- 
2.39.5

v10-0002-Use-64-bit-multixact-offsets.patchtext/x-patch; charset=UTF-8; name=v10-0002-Use-64-bit-multixact-offsets.patchDownload
From 164e5ee26b7d03968a8eb371a894d35b47d18e61 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Wed, 6 Mar 2024 11:11:33 +0300
Subject: [PATCH v10 02/14] Use 64-bit multixact offsets.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/transam/multixact.c | 170 +------------------------
 src/bin/pg_resetwal/pg_resetwal.c      |   2 +-
 src/bin/pg_resetwal/t/001_basic.pl     |   2 +-
 src/include/access/multixact.h         |   2 +-
 src/include/c.h                        |   2 +-
 5 files changed, 10 insertions(+), 168 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index ab90912ed3a..48e1c0160a8 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -96,14 +96,6 @@
 /*
  * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
  * used everywhere else in Postgres.
- *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
  */
 
 /* We need four bytes per offset */
@@ -272,9 +264,6 @@ typedef struct MultiXactStateData
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 
-	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
-
 	/*
 	 * This is used to sleep until a multixact offset is written when we want
 	 * to create the next one.
@@ -409,8 +398,6 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
-									 MultiXactOffset start, uint32 distance);
 static bool SetOffsetVacuumLimit(bool is_startup);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMZeroPageXlogRec(int64 pageno, uint8 info);
@@ -1164,78 +1151,6 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	else
 		*offset = nextOffset;
 
-	/*----------
-	 * Protect against overrun of the members space as well, with the
-	 * following rules:
-	 *
-	 * If we're past offsetStopLimit, refuse to generate more multis.
-	 * If we're close to offsetStopLimit, emit a warning.
-	 *
-	 * Arbitrarily, we start emitting warnings when we're 20 segments or less
-	 * from offsetStopLimit.
-	 *
-	 * Note we haven't updated the shared state yet, so if we fail at this
-	 * point, the multixact ID we grabbed can still be used by the next guy.
-	 *
-	 * Note that there is no point in forcing autovacuum runs here: the
-	 * multixact freeze settings would have to be reduced for that to have any
-	 * effect.
-	 *----------
-	 */
-#define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
-								 nmembers))
-	{
-		/* see comment in the corresponding offsets wraparound case */
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("multixact \"members\" limit exceeded"),
-				 errdetail_plural("This command would create a multixact with %u members, but the remaining space is only enough for %u member.",
-								  "This command would create a multixact with %u members, but the remaining space is only enough for %u members.",
-								  MultiXactState->offsetStopLimit - nextOffset - 1,
-								  nmembers,
-								  MultiXactState->offsetStopLimit - nextOffset - 1),
-				 errhint("Execute a database-wide VACUUM in database with OID %u with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.",
-						 MultiXactState->oldestMultiXactDB)));
-	}
-
-	/*
-	 * Check whether we should kick autovacuum into action, to prevent members
-	 * wraparound. NB we use a much larger window to trigger autovacuum than
-	 * just the warning limit. The warning is just a measure of last resort -
-	 * this is in line with GetNewTransactionId's behaviour.
-	 */
-	if (!MultiXactState->oldestOffsetKnown ||
-		(MultiXactState->nextOffset - MultiXactState->oldestOffset
-		 > MULTIXACT_MEMBER_SAFE_THRESHOLD))
-	{
-		/*
-		 * To avoid swamping the postmaster with signals, we issue the autovac
-		 * request only when crossing a segment boundary. With default
-		 * compilation settings that's roughly after 50k members.  This still
-		 * gives plenty of chances before we get into real trouble.
-		 */
-		if ((MXOffsetToMemberPage(nextOffset) / SLRU_PAGES_PER_SEGMENT) !=
-			(MXOffsetToMemberPage(nextOffset + nmembers) / SLRU_PAGES_PER_SEGMENT))
-			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	}
-
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
-								 nextOffset,
-								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
-		ereport(WARNING,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg_plural("database with OID %u must be vacuumed before %d more multixact member is used",
-							   "database with OID %u must be vacuumed before %d more multixact members are used",
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers,
-							   MultiXactState->oldestMultiXactDB,
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers),
-				 errhint("Execute a database-wide VACUUM in that database with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.")));
-
 	ExtendMultiXactMember(nextOffset, nmembers);
 
 	/*
@@ -2721,8 +2636,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	MultiXactOffset nextOffset;
 	bool		oldestOffsetKnown = false;
 	bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;
-	MultiXactOffset prevOffsetStopLimit;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2737,7 +2650,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	nextOffset = MultiXactState->nextOffset;
 	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
 	prevOldestOffset = MultiXactState->oldestOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2768,11 +2680,7 @@ SetOffsetVacuumLimit(bool is_startup)
 		oldestOffsetKnown =
 			find_multixact_start(oldestMultiXactId, &oldestOffset);
 
-		if (oldestOffsetKnown)
-			ereport(DEBUG1,
-					(errmsg_internal("oldest MultiXactId member is at offset %u",
-									 oldestOffset)));
-		else
+		if (!oldestOffsetKnown)
 			ereport(LOG,
 					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
 							oldestMultiXactId)));
@@ -2785,24 +2693,7 @@ SetOffsetVacuumLimit(bool is_startup)
 	 * overrun of old data in the members SLRU area. We can only do so if the
 	 * oldest offset is known though.
 	 */
-	if (oldestOffsetKnown)
-	{
-		/* move back to start of the corresponding segment */
-		offsetStopLimit = oldestOffset - (oldestOffset %
-										  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-
-		/* always leave one segment before the wraparound point */
-		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-		if (!prevOldestOffsetKnown && !is_startup)
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are now enabled")));
-
-		ereport(DEBUG1,
-				(errmsg_internal("MultiXact member stop limit is now %u based on MultiXact %u",
-								 offsetStopLimit, oldestMultiXactId)));
-	}
-	else if (prevOldestOffsetKnown)
+	if (prevOldestOffsetKnown)
 	{
 		/*
 		 * If we failed to get the oldest offset this time, but we have a
@@ -2812,14 +2703,12 @@ SetOffsetVacuumLimit(bool is_startup)
 		 */
 		oldestOffset = prevOldestOffset;
 		oldestOffsetKnown = true;
-		offsetStopLimit = prevOffsetStopLimit;
 	}
 
 	/* Install the computed values */
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestOffset = oldestOffset;
 	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	MultiXactState->offsetStopLimit = offsetStopLimit;
 	LWLockRelease(MultiXactGenLock);
 
 	/*
@@ -2829,54 +2718,6 @@ SetOffsetVacuumLimit(bool is_startup)
 		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
 }
 
-/*
- * Return whether adding "distance" to "start" would move past "boundary".
- *
- * We use this to determine whether the addition is "wrapping around" the
- * boundary point, hence the name.  The reason we don't want to use the regular
- * 2^31-modulo arithmetic here is that we want to be able to use the whole of
- * the 2^32-1 space here, allowing for more multixacts than would fit
- * otherwise.
- */
-static bool
-MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
-						 uint32 distance)
-{
-	MultiXactOffset finish;
-
-	/*
-	 * Note that offset number 0 is not used (see GetMultiXactIdMembers), so
-	 * if the addition wraps around the UINT_MAX boundary, skip that value.
-	 */
-	finish = start + distance;
-	if (finish < start)
-		finish++;
-
-	/*-----------------------------------------------------------------------
-	 * When the boundary is numerically greater than the starting point, any
-	 * value numerically between the two is not wrapped:
-	 *
-	 *	<----S----B---->
-	 *	[---)			 = F wrapped past B (and UINT_MAX)
-	 *		 [---)		 = F not wrapped
-	 *			  [----] = F wrapped past B
-	 *
-	 * When the boundary is numerically less than the starting point (i.e. the
-	 * UINT_MAX wraparound occurs somewhere in between) then all values in
-	 * between are wrapped:
-	 *
-	 *	<----B----S---->
-	 *	[---)			 = F not wrapped past B (but wrapped past UINT_MAX)
-	 *		 [---)		 = F wrapped past B (and UINT_MAX)
-	 *			  [----] = F not wrapped
-	 *-----------------------------------------------------------------------
-	 */
-	if (start < boundary)
-		return finish >= boundary || finish < start;
-	else
-		return finish >= boundary && finish < start;
-}
-
 /*
  * Find the starting offset of the given MultiXactId.
  *
@@ -2998,8 +2839,9 @@ MultiXactMemberFreezeThreshold(void)
 	 * we try to eliminate from the system is based on how far we are past
 	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
 	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD) /
-		(MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+	fraction /= (double) (MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+
 	victim_multixacts = multixacts * fraction;
 
 	/* fraction could be > 1.0, but lowest possible freeze age is zero */
@@ -3345,7 +3187,7 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
 static bool
 MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
 {
-	int32		diff = (int32) (offset1 - offset2);
+	int64		diff = (int64) (offset1 - offset2);
 
 	return (diff < 0);
 }
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 985cd068029..1af2ce4b93b 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -264,7 +264,7 @@ main(int argc, char *argv[])
 
 			case 'O':
 				errno = 0;
-				set_mxoff = strtoul(optarg, &endptr, 0);
+				set_mxoff = strtou64(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index 9829e48106e..f8a8eef44d1 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -206,7 +206,7 @@ push @cmd,
   sprintf("%d,%d", hex($files[0]) == 0 ? 3 : hex($files[0]), hex($files[-1]));
 
 @files = get_slru_files('pg_multixact/offsets');
-$mult = 32 * $blcksz / 4;
+$mult = 32 * $blcksz / 8;
 # -m argument is "new,old"
 push @cmd, '-m',
   sprintf("%d,%d",
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 7ffd256c744..90583634ec9 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -27,7 +27,7 @@
 
 #define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
 
-#define MaxMultiXactOffset	((MultiXactOffset) 0xFFFFFFFF)
+#define MaxMultiXactOffset	UINT64CONST(0xFFFFFFFFFFFFFFFF)
 
 /*
  * Possible multixact lock modes ("status").  The first four modes are for
diff --git a/src/include/c.h b/src/include/c.h
index 13bb39fdef3..b5b6b9261b0 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -615,7 +615,7 @@ typedef uint32 SubTransactionId;
 /* MultiXactId must be equivalent to TransactionId, to fit in t_xmax */
 typedef TransactionId MultiXactId;
 
-typedef uint32 MultiXactOffset;
+typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
-- 
2.39.5

v10-0003-Make-pg_upgrade-convert-multixact-offsets.patchtext/x-patch; charset=UTF-8; name=v10-0003-Make-pg_upgrade-convert-multixact-offsets.patchDownload
From a80e03edba88eb8375443616921b59ea0a432326 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Tue, 13 Aug 2024 14:44:50 +0300
Subject: [PATCH v10 03/14] Make pg_upgrade convert multixact offsets.

Author: Maxim Orlov <orlovmg@gmail.com>
Author: Yura Sokolov <y.sokolov@postgrespro.ru>
---
 src/bin/pg_upgrade/Makefile     |   1 +
 src/bin/pg_upgrade/meson.build  |   1 +
 src/bin/pg_upgrade/pg_upgrade.c |  42 ++-
 src/bin/pg_upgrade/pg_upgrade.h |  14 +-
 src/bin/pg_upgrade/segresize.c  | 527 ++++++++++++++++++++++++++++++++
 5 files changed, 580 insertions(+), 5 deletions(-)
 create mode 100644 src/bin/pg_upgrade/segresize.c

diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index f83d2b5d309..70908d63a31 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	info.o \
 	option.o \
 	parallel.o \
+	segresize.o \
 	pg_upgrade.o \
 	relfilenumber.o \
 	server.o \
diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index 3d884196746..16f898ba148 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -10,6 +10,7 @@ pg_upgrade_sources = files(
   'info.c',
   'option.c',
   'parallel.c',
+  'segresize.c',
   'pg_upgrade.c',
   'relfilenumber.c',
   'server.c',
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 663235816f8..1654e877c07 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -750,8 +750,42 @@ copy_xact_xlog_xid(void)
 	if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER &&
 		new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
 	{
-		copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
-		copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		/*
+		 * If the old server is before the MULTIXACTOFFSET_FORMATCHANGE_CAT_VER
+		 * it must have 32-bit multixid offsets, thus it should be converted.
+		 */
+		if (old_cluster.controldata.cat_ver < MULTIXACTOFFSET_FORMATCHANGE_CAT_VER &&
+			new_cluster.controldata.cat_ver >= MULTIXACTOFFSET_FORMATCHANGE_CAT_VER)
+		{
+			MultiXactOffset		oldest_offset,
+								next_offset;
+
+			remove_new_subdir("pg_multixact/offsets", false);
+			prep_status("Converting pg_multixact/offsets to 64-bit");
+			oldest_offset = convert_multixact_offsets();
+			check_ok();
+
+			remove_new_subdir("pg_multixact/members", false);
+			prep_status("Converting pg_multixact/members");
+			convert_multixact_members(oldest_offset);
+			check_ok();
+
+			next_offset = old_cluster.controldata.chkpnt_nxtmxoff;
+			if (oldest_offset)
+			{
+				if (next_offset < oldest_offset)
+					next_offset += ((MultiXactOffset) 1 << 32) - 1;
+
+				next_offset -= oldest_offset - 1;
+
+				old_cluster.controldata.chkpnt_nxtmxoff = next_offset;
+			}
+		}
+		else
+		{
+			copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
+			copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		}
 
 		prep_status("Setting next multixact ID and offset for new cluster");
 
@@ -760,9 +794,9 @@ copy_xact_xlog_xid(void)
 		 * counters here and the oldest multi present on system.
 		 */
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/pg_resetwal\" -O %u -m %u,%u \"%s\"",
+				  "\"%s/pg_resetwal\" -O %llu -m %u,%u \"%s\"",
 				  new_cluster.bindir,
-				  old_cluster.controldata.chkpnt_nxtmxoff,
+				  (unsigned long long) old_cluster.controldata.chkpnt_nxtmxoff,
 				  old_cluster.controldata.chkpnt_nxtmulti,
 				  old_cluster.controldata.chkpnt_oldstMulti,
 				  new_cluster.pgdata);
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 53f693c2d4b..2c85ec1e949 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -114,6 +114,13 @@ extern char *output_files[];
  */
 #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231
 
+/*
+ * Swicth from 32-bit to 64-bit for multixid offsets.
+ *
+ * XXX: should be changed to the actual CATALOG_VERSION_NO on commit.
+ */
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202409041
+
 /*
  * large object chunk size added to pg_controldata,
  * commit 5f93c37805e7485488480916b4585e098d3cc883
@@ -230,7 +237,7 @@ typedef struct
 	uint32		chkpnt_nxtepoch;
 	uint32		chkpnt_nxtoid;
 	uint32		chkpnt_nxtmulti;
-	uint32		chkpnt_nxtmxoff;
+	uint64		chkpnt_nxtmxoff;
 	uint32		chkpnt_oldstMulti;
 	uint32		chkpnt_oldstxid;
 	uint32		align;
@@ -515,3 +522,8 @@ typedef struct
 	FILE	   *file;
 	char		path[MAXPGPATH];
 } UpgradeTaskReport;
+
+/* segresize.c */
+
+MultiXactOffset		convert_multixact_offsets(void);
+void				convert_multixact_members(MultiXactOffset oldest_offset);
diff --git a/src/bin/pg_upgrade/segresize.c b/src/bin/pg_upgrade/segresize.c
new file mode 100644
index 00000000000..73064c77deb
--- /dev/null
+++ b/src/bin/pg_upgrade/segresize.c
@@ -0,0 +1,527 @@
+/*
+ *	segresize.c
+ *
+ *	SLRU segment resize utility
+ *
+ *	Copyright (c) 2024, PostgreSQL Global Development Group
+ *	src/bin/pg_upgrade/segresize.c
+ */
+
+#include "postgres_fe.h"
+
+#include "pg_upgrade.h"
+#include "access/multixact.h"
+
+/* See slru.h */
+#define SLRU_PAGES_PER_SEGMENT		32
+
+/*
+ * Some kind of iterator associated with a particular SLRU segment.  The idea is
+ * to specify the segment and page number and then move through the pages.
+ */
+typedef struct SlruSegState
+{
+	char	   *dir;
+	char	   *fn;
+	FILE	   *file;
+	int64		segno;
+	uint64		pageno;
+	bool		leading_gap;
+} SlruSegState;
+
+/*
+ * Mirrors the SlruFileName from slru.c
+ */
+static inline char *
+SlruFileName(SlruSegState *state)
+{
+	Assert(state->segno >= 0 && state->segno <= INT64CONST(0xFFFFFF));
+	return psprintf("%s/%04X", state->dir, (unsigned int) state->segno);
+}
+
+/*
+ * Create new SLRU segment file.
+ */
+static void
+create_segment(SlruSegState *state)
+{
+	Assert(state->fn == NULL);
+	Assert(state->file == NULL);
+
+	state->fn = SlruFileName(state);
+	state->file = fopen(state->fn, "wb");
+	if (!state->file)
+		pg_fatal("could not create file \"%s\": %m", state->fn);
+}
+
+/*
+ * Open existing SLRU segment file.
+ */
+static void
+open_segment(SlruSegState *state)
+{
+	Assert(state->fn == NULL);
+	Assert(state->file == NULL);
+
+	state->fn = SlruFileName(state);
+	state->file = fopen(state->fn, "rb");
+	if (!state->file)
+		pg_fatal("could not open file \"%s\": %m", state->fn);
+}
+
+/*
+ * Close SLRU segment file.
+ */
+static void
+close_segment(SlruSegState *state)
+{
+	if (state->file)
+	{
+		fclose(state->file);
+		state->file = NULL;
+	}
+
+	if (state->fn)
+	{
+		pfree(state->fn);
+		state->fn = NULL;
+	}
+}
+
+/*
+ * Read next page from the old 32-bit offset segment file.
+ */
+static int
+read_old_segment_page(SlruSegState *state, void *buf, bool *empty)
+{
+	int		len;
+
+	/* Open next segment file, if needed. */
+	if (!state->fn)
+	{
+		if (!state->segno)
+			state->leading_gap = true;
+
+		open_segment(state);
+
+		/* Set position to the needed page. */
+		if (state->pageno > 0 &&
+			fseek(state->file, state->pageno * BLCKSZ, SEEK_SET))
+		{
+			close_segment(state);
+		}
+	}
+
+	if (state->file)
+	{
+		/* Segment file do exists, read page from it. */
+		state->leading_gap = false;
+
+		len = fread(buf, sizeof(char), BLCKSZ, state->file);
+
+		/* Are we done or was there an error? */
+		if (len <= 0)
+		{
+			if (ferror(state->file))
+				pg_fatal("error reading file \"%s\": %m", state->fn);
+
+			if (feof(state->file))
+			{
+				*empty = true;
+				len = -1;
+
+				close_segment(state);
+			}
+		}
+		else
+			*empty = false;
+	}
+	else if (!state->leading_gap)
+	{
+		/* We reached the last segment. */
+		len = -1;
+		*empty = true;
+	}
+	else
+	{
+		/* Skip few first segments if they were frozen and removed. */
+		len = BLCKSZ;
+		*empty = true;
+	}
+
+	if (++state->pageno >= SLRU_PAGES_PER_SEGMENT)
+	{
+		/* Start a new segment. */
+		state->segno++;
+		state->pageno = 0;
+
+		close_segment(state);
+	}
+
+	return len;
+}
+
+/*
+ * Write next page to the new 64-bit offset segment file.
+ */
+static void
+write_new_segment_page(SlruSegState *state, void *buf)
+{
+	/*
+	 * Create a new segment file if we still didn't.  Creation is
+	 * postponed until the first non-empty page is found.  This helps
+	 * not to create completely empty segments.
+	 */
+	if (!state->file)
+	{
+		create_segment(state);
+
+		/* Write zeroes to the previously skipped prefix. */
+		if (state->pageno > 0)
+		{
+			char		zerobuf[BLCKSZ] = {0};
+
+			for (int64 i = 0; i < state->pageno; i++)
+			{
+				if (fwrite(zerobuf, sizeof(char), BLCKSZ, state->file) != BLCKSZ)
+					pg_fatal("could not write file \"%s\": %m", state->fn);
+			}
+		}
+	}
+
+	/* Write page to the new segment (if it was created). */
+	if (state->file)
+	{
+		if (fwrite(buf, sizeof(char), BLCKSZ, state->file) != BLCKSZ)
+			pg_fatal("could not write file \"%s\": %m", state->fn);
+	}
+
+	/*
+	 * Did we reach the maximum page number?  Then close segment file
+	 * and create a new one on the next iteration.
+	 */
+	if (++state->pageno >= SLRU_PAGES_PER_SEGMENT)
+	{
+		/* Start a new segment. */
+		state->segno++;
+		state->pageno = 0;
+
+		close_segment(state);
+	}
+}
+
+typedef uint32 MultiXactOffsetOld;
+
+#define MaxMultiXactOffsetOld	((MultiXactOffsetOld) 0xFFFFFFFF)
+
+#define MULTIXACT_OFFSETS_PER_PAGE_OLD (BLCKSZ / sizeof(MultiXactOffsetOld))
+#define MULTIXACT_OFFSETS_PER_PAGE_NEW (BLCKSZ / sizeof(MultiXactOffset))
+
+/*
+ * Convert pg_multixact/offsets segments and return oldest multi offset.
+ */
+MultiXactOffset
+convert_multixact_offsets(void)
+{
+	SlruSegState		oldseg = {0},
+						newseg = {0};
+	MultiXactOffsetOld	oldbuf[MULTIXACT_OFFSETS_PER_PAGE_OLD] = {0};
+	MultiXactOffset		newbuf[MULTIXACT_OFFSETS_PER_PAGE_NEW] = {0},
+						oldest_offset = 0;
+	uint64				oldest_multi = old_cluster.controldata.chkpnt_oldstMulti,
+						next_multi = old_cluster.controldata.chkpnt_nxtmulti,
+						multi,
+						old_entry,
+						new_entry;
+	bool				oldest_offset_known = false;
+
+	oldseg.dir = psprintf("%s/pg_multixact/offsets", old_cluster.pgdata);
+	newseg.dir = psprintf("%s/pg_multixact/offsets", new_cluster.pgdata);
+
+	old_entry = oldest_multi % MULTIXACT_OFFSETS_PER_PAGE_OLD;
+	oldseg.pageno = oldest_multi / MULTIXACT_OFFSETS_PER_PAGE_OLD;
+	oldseg.segno = oldseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	oldseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+
+	new_entry = oldest_multi % MULTIXACT_OFFSETS_PER_PAGE_NEW;
+	newseg.pageno = oldest_multi / MULTIXACT_OFFSETS_PER_PAGE_NEW;
+	newseg.segno = newseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	newseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+
+	if (next_multi < oldest_multi)
+		next_multi += (uint64) 1 << 32;	/* wraparound */
+
+	/* Copy multi offsets reading only needed segment pages */
+	for (multi = oldest_multi; multi < next_multi; old_entry = 0)
+	{
+		int		oldlen;
+		bool	empty;
+
+		/* Handle possible segment wraparound */
+#define OLD_OFFSET_SEGNO_MAX	\
+	(MaxMultiXactId / MULTIXACT_OFFSETS_PER_PAGE_OLD / SLRU_PAGES_PER_SEGMENT)
+		if (oldseg.segno > OLD_OFFSET_SEGNO_MAX)
+		{
+			oldseg.segno = 0;
+			oldseg.pageno = 0;
+		}
+
+		oldlen = read_old_segment_page(&oldseg, oldbuf, &empty);
+		if (empty || oldlen != BLCKSZ)
+			pg_fatal("cannot read page %llu from file \"%s\": %m",
+					 (unsigned long long) oldseg.pageno, oldseg.fn);
+
+		/* Save oldest multi offset */
+		if (!oldest_offset_known)
+		{
+			oldest_offset = oldbuf[old_entry];
+			oldest_offset_known = true;
+		}
+
+		/* Skip wrapped-around invalid MultiXactIds */
+		if (multi == (uint64) 1 << 32)
+		{
+			Assert(oldseg.segno == 0);
+			Assert(oldseg.pageno == 1);
+			Assert(old_entry == 0);
+			Assert(new_entry == 0);
+
+			multi += FirstMultiXactId;
+			old_entry = FirstMultiXactId;
+			new_entry = FirstMultiXactId;
+		}
+
+		/* Copy entries to the new page */
+		for (; multi < next_multi && old_entry < MULTIXACT_OFFSETS_PER_PAGE_OLD;
+			 multi++, old_entry++)
+		{
+			MultiXactOffset offset = oldbuf[old_entry];
+
+			/* Handle possible offset wraparound (1 becomes 2^32) */
+			if (offset < oldest_offset)
+				offset += ((uint64) 1 << 32) - 1;
+
+			/* Subtract oldest_offset, so new offsets will start from 1 */
+			newbuf[new_entry++] = offset - oldest_offset + 1;
+
+			if (new_entry >= MULTIXACT_OFFSETS_PER_PAGE_NEW)
+			{
+				/* Handle possible segment wraparound */
+#define NEW_OFFSET_SEGNO_MAX	\
+	(MaxMultiXactId / MULTIXACT_OFFSETS_PER_PAGE_NEW / SLRU_PAGES_PER_SEGMENT)
+				if (newseg.segno > NEW_OFFSET_SEGNO_MAX)
+				{
+					newseg.segno = 0;
+					newseg.pageno = 0;
+				}
+
+				/* Write new page */
+				write_new_segment_page(&newseg, newbuf);
+				new_entry = 0;
+			}
+		}
+	}
+
+	/* Write the last incomplete page */
+	if (new_entry > 0 || oldest_multi == next_multi)
+	{
+		memset(&newbuf[new_entry], 0,
+			   sizeof(newbuf[0]) * (MULTIXACT_OFFSETS_PER_PAGE_NEW - new_entry));
+		write_new_segment_page(&newseg, newbuf);
+	}
+
+	/* Use next_offset as oldest_offset, if oldest_multi == next_multi */
+	if (!oldest_offset_known)
+	{
+		Assert(oldest_multi == next_multi);
+		oldest_offset = (MultiXactOffset) old_cluster.controldata.chkpnt_nxtmxoff;
+	}
+
+	/* Release resources */
+	close_segment(&oldseg);
+	close_segment(&newseg);
+
+	pfree(oldseg.dir);
+	pfree(newseg.dir);
+
+	return oldest_offset;
+}
+
+#define MXACT_MEMBERS_FLAG_BYTES			1
+
+#define MULTIXACT_MEMBERS_PER_GROUP			4
+#define MULTIXACT_MEMBERGROUP_SIZE			\
+	(MULTIXACT_MEMBERS_PER_GROUP * (sizeof(TransactionId) + MXACT_MEMBERS_FLAG_BYTES))
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE		\
+	(BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+
+#define MULTIXACT_MEMBERS_PER_PAGE				\
+	(MULTIXACT_MEMBERS_PER_GROUP * MULTIXACT_MEMBERGROUPS_PER_PAGE)
+#define MULTIXACT_MEMBER_FLAG_BYTES_PER_GROUP	\
+	(MXACT_MEMBERS_FLAG_BYTES * MULTIXACT_MEMBERS_PER_GROUP)
+
+typedef struct MultiXactMembersCtx
+{
+	SlruSegState	seg;
+	char			buf[BLCKSZ];
+	int				group;
+	int				member;
+	char		   *flag;
+	TransactionId  *xid;
+} MultiXactMembersCtx;
+
+static void
+MultiXactMembersCtxInit(MultiXactMembersCtx *ctx)
+{
+	ctx->seg.dir = psprintf("%s/pg_multixact/members", new_cluster.pgdata);
+
+	ctx->group = 0;
+	ctx->member = 1;		/* skip invalid zero offset */
+
+	ctx->flag = (char *) ctx->buf + ctx->group * MULTIXACT_MEMBERGROUP_SIZE;
+	ctx->xid = (TransactionId *)(ctx->flag + MXACT_MEMBERS_FLAG_BYTES * MULTIXACT_MEMBERS_PER_GROUP);
+
+	ctx->flag += ctx->member;
+	ctx->xid += ctx->member;
+}
+
+static void
+MultiXactMembersCtxAdd(MultiXactMembersCtx *ctx, char flag, TransactionId xid)
+{
+	/* Copy member's xid and flags to the new page */
+	*ctx->flag++ = flag;
+	*ctx->xid++ = xid;
+
+	if (++ctx->member < MULTIXACT_MEMBERS_PER_GROUP)
+		return;
+
+	/* Start next member group */
+	ctx->member = 0;
+
+	if (++ctx->group >= MULTIXACT_MEMBERGROUPS_PER_PAGE)
+	{
+		/* Write current page and start new */
+		write_new_segment_page(&ctx->seg, ctx->buf);
+
+		ctx->group = 0;
+		memset(ctx->buf, 0, BLCKSZ);
+	}
+
+	ctx->flag = (char *) ctx->buf + ctx->group * MULTIXACT_MEMBERGROUP_SIZE;
+	ctx->xid = (TransactionId *)(ctx->flag + MXACT_MEMBERS_FLAG_BYTES * MULTIXACT_MEMBERS_PER_GROUP);
+}
+
+static void
+MultiXactMembersCtxFinit(MultiXactMembersCtx *ctx)
+{
+	if (ctx->flag > (char *) ctx->buf)
+		write_new_segment_page(&ctx->seg, ctx->buf);
+
+	close_segment(&ctx->seg);
+
+	pfree(ctx->seg.dir);
+}
+
+/*
+ * Convert pg_multixact/members segments, offsets will start from 1.
+ *
+ */
+void
+convert_multixact_members(MultiXactOffset oldest_offset)
+{
+	MultiXactOffset			next_offset,
+							offset;
+	SlruSegState			oldseg = {0};
+	char					oldbuf[BLCKSZ] = {0};
+	int						oldidx;
+	MultiXactMembersCtx		newctx = {0};
+
+	oldseg.dir = psprintf("%s/pg_multixact/members", old_cluster.pgdata);
+
+	next_offset = (MultiXactOffset) old_cluster.controldata.chkpnt_nxtmxoff;
+	if (next_offset < oldest_offset)
+		next_offset += ((uint64) 1 << 32) - 1;
+
+	/* Initialize the old starting position */
+	oldseg.pageno = oldest_offset / MULTIXACT_MEMBERS_PER_PAGE;
+	oldseg.segno = oldseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	oldseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+
+	/* Initialize new starting position */
+	MultiXactMembersCtxInit(&newctx);
+
+	/* Iterate through the original directory */
+	oldidx = oldest_offset % MULTIXACT_MEMBERS_PER_PAGE;
+	for (offset = oldest_offset; offset < next_offset;)
+	{
+		bool	empty;
+		int		oldlen;
+		int		ngroups;
+		int		oldgroup;
+		int		oldmember;
+
+		oldlen = read_old_segment_page(&oldseg, oldbuf, &empty);
+		if (empty || oldlen != BLCKSZ)
+			pg_fatal("cannot read page %llu from file \"%s\": %m",
+					 (unsigned long long) oldseg.pageno, oldseg.fn);
+
+		/* Iterate through the old member groups */
+		ngroups = oldlen / MULTIXACT_MEMBERGROUP_SIZE;
+		oldmember = oldidx % MULTIXACT_MEMBERS_PER_GROUP;
+		oldgroup = oldidx / MULTIXACT_MEMBERS_PER_GROUP;
+		while (oldgroup < ngroups && offset < next_offset)
+		{
+			char		   *oldflag;
+			TransactionId  *oldxid;
+			int				i;
+
+			oldflag = (char *) oldbuf + oldgroup * MULTIXACT_MEMBERGROUP_SIZE;
+			oldxid = (TransactionId *)(oldflag + MULTIXACT_MEMBER_FLAG_BYTES_PER_GROUP);
+
+			oldxid += oldmember;
+			oldflag += oldmember;
+
+			/* Iterate through the old members */
+			for (i = oldmember;
+				 i < MULTIXACT_MEMBERS_PER_GROUP && offset < next_offset;
+				 i++)
+			{
+				MultiXactMembersCtxAdd(&newctx, *oldflag++, *oldxid++);
+
+				if (++offset == (uint64) 1 << 32)
+				{
+					Assert(i == MaxMultiXactOffsetOld % MULTIXACT_MEMBERS_PER_GROUP);
+					goto wraparound;
+				}
+			}
+
+			oldgroup++;
+			oldmember = 0;
+		}
+
+		oldidx = 0;
+
+		continue;
+
+wraparound:
+#define SEGNO_MAX	MaxMultiXactOffsetOld / MULTIXACT_MEMBERS_PER_PAGE / SLRU_PAGES_PER_SEGMENT
+#define PAGENO_MAX	MaxMultiXactOffsetOld / MULTIXACT_MEMBERS_PER_PAGE % SLRU_PAGES_PER_SEGMENT
+		Assert((oldseg.segno == SEGNO_MAX && oldseg.pageno == PAGENO_MAX + 1) ||
+			   (oldseg.segno == SEGNO_MAX + 1 && oldseg.pageno == 0));
+
+		/* Switch to segment 0000 */
+		close_segment(&oldseg);
+		oldseg.segno = 0;
+		oldseg.pageno = 0;
+
+		/* skip invalid zero multi offset */
+		oldidx = 1;
+	}
+
+	MultiXactMembersCtxFinit(&newctx);
+
+	/* Release resources */
+	close_segment(&oldseg);
+
+	pfree(oldseg.dir);
+}
-- 
2.39.5

v10-0004-Get-rid-of-MultiXactMemberFreezeThreshold-call.patchtext/x-patch; charset=UTF-8; name=v10-0004-Get-rid-of-MultiXactMemberFreezeThreshold-call.patchDownload
From 734c61892076e037ab7ea22273e61590977c56ee Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 23 Oct 2024 18:23:39 +0300
Subject: [PATCH v10 04/14] Get rid of MultiXactMemberFreezeThreshold call.

Since MaxMultiXactOffset are UINT64_MAX now, MULTIXACT_MEMBER_SAFE_THRESHOLD and
MULTIXACT_MEMBER_DANGER_THRESHOLD values are not meaningful any more. Thus,
MultiXactMemberFreezeThreshold is not needed too.

Instead, switch to MULTIXACT_MEMBER_AUTOVAC_THRESHOLD (eq 2^32) members
threshold. It is used to determine if we need to force autovacuum or not.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/transam/multixact.c | 117 +++----------------------
 src/backend/commands/vacuum.c          |   2 +-
 src/backend/postmaster/autovacuum.c    |   4 +-
 src/include/access/multixact.h         |   1 -
 4 files changed, 15 insertions(+), 109 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 48e1c0160a8..a817f539ee9 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -204,10 +204,14 @@ MXOffsetToMemberOffset(MultiXactOffset offset)
 		member_in_group * sizeof(TransactionId);
 }
 
-/* Multixact members wraparound thresholds. */
-#define MULTIXACT_MEMBER_SAFE_THRESHOLD		(MaxMultiXactOffset / 2)
-#define MULTIXACT_MEMBER_DANGER_THRESHOLD	\
-	(MaxMultiXactOffset - MaxMultiXactOffset / 4)
+/*
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuumin order to release the disk space, reduce table bloat if
+ * possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD		UINT64CONST(0xFFFFFFFF)
 
 static inline MultiXactId
 PreviousMultiXactId(MultiXactId multi)
@@ -2616,15 +2620,13 @@ GetOldestMultiXactId(void)
 }
 
 /*
- * Determine how aggressively we need to vacuum in order to prevent member
- * wraparound.
+ * Determine if we need to vacuum for member or not.
  *
  * To do so determine what's the oldest member offset and install the limit
  * info in MultiXactState, where it can be used to prevent overrun of old data
  * in the members SLRU area.
  *
- * The return value is true if emergency autovacuum is required and false
- * otherwise.
+ * The return value is true if autovacuum is required and false otherwise.
  */
 static bool
 SetOffsetVacuumLimit(bool is_startup)
@@ -2712,10 +2714,10 @@ SetOffsetVacuumLimit(bool is_startup)
 	LWLockRelease(MultiXactGenLock);
 
 	/*
-	 * Do we need an emergency autovacuum?	If we're not sure, assume yes.
+	 * Do we need autovacuum?	If we're not sure, assume yes.
 	 */
 	return !oldestOffsetKnown ||
-		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
+		(nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);
 }
 
 /*
@@ -2761,101 +2763,6 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	return true;
 }
 
-/*
- * Determine how many multixacts, and how many multixact members, currently
- * exist.  Return false if unable to determine.
- */
-static bool
-ReadMultiXactCounts(uint32 *multixacts, MultiXactOffset *members)
-{
-	MultiXactOffset nextOffset;
-	MultiXactOffset oldestOffset;
-	MultiXactId oldestMultiXactId;
-	MultiXactId nextMultiXactId;
-	bool		oldestOffsetKnown;
-
-	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	nextOffset = MultiXactState->nextOffset;
-	oldestMultiXactId = MultiXactState->oldestMultiXactId;
-	nextMultiXactId = MultiXactState->nextMXact;
-	oldestOffset = MultiXactState->oldestOffset;
-	oldestOffsetKnown = MultiXactState->oldestOffsetKnown;
-	LWLockRelease(MultiXactGenLock);
-
-	if (!oldestOffsetKnown)
-		return false;
-
-	*members = nextOffset - oldestOffset;
-	*multixacts = nextMultiXactId - oldestMultiXactId;
-	return true;
-}
-
-/*
- * Multixact members can be removed once the multixacts that refer to them
- * are older than every datminmxid.  autovacuum_multixact_freeze_max_age and
- * vacuum_multixact_freeze_table_age work together to make sure we never have
- * too many multixacts; we hope that, at least under normal circumstances,
- * this will also be sufficient to keep us from using too many offsets.
- * However, if the average multixact has many members, we might exhaust the
- * members space while still using few enough members that these limits fail
- * to trigger relminmxid advancement by VACUUM.  At that point, we'd have no
- * choice but to start failing multixact-creating operations with an error.
- *
- * To prevent that, if more than a threshold portion of the members space is
- * used, we effectively reduce autovacuum_multixact_freeze_max_age and
- * to a value just less than the number of multixacts in use.  We hope that
- * this will quickly trigger autovacuuming on the table or tables with the
- * oldest relminmxid, thus allowing datminmxid values to advance and removing
- * some members.
- *
- * As the fraction of the member space currently in use grows, we become
- * more aggressive in clamping this value.  That not only causes autovacuum
- * to ramp up, but also makes any manual vacuums the user issues more
- * aggressive.  This happens because vacuum_get_cutoffs() will clamp the
- * freeze table and the minimum freeze age cutoffs based on the effective
- * autovacuum_multixact_freeze_max_age this function returns.  In the worst
- * case, we'll claim the freeze_max_age to zero, and every vacuum of any
- * table will freeze every multixact.
- */
-int
-MultiXactMemberFreezeThreshold(void)
-{
-	MultiXactOffset members;
-	uint32		multixacts;
-	uint32		victim_multixacts;
-	double		fraction;
-	int			result;
-
-	/* If we can't determine member space utilization, assume the worst. */
-	if (!ReadMultiXactCounts(&multixacts, &members))
-		return 0;
-
-	/* If member space utilization is low, no special action is required. */
-	if (members <= MULTIXACT_MEMBER_SAFE_THRESHOLD)
-		return autovacuum_multixact_freeze_max_age;
-
-	/*
-	 * Compute a target for relminmxid advancement.  The number of multixacts
-	 * we try to eliminate from the system is based on how far we are past
-	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
-	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-	fraction /= (double) (MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-
-	victim_multixacts = multixacts * fraction;
-
-	/* fraction could be > 1.0, but lowest possible freeze age is zero */
-	if (victim_multixacts > multixacts)
-		return 0;
-	result = multixacts - victim_multixacts;
-
-	/*
-	 * Clamp to autovacuum_multixact_freeze_max_age, so that we never make
-	 * autovacuum less aggressive than it would otherwise be.
-	 */
-	return Min(result, autovacuum_multixact_freeze_max_age);
-}
-
 typedef struct mxtruncinfo
 {
 	int64		earliestExistingPage;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index bb639ef51fb..3e1942a6a2b 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1134,7 +1134,7 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Almost ready to set freeze output parameters; check if OldestXmin or
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index dc3cf87abab..180bb7e96ed 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1122,7 +1122,7 @@ do_start_worker(void)
 
 	/* Also determine the oldest datminmxid we will consider. */
 	recentMulti = ReadNextMultiXactId();
-	multiForceLimit = recentMulti - MultiXactMemberFreezeThreshold();
+	multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
 	if (multiForceLimit < FirstMultiXactId)
 		multiForceLimit -= FirstMultiXactId;
 
@@ -1915,7 +1915,7 @@ do_autovacuum(void)
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Find the pg_database entry and select the default freeze ages. We use
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 90583634ec9..5aefbddce3e 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -143,7 +143,6 @@ extern void MultiXactSetNextMXact(MultiXactId nextMulti,
 extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 									  MultiXactOffset minMultiOffset);
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
-extern int	MultiXactMemberFreezeThreshold(void);
 
 extern void multixact_twophase_recover(TransactionId xid, uint16 info,
 									   void *recdata, uint32 len);
-- 
2.39.5

v10-0005-Fixup-setting-oldestOffset.patchtext/x-patch; charset=UTF-8; name=v10-0005-Fixup-setting-oldestOffset.patchDownload
From aab57d260ca5268f86d7591ce7957f3b0d68a517 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 18 Dec 2024 11:47:32 +0200
Subject: [PATCH v10 05/14] Fixup setting oldestOffset

---
 src/backend/access/transam/multixact.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index a817f539ee9..3d27995a299 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1045,9 +1045,7 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	 * against catastrophic data loss due to multixact wraparound.  The basic
 	 * rules are:
 	 *
-	 * If we're past multiVacLimit or the safe threshold for member storage
-	 * space, or we don't know what the safe threshold for member storage is,
-	 * start trying to force autovacuum cycles.
+	 * If we're past multiVacLimit, start trying to force autovacuum cycles.
 	 * If we're past multiWarnLimit, start issuing warnings.
 	 * If we're past multiStopLimit, refuse to create new MultiXactIds.
 	 *
@@ -2695,7 +2693,7 @@ SetOffsetVacuumLimit(bool is_startup)
 	 * overrun of old data in the members SLRU area. We can only do so if the
 	 * oldest offset is known though.
 	 */
-	if (prevOldestOffsetKnown)
+	if (!oldestOffsetKnown && prevOldestOffsetKnown)
 	{
 		/*
 		 * If we failed to get the oldest offset this time, but we have a
-- 
2.39.5

v10-0006-Remove-some-dead-code-related-to-handling-member.patchtext/x-patch; charset=UTF-8; name=v10-0006-Remove-some-dead-code-related-to-handling-member.patchDownload
From b35b0aa849a16a9f71ba615c8134fe74db37d088 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 18 Dec 2024 11:57:06 +0200
Subject: [PATCH v10 06/14] Remove some dead code related to handling members
 wraparound

Now that offsets are 64-bit, we assume members never wrap around.

Start the offset counter from 1 so that we don't need the special case
for starting from 0.
---
 src/backend/access/transam/multixact.c | 42 +++-----------------------
 src/backend/access/transam/xlog.c      |  2 +-
 2 files changed, 5 insertions(+), 39 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 3d27995a299..737154814a8 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -147,19 +147,6 @@ MultiXactIdToOffsetSegment(MultiXactId multi)
 #define MULTIXACT_MEMBERS_PER_PAGE	\
 	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
 
-/*
- * Because the number of items per page is not a divisor of the last item
- * number (member 0xFFFFFFFF), the last segment does not use the maximum number
- * of pages, and moreover the last used page therein does not use the same
- * number of items as previous pages.  (Another way to say it is that the
- * 0xFFFFFFFF member is somewhere in the middle of the last page, so the page
- * has some empty space after that item.)
- *
- * This constant is the number of members in the last page of the last segment.
- */
-#define MAX_MEMBERS_IN_LAST_MEMBERS_PAGE \
-		((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
-
 /* page in which a member is to be found */
 static inline int64
 MXOffsetToMemberPage(MultiXactOffset offset)
@@ -1140,18 +1127,10 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	ExtendMultiXactOffset(result);
 
 	/*
-	 * Reserve the members space, similarly to above.  Also, be careful not to
-	 * return zero as the starting offset for any multixact. See
-	 * GetMultiXactIdMembers() for motivation.
+	 * Reserve the members space, similarly to above.
 	 */
 	nextOffset = MultiXactState->nextOffset;
-	if (nextOffset == 0)
-	{
-		*offset = 1;
-		nmembers++;				/* allocate member slot 0 too */
-	}
-	else
-		*offset = nextOffset;
+	*offset = nextOffset;
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
@@ -2537,22 +2516,9 @@ ExtendMultiXactMember(MultiXactOffset offset, int nmembers)
 		}
 
 		/*
-		 * Compute the number of items till end of current page.  Careful: if
-		 * addition of unsigned ints wraps around, we're at the last page of
-		 * the last segment; since that page holds a different number of items
-		 * than other pages, we need to do it differently.
+		 * Compute the number of items till end of current page.
 		 */
-		if (offset + MAX_MEMBERS_IN_LAST_MEMBERS_PAGE < offset)
-		{
-			/*
-			 * This is the last page of the last segment; we can compute the
-			 * number of items left to allocate in it without modulo
-			 * arithmetic.
-			 */
-			difference = MaxMultiXactOffset - offset + 1;
-		}
-		else
-			difference = MULTIXACT_MEMBERS_PER_PAGE - offset % MULTIXACT_MEMBERS_PER_PAGE;
+		difference = MULTIXACT_MEMBERS_PER_PAGE - offset % MULTIXACT_MEMBERS_PER_PAGE;
 
 		/*
 		 * Advance to next page, taking care to properly handle the wraparound
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f58412bcab..067cb70938a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5083,7 +5083,7 @@ BootStrapXLOG(uint32 data_checksum_version)
 		FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId);
 	checkPoint.nextOid = FirstGenbkiObjectId;
 	checkPoint.nextMulti = FirstMultiXactId;
-	checkPoint.nextMultiOffset = 0;
+	checkPoint.nextMultiOffset = 1;
 	checkPoint.oldestXid = FirstNormalTransactionId;
 	checkPoint.oldestXidDB = Template1DbOid;
 	checkPoint.oldestMulti = FirstMultiXactId;
-- 
2.39.5

v10-0007-add-fixme-comment-about-pg_upgrade.patchtext/x-patch; charset=UTF-8; name=v10-0007-add-fixme-comment-about-pg_upgrade.patchDownload
From 6c97887738164fc6810781211fba4774a63e3167 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 18 Dec 2024 12:00:30 +0200
Subject: [PATCH v10 07/14] add fixme comment about pg_upgrade

---
 src/backend/access/transam/multixact.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 737154814a8..651766a4935 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1302,6 +1302,7 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * we have just for this; the process in charge will signal the CV as soon
 	 * as it has finished writing the multixact offset.
 	 *
+	 * FIXME: case 3 is now only needed for pg_upgraded clusters
 	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
 	 * handle case #2, there is an ambiguity near the point of offset
 	 * wraparound.  If we see next multixact's offset is one, is that our
-- 
2.39.5

v10-0008-Remove-code-to-deal-with-old-9.3-and-9.3-era-bro.patchtext/x-patch; charset=UTF-8; name=v10-0008-Remove-code-to-deal-with-old-9.3-and-9.3-era-bro.patchDownload
From 2c0d3b38d31f65edd7c50910770ea5e502c08ea6 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 18 Dec 2024 12:03:49 +0200
Subject: [PATCH v10 08/14] Remove code to deal with old 9.3 and 9.3 era broken
 clusters

Now that pg_upgrade will rewrite the SLRUs, we know the correct oldest
member offset.

XXX: is that correct?
---
 src/backend/access/transam/multixact.c | 99 ++++----------------------
 1 file changed, 15 insertions(+), 84 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 651766a4935..25fca431937 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -243,11 +243,9 @@ typedef struct MultiXactStateData
 
 	/*
 	 * Oldest multixact offset that is potentially referenced by a multixact
-	 * referenced by a relation.  We don't always know this value, so there's
-	 * a flag here to indicate whether or not we currently do.
+	 * referenced by a relation.
 	 */
 	MultiXactOffset oldestOffset;
-	bool		oldestOffsetKnown;
 
 	/* support for anti-wraparound measures */
 	MultiXactId multiVacLimit;
@@ -390,7 +388,7 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
 static bool SetOffsetVacuumLimit(bool is_startup);
-static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
+static MultiXactOffset find_multixact_start(MultiXactId multi);
 static void WriteMZeroPageXlogRec(int64 pageno, uint8 info);
 static void WriteMTruncateXlogRec(Oid oldestMultiDB,
 								  MultiXactId startTruncOff,
@@ -2599,10 +2597,7 @@ SetOffsetVacuumLimit(bool is_startup)
 	MultiXactId oldestMultiXactId;
 	MultiXactId nextMXact;
 	MultiXactOffset oldestOffset = 0;	/* placate compiler */
-	MultiXactOffset prevOldestOffset;
 	MultiXactOffset nextOffset;
-	bool		oldestOffsetKnown = false;
-	bool		prevOldestOffsetKnown;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2615,8 +2610,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	oldestMultiXactId = MultiXactState->oldestMultiXactId;
 	nextMXact = MultiXactState->nextMXact;
 	nextOffset = MultiXactState->nextOffset;
-	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
-	prevOldestOffset = MultiXactState->oldestOffset;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2634,68 +2627,31 @@ SetOffsetVacuumLimit(bool is_startup)
 		 * offset.
 		 */
 		oldestOffset = nextOffset;
-		oldestOffsetKnown = true;
 	}
 	else
-	{
-		/*
-		 * Figure out where the oldest existing multixact's offsets are
-		 * stored. Due to bugs in early release of PostgreSQL 9.3.X and 9.4.X,
-		 * the supposedly-earliest multixact might not really exist.  We are
-		 * careful not to fail in that case.
-		 */
-		oldestOffsetKnown =
-			find_multixact_start(oldestMultiXactId, &oldestOffset);
-
-		if (!oldestOffsetKnown)
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
-							oldestMultiXactId)));
-	}
+		oldestOffset = find_multixact_start(oldestMultiXactId);
 
 	LWLockRelease(MultiXactTruncationLock);
 
-	/*
-	 * If we can, compute limits (and install them MultiXactState) to prevent
-	 * overrun of old data in the members SLRU area. We can only do so if the
-	 * oldest offset is known though.
-	 */
-	if (!oldestOffsetKnown && prevOldestOffsetKnown)
-	{
-		/*
-		 * If we failed to get the oldest offset this time, but we have a
-		 * value from a previous pass through this function, use the old
-		 * values rather than automatically forcing an emergency autovacuum
-		 * cycle again.
-		 */
-		oldestOffset = prevOldestOffset;
-		oldestOffsetKnown = true;
-	}
-
 	/* Install the computed values */
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestOffset = oldestOffset;
-	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
 	LWLockRelease(MultiXactGenLock);
 
 	/*
-	 * Do we need autovacuum?	If we're not sure, assume yes.
+	 * Do we need autovacuum?
 	 */
-	return !oldestOffsetKnown ||
-		(nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);
+	return (nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);
 }
 
 /*
  * Find the starting offset of the given MultiXactId.
  *
- * Returns false if the file containing the multi does not exist on disk.
- * Otherwise, returns true and sets *result to the starting member offset.
- *
  * This function does not prevent concurrent truncation, so if that's
  * required, the caller has to protect against that.
  */
-static bool
-find_multixact_start(MultiXactId multi, MultiXactOffset *result)
+static MultiXactOffset
+find_multixact_start(MultiXactId multi)
 {
 	MultiXactOffset offset;
 	int64		pageno;
@@ -2708,15 +2664,6 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
-	/*
-	 * Write out dirty data, so PhysicalPageExists can work correctly.
-	 */
-	SimpleLruWriteAll(MultiXactOffsetCtl, true);
-	SimpleLruWriteAll(MultiXactMemberCtl, true);
-
-	if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno))
-		return false;
-
 	/* lock is acquired by SimpleLruReadPage_ReadOnly */
 	slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno, multi);
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
@@ -2724,8 +2671,7 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	offset = *offptr;
 	LWLockRelease(SimpleLruGetBankLock(MultiXactOffsetCtl, pageno));
 
-	*result = offset;
-	return true;
+	return offset;
 }
 
 typedef struct mxtruncinfo
@@ -2759,11 +2705,12 @@ SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int64 segpage, void *data
  * the full range at once. This means SimpleLruTruncate() can't trivially be
  * used - instead the to-be-deleted range is computed using the offsets
  * SLRU. C.f. TruncateMultiXact().
+ *
+ * XXX could use SimpleLruTruncate() now
  */
 static void
 PerformMembersTruncation(MultiXactOffset oldestOffset, MultiXactOffset newOldestOffset)
 {
-	const int64 maxsegment = MXOffsetToMemberSegment(MaxMultiXactOffset);
 	int64		startsegment = MXOffsetToMemberSegment(oldestOffset);
 	int64		endsegment = MXOffsetToMemberSegment(newOldestOffset);
 	int64		segment = startsegment;
@@ -2778,11 +2725,7 @@ PerformMembersTruncation(MultiXactOffset oldestOffset, MultiXactOffset newOldest
 			 (unsigned long long) segment);
 		SlruDeleteSegment(MultiXactMemberCtl, segment);
 
-		/* move to next segment, handling wraparound correctly */
-		if (segment == maxsegment)
-			segment = 0;
-		else
-			segment += 1;
+		segment += 1;
 	}
 }
 
@@ -2888,23 +2831,15 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	/*
 	 * First, compute the safe truncation point for MultiXactMember. This is
 	 * the starting offset of the oldest multixact.
-	 *
-	 * Hopefully, find_multixact_start will always work here, because we've
-	 * already checked that it doesn't precede the earliest MultiXact on disk.
-	 * But if it fails, don't truncate anything, and log a message.
 	 */
 	if (oldestMulti == nextMulti)
 	{
 		/* there are NO MultiXacts */
 		oldestOffset = nextOffset;
 	}
-	else if (!find_multixact_start(oldestMulti, &oldestOffset))
+	else
 	{
-		ereport(LOG,
-				(errmsg("oldest MultiXact %u not found, earliest MultiXact %u, skipping truncation",
-						oldestMulti, earliest)));
-		LWLockRelease(MultiXactTruncationLock);
-		return;
+		oldestOffset = find_multixact_start(oldestMulti);
 	}
 
 	/*
@@ -2916,13 +2851,9 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 		/* there are NO MultiXacts */
 		newOldestOffset = nextOffset;
 	}
-	else if (!find_multixact_start(newOldestMulti, &newOldestOffset))
+	else
 	{
-		ereport(LOG,
-				(errmsg("cannot truncate up to MultiXact %u because it does not exist on disk, skipping truncation",
-						newOldestMulti)));
-		LWLockRelease(MultiXactTruncationLock);
-		return;
+		newOldestOffset = find_multixact_start(newOldestMulti);
 	}
 
 	elog(DEBUG1, "performing multixact truncation: "
-- 
2.39.5

v10-0009-Move-some-macros-to-deal-with-pg_multixact-on-di.patchtext/x-patch; charset=UTF-8; name=v10-0009-Move-some-macros-to-deal-with-pg_multixact-on-di.patchDownload
From 00798a9dccf442ca880f98fd682c7829bab28683 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 18 Dec 2024 12:08:21 +0200
Subject: [PATCH v10 09/14] Move some macros to deal with pg_multixact on-disk
 format to header file

So that they can be used from pg_upgrade in next commit
---
 src/backend/access/transam/multixact.c  | 100 +--------------------
 src/include/access/multixact_internal.h | 115 ++++++++++++++++++++++++
 2 files changed, 116 insertions(+), 99 deletions(-)
 create mode 100644 src/include/access/multixact_internal.h

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 25fca431937..b786ee23563 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -69,6 +69,7 @@
 #include "postgres.h"
 
 #include "access/multixact.h"
+#include "access/multixact_internal.h"
 #include "access/slru.h"
 #include "access/transam.h"
 #include "access/twophase.h"
@@ -92,105 +93,6 @@
 #include "utils/injection_point.h"
 #include "utils/memutils.h"
 
-
-/*
- * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
- * used everywhere else in Postgres.
- */
-
-/* We need four bytes per offset */
-#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
-
-static inline int64
-MultiXactIdToOffsetPage(MultiXactId multi)
-{
-	return multi / MULTIXACT_OFFSETS_PER_PAGE;
-}
-
-static inline int
-MultiXactIdToOffsetEntry(MultiXactId multi)
-{
-	return multi % MULTIXACT_OFFSETS_PER_PAGE;
-}
-
-static inline int64
-MultiXactIdToOffsetSegment(MultiXactId multi)
-{
-	return MultiXactIdToOffsetPage(multi) / SLRU_PAGES_PER_SEGMENT;
-}
-
-/*
- * The situation for members is a bit more complex: we store one byte of
- * additional flag bits for each TransactionId.  To do this without getting
- * into alignment issues, we store four bytes of flags, and then the
- * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
- * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
- * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
- * performance) trumps space efficiency here.
- *
- * Note that the "offset" macros work with byte offset, not array indexes, so
- * arithmetic must be done using "char *" pointers.
- */
-/* We need eight bits per xact, so one xact fits in a byte */
-#define MXACT_MEMBER_BITS_PER_XACT			8
-#define MXACT_MEMBER_FLAGS_PER_BYTE			1
-#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
-
-/* how many full bytes of flags are there in a group? */
-#define MULTIXACT_FLAGBYTES_PER_GROUP		4
-#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
-	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
-/* size in bytes of a complete group */
-#define MULTIXACT_MEMBERGROUP_SIZE \
-	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
-#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
-#define MULTIXACT_MEMBERS_PER_PAGE	\
-	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
-
-/* page in which a member is to be found */
-static inline int64
-MXOffsetToMemberPage(MultiXactOffset offset)
-{
-	return offset / MULTIXACT_MEMBERS_PER_PAGE;
-}
-
-static inline int64
-MXOffsetToMemberSegment(MultiXactOffset offset)
-{
-	return MXOffsetToMemberPage(offset) / SLRU_PAGES_PER_SEGMENT;
-}
-
-/* Location (byte offset within page) of flag word for a given member */
-static inline int
-MXOffsetToFlagsOffset(MultiXactOffset offset)
-{
-	MultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
-	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
-	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
-
-	return byteoff;
-}
-
-static inline int
-MXOffsetToFlagsBitShift(MultiXactOffset offset)
-{
-	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
-	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
-
-	return bshift;
-}
-
-/* Location (byte offset within page) of TransactionId of given member */
-static inline int
-MXOffsetToMemberOffset(MultiXactOffset offset)
-{
-	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
-
-	return MXOffsetToFlagsOffset(offset) +
-		MULTIXACT_FLAGBYTES_PER_GROUP +
-		member_in_group * sizeof(TransactionId);
-}
-
 /*
  * Multixact members warning threshold.
  *
diff --git a/src/include/access/multixact_internal.h b/src/include/access/multixact_internal.h
new file mode 100644
index 00000000000..39e74a21c74
--- /dev/null
+++ b/src/include/access/multixact_internal.h
@@ -0,0 +1,115 @@
+/*
+ * multixact_internal.h
+ *
+ * Internal definitions for the on-disk format of multixact manager.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/multixact_internal.h
+ */
+#ifndef MULTIXACT_INTERNAL_H
+#define MULTIXACT_INTERNAL_H
+
+/* FIXME: had to duplicate this */
+#define SLRU_PAGES_PER_SEGMENT	32
+
+/*
+ * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
+ * used everywhere else in Postgres.
+ */
+
+/* We need four bytes per offset */
+#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)
+{
+	return multi / MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int
+MultiXactIdToOffsetEntry(MultiXactId multi)
+{
+	return multi % MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int64
+MultiXactIdToOffsetSegment(MultiXactId multi)
+{
+	return MultiXactIdToOffsetPage(multi) / SLRU_PAGES_PER_SEGMENT;
+}
+
+/*
+ * The situation for members is a bit more complex: we store one byte of
+ * additional flag bits for each TransactionId.  To do this without getting
+ * into alignment issues, we store four bytes of flags, and then the
+ * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
+ * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
+ * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
+ * performance) trumps space efficiency here.
+ *
+ * Note that the "offset" macros work with byte offset, not array indexes, so
+ * arithmetic must be done using "char *" pointers.
+ */
+/* We need eight bits per xact, so one xact fits in a byte */
+#define MXACT_MEMBER_BITS_PER_XACT			8
+#define MXACT_MEMBER_FLAGS_PER_BYTE			1
+#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
+
+/* how many full bytes of flags are there in a group? */
+#define MULTIXACT_FLAGBYTES_PER_GROUP		4
+#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
+	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
+/* size in bytes of a complete group */
+#define MULTIXACT_MEMBERGROUP_SIZE \
+	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERS_PER_PAGE	\
+	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
+
+/* page in which a member is to be found */
+static inline int64
+MXOffsetToMemberPage(MultiXactOffset offset)
+{
+	return offset / MULTIXACT_MEMBERS_PER_PAGE;
+}
+
+static inline int64
+MXOffsetToMemberSegment(MultiXactOffset offset)
+{
+	return MXOffsetToMemberPage(offset) / SLRU_PAGES_PER_SEGMENT;
+}
+
+/* Location (byte offset within page) of flag word for a given member */
+static inline int
+MXOffsetToFlagsOffset(MultiXactOffset offset)
+{
+	MultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
+	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
+
+	return byteoff;
+}
+
+static inline int
+MXOffsetToFlagsBitShift(MultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
+
+	return bshift;
+}
+
+/* Location (byte offset within page) of TransactionId of given member */
+static inline int
+MXOffsetToMemberOffset(MultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+
+	return MXOffsetToFlagsOffset(offset) +
+		MULTIXACT_FLAGBYTES_PER_GROUP +
+		member_in_group * sizeof(TransactionId);
+}
+
+#endif							/* MULTIXACT_INTERNAL_H */
-- 
2.39.5

v10-0010-rewrite-pg_upgrade-code.patchtext/x-patch; charset=UTF-8; name=v10-0010-rewrite-pg_upgrade-code.patchDownload
From 6cc69b50f677a08c72e7a10fd043f2c0af7072bc Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 18 Dec 2024 01:07:03 +0200
Subject: [PATCH v10 10/14] rewrite pg_upgrade code

---
 src/backend/access/transam/multixact.c |  36 +-
 src/bin/pg_upgrade/Makefile            |   4 +-
 src/bin/pg_upgrade/meson.build         |   4 +-
 src/bin/pg_upgrade/multixact_old.c     | 340 ++++++++++++++++
 src/bin/pg_upgrade/multixact_old.h     |  12 +
 src/bin/pg_upgrade/multixact_rewrite.c | 238 +++++++++++
 src/bin/pg_upgrade/pg_upgrade.c        |  29 +-
 src/bin/pg_upgrade/pg_upgrade.h        |   5 +-
 src/bin/pg_upgrade/segresize.c         | 527 -------------------------
 src/bin/pg_upgrade/slru_io.c           | 214 ++++++++++
 src/bin/pg_upgrade/slru_io.h           |  23 ++
 11 files changed, 851 insertions(+), 581 deletions(-)
 create mode 100644 src/bin/pg_upgrade/multixact_old.c
 create mode 100644 src/bin/pg_upgrade/multixact_old.h
 create mode 100644 src/bin/pg_upgrade/multixact_rewrite.c
 delete mode 100644 src/bin/pg_upgrade/segresize.c
 create mode 100644 src/bin/pg_upgrade/slru_io.c
 create mode 100644 src/bin/pg_upgrade/slru_io.h

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b786ee23563..ea09f8606cf 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1103,7 +1103,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	MultiXactOffset *offptr;
 	MultiXactOffset offset;
 	int			length;
-	int			truelength;
 	MultiXactId oldestMXact;
 	MultiXactId nextMXact;
 	MultiXactId tmpMXact;
@@ -1202,16 +1201,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * we have just for this; the process in charge will signal the CV as soon
 	 * as it has finished writing the multixact offset.
 	 *
-	 * FIXME: case 3 is now only needed for pg_upgraded clusters
-	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
-	 * handle case #2, there is an ambiguity near the point of offset
-	 * wraparound.  If we see next multixact's offset is one, is that our
-	 * multixact's actual endpoint, or did it end at zero with a subsequent
-	 * increment?  We handle this using the knowledge that if the zero'th
-	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
-	 * transaction ID so it can't be a multixact member.  Therefore, if we
-	 * read a zero from the members array, just ignore it.
-	 *
 	 * This is all pretty messy, but the mess occurs only in infrequent corner
 	 * cases, so it seems better than holding the MultiXactGenLock for a long
 	 * time on every multixact creation.
@@ -1298,6 +1287,9 @@ retry:
 	LWLockRelease(lock);
 	lock = NULL;
 
+	/* A multixid with zero members should not happen */
+	Assert(length > 0);
+
 	/*
 	 * If we slept above, clean up state; it's no longer needed.
 	 */
@@ -1306,7 +1298,6 @@ retry:
 
 	ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
 
-	truelength = 0;
 	prev_pageno = -1;
 	for (int i = 0; i < length; i++, offset++)
 	{
@@ -1344,36 +1335,27 @@ retry:
 		xactptr = (TransactionId *)
 			(MultiXactMemberCtl->shared->page_buffer[slotno] + memberoff);
 
-		if (!TransactionIdIsValid(*xactptr))
-		{
-			/* Corner case 3: we must be looking at unused slot zero */
-			Assert(offset == 0);
-			continue;
-		}
+		Assert(TransactionIdIsValid(*xactptr));
 
 		flagsoff = MXOffsetToFlagsOffset(offset);
 		bshift = MXOffsetToFlagsBitShift(offset);
 		flagsptr = (uint32 *) (MultiXactMemberCtl->shared->page_buffer[slotno] + flagsoff);
 
-		ptr[truelength].xid = *xactptr;
-		ptr[truelength].status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
-		truelength++;
+		ptr[i].xid = *xactptr;
+		ptr[i].status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
 	}
 
 	LWLockRelease(lock);
 
-	/* A multixid with zero members should not happen */
-	Assert(truelength > 0);
-
 	/*
 	 * Copy the result into the local cache.
 	 */
-	mXactCachePut(multi, truelength, ptr);
+	mXactCachePut(multi, length, ptr);
 
 	debug_elog3(DEBUG2, "GetMembers: no cache for %s",
-				mxid_to_string(multi, truelength, ptr));
+				mxid_to_string(multi, length, ptr));
 	*members = ptr;
-	return truelength;
+	return length;
 }
 
 /*
diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index 70908d63a31..b4ad01c00b2 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -19,12 +19,14 @@ OBJS = \
 	file.o \
 	function.o \
 	info.o \
+	multixact_old.o \
+	multixact_rewrite.o \
 	option.o \
 	parallel.o \
-	segresize.o \
 	pg_upgrade.o \
 	relfilenumber.o \
 	server.o \
+	slru_io.o \
 	tablespace.o \
 	task.o \
 	util.o \
diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index 16f898ba148..2dffc48b3d2 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -8,12 +8,14 @@ pg_upgrade_sources = files(
   'file.c',
   'function.c',
   'info.c',
+  'multixact_old.c',
+  'multixact_rewrite.c',
   'option.c',
   'parallel.c',
-  'segresize.c',
   'pg_upgrade.c',
   'relfilenumber.c',
   'server.c',
+  'slru_io.c',
   'tablespace.c',
   'task.c',
   'util.c',
diff --git a/src/bin/pg_upgrade/multixact_old.c b/src/bin/pg_upgrade/multixact_old.c
new file mode 100644
index 00000000000..14988c105ce
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_old.c
@@ -0,0 +1,340 @@
+/*
+ *	multixact_old.c
+ *
+ *	Support for reading pre-v18 format pg_multixact files
+ *
+ *	Copyright (c) 2024, PostgreSQL Global Development Group
+ *	src/bin/pg_upgrade/multixact_old.c
+ */
+
+#include "postgres_fe.h"
+
+#include "access/transam.h"
+#include "pg_upgrade.h"
+#include "multixact_old.h"
+#include "slru_io.h"
+
+/*
+ * Below are a bunch of definitions that are copy-pasted from multixact.c from
+ * version 17. They shadow the new definitions in access/multixact.h, so it's
+ * important that we *don't* include that here. That's is a big reason this
+ * code has to be in a separate source file.
+ *
+ * All references to MultiXactOffset have been replaced with OldMultiXactOffset;
+ */
+typedef uint32 OldMultiXactOffset;
+
+#define FirstMultiXactId	((MultiXactId) 1)
+
+/*
+ * Possible multixact lock modes ("status").  The first four modes are for
+ * tuple locks (FOR KEY SHARE, FOR SHARE, FOR NO KEY UPDATE, FOR UPDATE); the
+ * next two are used for update and delete modes.
+ */
+typedef enum
+{
+	MultiXactStatusForKeyShare = 0x00,
+	MultiXactStatusForShare = 0x01,
+	MultiXactStatusForNoKeyUpdate = 0x02,
+	MultiXactStatusForUpdate = 0x03,
+	/* an update that doesn't touch "key" columns */
+	MultiXactStatusNoKeyUpdate = 0x04,
+	/* other updates, and delete */
+	MultiXactStatusUpdate = 0x05,
+} MultiXactStatus;
+
+/* does a status value correspond to a tuple update? */
+#define ISUPDATE_from_mxstatus(status) \
+			((status) > MultiXactStatusForUpdate)
+
+/*
+ * Defines for OldMultiXactOffset page sizes.  A page is the same BLCKSZ as is
+ * used everywhere else in Postgres.
+ *
+ * Note: because OldMultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
+ * MultiXact page numbering also wraps around at
+ * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
+ * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
+ * take no explicit notice of that fact in this module, except when comparing
+ * segment and page numbers in TruncateMultiXact (see
+ * OldMultiXactOffsetPagePrecedes).
+ */
+
+/* We need four bytes per offset */
+#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(uint32))
+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)
+{
+	return multi / MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int
+MultiXactIdToOffsetEntry(MultiXactId multi)
+{
+	return multi % MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int64
+MultiXactIdToOffsetSegment(MultiXactId multi)
+{
+	return MultiXactIdToOffsetPage(multi) / SLRU_PAGES_PER_SEGMENT;
+}
+
+/*
+ * The situation for members is a bit more complex: we store one byte of
+ * additional flag bits for each TransactionId.  To do this without getting
+ * into alignment issues, we store four bytes of flags, and then the
+ * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
+ * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
+ * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
+ * performance) trumps space efficiency here.
+ *
+ * Note that the "offset" macros work with byte offset, not array indexes, so
+ * arithmetic must be done using "char *" pointers.
+ */
+/* We need eight bits per xact, so one xact fits in a byte */
+#define MXACT_MEMBER_BITS_PER_XACT			8
+#define MXACT_MEMBER_FLAGS_PER_BYTE			1
+#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
+
+/* how many full bytes of flags are there in a group? */
+#define MULTIXACT_FLAGBYTES_PER_GROUP		4
+#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
+	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
+/* size in bytes of a complete group */
+#define MULTIXACT_MEMBERGROUP_SIZE \
+	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERS_PER_PAGE	\
+	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
+
+/* page in which a member is to be found */
+static inline int64
+MXOffsetToMemberPage(OldMultiXactOffset offset)
+{
+	return offset / MULTIXACT_MEMBERS_PER_PAGE;
+}
+
+/* Location (byte offset within page) of flag word for a given member */
+static inline int
+MXOffsetToFlagsOffset(OldMultiXactOffset offset)
+{
+	OldMultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
+	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
+
+	return byteoff;
+}
+
+static inline int
+MXOffsetToFlagsBitShift(OldMultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
+
+	return bshift;
+}
+
+/* Location (byte offset within page) of TransactionId of given member */
+static inline int
+MXOffsetToMemberOffset(OldMultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+
+	return MXOffsetToFlagsOffset(offset) +
+		MULTIXACT_FLAGBYTES_PER_GROUP +
+		member_in_group * sizeof(TransactionId);
+}
+
+typedef struct OldMultiXactReader
+{
+	MultiXactId nextMXact;
+	uint32	nextOffset;
+
+	SlruSegState *offset;
+	SlruSegState *members;
+} OldMultiXactReader;
+
+OldMultiXactReader *
+StartOldMultiXactRead(void)
+{
+	OldMultiXactReader *state;
+	char	   *dir;
+
+	state = pg_malloc(sizeof(OldMultiXactReader));
+	state->nextMXact = old_cluster.controldata.chkpnt_nxtmulti;
+	state->nextOffset = old_cluster.controldata.chkpnt_nxtmxoff;
+
+	dir = psprintf("%s/pg_multixact/offsets", old_cluster.pgdata);
+	state->offset = OpenSlruRead(dir);
+	pg_free(dir);
+
+	dir = psprintf("%s/pg_multixact/members", old_cluster.pgdata);
+	state->members = OpenSlruRead(dir);
+	pg_free(dir);
+
+	return state;
+}
+
+/*
+ * This is a simplified version of the GetMultiXactIdMembers() server function.
+ *
+ * - Only return the updating member, if any. Upgrade only cares about the updaters.
+ *   If there is no updating member, return the first locking-only member. We don't
+ *   have any way to represent "no members", but we also don't need to preserve all
+ *   the locking members.
+ *
+ * - We don't need to worry about locking and some corner cases because there's
+ *   no concurrent activity.
+ */
+void
+GetOldMultiXactIdSingleMember(OldMultiXactReader *state, MultiXactId multi,
+							  TransactionId *result, bool *isupdate)
+{
+	TransactionId result_xid;
+	bool		result_isupdate;
+	int64		pageno;
+	int64		prev_pageno;
+	int			entryno;
+	OldMultiXactOffset *offptr;
+	OldMultiXactOffset offset;
+	int			length;
+	MultiXactId nextMXact;
+	MultiXactId tmpMXact;
+	OldMultiXactOffset nextOffset;
+	char	   *buf;
+
+	nextMXact = state->nextMXact;
+	nextOffset = state->nextOffset;
+
+	/*
+	 * Find out the offset at which we need to start reading MultiXactMembers
+	 * and the number of members in the multixact.  We determine the latter as
+	 * the difference between this multixact's starting offset and the next
+	 * one's.  However, there are some corner cases to worry about:
+	 *
+	 * 1. This multixact may be the latest one created, in which case there is
+	 * no next one to look at.  In this case the nextOffset value we just
+	 * saved is the correct endpoint.
+	 *
+	 * 2. (this cannot happen during upgrade)
+	 *
+	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
+	 * handle case #2, there is an ambiguity near the point of offset
+	 * wraparound.  If we see next multixact's offset is one, is that our
+	 * multixact's actual endpoint, or did it end at zero with a subsequent
+	 * increment?  We handle this using the knowledge that if the zero'th
+	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
+	 * transaction ID so it can't be a multixact member.  Therefore, if we
+	 * read a zero from the members array, just ignore it.
+	 */
+	pageno = MultiXactIdToOffsetPage(multi);
+	entryno = MultiXactIdToOffsetEntry(multi);
+
+	buf = SlruReadSwitchPage(state->offset, pageno);
+	offptr = (OldMultiXactOffset *) buf;
+	offptr += entryno;
+	offset = *offptr;
+
+	Assert(offset != 0);
+
+	/*
+	 * Use the same increment rule as GetNewMultiXactId(), that is, don't
+	 * handle wraparound explicitly until needed.
+	 */
+	tmpMXact = multi + 1;
+
+	if (nextMXact == tmpMXact)
+	{
+		/* Corner case 1: there is no next multixact */
+		length = nextOffset - offset;
+	}
+	else
+	{
+		OldMultiXactOffset nextMXOffset;
+
+		/* handle wraparound if needed */
+		if (tmpMXact < FirstMultiXactId)
+			tmpMXact = FirstMultiXactId;
+
+		prev_pageno = pageno;
+
+		pageno = MultiXactIdToOffsetPage(tmpMXact);
+		entryno = MultiXactIdToOffsetEntry(tmpMXact);
+
+		if (pageno != prev_pageno)
+		{
+			buf = SlruReadSwitchPage(state->offset, pageno);
+		}
+
+		offptr = (OldMultiXactOffset *) buf;
+		offptr += entryno;
+		nextMXOffset = *offptr;
+
+		if (nextMXOffset == 0)
+		{
+			/* Corner case 2: next multixact is still being filled in */
+			Assert(false); /* shouldn't happen during upgrade */
+		}
+
+		length = nextMXOffset - offset;
+	}
+
+	result_xid = InvalidTransactionId;
+	result_isupdate = false;
+	prev_pageno = -1;
+	for (int i = 0; i < length; i++, offset++)
+	{
+		TransactionId *xactptr;
+		uint32	   *flagsptr;
+		int			flagsoff;
+		int			bshift;
+		int			memberoff;
+		MultiXactStatus status;
+
+		pageno = MXOffsetToMemberPage(offset);
+		memberoff = MXOffsetToMemberOffset(offset);
+
+		if (pageno != prev_pageno)
+		{
+			buf = SlruReadSwitchPage(state->members, pageno);
+			prev_pageno = pageno;
+		}
+
+		xactptr = (TransactionId *) (buf + memberoff);
+
+		if (!TransactionIdIsValid(*xactptr))
+		{
+			/* Corner case 3: we must be looking at unused slot zero */
+			Assert(offset == 0);
+			continue;
+		}
+
+		flagsoff = MXOffsetToFlagsOffset(offset);
+		bshift = MXOffsetToFlagsBitShift(offset);
+		flagsptr = (uint32 *) (buf + flagsoff);
+
+		status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
+
+		/* Verify that there is a single update Xid among the given members. */
+		if (ISUPDATE_from_mxstatus(status))
+		{
+			if (result_isupdate)
+				pg_fatal("multixact %u has more than one updating member",
+						 multi);
+			result_xid = *xactptr;
+			result_isupdate = true;
+		}
+		else if (!TransactionIdIsValid(result_xid))
+			result_xid = *xactptr;
+	}
+
+	/* A multixid with zero members should not happen */
+	Assert(TransactionIdIsValid(result_xid));
+
+	*result = result_xid;
+	*isupdate = result_isupdate;
+}
+
+
diff --git a/src/bin/pg_upgrade/multixact_old.h b/src/bin/pg_upgrade/multixact_old.h
new file mode 100644
index 00000000000..70800c1cda5
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_old.h
@@ -0,0 +1,12 @@
+/*
+ *	multixact_old.h
+ *
+ *	Copyright (c) 2010-2024, PostgreSQL Global Development Group
+ *	src/bin/pg_upgrade/multixact_old.h
+ */
+
+typedef struct OldMultiXactReader OldMultiXactReader;
+
+extern OldMultiXactReader *StartOldMultiXactRead(void);
+extern void GetOldMultiXactIdSingleMember(OldMultiXactReader *state, MultiXactId multi,
+										  TransactionId *result, bool *isupdate);
diff --git a/src/bin/pg_upgrade/multixact_rewrite.c b/src/bin/pg_upgrade/multixact_rewrite.c
new file mode 100644
index 00000000000..7b3aeb80c0b
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_rewrite.c
@@ -0,0 +1,238 @@
+/*
+ *	multixact_rewrite.c
+ *
+ *	Rewrite pre-v18 multixacts to new format with 64-bit MultiXactOffsets
+ *
+ *	Copyright (c) 2024, PostgreSQL Global Development Group
+ *	src/bin/pg_upgrade/multixact_rewrite.c
+ */
+
+#include "postgres_fe.h"
+
+#include "multixact_old.h"
+#include "pg_upgrade.h"
+#include "slru_io.h"
+
+#include "access/multixact.h"
+#include "access/multixact_internal.h"
+
+typedef struct
+{
+	MultiXactId nextMXact;
+	MultiXactOffset nextOffset;
+
+	SlruSegState *offset;
+	SlruSegState *members;
+} MultiXactWriter;
+
+static MultiXactWriter *StartMultiXactWrite(MultiXactId firstMulti, MultiXactOffset firstOffset);
+static MultiXactId GetNewMultiXactId(MultiXactWriter *state, int nmembers, MultiXactOffset *offset);
+static void RecordNewMultiXact(MultiXactWriter *state,
+							   MultiXactOffset offset,
+							   MultiXactId multi,
+							   int nmembers, MultiXactMember *members);
+static void CloseMultiXactWrite(MultiXactWriter *state);
+
+
+/*
+ * Convert pg_multixact/offset and /members to new format with 64-bit offsets.
+ */
+void
+convert_multixacts(MultiXactId *new_nxtmulti, MultiXactOffset *new_nxtmxoff)
+{
+	MultiXactWriter	   *new_writer;
+	MultiXactId			oldest_multi = old_cluster.controldata.chkpnt_oldstMulti,
+						next_multi = old_cluster.controldata.chkpnt_nxtmulti,
+						multi;
+	OldMultiXactReader *old_reader;
+
+	if (next_multi < FirstMultiXactId)
+		next_multi = FirstMultiXactId;
+
+	old_reader = StartOldMultiXactRead();
+	new_writer = StartMultiXactWrite(oldest_multi, 1);
+
+	/*
+	 * Read multixids from old files one by one, and write them back in the
+	 * new format.
+	 *
+	 * The locking-only XIDs that may be part of multi-xids don't matter after
+	 * upgrade, as there can be no transactions running across upgrade. So as
+	 * a little optimization, we only read one member from each multixid: the
+	 * one updating one, or if there was no update, arbitrarily the first
+	 * locking xid.
+	 */
+	for (multi = oldest_multi; multi != next_multi;)
+	{
+		TransactionId xid;
+		bool		isupdate;
+		MultiXactMember member;
+		MultiXactId newmulti;
+		MultiXactOffset offset;
+
+		/* Read the old multixid */
+		GetOldMultiXactIdSingleMember(old_reader, multi, &xid, &isupdate);
+
+		/* Write it out in new format */
+		member.xid = xid;
+		member.status = isupdate ? MultiXactStatusUpdate : MultiXactStatusForKeyShare;
+		newmulti = GetNewMultiXactId(new_writer, 1, &offset);
+		Assert(newmulti == multi);
+		RecordNewMultiXact(new_writer, offset, multi, 1, &member);
+
+		multi++;
+		if (multi < FirstMultiXactId)
+			multi = FirstMultiXactId;
+	}
+
+	/*
+	 * Update the nextMXact/Offset values in the control file to match what we
+	 * wrote. The nextMXact should be unchanged, but because we ignored the
+	 * locking XIDs members, the nextOffset will be different.
+	 */
+	Assert(new_writer->nextMXact == next_multi);
+	*new_nxtmulti = next_multi;
+	*new_nxtmxoff = new_writer->nextOffset;
+
+	/* Release resources */
+	CloseMultiXactWrite(new_writer);
+}
+
+/* Support routines for writing the new format */
+
+static MultiXactWriter *
+StartMultiXactWrite(MultiXactId firstMulti, MultiXactOffset firstOffset)
+{
+	MultiXactWriter *state;
+	char	   *dir;
+
+	state = pg_malloc(sizeof(MultiXactWriter));
+	state->nextMXact = firstMulti;
+	state->nextOffset = firstOffset;
+
+	dir = psprintf("%s/pg_multixact/offsets", new_cluster.pgdata);
+	state->offset = OpenSlruWrite(dir, MultiXactIdToOffsetPage(firstMulti));
+	pg_free(dir);
+
+	dir = psprintf("%s/pg_multixact/members", new_cluster.pgdata);
+	state->members = OpenSlruWrite(dir, MXOffsetToMemberPage(1));
+	pg_free(dir);
+
+	return state;
+}
+
+static void
+CloseMultiXactWrite(MultiXactWriter *state)
+{
+	CloseSlruWrite(state->offset);
+	CloseSlruWrite(state->members);
+	pg_free(state);
+}
+
+/*
+ * Simplified copy of the corresponding server function
+ */
+static MultiXactId
+GetNewMultiXactId(MultiXactWriter *state, int nmembers, MultiXactOffset *offset)
+{
+	MultiXactId		result;
+
+	/* Handle wraparound of the nextMXact counter */
+	if (state->nextMXact < FirstMultiXactId)
+		state->nextMXact = FirstMultiXactId;
+
+	/* Assign the MXID */
+	result = state->nextMXact;
+
+	/*
+	 * Reserve the members space, similarly to above.
+	 */
+	*offset = state->nextOffset;
+
+	/*
+	 * Advance counters.  As in GetNewTransactionId(), this must not happen
+	 * until after file extension has succeeded!
+	 *
+	 * We don't care about MultiXactId wraparound here; it will be handled by
+	 * the next iteration.  But note that nextMXact may be InvalidMultiXactId
+	 * or the first value on a segment-beginning page after this routine
+	 * exits, so anyone else looking at the variable must be prepared to deal
+	 * with either case.  Similarly, nextOffset may be zero, but we won't use
+	 * that as the actual start offset of the next multixact.
+	 */
+	(state->nextMXact)++;
+
+	state->nextOffset += nmembers;
+
+	return result;
+}
+
+/*
+ * Write a new multixact with members.
+ *
+ * Simplified version of the correspoding server function.
+ */
+static void
+RecordNewMultiXact(MultiXactWriter *state, MultiXactOffset offset,
+				   MultiXactId multi,
+				   int nmembers, MultiXactMember *members)
+{
+	int64		pageno;
+	int64		prev_pageno;
+	int			entryno;
+
+	char	   *buf;
+	MultiXactOffset *offptr;
+
+	pageno = MultiXactIdToOffsetPage(multi);
+	entryno = MultiXactIdToOffsetEntry(multi);
+
+	/*
+	 * Note: we pass the MultiXactId to SimpleLruReadPage as the "transaction"
+	 * to complain about if there's any I/O error.  This is kinda bogus, but
+	 * since the errors will always give the full pathname, it should be clear
+	 * enough that a MultiXactId is really involved.  Perhaps someday we'll
+	 * take the trouble to generalize the slru.c error reporting code.
+	 */
+	buf = SlruWriteSwitchPage(state->offset, pageno);
+	offptr = (MultiXactOffset *) buf;
+	offptr += entryno;
+
+	*offptr = offset;
+
+	prev_pageno = -1;
+
+	for (int i = 0; i < nmembers; i++, offset++)
+	{
+		TransactionId *memberptr;
+		uint32	   *flagsptr;
+		uint32		flagsval;
+		int			bshift;
+		int			flagsoff;
+		int			memberoff;
+
+		Assert(members[i].status <= MultiXactStatusUpdate);
+
+		pageno = MXOffsetToMemberPage(offset);
+		memberoff = MXOffsetToMemberOffset(offset);
+		flagsoff = MXOffsetToFlagsOffset(offset);
+		bshift = MXOffsetToFlagsBitShift(offset);
+
+		if (pageno != prev_pageno)
+		{
+			buf = SlruWriteSwitchPage(state->members, pageno);
+			prev_pageno = pageno;
+		}
+
+		memberptr = (TransactionId *) (buf + memberoff);
+
+		*memberptr = members[i].xid;
+
+		flagsptr = (uint32 *) (buf + flagsoff);
+
+		flagsval = *flagsptr;
+		flagsval &= ~(((1 << MXACT_MEMBER_BITS_PER_XACT) - 1) << bshift);
+		flagsval |= (members[i].status << bshift);
+		*flagsptr = flagsval;
+	}
+}
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 1654e877c07..484536853a1 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -750,6 +750,9 @@ copy_xact_xlog_xid(void)
 	if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER &&
 		new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
 	{
+		MultiXactId new_nxtmulti = old_cluster.controldata.chkpnt_nxtmulti;
+		MultiXactOffset new_nxtmxoff = old_cluster.controldata.chkpnt_nxtmxoff;
+
 		/*
 		 * If the old server is before the MULTIXACTOFFSET_FORMATCHANGE_CAT_VER
 		 * it must have 32-bit multixid offsets, thus it should be converted.
@@ -757,29 +760,11 @@ copy_xact_xlog_xid(void)
 		if (old_cluster.controldata.cat_ver < MULTIXACTOFFSET_FORMATCHANGE_CAT_VER &&
 			new_cluster.controldata.cat_ver >= MULTIXACTOFFSET_FORMATCHANGE_CAT_VER)
 		{
-			MultiXactOffset		oldest_offset,
-								next_offset;
-
+			remove_new_subdir("pg_multixact/members", false);
 			remove_new_subdir("pg_multixact/offsets", false);
 			prep_status("Converting pg_multixact/offsets to 64-bit");
-			oldest_offset = convert_multixact_offsets();
-			check_ok();
-
-			remove_new_subdir("pg_multixact/members", false);
-			prep_status("Converting pg_multixact/members");
-			convert_multixact_members(oldest_offset);
+			convert_multixacts(&new_nxtmulti, &new_nxtmxoff);
 			check_ok();
-
-			next_offset = old_cluster.controldata.chkpnt_nxtmxoff;
-			if (oldest_offset)
-			{
-				if (next_offset < oldest_offset)
-					next_offset += ((MultiXactOffset) 1 << 32) - 1;
-
-				next_offset -= oldest_offset - 1;
-
-				old_cluster.controldata.chkpnt_nxtmxoff = next_offset;
-			}
 		}
 		else
 		{
@@ -796,8 +781,8 @@ copy_xact_xlog_xid(void)
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
 				  "\"%s/pg_resetwal\" -O %llu -m %u,%u \"%s\"",
 				  new_cluster.bindir,
-				  (unsigned long long) old_cluster.controldata.chkpnt_nxtmxoff,
-				  old_cluster.controldata.chkpnt_nxtmulti,
+				  (unsigned long long) new_nxtmxoff,
+				  new_nxtmulti,
 				  old_cluster.controldata.chkpnt_oldstMulti,
 				  new_cluster.pgdata);
 		check_ok();
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 2c85ec1e949..c13293b4add 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -523,7 +523,6 @@ typedef struct
 	char		path[MAXPGPATH];
 } UpgradeTaskReport;
 
-/* segresize.c */
+/* multixact_rewrite.c */
 
-MultiXactOffset		convert_multixact_offsets(void);
-void				convert_multixact_members(MultiXactOffset oldest_offset);
+void convert_multixacts(MultiXactId *new_nxtmulti, MultiXactOffset *new_nxtmxoff);
diff --git a/src/bin/pg_upgrade/segresize.c b/src/bin/pg_upgrade/segresize.c
deleted file mode 100644
index 73064c77deb..00000000000
--- a/src/bin/pg_upgrade/segresize.c
+++ /dev/null
@@ -1,527 +0,0 @@
-/*
- *	segresize.c
- *
- *	SLRU segment resize utility
- *
- *	Copyright (c) 2024, PostgreSQL Global Development Group
- *	src/bin/pg_upgrade/segresize.c
- */
-
-#include "postgres_fe.h"
-
-#include "pg_upgrade.h"
-#include "access/multixact.h"
-
-/* See slru.h */
-#define SLRU_PAGES_PER_SEGMENT		32
-
-/*
- * Some kind of iterator associated with a particular SLRU segment.  The idea is
- * to specify the segment and page number and then move through the pages.
- */
-typedef struct SlruSegState
-{
-	char	   *dir;
-	char	   *fn;
-	FILE	   *file;
-	int64		segno;
-	uint64		pageno;
-	bool		leading_gap;
-} SlruSegState;
-
-/*
- * Mirrors the SlruFileName from slru.c
- */
-static inline char *
-SlruFileName(SlruSegState *state)
-{
-	Assert(state->segno >= 0 && state->segno <= INT64CONST(0xFFFFFF));
-	return psprintf("%s/%04X", state->dir, (unsigned int) state->segno);
-}
-
-/*
- * Create new SLRU segment file.
- */
-static void
-create_segment(SlruSegState *state)
-{
-	Assert(state->fn == NULL);
-	Assert(state->file == NULL);
-
-	state->fn = SlruFileName(state);
-	state->file = fopen(state->fn, "wb");
-	if (!state->file)
-		pg_fatal("could not create file \"%s\": %m", state->fn);
-}
-
-/*
- * Open existing SLRU segment file.
- */
-static void
-open_segment(SlruSegState *state)
-{
-	Assert(state->fn == NULL);
-	Assert(state->file == NULL);
-
-	state->fn = SlruFileName(state);
-	state->file = fopen(state->fn, "rb");
-	if (!state->file)
-		pg_fatal("could not open file \"%s\": %m", state->fn);
-}
-
-/*
- * Close SLRU segment file.
- */
-static void
-close_segment(SlruSegState *state)
-{
-	if (state->file)
-	{
-		fclose(state->file);
-		state->file = NULL;
-	}
-
-	if (state->fn)
-	{
-		pfree(state->fn);
-		state->fn = NULL;
-	}
-}
-
-/*
- * Read next page from the old 32-bit offset segment file.
- */
-static int
-read_old_segment_page(SlruSegState *state, void *buf, bool *empty)
-{
-	int		len;
-
-	/* Open next segment file, if needed. */
-	if (!state->fn)
-	{
-		if (!state->segno)
-			state->leading_gap = true;
-
-		open_segment(state);
-
-		/* Set position to the needed page. */
-		if (state->pageno > 0 &&
-			fseek(state->file, state->pageno * BLCKSZ, SEEK_SET))
-		{
-			close_segment(state);
-		}
-	}
-
-	if (state->file)
-	{
-		/* Segment file do exists, read page from it. */
-		state->leading_gap = false;
-
-		len = fread(buf, sizeof(char), BLCKSZ, state->file);
-
-		/* Are we done or was there an error? */
-		if (len <= 0)
-		{
-			if (ferror(state->file))
-				pg_fatal("error reading file \"%s\": %m", state->fn);
-
-			if (feof(state->file))
-			{
-				*empty = true;
-				len = -1;
-
-				close_segment(state);
-			}
-		}
-		else
-			*empty = false;
-	}
-	else if (!state->leading_gap)
-	{
-		/* We reached the last segment. */
-		len = -1;
-		*empty = true;
-	}
-	else
-	{
-		/* Skip few first segments if they were frozen and removed. */
-		len = BLCKSZ;
-		*empty = true;
-	}
-
-	if (++state->pageno >= SLRU_PAGES_PER_SEGMENT)
-	{
-		/* Start a new segment. */
-		state->segno++;
-		state->pageno = 0;
-
-		close_segment(state);
-	}
-
-	return len;
-}
-
-/*
- * Write next page to the new 64-bit offset segment file.
- */
-static void
-write_new_segment_page(SlruSegState *state, void *buf)
-{
-	/*
-	 * Create a new segment file if we still didn't.  Creation is
-	 * postponed until the first non-empty page is found.  This helps
-	 * not to create completely empty segments.
-	 */
-	if (!state->file)
-	{
-		create_segment(state);
-
-		/* Write zeroes to the previously skipped prefix. */
-		if (state->pageno > 0)
-		{
-			char		zerobuf[BLCKSZ] = {0};
-
-			for (int64 i = 0; i < state->pageno; i++)
-			{
-				if (fwrite(zerobuf, sizeof(char), BLCKSZ, state->file) != BLCKSZ)
-					pg_fatal("could not write file \"%s\": %m", state->fn);
-			}
-		}
-	}
-
-	/* Write page to the new segment (if it was created). */
-	if (state->file)
-	{
-		if (fwrite(buf, sizeof(char), BLCKSZ, state->file) != BLCKSZ)
-			pg_fatal("could not write file \"%s\": %m", state->fn);
-	}
-
-	/*
-	 * Did we reach the maximum page number?  Then close segment file
-	 * and create a new one on the next iteration.
-	 */
-	if (++state->pageno >= SLRU_PAGES_PER_SEGMENT)
-	{
-		/* Start a new segment. */
-		state->segno++;
-		state->pageno = 0;
-
-		close_segment(state);
-	}
-}
-
-typedef uint32 MultiXactOffsetOld;
-
-#define MaxMultiXactOffsetOld	((MultiXactOffsetOld) 0xFFFFFFFF)
-
-#define MULTIXACT_OFFSETS_PER_PAGE_OLD (BLCKSZ / sizeof(MultiXactOffsetOld))
-#define MULTIXACT_OFFSETS_PER_PAGE_NEW (BLCKSZ / sizeof(MultiXactOffset))
-
-/*
- * Convert pg_multixact/offsets segments and return oldest multi offset.
- */
-MultiXactOffset
-convert_multixact_offsets(void)
-{
-	SlruSegState		oldseg = {0},
-						newseg = {0};
-	MultiXactOffsetOld	oldbuf[MULTIXACT_OFFSETS_PER_PAGE_OLD] = {0};
-	MultiXactOffset		newbuf[MULTIXACT_OFFSETS_PER_PAGE_NEW] = {0},
-						oldest_offset = 0;
-	uint64				oldest_multi = old_cluster.controldata.chkpnt_oldstMulti,
-						next_multi = old_cluster.controldata.chkpnt_nxtmulti,
-						multi,
-						old_entry,
-						new_entry;
-	bool				oldest_offset_known = false;
-
-	oldseg.dir = psprintf("%s/pg_multixact/offsets", old_cluster.pgdata);
-	newseg.dir = psprintf("%s/pg_multixact/offsets", new_cluster.pgdata);
-
-	old_entry = oldest_multi % MULTIXACT_OFFSETS_PER_PAGE_OLD;
-	oldseg.pageno = oldest_multi / MULTIXACT_OFFSETS_PER_PAGE_OLD;
-	oldseg.segno = oldseg.pageno / SLRU_PAGES_PER_SEGMENT;
-	oldseg.pageno %= SLRU_PAGES_PER_SEGMENT;
-
-	new_entry = oldest_multi % MULTIXACT_OFFSETS_PER_PAGE_NEW;
-	newseg.pageno = oldest_multi / MULTIXACT_OFFSETS_PER_PAGE_NEW;
-	newseg.segno = newseg.pageno / SLRU_PAGES_PER_SEGMENT;
-	newseg.pageno %= SLRU_PAGES_PER_SEGMENT;
-
-	if (next_multi < oldest_multi)
-		next_multi += (uint64) 1 << 32;	/* wraparound */
-
-	/* Copy multi offsets reading only needed segment pages */
-	for (multi = oldest_multi; multi < next_multi; old_entry = 0)
-	{
-		int		oldlen;
-		bool	empty;
-
-		/* Handle possible segment wraparound */
-#define OLD_OFFSET_SEGNO_MAX	\
-	(MaxMultiXactId / MULTIXACT_OFFSETS_PER_PAGE_OLD / SLRU_PAGES_PER_SEGMENT)
-		if (oldseg.segno > OLD_OFFSET_SEGNO_MAX)
-		{
-			oldseg.segno = 0;
-			oldseg.pageno = 0;
-		}
-
-		oldlen = read_old_segment_page(&oldseg, oldbuf, &empty);
-		if (empty || oldlen != BLCKSZ)
-			pg_fatal("cannot read page %llu from file \"%s\": %m",
-					 (unsigned long long) oldseg.pageno, oldseg.fn);
-
-		/* Save oldest multi offset */
-		if (!oldest_offset_known)
-		{
-			oldest_offset = oldbuf[old_entry];
-			oldest_offset_known = true;
-		}
-
-		/* Skip wrapped-around invalid MultiXactIds */
-		if (multi == (uint64) 1 << 32)
-		{
-			Assert(oldseg.segno == 0);
-			Assert(oldseg.pageno == 1);
-			Assert(old_entry == 0);
-			Assert(new_entry == 0);
-
-			multi += FirstMultiXactId;
-			old_entry = FirstMultiXactId;
-			new_entry = FirstMultiXactId;
-		}
-
-		/* Copy entries to the new page */
-		for (; multi < next_multi && old_entry < MULTIXACT_OFFSETS_PER_PAGE_OLD;
-			 multi++, old_entry++)
-		{
-			MultiXactOffset offset = oldbuf[old_entry];
-
-			/* Handle possible offset wraparound (1 becomes 2^32) */
-			if (offset < oldest_offset)
-				offset += ((uint64) 1 << 32) - 1;
-
-			/* Subtract oldest_offset, so new offsets will start from 1 */
-			newbuf[new_entry++] = offset - oldest_offset + 1;
-
-			if (new_entry >= MULTIXACT_OFFSETS_PER_PAGE_NEW)
-			{
-				/* Handle possible segment wraparound */
-#define NEW_OFFSET_SEGNO_MAX	\
-	(MaxMultiXactId / MULTIXACT_OFFSETS_PER_PAGE_NEW / SLRU_PAGES_PER_SEGMENT)
-				if (newseg.segno > NEW_OFFSET_SEGNO_MAX)
-				{
-					newseg.segno = 0;
-					newseg.pageno = 0;
-				}
-
-				/* Write new page */
-				write_new_segment_page(&newseg, newbuf);
-				new_entry = 0;
-			}
-		}
-	}
-
-	/* Write the last incomplete page */
-	if (new_entry > 0 || oldest_multi == next_multi)
-	{
-		memset(&newbuf[new_entry], 0,
-			   sizeof(newbuf[0]) * (MULTIXACT_OFFSETS_PER_PAGE_NEW - new_entry));
-		write_new_segment_page(&newseg, newbuf);
-	}
-
-	/* Use next_offset as oldest_offset, if oldest_multi == next_multi */
-	if (!oldest_offset_known)
-	{
-		Assert(oldest_multi == next_multi);
-		oldest_offset = (MultiXactOffset) old_cluster.controldata.chkpnt_nxtmxoff;
-	}
-
-	/* Release resources */
-	close_segment(&oldseg);
-	close_segment(&newseg);
-
-	pfree(oldseg.dir);
-	pfree(newseg.dir);
-
-	return oldest_offset;
-}
-
-#define MXACT_MEMBERS_FLAG_BYTES			1
-
-#define MULTIXACT_MEMBERS_PER_GROUP			4
-#define MULTIXACT_MEMBERGROUP_SIZE			\
-	(MULTIXACT_MEMBERS_PER_GROUP * (sizeof(TransactionId) + MXACT_MEMBERS_FLAG_BYTES))
-#define MULTIXACT_MEMBERGROUPS_PER_PAGE		\
-	(BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
-
-#define MULTIXACT_MEMBERS_PER_PAGE				\
-	(MULTIXACT_MEMBERS_PER_GROUP * MULTIXACT_MEMBERGROUPS_PER_PAGE)
-#define MULTIXACT_MEMBER_FLAG_BYTES_PER_GROUP	\
-	(MXACT_MEMBERS_FLAG_BYTES * MULTIXACT_MEMBERS_PER_GROUP)
-
-typedef struct MultiXactMembersCtx
-{
-	SlruSegState	seg;
-	char			buf[BLCKSZ];
-	int				group;
-	int				member;
-	char		   *flag;
-	TransactionId  *xid;
-} MultiXactMembersCtx;
-
-static void
-MultiXactMembersCtxInit(MultiXactMembersCtx *ctx)
-{
-	ctx->seg.dir = psprintf("%s/pg_multixact/members", new_cluster.pgdata);
-
-	ctx->group = 0;
-	ctx->member = 1;		/* skip invalid zero offset */
-
-	ctx->flag = (char *) ctx->buf + ctx->group * MULTIXACT_MEMBERGROUP_SIZE;
-	ctx->xid = (TransactionId *)(ctx->flag + MXACT_MEMBERS_FLAG_BYTES * MULTIXACT_MEMBERS_PER_GROUP);
-
-	ctx->flag += ctx->member;
-	ctx->xid += ctx->member;
-}
-
-static void
-MultiXactMembersCtxAdd(MultiXactMembersCtx *ctx, char flag, TransactionId xid)
-{
-	/* Copy member's xid and flags to the new page */
-	*ctx->flag++ = flag;
-	*ctx->xid++ = xid;
-
-	if (++ctx->member < MULTIXACT_MEMBERS_PER_GROUP)
-		return;
-
-	/* Start next member group */
-	ctx->member = 0;
-
-	if (++ctx->group >= MULTIXACT_MEMBERGROUPS_PER_PAGE)
-	{
-		/* Write current page and start new */
-		write_new_segment_page(&ctx->seg, ctx->buf);
-
-		ctx->group = 0;
-		memset(ctx->buf, 0, BLCKSZ);
-	}
-
-	ctx->flag = (char *) ctx->buf + ctx->group * MULTIXACT_MEMBERGROUP_SIZE;
-	ctx->xid = (TransactionId *)(ctx->flag + MXACT_MEMBERS_FLAG_BYTES * MULTIXACT_MEMBERS_PER_GROUP);
-}
-
-static void
-MultiXactMembersCtxFinit(MultiXactMembersCtx *ctx)
-{
-	if (ctx->flag > (char *) ctx->buf)
-		write_new_segment_page(&ctx->seg, ctx->buf);
-
-	close_segment(&ctx->seg);
-
-	pfree(ctx->seg.dir);
-}
-
-/*
- * Convert pg_multixact/members segments, offsets will start from 1.
- *
- */
-void
-convert_multixact_members(MultiXactOffset oldest_offset)
-{
-	MultiXactOffset			next_offset,
-							offset;
-	SlruSegState			oldseg = {0};
-	char					oldbuf[BLCKSZ] = {0};
-	int						oldidx;
-	MultiXactMembersCtx		newctx = {0};
-
-	oldseg.dir = psprintf("%s/pg_multixact/members", old_cluster.pgdata);
-
-	next_offset = (MultiXactOffset) old_cluster.controldata.chkpnt_nxtmxoff;
-	if (next_offset < oldest_offset)
-		next_offset += ((uint64) 1 << 32) - 1;
-
-	/* Initialize the old starting position */
-	oldseg.pageno = oldest_offset / MULTIXACT_MEMBERS_PER_PAGE;
-	oldseg.segno = oldseg.pageno / SLRU_PAGES_PER_SEGMENT;
-	oldseg.pageno %= SLRU_PAGES_PER_SEGMENT;
-
-	/* Initialize new starting position */
-	MultiXactMembersCtxInit(&newctx);
-
-	/* Iterate through the original directory */
-	oldidx = oldest_offset % MULTIXACT_MEMBERS_PER_PAGE;
-	for (offset = oldest_offset; offset < next_offset;)
-	{
-		bool	empty;
-		int		oldlen;
-		int		ngroups;
-		int		oldgroup;
-		int		oldmember;
-
-		oldlen = read_old_segment_page(&oldseg, oldbuf, &empty);
-		if (empty || oldlen != BLCKSZ)
-			pg_fatal("cannot read page %llu from file \"%s\": %m",
-					 (unsigned long long) oldseg.pageno, oldseg.fn);
-
-		/* Iterate through the old member groups */
-		ngroups = oldlen / MULTIXACT_MEMBERGROUP_SIZE;
-		oldmember = oldidx % MULTIXACT_MEMBERS_PER_GROUP;
-		oldgroup = oldidx / MULTIXACT_MEMBERS_PER_GROUP;
-		while (oldgroup < ngroups && offset < next_offset)
-		{
-			char		   *oldflag;
-			TransactionId  *oldxid;
-			int				i;
-
-			oldflag = (char *) oldbuf + oldgroup * MULTIXACT_MEMBERGROUP_SIZE;
-			oldxid = (TransactionId *)(oldflag + MULTIXACT_MEMBER_FLAG_BYTES_PER_GROUP);
-
-			oldxid += oldmember;
-			oldflag += oldmember;
-
-			/* Iterate through the old members */
-			for (i = oldmember;
-				 i < MULTIXACT_MEMBERS_PER_GROUP && offset < next_offset;
-				 i++)
-			{
-				MultiXactMembersCtxAdd(&newctx, *oldflag++, *oldxid++);
-
-				if (++offset == (uint64) 1 << 32)
-				{
-					Assert(i == MaxMultiXactOffsetOld % MULTIXACT_MEMBERS_PER_GROUP);
-					goto wraparound;
-				}
-			}
-
-			oldgroup++;
-			oldmember = 0;
-		}
-
-		oldidx = 0;
-
-		continue;
-
-wraparound:
-#define SEGNO_MAX	MaxMultiXactOffsetOld / MULTIXACT_MEMBERS_PER_PAGE / SLRU_PAGES_PER_SEGMENT
-#define PAGENO_MAX	MaxMultiXactOffsetOld / MULTIXACT_MEMBERS_PER_PAGE % SLRU_PAGES_PER_SEGMENT
-		Assert((oldseg.segno == SEGNO_MAX && oldseg.pageno == PAGENO_MAX + 1) ||
-			   (oldseg.segno == SEGNO_MAX + 1 && oldseg.pageno == 0));
-
-		/* Switch to segment 0000 */
-		close_segment(&oldseg);
-		oldseg.segno = 0;
-		oldseg.pageno = 0;
-
-		/* skip invalid zero multi offset */
-		oldidx = 1;
-	}
-
-	MultiXactMembersCtxFinit(&newctx);
-
-	/* Release resources */
-	close_segment(&oldseg);
-
-	pfree(oldseg.dir);
-}
diff --git a/src/bin/pg_upgrade/slru_io.c b/src/bin/pg_upgrade/slru_io.c
new file mode 100644
index 00000000000..152ecfdce59
--- /dev/null
+++ b/src/bin/pg_upgrade/slru_io.c
@@ -0,0 +1,214 @@
+/*
+ *	slru_io.c
+ *
+ *	Routines for reading and writing SLRU files during upgrade.
+ *
+ *	Copyright (c) 2024, PostgreSQL Global Development Group
+ *	src/bin/pg_upgrade/slru_io.c
+ */
+
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+
+#include "pg_upgrade.h"
+#include "slru_io.h"
+
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "port/pg_iovec.h"
+
+/* See slru.h */
+#define SLRU_PAGES_PER_SEGMENT		32
+
+/*
+ * State for reading or writing an SLRU, with a one page buffer.
+ */
+typedef struct SlruSegState
+{
+	bool		writing;
+
+	char	   *dir;
+	char	   *fn;
+	int			fd;
+	int64		segno;
+	uint64		pageno;
+
+	PGAlignedBlock buf;
+} SlruSegState;
+
+static void SlruFlush(SlruSegState *state);
+
+
+SlruSegState *
+OpenSlruRead(char *dir)
+{
+	SlruSegState *state;
+
+	state = pg_malloc(sizeof(SlruSegState));
+	state->writing = false;
+	state->segno = -1;
+	state->pageno = 0;
+	state->dir = pstrdup(dir);
+	state->fd = -1;
+	state->fn = NULL;
+
+	return state;
+}
+
+void
+CloseSlruRead(SlruSegState *state)
+{
+	Assert(!state->writing);
+	close(state->fd);
+	pg_free(state);
+}
+
+SlruSegState *
+OpenSlruWrite(char *dir, int64 startPageno)
+{
+	SlruSegState *state;
+
+	state = pg_malloc(sizeof(SlruSegState));
+	state->writing = true;
+	state->segno = -1;
+	state->pageno = 0;
+	state->dir = pstrdup(dir);
+	state->fd = -1;
+	state->fn = NULL;
+
+	return state;
+}
+
+void
+CloseSlruWrite(SlruSegState *state)
+{
+	Assert(state->writing);
+	SlruFlush(state);
+
+	close(state->fd);
+	pg_free(state);
+}
+
+static void
+SlruFlush(SlruSegState *state)
+{
+	struct iovec iovec = {
+		.iov_base = &state->buf,
+		.iov_len = BLCKSZ,
+	};
+	off_t		offset;
+
+	if (state->segno == -1)
+		return;
+
+	offset = (state->pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+	if (pg_pwritev_with_retry(state->fd, &iovec, 1, offset) < 0)
+		pg_fatal("could not write file \"%s\": %m", state->fn);
+}
+
+/*
+ * Open the given page for writing.
+ *
+ * NOTE: This uses O_EXCL when stepping to a new segment, so this assumes that
+ * each segment is written in full before moving on to next one. This
+ * limitation would be easy to lift if needed, but it fits the usage pattern
+ * of current callers.
+ */
+char *
+SlruWriteSwitchPage(SlruSegState *state, uint64 pageno)
+{
+	int64		segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	off_t		offset;
+
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+
+	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	offset = (pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+	SlruFlush(state);
+	memset(state->buf.data, 0, BLCKSZ);
+
+	if (segno != state->segno)
+	{
+		if (state->segno != -1)
+		{
+			close(state->fd);
+			state->fd = -1;
+			pg_free(state->fn);
+			state->fn = NULL;
+		}
+
+		/* Create the segment */
+		state->fn = psprintf("%s/%04X", state->dir, (unsigned int) segno);
+		if ((state->fd = open(state->fn, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+							  pg_file_create_mode)) < 0)
+		{
+			pg_fatal("could not create file \"%s\": %m", state->fn);
+		}
+		state->segno = segno;
+
+		if (offset > 0)
+		{
+			if (pg_pwrite_zeros(state->fd, offset, 0) < 0)
+				pg_fatal("could not write file \"%s\": %m", state->fn);
+		}
+	}
+
+	state->pageno = pageno;
+	return state->buf.data;
+}
+
+/*
+ * Open given page for reading.
+ *
+ * Reading can be done in random order.
+ */
+char *
+SlruReadSwitchPage(SlruSegState *state, uint64 pageno)
+{
+	int64		segno;
+
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+
+	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+
+	if (segno != state->segno)
+	{
+		if (state->segno != -1)
+		{
+			close(state->fd);
+			state->fd = -1;
+			pg_free(state->fn);
+			state->fn = NULL;
+		}
+
+		/* Open new segment */
+		state->fn = psprintf("%s/%04X", state->dir, (unsigned int) segno);
+		if ((state->fd = open(state->fn, O_RDONLY | PG_BINARY, 0)) < 0)
+		{
+			pg_fatal("could not open file \"%s\": %m", state->fn);
+		}
+		state->segno = segno;
+	}
+
+	{
+		struct iovec iovec = {
+			.iov_base = &state->buf,
+			.iov_len = BLCKSZ,
+		};
+		off_t		offset;
+
+		offset = (pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+		if (pg_preadv(state->fd, &iovec, 1, offset) < 0)
+			pg_fatal("could not read file \"%s\": %m", state->fn);
+
+		state->pageno = pageno;
+	}
+
+	return state->buf.data;
+}
diff --git a/src/bin/pg_upgrade/slru_io.h b/src/bin/pg_upgrade/slru_io.h
new file mode 100644
index 00000000000..e1a9c063139
--- /dev/null
+++ b/src/bin/pg_upgrade/slru_io.h
@@ -0,0 +1,23 @@
+/*
+ *	slru_io.h
+ *
+ *	Copyright (c) 2010-2024, PostgreSQL Global Development Group
+ *	src/bin/pg_upgrade/slru_io.h
+ */
+
+/* XXX: copied from slru.h */
+#define SLRU_PAGES_PER_SEGMENT	32
+
+/*
+ * Some kind of iterator associated with a particular SLRU segment.  The idea is
+ * to specify the segment and page number and then move through the pages.
+ */
+typedef struct SlruSegState SlruSegState;
+
+extern SlruSegState *OpenSlruRead(char *dir);
+extern void CloseSlruRead(SlruSegState *state);
+extern char *SlruReadSwitchPage(SlruSegState *state, uint64 pageno);
+
+extern SlruSegState *OpenSlruWrite(char *dir, int64 startPageno);
+extern void CloseSlruWrite(SlruSegState *state);
+extern char *SlruWriteSwitchPage(SlruSegState *state, uint64 pageno);
-- 
2.39.5

v10-0011-TEST-initdb-option-to-initialize-cluster-with-no.patchtext/x-patch; charset=UTF-8; name=v10-0011-TEST-initdb-option-to-initialize-cluster-with-no.patchDownload
From 1ddbce46f95f26417934fd232181f2cb51b5d2b5 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Wed, 4 May 2022 15:53:36 +0300
Subject: [PATCH v10 11/14] TEST: initdb option to initialize cluster with
 non-standard xid/mxid/mxoff

To date testing database cluster wraparund was not easy as initdb has always
inited it with default xid/mxid/mxoff. The option to specify any valid
xid/mxid/mxoff at cluster startup will make these things easier.

Author: Maxim Orlov <orlovmg@gmail.com>
Author: Pavel Borisov <pashkin.elfe@gmail.com>
Author: Svetlana Derevyanko <s.derevyanko@postgrespro.ru>
Discussion: https://www.postgresql.org/message-id/flat/CACG%3Dezaa4vqYjJ16yoxgrpa-%3DgXnf0Vv3Ey9bjGrRRFN2YyWFQ%40mail.gmail.com
---
 src/backend/access/transam/clog.c      |  21 +++++
 src/backend/access/transam/multixact.c |  53 ++++++++++++
 src/backend/access/transam/subtrans.c  |   8 +-
 src/backend/access/transam/xlog.c      |  15 ++--
 src/backend/bootstrap/bootstrap.c      |  50 +++++++++++-
 src/backend/main/main.c                |   6 ++
 src/backend/postmaster/postmaster.c    |  14 +++-
 src/backend/tcop/postgres.c            |  53 +++++++++++-
 src/bin/initdb/initdb.c                | 107 ++++++++++++++++++++++++-
 src/bin/initdb/t/001_initdb.pl         |  60 ++++++++++++++
 src/include/access/xlog.h              |   3 +
 src/include/c.h                        |   4 +
 src/include/catalog/pg_class.h         |   2 +-
 13 files changed, 382 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index e6f79320e94..17e29f44978 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -834,6 +834,7 @@ BootStrapCLOG(void)
 {
 	int			slotno;
 	LWLock	   *lock = SimpleLruGetBankLock(XactCtl, 0);
+	int64		pageno;
 
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
@@ -844,6 +845,26 @@ BootStrapCLOG(void)
 	SimpleLruWritePage(XactCtl, slotno);
 	Assert(!XactCtl->shared->page_dirty[slotno]);
 
+	pageno = TransactionIdToPage(XidFromFullTransactionId(TransamVariables->nextXid));
+	if (pageno != 0)
+	{
+		LWLock *nextlock = SimpleLruGetBankLock(XactCtl, pageno);
+
+		if (nextlock != lock)
+		{
+			LWLockRelease(lock);
+			LWLockAcquire(nextlock, LW_EXCLUSIVE);
+			lock = nextlock;
+		}
+
+		/* Create and zero the first page of the commit log */
+		slotno = ZeroCLOGPage(pageno, false);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(XactCtl, slotno);
+		Assert(!XactCtl->shared->page_dirty[slotno]);
+	}
+
 	LWLockRelease(lock);
 }
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index ea09f8606cf..4231be3b665 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1815,6 +1815,7 @@ BootStrapMultiXact(void)
 {
 	int			slotno;
 	LWLock	   *lock;
+	int64		pageno;
 
 	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, 0);
 	LWLockAcquire(lock, LW_EXCLUSIVE);
@@ -1826,6 +1827,26 @@ BootStrapMultiXact(void)
 	SimpleLruWritePage(MultiXactOffsetCtl, slotno);
 	Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
 
+	pageno = MultiXactIdToOffsetPage(MultiXactState->nextMXact);
+	if (pageno != 0)
+	{
+		LWLock *nextlock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
+
+		if (nextlock != lock)
+		{
+			LWLockRelease(lock);
+			LWLockAcquire(nextlock, LW_EXCLUSIVE);
+			lock = nextlock;
+		}
+
+		/* Create and zero the first page of the offsets log */
+		slotno = ZeroMultiXactOffsetPage(pageno, false);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
+		Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
+	}
+
 	LWLockRelease(lock);
 
 	lock = SimpleLruGetBankLock(MultiXactMemberCtl, 0);
@@ -1838,7 +1859,39 @@ BootStrapMultiXact(void)
 	SimpleLruWritePage(MultiXactMemberCtl, slotno);
 	Assert(!MultiXactMemberCtl->shared->page_dirty[slotno]);
 
+	pageno = MXOffsetToMemberPage(MultiXactState->nextOffset);
+	if (pageno != 0)
+	{
+		LWLock *nextlock = SimpleLruGetBankLock(MultiXactMemberCtl, pageno);
+
+		if (nextlock != lock)
+		{
+			LWLockRelease(lock);
+			LWLockAcquire(nextlock, LW_EXCLUSIVE);
+			lock = nextlock;
+		}
+
+		/* Create and zero the first page of the members log */
+		slotno = ZeroMultiXactMemberPage(pageno, false);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(MultiXactMemberCtl, slotno);
+		Assert(!MultiXactMemberCtl->shared->page_dirty[slotno]);
+	}
+
 	LWLockRelease(lock);
+
+	/*
+	 * If we're starting not from zero offset, initilize dummy multixact to
+	 * evade too long loop in PerformMembersTruncation().
+	 */
+	if (MultiXactState->nextOffset > 0 && MultiXactState->nextMXact > 0)
+	{
+		RecordNewMultiXact(FirstMultiXactId,
+						   MultiXactState->nextOffset, 0, NULL);
+		RecordNewMultiXact(MultiXactState->nextMXact,
+						   MultiXactState->nextOffset, 0, NULL);
+	}
 }
 
 /*
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 50bb1d8cfc5..a5e6e8f0905 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -270,12 +270,16 @@ void
 BootStrapSUBTRANS(void)
 {
 	int			slotno;
-	LWLock	   *lock = SimpleLruGetBankLock(SubTransCtl, 0);
+	LWLock	   *lock;
+	int64		pageno;
+
+	pageno = TransactionIdToPage(XidFromFullTransactionId(TransamVariables->nextXid));
+	lock = SimpleLruGetBankLock(SubTransCtl, pageno);
 
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
 	/* Create and zero the first page of the subtrans log */
-	slotno = ZeroSUBTRANSPage(0);
+	slotno = ZeroSUBTRANSPage(pageno);
 
 	/* Make sure it's written out */
 	SimpleLruWritePage(SubTransCtl, slotno);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 067cb70938a..c147bf114cf 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -136,6 +136,10 @@ int			max_slot_wal_keep_size_mb = -1;
 int			wal_decode_buffer_size = 512 * 1024;
 bool		track_wal_io_timing = false;
 
+TransactionId		start_xid = FirstNormalTransactionId;
+MultiXactId			start_mxid = FirstMultiXactId;
+MultiXactOffset		start_mxoff = 0;
+
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
 #endif
@@ -5080,13 +5084,14 @@ BootStrapXLOG(uint32 data_checksum_version)
 	checkPoint.fullPageWrites = fullPageWrites;
 	checkPoint.wal_level = wal_level;
 	checkPoint.nextXid =
-		FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId);
+		FullTransactionIdFromEpochAndXid(0, Max(FirstNormalTransactionId,
+												start_xid));
 	checkPoint.nextOid = FirstGenbkiObjectId;
-	checkPoint.nextMulti = FirstMultiXactId;
-	checkPoint.nextMultiOffset = 1;
-	checkPoint.oldestXid = FirstNormalTransactionId;
+	checkPoint.nextMulti = Max(FirstMultiXactId, start_mxid);
+	checkPoint.nextMultiOffset = Max(1, start_mxoff);
+	checkPoint.oldestXid = XidFromFullTransactionId(checkPoint.nextXid);
 	checkPoint.oldestXidDB = Template1DbOid;
-	checkPoint.oldestMulti = FirstMultiXactId;
+	checkPoint.oldestMulti = checkPoint.nextMulti;
 	checkPoint.oldestMultiDB = Template1DbOid;
 	checkPoint.oldestCommitTsXid = InvalidTransactionId;
 	checkPoint.newestCommitTsXid = InvalidTransactionId;
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index f3a7a007f77..485213e126e 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -217,7 +217,7 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	argv++;
 	argc--;
 
-	while ((flag = getopt(argc, argv, "B:c:d:D:Fkr:X:-:")) != -1)
+	while ((flag = getopt(argc, argv, "B:c:d:D:Fkm:o:r:X:x:-:")) != -1)
 	{
 		switch (flag)
 		{
@@ -285,12 +285,60 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 			case 'k':
 				bootstrap_data_checksum_version = PG_DATA_CHECKSUM_VERSION;
 				break;
+			case 'm':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactIdIsValid(start_mxid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact id")));
+					}
+				}
+				break;
+			case 'o':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxoff = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactOffsetIsValid(start_mxoff))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact offset")));
+					}
+				}
+				break;
 			case 'r':
 				strlcpy(OutputFileName, optarg, MAXPGPATH);
 				break;
 			case 'X':
 				SetConfigOption("wal_segment_size", optarg, PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
 				break;
+			case 'x':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_xid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartTransactionIdIsValid(start_xid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster xid value")));
+					}
+				}
+				break;
 			default:
 				write_stderr("Try \"%s --help\" for more information.\n",
 							 progname);
diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index 864714107cb..7bd8d99f816 100644
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -426,12 +426,18 @@ help(const char *progname)
 	printf(_("  -E                 echo statement before execution\n"));
 	printf(_("  -j                 do not use newline as interactive query delimiter\n"));
 	printf(_("  -r FILENAME        send stdout and stderr to given file\n"));
+	printf(_("  -m START_MXID      set initial database cluster multixact id\n"));
+	printf(_("  -o START_MXOFF     set initial database cluster multixact offset\n"));
+	printf(_("  -x START_XID       set initial database cluster xid\n"));
 
 	printf(_("\nOptions for bootstrapping mode:\n"));
 	printf(_("  --boot             selects bootstrapping mode (must be first argument)\n"));
 	printf(_("  --check            selects check mode (must be first argument)\n"));
 	printf(_("  DBNAME             database name (mandatory argument in bootstrapping mode)\n"));
 	printf(_("  -r FILENAME        send stdout and stderr to given file\n"));
+	printf(_("  -m START_MXID      set initial database cluster multixact id\n"));
+	printf(_("  -o START_MXOFF     set initial database cluster multixact offset\n"));
+	printf(_("  -x START_XID       set initial database cluster xid\n"));
 
 	printf(_("\nPlease read the documentation for the complete list of run-time\n"
 			 "configuration settings and how to set them on the command line or in\n"
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 6f849ffbcb5..03262f2906c 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -572,7 +572,7 @@ PostmasterMain(int argc, char *argv[])
 	 * tcop/postgres.c (the option sets should not conflict) and with the
 	 * common help() function in main/main.c.
 	 */
-	while ((opt = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lN:OPp:r:S:sTt:W:-:")) != -1)
+	while ((opt = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lm:N:Oo:Pp:r:S:sTt:W:x:-:")) != -1)
 	{
 		switch (opt)
 		{
@@ -682,10 +682,18 @@ PostmasterMain(int argc, char *argv[])
 				SetConfigOption("max_connections", optarg, PGC_POSTMASTER, PGC_S_ARGV);
 				break;
 
+			case 'm':
+				/* only used by single-user backend */
+				break;
+
 			case 'O':
 				SetConfigOption("allow_system_table_mods", "true", PGC_POSTMASTER, PGC_S_ARGV);
 				break;
 
+			case 'o':
+				/* only used by single-user backend */
+				break;
+
 			case 'P':
 				SetConfigOption("ignore_system_indexes", "true", PGC_POSTMASTER, PGC_S_ARGV);
 				break;
@@ -736,6 +744,10 @@ PostmasterMain(int argc, char *argv[])
 				SetConfigOption("post_auth_delay", optarg, PGC_POSTMASTER, PGC_S_ARGV);
 				break;
 
+			case 'x':
+				/* only used by single-user backend */
+				break;
+
 			default:
 				write_stderr("Try \"%s --help\" for more information.\n",
 							 progname);
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 85902788181..5859e18b2e5 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3788,7 +3788,7 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 	 * postmaster/postmaster.c (the option sets should not conflict) and with
 	 * the common help() function in main/main.c.
 	 */
-	while ((flag = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lN:nOPp:r:S:sTt:v:W:-:")) != -1)
+	while ((flag = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lm:N:nOo:Pp:r:S:sTt:v:W:x:-:")) != -1)
 	{
 		switch (flag)
 		{
@@ -3893,6 +3893,23 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 				SetConfigOption("ssl", "true", ctx, gucsource);
 				break;
 
+			case 'm':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactIdIsValid(start_mxid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact id")));
+					}
+				}
+				break;
+
 			case 'N':
 				SetConfigOption("max_connections", optarg, ctx, gucsource);
 				break;
@@ -3905,6 +3922,23 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 				SetConfigOption("allow_system_table_mods", "true", ctx, gucsource);
 				break;
 
+			case 'o':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxoff = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactOffsetIsValid(start_mxoff))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact offset")));
+					}
+				}
+				break;
+
 			case 'P':
 				SetConfigOption("ignore_system_indexes", "true", ctx, gucsource);
 				break;
@@ -3959,6 +3993,23 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 				SetConfigOption("post_auth_delay", optarg, ctx, gucsource);
 				break;
 
+			case 'x':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_xid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartTransactionIdIsValid(start_xid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster xid")));
+					}
+				}
+				break;
+
 			default:
 				errs++;
 				break;
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 9a91830783e..410868dddf1 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -168,6 +168,9 @@ static bool data_checksums = true;
 static char *xlog_dir = NULL;
 static int	wal_segment_size_mb = (DEFAULT_XLOG_SEG_SIZE) / (1024 * 1024);
 static DataDirSyncMethod sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+static TransactionId start_xid = 0;
+static MultiXactId start_mxid = 0;
+static MultiXactOffset start_mxoff = 0;
 
 
 /* internal vars */
@@ -1568,6 +1571,11 @@ bootstrap_template1(void)
 	bki_lines = replace_token(bki_lines, "POSTGRES",
 							  escape_quotes_bki(username));
 
+	/* relfrozenxid must not be less than FirstNormalTransactionId */
+	sprintf(buf, "%llu", (unsigned long long) Max(start_xid, 3));
+	bki_lines = replace_token(bki_lines, "RECENTXMIN",
+							  buf);
+
 	bki_lines = replace_token(bki_lines, "ENCODING",
 							  encodingid_to_string(encodingid));
 
@@ -1593,6 +1601,9 @@ bootstrap_template1(void)
 
 	printfPQExpBuffer(&cmd, "\"%s\" --boot %s %s", backend_exec, boot_options, extra_options);
 	appendPQExpBuffer(&cmd, " -X %d", wal_segment_size_mb * (1024 * 1024));
+	appendPQExpBuffer(&cmd, " -m %llu", (unsigned long long) start_mxid);
+	appendPQExpBuffer(&cmd, " -o %llu", (unsigned long long) start_mxoff);
+	appendPQExpBuffer(&cmd, " -x %llu", (unsigned long long) start_xid);
 	if (data_checksums)
 		appendPQExpBuffer(&cmd, " -k");
 	if (debug)
@@ -2532,12 +2543,20 @@ usage(const char *progname)
 	printf(_("  -d, --debug               generate lots of debugging output\n"));
 	printf(_("      --discard-caches      set debug_discard_caches=1\n"));
 	printf(_("  -L DIRECTORY              where to find the input files\n"));
+	printf(_("  -m, --multixact-id=START_MXID\n"
+			 "                            set initial database cluster multixact id\n"
+			 "                            max value is 2^62-1\n"));
 	printf(_("  -n, --no-clean            do not clean up after errors\n"));
 	printf(_("  -N, --no-sync             do not wait for changes to be written safely to disk\n"));
 	printf(_("      --no-instructions     do not print instructions for next steps\n"));
+	printf(_("  -o, --multixact-offset=START_MXOFF\n"
+			 "                            set initial database cluster multixact offset\n"
+			 "                            max value is 2^62-1\n"));
 	printf(_("  -s, --show                show internal settings, then exit\n"));
 	printf(_("      --sync-method=METHOD  set method for syncing files to disk\n"));
 	printf(_("  -S, --sync-only           only sync database files to disk, then exit\n"));
+	printf(_("  -x, --xid=START_XID       set initial database cluster xid\n"
+			 "                            max value is 2^62-1\n"));
 	printf(_("\nOther options:\n"));
 	printf(_("  -V, --version             output version information, then exit\n"));
 	printf(_("  -?, --help                show this help, then exit\n"));
@@ -3079,6 +3098,18 @@ initialize_data_directory(void)
 	/* Now create all the text config files */
 	setup_config();
 
+	if (start_mxid != 0)
+		printf(_("selecting initial multixact id ... %llu\n"),
+				 (unsigned long long) start_mxid);
+
+	if (start_mxoff != 0)
+		printf(_("selecting initial multixact offset ... %llu\n"),
+				 (unsigned long long) start_mxoff);
+
+	if (start_xid != 0)
+		printf(_("selecting initial xid ... %llu\n"),
+				 (unsigned long long) start_xid);
+
 	/* Bootstrap template1 */
 	bootstrap_template1();
 
@@ -3095,8 +3126,12 @@ initialize_data_directory(void)
 	fflush(stdout);
 
 	initPQExpBuffer(&cmd);
-	printfPQExpBuffer(&cmd, "\"%s\" %s %s template1 >%s",
-					  backend_exec, backend_options, extra_options, DEVNULL);
+	printfPQExpBuffer(&cmd, "\"%s\" %s %s",
+					  backend_exec, backend_options, extra_options);
+	appendPQExpBuffer(&cmd, " -m %llu", (unsigned long long) start_mxid);
+	appendPQExpBuffer(&cmd, " -o %llu", (unsigned long long) start_mxoff);
+	appendPQExpBuffer(&cmd, " -x %llu", (unsigned long long) start_xid);
+	appendPQExpBuffer(&cmd, " template1 >%s", DEVNULL);
 
 	PG_CMD_OPEN(cmd.data);
 
@@ -3183,6 +3218,9 @@ main(int argc, char *argv[])
 		{"icu-rules", required_argument, NULL, 18},
 		{"sync-method", required_argument, NULL, 19},
 		{"no-data-checksums", no_argument, NULL, 20},
+		{"xid", required_argument, NULL, 'x'},
+		{"multixact-id", required_argument, NULL, 'm'},
+		{"multixact-offset", required_argument, NULL, 'o'},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -3224,7 +3262,7 @@ main(int argc, char *argv[])
 
 	/* process command-line options */
 
-	while ((c = getopt_long(argc, argv, "A:c:dD:E:gkL:nNsST:U:WX:",
+	while ((c = getopt_long(argc, argv, "A:c:dD:E:gkL:m:nNo:sST:U:Wx:X:",
 							long_options, &option_index)) != -1)
 	{
 		switch (c)
@@ -3282,6 +3320,30 @@ main(int argc, char *argv[])
 				debug = true;
 				printf(_("Running in debug mode.\n"));
 				break;
+			case 'm':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactIdIsValid(start_mxid))
+					{
+						pg_log_error("invalid initial database cluster multixact id");
+						exit(1);
+					}
+					else if (start_mxid < 1) /* FirstMultiXactId */
+					{
+						/*
+						 * We avoid mxid to be silently set to
+						 * FirstMultiXactId, though it does not harm.
+						 */
+						pg_log_error("multixact id should be greater than 0");
+						exit(1);
+					}
+				}
+				break;
 			case 'n':
 				noclean = true;
 				printf(_("Running in no-clean mode.  Mistakes will not be cleaned up.\n"));
@@ -3289,6 +3351,21 @@ main(int argc, char *argv[])
 			case 'N':
 				do_sync = false;
 				break;
+			case 'o':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxoff = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactOffsetIsValid(start_mxoff))
+					{
+						pg_log_error("invalid initial database cluster multixact offset");
+						exit(1);
+					}
+				}
+				break;
 			case 'S':
 				sync_only = true;
 				break;
@@ -3377,6 +3454,30 @@ main(int argc, char *argv[])
 			case 20:
 				data_checksums = false;
 				break;
+			case 'x':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_xid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartTransactionIdIsValid(start_xid))
+					{
+						pg_log_error("invalid value for initial database cluster xid");
+						exit(1);
+					}
+					else if (start_xid < 3) /* FirstNormalTransactionId */
+					{
+						/*
+						 * We avoid xid to be silently set to
+						 * FirstNormalTransactionId, though it does not harm.
+						 */
+						pg_log_error("xid should be greater than 2");
+						exit(1);
+					}
+				}
+				break;
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index 7520d3d0dda..91a85d9f4d1 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -282,4 +282,64 @@ command_fails(
 	[ 'pg_checksums', '-D', $datadir_nochecksums ],
 	"pg_checksums fails with data checksum disabled");
 
+# Set non-standard initial mxid/mxoff/xid.
+command_fails_like(
+	[ 'initdb', '-m', 'seven', $datadir ],
+	qr/initdb: error: invalid initial database cluster multixact id/,
+	'fails for invalid initial database cluster multixact id');
+command_fails_like(
+	[ 'initdb', '-o', 'seven', $datadir ],
+	qr/initdb: error: invalid initial database cluster multixact offset/,
+	'fails for invalid initial database cluster multixact offset');
+command_fails_like(
+	[ 'initdb', '-x', 'seven', $datadir ],
+	qr/initdb: error: invalid value for initial database cluster xid/,
+	'fails for invalid initial database cluster xid');
+
+command_checks_all(
+	[ 'initdb', '-m', '65535', "$tempdir/data-m65535" ],
+	0,
+	[qr/selecting initial multixact id ... 65535/],
+	[],
+	'selecting initial multixact id');
+command_checks_all(
+	[ 'initdb', '-o', '65535', "$tempdir/data-o65535" ],
+	0,
+	[qr/selecting initial multixact offset ... 65535/],
+	[],
+	'selecting initial multixact offset');
+command_checks_all(
+	[ 'initdb', '-x', '65535', "$tempdir/data-x65535" ],
+	0,
+	[qr/selecting initial xid ... 65535/],
+	[],
+	'selecting initial xid');
+
+# Setup new cluster with given mxid/mxoff/xid.
+my $node;
+my $result;
+
+$node = PostgreSQL::Test::Cluster->new('test-mxid');
+$node->init(extra => ['-m', '16777215']); # 0xFFFFFF
+$node->start;
+$result = $node->safe_psql('postgres', "SELECT next_multixact_id FROM pg_control_checkpoint();");
+ok($result >= 16777215, 'setup cluster with given mxid');
+$node->stop;
+
+$node = PostgreSQL::Test::Cluster->new('test-mxoff');
+$node->init(extra => ['-o', '16777215']); # 0xFFFFFF
+$node->start;
+$result = $node->safe_psql('postgres', "SELECT next_multi_offset FROM pg_control_checkpoint();");
+ok($result >= 16777215, 'setup cluster with given mxoff');
+$node->stop;
+
+$node = PostgreSQL::Test::Cluster->new('test-xid');
+$node->init(extra => ['-x', '16777215']); # 0xFFFFFF
+$node->start;
+$result = $node->safe_psql('postgres', "SELECT txid_current();");
+ok($result >= 16777215, 'setup cluster with given xid - check 1');
+$result = $node->safe_psql('postgres', "SELECT oldest_xid FROM pg_control_checkpoint();");
+ok($result >= 16777215, 'setup cluster with given xid - check 2');
+$node->stop;
+
 done_testing();
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 34ad46c067b..4ce79b12e35 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -94,6 +94,9 @@ typedef enum RecoveryState
 } RecoveryState;
 
 extern PGDLLIMPORT int wal_level;
+extern PGDLLIMPORT TransactionId start_xid;
+extern PGDLLIMPORT MultiXactId start_mxid;
+extern PGDLLIMPORT MultiXactOffset start_mxoff;
 
 /* Is WAL archiving enabled (always or only while server is running normally)? */
 #define XLogArchivingActive() \
diff --git a/src/include/c.h b/src/include/c.h
index b5b6b9261b0..8c15663e3fe 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -619,6 +619,10 @@ typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
+#define StartTransactionIdIsValid(xid)			((xid) <= 0xFFFFFFFF)
+#define StartMultiXactIdIsValid(mxid)			((mxid) <= 0xFFFFFFFF)
+#define StartMultiXactOffsetIsValid(offset)		((offset) <= 0xFFFFFFFF)
+
 #define FirstCommandId	((CommandId) 0)
 #define InvalidCommandId	(~(CommandId)0)
 
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index 0fc2c093b0d..0a7518df0db 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -123,7 +123,7 @@ CATALOG(pg_class,1259,RelationRelationId) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83,Relat
 	Oid			relrewrite BKI_DEFAULT(0) BKI_LOOKUP_OPT(pg_class);
 
 	/* all Xids < this are frozen in this rel */
-	TransactionId relfrozenxid BKI_DEFAULT(3);	/* FirstNormalTransactionId */
+	TransactionId relfrozenxid BKI_DEFAULT(RECENTXMIN);	/* FirstNormalTransactionId */
 
 	/* all multixacts in this rel are >= this; it is really a MultiXactId */
 	TransactionId relminmxid BKI_DEFAULT(1);	/* FirstMultiXactId */
-- 
2.39.5

v10-0012-TEST-add-src-bin-pg_upgrade-t-005_offset.pl.patchtext/x-patch; charset=UTF-8; name=v10-0012-TEST-add-src-bin-pg_upgrade-t-005_offset.pl.patchDownload
From 12ef8d6da03a309dd8adf06aad6439e9010ba30f Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Tue, 19 Nov 2024 17:08:10 +0300
Subject: [PATCH v10 12/14] TEST: add src/bin/pg_upgrade/t/005_offset.pl

---
 src/bin/pg_upgrade/t/005_offset.pl | 562 +++++++++++++++++++++++++++++
 1 file changed, 562 insertions(+)
 create mode 100644 src/bin/pg_upgrade/t/005_offset.pl

diff --git a/src/bin/pg_upgrade/t/005_offset.pl b/src/bin/pg_upgrade/t/005_offset.pl
new file mode 100644
index 00000000000..1cfd8b364a9
--- /dev/null
+++ b/src/bin/pg_upgrade/t/005_offset.pl
@@ -0,0 +1,562 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use File::Find qw(find);
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# This pair of calls will create significantly more member segments than offset
+# segments.
+sub prep
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl} (I INT PRIMARY KEY, N_UPDATED INT) " .
+		"       WITH (AUTOVACUUM_ENABLED=FALSE);" .
+		"INSERT INTO ${tbl} SELECT G, 0 FROM GENERATE_SERIES(1, 50) G;");
+}
+
+sub fill
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	my $nclients = 50;
+	my $update_every = 90;
+	my @connections = ();
+
+	for (0..$nclients)
+	{
+		my $conn = $node->background_psql('postgres');
+		$conn->query_safe("BEGIN");
+
+		push(@connections, $conn);
+	}
+
+	for (my $i = 0; $i < 20000; $i++)
+	{
+		my $conn = $connections[$i % $nclients];
+
+		$conn->query_safe("COMMIT;");
+		$conn->query_safe("BEGIN");
+
+		if ($i % $update_every == 0)
+		{
+			$conn->query_safe(
+				"UPDATE ${tbl} SET " .
+				"N_UPDATED = N_UPDATED + 1 " .
+				"WHERE I = ${i} % 50");
+		}
+		else
+		{
+			$conn->query_safe(
+				"SELECT * FROM ${tbl} FOR KEY SHARE");
+		}
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+}
+
+# This pair of calls will create more or less the same amount of membsers and
+# offsets segments.
+sub prep2
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl}(BAR INT PRIMARY KEY, BAZ INT); " .
+		"CREATE OR REPLACE PROCEDURE MXIDFILLER(N_STEPS INT DEFAULT 1000) " .
+		"LANGUAGE PLPGSQL " .
+		"AS \$\$ " .
+		"BEGIN " .
+		"	FOR I IN 1..N_STEPS LOOP " .
+		"		UPDATE ${tbl} SET BAZ = RANDOM(1, 1000) " .
+		"		WHERE BAR IN (SELECT BAR FROM ${tbl} " .
+		"						TABLESAMPLE BERNOULLI(80)); " .
+		"		COMMIT; " .
+		"	END LOOP; " .
+		"END; \$\$; " .
+		"INSERT INTO ${tbl} (BAR, BAZ) " .
+		"SELECT ID, ID FROM GENERATE_SERIES(1, 1024) ID;");
+}
+
+sub fill2
+{
+	my $node = shift;
+	my $tbl = shift;
+	my $scale = shift // 1;
+
+	$node->safe_psql('postgres',
+		"BEGIN; " .
+		"SELECT * FROM ${tbl} FOR KEY SHARE; " .
+		"PREPARE TRANSACTION 'A'; " .
+		"CALL MXIDFILLER((365 * ${scale})::int); " .
+		"COMMIT PREPARED 'A';");
+}
+
+
+# generate around 2 offset segments and 55 member segments
+sub mxid_gen1
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	prep($node, $tbl);
+	fill($node, $tbl);
+
+	$node->safe_psql('postgres', q(CHECKPOINT));
+}
+
+# generate around 10 offset segments and 12 member segments
+sub mxid_gen2
+{
+	my $node = shift;
+	my $tbl = shift;
+	my $scale = shift // 1;
+
+	prep2($node, $tbl);
+	fill2($node, $tbl, $scale);
+
+	$node->safe_psql('postgres', q(CHECKPOINT));
+}
+
+# Fetch latest multixact checkpoint values.
+sub multi_bounds
+{
+	my ($node) = @_;
+	my $path = $node->config_data('--bindir');
+	my ($stdout, $stderr) = run_command([
+									$path . '/pg_controldata',
+									$node->data_dir
+								]);
+	my @control_data = split("\n", $stdout);
+	my $next = undef;
+	my $oldest = undef;
+	my $next_offset = undef;
+
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint's NextMultiXactId:\s*(.*)$/mg)
+		{
+			$next = $1;
+			print ">>> @ node ". $node->name . ", " . $_ . "\n";
+		}
+
+		if ($_ =~ /^Latest checkpoint's oldestMultiXid:\s*(.*)$/mg)
+		{
+			$oldest = $1;
+			print ">>> @ node ". $node->name . ", " . $_ . "\n";
+		}
+
+		if ($_ =~ /^Latest checkpoint's NextMultiOffset:\s*(.*)$/mg)
+		{
+			$next_offset = $1;
+			print ">>> @ node ". $node->name . ", " . $_ . "\n";
+		}
+
+		if (defined($oldest) && defined($next) && defined($next_offset))
+		{
+			last;
+		}
+	}
+
+	die "Latest checkpoint's NextMultiXactId not found in control file!\n"
+	unless defined($next);
+
+	die "Latest checkpoint's oldestMultiXid not found in control file!\n"
+	unless defined($oldest);
+
+	die "Latest checkpoint's NextMultiOffset not found in control file!\n"
+	unless defined($next_offset);
+
+	return ($oldest, $next, $next_offset);
+}
+
+# Create node from existing bins.
+sub create_new_node
+{
+	my ($name, %params) = @_;
+
+	create_node(0, @_);
+}
+
+# Create node from ENV oldinstall
+sub create_old_node
+{
+	my ($name, %params) = @_;
+
+	if (!defined($ENV{oldinstall}))
+	{
+		die "oldinstall is not defined";
+	}
+
+	create_node(1, @_);
+}
+
+sub create_node
+{
+	my ($install_path_from_env, $name, %params) = @_;
+	my $scale = defined $params{scale} ? $params{scale} : 1;
+	my $multi = defined $params{multi} ? $params{multi} : undef;
+	my $offset = defined $params{offset} ? $params{offset} : undef;
+
+	my $node =
+		$install_path_from_env ?
+			PostgreSQL::Test::Cluster->new($name,
+					install_path => $ENV{oldinstall}) :
+			PostgreSQL::Test::Cluster->new($name);
+
+	$node->init(force_initdb => 1, 
+		extra => [
+			$multi ? ('-m', $multi) : (),
+			$offset ? ('-o', $offset) : (),
+		]);
+
+	# Fixup MOX patch quirk
+	if ($multi)
+	{
+		unlink $node->data_dir . '/pg_multixact/offsets/0000';
+	}
+	if ($offset)
+	{
+		unlink $node->data_dir . '/pg_multixact/members/0000';
+	}
+
+	$node->append_conf('fsync', 'off');
+	$node->append_conf('postgresql.conf', 'max_prepared_transactions = 2');
+
+	$node->start();
+	mxid_gen2($node, 'FOO', $scale);
+	mxid_gen1($node, 'BAR', $scale);
+	$node->restart();
+	$node->safe_psql('postgres', q(SELECT * FROM FOO));		# just in case...
+	$node->safe_psql('postgres', q(SELECT * FROM BAR));
+	$node->safe_psql('postgres', q(CHECKPOINT));
+	$node->stop();
+
+	return $node;
+}
+
+sub do_upgrade
+{
+	my ($oldnode, $newnode) = @_;
+
+	command_ok(
+		[
+			'pg_upgrade', '--no-sync',
+			'-d', $oldnode->data_dir,
+			'-D', $newnode->data_dir,
+			'-b', $oldnode->config_data('--bindir'),
+			'-B', $newnode->config_data('--bindir'),
+			'-s', $newnode->host,
+			'-p', $oldnode->port,
+			'-P', $newnode->port,
+			'--check'
+		],
+		'run of pg_upgrade');
+
+	command_ok(
+		[
+			'pg_upgrade', '--no-sync',
+			'-d', $oldnode->data_dir,
+			'-D', $newnode->data_dir,
+			'-b', $oldnode->config_data('--bindir'),
+			'-B', $newnode->config_data('--bindir'),
+			'-s', $newnode->host,
+			'-p', $oldnode->port,
+			'-P', $newnode->port,
+			'--copy'
+		],
+		'run of pg_upgrade');
+
+	$oldnode->start();
+	$newnode->start();
+
+	my $oldfoo = $oldnode->safe_psql('postgres', q(SELECT * FROM FOO));
+	my $newfoo = $newnode->safe_psql('postgres', q(SELECT * FROM FOO));
+	is($oldfoo, $newfoo, "select foo eq");
+
+	my $oldbar = $oldnode->safe_psql('postgres', q(SELECT * FROM BAR));
+	my $newbar = $newnode->safe_psql('postgres', q(SELECT * FROM BAR));
+	is($oldbar, $newbar, "select bar eq");
+
+	$oldnode->stop();
+	$newnode->stop();
+
+	multi_bounds($oldnode);
+	multi_bounds($newnode);
+}
+
+my @TESTS = (
+	# tests without ENV oldinstall
+	0, 1, 2, 3, 4, 5, 6,
+	# tests with "real" pg_upgrade
+	100, 101, 102, 103, 104, 105, 106,
+	# self upgrade
+	1000,
+);
+
+# =============================================================================
+# Basic sanity tests on a NEW bin
+# =============================================================================
+
+# starts from the zero
+SKIP:
+{
+	my $TEST_NO = 0;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_mo',
+						scale => 1);
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value
+SKIP:
+{
+	my $TEST_NO = 1;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_Mo',
+						scale => 1.15,
+						multi => '0x123400');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# offsets starts from the value
+SKIP:
+{
+	my $TEST_NO = 2;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_mO',
+						scale => 1.15,
+						offset => '0x432100');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi and offsets starts from the value
+SKIP:
+{
+	my $TEST_NO = 3;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_MO',
+						scale => 1.15,
+						multi => '0xDEAD00', offset => '0xBEEF00');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value, multi wrap
+SKIP:
+{
+	my $TEST_NO = 4;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_Mo_wrap',
+						scale => 1.15,
+						multi => '0xFFFF7000');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# offsets starts from the value, offsets wrap
+SKIP:
+{
+	my $TEST_NO = 5;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_mO_wrap',
+						scale => 1.15,
+						offset => '0xFFFFFC00');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value, offsets starts from the value,
+# multi wrap, offsets wrap
+SKIP:
+{
+	my $TEST_NO = 6;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_MO_wrap',
+						scale => 1.15,
+						multi => '0xFFFF7000', offset => '0xFFFFFC00');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# =============================================================================
+# pg_upgarde tests
+# =============================================================================
+
+# starts from the zero
+SKIP:
+{
+	my $TEST_NO = 100;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'mo';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1);
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value
+SKIP:
+{
+	my $TEST_NO = 101;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'Mo';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						multi => '0x123400');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# offsets starts from the value
+SKIP:
+{
+	my $TEST_NO = 102;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'mO';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						offset => '0x432100');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi and offsets starts from the value
+SKIP:
+{
+	my $TEST_NO = 103;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'MO';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						multi => '0xDEAD00', offset => '0xBEEF00');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value, multi wrap
+SKIP:
+{
+	my $TEST_NO = 104;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'Mo_wrap';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						multi => '0xFFFF7000');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# offsets starts from the value, offsets wrap
+SKIP:
+{
+	my $TEST_NO = 105;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'mO_wrap';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						offset => '0xFFFFFC00');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value, offsets starts from the value,
+# multi wrap, offsets wrap
+SKIP:
+{
+	my $TEST_NO = 106;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'MO_wrap';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						multi => '0xFFFF7000', offset => '0xFFFFFC00');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# =============================================================================
+# Self upgrade
+# =============================================================================
+
+# starts from the zero
+SKIP:
+{
+	my $TEST_NO = 1000;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'self_upgrade';
+	my $oldnode = create_new_node("old_$dbname",
+						scale => 1);
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+done_testing();
-- 
2.39.5

v10-0014-TEST-bump-catver.patchtext/x-patch; charset=UTF-8; name=v10-0014-TEST-bump-catver.patchDownload
From d009ae1a4ec7fafd60a63594b53d20a256e53182 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 13 Nov 2024 16:34:34 +0300
Subject: [PATCH v10 14/14] TEST: bump catver

---
 src/bin/pg_upgrade/pg_upgrade.h  | 2 +-
 src/include/catalog/catversion.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index c13293b4add..7acccf2900e 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -119,7 +119,7 @@ extern char *output_files[];
  *
  * XXX: should be changed to the actual CATALOG_VERSION_NO on commit.
  */
-#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202409041
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202411112
 
 /*
  * large object chunk size added to pg_controldata,
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index f815d15415f..307a9d88471 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202412112
+#define CATALOG_VERSION_NO	202412113
 
 #endif
-- 
2.39.5

#27Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#26)
1 attachment(s)
Re: POC: make mxidoff 64 bits

On Wed, 18 Dec 2024 at 13:21, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Attached is some more cleanup on top of patch set v9, removing more dead
stuff related to wraparound. I also removed the oldestOffsetKnown
variable and related code. It was needed to deal with clusters upgraded
from buggy 9.3 and 9.4 era versions, but now that pg_upgrade will
rewrite the SLRUs, it's no longer needed.

Yep, multixact.c looks correct to me. As for "XXX could use
SimpleLruTruncate()", yes, for sure.
Actually, xl_multixact_truncate.startTruncMemb is also no longer needed, is
it?

Does the pg_upgrade code work though, if you have that buggy situation
where oldestOffsetKnown == false ?

if (!TransactionIdIsValid(*xactptr))
{
/* Corner case 3: we must be looking at unused

slot zero */

Assert(offset == 0);
continue;
}

After upgrade, this corner case 3 would *not* happen on offset == 0. So
looks like we're still missing test coverage for this upgrade corner case.

Am I understanding correctly that you want to have a test corresponding to
the buggy 9.3 and 9.4 era versions?
Do you think we could imitate this scenario on a current master branch like
that:
1) generate a couple of offsets segments for the first table;
2) generate more segments for a second table;
3) drop first table;
4) stop pg cluster;
5) remove pg_multixact/offsets/0000
6) upgrade?

PFA, v10-0016-TEST-try-to-replicate-buggy-oldest-offset.patch
This test will fail now, for an obvious reason, but is this case a relevant
one?

--
Best regards,
Maxim Orlov.

Attachments:

v10-64-bit-mxoff.zipapplication/zip; name=v10-64-bit-mxoff.zipDownload
#28Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Maxim Orlov (#27)
Re: POC: make mxidoff 64 bits

On 27/12/2024 19:09, Maxim Orlov wrote:

On Wed, 18 Dec 2024 at 13:21, Heikki Linnakangas <hlinnaka@iki.fi
<mailto:hlinnaka@iki.fi>> wrote:
Does the pg_upgrade code work though, if you have that buggy situation
where oldestOffsetKnown == false ?

...

               if (!TransactionIdIsValid(*xactptr))
               {
                       /* Corner case 3: we must be looking at

unused slot zero */

                       Assert(offset == 0);
                       continue;
               }

After upgrade, this corner case 3 would *not* happen on offset == 0. So
looks like we're still missing test coverage for this upgrade corner
case.

Am I understanding correctly that you want to have a test corresponding
to the buggy 9.3 and 9.4 era versions?

No, those were two different things. I think there might be two things
wrong here:

1. I suspect pg_upgrade might not correctly handle the situation where
oldestOffsetKnown==false, and

2. The above assertion in "corner case 3" would not hold. It seems that
we don't have a test case for it, or it would've hit the assertion.

Now that I think about it, yes, a test case for 1. would be good too.
But I was talking about 2.

Do you think we could imitate this scenario on a current master branch
like that:
1) generate a couple of offsets segments for the first table;
2) generate more segments for a second table;
3) drop first table;
4) stop pg cluster;
5) remove pg_multixact/offsets/0000
6) upgrade?

I don't remember off the top of my head.

It might be best to just refuse the upgrade if oldestOffsetKnown==false.
It's a very ancient corner case. It seems reasonable to require you to
upgrade to a newer minor version and run VACUUM before upgrading. IIRC
that sets oldestOffsetKnown.

--
Heikki Linnakangas
Neon (https://neon.tech)

#29Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#28)
Re: POC: make mxidoff 64 bits

On Thu, 2 Jan 2025 at 01:12, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

It might be best to just refuse the upgrade if oldestOffsetKnown==false.
It's a very ancient corner case. It seems reasonable to require you to
upgrade to a newer minor version and run VACUUM before upgrading. IIRC
that sets oldestOffsetKnown.

I agree. After all, we do already have a ready-made solution in the form
of a vacuum, do we?

If I understand all this multixact_old.c machinery correctly, in case of
oldestOffsetKnown==false
we should fail with "could not open file" or offset will be 0 in
GetOldMultiXactIdSingleMember.
So, I suppose we can put an analogue of SimpleLruDoesPhysicalPageExist call
in the beginning
of GetOldMultiXactIdSingleMember. And if either
SimpleLruDoesPhysicalPageExist return false
or a corresponding offset will be 0 we have to bail out with "oldest offset
does not exist, consider
running vacuum before pg_upgrdade" or smth. Please, correct me if I'm wrong.

--
Best regards,
Maxim Orlov.

#30Maxim Orlov
orlovmg@gmail.com
In reply to: Maxim Orlov (#29)
7 attachment(s)
Re: POC: make mxidoff 64 bits

Looks like there is a bit of a pause in the discussion. Here is a small
update. Consider v12.
No major changes, rebase to the actual master and a squash of multiple
commits to make a
patch set easy to reviewer.

AFAICs, we are reached a consensus on a core patch for switching to 64 bits
offsets. The
only concern is about more comprehensive test coverage for pg_upgrade, is
it?

--
Best regards,
Maxim Orlov.

Attachments:

v12-0007-TEST-bump-catver.patch.txttext/plain; charset=US-ASCII; name=v12-0007-TEST-bump-catver.patch.txtDownload
From bbd878821f997c6b8e3053091bdbecbdeff79d2b Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 13 Nov 2024 16:34:34 +0300
Subject: [PATCH v12 7/7] TEST: bump catver

---
 src/bin/pg_upgrade/pg_upgrade.h  | 2 +-
 src/include/catalog/catversion.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 9b3d645b08..df915e6382 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -119,7 +119,7 @@ extern char *output_files[];
  *
  * XXX: should be changed to the actual CATALOG_VERSION_NO on commit.
  */
-#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202409041
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202412202
 
 /*
  * large object chunk size added to pg_controldata,
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 7c7133cd88..9ee5102259 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202501401
+#define CATALOG_VERSION_NO	202412202
 
 #endif
-- 
2.43.0

v12-0001-Use-64-bit-format-output-for-multixact-offsets.patchapplication/octet-stream; name=v12-0001-Use-64-bit-format-output-for-multixact-offsets.patchDownload
From a75408003a23163b6d39906590e3fd859d79abcc Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 7 Aug 2024 16:35:22 +0300
Subject: [PATCH v12 1/7] Use 64-bit format output for multixact offsets

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/rmgrdesc/mxactdesc.c   |  9 ++++----
 src/backend/access/rmgrdesc/xlogdesc.c    |  4 ++--
 src/backend/access/transam/multixact.c    | 26 +++++++++++++----------
 src/backend/access/transam/xlogrecovery.c |  5 +++--
 src/bin/pg_controldata/pg_controldata.c   |  4 ++--
 src/bin/pg_resetwal/pg_resetwal.c         |  8 +++----
 6 files changed, 31 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 8bd3d5b63c..b792e9d939 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -65,8 +65,8 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		xl_multixact_create *xlrec = (xl_multixact_create *) rec;
 		int			i;
 
-		appendStringInfo(buf, "%u offset %u nmembers %d: ", xlrec->mid,
-						 xlrec->moff, xlrec->nmembers);
+		appendStringInfo(buf, "%u offset %llu nmembers %d: ", xlrec->mid,
+						 (unsigned long long) xlrec->moff, xlrec->nmembers);
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
 	}
@@ -74,9 +74,10 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
 
-		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+		appendStringInfo(buf, "offsets [%u, %u), members [%llu, %llu)",
 						 xlrec->startTruncOff, xlrec->endTruncOff,
-						 xlrec->startTruncMemb, xlrec->endTruncMemb);
+						 (unsigned long long) xlrec->startTruncMemb,
+						 (unsigned long long) xlrec->endTruncMemb);
 	}
 }
 
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 58040f2865..e52a5625a8 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -66,7 +66,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
 		appendStringInfo(buf, "redo %X/%X; "
-						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %u; "
+						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %llu; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
 						 "oldest running xid %u; %s",
@@ -79,7 +79,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 XidFromFullTransactionId(checkpoint->nextXid),
 						 checkpoint->nextOid,
 						 checkpoint->nextMulti,
-						 checkpoint->nextMultiOffset,
+						 (unsigned long long) checkpoint->nextMultiOffset,
 						 checkpoint->oldestXid,
 						 checkpoint->oldestXidDB,
 						 checkpoint->oldestMulti,
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 27ccdf9500..623fc8bdac 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1264,7 +1264,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	LWLockRelease(MultiXactGenLock);
 
-	debug_elog4(DEBUG2, "GetNew: returning %u offset %u", result, *offset);
+	debug_elog4(DEBUG2, "GetNew: returning %u offset %llu", result,
+				(unsigned long long) *offset);
 	return result;
 }
 
@@ -2293,8 +2294,9 @@ MultiXactGetCheckptMulti(bool is_shutdown,
 	LWLockRelease(MultiXactGenLock);
 
 	debug_elog6(DEBUG2,
-				"MultiXact: checkpoint is nextMulti %u, nextOffset %u, oldestMulti %u in DB %u",
-				*nextMulti, *nextMultiOffset, *oldestMulti, *oldestMultiDB);
+				"MultiXact: checkpoint is nextMulti %u, nextOffset %llu, oldestMulti %u in DB %u",
+				*nextMulti, (unsigned long long) *nextMultiOffset, *oldestMulti,
+				*oldestMultiDB);
 }
 
 /*
@@ -2328,8 +2330,8 @@ void
 MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset)
 {
-	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %u",
-				nextMulti, nextMultiOffset);
+	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %llu",
+				nextMulti, (unsigned long long) nextMultiOffset);
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->nextMXact = nextMulti;
 	MultiXactState->nextOffset = nextMultiOffset;
@@ -2519,8 +2521,8 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 	}
 	if (MultiXactOffsetPrecedes(MultiXactState->nextOffset, minMultiOffset))
 	{
-		debug_elog3(DEBUG2, "MultiXact: setting next offset to %u",
-					minMultiOffset);
+		debug_elog3(DEBUG2, "MultiXact: setting next offset to %llu",
+					(unsigned long long) minMultiOffset);
 		MultiXactState->nextOffset = minMultiOffset;
 	}
 	LWLockRelease(MultiXactGenLock);
@@ -3211,11 +3213,12 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 
 	elog(DEBUG1, "performing multixact truncation: "
 		 "offsets [%u, %u), offsets segments [%llx, %llx), "
-		 "members [%u, %u), members segments [%llx, %llx)",
+		 "members [%llu, %llu), members segments [%llx, %llx)",
 		 oldestMulti, newOldestMulti,
 		 (unsigned long long) MultiXactIdToOffsetSegment(oldestMulti),
 		 (unsigned long long) MultiXactIdToOffsetSegment(newOldestMulti),
-		 oldestOffset, newOldestOffset,
+		 (unsigned long long) oldestOffset,
+		 (unsigned long long) newOldestOffset,
 		 (unsigned long long) MXOffsetToMemberSegment(oldestOffset),
 		 (unsigned long long) MXOffsetToMemberSegment(newOldestOffset));
 
@@ -3471,11 +3474,12 @@ multixact_redo(XLogReaderState *record)
 
 		elog(DEBUG1, "replaying multixact truncation: "
 			 "offsets [%u, %u), offsets segments [%llx, %llx), "
-			 "members [%u, %u), members segments [%llx, %llx)",
+			 "members [%llu, %llu), members segments [%llx, %llx)",
 			 xlrec.startTruncOff, xlrec.endTruncOff,
 			 (unsigned long long) MultiXactIdToOffsetSegment(xlrec.startTruncOff),
 			 (unsigned long long) MultiXactIdToOffsetSegment(xlrec.endTruncOff),
-			 xlrec.startTruncMemb, xlrec.endTruncMemb,
+			 (unsigned long long) xlrec.startTruncMemb,
+			 (unsigned long long) xlrec.endTruncMemb,
 			 (unsigned long long) MXOffsetToMemberSegment(xlrec.startTruncMemb),
 			 (unsigned long long) MXOffsetToMemberSegment(xlrec.endTruncMemb));
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 0bbe2eea20..13a20eb8d2 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -876,8 +876,9 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 							 U64FromFullTransactionId(checkPoint.nextXid),
 							 checkPoint.nextOid)));
 	ereport(DEBUG1,
-			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %u",
-							 checkPoint.nextMulti, checkPoint.nextMultiOffset)));
+			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %llu",
+							 checkPoint.nextMulti,
+							 (unsigned long long) checkPoint.nextMultiOffset)));
 	ereport(DEBUG1,
 			(errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
 							 checkPoint.oldestXid, checkPoint.oldestXidDB)));
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 93a05d80ca..43b6727570 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -253,8 +253,8 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile->checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
-		   ControlFile->checkPointCopy.nextMultiOffset);
+	printf(_("Latest checkpoint's NextMultiOffset:  %llu\n"),
+		   (unsigned long long) ControlFile->checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile->checkPointCopy.oldestXid);
 	printf(_("Latest checkpoint's oldestXID's DB:   %u\n"),
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index ed73607a46..fff401e469 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -737,8 +737,8 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile.checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
-		   ControlFile.checkPointCopy.nextMultiOffset);
+	printf(_("Latest checkpoint's NextMultiOffset:  %llu\n"),
+		   (unsigned long long) ControlFile.checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile.checkPointCopy.oldestXid);
 	printf(_("Latest checkpoint's oldestXID's DB:   %u\n"),
@@ -809,8 +809,8 @@ PrintNewControlValues(void)
 
 	if (set_mxoff != -1)
 	{
-		printf(_("NextMultiOffset:                      %u\n"),
-			   ControlFile.checkPointCopy.nextMultiOffset);
+		printf(_("NextMultiOffset:                      %llu\n"),
+			   (unsigned long long) ControlFile.checkPointCopy.nextMultiOffset);
 	}
 
 	if (set_oid != 0)
-- 
2.43.0

v12-0003-Make-pg_upgrade-convert-multixact-offsets.patchapplication/octet-stream; name=v12-0003-Make-pg_upgrade-convert-multixact-offsets.patchDownload
From 0537a94691c44f5fba4330c6d22656c5f5c1466b Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Tue, 13 Aug 2024 14:44:50 +0300
Subject: [PATCH v12 3/7] Make pg_upgrade convert multixact offsets.

Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Author: Maxim Orlov <orlovmg@gmail.com>
Author: Yura Sokolov <y.sokolov@postgrespro.ru>
---
 src/backend/access/transam/multixact.c |  35 +--
 src/bin/pg_upgrade/Makefile            |   3 +
 src/bin/pg_upgrade/meson.build         |   3 +
 src/bin/pg_upgrade/multixact_old.c     | 338 +++++++++++++++++++++++++
 src/bin/pg_upgrade/multixact_old.h     |  12 +
 src/bin/pg_upgrade/multixact_rewrite.c | 238 +++++++++++++++++
 src/bin/pg_upgrade/pg_upgrade.c        |  29 ++-
 src/bin/pg_upgrade/pg_upgrade.h        |  13 +-
 src/bin/pg_upgrade/slru_io.c           | 211 +++++++++++++++
 src/bin/pg_upgrade/slru_io.h           |  23 ++
 10 files changed, 873 insertions(+), 32 deletions(-)
 create mode 100644 src/bin/pg_upgrade/multixact_old.c
 create mode 100644 src/bin/pg_upgrade/multixact_old.h
 create mode 100644 src/bin/pg_upgrade/multixact_rewrite.c
 create mode 100644 src/bin/pg_upgrade/slru_io.c
 create mode 100644 src/bin/pg_upgrade/slru_io.h

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index cd9db52e95..d63ae17330 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1103,7 +1103,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	MultiXactOffset *offptr;
 	MultiXactOffset offset;
 	int			length;
-	int			truelength;
 	MultiXactId oldestMXact;
 	MultiXactId nextMXact;
 	MultiXactId tmpMXact;
@@ -1202,15 +1201,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * we have just for this; the process in charge will signal the CV as soon
 	 * as it has finished writing the multixact offset.
 	 *
-	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
-	 * handle case #2, there is an ambiguity near the point of offset
-	 * wraparound.  If we see next multixact's offset is one, is that our
-	 * multixact's actual endpoint, or did it end at zero with a subsequent
-	 * increment?  We handle this using the knowledge that if the zero'th
-	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
-	 * transaction ID so it can't be a multixact member.  Therefore, if we
-	 * read a zero from the members array, just ignore it.
-	 *
 	 * This is all pretty messy, but the mess occurs only in infrequent corner
 	 * cases, so it seems better than holding the MultiXactGenLock for a long
 	 * time on every multixact creation.
@@ -1297,6 +1287,9 @@ retry:
 	LWLockRelease(lock);
 	lock = NULL;
 
+	/* A multixid with zero members should not happen */
+	Assert(length > 0);
+
 	/*
 	 * If we slept above, clean up state; it's no longer needed.
 	 */
@@ -1305,7 +1298,6 @@ retry:
 
 	ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
 
-	truelength = 0;
 	prev_pageno = -1;
 	for (int i = 0; i < length; i++, offset++)
 	{
@@ -1343,36 +1335,27 @@ retry:
 		xactptr = (TransactionId *)
 			(MultiXactMemberCtl->shared->page_buffer[slotno] + memberoff);
 
-		if (!TransactionIdIsValid(*xactptr))
-		{
-			/* Corner case 3: we must be looking at unused slot zero */
-			Assert(offset == 0);
-			continue;
-		}
+		Assert(TransactionIdIsValid(*xactptr));
 
 		flagsoff = MXOffsetToFlagsOffset(offset);
 		bshift = MXOffsetToFlagsBitShift(offset);
 		flagsptr = (uint32 *) (MultiXactMemberCtl->shared->page_buffer[slotno] + flagsoff);
 
-		ptr[truelength].xid = *xactptr;
-		ptr[truelength].status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
-		truelength++;
+		ptr[i].xid = *xactptr;
+		ptr[i].status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
 	}
 
 	LWLockRelease(lock);
 
-	/* A multixid with zero members should not happen */
-	Assert(truelength > 0);
-
 	/*
 	 * Copy the result into the local cache.
 	 */
-	mXactCachePut(multi, truelength, ptr);
+	mXactCachePut(multi, length, ptr);
 
 	debug_elog3(DEBUG2, "GetMembers: no cache for %s",
-				mxid_to_string(multi, truelength, ptr));
+				mxid_to_string(multi, length, ptr));
 	*members = ptr;
-	return truelength;
+	return length;
 }
 
 /*
diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index f83d2b5d30..b4ad01c00b 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -19,11 +19,14 @@ OBJS = \
 	file.o \
 	function.o \
 	info.o \
+	multixact_old.o \
+	multixact_rewrite.o \
 	option.o \
 	parallel.o \
 	pg_upgrade.o \
 	relfilenumber.o \
 	server.o \
+	slru_io.o \
 	tablespace.o \
 	task.o \
 	util.o \
diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index cc2ba97d9a..76c8f2005d 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -8,11 +8,14 @@ pg_upgrade_sources = files(
   'file.c',
   'function.c',
   'info.c',
+  'multixact_old.c',
+  'multixact_rewrite.c',
   'option.c',
   'parallel.c',
   'pg_upgrade.c',
   'relfilenumber.c',
   'server.c',
+  'slru_io.c',
   'tablespace.c',
   'task.c',
   'util.c',
diff --git a/src/bin/pg_upgrade/multixact_old.c b/src/bin/pg_upgrade/multixact_old.c
new file mode 100644
index 0000000000..0442928e89
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_old.c
@@ -0,0 +1,338 @@
+/*
+ *	multixact_old.c
+ *
+ *	Support for reading pre-v18 format pg_multixact files
+ *
+ *	Copyright (c) 2024, PostgreSQL Global Development Group
+ *	src/bin/pg_upgrade/multixact_old.c
+ */
+
+#include "postgres_fe.h"
+
+#include "access/transam.h"
+#include "pg_upgrade.h"
+#include "multixact_old.h"
+#include "slru_io.h"
+
+/*
+ * Below are a bunch of definitions that are copy-pasted from multixact.c from
+ * version 17. They shadow the new definitions in access/multixact.h, so it's
+ * important that we *don't* include that here. That's is a big reason this
+ * code has to be in a separate source file.
+ *
+ * All references to MultiXactOffset have been replaced with OldMultiXactOffset;
+ */
+typedef uint32 OldMultiXactOffset;
+
+#define FirstMultiXactId	((MultiXactId) 1)
+
+/*
+ * Possible multixact lock modes ("status").  The first four modes are for
+ * tuple locks (FOR KEY SHARE, FOR SHARE, FOR NO KEY UPDATE, FOR UPDATE); the
+ * next two are used for update and delete modes.
+ */
+typedef enum
+{
+	MultiXactStatusForKeyShare = 0x00,
+	MultiXactStatusForShare = 0x01,
+	MultiXactStatusForNoKeyUpdate = 0x02,
+	MultiXactStatusForUpdate = 0x03,
+	/* an update that doesn't touch "key" columns */
+	MultiXactStatusNoKeyUpdate = 0x04,
+	/* other updates, and delete */
+	MultiXactStatusUpdate = 0x05,
+} MultiXactStatus;
+
+/* does a status value correspond to a tuple update? */
+#define ISUPDATE_from_mxstatus(status) \
+			((status) > MultiXactStatusForUpdate)
+
+/*
+ * Defines for OldMultiXactOffset page sizes.  A page is the same BLCKSZ as is
+ * used everywhere else in Postgres.
+ *
+ * Note: because OldMultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
+ * MultiXact page numbering also wraps around at
+ * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
+ * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
+ * take no explicit notice of that fact in this module, except when comparing
+ * segment and page numbers in TruncateMultiXact (see
+ * OldMultiXactOffsetPagePrecedes).
+ */
+
+/* We need four bytes per offset */
+#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(uint32))
+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)
+{
+	return multi / MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int
+MultiXactIdToOffsetEntry(MultiXactId multi)
+{
+	return multi % MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int64
+MultiXactIdToOffsetSegment(MultiXactId multi)
+{
+	return MultiXactIdToOffsetPage(multi) / SLRU_PAGES_PER_SEGMENT;
+}
+
+/*
+ * The situation for members is a bit more complex: we store one byte of
+ * additional flag bits for each TransactionId.  To do this without getting
+ * into alignment issues, we store four bytes of flags, and then the
+ * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
+ * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
+ * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
+ * performance) trumps space efficiency here.
+ *
+ * Note that the "offset" macros work with byte offset, not array indexes, so
+ * arithmetic must be done using "char *" pointers.
+ */
+/* We need eight bits per xact, so one xact fits in a byte */
+#define MXACT_MEMBER_BITS_PER_XACT			8
+#define MXACT_MEMBER_FLAGS_PER_BYTE			1
+#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
+
+/* how many full bytes of flags are there in a group? */
+#define MULTIXACT_FLAGBYTES_PER_GROUP		4
+#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
+	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
+/* size in bytes of a complete group */
+#define MULTIXACT_MEMBERGROUP_SIZE \
+	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERS_PER_PAGE	\
+	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
+
+/* page in which a member is to be found */
+static inline int64
+MXOffsetToMemberPage(OldMultiXactOffset offset)
+{
+	return offset / MULTIXACT_MEMBERS_PER_PAGE;
+}
+
+/* Location (byte offset within page) of flag word for a given member */
+static inline int
+MXOffsetToFlagsOffset(OldMultiXactOffset offset)
+{
+	OldMultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
+	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
+
+	return byteoff;
+}
+
+static inline int
+MXOffsetToFlagsBitShift(OldMultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
+
+	return bshift;
+}
+
+/* Location (byte offset within page) of TransactionId of given member */
+static inline int
+MXOffsetToMemberOffset(OldMultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+
+	return MXOffsetToFlagsOffset(offset) +
+		MULTIXACT_FLAGBYTES_PER_GROUP +
+		member_in_group * sizeof(TransactionId);
+}
+
+typedef struct OldMultiXactReader
+{
+	MultiXactId nextMXact;
+	uint32	nextOffset;
+
+	SlruSegState *offset;
+	SlruSegState *members;
+} OldMultiXactReader;
+
+OldMultiXactReader *
+StartOldMultiXactRead(void)
+{
+	OldMultiXactReader *state;
+	char	   *dir;
+
+	state = pg_malloc(sizeof(OldMultiXactReader));
+	state->nextMXact = old_cluster.controldata.chkpnt_nxtmulti;
+	state->nextOffset = old_cluster.controldata.chkpnt_nxtmxoff;
+
+	dir = psprintf("%s/pg_multixact/offsets", old_cluster.pgdata);
+	state->offset = OpenSlruRead(dir);
+	pg_free(dir);
+
+	dir = psprintf("%s/pg_multixact/members", old_cluster.pgdata);
+	state->members = OpenSlruRead(dir);
+	pg_free(dir);
+
+	return state;
+}
+
+/*
+ * This is a simplified version of the GetMultiXactIdMembers() server function.
+ *
+ * - Only return the updating member, if any. Upgrade only cares about the updaters.
+ *   If there is no updating member, return the first locking-only member. We don't
+ *   have any way to represent "no members", but we also don't need to preserve all
+ *   the locking members.
+ *
+ * - We don't need to worry about locking and some corner cases because there's
+ *   no concurrent activity.
+ */
+void
+GetOldMultiXactIdSingleMember(OldMultiXactReader *state, MultiXactId multi,
+							  TransactionId *result, bool *isupdate)
+{
+	TransactionId result_xid;
+	bool		result_isupdate;
+	int64		pageno;
+	int64		prev_pageno;
+	int			entryno;
+	OldMultiXactOffset *offptr;
+	OldMultiXactOffset offset;
+	int			length;
+	MultiXactId nextMXact;
+	MultiXactId tmpMXact;
+	OldMultiXactOffset nextOffset;
+	char	   *buf;
+
+	nextMXact = state->nextMXact;
+	nextOffset = state->nextOffset;
+
+	/*
+	 * Find out the offset at which we need to start reading MultiXactMembers
+	 * and the number of members in the multixact.  We determine the latter as
+	 * the difference between this multixact's starting offset and the next
+	 * one's.  However, there are some corner cases to worry about:
+	 *
+	 * 1. This multixact may be the latest one created, in which case there is
+	 * no next one to look at.  In this case the nextOffset value we just
+	 * saved is the correct endpoint.
+	 *
+	 * 2. (this cannot happen during upgrade)
+	 *
+	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
+	 * handle case #2, there is an ambiguity near the point of offset
+	 * wraparound.  If we see next multixact's offset is one, is that our
+	 * multixact's actual endpoint, or did it end at zero with a subsequent
+	 * increment?  We handle this using the knowledge that if the zero'th
+	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
+	 * transaction ID so it can't be a multixact member.  Therefore, if we
+	 * read a zero from the members array, just ignore it.
+	 */
+	pageno = MultiXactIdToOffsetPage(multi);
+	entryno = MultiXactIdToOffsetEntry(multi);
+
+	buf = SlruReadSwitchPage(state->offset, pageno);
+	offptr = (OldMultiXactOffset *) buf;
+	offptr += entryno;
+	offset = *offptr;
+
+	Assert(offset != 0);
+
+	/*
+	 * Use the same increment rule as GetNewMultiXactId(), that is, don't
+	 * handle wraparound explicitly until needed.
+	 */
+	tmpMXact = multi + 1;
+
+	if (nextMXact == tmpMXact)
+	{
+		/* Corner case 1: there is no next multixact */
+		length = nextOffset - offset;
+	}
+	else
+	{
+		OldMultiXactOffset nextMXOffset;
+
+		/* handle wraparound if needed */
+		if (tmpMXact < FirstMultiXactId)
+			tmpMXact = FirstMultiXactId;
+
+		prev_pageno = pageno;
+
+		pageno = MultiXactIdToOffsetPage(tmpMXact);
+		entryno = MultiXactIdToOffsetEntry(tmpMXact);
+
+		if (pageno != prev_pageno)
+		{
+			buf = SlruReadSwitchPage(state->offset, pageno);
+		}
+
+		offptr = (OldMultiXactOffset *) buf;
+		offptr += entryno;
+		nextMXOffset = *offptr;
+
+		if (nextMXOffset == 0)
+		{
+			/* Corner case 2: next multixact is still being filled in */
+			Assert(false); /* shouldn't happen during upgrade */
+		}
+
+		length = nextMXOffset - offset;
+	}
+
+	result_xid = InvalidTransactionId;
+	result_isupdate = false;
+	prev_pageno = -1;
+	for (int i = 0; i < length; i++, offset++)
+	{
+		TransactionId *xactptr;
+		uint32	   *flagsptr;
+		int			flagsoff;
+		int			bshift;
+		int			memberoff;
+		MultiXactStatus status;
+
+		pageno = MXOffsetToMemberPage(offset);
+		memberoff = MXOffsetToMemberOffset(offset);
+
+		if (pageno != prev_pageno)
+		{
+			buf = SlruReadSwitchPage(state->members, pageno);
+			prev_pageno = pageno;
+		}
+
+		xactptr = (TransactionId *) (buf + memberoff);
+
+		if (!TransactionIdIsValid(*xactptr))
+		{
+			/* Corner case 3: we must be looking at unused slot zero */
+			Assert(offset == 0);
+			continue;
+		}
+
+		flagsoff = MXOffsetToFlagsOffset(offset);
+		bshift = MXOffsetToFlagsBitShift(offset);
+		flagsptr = (uint32 *) (buf + flagsoff);
+
+		status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
+
+		/* Verify that there is a single update Xid among the given members. */
+		if (ISUPDATE_from_mxstatus(status))
+		{
+			if (result_isupdate)
+				pg_fatal("multixact %u has more than one updating member",
+						 multi);
+			result_xid = *xactptr;
+			result_isupdate = true;
+		}
+		else if (!TransactionIdIsValid(result_xid))
+			result_xid = *xactptr;
+	}
+
+	/* A multixid with zero members should not happen */
+	Assert(TransactionIdIsValid(result_xid));
+
+	*result = result_xid;
+	*isupdate = result_isupdate;
+}
diff --git a/src/bin/pg_upgrade/multixact_old.h b/src/bin/pg_upgrade/multixact_old.h
new file mode 100644
index 0000000000..70800c1cda
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_old.h
@@ -0,0 +1,12 @@
+/*
+ *	multixact_old.h
+ *
+ *	Copyright (c) 2010-2024, PostgreSQL Global Development Group
+ *	src/bin/pg_upgrade/multixact_old.h
+ */
+
+typedef struct OldMultiXactReader OldMultiXactReader;
+
+extern OldMultiXactReader *StartOldMultiXactRead(void);
+extern void GetOldMultiXactIdSingleMember(OldMultiXactReader *state, MultiXactId multi,
+										  TransactionId *result, bool *isupdate);
diff --git a/src/bin/pg_upgrade/multixact_rewrite.c b/src/bin/pg_upgrade/multixact_rewrite.c
new file mode 100644
index 0000000000..7b3aeb80c0
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_rewrite.c
@@ -0,0 +1,238 @@
+/*
+ *	multixact_rewrite.c
+ *
+ *	Rewrite pre-v18 multixacts to new format with 64-bit MultiXactOffsets
+ *
+ *	Copyright (c) 2024, PostgreSQL Global Development Group
+ *	src/bin/pg_upgrade/multixact_rewrite.c
+ */
+
+#include "postgres_fe.h"
+
+#include "multixact_old.h"
+#include "pg_upgrade.h"
+#include "slru_io.h"
+
+#include "access/multixact.h"
+#include "access/multixact_internal.h"
+
+typedef struct
+{
+	MultiXactId nextMXact;
+	MultiXactOffset nextOffset;
+
+	SlruSegState *offset;
+	SlruSegState *members;
+} MultiXactWriter;
+
+static MultiXactWriter *StartMultiXactWrite(MultiXactId firstMulti, MultiXactOffset firstOffset);
+static MultiXactId GetNewMultiXactId(MultiXactWriter *state, int nmembers, MultiXactOffset *offset);
+static void RecordNewMultiXact(MultiXactWriter *state,
+							   MultiXactOffset offset,
+							   MultiXactId multi,
+							   int nmembers, MultiXactMember *members);
+static void CloseMultiXactWrite(MultiXactWriter *state);
+
+
+/*
+ * Convert pg_multixact/offset and /members to new format with 64-bit offsets.
+ */
+void
+convert_multixacts(MultiXactId *new_nxtmulti, MultiXactOffset *new_nxtmxoff)
+{
+	MultiXactWriter	   *new_writer;
+	MultiXactId			oldest_multi = old_cluster.controldata.chkpnt_oldstMulti,
+						next_multi = old_cluster.controldata.chkpnt_nxtmulti,
+						multi;
+	OldMultiXactReader *old_reader;
+
+	if (next_multi < FirstMultiXactId)
+		next_multi = FirstMultiXactId;
+
+	old_reader = StartOldMultiXactRead();
+	new_writer = StartMultiXactWrite(oldest_multi, 1);
+
+	/*
+	 * Read multixids from old files one by one, and write them back in the
+	 * new format.
+	 *
+	 * The locking-only XIDs that may be part of multi-xids don't matter after
+	 * upgrade, as there can be no transactions running across upgrade. So as
+	 * a little optimization, we only read one member from each multixid: the
+	 * one updating one, or if there was no update, arbitrarily the first
+	 * locking xid.
+	 */
+	for (multi = oldest_multi; multi != next_multi;)
+	{
+		TransactionId xid;
+		bool		isupdate;
+		MultiXactMember member;
+		MultiXactId newmulti;
+		MultiXactOffset offset;
+
+		/* Read the old multixid */
+		GetOldMultiXactIdSingleMember(old_reader, multi, &xid, &isupdate);
+
+		/* Write it out in new format */
+		member.xid = xid;
+		member.status = isupdate ? MultiXactStatusUpdate : MultiXactStatusForKeyShare;
+		newmulti = GetNewMultiXactId(new_writer, 1, &offset);
+		Assert(newmulti == multi);
+		RecordNewMultiXact(new_writer, offset, multi, 1, &member);
+
+		multi++;
+		if (multi < FirstMultiXactId)
+			multi = FirstMultiXactId;
+	}
+
+	/*
+	 * Update the nextMXact/Offset values in the control file to match what we
+	 * wrote. The nextMXact should be unchanged, but because we ignored the
+	 * locking XIDs members, the nextOffset will be different.
+	 */
+	Assert(new_writer->nextMXact == next_multi);
+	*new_nxtmulti = next_multi;
+	*new_nxtmxoff = new_writer->nextOffset;
+
+	/* Release resources */
+	CloseMultiXactWrite(new_writer);
+}
+
+/* Support routines for writing the new format */
+
+static MultiXactWriter *
+StartMultiXactWrite(MultiXactId firstMulti, MultiXactOffset firstOffset)
+{
+	MultiXactWriter *state;
+	char	   *dir;
+
+	state = pg_malloc(sizeof(MultiXactWriter));
+	state->nextMXact = firstMulti;
+	state->nextOffset = firstOffset;
+
+	dir = psprintf("%s/pg_multixact/offsets", new_cluster.pgdata);
+	state->offset = OpenSlruWrite(dir, MultiXactIdToOffsetPage(firstMulti));
+	pg_free(dir);
+
+	dir = psprintf("%s/pg_multixact/members", new_cluster.pgdata);
+	state->members = OpenSlruWrite(dir, MXOffsetToMemberPage(1));
+	pg_free(dir);
+
+	return state;
+}
+
+static void
+CloseMultiXactWrite(MultiXactWriter *state)
+{
+	CloseSlruWrite(state->offset);
+	CloseSlruWrite(state->members);
+	pg_free(state);
+}
+
+/*
+ * Simplified copy of the corresponding server function
+ */
+static MultiXactId
+GetNewMultiXactId(MultiXactWriter *state, int nmembers, MultiXactOffset *offset)
+{
+	MultiXactId		result;
+
+	/* Handle wraparound of the nextMXact counter */
+	if (state->nextMXact < FirstMultiXactId)
+		state->nextMXact = FirstMultiXactId;
+
+	/* Assign the MXID */
+	result = state->nextMXact;
+
+	/*
+	 * Reserve the members space, similarly to above.
+	 */
+	*offset = state->nextOffset;
+
+	/*
+	 * Advance counters.  As in GetNewTransactionId(), this must not happen
+	 * until after file extension has succeeded!
+	 *
+	 * We don't care about MultiXactId wraparound here; it will be handled by
+	 * the next iteration.  But note that nextMXact may be InvalidMultiXactId
+	 * or the first value on a segment-beginning page after this routine
+	 * exits, so anyone else looking at the variable must be prepared to deal
+	 * with either case.  Similarly, nextOffset may be zero, but we won't use
+	 * that as the actual start offset of the next multixact.
+	 */
+	(state->nextMXact)++;
+
+	state->nextOffset += nmembers;
+
+	return result;
+}
+
+/*
+ * Write a new multixact with members.
+ *
+ * Simplified version of the correspoding server function.
+ */
+static void
+RecordNewMultiXact(MultiXactWriter *state, MultiXactOffset offset,
+				   MultiXactId multi,
+				   int nmembers, MultiXactMember *members)
+{
+	int64		pageno;
+	int64		prev_pageno;
+	int			entryno;
+
+	char	   *buf;
+	MultiXactOffset *offptr;
+
+	pageno = MultiXactIdToOffsetPage(multi);
+	entryno = MultiXactIdToOffsetEntry(multi);
+
+	/*
+	 * Note: we pass the MultiXactId to SimpleLruReadPage as the "transaction"
+	 * to complain about if there's any I/O error.  This is kinda bogus, but
+	 * since the errors will always give the full pathname, it should be clear
+	 * enough that a MultiXactId is really involved.  Perhaps someday we'll
+	 * take the trouble to generalize the slru.c error reporting code.
+	 */
+	buf = SlruWriteSwitchPage(state->offset, pageno);
+	offptr = (MultiXactOffset *) buf;
+	offptr += entryno;
+
+	*offptr = offset;
+
+	prev_pageno = -1;
+
+	for (int i = 0; i < nmembers; i++, offset++)
+	{
+		TransactionId *memberptr;
+		uint32	   *flagsptr;
+		uint32		flagsval;
+		int			bshift;
+		int			flagsoff;
+		int			memberoff;
+
+		Assert(members[i].status <= MultiXactStatusUpdate);
+
+		pageno = MXOffsetToMemberPage(offset);
+		memberoff = MXOffsetToMemberOffset(offset);
+		flagsoff = MXOffsetToFlagsOffset(offset);
+		bshift = MXOffsetToFlagsBitShift(offset);
+
+		if (pageno != prev_pageno)
+		{
+			buf = SlruWriteSwitchPage(state->members, pageno);
+			prev_pageno = pageno;
+		}
+
+		memberptr = (TransactionId *) (buf + memberoff);
+
+		*memberptr = members[i].xid;
+
+		flagsptr = (uint32 *) (buf + flagsoff);
+
+		flagsval = *flagsptr;
+		flagsval &= ~(((1 << MXACT_MEMBER_BITS_PER_XACT) - 1) << bshift);
+		flagsval |= (members[i].status << bshift);
+		*flagsptr = flagsval;
+	}
+}
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 36c7f3879d..9bf191b984 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -750,8 +750,27 @@ copy_xact_xlog_xid(void)
 	if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER &&
 		new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
 	{
-		copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
-		copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		MultiXactId new_nxtmulti = old_cluster.controldata.chkpnt_nxtmulti;
+		MultiXactOffset new_nxtmxoff = old_cluster.controldata.chkpnt_nxtmxoff;
+
+		/*
+		 * If the old server is before the MULTIXACTOFFSET_FORMATCHANGE_CAT_VER
+		 * it must have 32-bit multixid offsets, thus it should be converted.
+		 */
+		if (old_cluster.controldata.cat_ver < MULTIXACTOFFSET_FORMATCHANGE_CAT_VER &&
+			new_cluster.controldata.cat_ver >= MULTIXACTOFFSET_FORMATCHANGE_CAT_VER)
+		{
+			remove_new_subdir("pg_multixact/members", false);
+			remove_new_subdir("pg_multixact/offsets", false);
+			prep_status("Converting pg_multixact/offsets to 64-bit");
+			convert_multixacts(&new_nxtmulti, &new_nxtmxoff);
+			check_ok();
+		}
+		else
+		{
+			copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
+			copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		}
 
 		prep_status("Setting next multixact ID and offset for new cluster");
 
@@ -760,10 +779,10 @@ copy_xact_xlog_xid(void)
 		 * counters here and the oldest multi present on system.
 		 */
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/pg_resetwal\" -O %u -m %u,%u \"%s\"",
+				  "\"%s/pg_resetwal\" -O %llu -m %u,%u \"%s\"",
 				  new_cluster.bindir,
-				  old_cluster.controldata.chkpnt_nxtmxoff,
-				  old_cluster.controldata.chkpnt_nxtmulti,
+				  (unsigned long long) new_nxtmxoff,
+				  new_nxtmulti,
 				  old_cluster.controldata.chkpnt_oldstMulti,
 				  new_cluster.pgdata);
 		check_ok();
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 0cdd675e4f..9b3d645b08 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -114,6 +114,13 @@ extern char *output_files[];
  */
 #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231
 
+/*
+ * Swicth from 32-bit to 64-bit for multixid offsets.
+ *
+ * XXX: should be changed to the actual CATALOG_VERSION_NO on commit.
+ */
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202409041
+
 /*
  * large object chunk size added to pg_controldata,
  * commit 5f93c37805e7485488480916b4585e098d3cc883
@@ -230,7 +237,7 @@ typedef struct
 	uint32		chkpnt_nxtepoch;
 	uint32		chkpnt_nxtoid;
 	uint32		chkpnt_nxtmulti;
-	uint32		chkpnt_nxtmxoff;
+	uint64		chkpnt_nxtmxoff;
 	uint32		chkpnt_oldstMulti;
 	uint32		chkpnt_oldstxid;
 	uint32		align;
@@ -515,3 +522,7 @@ typedef struct
 	FILE	   *file;
 	char		path[MAXPGPATH];
 } UpgradeTaskReport;
+
+/* multixact_rewrite.c */
+
+void convert_multixacts(MultiXactId *new_nxtmulti, MultiXactOffset *new_nxtmxoff);
diff --git a/src/bin/pg_upgrade/slru_io.c b/src/bin/pg_upgrade/slru_io.c
new file mode 100644
index 0000000000..87acf16732
--- /dev/null
+++ b/src/bin/pg_upgrade/slru_io.c
@@ -0,0 +1,211 @@
+/*
+ *	slru_io.c
+ *
+ *	Routines for reading and writing SLRU files during upgrade.
+ *
+ *	Copyright (c) 2024, PostgreSQL Global Development Group
+ *	src/bin/pg_upgrade/slru_io.c
+ */
+
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+
+#include "pg_upgrade.h"
+#include "slru_io.h"
+
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "port/pg_iovec.h"
+
+/*
+ * State for reading or writing an SLRU, with a one page buffer.
+ */
+typedef struct SlruSegState
+{
+	bool		writing;
+
+	char	   *dir;
+	char	   *fn;
+	int			fd;
+	int64		segno;
+	uint64		pageno;
+
+	PGAlignedBlock buf;
+} SlruSegState;
+
+static void SlruFlush(SlruSegState *state);
+
+
+SlruSegState *
+OpenSlruRead(char *dir)
+{
+	SlruSegState *state;
+
+	state = pg_malloc(sizeof(SlruSegState));
+	state->writing = false;
+	state->segno = -1;
+	state->pageno = 0;
+	state->dir = pstrdup(dir);
+	state->fd = -1;
+	state->fn = NULL;
+
+	return state;
+}
+
+void
+CloseSlruRead(SlruSegState *state)
+{
+	Assert(!state->writing);
+	close(state->fd);
+	pg_free(state);
+}
+
+SlruSegState *
+OpenSlruWrite(char *dir, int64 startPageno)
+{
+	SlruSegState *state;
+
+	state = pg_malloc(sizeof(SlruSegState));
+	state->writing = true;
+	state->segno = -1;
+	state->pageno = 0;
+	state->dir = pstrdup(dir);
+	state->fd = -1;
+	state->fn = NULL;
+
+	return state;
+}
+
+void
+CloseSlruWrite(SlruSegState *state)
+{
+	Assert(state->writing);
+	SlruFlush(state);
+
+	close(state->fd);
+	pg_free(state);
+}
+
+static void
+SlruFlush(SlruSegState *state)
+{
+	struct iovec iovec = {
+		.iov_base = &state->buf,
+		.iov_len = BLCKSZ,
+	};
+	off_t		offset;
+
+	if (state->segno == -1)
+		return;
+
+	offset = (state->pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+	if (pg_pwritev_with_retry(state->fd, &iovec, 1, offset) < 0)
+		pg_fatal("could not write file \"%s\": %m", state->fn);
+}
+
+/*
+ * Open the given page for writing.
+ *
+ * NOTE: This uses O_EXCL when stepping to a new segment, so this assumes that
+ * each segment is written in full before moving on to next one. This
+ * limitation would be easy to lift if needed, but it fits the usage pattern
+ * of current callers.
+ */
+char *
+SlruWriteSwitchPage(SlruSegState *state, uint64 pageno)
+{
+	int64		segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	off_t		offset;
+
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+
+	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	offset = (pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+	SlruFlush(state);
+	memset(state->buf.data, 0, BLCKSZ);
+
+	if (segno != state->segno)
+	{
+		if (state->segno != -1)
+		{
+			close(state->fd);
+			state->fd = -1;
+			pg_free(state->fn);
+			state->fn = NULL;
+		}
+
+		/* Create the segment */
+		state->fn = psprintf("%s/%04X", state->dir, (unsigned int) segno);
+		if ((state->fd = open(state->fn, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+							  pg_file_create_mode)) < 0)
+		{
+			pg_fatal("could not create file \"%s\": %m", state->fn);
+		}
+		state->segno = segno;
+
+		if (offset > 0)
+		{
+			if (pg_pwrite_zeros(state->fd, offset, 0) < 0)
+				pg_fatal("could not write file \"%s\": %m", state->fn);
+		}
+	}
+
+	state->pageno = pageno;
+	return state->buf.data;
+}
+
+/*
+ * Open given page for reading.
+ *
+ * Reading can be done in random order.
+ */
+char *
+SlruReadSwitchPage(SlruSegState *state, uint64 pageno)
+{
+	int64		segno;
+
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+
+	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+
+	if (segno != state->segno)
+	{
+		if (state->segno != -1)
+		{
+			close(state->fd);
+			state->fd = -1;
+			pg_free(state->fn);
+			state->fn = NULL;
+		}
+
+		/* Open new segment */
+		state->fn = psprintf("%s/%04X", state->dir, (unsigned int) segno);
+		if ((state->fd = open(state->fn, O_RDONLY | PG_BINARY, 0)) < 0)
+		{
+			pg_fatal("could not open file \"%s\": %m", state->fn);
+		}
+		state->segno = segno;
+	}
+
+	{
+		struct iovec iovec = {
+			.iov_base = &state->buf,
+			.iov_len = BLCKSZ,
+		};
+		off_t		offset;
+
+		offset = (pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+		if (pg_preadv(state->fd, &iovec, 1, offset) < 0)
+			pg_fatal("could not read file \"%s\": %m", state->fn);
+
+		state->pageno = pageno;
+	}
+
+	return state->buf.data;
+}
diff --git a/src/bin/pg_upgrade/slru_io.h b/src/bin/pg_upgrade/slru_io.h
new file mode 100644
index 0000000000..e1a9c06313
--- /dev/null
+++ b/src/bin/pg_upgrade/slru_io.h
@@ -0,0 +1,23 @@
+/*
+ *	slru_io.h
+ *
+ *	Copyright (c) 2010-2024, PostgreSQL Global Development Group
+ *	src/bin/pg_upgrade/slru_io.h
+ */
+
+/* XXX: copied from slru.h */
+#define SLRU_PAGES_PER_SEGMENT	32
+
+/*
+ * Some kind of iterator associated with a particular SLRU segment.  The idea is
+ * to specify the segment and page number and then move through the pages.
+ */
+typedef struct SlruSegState SlruSegState;
+
+extern SlruSegState *OpenSlruRead(char *dir);
+extern void CloseSlruRead(SlruSegState *state);
+extern char *SlruReadSwitchPage(SlruSegState *state, uint64 pageno);
+
+extern SlruSegState *OpenSlruWrite(char *dir, int64 startPageno);
+extern void CloseSlruWrite(SlruSegState *state);
+extern char *SlruWriteSwitchPage(SlruSegState *state, uint64 pageno);
-- 
2.43.0

v12-0002-Use-64-bit-multixact-offsets.patchapplication/octet-stream; name=v12-0002-Use-64-bit-multixact-offsets.patchDownload
From b4db0517df2510bd452ae0e82d4b707e70fa29f5 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Wed, 6 Mar 2024 11:11:33 +0300
Subject: [PATCH v12 2/7] Use 64-bit multixact offsets.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/transam/multixact.c  | 541 ++----------------------
 src/backend/access/transam/xlog.c       |   2 +-
 src/backend/commands/vacuum.c           |   2 +-
 src/backend/postmaster/autovacuum.c     |   4 +-
 src/bin/pg_resetwal/pg_resetwal.c       |   2 +-
 src/bin/pg_resetwal/t/001_basic.pl      |   2 +-
 src/include/access/multixact.h          |   3 +-
 src/include/access/multixact_internal.h | 115 +++++
 src/include/c.h                         |   2 +-
 9 files changed, 156 insertions(+), 517 deletions(-)
 create mode 100644 src/include/access/multixact_internal.h

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 623fc8bdac..cd9db52e95 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -69,6 +69,7 @@
 #include "postgres.h"
 
 #include "access/multixact.h"
+#include "access/multixact_internal.h"
 #include "access/slru.h"
 #include "access/transam.h"
 #include "access/twophase.h"
@@ -92,130 +93,14 @@
 #include "utils/injection_point.h"
 #include "utils/memutils.h"
 
-
-/*
- * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
- * used everywhere else in Postgres.
- *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
- */
-
-/* We need four bytes per offset */
-#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
-
-static inline int64
-MultiXactIdToOffsetPage(MultiXactId multi)
-{
-	return multi / MULTIXACT_OFFSETS_PER_PAGE;
-}
-
-static inline int
-MultiXactIdToOffsetEntry(MultiXactId multi)
-{
-	return multi % MULTIXACT_OFFSETS_PER_PAGE;
-}
-
-static inline int64
-MultiXactIdToOffsetSegment(MultiXactId multi)
-{
-	return MultiXactIdToOffsetPage(multi) / SLRU_PAGES_PER_SEGMENT;
-}
-
-/*
- * The situation for members is a bit more complex: we store one byte of
- * additional flag bits for each TransactionId.  To do this without getting
- * into alignment issues, we store four bytes of flags, and then the
- * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
- * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
- * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
- * performance) trumps space efficiency here.
- *
- * Note that the "offset" macros work with byte offset, not array indexes, so
- * arithmetic must be done using "char *" pointers.
- */
-/* We need eight bits per xact, so one xact fits in a byte */
-#define MXACT_MEMBER_BITS_PER_XACT			8
-#define MXACT_MEMBER_FLAGS_PER_BYTE			1
-#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
-
-/* how many full bytes of flags are there in a group? */
-#define MULTIXACT_FLAGBYTES_PER_GROUP		4
-#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
-	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
-/* size in bytes of a complete group */
-#define MULTIXACT_MEMBERGROUP_SIZE \
-	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
-#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
-#define MULTIXACT_MEMBERS_PER_PAGE	\
-	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
-
 /*
- * Because the number of items per page is not a divisor of the last item
- * number (member 0xFFFFFFFF), the last segment does not use the maximum number
- * of pages, and moreover the last used page therein does not use the same
- * number of items as previous pages.  (Another way to say it is that the
- * 0xFFFFFFFF member is somewhere in the middle of the last page, so the page
- * has some empty space after that item.)
- *
- * This constant is the number of members in the last page of the last segment.
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuumin order to release the disk space, reduce table bloat if
+ * possible.
  */
-#define MAX_MEMBERS_IN_LAST_MEMBERS_PAGE \
-		((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
-
-/* page in which a member is to be found */
-static inline int64
-MXOffsetToMemberPage(MultiXactOffset offset)
-{
-	return offset / MULTIXACT_MEMBERS_PER_PAGE;
-}
-
-static inline int64
-MXOffsetToMemberSegment(MultiXactOffset offset)
-{
-	return MXOffsetToMemberPage(offset) / SLRU_PAGES_PER_SEGMENT;
-}
-
-/* Location (byte offset within page) of flag word for a given member */
-static inline int
-MXOffsetToFlagsOffset(MultiXactOffset offset)
-{
-	MultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
-	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
-	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
-
-	return byteoff;
-}
-
-static inline int
-MXOffsetToFlagsBitShift(MultiXactOffset offset)
-{
-	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
-	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
-
-	return bshift;
-}
-
-/* Location (byte offset within page) of TransactionId of given member */
-static inline int
-MXOffsetToMemberOffset(MultiXactOffset offset)
-{
-	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
-
-	return MXOffsetToFlagsOffset(offset) +
-		MULTIXACT_FLAGBYTES_PER_GROUP +
-		member_in_group * sizeof(TransactionId);
-}
-
-/* Multixact members wraparound thresholds. */
-#define MULTIXACT_MEMBER_SAFE_THRESHOLD		(MaxMultiXactOffset / 2)
-#define MULTIXACT_MEMBER_DANGER_THRESHOLD	\
-	(MaxMultiXactOffset - MaxMultiXactOffset / 4)
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD		UINT64CONST(0xFFFFFFFF)
 
 static inline MultiXactId
 PreviousMultiXactId(MultiXactId multi)
@@ -260,11 +145,9 @@ typedef struct MultiXactStateData
 
 	/*
 	 * Oldest multixact offset that is potentially referenced by a multixact
-	 * referenced by a relation.  We don't always know this value, so there's
-	 * a flag here to indicate whether or not we currently do.
+	 * referenced by a relation.
 	 */
 	MultiXactOffset oldestOffset;
-	bool		oldestOffsetKnown;
 
 	/* support for anti-wraparound measures */
 	MultiXactId multiVacLimit;
@@ -272,9 +155,6 @@ typedef struct MultiXactStateData
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 
-	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
-
 	/*
 	 * This is used to sleep until a multixact offset is written when we want
 	 * to create the next one.
@@ -409,10 +289,8 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
-									 MultiXactOffset start, uint32 distance);
 static bool SetOffsetVacuumLimit(bool is_startup);
-static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
+static MultiXactOffset find_multixact_start(MultiXactId multi);
 static void WriteMZeroPageXlogRec(int64 pageno, uint8 info);
 static void WriteMTruncateXlogRec(Oid oldestMultiDB,
 								  MultiXactId startTruncOff,
@@ -1054,9 +932,7 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	 * against catastrophic data loss due to multixact wraparound.  The basic
 	 * rules are:
 	 *
-	 * If we're past multiVacLimit or the safe threshold for member storage
-	 * space, or we don't know what the safe threshold for member storage is,
-	 * start trying to force autovacuum cycles.
+	 * If we're past multiVacLimit, start trying to force autovacuum cycles.
 	 * If we're past multiWarnLimit, start issuing warnings.
 	 * If we're past multiStopLimit, refuse to create new MultiXactIds.
 	 *
@@ -1151,90 +1027,10 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	ExtendMultiXactOffset(result);
 
 	/*
-	 * Reserve the members space, similarly to above.  Also, be careful not to
-	 * return zero as the starting offset for any multixact. See
-	 * GetMultiXactIdMembers() for motivation.
+	 * Reserve the members space, similarly to above.
 	 */
 	nextOffset = MultiXactState->nextOffset;
-	if (nextOffset == 0)
-	{
-		*offset = 1;
-		nmembers++;				/* allocate member slot 0 too */
-	}
-	else
-		*offset = nextOffset;
-
-	/*----------
-	 * Protect against overrun of the members space as well, with the
-	 * following rules:
-	 *
-	 * If we're past offsetStopLimit, refuse to generate more multis.
-	 * If we're close to offsetStopLimit, emit a warning.
-	 *
-	 * Arbitrarily, we start emitting warnings when we're 20 segments or less
-	 * from offsetStopLimit.
-	 *
-	 * Note we haven't updated the shared state yet, so if we fail at this
-	 * point, the multixact ID we grabbed can still be used by the next guy.
-	 *
-	 * Note that there is no point in forcing autovacuum runs here: the
-	 * multixact freeze settings would have to be reduced for that to have any
-	 * effect.
-	 *----------
-	 */
-#define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
-								 nmembers))
-	{
-		/* see comment in the corresponding offsets wraparound case */
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("multixact \"members\" limit exceeded"),
-				 errdetail_plural("This command would create a multixact with %u members, but the remaining space is only enough for %u member.",
-								  "This command would create a multixact with %u members, but the remaining space is only enough for %u members.",
-								  MultiXactState->offsetStopLimit - nextOffset - 1,
-								  nmembers,
-								  MultiXactState->offsetStopLimit - nextOffset - 1),
-				 errhint("Execute a database-wide VACUUM in database with OID %u with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.",
-						 MultiXactState->oldestMultiXactDB)));
-	}
-
-	/*
-	 * Check whether we should kick autovacuum into action, to prevent members
-	 * wraparound. NB we use a much larger window to trigger autovacuum than
-	 * just the warning limit. The warning is just a measure of last resort -
-	 * this is in line with GetNewTransactionId's behaviour.
-	 */
-	if (!MultiXactState->oldestOffsetKnown ||
-		(MultiXactState->nextOffset - MultiXactState->oldestOffset
-		 > MULTIXACT_MEMBER_SAFE_THRESHOLD))
-	{
-		/*
-		 * To avoid swamping the postmaster with signals, we issue the autovac
-		 * request only when crossing a segment boundary. With default
-		 * compilation settings that's roughly after 50k members.  This still
-		 * gives plenty of chances before we get into real trouble.
-		 */
-		if ((MXOffsetToMemberPage(nextOffset) / SLRU_PAGES_PER_SEGMENT) !=
-			(MXOffsetToMemberPage(nextOffset + nmembers) / SLRU_PAGES_PER_SEGMENT))
-			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	}
-
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
-								 nextOffset,
-								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
-		ereport(WARNING,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg_plural("database with OID %u must be vacuumed before %d more multixact member is used",
-							   "database with OID %u must be vacuumed before %d more multixact members are used",
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers,
-							   MultiXactState->oldestMultiXactDB,
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers),
-				 errhint("Execute a database-wide VACUUM in that database with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.")));
+	*offset = nextOffset;
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
@@ -2620,22 +2416,9 @@ ExtendMultiXactMember(MultiXactOffset offset, int nmembers)
 		}
 
 		/*
-		 * Compute the number of items till end of current page.  Careful: if
-		 * addition of unsigned ints wraps around, we're at the last page of
-		 * the last segment; since that page holds a different number of items
-		 * than other pages, we need to do it differently.
+		 * Compute the number of items till end of current page.
 		 */
-		if (offset + MAX_MEMBERS_IN_LAST_MEMBERS_PAGE < offset)
-		{
-			/*
-			 * This is the last page of the last segment; we can compute the
-			 * number of items left to allocate in it without modulo
-			 * arithmetic.
-			 */
-			difference = MaxMultiXactOffset - offset + 1;
-		}
-		else
-			difference = MULTIXACT_MEMBERS_PER_PAGE - offset % MULTIXACT_MEMBERS_PER_PAGE;
+		difference = MULTIXACT_MEMBERS_PER_PAGE - offset % MULTIXACT_MEMBERS_PER_PAGE;
 
 		/*
 		 * Advance to next page, taking care to properly handle the wraparound
@@ -2701,15 +2484,13 @@ GetOldestMultiXactId(void)
 }
 
 /*
- * Determine how aggressively we need to vacuum in order to prevent member
- * wraparound.
+ * Determine if we need to vacuum for member or not.
  *
  * To do so determine what's the oldest member offset and install the limit
  * info in MultiXactState, where it can be used to prevent overrun of old data
  * in the members SLRU area.
  *
- * The return value is true if emergency autovacuum is required and false
- * otherwise.
+ * The return value is true if autovacuum is required and false otherwise.
  */
 static bool
 SetOffsetVacuumLimit(bool is_startup)
@@ -2717,12 +2498,7 @@ SetOffsetVacuumLimit(bool is_startup)
 	MultiXactId oldestMultiXactId;
 	MultiXactId nextMXact;
 	MultiXactOffset oldestOffset = 0;	/* placate compiler */
-	MultiXactOffset prevOldestOffset;
 	MultiXactOffset nextOffset;
-	bool		oldestOffsetKnown = false;
-	bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;
-	MultiXactOffset prevOffsetStopLimit;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2735,9 +2511,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	oldestMultiXactId = MultiXactState->oldestMultiXactId;
 	nextMXact = MultiXactState->nextMXact;
 	nextOffset = MultiXactState->nextOffset;
-	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
-	prevOldestOffset = MultiXactState->oldestOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2755,139 +2528,31 @@ SetOffsetVacuumLimit(bool is_startup)
 		 * offset.
 		 */
 		oldestOffset = nextOffset;
-		oldestOffsetKnown = true;
 	}
 	else
-	{
-		/*
-		 * Figure out where the oldest existing multixact's offsets are
-		 * stored. Due to bugs in early release of PostgreSQL 9.3.X and 9.4.X,
-		 * the supposedly-earliest multixact might not really exist.  We are
-		 * careful not to fail in that case.
-		 */
-		oldestOffsetKnown =
-			find_multixact_start(oldestMultiXactId, &oldestOffset);
-
-		if (oldestOffsetKnown)
-			ereport(DEBUG1,
-					(errmsg_internal("oldest MultiXactId member is at offset %u",
-									 oldestOffset)));
-		else
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
-							oldestMultiXactId)));
-	}
+		oldestOffset = find_multixact_start(oldestMultiXactId);
 
 	LWLockRelease(MultiXactTruncationLock);
 
-	/*
-	 * If we can, compute limits (and install them MultiXactState) to prevent
-	 * overrun of old data in the members SLRU area. We can only do so if the
-	 * oldest offset is known though.
-	 */
-	if (oldestOffsetKnown)
-	{
-		/* move back to start of the corresponding segment */
-		offsetStopLimit = oldestOffset - (oldestOffset %
-										  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-
-		/* always leave one segment before the wraparound point */
-		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-		if (!prevOldestOffsetKnown && !is_startup)
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are now enabled")));
-
-		ereport(DEBUG1,
-				(errmsg_internal("MultiXact member stop limit is now %u based on MultiXact %u",
-								 offsetStopLimit, oldestMultiXactId)));
-	}
-	else if (prevOldestOffsetKnown)
-	{
-		/*
-		 * If we failed to get the oldest offset this time, but we have a
-		 * value from a previous pass through this function, use the old
-		 * values rather than automatically forcing an emergency autovacuum
-		 * cycle again.
-		 */
-		oldestOffset = prevOldestOffset;
-		oldestOffsetKnown = true;
-		offsetStopLimit = prevOffsetStopLimit;
-	}
-
 	/* Install the computed values */
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestOffset = oldestOffset;
-	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	MultiXactState->offsetStopLimit = offsetStopLimit;
 	LWLockRelease(MultiXactGenLock);
 
 	/*
-	 * Do we need an emergency autovacuum?	If we're not sure, assume yes.
-	 */
-	return !oldestOffsetKnown ||
-		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
-}
-
-/*
- * Return whether adding "distance" to "start" would move past "boundary".
- *
- * We use this to determine whether the addition is "wrapping around" the
- * boundary point, hence the name.  The reason we don't want to use the regular
- * 2^31-modulo arithmetic here is that we want to be able to use the whole of
- * the 2^32-1 space here, allowing for more multixacts than would fit
- * otherwise.
- */
-static bool
-MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
-						 uint32 distance)
-{
-	MultiXactOffset finish;
-
-	/*
-	 * Note that offset number 0 is not used (see GetMultiXactIdMembers), so
-	 * if the addition wraps around the UINT_MAX boundary, skip that value.
-	 */
-	finish = start + distance;
-	if (finish < start)
-		finish++;
-
-	/*-----------------------------------------------------------------------
-	 * When the boundary is numerically greater than the starting point, any
-	 * value numerically between the two is not wrapped:
-	 *
-	 *	<----S----B---->
-	 *	[---)			 = F wrapped past B (and UINT_MAX)
-	 *		 [---)		 = F not wrapped
-	 *			  [----] = F wrapped past B
-	 *
-	 * When the boundary is numerically less than the starting point (i.e. the
-	 * UINT_MAX wraparound occurs somewhere in between) then all values in
-	 * between are wrapped:
-	 *
-	 *	<----B----S---->
-	 *	[---)			 = F not wrapped past B (but wrapped past UINT_MAX)
-	 *		 [---)		 = F wrapped past B (and UINT_MAX)
-	 *			  [----] = F not wrapped
-	 *-----------------------------------------------------------------------
+	 * Do we need autovacuum?
 	 */
-	if (start < boundary)
-		return finish >= boundary || finish < start;
-	else
-		return finish >= boundary && finish < start;
+	return (nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);
 }
 
 /*
  * Find the starting offset of the given MultiXactId.
  *
- * Returns false if the file containing the multi does not exist on disk.
- * Otherwise, returns true and sets *result to the starting member offset.
- *
  * This function does not prevent concurrent truncation, so if that's
  * required, the caller has to protect against that.
  */
-static bool
-find_multixact_start(MultiXactId multi, MultiXactOffset *result)
+static MultiXactOffset
+find_multixact_start(MultiXactId multi)
 {
 	MultiXactOffset offset;
 	int64		pageno;
@@ -2900,15 +2565,6 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
-	/*
-	 * Write out dirty data, so PhysicalPageExists can work correctly.
-	 */
-	SimpleLruWriteAll(MultiXactOffsetCtl, true);
-	SimpleLruWriteAll(MultiXactMemberCtl, true);
-
-	if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno))
-		return false;
-
 	/* lock is acquired by SimpleLruReadPage_ReadOnly */
 	slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno, multi);
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
@@ -2916,102 +2572,7 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	offset = *offptr;
 	LWLockRelease(SimpleLruGetBankLock(MultiXactOffsetCtl, pageno));
 
-	*result = offset;
-	return true;
-}
-
-/*
- * Determine how many multixacts, and how many multixact members, currently
- * exist.  Return false if unable to determine.
- */
-static bool
-ReadMultiXactCounts(uint32 *multixacts, MultiXactOffset *members)
-{
-	MultiXactOffset nextOffset;
-	MultiXactOffset oldestOffset;
-	MultiXactId oldestMultiXactId;
-	MultiXactId nextMultiXactId;
-	bool		oldestOffsetKnown;
-
-	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	nextOffset = MultiXactState->nextOffset;
-	oldestMultiXactId = MultiXactState->oldestMultiXactId;
-	nextMultiXactId = MultiXactState->nextMXact;
-	oldestOffset = MultiXactState->oldestOffset;
-	oldestOffsetKnown = MultiXactState->oldestOffsetKnown;
-	LWLockRelease(MultiXactGenLock);
-
-	if (!oldestOffsetKnown)
-		return false;
-
-	*members = nextOffset - oldestOffset;
-	*multixacts = nextMultiXactId - oldestMultiXactId;
-	return true;
-}
-
-/*
- * Multixact members can be removed once the multixacts that refer to them
- * are older than every datminmxid.  autovacuum_multixact_freeze_max_age and
- * vacuum_multixact_freeze_table_age work together to make sure we never have
- * too many multixacts; we hope that, at least under normal circumstances,
- * this will also be sufficient to keep us from using too many offsets.
- * However, if the average multixact has many members, we might exhaust the
- * members space while still using few enough members that these limits fail
- * to trigger relminmxid advancement by VACUUM.  At that point, we'd have no
- * choice but to start failing multixact-creating operations with an error.
- *
- * To prevent that, if more than a threshold portion of the members space is
- * used, we effectively reduce autovacuum_multixact_freeze_max_age and
- * to a value just less than the number of multixacts in use.  We hope that
- * this will quickly trigger autovacuuming on the table or tables with the
- * oldest relminmxid, thus allowing datminmxid values to advance and removing
- * some members.
- *
- * As the fraction of the member space currently in use grows, we become
- * more aggressive in clamping this value.  That not only causes autovacuum
- * to ramp up, but also makes any manual vacuums the user issues more
- * aggressive.  This happens because vacuum_get_cutoffs() will clamp the
- * freeze table and the minimum freeze age cutoffs based on the effective
- * autovacuum_multixact_freeze_max_age this function returns.  In the worst
- * case, we'll claim the freeze_max_age to zero, and every vacuum of any
- * table will freeze every multixact.
- */
-int
-MultiXactMemberFreezeThreshold(void)
-{
-	MultiXactOffset members;
-	uint32		multixacts;
-	uint32		victim_multixacts;
-	double		fraction;
-	int			result;
-
-	/* If we can't determine member space utilization, assume the worst. */
-	if (!ReadMultiXactCounts(&multixacts, &members))
-		return 0;
-
-	/* If member space utilization is low, no special action is required. */
-	if (members <= MULTIXACT_MEMBER_SAFE_THRESHOLD)
-		return autovacuum_multixact_freeze_max_age;
-
-	/*
-	 * Compute a target for relminmxid advancement.  The number of multixacts
-	 * we try to eliminate from the system is based on how far we are past
-	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
-	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD) /
-		(MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-	victim_multixacts = multixacts * fraction;
-
-	/* fraction could be > 1.0, but lowest possible freeze age is zero */
-	if (victim_multixacts > multixacts)
-		return 0;
-	result = multixacts - victim_multixacts;
-
-	/*
-	 * Clamp to autovacuum_multixact_freeze_max_age, so that we never make
-	 * autovacuum less aggressive than it would otherwise be.
-	 */
-	return Min(result, autovacuum_multixact_freeze_max_age);
+	return offset;
 }
 
 typedef struct mxtruncinfo
@@ -3039,37 +2600,13 @@ SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int64 segpage, void *data
 
 
 /*
- * Delete members segments [oldest, newOldest)
- *
- * The members SLRU can, in contrast to the offsets one, be filled to almost
- * the full range at once. This means SimpleLruTruncate() can't trivially be
- * used - instead the to-be-deleted range is computed using the offsets
- * SLRU. C.f. TruncateMultiXact().
+ * Delete members segments before the newOldestOffset.
  */
 static void
-PerformMembersTruncation(MultiXactOffset oldestOffset, MultiXactOffset newOldestOffset)
+PerformMembersTruncation(MultiXactOffset newOldestOffset)
 {
-	const int64 maxsegment = MXOffsetToMemberSegment(MaxMultiXactOffset);
-	int64		startsegment = MXOffsetToMemberSegment(oldestOffset);
-	int64		endsegment = MXOffsetToMemberSegment(newOldestOffset);
-	int64		segment = startsegment;
-
-	/*
-	 * Delete all the segments but the last one. The last segment can still
-	 * contain, possibly partially, valid data.
-	 */
-	while (segment != endsegment)
-	{
-		elog(DEBUG2, "truncating multixact members segment %llx",
-			 (unsigned long long) segment);
-		SlruDeleteSegment(MultiXactMemberCtl, segment);
-
-		/* move to next segment, handling wraparound correctly */
-		if (segment == maxsegment)
-			segment = 0;
-		else
-			segment += 1;
-	}
+	SimpleLruTruncate(MultiXactMemberCtl,
+					  MXOffsetToMemberPage(newOldestOffset));
 }
 
 /*
@@ -3174,23 +2711,15 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	/*
 	 * First, compute the safe truncation point for MultiXactMember. This is
 	 * the starting offset of the oldest multixact.
-	 *
-	 * Hopefully, find_multixact_start will always work here, because we've
-	 * already checked that it doesn't precede the earliest MultiXact on disk.
-	 * But if it fails, don't truncate anything, and log a message.
 	 */
 	if (oldestMulti == nextMulti)
 	{
 		/* there are NO MultiXacts */
 		oldestOffset = nextOffset;
 	}
-	else if (!find_multixact_start(oldestMulti, &oldestOffset))
+	else
 	{
-		ereport(LOG,
-				(errmsg("oldest MultiXact %u not found, earliest MultiXact %u, skipping truncation",
-						oldestMulti, earliest)));
-		LWLockRelease(MultiXactTruncationLock);
-		return;
+		oldestOffset = find_multixact_start(oldestMulti);
 	}
 
 	/*
@@ -3202,13 +2731,9 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 		/* there are NO MultiXacts */
 		newOldestOffset = nextOffset;
 	}
-	else if (!find_multixact_start(newOldestMulti, &newOldestOffset))
+	else
 	{
-		ereport(LOG,
-				(errmsg("cannot truncate up to MultiXact %u because it does not exist on disk, skipping truncation",
-						newOldestMulti)));
-		LWLockRelease(MultiXactTruncationLock);
-		return;
+		newOldestOffset = find_multixact_start(newOldestMulti);
 	}
 
 	elog(DEBUG1, "performing multixact truncation: "
@@ -3258,7 +2783,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	LWLockRelease(MultiXactGenLock);
 
 	/* First truncate members */
-	PerformMembersTruncation(oldestOffset, newOldestOffset);
+	PerformMembersTruncation(newOldestOffset);
 
 	/* Then offsets */
 	PerformOffsetsTruncation(oldestMulti, newOldestMulti);
@@ -3345,7 +2870,7 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
 static bool
 MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
 {
-	int32		diff = (int32) (offset1 - offset2);
+	int64		diff = (int64) (offset1 - offset2);
 
 	return (diff < 0);
 }
@@ -3492,7 +3017,7 @@ multixact_redo(XLogReaderState *record)
 		 */
 		SetMultiXactIdLimit(xlrec.endTruncOff, xlrec.oldestMultiDB, false);
 
-		PerformMembersTruncation(xlrec.startTruncMemb, xlrec.endTruncMemb);
+		PerformMembersTruncation(xlrec.endTruncMemb);
 
 		/*
 		 * During XLOG replay, latest_page_number isn't necessarily set up
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bf3dbda901..a813a090fa 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5083,7 +5083,7 @@ BootStrapXLOG(uint32 data_checksum_version)
 		FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId);
 	checkPoint.nextOid = FirstGenbkiObjectId;
 	checkPoint.nextMulti = FirstMultiXactId;
-	checkPoint.nextMultiOffset = 0;
+	checkPoint.nextMultiOffset = 1;
 	checkPoint.oldestXid = FirstNormalTransactionId;
 	checkPoint.oldestXidDB = Template1DbOid;
 	checkPoint.oldestMulti = FirstMultiXactId;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index e6745e6145..c96fbf004d 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1134,7 +1134,7 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Almost ready to set freeze output parameters; check if OldestXmin or
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 0ab921a169..ed5fc09c38 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1134,7 +1134,7 @@ do_start_worker(void)
 
 	/* Also determine the oldest datminmxid we will consider. */
 	recentMulti = ReadNextMultiXactId();
-	multiForceLimit = recentMulti - MultiXactMemberFreezeThreshold();
+	multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
 	if (multiForceLimit < FirstMultiXactId)
 		multiForceLimit -= FirstMultiXactId;
 
@@ -1922,7 +1922,7 @@ do_autovacuum(void)
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Find the pg_database entry and select the default freeze ages. We use
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index fff401e469..4ad64cf1ed 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -264,7 +264,7 @@ main(int argc, char *argv[])
 
 			case 'O':
 				errno = 0;
-				set_mxoff = strtoul(optarg, &endptr, 0);
+				set_mxoff = strtou64(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index d0bd1f7ace..79d03d79de 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -206,7 +206,7 @@ push @cmd,
   sprintf("%d,%d", hex($files[0]) == 0 ? 3 : hex($files[0]), hex($files[-1]));
 
 @files = get_slru_files('pg_multixact/offsets');
-$mult = 32 * $blcksz / 4;
+$mult = 32 * $blcksz / 8;
 # -m argument is "new,old"
 push @cmd, '-m',
   sprintf("%d,%d",
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 4e6b0eec2f..5ee632dfe6 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -27,7 +27,7 @@
 
 #define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
 
-#define MaxMultiXactOffset	((MultiXactOffset) 0xFFFFFFFF)
+#define MaxMultiXactOffset	UINT64CONST(0xFFFFFFFFFFFFFFFF)
 
 /*
  * Possible multixact lock modes ("status").  The first four modes are for
@@ -143,7 +143,6 @@ extern void MultiXactSetNextMXact(MultiXactId nextMulti,
 extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 									  MultiXactOffset minMultiOffset);
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
-extern int	MultiXactMemberFreezeThreshold(void);
 
 extern void multixact_twophase_recover(TransactionId xid, uint16 info,
 									   void *recdata, uint32 len);
diff --git a/src/include/access/multixact_internal.h b/src/include/access/multixact_internal.h
new file mode 100644
index 0000000000..39e74a21c7
--- /dev/null
+++ b/src/include/access/multixact_internal.h
@@ -0,0 +1,115 @@
+/*
+ * multixact_internal.h
+ *
+ * Internal definitions for the on-disk format of multixact manager.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/multixact_internal.h
+ */
+#ifndef MULTIXACT_INTERNAL_H
+#define MULTIXACT_INTERNAL_H
+
+/* FIXME: had to duplicate this */
+#define SLRU_PAGES_PER_SEGMENT	32
+
+/*
+ * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
+ * used everywhere else in Postgres.
+ */
+
+/* We need four bytes per offset */
+#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)
+{
+	return multi / MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int
+MultiXactIdToOffsetEntry(MultiXactId multi)
+{
+	return multi % MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int64
+MultiXactIdToOffsetSegment(MultiXactId multi)
+{
+	return MultiXactIdToOffsetPage(multi) / SLRU_PAGES_PER_SEGMENT;
+}
+
+/*
+ * The situation for members is a bit more complex: we store one byte of
+ * additional flag bits for each TransactionId.  To do this without getting
+ * into alignment issues, we store four bytes of flags, and then the
+ * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
+ * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
+ * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
+ * performance) trumps space efficiency here.
+ *
+ * Note that the "offset" macros work with byte offset, not array indexes, so
+ * arithmetic must be done using "char *" pointers.
+ */
+/* We need eight bits per xact, so one xact fits in a byte */
+#define MXACT_MEMBER_BITS_PER_XACT			8
+#define MXACT_MEMBER_FLAGS_PER_BYTE			1
+#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
+
+/* how many full bytes of flags are there in a group? */
+#define MULTIXACT_FLAGBYTES_PER_GROUP		4
+#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
+	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
+/* size in bytes of a complete group */
+#define MULTIXACT_MEMBERGROUP_SIZE \
+	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERS_PER_PAGE	\
+	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
+
+/* page in which a member is to be found */
+static inline int64
+MXOffsetToMemberPage(MultiXactOffset offset)
+{
+	return offset / MULTIXACT_MEMBERS_PER_PAGE;
+}
+
+static inline int64
+MXOffsetToMemberSegment(MultiXactOffset offset)
+{
+	return MXOffsetToMemberPage(offset) / SLRU_PAGES_PER_SEGMENT;
+}
+
+/* Location (byte offset within page) of flag word for a given member */
+static inline int
+MXOffsetToFlagsOffset(MultiXactOffset offset)
+{
+	MultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
+	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
+
+	return byteoff;
+}
+
+static inline int
+MXOffsetToFlagsBitShift(MultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
+
+	return bshift;
+}
+
+/* Location (byte offset within page) of TransactionId of given member */
+static inline int
+MXOffsetToMemberOffset(MultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+
+	return MXOffsetToFlagsOffset(offset) +
+		MULTIXACT_FLAGBYTES_PER_GROUP +
+		member_in_group * sizeof(TransactionId);
+}
+
+#endif							/* MULTIXACT_INTERNAL_H */
diff --git a/src/include/c.h b/src/include/c.h
index a14c631516..318194f78d 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -618,7 +618,7 @@ typedef uint32 SubTransactionId;
 /* MultiXactId must be equivalent to TransactionId, to fit in t_xmax */
 typedef TransactionId MultiXactId;
 
-typedef uint32 MultiXactOffset;
+typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
-- 
2.43.0

v12-0005-TEST-add-src-bin-pg_upgrade-t-005_offset.pl.patchapplication/octet-stream; name=v12-0005-TEST-add-src-bin-pg_upgrade-t-005_offset.pl.patchDownload
From c4e9dcf1525065e155a9d625590afc9ff3e5655c Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Tue, 19 Nov 2024 17:08:10 +0300
Subject: [PATCH v12 5/7] TEST: add src/bin/pg_upgrade/t/005_offset.pl

---
 src/bin/pg_upgrade/t/005_offset.pl | 563 +++++++++++++++++++++++++++++
 1 file changed, 563 insertions(+)
 create mode 100644 src/bin/pg_upgrade/t/005_offset.pl

diff --git a/src/bin/pg_upgrade/t/005_offset.pl b/src/bin/pg_upgrade/t/005_offset.pl
new file mode 100644
index 0000000000..df84186de4
--- /dev/null
+++ b/src/bin/pg_upgrade/t/005_offset.pl
@@ -0,0 +1,563 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use File::Find qw(find);
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# This pair of calls will create significantly more member segments than offset
+# segments.
+sub prep
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl} (I INT PRIMARY KEY, N_UPDATED INT) " .
+		"       WITH (AUTOVACUUM_ENABLED=FALSE);" .
+		"INSERT INTO ${tbl} SELECT G, 0 FROM GENERATE_SERIES(1, 50) G;");
+}
+
+sub fill
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	my $nclients = 50;
+	my $update_every = 90;
+	my @connections = ();
+
+	for (0..$nclients)
+	{
+		my $conn = $node->background_psql('postgres');
+		$conn->query_safe("BEGIN");
+
+		push(@connections, $conn);
+	}
+
+	for (my $i = 0; $i < 20000; $i++)
+	{
+		my $conn = $connections[$i % $nclients];
+
+		$conn->query_safe("COMMIT;");
+		$conn->query_safe("BEGIN");
+
+		if ($i % $update_every == 0)
+		{
+			$conn->query_safe(
+				"UPDATE ${tbl} SET " .
+				"N_UPDATED = N_UPDATED + 1 " .
+				"WHERE I = ${i} % 50");
+		}
+		else
+		{
+			$conn->query_safe(
+				"SELECT * FROM ${tbl} FOR KEY SHARE");
+		}
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+}
+
+# This pair of calls will create more or less the same amount of membsers and
+# offsets segments.
+sub prep2
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl}(BAR INT PRIMARY KEY, BAZ INT); " .
+		"CREATE OR REPLACE PROCEDURE MXIDFILLER(N_STEPS INT DEFAULT 1000) " .
+		"LANGUAGE PLPGSQL " .
+		"AS \$\$ " .
+		"BEGIN " .
+		"	FOR I IN 1..N_STEPS LOOP " .
+		"		UPDATE ${tbl} SET BAZ = RANDOM(1, 1000) " .
+		"		WHERE BAR IN (SELECT BAR FROM ${tbl} " .
+		"						TABLESAMPLE BERNOULLI(80)); " .
+		"		COMMIT; " .
+		"	END LOOP; " .
+		"END; \$\$; " .
+		"INSERT INTO ${tbl} (BAR, BAZ) " .
+		"SELECT ID, ID FROM GENERATE_SERIES(1, 1024) ID;");
+}
+
+sub fill2
+{
+	my $node = shift;
+	my $tbl = shift;
+	my $scale = shift // 1;
+
+	$node->safe_psql('postgres',
+		"BEGIN; " .
+		"SELECT * FROM ${tbl} FOR KEY SHARE; " .
+		"PREPARE TRANSACTION 'A'; " .
+		"CALL MXIDFILLER((365 * ${scale})::int); " .
+		"COMMIT PREPARED 'A';");
+}
+
+
+# generate around 2 offset segments and 55 member segments
+sub mxid_gen1
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	prep($node, $tbl);
+	fill($node, $tbl);
+
+	$node->safe_psql('postgres', q(CHECKPOINT));
+}
+
+# generate around 10 offset segments and 12 member segments
+sub mxid_gen2
+{
+	my $node = shift;
+	my $tbl = shift;
+	my $scale = shift // 1;
+
+	prep2($node, $tbl);
+	fill2($node, $tbl, $scale);
+
+	$node->safe_psql('postgres', q(CHECKPOINT));
+}
+
+# Fetch latest multixact checkpoint values.
+sub multi_bounds
+{
+	my ($node) = @_;
+	my $path = $node->config_data('--bindir');
+	my ($stdout, $stderr) = run_command([
+									$path . '/pg_controldata',
+									$node->data_dir
+								]);
+	my @control_data = split("\n", $stdout);
+	my $next = undef;
+	my $oldest = undef;
+	my $next_offset = undef;
+
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint's NextMultiXactId:\s*(.*)$/mg)
+		{
+			$next = $1;
+			print ">>> @ node ". $node->name . ", " . $_ . "\n";
+		}
+
+		if ($_ =~ /^Latest checkpoint's oldestMultiXid:\s*(.*)$/mg)
+		{
+			$oldest = $1;
+			print ">>> @ node ". $node->name . ", " . $_ . "\n";
+		}
+
+		if ($_ =~ /^Latest checkpoint's NextMultiOffset:\s*(.*)$/mg)
+		{
+			$next_offset = $1;
+			print ">>> @ node ". $node->name . ", " . $_ . "\n";
+		}
+
+		if (defined($oldest) && defined($next) && defined($next_offset))
+		{
+			last;
+		}
+	}
+
+	die "Latest checkpoint's NextMultiXactId not found in control file!\n"
+	unless defined($next);
+
+	die "Latest checkpoint's oldestMultiXid not found in control file!\n"
+	unless defined($oldest);
+
+	die "Latest checkpoint's NextMultiOffset not found in control file!\n"
+	unless defined($next_offset);
+
+	return ($oldest, $next, $next_offset);
+}
+
+# Create node from existing bins.
+sub create_new_node
+{
+	my ($name, %params) = @_;
+
+	create_node(0, @_);
+}
+
+# Create node from ENV oldinstall
+sub create_old_node
+{
+	my ($name, %params) = @_;
+
+	if (!defined($ENV{oldinstall}))
+	{
+		die "oldinstall is not defined";
+	}
+
+	create_node(1, @_);
+}
+
+sub create_node
+{
+	my ($install_path_from_env, $name, %params) = @_;
+	my $scale = defined $params{scale} ? $params{scale} : 1;
+	my $multi = defined $params{multi} ? $params{multi} : undef;
+	my $offset = defined $params{offset} ? $params{offset} : undef;
+
+	my $node =
+		$install_path_from_env ?
+			PostgreSQL::Test::Cluster->new($name,
+					install_path => $ENV{oldinstall}) :
+			PostgreSQL::Test::Cluster->new($name);
+
+	$node->init(force_initdb => 1,
+		extra => [
+			$multi ? ('-m', $multi) : (),
+			$offset ? ('-o', $offset) : (),
+			('-k'),
+		]);
+
+	# Fixup MOX patch quirk
+	if ($multi)
+	{
+		unlink $node->data_dir . '/pg_multixact/offsets/0000';
+	}
+	if ($offset)
+	{
+		unlink $node->data_dir . '/pg_multixact/members/0000';
+	}
+
+	$node->append_conf('fsync', 'off');
+	$node->append_conf('postgresql.conf', 'max_prepared_transactions = 2');
+
+	$node->start();
+	mxid_gen2($node, 'FOO', $scale);
+	mxid_gen1($node, 'BAR', $scale);
+	$node->restart();
+	$node->safe_psql('postgres', q(SELECT * FROM FOO));		# just in case...
+	$node->safe_psql('postgres', q(SELECT * FROM BAR));
+	$node->safe_psql('postgres', q(CHECKPOINT));
+	$node->stop();
+
+	return $node;
+}
+
+sub do_upgrade
+{
+	my ($oldnode, $newnode) = @_;
+
+	command_ok(
+		[
+			'pg_upgrade', '--no-sync',
+			'-d', $oldnode->data_dir,
+			'-D', $newnode->data_dir,
+			'-b', $oldnode->config_data('--bindir'),
+			'-B', $newnode->config_data('--bindir'),
+			'-s', $newnode->host,
+			'-p', $oldnode->port,
+			'-P', $newnode->port,
+			'--check'
+		],
+		'run of pg_upgrade');
+
+	command_ok(
+		[
+			'pg_upgrade', '--no-sync',
+			'-d', $oldnode->data_dir,
+			'-D', $newnode->data_dir,
+			'-b', $oldnode->config_data('--bindir'),
+			'-B', $newnode->config_data('--bindir'),
+			'-s', $newnode->host,
+			'-p', $oldnode->port,
+			'-P', $newnode->port,
+			'--copy'
+		],
+		'run of pg_upgrade');
+
+	$oldnode->start();
+	$newnode->start();
+
+	my $oldfoo = $oldnode->safe_psql('postgres', q(SELECT * FROM FOO));
+	my $newfoo = $newnode->safe_psql('postgres', q(SELECT * FROM FOO));
+	is($oldfoo, $newfoo, "select foo eq");
+
+	my $oldbar = $oldnode->safe_psql('postgres', q(SELECT * FROM BAR));
+	my $newbar = $newnode->safe_psql('postgres', q(SELECT * FROM BAR));
+	is($oldbar, $newbar, "select bar eq");
+
+	$oldnode->stop();
+	$newnode->stop();
+
+	multi_bounds($oldnode);
+	multi_bounds($newnode);
+}
+
+my @TESTS = (
+	# tests without ENV oldinstall
+	0, 1, 2, 3, 4, 5, 6,
+	# tests with "real" pg_upgrade
+	100, 101, 102, 103, 104, 105, 106,
+	# self upgrade
+	1000,
+);
+
+# =============================================================================
+# Basic sanity tests on a NEW bin
+# =============================================================================
+
+# starts from the zero
+SKIP:
+{
+	my $TEST_NO = 0;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_mo',
+						scale => 1);
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value
+SKIP:
+{
+	my $TEST_NO = 1;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_Mo',
+						scale => 1.15,
+						multi => '0x123400');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# offsets starts from the value
+SKIP:
+{
+	my $TEST_NO = 2;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_mO',
+						scale => 1.15,
+						offset => '0x432100');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi and offsets starts from the value
+SKIP:
+{
+	my $TEST_NO = 3;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_MO',
+						scale => 1.15,
+						multi => '0xDEAD00', offset => '0xBEEF00');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value, multi wrap
+SKIP:
+{
+	my $TEST_NO = 4;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_Mo_wrap',
+						scale => 1.15,
+						multi => '0xFFFF7000');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# offsets starts from the value, offsets wrap
+SKIP:
+{
+	my $TEST_NO = 5;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_mO_wrap',
+						scale => 1.15,
+						offset => '0xFFFFFC00');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value, offsets starts from the value,
+# multi wrap, offsets wrap
+SKIP:
+{
+	my $TEST_NO = 6;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_MO_wrap',
+						scale => 1.15,
+						multi => '0xFFFF7000', offset => '0xFFFFFC00');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# =============================================================================
+# pg_upgarde tests
+# =============================================================================
+
+# starts from the zero
+SKIP:
+{
+	my $TEST_NO = 100;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'mo';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1);
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value
+SKIP:
+{
+	my $TEST_NO = 101;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'Mo';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						multi => '0x123400');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# offsets starts from the value
+SKIP:
+{
+	my $TEST_NO = 102;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'mO';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						offset => '0x432100');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi and offsets starts from the value
+SKIP:
+{
+	my $TEST_NO = 103;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'MO';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						multi => '0xDEAD00', offset => '0xBEEF00');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value, multi wrap
+SKIP:
+{
+	my $TEST_NO = 104;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'Mo_wrap';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						multi => '0xFFFF7000');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# offsets starts from the value, offsets wrap
+SKIP:
+{
+	my $TEST_NO = 105;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'mO_wrap';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						offset => '0xFFFFFC00');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value, offsets starts from the value,
+# multi wrap, offsets wrap
+SKIP:
+{
+	my $TEST_NO = 106;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'MO_wrap';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						multi => '0xFFFF7000', offset => '0xFFFFFC00');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# =============================================================================
+# Self upgrade
+# =============================================================================
+
+# starts from the zero
+SKIP:
+{
+	my $TEST_NO = 1000;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'self_upgrade';
+	my $oldnode = create_new_node("old_$dbname",
+						scale => 1);
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+done_testing();
-- 
2.43.0

v12-0004-TEST-initdb-option-to-initialize-cluster-with-no.patchapplication/octet-stream; name=v12-0004-TEST-initdb-option-to-initialize-cluster-with-no.patchDownload
From b7d6cd34aec728dd143f04d44b416896183267fb Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Wed, 4 May 2022 15:53:36 +0300
Subject: [PATCH v12 4/7] TEST: initdb option to initialize cluster with
 non-standard xid/mxid/mxoff

To date testing database cluster wraparund was not easy as initdb has always
inited it with default xid/mxid/mxoff. The option to specify any valid
xid/mxid/mxoff at cluster startup will make these things easier.

Author: Maxim Orlov <orlovmg@gmail.com>
Author: Pavel Borisov <pashkin.elfe@gmail.com>
Author: Svetlana Derevyanko <s.derevyanko@postgrespro.ru>
Discussion: https://www.postgresql.org/message-id/flat/CACG%3Dezaa4vqYjJ16yoxgrpa-%3DgXnf0Vv3Ey9bjGrRRFN2YyWFQ%40mail.gmail.com
---
 src/backend/access/transam/clog.c      |  21 +++++
 src/backend/access/transam/multixact.c |  53 ++++++++++++
 src/backend/access/transam/subtrans.c  |   8 +-
 src/backend/access/transam/xlog.c      |  15 ++--
 src/backend/bootstrap/bootstrap.c      |  50 +++++++++++-
 src/backend/main/main.c                |   6 ++
 src/backend/postmaster/postmaster.c    |  14 +++-
 src/backend/tcop/postgres.c            |  53 +++++++++++-
 src/bin/initdb/initdb.c                | 107 ++++++++++++++++++++++++-
 src/bin/initdb/t/001_initdb.pl         |  60 ++++++++++++++
 src/include/access/xlog.h              |   3 +
 src/include/c.h                        |   4 +
 src/include/catalog/pg_class.h         |   2 +-
 13 files changed, 382 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 0d556c00b8..89516e9f52 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -834,6 +834,7 @@ BootStrapCLOG(void)
 {
 	int			slotno;
 	LWLock	   *lock = SimpleLruGetBankLock(XactCtl, 0);
+	int64		pageno;
 
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
@@ -844,6 +845,26 @@ BootStrapCLOG(void)
 	SimpleLruWritePage(XactCtl, slotno);
 	Assert(!XactCtl->shared->page_dirty[slotno]);
 
+	pageno = TransactionIdToPage(XidFromFullTransactionId(TransamVariables->nextXid));
+	if (pageno != 0)
+	{
+		LWLock *nextlock = SimpleLruGetBankLock(XactCtl, pageno);
+
+		if (nextlock != lock)
+		{
+			LWLockRelease(lock);
+			LWLockAcquire(nextlock, LW_EXCLUSIVE);
+			lock = nextlock;
+		}
+
+		/* Create and zero the first page of the commit log */
+		slotno = ZeroCLOGPage(pageno, false);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(XactCtl, slotno);
+		Assert(!XactCtl->shared->page_dirty[slotno]);
+	}
+
 	LWLockRelease(lock);
 }
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index d63ae17330..70c9d2f6ee 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1815,6 +1815,7 @@ BootStrapMultiXact(void)
 {
 	int			slotno;
 	LWLock	   *lock;
+	int64		pageno;
 
 	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, 0);
 	LWLockAcquire(lock, LW_EXCLUSIVE);
@@ -1826,6 +1827,26 @@ BootStrapMultiXact(void)
 	SimpleLruWritePage(MultiXactOffsetCtl, slotno);
 	Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
 
+	pageno = MultiXactIdToOffsetPage(MultiXactState->nextMXact);
+	if (pageno != 0)
+	{
+		LWLock *nextlock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
+
+		if (nextlock != lock)
+		{
+			LWLockRelease(lock);
+			LWLockAcquire(nextlock, LW_EXCLUSIVE);
+			lock = nextlock;
+		}
+
+		/* Create and zero the first page of the offsets log */
+		slotno = ZeroMultiXactOffsetPage(pageno, false);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
+		Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
+	}
+
 	LWLockRelease(lock);
 
 	lock = SimpleLruGetBankLock(MultiXactMemberCtl, 0);
@@ -1838,7 +1859,39 @@ BootStrapMultiXact(void)
 	SimpleLruWritePage(MultiXactMemberCtl, slotno);
 	Assert(!MultiXactMemberCtl->shared->page_dirty[slotno]);
 
+	pageno = MXOffsetToMemberPage(MultiXactState->nextOffset);
+	if (pageno != 0)
+	{
+		LWLock *nextlock = SimpleLruGetBankLock(MultiXactMemberCtl, pageno);
+
+		if (nextlock != lock)
+		{
+			LWLockRelease(lock);
+			LWLockAcquire(nextlock, LW_EXCLUSIVE);
+			lock = nextlock;
+		}
+
+		/* Create and zero the first page of the members log */
+		slotno = ZeroMultiXactMemberPage(pageno, false);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(MultiXactMemberCtl, slotno);
+		Assert(!MultiXactMemberCtl->shared->page_dirty[slotno]);
+	}
+
 	LWLockRelease(lock);
+
+	/*
+	 * If we're starting not from zero offset, initilize dummy multixact to
+	 * evade too long loop in PerformMembersTruncation().
+	 */
+	if (MultiXactState->nextOffset > 0 && MultiXactState->nextMXact > 0)
+	{
+		RecordNewMultiXact(FirstMultiXactId,
+						   MultiXactState->nextOffset, 0, NULL);
+		RecordNewMultiXact(MultiXactState->nextMXact,
+						   MultiXactState->nextOffset, 0, NULL);
+	}
 }
 
 /*
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 15153618fa..218675fa60 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -270,12 +270,16 @@ void
 BootStrapSUBTRANS(void)
 {
 	int			slotno;
-	LWLock	   *lock = SimpleLruGetBankLock(SubTransCtl, 0);
+	LWLock	   *lock;
+	int64		pageno;
+
+	pageno = TransactionIdToPage(XidFromFullTransactionId(TransamVariables->nextXid));
+	lock = SimpleLruGetBankLock(SubTransCtl, pageno);
 
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
 	/* Create and zero the first page of the subtrans log */
-	slotno = ZeroSUBTRANSPage(0);
+	slotno = ZeroSUBTRANSPage(pageno);
 
 	/* Make sure it's written out */
 	SimpleLruWritePage(SubTransCtl, slotno);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a813a090fa..9f78a3e34a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -136,6 +136,10 @@ int			max_slot_wal_keep_size_mb = -1;
 int			wal_decode_buffer_size = 512 * 1024;
 bool		track_wal_io_timing = false;
 
+TransactionId		start_xid = FirstNormalTransactionId;
+MultiXactId			start_mxid = FirstMultiXactId;
+MultiXactOffset		start_mxoff = 0;
+
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
 #endif
@@ -5080,13 +5084,14 @@ BootStrapXLOG(uint32 data_checksum_version)
 	checkPoint.fullPageWrites = fullPageWrites;
 	checkPoint.wal_level = wal_level;
 	checkPoint.nextXid =
-		FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId);
+		FullTransactionIdFromEpochAndXid(0, Max(FirstNormalTransactionId,
+												start_xid));
 	checkPoint.nextOid = FirstGenbkiObjectId;
-	checkPoint.nextMulti = FirstMultiXactId;
-	checkPoint.nextMultiOffset = 1;
-	checkPoint.oldestXid = FirstNormalTransactionId;
+	checkPoint.nextMulti = Max(FirstMultiXactId, start_mxid);
+	checkPoint.nextMultiOffset = Max(1, start_mxoff);
+	checkPoint.oldestXid = XidFromFullTransactionId(checkPoint.nextXid);
 	checkPoint.oldestXidDB = Template1DbOid;
-	checkPoint.oldestMulti = FirstMultiXactId;
+	checkPoint.oldestMulti = checkPoint.nextMulti;
 	checkPoint.oldestMultiDB = Template1DbOid;
 	checkPoint.oldestCommitTsXid = InvalidTransactionId;
 	checkPoint.newestCommitTsXid = InvalidTransactionId;
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 359f58a8f9..b697138b7e 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -218,7 +218,7 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	argv++;
 	argc--;
 
-	while ((flag = getopt(argc, argv, "B:c:d:D:Fkr:X:-:")) != -1)
+	while ((flag = getopt(argc, argv, "B:c:d:D:Fkm:o:r:X:x:-:")) != -1)
 	{
 		switch (flag)
 		{
@@ -286,12 +286,60 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 			case 'k':
 				bootstrap_data_checksum_version = PG_DATA_CHECKSUM_VERSION;
 				break;
+			case 'm':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactIdIsValid(start_mxid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact id")));
+					}
+				}
+				break;
+			case 'o':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxoff = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactOffsetIsValid(start_mxoff))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact offset")));
+					}
+				}
+				break;
 			case 'r':
 				strlcpy(OutputFileName, optarg, MAXPGPATH);
 				break;
 			case 'X':
 				SetConfigOption("wal_segment_size", optarg, PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
 				break;
+			case 'x':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_xid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartTransactionIdIsValid(start_xid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster xid value")));
+					}
+				}
+				break;
 			default:
 				write_stderr("Try \"%s --help\" for more information.\n",
 							 progname);
diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index e8effe5024..ff252dffbd 100644
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -426,12 +426,18 @@ help(const char *progname)
 	printf(_("  -E                 echo statement before execution\n"));
 	printf(_("  -j                 do not use newline as interactive query delimiter\n"));
 	printf(_("  -r FILENAME        send stdout and stderr to given file\n"));
+	printf(_("  -m START_MXID      set initial database cluster multixact id\n"));
+	printf(_("  -o START_MXOFF     set initial database cluster multixact offset\n"));
+	printf(_("  -x START_XID       set initial database cluster xid\n"));
 
 	printf(_("\nOptions for bootstrapping mode:\n"));
 	printf(_("  --boot             selects bootstrapping mode (must be first argument)\n"));
 	printf(_("  --check            selects check mode (must be first argument)\n"));
 	printf(_("  DBNAME             database name (mandatory argument in bootstrapping mode)\n"));
 	printf(_("  -r FILENAME        send stdout and stderr to given file\n"));
+	printf(_("  -m START_MXID      set initial database cluster multixact id\n"));
+	printf(_("  -o START_MXOFF     set initial database cluster multixact offset\n"));
+	printf(_("  -x START_XID       set initial database cluster xid\n"));
 
 	printf(_("\nPlease read the documentation for the complete list of run-time\n"
 			 "configuration settings and how to set them on the command line or in\n"
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 5f615d0f60..9f6bc6e33d 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -578,7 +578,7 @@ PostmasterMain(int argc, char *argv[])
 	 * tcop/postgres.c (the option sets should not conflict) and with the
 	 * common help() function in main/main.c.
 	 */
-	while ((opt = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lN:OPp:r:S:sTt:W:-:")) != -1)
+	while ((opt = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lm:N:Oo:Pp:r:S:sTt:W:x:-:")) != -1)
 	{
 		switch (opt)
 		{
@@ -688,10 +688,18 @@ PostmasterMain(int argc, char *argv[])
 				SetConfigOption("max_connections", optarg, PGC_POSTMASTER, PGC_S_ARGV);
 				break;
 
+			case 'm':
+				/* only used by single-user backend */
+				break;
+
 			case 'O':
 				SetConfigOption("allow_system_table_mods", "true", PGC_POSTMASTER, PGC_S_ARGV);
 				break;
 
+			case 'o':
+				/* only used by single-user backend */
+				break;
+
 			case 'P':
 				SetConfigOption("ignore_system_indexes", "true", PGC_POSTMASTER, PGC_S_ARGV);
 				break;
@@ -742,6 +750,10 @@ PostmasterMain(int argc, char *argv[])
 				SetConfigOption("post_auth_delay", optarg, PGC_POSTMASTER, PGC_S_ARGV);
 				break;
 
+			case 'x':
+				/* only used by single-user backend */
+				break;
+
 			default:
 				write_stderr("Try \"%s --help\" for more information.\n",
 							 progname);
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 5655348a2e..9c170f4906 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3788,7 +3788,7 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 	 * postmaster/postmaster.c (the option sets should not conflict) and with
 	 * the common help() function in main/main.c.
 	 */
-	while ((flag = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lN:nOPp:r:S:sTt:v:W:-:")) != -1)
+	while ((flag = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lm:N:nOo:Pp:r:S:sTt:v:W:x:-:")) != -1)
 	{
 		switch (flag)
 		{
@@ -3893,6 +3893,23 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 				SetConfigOption("ssl", "true", ctx, gucsource);
 				break;
 
+			case 'm':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactIdIsValid(start_mxid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact id")));
+					}
+				}
+				break;
+
 			case 'N':
 				SetConfigOption("max_connections", optarg, ctx, gucsource);
 				break;
@@ -3905,6 +3922,23 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 				SetConfigOption("allow_system_table_mods", "true", ctx, gucsource);
 				break;
 
+			case 'o':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxoff = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactOffsetIsValid(start_mxoff))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact offset")));
+					}
+				}
+				break;
+
 			case 'P':
 				SetConfigOption("ignore_system_indexes", "true", ctx, gucsource);
 				break;
@@ -3959,6 +3993,23 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 				SetConfigOption("post_auth_delay", optarg, ctx, gucsource);
 				break;
 
+			case 'x':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_xid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartTransactionIdIsValid(start_xid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster xid")));
+					}
+				}
+				break;
+
 			default:
 				errs++;
 				break;
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 101c780012..336ec9cdde 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -168,6 +168,9 @@ static bool data_checksums = true;
 static char *xlog_dir = NULL;
 static int	wal_segment_size_mb = (DEFAULT_XLOG_SEG_SIZE) / (1024 * 1024);
 static DataDirSyncMethod sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+static TransactionId start_xid = 0;
+static MultiXactId start_mxid = 0;
+static MultiXactOffset start_mxoff = 0;
 
 
 /* internal vars */
@@ -1596,6 +1599,11 @@ bootstrap_template1(void)
 	bki_lines = replace_token(bki_lines, "POSTGRES",
 							  escape_quotes_bki(username));
 
+	/* relfrozenxid must not be less than FirstNormalTransactionId */
+	sprintf(buf, "%llu", (unsigned long long) Max(start_xid, 3));
+	bki_lines = replace_token(bki_lines, "RECENTXMIN",
+							  buf);
+
 	bki_lines = replace_token(bki_lines, "ENCODING",
 							  encodingid_to_string(encodingid));
 
@@ -1621,6 +1629,9 @@ bootstrap_template1(void)
 
 	printfPQExpBuffer(&cmd, "\"%s\" --boot %s %s", backend_exec, boot_options, extra_options);
 	appendPQExpBuffer(&cmd, " -X %d", wal_segment_size_mb * (1024 * 1024));
+	appendPQExpBuffer(&cmd, " -m %llu", (unsigned long long) start_mxid);
+	appendPQExpBuffer(&cmd, " -o %llu", (unsigned long long) start_mxoff);
+	appendPQExpBuffer(&cmd, " -x %llu", (unsigned long long) start_xid);
 	if (data_checksums)
 		appendPQExpBuffer(&cmd, " -k");
 	if (debug)
@@ -2560,12 +2571,20 @@ usage(const char *progname)
 	printf(_("  -d, --debug               generate lots of debugging output\n"));
 	printf(_("      --discard-caches      set debug_discard_caches=1\n"));
 	printf(_("  -L DIRECTORY              where to find the input files\n"));
+	printf(_("  -m, --multixact-id=START_MXID\n"
+			 "                            set initial database cluster multixact id\n"
+			 "                            max value is 2^62-1\n"));
 	printf(_("  -n, --no-clean            do not clean up after errors\n"));
 	printf(_("  -N, --no-sync             do not wait for changes to be written safely to disk\n"));
 	printf(_("      --no-instructions     do not print instructions for next steps\n"));
+	printf(_("  -o, --multixact-offset=START_MXOFF\n"
+			 "                            set initial database cluster multixact offset\n"
+			 "                            max value is 2^62-1\n"));
 	printf(_("  -s, --show                show internal settings, then exit\n"));
 	printf(_("      --sync-method=METHOD  set method for syncing files to disk\n"));
 	printf(_("  -S, --sync-only           only sync database files to disk, then exit\n"));
+	printf(_("  -x, --xid=START_XID       set initial database cluster xid\n"
+			 "                            max value is 2^62-1\n"));
 	printf(_("\nOther options:\n"));
 	printf(_("  -V, --version             output version information, then exit\n"));
 	printf(_("  -?, --help                show this help, then exit\n"));
@@ -3107,6 +3126,18 @@ initialize_data_directory(void)
 	/* Now create all the text config files */
 	setup_config();
 
+	if (start_mxid != 0)
+		printf(_("selecting initial multixact id ... %llu\n"),
+				 (unsigned long long) start_mxid);
+
+	if (start_mxoff != 0)
+		printf(_("selecting initial multixact offset ... %llu\n"),
+				 (unsigned long long) start_mxoff);
+
+	if (start_xid != 0)
+		printf(_("selecting initial xid ... %llu\n"),
+				 (unsigned long long) start_xid);
+
 	/* Bootstrap template1 */
 	bootstrap_template1();
 
@@ -3123,8 +3154,12 @@ initialize_data_directory(void)
 	fflush(stdout);
 
 	initPQExpBuffer(&cmd);
-	printfPQExpBuffer(&cmd, "\"%s\" %s %s template1 >%s",
-					  backend_exec, backend_options, extra_options, DEVNULL);
+	printfPQExpBuffer(&cmd, "\"%s\" %s %s",
+					  backend_exec, backend_options, extra_options);
+	appendPQExpBuffer(&cmd, " -m %llu", (unsigned long long) start_mxid);
+	appendPQExpBuffer(&cmd, " -o %llu", (unsigned long long) start_mxoff);
+	appendPQExpBuffer(&cmd, " -x %llu", (unsigned long long) start_xid);
+	appendPQExpBuffer(&cmd, " template1 >%s", DEVNULL);
 
 	PG_CMD_OPEN(cmd.data);
 
@@ -3211,6 +3246,9 @@ main(int argc, char *argv[])
 		{"icu-rules", required_argument, NULL, 18},
 		{"sync-method", required_argument, NULL, 19},
 		{"no-data-checksums", no_argument, NULL, 20},
+		{"xid", required_argument, NULL, 'x'},
+		{"multixact-id", required_argument, NULL, 'm'},
+		{"multixact-offset", required_argument, NULL, 'o'},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -3252,7 +3290,7 @@ main(int argc, char *argv[])
 
 	/* process command-line options */
 
-	while ((c = getopt_long(argc, argv, "A:c:dD:E:gkL:nNsST:U:WX:",
+	while ((c = getopt_long(argc, argv, "A:c:dD:E:gkL:m:nNo:sST:U:Wx:X:",
 							long_options, &option_index)) != -1)
 	{
 		switch (c)
@@ -3310,6 +3348,30 @@ main(int argc, char *argv[])
 				debug = true;
 				printf(_("Running in debug mode.\n"));
 				break;
+			case 'm':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactIdIsValid(start_mxid))
+					{
+						pg_log_error("invalid initial database cluster multixact id");
+						exit(1);
+					}
+					else if (start_mxid < 1) /* FirstMultiXactId */
+					{
+						/*
+						 * We avoid mxid to be silently set to
+						 * FirstMultiXactId, though it does not harm.
+						 */
+						pg_log_error("multixact id should be greater than 0");
+						exit(1);
+					}
+				}
+				break;
 			case 'n':
 				noclean = true;
 				printf(_("Running in no-clean mode.  Mistakes will not be cleaned up.\n"));
@@ -3317,6 +3379,21 @@ main(int argc, char *argv[])
 			case 'N':
 				do_sync = false;
 				break;
+			case 'o':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxoff = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactOffsetIsValid(start_mxoff))
+					{
+						pg_log_error("invalid initial database cluster multixact offset");
+						exit(1);
+					}
+				}
+				break;
 			case 'S':
 				sync_only = true;
 				break;
@@ -3405,6 +3482,30 @@ main(int argc, char *argv[])
 			case 20:
 				data_checksums = false;
 				break;
+			case 'x':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_xid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartTransactionIdIsValid(start_xid))
+					{
+						pg_log_error("invalid value for initial database cluster xid");
+						exit(1);
+					}
+					else if (start_xid < 3) /* FirstNormalTransactionId */
+					{
+						/*
+						 * We avoid xid to be silently set to
+						 * FirstNormalTransactionId, though it does not harm.
+						 */
+						pg_log_error("xid should be greater than 2");
+						exit(1);
+					}
+				}
+				break;
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index f114c2a1b6..85a133280c 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -282,4 +282,64 @@ command_fails(
 	[ 'pg_checksums', '-D', $datadir_nochecksums ],
 	"pg_checksums fails with data checksum disabled");
 
+# Set non-standard initial mxid/mxoff/xid.
+command_fails_like(
+	[ 'initdb', '-m', 'seven', $datadir ],
+	qr/initdb: error: invalid initial database cluster multixact id/,
+	'fails for invalid initial database cluster multixact id');
+command_fails_like(
+	[ 'initdb', '-o', 'seven', $datadir ],
+	qr/initdb: error: invalid initial database cluster multixact offset/,
+	'fails for invalid initial database cluster multixact offset');
+command_fails_like(
+	[ 'initdb', '-x', 'seven', $datadir ],
+	qr/initdb: error: invalid value for initial database cluster xid/,
+	'fails for invalid initial database cluster xid');
+
+command_checks_all(
+	[ 'initdb', '-m', '65535', "$tempdir/data-m65535" ],
+	0,
+	[qr/selecting initial multixact id ... 65535/],
+	[],
+	'selecting initial multixact id');
+command_checks_all(
+	[ 'initdb', '-o', '65535', "$tempdir/data-o65535" ],
+	0,
+	[qr/selecting initial multixact offset ... 65535/],
+	[],
+	'selecting initial multixact offset');
+command_checks_all(
+	[ 'initdb', '-x', '65535', "$tempdir/data-x65535" ],
+	0,
+	[qr/selecting initial xid ... 65535/],
+	[],
+	'selecting initial xid');
+
+# Setup new cluster with given mxid/mxoff/xid.
+my $node;
+my $result;
+
+$node = PostgreSQL::Test::Cluster->new('test-mxid');
+$node->init(extra => ['-m', '16777215']); # 0xFFFFFF
+$node->start;
+$result = $node->safe_psql('postgres', "SELECT next_multixact_id FROM pg_control_checkpoint();");
+ok($result >= 16777215, 'setup cluster with given mxid');
+$node->stop;
+
+$node = PostgreSQL::Test::Cluster->new('test-mxoff');
+$node->init(extra => ['-o', '16777215']); # 0xFFFFFF
+$node->start;
+$result = $node->safe_psql('postgres', "SELECT next_multi_offset FROM pg_control_checkpoint();");
+ok($result >= 16777215, 'setup cluster with given mxoff');
+$node->stop;
+
+$node = PostgreSQL::Test::Cluster->new('test-xid');
+$node->init(extra => ['-x', '16777215']); # 0xFFFFFF
+$node->start;
+$result = $node->safe_psql('postgres', "SELECT txid_current();");
+ok($result >= 16777215, 'setup cluster with given xid - check 1');
+$result = $node->safe_psql('postgres', "SELECT oldest_xid FROM pg_control_checkpoint();");
+ok($result >= 16777215, 'setup cluster with given xid - check 2');
+$node->stop;
+
 done_testing();
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 4411c1468a..8eb34846da 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -94,6 +94,9 @@ typedef enum RecoveryState
 } RecoveryState;
 
 extern PGDLLIMPORT int wal_level;
+extern PGDLLIMPORT TransactionId start_xid;
+extern PGDLLIMPORT MultiXactId start_mxid;
+extern PGDLLIMPORT MultiXactOffset start_mxoff;
 
 /* Is WAL archiving enabled (always or only while server is running normally)? */
 #define XLogArchivingActive() \
diff --git a/src/include/c.h b/src/include/c.h
index 318194f78d..4f2b5432e5 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -622,6 +622,10 @@ typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
+#define StartTransactionIdIsValid(xid)			((xid) <= 0xFFFFFFFF)
+#define StartMultiXactIdIsValid(mxid)			((mxid) <= 0xFFFFFFFF)
+#define StartMultiXactOffsetIsValid(offset)		((offset) <= 0xFFFFFFFF)
+
 #define FirstCommandId	((CommandId) 0)
 #define InvalidCommandId	(~(CommandId)0)
 
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index f0d612ca48..5c63290a72 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -123,7 +123,7 @@ CATALOG(pg_class,1259,RelationRelationId) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83,Relat
 	Oid			relrewrite BKI_DEFAULT(0) BKI_LOOKUP_OPT(pg_class);
 
 	/* all Xids < this are frozen in this rel */
-	TransactionId relfrozenxid BKI_DEFAULT(3);	/* FirstNormalTransactionId */
+	TransactionId relfrozenxid BKI_DEFAULT(RECENTXMIN);	/* FirstNormalTransactionId */
 
 	/* all multixacts in this rel are >= this; it is really a MultiXactId */
 	TransactionId relminmxid BKI_DEFAULT(1);	/* FirstMultiXactId */
-- 
2.43.0

v12-0006-TEST-try-to-replicate-buggy-oldest-offset.patchapplication/octet-stream; name=v12-0006-TEST-try-to-replicate-buggy-oldest-offset.patchDownload
From 7183e7d256ab35cd0ef6a2a7d5f9c1d4ff001f70 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Fri, 27 Dec 2024 19:39:58 +0300
Subject: [PATCH v12 6/7] TEST: try to replicate buggy oldest offset

---
 src/bin/pg_upgrade/t/005_offset.pl | 59 ++++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/src/bin/pg_upgrade/t/005_offset.pl b/src/bin/pg_upgrade/t/005_offset.pl
index df84186de4..2d91d101fa 100644
--- a/src/bin/pg_upgrade/t/005_offset.pl
+++ b/src/bin/pg_upgrade/t/005_offset.pl
@@ -305,6 +305,8 @@ my @TESTS = (
 	100, 101, 102, 103, 104, 105, 106,
 	# self upgrade
 	1000,
+	# buggy
+	2000
 );
 
 # =============================================================================
@@ -560,4 +562,61 @@ SKIP:
 	ok(1, "TEST $TEST_NO PASSED");
 }
 
+# =============================================================================
+# Buggy
+# =============================================================================
+
+SKIP:
+{
+	my $TEST_NO = 2000;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	if (!defined($ENV{oldinstall}))
+	{
+		die "oldinstall is not defined";
+	}
+
+	my $dbname = 'buggy';
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	my $oldnode = PostgreSQL::Test::Cluster->new("old_$dbname",
+					install_path => $ENV{oldinstall});
+	$newnode->init;
+	$oldnode->init(force_initdb => 1);
+
+	$oldnode->append_conf('postgresql.conf', q(
+		autovacuum = off
+		max_prepared_transactions = 2
+		fsync = off
+	));
+	$oldnode->start;
+
+	mxid_gen2($oldnode, 'FOO', 1.25);
+	mxid_gen2($oldnode, 'BAR', 1.25);
+
+	$oldnode->safe_psql('postgres', q(
+		DROP TABLE BAR;
+		CHECKPOINT;
+	));
+	$oldnode->stop;
+
+	unlink($oldnode->data_dir . "/pg_multixact/offsets/0000");
+
+	command_ok(
+		[
+			'pg_upgrade', '--no-sync',
+			'-d', $oldnode->data_dir,
+			'-D', $newnode->data_dir,
+			'-b', $oldnode->config_data('--bindir'),
+			'-B', $newnode->config_data('--bindir'),
+			'-s', $newnode->host,
+			'-p', $oldnode->port,
+			'-P', $newnode->port,
+			'--copy'
+		],
+		'run of pg_upgrade');
+
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
 done_testing();
-- 
2.43.0

#31wenhui qiu
qiuwenhuifx@gmail.com
In reply to: Maxim Orlov (#30)
Re: POC: make mxidoff 64 bits

HI Maxim

Looks like there is a bit of a pause in the discussion. Here is a small

update. Consider v12.

No major changes, rebase to the actual master and a squash of multiple

commits to make a

patch set easy to reviewer.

AFAICs, we are reached a consensus on a core patch for switching to 64

bits offsets. The

only concern is about more comprehensive test coverage for pg_upgrade, is

it?
Agree ,When upgrading meets extremes (oldestOffsetKnown==false.) Just
follow the solution mentioned by Heikki Linnakangas.

Thanks

On Thu, Jan 16, 2025 at 9:32 PM Maxim Orlov <orlovmg@gmail.com> wrote:

Show quoted text

Looks like there is a bit of a pause in the discussion. Here is a small
update. Consider v12.
No major changes, rebase to the actual master and a squash of multiple
commits to make a
patch set easy to reviewer.

AFAICs, we are reached a consensus on a core patch for switching to 64
bits offsets. The
only concern is about more comprehensive test coverage for pg_upgrade, is
it?

--
Best regards,
Maxim Orlov.

#32Maxim Orlov
orlovmg@gmail.com
In reply to: wenhui qiu (#31)
7 attachment(s)
Re: POC: make mxidoff 64 bits

Here is a v13 version with small changes to make cf bot happy.

--
Best regards,
Maxim Orlov.

Attachments:

v13-0005-TEST-add-src-bin-pg_upgrade-t-005_offset.pl.patch.txttext/plain; charset=US-ASCII; name=v13-0005-TEST-add-src-bin-pg_upgrade-t-005_offset.pl.patch.txtDownload
From a989c58abfabb07c8778de339d617690f6654f79 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Tue, 19 Nov 2024 17:08:10 +0300
Subject: [PATCH v13 5/7] TEST: add src/bin/pg_upgrade/t/005_offset.pl

---
 src/bin/pg_upgrade/t/005_offset.pl | 563 +++++++++++++++++++++++++++++
 1 file changed, 563 insertions(+)
 create mode 100644 src/bin/pg_upgrade/t/005_offset.pl

diff --git a/src/bin/pg_upgrade/t/005_offset.pl b/src/bin/pg_upgrade/t/005_offset.pl
new file mode 100644
index 0000000000..df84186de4
--- /dev/null
+++ b/src/bin/pg_upgrade/t/005_offset.pl
@@ -0,0 +1,563 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use File::Find qw(find);
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# This pair of calls will create significantly more member segments than offset
+# segments.
+sub prep
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl} (I INT PRIMARY KEY, N_UPDATED INT) " .
+		"       WITH (AUTOVACUUM_ENABLED=FALSE);" .
+		"INSERT INTO ${tbl} SELECT G, 0 FROM GENERATE_SERIES(1, 50) G;");
+}
+
+sub fill
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	my $nclients = 50;
+	my $update_every = 90;
+	my @connections = ();
+
+	for (0..$nclients)
+	{
+		my $conn = $node->background_psql('postgres');
+		$conn->query_safe("BEGIN");
+
+		push(@connections, $conn);
+	}
+
+	for (my $i = 0; $i < 20000; $i++)
+	{
+		my $conn = $connections[$i % $nclients];
+
+		$conn->query_safe("COMMIT;");
+		$conn->query_safe("BEGIN");
+
+		if ($i % $update_every == 0)
+		{
+			$conn->query_safe(
+				"UPDATE ${tbl} SET " .
+				"N_UPDATED = N_UPDATED + 1 " .
+				"WHERE I = ${i} % 50");
+		}
+		else
+		{
+			$conn->query_safe(
+				"SELECT * FROM ${tbl} FOR KEY SHARE");
+		}
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+}
+
+# This pair of calls will create more or less the same amount of membsers and
+# offsets segments.
+sub prep2
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl}(BAR INT PRIMARY KEY, BAZ INT); " .
+		"CREATE OR REPLACE PROCEDURE MXIDFILLER(N_STEPS INT DEFAULT 1000) " .
+		"LANGUAGE PLPGSQL " .
+		"AS \$\$ " .
+		"BEGIN " .
+		"	FOR I IN 1..N_STEPS LOOP " .
+		"		UPDATE ${tbl} SET BAZ = RANDOM(1, 1000) " .
+		"		WHERE BAR IN (SELECT BAR FROM ${tbl} " .
+		"						TABLESAMPLE BERNOULLI(80)); " .
+		"		COMMIT; " .
+		"	END LOOP; " .
+		"END; \$\$; " .
+		"INSERT INTO ${tbl} (BAR, BAZ) " .
+		"SELECT ID, ID FROM GENERATE_SERIES(1, 1024) ID;");
+}
+
+sub fill2
+{
+	my $node = shift;
+	my $tbl = shift;
+	my $scale = shift // 1;
+
+	$node->safe_psql('postgres',
+		"BEGIN; " .
+		"SELECT * FROM ${tbl} FOR KEY SHARE; " .
+		"PREPARE TRANSACTION 'A'; " .
+		"CALL MXIDFILLER((365 * ${scale})::int); " .
+		"COMMIT PREPARED 'A';");
+}
+
+
+# generate around 2 offset segments and 55 member segments
+sub mxid_gen1
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	prep($node, $tbl);
+	fill($node, $tbl);
+
+	$node->safe_psql('postgres', q(CHECKPOINT));
+}
+
+# generate around 10 offset segments and 12 member segments
+sub mxid_gen2
+{
+	my $node = shift;
+	my $tbl = shift;
+	my $scale = shift // 1;
+
+	prep2($node, $tbl);
+	fill2($node, $tbl, $scale);
+
+	$node->safe_psql('postgres', q(CHECKPOINT));
+}
+
+# Fetch latest multixact checkpoint values.
+sub multi_bounds
+{
+	my ($node) = @_;
+	my $path = $node->config_data('--bindir');
+	my ($stdout, $stderr) = run_command([
+									$path . '/pg_controldata',
+									$node->data_dir
+								]);
+	my @control_data = split("\n", $stdout);
+	my $next = undef;
+	my $oldest = undef;
+	my $next_offset = undef;
+
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint's NextMultiXactId:\s*(.*)$/mg)
+		{
+			$next = $1;
+			print ">>> @ node ". $node->name . ", " . $_ . "\n";
+		}
+
+		if ($_ =~ /^Latest checkpoint's oldestMultiXid:\s*(.*)$/mg)
+		{
+			$oldest = $1;
+			print ">>> @ node ". $node->name . ", " . $_ . "\n";
+		}
+
+		if ($_ =~ /^Latest checkpoint's NextMultiOffset:\s*(.*)$/mg)
+		{
+			$next_offset = $1;
+			print ">>> @ node ". $node->name . ", " . $_ . "\n";
+		}
+
+		if (defined($oldest) && defined($next) && defined($next_offset))
+		{
+			last;
+		}
+	}
+
+	die "Latest checkpoint's NextMultiXactId not found in control file!\n"
+	unless defined($next);
+
+	die "Latest checkpoint's oldestMultiXid not found in control file!\n"
+	unless defined($oldest);
+
+	die "Latest checkpoint's NextMultiOffset not found in control file!\n"
+	unless defined($next_offset);
+
+	return ($oldest, $next, $next_offset);
+}
+
+# Create node from existing bins.
+sub create_new_node
+{
+	my ($name, %params) = @_;
+
+	create_node(0, @_);
+}
+
+# Create node from ENV oldinstall
+sub create_old_node
+{
+	my ($name, %params) = @_;
+
+	if (!defined($ENV{oldinstall}))
+	{
+		die "oldinstall is not defined";
+	}
+
+	create_node(1, @_);
+}
+
+sub create_node
+{
+	my ($install_path_from_env, $name, %params) = @_;
+	my $scale = defined $params{scale} ? $params{scale} : 1;
+	my $multi = defined $params{multi} ? $params{multi} : undef;
+	my $offset = defined $params{offset} ? $params{offset} : undef;
+
+	my $node =
+		$install_path_from_env ?
+			PostgreSQL::Test::Cluster->new($name,
+					install_path => $ENV{oldinstall}) :
+			PostgreSQL::Test::Cluster->new($name);
+
+	$node->init(force_initdb => 1,
+		extra => [
+			$multi ? ('-m', $multi) : (),
+			$offset ? ('-o', $offset) : (),
+			('-k'),
+		]);
+
+	# Fixup MOX patch quirk
+	if ($multi)
+	{
+		unlink $node->data_dir . '/pg_multixact/offsets/0000';
+	}
+	if ($offset)
+	{
+		unlink $node->data_dir . '/pg_multixact/members/0000';
+	}
+
+	$node->append_conf('fsync', 'off');
+	$node->append_conf('postgresql.conf', 'max_prepared_transactions = 2');
+
+	$node->start();
+	mxid_gen2($node, 'FOO', $scale);
+	mxid_gen1($node, 'BAR', $scale);
+	$node->restart();
+	$node->safe_psql('postgres', q(SELECT * FROM FOO));		# just in case...
+	$node->safe_psql('postgres', q(SELECT * FROM BAR));
+	$node->safe_psql('postgres', q(CHECKPOINT));
+	$node->stop();
+
+	return $node;
+}
+
+sub do_upgrade
+{
+	my ($oldnode, $newnode) = @_;
+
+	command_ok(
+		[
+			'pg_upgrade', '--no-sync',
+			'-d', $oldnode->data_dir,
+			'-D', $newnode->data_dir,
+			'-b', $oldnode->config_data('--bindir'),
+			'-B', $newnode->config_data('--bindir'),
+			'-s', $newnode->host,
+			'-p', $oldnode->port,
+			'-P', $newnode->port,
+			'--check'
+		],
+		'run of pg_upgrade');
+
+	command_ok(
+		[
+			'pg_upgrade', '--no-sync',
+			'-d', $oldnode->data_dir,
+			'-D', $newnode->data_dir,
+			'-b', $oldnode->config_data('--bindir'),
+			'-B', $newnode->config_data('--bindir'),
+			'-s', $newnode->host,
+			'-p', $oldnode->port,
+			'-P', $newnode->port,
+			'--copy'
+		],
+		'run of pg_upgrade');
+
+	$oldnode->start();
+	$newnode->start();
+
+	my $oldfoo = $oldnode->safe_psql('postgres', q(SELECT * FROM FOO));
+	my $newfoo = $newnode->safe_psql('postgres', q(SELECT * FROM FOO));
+	is($oldfoo, $newfoo, "select foo eq");
+
+	my $oldbar = $oldnode->safe_psql('postgres', q(SELECT * FROM BAR));
+	my $newbar = $newnode->safe_psql('postgres', q(SELECT * FROM BAR));
+	is($oldbar, $newbar, "select bar eq");
+
+	$oldnode->stop();
+	$newnode->stop();
+
+	multi_bounds($oldnode);
+	multi_bounds($newnode);
+}
+
+my @TESTS = (
+	# tests without ENV oldinstall
+	0, 1, 2, 3, 4, 5, 6,
+	# tests with "real" pg_upgrade
+	100, 101, 102, 103, 104, 105, 106,
+	# self upgrade
+	1000,
+);
+
+# =============================================================================
+# Basic sanity tests on a NEW bin
+# =============================================================================
+
+# starts from the zero
+SKIP:
+{
+	my $TEST_NO = 0;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_mo',
+						scale => 1);
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value
+SKIP:
+{
+	my $TEST_NO = 1;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_Mo',
+						scale => 1.15,
+						multi => '0x123400');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# offsets starts from the value
+SKIP:
+{
+	my $TEST_NO = 2;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_mO',
+						scale => 1.15,
+						offset => '0x432100');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi and offsets starts from the value
+SKIP:
+{
+	my $TEST_NO = 3;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_MO',
+						scale => 1.15,
+						multi => '0xDEAD00', offset => '0xBEEF00');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value, multi wrap
+SKIP:
+{
+	my $TEST_NO = 4;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_Mo_wrap',
+						scale => 1.15,
+						multi => '0xFFFF7000');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# offsets starts from the value, offsets wrap
+SKIP:
+{
+	my $TEST_NO = 5;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_mO_wrap',
+						scale => 1.15,
+						offset => '0xFFFFFC00');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value, offsets starts from the value,
+# multi wrap, offsets wrap
+SKIP:
+{
+	my $TEST_NO = 6;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_MO_wrap',
+						scale => 1.15,
+						multi => '0xFFFF7000', offset => '0xFFFFFC00');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# =============================================================================
+# pg_upgarde tests
+# =============================================================================
+
+# starts from the zero
+SKIP:
+{
+	my $TEST_NO = 100;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'mo';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1);
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value
+SKIP:
+{
+	my $TEST_NO = 101;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'Mo';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						multi => '0x123400');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# offsets starts from the value
+SKIP:
+{
+	my $TEST_NO = 102;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'mO';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						offset => '0x432100');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi and offsets starts from the value
+SKIP:
+{
+	my $TEST_NO = 103;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'MO';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						multi => '0xDEAD00', offset => '0xBEEF00');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value, multi wrap
+SKIP:
+{
+	my $TEST_NO = 104;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'Mo_wrap';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						multi => '0xFFFF7000');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# offsets starts from the value, offsets wrap
+SKIP:
+{
+	my $TEST_NO = 105;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'mO_wrap';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						offset => '0xFFFFFC00');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value, offsets starts from the value,
+# multi wrap, offsets wrap
+SKIP:
+{
+	my $TEST_NO = 106;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'MO_wrap';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						multi => '0xFFFF7000', offset => '0xFFFFFC00');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# =============================================================================
+# Self upgrade
+# =============================================================================
+
+# starts from the zero
+SKIP:
+{
+	my $TEST_NO = 1000;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'self_upgrade';
+	my $oldnode = create_new_node("old_$dbname",
+						scale => 1);
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+done_testing();
-- 
2.43.0

v13-0003-Make-pg_upgrade-convert-multixact-offsets.patchapplication/octet-stream; name=v13-0003-Make-pg_upgrade-convert-multixact-offsets.patchDownload
From e15f89143dd8aef70957e87d59c177fab66f9ce2 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Tue, 13 Aug 2024 14:44:50 +0300
Subject: [PATCH v13 3/7] Make pg_upgrade convert multixact offsets.

Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Author: Maxim Orlov <orlovmg@gmail.com>
Author: Yura Sokolov <y.sokolov@postgrespro.ru>
---
 src/backend/access/transam/multixact.c |  35 +--
 src/bin/pg_upgrade/Makefile            |   3 +
 src/bin/pg_upgrade/meson.build         |   3 +
 src/bin/pg_upgrade/multixact_old.c     | 338 +++++++++++++++++++++++++
 src/bin/pg_upgrade/multixact_old.h     |  12 +
 src/bin/pg_upgrade/multixact_rewrite.c | 238 +++++++++++++++++
 src/bin/pg_upgrade/pg_upgrade.c        |  29 ++-
 src/bin/pg_upgrade/pg_upgrade.h        |  13 +-
 src/bin/pg_upgrade/slru_io.c           | 211 +++++++++++++++
 src/bin/pg_upgrade/slru_io.h           |  23 ++
 10 files changed, 873 insertions(+), 32 deletions(-)
 create mode 100644 src/bin/pg_upgrade/multixact_old.c
 create mode 100644 src/bin/pg_upgrade/multixact_old.h
 create mode 100644 src/bin/pg_upgrade/multixact_rewrite.c
 create mode 100644 src/bin/pg_upgrade/slru_io.c
 create mode 100644 src/bin/pg_upgrade/slru_io.h

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index cd9db52e95..d63ae17330 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1103,7 +1103,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	MultiXactOffset *offptr;
 	MultiXactOffset offset;
 	int			length;
-	int			truelength;
 	MultiXactId oldestMXact;
 	MultiXactId nextMXact;
 	MultiXactId tmpMXact;
@@ -1202,15 +1201,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * we have just for this; the process in charge will signal the CV as soon
 	 * as it has finished writing the multixact offset.
 	 *
-	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
-	 * handle case #2, there is an ambiguity near the point of offset
-	 * wraparound.  If we see next multixact's offset is one, is that our
-	 * multixact's actual endpoint, or did it end at zero with a subsequent
-	 * increment?  We handle this using the knowledge that if the zero'th
-	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
-	 * transaction ID so it can't be a multixact member.  Therefore, if we
-	 * read a zero from the members array, just ignore it.
-	 *
 	 * This is all pretty messy, but the mess occurs only in infrequent corner
 	 * cases, so it seems better than holding the MultiXactGenLock for a long
 	 * time on every multixact creation.
@@ -1297,6 +1287,9 @@ retry:
 	LWLockRelease(lock);
 	lock = NULL;
 
+	/* A multixid with zero members should not happen */
+	Assert(length > 0);
+
 	/*
 	 * If we slept above, clean up state; it's no longer needed.
 	 */
@@ -1305,7 +1298,6 @@ retry:
 
 	ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
 
-	truelength = 0;
 	prev_pageno = -1;
 	for (int i = 0; i < length; i++, offset++)
 	{
@@ -1343,36 +1335,27 @@ retry:
 		xactptr = (TransactionId *)
 			(MultiXactMemberCtl->shared->page_buffer[slotno] + memberoff);
 
-		if (!TransactionIdIsValid(*xactptr))
-		{
-			/* Corner case 3: we must be looking at unused slot zero */
-			Assert(offset == 0);
-			continue;
-		}
+		Assert(TransactionIdIsValid(*xactptr));
 
 		flagsoff = MXOffsetToFlagsOffset(offset);
 		bshift = MXOffsetToFlagsBitShift(offset);
 		flagsptr = (uint32 *) (MultiXactMemberCtl->shared->page_buffer[slotno] + flagsoff);
 
-		ptr[truelength].xid = *xactptr;
-		ptr[truelength].status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
-		truelength++;
+		ptr[i].xid = *xactptr;
+		ptr[i].status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
 	}
 
 	LWLockRelease(lock);
 
-	/* A multixid with zero members should not happen */
-	Assert(truelength > 0);
-
 	/*
 	 * Copy the result into the local cache.
 	 */
-	mXactCachePut(multi, truelength, ptr);
+	mXactCachePut(multi, length, ptr);
 
 	debug_elog3(DEBUG2, "GetMembers: no cache for %s",
-				mxid_to_string(multi, truelength, ptr));
+				mxid_to_string(multi, length, ptr));
 	*members = ptr;
-	return truelength;
+	return length;
 }
 
 /*
diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index f83d2b5d30..b4ad01c00b 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -19,11 +19,14 @@ OBJS = \
 	file.o \
 	function.o \
 	info.o \
+	multixact_old.o \
+	multixact_rewrite.o \
 	option.o \
 	parallel.o \
 	pg_upgrade.o \
 	relfilenumber.o \
 	server.o \
+	slru_io.o \
 	tablespace.o \
 	task.o \
 	util.o \
diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index cc2ba97d9a..76c8f2005d 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -8,11 +8,14 @@ pg_upgrade_sources = files(
   'file.c',
   'function.c',
   'info.c',
+  'multixact_old.c',
+  'multixact_rewrite.c',
   'option.c',
   'parallel.c',
   'pg_upgrade.c',
   'relfilenumber.c',
   'server.c',
+  'slru_io.c',
   'tablespace.c',
   'task.c',
   'util.c',
diff --git a/src/bin/pg_upgrade/multixact_old.c b/src/bin/pg_upgrade/multixact_old.c
new file mode 100644
index 0000000000..0442928e89
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_old.c
@@ -0,0 +1,338 @@
+/*
+ *	multixact_old.c
+ *
+ *	Support for reading pre-v18 format pg_multixact files
+ *
+ *	Copyright (c) 2024, PostgreSQL Global Development Group
+ *	src/bin/pg_upgrade/multixact_old.c
+ */
+
+#include "postgres_fe.h"
+
+#include "access/transam.h"
+#include "pg_upgrade.h"
+#include "multixact_old.h"
+#include "slru_io.h"
+
+/*
+ * Below are a bunch of definitions that are copy-pasted from multixact.c from
+ * version 17. They shadow the new definitions in access/multixact.h, so it's
+ * important that we *don't* include that here. That's is a big reason this
+ * code has to be in a separate source file.
+ *
+ * All references to MultiXactOffset have been replaced with OldMultiXactOffset;
+ */
+typedef uint32 OldMultiXactOffset;
+
+#define FirstMultiXactId	((MultiXactId) 1)
+
+/*
+ * Possible multixact lock modes ("status").  The first four modes are for
+ * tuple locks (FOR KEY SHARE, FOR SHARE, FOR NO KEY UPDATE, FOR UPDATE); the
+ * next two are used for update and delete modes.
+ */
+typedef enum
+{
+	MultiXactStatusForKeyShare = 0x00,
+	MultiXactStatusForShare = 0x01,
+	MultiXactStatusForNoKeyUpdate = 0x02,
+	MultiXactStatusForUpdate = 0x03,
+	/* an update that doesn't touch "key" columns */
+	MultiXactStatusNoKeyUpdate = 0x04,
+	/* other updates, and delete */
+	MultiXactStatusUpdate = 0x05,
+} MultiXactStatus;
+
+/* does a status value correspond to a tuple update? */
+#define ISUPDATE_from_mxstatus(status) \
+			((status) > MultiXactStatusForUpdate)
+
+/*
+ * Defines for OldMultiXactOffset page sizes.  A page is the same BLCKSZ as is
+ * used everywhere else in Postgres.
+ *
+ * Note: because OldMultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
+ * MultiXact page numbering also wraps around at
+ * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
+ * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
+ * take no explicit notice of that fact in this module, except when comparing
+ * segment and page numbers in TruncateMultiXact (see
+ * OldMultiXactOffsetPagePrecedes).
+ */
+
+/* We need four bytes per offset */
+#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(uint32))
+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)
+{
+	return multi / MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int
+MultiXactIdToOffsetEntry(MultiXactId multi)
+{
+	return multi % MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int64
+MultiXactIdToOffsetSegment(MultiXactId multi)
+{
+	return MultiXactIdToOffsetPage(multi) / SLRU_PAGES_PER_SEGMENT;
+}
+
+/*
+ * The situation for members is a bit more complex: we store one byte of
+ * additional flag bits for each TransactionId.  To do this without getting
+ * into alignment issues, we store four bytes of flags, and then the
+ * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
+ * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
+ * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
+ * performance) trumps space efficiency here.
+ *
+ * Note that the "offset" macros work with byte offset, not array indexes, so
+ * arithmetic must be done using "char *" pointers.
+ */
+/* We need eight bits per xact, so one xact fits in a byte */
+#define MXACT_MEMBER_BITS_PER_XACT			8
+#define MXACT_MEMBER_FLAGS_PER_BYTE			1
+#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
+
+/* how many full bytes of flags are there in a group? */
+#define MULTIXACT_FLAGBYTES_PER_GROUP		4
+#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
+	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
+/* size in bytes of a complete group */
+#define MULTIXACT_MEMBERGROUP_SIZE \
+	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERS_PER_PAGE	\
+	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
+
+/* page in which a member is to be found */
+static inline int64
+MXOffsetToMemberPage(OldMultiXactOffset offset)
+{
+	return offset / MULTIXACT_MEMBERS_PER_PAGE;
+}
+
+/* Location (byte offset within page) of flag word for a given member */
+static inline int
+MXOffsetToFlagsOffset(OldMultiXactOffset offset)
+{
+	OldMultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
+	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
+
+	return byteoff;
+}
+
+static inline int
+MXOffsetToFlagsBitShift(OldMultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
+
+	return bshift;
+}
+
+/* Location (byte offset within page) of TransactionId of given member */
+static inline int
+MXOffsetToMemberOffset(OldMultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+
+	return MXOffsetToFlagsOffset(offset) +
+		MULTIXACT_FLAGBYTES_PER_GROUP +
+		member_in_group * sizeof(TransactionId);
+}
+
+typedef struct OldMultiXactReader
+{
+	MultiXactId nextMXact;
+	uint32	nextOffset;
+
+	SlruSegState *offset;
+	SlruSegState *members;
+} OldMultiXactReader;
+
+OldMultiXactReader *
+StartOldMultiXactRead(void)
+{
+	OldMultiXactReader *state;
+	char	   *dir;
+
+	state = pg_malloc(sizeof(OldMultiXactReader));
+	state->nextMXact = old_cluster.controldata.chkpnt_nxtmulti;
+	state->nextOffset = old_cluster.controldata.chkpnt_nxtmxoff;
+
+	dir = psprintf("%s/pg_multixact/offsets", old_cluster.pgdata);
+	state->offset = OpenSlruRead(dir);
+	pg_free(dir);
+
+	dir = psprintf("%s/pg_multixact/members", old_cluster.pgdata);
+	state->members = OpenSlruRead(dir);
+	pg_free(dir);
+
+	return state;
+}
+
+/*
+ * This is a simplified version of the GetMultiXactIdMembers() server function.
+ *
+ * - Only return the updating member, if any. Upgrade only cares about the updaters.
+ *   If there is no updating member, return the first locking-only member. We don't
+ *   have any way to represent "no members", but we also don't need to preserve all
+ *   the locking members.
+ *
+ * - We don't need to worry about locking and some corner cases because there's
+ *   no concurrent activity.
+ */
+void
+GetOldMultiXactIdSingleMember(OldMultiXactReader *state, MultiXactId multi,
+							  TransactionId *result, bool *isupdate)
+{
+	TransactionId result_xid;
+	bool		result_isupdate;
+	int64		pageno;
+	int64		prev_pageno;
+	int			entryno;
+	OldMultiXactOffset *offptr;
+	OldMultiXactOffset offset;
+	int			length;
+	MultiXactId nextMXact;
+	MultiXactId tmpMXact;
+	OldMultiXactOffset nextOffset;
+	char	   *buf;
+
+	nextMXact = state->nextMXact;
+	nextOffset = state->nextOffset;
+
+	/*
+	 * Find out the offset at which we need to start reading MultiXactMembers
+	 * and the number of members in the multixact.  We determine the latter as
+	 * the difference between this multixact's starting offset and the next
+	 * one's.  However, there are some corner cases to worry about:
+	 *
+	 * 1. This multixact may be the latest one created, in which case there is
+	 * no next one to look at.  In this case the nextOffset value we just
+	 * saved is the correct endpoint.
+	 *
+	 * 2. (this cannot happen during upgrade)
+	 *
+	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
+	 * handle case #2, there is an ambiguity near the point of offset
+	 * wraparound.  If we see next multixact's offset is one, is that our
+	 * multixact's actual endpoint, or did it end at zero with a subsequent
+	 * increment?  We handle this using the knowledge that if the zero'th
+	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
+	 * transaction ID so it can't be a multixact member.  Therefore, if we
+	 * read a zero from the members array, just ignore it.
+	 */
+	pageno = MultiXactIdToOffsetPage(multi);
+	entryno = MultiXactIdToOffsetEntry(multi);
+
+	buf = SlruReadSwitchPage(state->offset, pageno);
+	offptr = (OldMultiXactOffset *) buf;
+	offptr += entryno;
+	offset = *offptr;
+
+	Assert(offset != 0);
+
+	/*
+	 * Use the same increment rule as GetNewMultiXactId(), that is, don't
+	 * handle wraparound explicitly until needed.
+	 */
+	tmpMXact = multi + 1;
+
+	if (nextMXact == tmpMXact)
+	{
+		/* Corner case 1: there is no next multixact */
+		length = nextOffset - offset;
+	}
+	else
+	{
+		OldMultiXactOffset nextMXOffset;
+
+		/* handle wraparound if needed */
+		if (tmpMXact < FirstMultiXactId)
+			tmpMXact = FirstMultiXactId;
+
+		prev_pageno = pageno;
+
+		pageno = MultiXactIdToOffsetPage(tmpMXact);
+		entryno = MultiXactIdToOffsetEntry(tmpMXact);
+
+		if (pageno != prev_pageno)
+		{
+			buf = SlruReadSwitchPage(state->offset, pageno);
+		}
+
+		offptr = (OldMultiXactOffset *) buf;
+		offptr += entryno;
+		nextMXOffset = *offptr;
+
+		if (nextMXOffset == 0)
+		{
+			/* Corner case 2: next multixact is still being filled in */
+			Assert(false); /* shouldn't happen during upgrade */
+		}
+
+		length = nextMXOffset - offset;
+	}
+
+	result_xid = InvalidTransactionId;
+	result_isupdate = false;
+	prev_pageno = -1;
+	for (int i = 0; i < length; i++, offset++)
+	{
+		TransactionId *xactptr;
+		uint32	   *flagsptr;
+		int			flagsoff;
+		int			bshift;
+		int			memberoff;
+		MultiXactStatus status;
+
+		pageno = MXOffsetToMemberPage(offset);
+		memberoff = MXOffsetToMemberOffset(offset);
+
+		if (pageno != prev_pageno)
+		{
+			buf = SlruReadSwitchPage(state->members, pageno);
+			prev_pageno = pageno;
+		}
+
+		xactptr = (TransactionId *) (buf + memberoff);
+
+		if (!TransactionIdIsValid(*xactptr))
+		{
+			/* Corner case 3: we must be looking at unused slot zero */
+			Assert(offset == 0);
+			continue;
+		}
+
+		flagsoff = MXOffsetToFlagsOffset(offset);
+		bshift = MXOffsetToFlagsBitShift(offset);
+		flagsptr = (uint32 *) (buf + flagsoff);
+
+		status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
+
+		/* Verify that there is a single update Xid among the given members. */
+		if (ISUPDATE_from_mxstatus(status))
+		{
+			if (result_isupdate)
+				pg_fatal("multixact %u has more than one updating member",
+						 multi);
+			result_xid = *xactptr;
+			result_isupdate = true;
+		}
+		else if (!TransactionIdIsValid(result_xid))
+			result_xid = *xactptr;
+	}
+
+	/* A multixid with zero members should not happen */
+	Assert(TransactionIdIsValid(result_xid));
+
+	*result = result_xid;
+	*isupdate = result_isupdate;
+}
diff --git a/src/bin/pg_upgrade/multixact_old.h b/src/bin/pg_upgrade/multixact_old.h
new file mode 100644
index 0000000000..70800c1cda
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_old.h
@@ -0,0 +1,12 @@
+/*
+ *	multixact_old.h
+ *
+ *	Copyright (c) 2010-2024, PostgreSQL Global Development Group
+ *	src/bin/pg_upgrade/multixact_old.h
+ */
+
+typedef struct OldMultiXactReader OldMultiXactReader;
+
+extern OldMultiXactReader *StartOldMultiXactRead(void);
+extern void GetOldMultiXactIdSingleMember(OldMultiXactReader *state, MultiXactId multi,
+										  TransactionId *result, bool *isupdate);
diff --git a/src/bin/pg_upgrade/multixact_rewrite.c b/src/bin/pg_upgrade/multixact_rewrite.c
new file mode 100644
index 0000000000..8c3f538cc9
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_rewrite.c
@@ -0,0 +1,238 @@
+/*
+ *	multixact_rewrite.c
+ *
+ *	Rewrite pre-v18 multixacts to new format with 64-bit MultiXactOffsets
+ *
+ *	Copyright (c) 2024, PostgreSQL Global Development Group
+ *	src/bin/pg_upgrade/multixact_rewrite.c
+ */
+
+#include "postgres_fe.h"
+
+#include "multixact_old.h"
+#include "pg_upgrade.h"
+#include "slru_io.h"
+
+#include "access/multixact.h"
+#include "access/multixact_internal.h"
+
+typedef struct
+{
+	MultiXactId nextMXact;
+	MultiXactOffset nextOffset;
+
+	SlruSegState *offset;
+	SlruSegState *members;
+} MultiXactWriter;
+
+static MultiXactWriter *StartMultiXactWrite(MultiXactId firstMulti, MultiXactOffset firstOffset);
+static MultiXactId GetNewMultiXactId(MultiXactWriter *state, int nmembers, MultiXactOffset *offset);
+static void RecordNewMultiXact(MultiXactWriter *state,
+							   MultiXactOffset offset,
+							   MultiXactId multi,
+							   int nmembers, MultiXactMember *members);
+static void CloseMultiXactWrite(MultiXactWriter *state);
+
+
+/*
+ * Convert pg_multixact/offset and /members to new format with 64-bit offsets.
+ */
+void
+convert_multixacts(MultiXactId *new_nxtmulti, MultiXactOffset *new_nxtmxoff)
+{
+	MultiXactWriter	   *new_writer;
+	MultiXactId			oldest_multi = old_cluster.controldata.chkpnt_oldstMulti,
+						next_multi = old_cluster.controldata.chkpnt_nxtmulti,
+						multi;
+	OldMultiXactReader *old_reader;
+
+	if (next_multi < FirstMultiXactId)
+		next_multi = FirstMultiXactId;
+
+	old_reader = StartOldMultiXactRead();
+	new_writer = StartMultiXactWrite(oldest_multi, 1);
+
+	/*
+	 * Read multixids from old files one by one, and write them back in the
+	 * new format.
+	 *
+	 * The locking-only XIDs that may be part of multi-xids don't matter after
+	 * upgrade, as there can be no transactions running across upgrade. So as
+	 * a little optimization, we only read one member from each multixid: the
+	 * one updating one, or if there was no update, arbitrarily the first
+	 * locking xid.
+	 */
+	for (multi = oldest_multi; multi != next_multi;)
+	{
+		TransactionId xid;
+		bool		isupdate;
+		MultiXactMember member;
+		MultiXactId newmulti PG_USED_FOR_ASSERTS_ONLY;
+		MultiXactOffset offset;
+
+		/* Read the old multixid */
+		GetOldMultiXactIdSingleMember(old_reader, multi, &xid, &isupdate);
+
+		/* Write it out in new format */
+		member.xid = xid;
+		member.status = isupdate ? MultiXactStatusUpdate : MultiXactStatusForKeyShare;
+		newmulti = GetNewMultiXactId(new_writer, 1, &offset);
+		Assert(newmulti == multi);
+		RecordNewMultiXact(new_writer, offset, multi, 1, &member);
+
+		multi++;
+		if (multi < FirstMultiXactId)
+			multi = FirstMultiXactId;
+	}
+
+	/*
+	 * Update the nextMXact/Offset values in the control file to match what we
+	 * wrote. The nextMXact should be unchanged, but because we ignored the
+	 * locking XIDs members, the nextOffset will be different.
+	 */
+	Assert(new_writer->nextMXact == next_multi);
+	*new_nxtmulti = next_multi;
+	*new_nxtmxoff = new_writer->nextOffset;
+
+	/* Release resources */
+	CloseMultiXactWrite(new_writer);
+}
+
+/* Support routines for writing the new format */
+
+static MultiXactWriter *
+StartMultiXactWrite(MultiXactId firstMulti, MultiXactOffset firstOffset)
+{
+	MultiXactWriter *state;
+	char	   *dir;
+
+	state = pg_malloc(sizeof(MultiXactWriter));
+	state->nextMXact = firstMulti;
+	state->nextOffset = firstOffset;
+
+	dir = psprintf("%s/pg_multixact/offsets", new_cluster.pgdata);
+	state->offset = OpenSlruWrite(dir, MultiXactIdToOffsetPage(firstMulti));
+	pg_free(dir);
+
+	dir = psprintf("%s/pg_multixact/members", new_cluster.pgdata);
+	state->members = OpenSlruWrite(dir, MXOffsetToMemberPage(1));
+	pg_free(dir);
+
+	return state;
+}
+
+static void
+CloseMultiXactWrite(MultiXactWriter *state)
+{
+	CloseSlruWrite(state->offset);
+	CloseSlruWrite(state->members);
+	pg_free(state);
+}
+
+/*
+ * Simplified copy of the corresponding server function
+ */
+static MultiXactId
+GetNewMultiXactId(MultiXactWriter *state, int nmembers, MultiXactOffset *offset)
+{
+	MultiXactId		result;
+
+	/* Handle wraparound of the nextMXact counter */
+	if (state->nextMXact < FirstMultiXactId)
+		state->nextMXact = FirstMultiXactId;
+
+	/* Assign the MXID */
+	result = state->nextMXact;
+
+	/*
+	 * Reserve the members space, similarly to above.
+	 */
+	*offset = state->nextOffset;
+
+	/*
+	 * Advance counters.  As in GetNewTransactionId(), this must not happen
+	 * until after file extension has succeeded!
+	 *
+	 * We don't care about MultiXactId wraparound here; it will be handled by
+	 * the next iteration.  But note that nextMXact may be InvalidMultiXactId
+	 * or the first value on a segment-beginning page after this routine
+	 * exits, so anyone else looking at the variable must be prepared to deal
+	 * with either case.  Similarly, nextOffset may be zero, but we won't use
+	 * that as the actual start offset of the next multixact.
+	 */
+	(state->nextMXact)++;
+
+	state->nextOffset += nmembers;
+
+	return result;
+}
+
+/*
+ * Write a new multixact with members.
+ *
+ * Simplified version of the correspoding server function.
+ */
+static void
+RecordNewMultiXact(MultiXactWriter *state, MultiXactOffset offset,
+				   MultiXactId multi,
+				   int nmembers, MultiXactMember *members)
+{
+	int64		pageno;
+	int64		prev_pageno;
+	int			entryno;
+
+	char	   *buf;
+	MultiXactOffset *offptr;
+
+	pageno = MultiXactIdToOffsetPage(multi);
+	entryno = MultiXactIdToOffsetEntry(multi);
+
+	/*
+	 * Note: we pass the MultiXactId to SimpleLruReadPage as the "transaction"
+	 * to complain about if there's any I/O error.  This is kinda bogus, but
+	 * since the errors will always give the full pathname, it should be clear
+	 * enough that a MultiXactId is really involved.  Perhaps someday we'll
+	 * take the trouble to generalize the slru.c error reporting code.
+	 */
+	buf = SlruWriteSwitchPage(state->offset, pageno);
+	offptr = (MultiXactOffset *) buf;
+	offptr += entryno;
+
+	*offptr = offset;
+
+	prev_pageno = -1;
+
+	for (int i = 0; i < nmembers; i++, offset++)
+	{
+		TransactionId *memberptr;
+		uint32	   *flagsptr;
+		uint32		flagsval;
+		int			bshift;
+		int			flagsoff;
+		int			memberoff;
+
+		Assert(members[i].status <= MultiXactStatusUpdate);
+
+		pageno = MXOffsetToMemberPage(offset);
+		memberoff = MXOffsetToMemberOffset(offset);
+		flagsoff = MXOffsetToFlagsOffset(offset);
+		bshift = MXOffsetToFlagsBitShift(offset);
+
+		if (pageno != prev_pageno)
+		{
+			buf = SlruWriteSwitchPage(state->members, pageno);
+			prev_pageno = pageno;
+		}
+
+		memberptr = (TransactionId *) (buf + memberoff);
+
+		*memberptr = members[i].xid;
+
+		flagsptr = (uint32 *) (buf + flagsoff);
+
+		flagsval = *flagsptr;
+		flagsval &= ~(((1 << MXACT_MEMBER_BITS_PER_XACT) - 1) << bshift);
+		flagsval |= (members[i].status << bshift);
+		*flagsptr = flagsval;
+	}
+}
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 36c7f3879d..9bf191b984 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -750,8 +750,27 @@ copy_xact_xlog_xid(void)
 	if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER &&
 		new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
 	{
-		copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
-		copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		MultiXactId new_nxtmulti = old_cluster.controldata.chkpnt_nxtmulti;
+		MultiXactOffset new_nxtmxoff = old_cluster.controldata.chkpnt_nxtmxoff;
+
+		/*
+		 * If the old server is before the MULTIXACTOFFSET_FORMATCHANGE_CAT_VER
+		 * it must have 32-bit multixid offsets, thus it should be converted.
+		 */
+		if (old_cluster.controldata.cat_ver < MULTIXACTOFFSET_FORMATCHANGE_CAT_VER &&
+			new_cluster.controldata.cat_ver >= MULTIXACTOFFSET_FORMATCHANGE_CAT_VER)
+		{
+			remove_new_subdir("pg_multixact/members", false);
+			remove_new_subdir("pg_multixact/offsets", false);
+			prep_status("Converting pg_multixact/offsets to 64-bit");
+			convert_multixacts(&new_nxtmulti, &new_nxtmxoff);
+			check_ok();
+		}
+		else
+		{
+			copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
+			copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		}
 
 		prep_status("Setting next multixact ID and offset for new cluster");
 
@@ -760,10 +779,10 @@ copy_xact_xlog_xid(void)
 		 * counters here and the oldest multi present on system.
 		 */
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/pg_resetwal\" -O %u -m %u,%u \"%s\"",
+				  "\"%s/pg_resetwal\" -O %llu -m %u,%u \"%s\"",
 				  new_cluster.bindir,
-				  old_cluster.controldata.chkpnt_nxtmxoff,
-				  old_cluster.controldata.chkpnt_nxtmulti,
+				  (unsigned long long) new_nxtmxoff,
+				  new_nxtmulti,
 				  old_cluster.controldata.chkpnt_oldstMulti,
 				  new_cluster.pgdata);
 		check_ok();
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 0cdd675e4f..9b3d645b08 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -114,6 +114,13 @@ extern char *output_files[];
  */
 #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231
 
+/*
+ * Swicth from 32-bit to 64-bit for multixid offsets.
+ *
+ * XXX: should be changed to the actual CATALOG_VERSION_NO on commit.
+ */
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202409041
+
 /*
  * large object chunk size added to pg_controldata,
  * commit 5f93c37805e7485488480916b4585e098d3cc883
@@ -230,7 +237,7 @@ typedef struct
 	uint32		chkpnt_nxtepoch;
 	uint32		chkpnt_nxtoid;
 	uint32		chkpnt_nxtmulti;
-	uint32		chkpnt_nxtmxoff;
+	uint64		chkpnt_nxtmxoff;
 	uint32		chkpnt_oldstMulti;
 	uint32		chkpnt_oldstxid;
 	uint32		align;
@@ -515,3 +522,7 @@ typedef struct
 	FILE	   *file;
 	char		path[MAXPGPATH];
 } UpgradeTaskReport;
+
+/* multixact_rewrite.c */
+
+void convert_multixacts(MultiXactId *new_nxtmulti, MultiXactOffset *new_nxtmxoff);
diff --git a/src/bin/pg_upgrade/slru_io.c b/src/bin/pg_upgrade/slru_io.c
new file mode 100644
index 0000000000..87acf16732
--- /dev/null
+++ b/src/bin/pg_upgrade/slru_io.c
@@ -0,0 +1,211 @@
+/*
+ *	slru_io.c
+ *
+ *	Routines for reading and writing SLRU files during upgrade.
+ *
+ *	Copyright (c) 2024, PostgreSQL Global Development Group
+ *	src/bin/pg_upgrade/slru_io.c
+ */
+
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+
+#include "pg_upgrade.h"
+#include "slru_io.h"
+
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "port/pg_iovec.h"
+
+/*
+ * State for reading or writing an SLRU, with a one page buffer.
+ */
+typedef struct SlruSegState
+{
+	bool		writing;
+
+	char	   *dir;
+	char	   *fn;
+	int			fd;
+	int64		segno;
+	uint64		pageno;
+
+	PGAlignedBlock buf;
+} SlruSegState;
+
+static void SlruFlush(SlruSegState *state);
+
+
+SlruSegState *
+OpenSlruRead(char *dir)
+{
+	SlruSegState *state;
+
+	state = pg_malloc(sizeof(SlruSegState));
+	state->writing = false;
+	state->segno = -1;
+	state->pageno = 0;
+	state->dir = pstrdup(dir);
+	state->fd = -1;
+	state->fn = NULL;
+
+	return state;
+}
+
+void
+CloseSlruRead(SlruSegState *state)
+{
+	Assert(!state->writing);
+	close(state->fd);
+	pg_free(state);
+}
+
+SlruSegState *
+OpenSlruWrite(char *dir, int64 startPageno)
+{
+	SlruSegState *state;
+
+	state = pg_malloc(sizeof(SlruSegState));
+	state->writing = true;
+	state->segno = -1;
+	state->pageno = 0;
+	state->dir = pstrdup(dir);
+	state->fd = -1;
+	state->fn = NULL;
+
+	return state;
+}
+
+void
+CloseSlruWrite(SlruSegState *state)
+{
+	Assert(state->writing);
+	SlruFlush(state);
+
+	close(state->fd);
+	pg_free(state);
+}
+
+static void
+SlruFlush(SlruSegState *state)
+{
+	struct iovec iovec = {
+		.iov_base = &state->buf,
+		.iov_len = BLCKSZ,
+	};
+	off_t		offset;
+
+	if (state->segno == -1)
+		return;
+
+	offset = (state->pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+	if (pg_pwritev_with_retry(state->fd, &iovec, 1, offset) < 0)
+		pg_fatal("could not write file \"%s\": %m", state->fn);
+}
+
+/*
+ * Open the given page for writing.
+ *
+ * NOTE: This uses O_EXCL when stepping to a new segment, so this assumes that
+ * each segment is written in full before moving on to next one. This
+ * limitation would be easy to lift if needed, but it fits the usage pattern
+ * of current callers.
+ */
+char *
+SlruWriteSwitchPage(SlruSegState *state, uint64 pageno)
+{
+	int64		segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	off_t		offset;
+
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+
+	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	offset = (pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+	SlruFlush(state);
+	memset(state->buf.data, 0, BLCKSZ);
+
+	if (segno != state->segno)
+	{
+		if (state->segno != -1)
+		{
+			close(state->fd);
+			state->fd = -1;
+			pg_free(state->fn);
+			state->fn = NULL;
+		}
+
+		/* Create the segment */
+		state->fn = psprintf("%s/%04X", state->dir, (unsigned int) segno);
+		if ((state->fd = open(state->fn, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+							  pg_file_create_mode)) < 0)
+		{
+			pg_fatal("could not create file \"%s\": %m", state->fn);
+		}
+		state->segno = segno;
+
+		if (offset > 0)
+		{
+			if (pg_pwrite_zeros(state->fd, offset, 0) < 0)
+				pg_fatal("could not write file \"%s\": %m", state->fn);
+		}
+	}
+
+	state->pageno = pageno;
+	return state->buf.data;
+}
+
+/*
+ * Open given page for reading.
+ *
+ * Reading can be done in random order.
+ */
+char *
+SlruReadSwitchPage(SlruSegState *state, uint64 pageno)
+{
+	int64		segno;
+
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+
+	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+
+	if (segno != state->segno)
+	{
+		if (state->segno != -1)
+		{
+			close(state->fd);
+			state->fd = -1;
+			pg_free(state->fn);
+			state->fn = NULL;
+		}
+
+		/* Open new segment */
+		state->fn = psprintf("%s/%04X", state->dir, (unsigned int) segno);
+		if ((state->fd = open(state->fn, O_RDONLY | PG_BINARY, 0)) < 0)
+		{
+			pg_fatal("could not open file \"%s\": %m", state->fn);
+		}
+		state->segno = segno;
+	}
+
+	{
+		struct iovec iovec = {
+			.iov_base = &state->buf,
+			.iov_len = BLCKSZ,
+		};
+		off_t		offset;
+
+		offset = (pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+		if (pg_preadv(state->fd, &iovec, 1, offset) < 0)
+			pg_fatal("could not read file \"%s\": %m", state->fn);
+
+		state->pageno = pageno;
+	}
+
+	return state->buf.data;
+}
diff --git a/src/bin/pg_upgrade/slru_io.h b/src/bin/pg_upgrade/slru_io.h
new file mode 100644
index 0000000000..e1a9c06313
--- /dev/null
+++ b/src/bin/pg_upgrade/slru_io.h
@@ -0,0 +1,23 @@
+/*
+ *	slru_io.h
+ *
+ *	Copyright (c) 2010-2024, PostgreSQL Global Development Group
+ *	src/bin/pg_upgrade/slru_io.h
+ */
+
+/* XXX: copied from slru.h */
+#define SLRU_PAGES_PER_SEGMENT	32
+
+/*
+ * Some kind of iterator associated with a particular SLRU segment.  The idea is
+ * to specify the segment and page number and then move through the pages.
+ */
+typedef struct SlruSegState SlruSegState;
+
+extern SlruSegState *OpenSlruRead(char *dir);
+extern void CloseSlruRead(SlruSegState *state);
+extern char *SlruReadSwitchPage(SlruSegState *state, uint64 pageno);
+
+extern SlruSegState *OpenSlruWrite(char *dir, int64 startPageno);
+extern void CloseSlruWrite(SlruSegState *state);
+extern char *SlruWriteSwitchPage(SlruSegState *state, uint64 pageno);
-- 
2.43.0

v13-0004-TEST-initdb-option-to-initialize-cluster-with-no.patch.txttext/plain; charset=US-ASCII; name=v13-0004-TEST-initdb-option-to-initialize-cluster-with-no.patch.txtDownload
From be906a9c2161e6972a396a9d283bb76ca023a808 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Wed, 4 May 2022 15:53:36 +0300
Subject: [PATCH v13 4/7] TEST: initdb option to initialize cluster with
 non-standard xid/mxid/mxoff

To date testing database cluster wraparund was not easy as initdb has always
inited it with default xid/mxid/mxoff. The option to specify any valid
xid/mxid/mxoff at cluster startup will make these things easier.

Author: Maxim Orlov <orlovmg@gmail.com>
Author: Pavel Borisov <pashkin.elfe@gmail.com>
Author: Svetlana Derevyanko <s.derevyanko@postgrespro.ru>
Discussion: https://www.postgresql.org/message-id/flat/CACG%3Dezaa4vqYjJ16yoxgrpa-%3DgXnf0Vv3Ey9bjGrRRFN2YyWFQ%40mail.gmail.com
---
 src/backend/access/transam/clog.c      |  21 +++++
 src/backend/access/transam/multixact.c |  53 ++++++++++++
 src/backend/access/transam/subtrans.c  |   8 +-
 src/backend/access/transam/xlog.c      |  15 ++--
 src/backend/bootstrap/bootstrap.c      |  50 +++++++++++-
 src/backend/main/main.c                |   6 ++
 src/backend/postmaster/postmaster.c    |  14 +++-
 src/backend/tcop/postgres.c            |  53 +++++++++++-
 src/bin/initdb/initdb.c                | 107 ++++++++++++++++++++++++-
 src/bin/initdb/t/001_initdb.pl         |  60 ++++++++++++++
 src/include/access/xlog.h              |   3 +
 src/include/c.h                        |   4 +
 src/include/catalog/pg_class.h         |   2 +-
 13 files changed, 382 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 0d556c00b8..89516e9f52 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -834,6 +834,7 @@ BootStrapCLOG(void)
 {
 	int			slotno;
 	LWLock	   *lock = SimpleLruGetBankLock(XactCtl, 0);
+	int64		pageno;
 
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
@@ -844,6 +845,26 @@ BootStrapCLOG(void)
 	SimpleLruWritePage(XactCtl, slotno);
 	Assert(!XactCtl->shared->page_dirty[slotno]);
 
+	pageno = TransactionIdToPage(XidFromFullTransactionId(TransamVariables->nextXid));
+	if (pageno != 0)
+	{
+		LWLock *nextlock = SimpleLruGetBankLock(XactCtl, pageno);
+
+		if (nextlock != lock)
+		{
+			LWLockRelease(lock);
+			LWLockAcquire(nextlock, LW_EXCLUSIVE);
+			lock = nextlock;
+		}
+
+		/* Create and zero the first page of the commit log */
+		slotno = ZeroCLOGPage(pageno, false);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(XactCtl, slotno);
+		Assert(!XactCtl->shared->page_dirty[slotno]);
+	}
+
 	LWLockRelease(lock);
 }
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index d63ae17330..70c9d2f6ee 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1815,6 +1815,7 @@ BootStrapMultiXact(void)
 {
 	int			slotno;
 	LWLock	   *lock;
+	int64		pageno;
 
 	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, 0);
 	LWLockAcquire(lock, LW_EXCLUSIVE);
@@ -1826,6 +1827,26 @@ BootStrapMultiXact(void)
 	SimpleLruWritePage(MultiXactOffsetCtl, slotno);
 	Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
 
+	pageno = MultiXactIdToOffsetPage(MultiXactState->nextMXact);
+	if (pageno != 0)
+	{
+		LWLock *nextlock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
+
+		if (nextlock != lock)
+		{
+			LWLockRelease(lock);
+			LWLockAcquire(nextlock, LW_EXCLUSIVE);
+			lock = nextlock;
+		}
+
+		/* Create and zero the first page of the offsets log */
+		slotno = ZeroMultiXactOffsetPage(pageno, false);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
+		Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
+	}
+
 	LWLockRelease(lock);
 
 	lock = SimpleLruGetBankLock(MultiXactMemberCtl, 0);
@@ -1838,7 +1859,39 @@ BootStrapMultiXact(void)
 	SimpleLruWritePage(MultiXactMemberCtl, slotno);
 	Assert(!MultiXactMemberCtl->shared->page_dirty[slotno]);
 
+	pageno = MXOffsetToMemberPage(MultiXactState->nextOffset);
+	if (pageno != 0)
+	{
+		LWLock *nextlock = SimpleLruGetBankLock(MultiXactMemberCtl, pageno);
+
+		if (nextlock != lock)
+		{
+			LWLockRelease(lock);
+			LWLockAcquire(nextlock, LW_EXCLUSIVE);
+			lock = nextlock;
+		}
+
+		/* Create and zero the first page of the members log */
+		slotno = ZeroMultiXactMemberPage(pageno, false);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(MultiXactMemberCtl, slotno);
+		Assert(!MultiXactMemberCtl->shared->page_dirty[slotno]);
+	}
+
 	LWLockRelease(lock);
+
+	/*
+	 * If we're starting not from zero offset, initilize dummy multixact to
+	 * evade too long loop in PerformMembersTruncation().
+	 */
+	if (MultiXactState->nextOffset > 0 && MultiXactState->nextMXact > 0)
+	{
+		RecordNewMultiXact(FirstMultiXactId,
+						   MultiXactState->nextOffset, 0, NULL);
+		RecordNewMultiXact(MultiXactState->nextMXact,
+						   MultiXactState->nextOffset, 0, NULL);
+	}
 }
 
 /*
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 15153618fa..218675fa60 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -270,12 +270,16 @@ void
 BootStrapSUBTRANS(void)
 {
 	int			slotno;
-	LWLock	   *lock = SimpleLruGetBankLock(SubTransCtl, 0);
+	LWLock	   *lock;
+	int64		pageno;
+
+	pageno = TransactionIdToPage(XidFromFullTransactionId(TransamVariables->nextXid));
+	lock = SimpleLruGetBankLock(SubTransCtl, pageno);
 
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
 	/* Create and zero the first page of the subtrans log */
-	slotno = ZeroSUBTRANSPage(0);
+	slotno = ZeroSUBTRANSPage(pageno);
 
 	/* Make sure it's written out */
 	SimpleLruWritePage(SubTransCtl, slotno);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a813a090fa..9f78a3e34a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -136,6 +136,10 @@ int			max_slot_wal_keep_size_mb = -1;
 int			wal_decode_buffer_size = 512 * 1024;
 bool		track_wal_io_timing = false;
 
+TransactionId		start_xid = FirstNormalTransactionId;
+MultiXactId			start_mxid = FirstMultiXactId;
+MultiXactOffset		start_mxoff = 0;
+
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
 #endif
@@ -5080,13 +5084,14 @@ BootStrapXLOG(uint32 data_checksum_version)
 	checkPoint.fullPageWrites = fullPageWrites;
 	checkPoint.wal_level = wal_level;
 	checkPoint.nextXid =
-		FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId);
+		FullTransactionIdFromEpochAndXid(0, Max(FirstNormalTransactionId,
+												start_xid));
 	checkPoint.nextOid = FirstGenbkiObjectId;
-	checkPoint.nextMulti = FirstMultiXactId;
-	checkPoint.nextMultiOffset = 1;
-	checkPoint.oldestXid = FirstNormalTransactionId;
+	checkPoint.nextMulti = Max(FirstMultiXactId, start_mxid);
+	checkPoint.nextMultiOffset = Max(1, start_mxoff);
+	checkPoint.oldestXid = XidFromFullTransactionId(checkPoint.nextXid);
 	checkPoint.oldestXidDB = Template1DbOid;
-	checkPoint.oldestMulti = FirstMultiXactId;
+	checkPoint.oldestMulti = checkPoint.nextMulti;
 	checkPoint.oldestMultiDB = Template1DbOid;
 	checkPoint.oldestCommitTsXid = InvalidTransactionId;
 	checkPoint.newestCommitTsXid = InvalidTransactionId;
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 359f58a8f9..b697138b7e 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -218,7 +218,7 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	argv++;
 	argc--;
 
-	while ((flag = getopt(argc, argv, "B:c:d:D:Fkr:X:-:")) != -1)
+	while ((flag = getopt(argc, argv, "B:c:d:D:Fkm:o:r:X:x:-:")) != -1)
 	{
 		switch (flag)
 		{
@@ -286,12 +286,60 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 			case 'k':
 				bootstrap_data_checksum_version = PG_DATA_CHECKSUM_VERSION;
 				break;
+			case 'm':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactIdIsValid(start_mxid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact id")));
+					}
+				}
+				break;
+			case 'o':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxoff = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactOffsetIsValid(start_mxoff))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact offset")));
+					}
+				}
+				break;
 			case 'r':
 				strlcpy(OutputFileName, optarg, MAXPGPATH);
 				break;
 			case 'X':
 				SetConfigOption("wal_segment_size", optarg, PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
 				break;
+			case 'x':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_xid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartTransactionIdIsValid(start_xid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster xid value")));
+					}
+				}
+				break;
 			default:
 				write_stderr("Try \"%s --help\" for more information.\n",
 							 progname);
diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index e8effe5024..ff252dffbd 100644
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -426,12 +426,18 @@ help(const char *progname)
 	printf(_("  -E                 echo statement before execution\n"));
 	printf(_("  -j                 do not use newline as interactive query delimiter\n"));
 	printf(_("  -r FILENAME        send stdout and stderr to given file\n"));
+	printf(_("  -m START_MXID      set initial database cluster multixact id\n"));
+	printf(_("  -o START_MXOFF     set initial database cluster multixact offset\n"));
+	printf(_("  -x START_XID       set initial database cluster xid\n"));
 
 	printf(_("\nOptions for bootstrapping mode:\n"));
 	printf(_("  --boot             selects bootstrapping mode (must be first argument)\n"));
 	printf(_("  --check            selects check mode (must be first argument)\n"));
 	printf(_("  DBNAME             database name (mandatory argument in bootstrapping mode)\n"));
 	printf(_("  -r FILENAME        send stdout and stderr to given file\n"));
+	printf(_("  -m START_MXID      set initial database cluster multixact id\n"));
+	printf(_("  -o START_MXOFF     set initial database cluster multixact offset\n"));
+	printf(_("  -x START_XID       set initial database cluster xid\n"));
 
 	printf(_("\nPlease read the documentation for the complete list of run-time\n"
 			 "configuration settings and how to set them on the command line or in\n"
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index bb22b13ade..028a734517 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -585,7 +585,7 @@ PostmasterMain(int argc, char *argv[])
 	 * tcop/postgres.c (the option sets should not conflict) and with the
 	 * common help() function in main/main.c.
 	 */
-	while ((opt = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lN:OPp:r:S:sTt:W:-:")) != -1)
+	while ((opt = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lm:N:Oo:Pp:r:S:sTt:W:x:-:")) != -1)
 	{
 		switch (opt)
 		{
@@ -695,10 +695,18 @@ PostmasterMain(int argc, char *argv[])
 				SetConfigOption("max_connections", optarg, PGC_POSTMASTER, PGC_S_ARGV);
 				break;
 
+			case 'm':
+				/* only used by single-user backend */
+				break;
+
 			case 'O':
 				SetConfigOption("allow_system_table_mods", "true", PGC_POSTMASTER, PGC_S_ARGV);
 				break;
 
+			case 'o':
+				/* only used by single-user backend */
+				break;
+
 			case 'P':
 				SetConfigOption("ignore_system_indexes", "true", PGC_POSTMASTER, PGC_S_ARGV);
 				break;
@@ -749,6 +757,10 @@ PostmasterMain(int argc, char *argv[])
 				SetConfigOption("post_auth_delay", optarg, PGC_POSTMASTER, PGC_S_ARGV);
 				break;
 
+			case 'x':
+				/* only used by single-user backend */
+				break;
+
 			default:
 				write_stderr("Try \"%s --help\" for more information.\n",
 							 progname);
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 5655348a2e..9c170f4906 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3788,7 +3788,7 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 	 * postmaster/postmaster.c (the option sets should not conflict) and with
 	 * the common help() function in main/main.c.
 	 */
-	while ((flag = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lN:nOPp:r:S:sTt:v:W:-:")) != -1)
+	while ((flag = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lm:N:nOo:Pp:r:S:sTt:v:W:x:-:")) != -1)
 	{
 		switch (flag)
 		{
@@ -3893,6 +3893,23 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 				SetConfigOption("ssl", "true", ctx, gucsource);
 				break;
 
+			case 'm':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactIdIsValid(start_mxid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact id")));
+					}
+				}
+				break;
+
 			case 'N':
 				SetConfigOption("max_connections", optarg, ctx, gucsource);
 				break;
@@ -3905,6 +3922,23 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 				SetConfigOption("allow_system_table_mods", "true", ctx, gucsource);
 				break;
 
+			case 'o':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxoff = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactOffsetIsValid(start_mxoff))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact offset")));
+					}
+				}
+				break;
+
 			case 'P':
 				SetConfigOption("ignore_system_indexes", "true", ctx, gucsource);
 				break;
@@ -3959,6 +3993,23 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 				SetConfigOption("post_auth_delay", optarg, ctx, gucsource);
 				break;
 
+			case 'x':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_xid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartTransactionIdIsValid(start_xid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster xid")));
+					}
+				}
+				break;
+
 			default:
 				errs++;
 				break;
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 759672a9b9..125bfb6736 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -168,6 +168,9 @@ static bool data_checksums = true;
 static char *xlog_dir = NULL;
 static int	wal_segment_size_mb = (DEFAULT_XLOG_SEG_SIZE) / (1024 * 1024);
 static DataDirSyncMethod sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+static TransactionId start_xid = 0;
+static MultiXactId start_mxid = 0;
+static MultiXactOffset start_mxoff = 0;
 
 
 /* internal vars */
@@ -1596,6 +1599,11 @@ bootstrap_template1(void)
 	bki_lines = replace_token(bki_lines, "POSTGRES",
 							  escape_quotes_bki(username));
 
+	/* relfrozenxid must not be less than FirstNormalTransactionId */
+	sprintf(buf, "%llu", (unsigned long long) Max(start_xid, 3));
+	bki_lines = replace_token(bki_lines, "RECENTXMIN",
+							  buf);
+
 	bki_lines = replace_token(bki_lines, "ENCODING",
 							  encodingid_to_string(encodingid));
 
@@ -1621,6 +1629,9 @@ bootstrap_template1(void)
 
 	printfPQExpBuffer(&cmd, "\"%s\" --boot %s %s", backend_exec, boot_options, extra_options);
 	appendPQExpBuffer(&cmd, " -X %d", wal_segment_size_mb * (1024 * 1024));
+	appendPQExpBuffer(&cmd, " -m %llu", (unsigned long long) start_mxid);
+	appendPQExpBuffer(&cmd, " -o %llu", (unsigned long long) start_mxoff);
+	appendPQExpBuffer(&cmd, " -x %llu", (unsigned long long) start_xid);
 	if (data_checksums)
 		appendPQExpBuffer(&cmd, " -k");
 	if (debug)
@@ -2562,12 +2573,20 @@ usage(const char *progname)
 	printf(_("  -d, --debug               generate lots of debugging output\n"));
 	printf(_("      --discard-caches      set debug_discard_caches=1\n"));
 	printf(_("  -L DIRECTORY              where to find the input files\n"));
+	printf(_("  -m, --multixact-id=START_MXID\n"
+			 "                            set initial database cluster multixact id\n"
+			 "                            max value is 2^62-1\n"));
 	printf(_("  -n, --no-clean            do not clean up after errors\n"));
 	printf(_("  -N, --no-sync             do not wait for changes to be written safely to disk\n"));
 	printf(_("      --no-instructions     do not print instructions for next steps\n"));
+	printf(_("  -o, --multixact-offset=START_MXOFF\n"
+			 "                            set initial database cluster multixact offset\n"
+			 "                            max value is 2^62-1\n"));
 	printf(_("  -s, --show                show internal settings, then exit\n"));
 	printf(_("      --sync-method=METHOD  set method for syncing files to disk\n"));
 	printf(_("  -S, --sync-only           only sync database files to disk, then exit\n"));
+	printf(_("  -x, --xid=START_XID       set initial database cluster xid\n"
+			 "                            max value is 2^62-1\n"));
 	printf(_("\nOther options:\n"));
 	printf(_("  -V, --version             output version information, then exit\n"));
 	printf(_("  -?, --help                show this help, then exit\n"));
@@ -3102,6 +3121,18 @@ initialize_data_directory(void)
 	/* Now create all the text config files */
 	setup_config();
 
+	if (start_mxid != 0)
+		printf(_("selecting initial multixact id ... %llu\n"),
+				 (unsigned long long) start_mxid);
+
+	if (start_mxoff != 0)
+		printf(_("selecting initial multixact offset ... %llu\n"),
+				 (unsigned long long) start_mxoff);
+
+	if (start_xid != 0)
+		printf(_("selecting initial xid ... %llu\n"),
+				 (unsigned long long) start_xid);
+
 	/* Bootstrap template1 */
 	bootstrap_template1();
 
@@ -3118,8 +3149,12 @@ initialize_data_directory(void)
 	fflush(stdout);
 
 	initPQExpBuffer(&cmd);
-	printfPQExpBuffer(&cmd, "\"%s\" %s %s template1 >%s",
-					  backend_exec, backend_options, extra_options, DEVNULL);
+	printfPQExpBuffer(&cmd, "\"%s\" %s %s",
+					  backend_exec, backend_options, extra_options);
+	appendPQExpBuffer(&cmd, " -m %llu", (unsigned long long) start_mxid);
+	appendPQExpBuffer(&cmd, " -o %llu", (unsigned long long) start_mxoff);
+	appendPQExpBuffer(&cmd, " -x %llu", (unsigned long long) start_xid);
+	appendPQExpBuffer(&cmd, " template1 >%s", DEVNULL);
 
 	PG_CMD_OPEN(cmd.data);
 
@@ -3206,6 +3241,9 @@ main(int argc, char *argv[])
 		{"icu-rules", required_argument, NULL, 18},
 		{"sync-method", required_argument, NULL, 19},
 		{"no-data-checksums", no_argument, NULL, 20},
+		{"xid", required_argument, NULL, 'x'},
+		{"multixact-id", required_argument, NULL, 'm'},
+		{"multixact-offset", required_argument, NULL, 'o'},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -3247,7 +3285,7 @@ main(int argc, char *argv[])
 
 	/* process command-line options */
 
-	while ((c = getopt_long(argc, argv, "A:c:dD:E:gkL:nNsST:U:WX:",
+	while ((c = getopt_long(argc, argv, "A:c:dD:E:gkL:m:nNo:sST:U:Wx:X:",
 							long_options, &option_index)) != -1)
 	{
 		switch (c)
@@ -3305,6 +3343,30 @@ main(int argc, char *argv[])
 				debug = true;
 				printf(_("Running in debug mode.\n"));
 				break;
+			case 'm':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactIdIsValid(start_mxid))
+					{
+						pg_log_error("invalid initial database cluster multixact id");
+						exit(1);
+					}
+					else if (start_mxid < 1) /* FirstMultiXactId */
+					{
+						/*
+						 * We avoid mxid to be silently set to
+						 * FirstMultiXactId, though it does not harm.
+						 */
+						pg_log_error("multixact id should be greater than 0");
+						exit(1);
+					}
+				}
+				break;
 			case 'n':
 				noclean = true;
 				printf(_("Running in no-clean mode.  Mistakes will not be cleaned up.\n"));
@@ -3312,6 +3374,21 @@ main(int argc, char *argv[])
 			case 'N':
 				do_sync = false;
 				break;
+			case 'o':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxoff = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactOffsetIsValid(start_mxoff))
+					{
+						pg_log_error("invalid initial database cluster multixact offset");
+						exit(1);
+					}
+				}
+				break;
 			case 'S':
 				sync_only = true;
 				break;
@@ -3400,6 +3477,30 @@ main(int argc, char *argv[])
 			case 20:
 				data_checksums = false;
 				break;
+			case 'x':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_xid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartTransactionIdIsValid(start_xid))
+					{
+						pg_log_error("invalid value for initial database cluster xid");
+						exit(1);
+					}
+					else if (start_xid < 3) /* FirstNormalTransactionId */
+					{
+						/*
+						 * We avoid xid to be silently set to
+						 * FirstNormalTransactionId, though it does not harm.
+						 */
+						pg_log_error("xid should be greater than 2");
+						exit(1);
+					}
+				}
+				break;
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index 01cc4a1602..8b017eb907 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -329,4 +329,64 @@ command_fails(
 	[ 'pg_checksums', '--pgdata' => $datadir_nochecksums ],
 	"pg_checksums fails with data checksum disabled");
 
+# Set non-standard initial mxid/mxoff/xid.
+command_fails_like(
+	[ 'initdb', '-m', 'seven', $datadir ],
+	qr/initdb: error: invalid initial database cluster multixact id/,
+	'fails for invalid initial database cluster multixact id');
+command_fails_like(
+	[ 'initdb', '-o', 'seven', $datadir ],
+	qr/initdb: error: invalid initial database cluster multixact offset/,
+	'fails for invalid initial database cluster multixact offset');
+command_fails_like(
+	[ 'initdb', '-x', 'seven', $datadir ],
+	qr/initdb: error: invalid value for initial database cluster xid/,
+	'fails for invalid initial database cluster xid');
+
+command_checks_all(
+	[ 'initdb', '-m', '65535', "$tempdir/data-m65535" ],
+	0,
+	[qr/selecting initial multixact id ... 65535/],
+	[],
+	'selecting initial multixact id');
+command_checks_all(
+	[ 'initdb', '-o', '65535', "$tempdir/data-o65535" ],
+	0,
+	[qr/selecting initial multixact offset ... 65535/],
+	[],
+	'selecting initial multixact offset');
+command_checks_all(
+	[ 'initdb', '-x', '65535', "$tempdir/data-x65535" ],
+	0,
+	[qr/selecting initial xid ... 65535/],
+	[],
+	'selecting initial xid');
+
+# Setup new cluster with given mxid/mxoff/xid.
+my $node;
+my $result;
+
+$node = PostgreSQL::Test::Cluster->new('test-mxid');
+$node->init(extra => ['-m', '16777215']); # 0xFFFFFF
+$node->start;
+$result = $node->safe_psql('postgres', "SELECT next_multixact_id FROM pg_control_checkpoint();");
+ok($result >= 16777215, 'setup cluster with given mxid');
+$node->stop;
+
+$node = PostgreSQL::Test::Cluster->new('test-mxoff');
+$node->init(extra => ['-o', '16777215']); # 0xFFFFFF
+$node->start;
+$result = $node->safe_psql('postgres', "SELECT next_multi_offset FROM pg_control_checkpoint();");
+ok($result >= 16777215, 'setup cluster with given mxoff');
+$node->stop;
+
+$node = PostgreSQL::Test::Cluster->new('test-xid');
+$node->init(extra => ['-x', '16777215']); # 0xFFFFFF
+$node->start;
+$result = $node->safe_psql('postgres', "SELECT txid_current();");
+ok($result >= 16777215, 'setup cluster with given xid - check 1');
+$result = $node->safe_psql('postgres', "SELECT oldest_xid FROM pg_control_checkpoint();");
+ok($result >= 16777215, 'setup cluster with given xid - check 2');
+$node->stop;
+
 done_testing();
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 4411c1468a..8eb34846da 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -94,6 +94,9 @@ typedef enum RecoveryState
 } RecoveryState;
 
 extern PGDLLIMPORT int wal_level;
+extern PGDLLIMPORT TransactionId start_xid;
+extern PGDLLIMPORT MultiXactId start_mxid;
+extern PGDLLIMPORT MultiXactOffset start_mxoff;
 
 /* Is WAL archiving enabled (always or only while server is running normally)? */
 #define XLogArchivingActive() \
diff --git a/src/include/c.h b/src/include/c.h
index 318194f78d..4f2b5432e5 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -622,6 +622,10 @@ typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
+#define StartTransactionIdIsValid(xid)			((xid) <= 0xFFFFFFFF)
+#define StartMultiXactIdIsValid(mxid)			((mxid) <= 0xFFFFFFFF)
+#define StartMultiXactOffsetIsValid(offset)		((offset) <= 0xFFFFFFFF)
+
 #define FirstCommandId	((CommandId) 0)
 #define InvalidCommandId	(~(CommandId)0)
 
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index f0d612ca48..5c63290a72 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -123,7 +123,7 @@ CATALOG(pg_class,1259,RelationRelationId) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83,Relat
 	Oid			relrewrite BKI_DEFAULT(0) BKI_LOOKUP_OPT(pg_class);
 
 	/* all Xids < this are frozen in this rel */
-	TransactionId relfrozenxid BKI_DEFAULT(3);	/* FirstNormalTransactionId */
+	TransactionId relfrozenxid BKI_DEFAULT(RECENTXMIN);	/* FirstNormalTransactionId */
 
 	/* all multixacts in this rel are >= this; it is really a MultiXactId */
 	TransactionId relminmxid BKI_DEFAULT(1);	/* FirstMultiXactId */
-- 
2.43.0

v13-0001-Use-64-bit-format-output-for-multixact-offsets.patchapplication/octet-stream; name=v13-0001-Use-64-bit-format-output-for-multixact-offsets.patchDownload
From ba345510f8e52c4504c238b85512bfe864a8a6c3 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 7 Aug 2024 16:35:22 +0300
Subject: [PATCH v13 1/7] Use 64-bit format output for multixact offsets

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/rmgrdesc/mxactdesc.c   |  9 ++++----
 src/backend/access/rmgrdesc/xlogdesc.c    |  4 ++--
 src/backend/access/transam/multixact.c    | 26 +++++++++++++----------
 src/backend/access/transam/xlogrecovery.c |  5 +++--
 src/bin/pg_controldata/pg_controldata.c   |  4 ++--
 src/bin/pg_resetwal/pg_resetwal.c         |  8 +++----
 6 files changed, 31 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 8bd3d5b63c..b792e9d939 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -65,8 +65,8 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		xl_multixact_create *xlrec = (xl_multixact_create *) rec;
 		int			i;
 
-		appendStringInfo(buf, "%u offset %u nmembers %d: ", xlrec->mid,
-						 xlrec->moff, xlrec->nmembers);
+		appendStringInfo(buf, "%u offset %llu nmembers %d: ", xlrec->mid,
+						 (unsigned long long) xlrec->moff, xlrec->nmembers);
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
 	}
@@ -74,9 +74,10 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
 
-		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+		appendStringInfo(buf, "offsets [%u, %u), members [%llu, %llu)",
 						 xlrec->startTruncOff, xlrec->endTruncOff,
-						 xlrec->startTruncMemb, xlrec->endTruncMemb);
+						 (unsigned long long) xlrec->startTruncMemb,
+						 (unsigned long long) xlrec->endTruncMemb);
 	}
 }
 
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 58040f2865..e52a5625a8 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -66,7 +66,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
 		appendStringInfo(buf, "redo %X/%X; "
-						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %u; "
+						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %llu; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
 						 "oldest running xid %u; %s",
@@ -79,7 +79,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 XidFromFullTransactionId(checkpoint->nextXid),
 						 checkpoint->nextOid,
 						 checkpoint->nextMulti,
-						 checkpoint->nextMultiOffset,
+						 (unsigned long long) checkpoint->nextMultiOffset,
 						 checkpoint->oldestXid,
 						 checkpoint->oldestXidDB,
 						 checkpoint->oldestMulti,
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 27ccdf9500..623fc8bdac 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1264,7 +1264,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	LWLockRelease(MultiXactGenLock);
 
-	debug_elog4(DEBUG2, "GetNew: returning %u offset %u", result, *offset);
+	debug_elog4(DEBUG2, "GetNew: returning %u offset %llu", result,
+				(unsigned long long) *offset);
 	return result;
 }
 
@@ -2293,8 +2294,9 @@ MultiXactGetCheckptMulti(bool is_shutdown,
 	LWLockRelease(MultiXactGenLock);
 
 	debug_elog6(DEBUG2,
-				"MultiXact: checkpoint is nextMulti %u, nextOffset %u, oldestMulti %u in DB %u",
-				*nextMulti, *nextMultiOffset, *oldestMulti, *oldestMultiDB);
+				"MultiXact: checkpoint is nextMulti %u, nextOffset %llu, oldestMulti %u in DB %u",
+				*nextMulti, (unsigned long long) *nextMultiOffset, *oldestMulti,
+				*oldestMultiDB);
 }
 
 /*
@@ -2328,8 +2330,8 @@ void
 MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset)
 {
-	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %u",
-				nextMulti, nextMultiOffset);
+	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %llu",
+				nextMulti, (unsigned long long) nextMultiOffset);
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->nextMXact = nextMulti;
 	MultiXactState->nextOffset = nextMultiOffset;
@@ -2519,8 +2521,8 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 	}
 	if (MultiXactOffsetPrecedes(MultiXactState->nextOffset, minMultiOffset))
 	{
-		debug_elog3(DEBUG2, "MultiXact: setting next offset to %u",
-					minMultiOffset);
+		debug_elog3(DEBUG2, "MultiXact: setting next offset to %llu",
+					(unsigned long long) minMultiOffset);
 		MultiXactState->nextOffset = minMultiOffset;
 	}
 	LWLockRelease(MultiXactGenLock);
@@ -3211,11 +3213,12 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 
 	elog(DEBUG1, "performing multixact truncation: "
 		 "offsets [%u, %u), offsets segments [%llx, %llx), "
-		 "members [%u, %u), members segments [%llx, %llx)",
+		 "members [%llu, %llu), members segments [%llx, %llx)",
 		 oldestMulti, newOldestMulti,
 		 (unsigned long long) MultiXactIdToOffsetSegment(oldestMulti),
 		 (unsigned long long) MultiXactIdToOffsetSegment(newOldestMulti),
-		 oldestOffset, newOldestOffset,
+		 (unsigned long long) oldestOffset,
+		 (unsigned long long) newOldestOffset,
 		 (unsigned long long) MXOffsetToMemberSegment(oldestOffset),
 		 (unsigned long long) MXOffsetToMemberSegment(newOldestOffset));
 
@@ -3471,11 +3474,12 @@ multixact_redo(XLogReaderState *record)
 
 		elog(DEBUG1, "replaying multixact truncation: "
 			 "offsets [%u, %u), offsets segments [%llx, %llx), "
-			 "members [%u, %u), members segments [%llx, %llx)",
+			 "members [%llu, %llu), members segments [%llx, %llx)",
 			 xlrec.startTruncOff, xlrec.endTruncOff,
 			 (unsigned long long) MultiXactIdToOffsetSegment(xlrec.startTruncOff),
 			 (unsigned long long) MultiXactIdToOffsetSegment(xlrec.endTruncOff),
-			 xlrec.startTruncMemb, xlrec.endTruncMemb,
+			 (unsigned long long) xlrec.startTruncMemb,
+			 (unsigned long long) xlrec.endTruncMemb,
 			 (unsigned long long) MXOffsetToMemberSegment(xlrec.startTruncMemb),
 			 (unsigned long long) MXOffsetToMemberSegment(xlrec.endTruncMemb));
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index cf2b007806..d5464d426c 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -876,8 +876,9 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 							 U64FromFullTransactionId(checkPoint.nextXid),
 							 checkPoint.nextOid)));
 	ereport(DEBUG1,
-			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %u",
-							 checkPoint.nextMulti, checkPoint.nextMultiOffset)));
+			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %llu",
+							 checkPoint.nextMulti,
+							 (unsigned long long) checkPoint.nextMultiOffset)));
 	ereport(DEBUG1,
 			(errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
 							 checkPoint.oldestXid, checkPoint.oldestXidDB)));
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 93a05d80ca..43b6727570 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -253,8 +253,8 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile->checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
-		   ControlFile->checkPointCopy.nextMultiOffset);
+	printf(_("Latest checkpoint's NextMultiOffset:  %llu\n"),
+		   (unsigned long long) ControlFile->checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile->checkPointCopy.oldestXid);
 	printf(_("Latest checkpoint's oldestXID's DB:   %u\n"),
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index ed73607a46..fff401e469 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -737,8 +737,8 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile.checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
-		   ControlFile.checkPointCopy.nextMultiOffset);
+	printf(_("Latest checkpoint's NextMultiOffset:  %llu\n"),
+		   (unsigned long long) ControlFile.checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile.checkPointCopy.oldestXid);
 	printf(_("Latest checkpoint's oldestXID's DB:   %u\n"),
@@ -809,8 +809,8 @@ PrintNewControlValues(void)
 
 	if (set_mxoff != -1)
 	{
-		printf(_("NextMultiOffset:                      %u\n"),
-			   ControlFile.checkPointCopy.nextMultiOffset);
+		printf(_("NextMultiOffset:                      %llu\n"),
+			   (unsigned long long) ControlFile.checkPointCopy.nextMultiOffset);
 	}
 
 	if (set_oid != 0)
-- 
2.43.0

v13-0002-Use-64-bit-multixact-offsets.patchapplication/octet-stream; name=v13-0002-Use-64-bit-multixact-offsets.patchDownload
From f3499102e2893e4b2e24d48975cbbd49385e190f Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Wed, 6 Mar 2024 11:11:33 +0300
Subject: [PATCH v13 2/7] Use 64-bit multixact offsets.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/transam/multixact.c  | 541 ++----------------------
 src/backend/access/transam/xlog.c       |   2 +-
 src/backend/commands/vacuum.c           |   2 +-
 src/backend/postmaster/autovacuum.c     |   4 +-
 src/bin/pg_resetwal/pg_resetwal.c       |   2 +-
 src/bin/pg_resetwal/t/001_basic.pl      |   2 +-
 src/include/access/multixact.h          |   3 +-
 src/include/access/multixact_internal.h | 115 +++++
 src/include/c.h                         |   2 +-
 9 files changed, 156 insertions(+), 517 deletions(-)
 create mode 100644 src/include/access/multixact_internal.h

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 623fc8bdac..cd9db52e95 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -69,6 +69,7 @@
 #include "postgres.h"
 
 #include "access/multixact.h"
+#include "access/multixact_internal.h"
 #include "access/slru.h"
 #include "access/transam.h"
 #include "access/twophase.h"
@@ -92,130 +93,14 @@
 #include "utils/injection_point.h"
 #include "utils/memutils.h"
 
-
-/*
- * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
- * used everywhere else in Postgres.
- *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
- */
-
-/* We need four bytes per offset */
-#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
-
-static inline int64
-MultiXactIdToOffsetPage(MultiXactId multi)
-{
-	return multi / MULTIXACT_OFFSETS_PER_PAGE;
-}
-
-static inline int
-MultiXactIdToOffsetEntry(MultiXactId multi)
-{
-	return multi % MULTIXACT_OFFSETS_PER_PAGE;
-}
-
-static inline int64
-MultiXactIdToOffsetSegment(MultiXactId multi)
-{
-	return MultiXactIdToOffsetPage(multi) / SLRU_PAGES_PER_SEGMENT;
-}
-
-/*
- * The situation for members is a bit more complex: we store one byte of
- * additional flag bits for each TransactionId.  To do this without getting
- * into alignment issues, we store four bytes of flags, and then the
- * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
- * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
- * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
- * performance) trumps space efficiency here.
- *
- * Note that the "offset" macros work with byte offset, not array indexes, so
- * arithmetic must be done using "char *" pointers.
- */
-/* We need eight bits per xact, so one xact fits in a byte */
-#define MXACT_MEMBER_BITS_PER_XACT			8
-#define MXACT_MEMBER_FLAGS_PER_BYTE			1
-#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
-
-/* how many full bytes of flags are there in a group? */
-#define MULTIXACT_FLAGBYTES_PER_GROUP		4
-#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
-	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
-/* size in bytes of a complete group */
-#define MULTIXACT_MEMBERGROUP_SIZE \
-	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
-#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
-#define MULTIXACT_MEMBERS_PER_PAGE	\
-	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
-
 /*
- * Because the number of items per page is not a divisor of the last item
- * number (member 0xFFFFFFFF), the last segment does not use the maximum number
- * of pages, and moreover the last used page therein does not use the same
- * number of items as previous pages.  (Another way to say it is that the
- * 0xFFFFFFFF member is somewhere in the middle of the last page, so the page
- * has some empty space after that item.)
- *
- * This constant is the number of members in the last page of the last segment.
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuumin order to release the disk space, reduce table bloat if
+ * possible.
  */
-#define MAX_MEMBERS_IN_LAST_MEMBERS_PAGE \
-		((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
-
-/* page in which a member is to be found */
-static inline int64
-MXOffsetToMemberPage(MultiXactOffset offset)
-{
-	return offset / MULTIXACT_MEMBERS_PER_PAGE;
-}
-
-static inline int64
-MXOffsetToMemberSegment(MultiXactOffset offset)
-{
-	return MXOffsetToMemberPage(offset) / SLRU_PAGES_PER_SEGMENT;
-}
-
-/* Location (byte offset within page) of flag word for a given member */
-static inline int
-MXOffsetToFlagsOffset(MultiXactOffset offset)
-{
-	MultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
-	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
-	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
-
-	return byteoff;
-}
-
-static inline int
-MXOffsetToFlagsBitShift(MultiXactOffset offset)
-{
-	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
-	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
-
-	return bshift;
-}
-
-/* Location (byte offset within page) of TransactionId of given member */
-static inline int
-MXOffsetToMemberOffset(MultiXactOffset offset)
-{
-	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
-
-	return MXOffsetToFlagsOffset(offset) +
-		MULTIXACT_FLAGBYTES_PER_GROUP +
-		member_in_group * sizeof(TransactionId);
-}
-
-/* Multixact members wraparound thresholds. */
-#define MULTIXACT_MEMBER_SAFE_THRESHOLD		(MaxMultiXactOffset / 2)
-#define MULTIXACT_MEMBER_DANGER_THRESHOLD	\
-	(MaxMultiXactOffset - MaxMultiXactOffset / 4)
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD		UINT64CONST(0xFFFFFFFF)
 
 static inline MultiXactId
 PreviousMultiXactId(MultiXactId multi)
@@ -260,11 +145,9 @@ typedef struct MultiXactStateData
 
 	/*
 	 * Oldest multixact offset that is potentially referenced by a multixact
-	 * referenced by a relation.  We don't always know this value, so there's
-	 * a flag here to indicate whether or not we currently do.
+	 * referenced by a relation.
 	 */
 	MultiXactOffset oldestOffset;
-	bool		oldestOffsetKnown;
 
 	/* support for anti-wraparound measures */
 	MultiXactId multiVacLimit;
@@ -272,9 +155,6 @@ typedef struct MultiXactStateData
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 
-	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
-
 	/*
 	 * This is used to sleep until a multixact offset is written when we want
 	 * to create the next one.
@@ -409,10 +289,8 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
-									 MultiXactOffset start, uint32 distance);
 static bool SetOffsetVacuumLimit(bool is_startup);
-static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
+static MultiXactOffset find_multixact_start(MultiXactId multi);
 static void WriteMZeroPageXlogRec(int64 pageno, uint8 info);
 static void WriteMTruncateXlogRec(Oid oldestMultiDB,
 								  MultiXactId startTruncOff,
@@ -1054,9 +932,7 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	 * against catastrophic data loss due to multixact wraparound.  The basic
 	 * rules are:
 	 *
-	 * If we're past multiVacLimit or the safe threshold for member storage
-	 * space, or we don't know what the safe threshold for member storage is,
-	 * start trying to force autovacuum cycles.
+	 * If we're past multiVacLimit, start trying to force autovacuum cycles.
 	 * If we're past multiWarnLimit, start issuing warnings.
 	 * If we're past multiStopLimit, refuse to create new MultiXactIds.
 	 *
@@ -1151,90 +1027,10 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	ExtendMultiXactOffset(result);
 
 	/*
-	 * Reserve the members space, similarly to above.  Also, be careful not to
-	 * return zero as the starting offset for any multixact. See
-	 * GetMultiXactIdMembers() for motivation.
+	 * Reserve the members space, similarly to above.
 	 */
 	nextOffset = MultiXactState->nextOffset;
-	if (nextOffset == 0)
-	{
-		*offset = 1;
-		nmembers++;				/* allocate member slot 0 too */
-	}
-	else
-		*offset = nextOffset;
-
-	/*----------
-	 * Protect against overrun of the members space as well, with the
-	 * following rules:
-	 *
-	 * If we're past offsetStopLimit, refuse to generate more multis.
-	 * If we're close to offsetStopLimit, emit a warning.
-	 *
-	 * Arbitrarily, we start emitting warnings when we're 20 segments or less
-	 * from offsetStopLimit.
-	 *
-	 * Note we haven't updated the shared state yet, so if we fail at this
-	 * point, the multixact ID we grabbed can still be used by the next guy.
-	 *
-	 * Note that there is no point in forcing autovacuum runs here: the
-	 * multixact freeze settings would have to be reduced for that to have any
-	 * effect.
-	 *----------
-	 */
-#define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
-								 nmembers))
-	{
-		/* see comment in the corresponding offsets wraparound case */
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("multixact \"members\" limit exceeded"),
-				 errdetail_plural("This command would create a multixact with %u members, but the remaining space is only enough for %u member.",
-								  "This command would create a multixact with %u members, but the remaining space is only enough for %u members.",
-								  MultiXactState->offsetStopLimit - nextOffset - 1,
-								  nmembers,
-								  MultiXactState->offsetStopLimit - nextOffset - 1),
-				 errhint("Execute a database-wide VACUUM in database with OID %u with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.",
-						 MultiXactState->oldestMultiXactDB)));
-	}
-
-	/*
-	 * Check whether we should kick autovacuum into action, to prevent members
-	 * wraparound. NB we use a much larger window to trigger autovacuum than
-	 * just the warning limit. The warning is just a measure of last resort -
-	 * this is in line with GetNewTransactionId's behaviour.
-	 */
-	if (!MultiXactState->oldestOffsetKnown ||
-		(MultiXactState->nextOffset - MultiXactState->oldestOffset
-		 > MULTIXACT_MEMBER_SAFE_THRESHOLD))
-	{
-		/*
-		 * To avoid swamping the postmaster with signals, we issue the autovac
-		 * request only when crossing a segment boundary. With default
-		 * compilation settings that's roughly after 50k members.  This still
-		 * gives plenty of chances before we get into real trouble.
-		 */
-		if ((MXOffsetToMemberPage(nextOffset) / SLRU_PAGES_PER_SEGMENT) !=
-			(MXOffsetToMemberPage(nextOffset + nmembers) / SLRU_PAGES_PER_SEGMENT))
-			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	}
-
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
-								 nextOffset,
-								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
-		ereport(WARNING,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg_plural("database with OID %u must be vacuumed before %d more multixact member is used",
-							   "database with OID %u must be vacuumed before %d more multixact members are used",
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers,
-							   MultiXactState->oldestMultiXactDB,
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers),
-				 errhint("Execute a database-wide VACUUM in that database with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.")));
+	*offset = nextOffset;
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
@@ -2620,22 +2416,9 @@ ExtendMultiXactMember(MultiXactOffset offset, int nmembers)
 		}
 
 		/*
-		 * Compute the number of items till end of current page.  Careful: if
-		 * addition of unsigned ints wraps around, we're at the last page of
-		 * the last segment; since that page holds a different number of items
-		 * than other pages, we need to do it differently.
+		 * Compute the number of items till end of current page.
 		 */
-		if (offset + MAX_MEMBERS_IN_LAST_MEMBERS_PAGE < offset)
-		{
-			/*
-			 * This is the last page of the last segment; we can compute the
-			 * number of items left to allocate in it without modulo
-			 * arithmetic.
-			 */
-			difference = MaxMultiXactOffset - offset + 1;
-		}
-		else
-			difference = MULTIXACT_MEMBERS_PER_PAGE - offset % MULTIXACT_MEMBERS_PER_PAGE;
+		difference = MULTIXACT_MEMBERS_PER_PAGE - offset % MULTIXACT_MEMBERS_PER_PAGE;
 
 		/*
 		 * Advance to next page, taking care to properly handle the wraparound
@@ -2701,15 +2484,13 @@ GetOldestMultiXactId(void)
 }
 
 /*
- * Determine how aggressively we need to vacuum in order to prevent member
- * wraparound.
+ * Determine if we need to vacuum for member or not.
  *
  * To do so determine what's the oldest member offset and install the limit
  * info in MultiXactState, where it can be used to prevent overrun of old data
  * in the members SLRU area.
  *
- * The return value is true if emergency autovacuum is required and false
- * otherwise.
+ * The return value is true if autovacuum is required and false otherwise.
  */
 static bool
 SetOffsetVacuumLimit(bool is_startup)
@@ -2717,12 +2498,7 @@ SetOffsetVacuumLimit(bool is_startup)
 	MultiXactId oldestMultiXactId;
 	MultiXactId nextMXact;
 	MultiXactOffset oldestOffset = 0;	/* placate compiler */
-	MultiXactOffset prevOldestOffset;
 	MultiXactOffset nextOffset;
-	bool		oldestOffsetKnown = false;
-	bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;
-	MultiXactOffset prevOffsetStopLimit;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2735,9 +2511,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	oldestMultiXactId = MultiXactState->oldestMultiXactId;
 	nextMXact = MultiXactState->nextMXact;
 	nextOffset = MultiXactState->nextOffset;
-	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
-	prevOldestOffset = MultiXactState->oldestOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2755,139 +2528,31 @@ SetOffsetVacuumLimit(bool is_startup)
 		 * offset.
 		 */
 		oldestOffset = nextOffset;
-		oldestOffsetKnown = true;
 	}
 	else
-	{
-		/*
-		 * Figure out where the oldest existing multixact's offsets are
-		 * stored. Due to bugs in early release of PostgreSQL 9.3.X and 9.4.X,
-		 * the supposedly-earliest multixact might not really exist.  We are
-		 * careful not to fail in that case.
-		 */
-		oldestOffsetKnown =
-			find_multixact_start(oldestMultiXactId, &oldestOffset);
-
-		if (oldestOffsetKnown)
-			ereport(DEBUG1,
-					(errmsg_internal("oldest MultiXactId member is at offset %u",
-									 oldestOffset)));
-		else
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
-							oldestMultiXactId)));
-	}
+		oldestOffset = find_multixact_start(oldestMultiXactId);
 
 	LWLockRelease(MultiXactTruncationLock);
 
-	/*
-	 * If we can, compute limits (and install them MultiXactState) to prevent
-	 * overrun of old data in the members SLRU area. We can only do so if the
-	 * oldest offset is known though.
-	 */
-	if (oldestOffsetKnown)
-	{
-		/* move back to start of the corresponding segment */
-		offsetStopLimit = oldestOffset - (oldestOffset %
-										  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-
-		/* always leave one segment before the wraparound point */
-		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-		if (!prevOldestOffsetKnown && !is_startup)
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are now enabled")));
-
-		ereport(DEBUG1,
-				(errmsg_internal("MultiXact member stop limit is now %u based on MultiXact %u",
-								 offsetStopLimit, oldestMultiXactId)));
-	}
-	else if (prevOldestOffsetKnown)
-	{
-		/*
-		 * If we failed to get the oldest offset this time, but we have a
-		 * value from a previous pass through this function, use the old
-		 * values rather than automatically forcing an emergency autovacuum
-		 * cycle again.
-		 */
-		oldestOffset = prevOldestOffset;
-		oldestOffsetKnown = true;
-		offsetStopLimit = prevOffsetStopLimit;
-	}
-
 	/* Install the computed values */
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestOffset = oldestOffset;
-	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	MultiXactState->offsetStopLimit = offsetStopLimit;
 	LWLockRelease(MultiXactGenLock);
 
 	/*
-	 * Do we need an emergency autovacuum?	If we're not sure, assume yes.
-	 */
-	return !oldestOffsetKnown ||
-		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
-}
-
-/*
- * Return whether adding "distance" to "start" would move past "boundary".
- *
- * We use this to determine whether the addition is "wrapping around" the
- * boundary point, hence the name.  The reason we don't want to use the regular
- * 2^31-modulo arithmetic here is that we want to be able to use the whole of
- * the 2^32-1 space here, allowing for more multixacts than would fit
- * otherwise.
- */
-static bool
-MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
-						 uint32 distance)
-{
-	MultiXactOffset finish;
-
-	/*
-	 * Note that offset number 0 is not used (see GetMultiXactIdMembers), so
-	 * if the addition wraps around the UINT_MAX boundary, skip that value.
-	 */
-	finish = start + distance;
-	if (finish < start)
-		finish++;
-
-	/*-----------------------------------------------------------------------
-	 * When the boundary is numerically greater than the starting point, any
-	 * value numerically between the two is not wrapped:
-	 *
-	 *	<----S----B---->
-	 *	[---)			 = F wrapped past B (and UINT_MAX)
-	 *		 [---)		 = F not wrapped
-	 *			  [----] = F wrapped past B
-	 *
-	 * When the boundary is numerically less than the starting point (i.e. the
-	 * UINT_MAX wraparound occurs somewhere in between) then all values in
-	 * between are wrapped:
-	 *
-	 *	<----B----S---->
-	 *	[---)			 = F not wrapped past B (but wrapped past UINT_MAX)
-	 *		 [---)		 = F wrapped past B (and UINT_MAX)
-	 *			  [----] = F not wrapped
-	 *-----------------------------------------------------------------------
+	 * Do we need autovacuum?
 	 */
-	if (start < boundary)
-		return finish >= boundary || finish < start;
-	else
-		return finish >= boundary && finish < start;
+	return (nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);
 }
 
 /*
  * Find the starting offset of the given MultiXactId.
  *
- * Returns false if the file containing the multi does not exist on disk.
- * Otherwise, returns true and sets *result to the starting member offset.
- *
  * This function does not prevent concurrent truncation, so if that's
  * required, the caller has to protect against that.
  */
-static bool
-find_multixact_start(MultiXactId multi, MultiXactOffset *result)
+static MultiXactOffset
+find_multixact_start(MultiXactId multi)
 {
 	MultiXactOffset offset;
 	int64		pageno;
@@ -2900,15 +2565,6 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
-	/*
-	 * Write out dirty data, so PhysicalPageExists can work correctly.
-	 */
-	SimpleLruWriteAll(MultiXactOffsetCtl, true);
-	SimpleLruWriteAll(MultiXactMemberCtl, true);
-
-	if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno))
-		return false;
-
 	/* lock is acquired by SimpleLruReadPage_ReadOnly */
 	slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno, multi);
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
@@ -2916,102 +2572,7 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	offset = *offptr;
 	LWLockRelease(SimpleLruGetBankLock(MultiXactOffsetCtl, pageno));
 
-	*result = offset;
-	return true;
-}
-
-/*
- * Determine how many multixacts, and how many multixact members, currently
- * exist.  Return false if unable to determine.
- */
-static bool
-ReadMultiXactCounts(uint32 *multixacts, MultiXactOffset *members)
-{
-	MultiXactOffset nextOffset;
-	MultiXactOffset oldestOffset;
-	MultiXactId oldestMultiXactId;
-	MultiXactId nextMultiXactId;
-	bool		oldestOffsetKnown;
-
-	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	nextOffset = MultiXactState->nextOffset;
-	oldestMultiXactId = MultiXactState->oldestMultiXactId;
-	nextMultiXactId = MultiXactState->nextMXact;
-	oldestOffset = MultiXactState->oldestOffset;
-	oldestOffsetKnown = MultiXactState->oldestOffsetKnown;
-	LWLockRelease(MultiXactGenLock);
-
-	if (!oldestOffsetKnown)
-		return false;
-
-	*members = nextOffset - oldestOffset;
-	*multixacts = nextMultiXactId - oldestMultiXactId;
-	return true;
-}
-
-/*
- * Multixact members can be removed once the multixacts that refer to them
- * are older than every datminmxid.  autovacuum_multixact_freeze_max_age and
- * vacuum_multixact_freeze_table_age work together to make sure we never have
- * too many multixacts; we hope that, at least under normal circumstances,
- * this will also be sufficient to keep us from using too many offsets.
- * However, if the average multixact has many members, we might exhaust the
- * members space while still using few enough members that these limits fail
- * to trigger relminmxid advancement by VACUUM.  At that point, we'd have no
- * choice but to start failing multixact-creating operations with an error.
- *
- * To prevent that, if more than a threshold portion of the members space is
- * used, we effectively reduce autovacuum_multixact_freeze_max_age and
- * to a value just less than the number of multixacts in use.  We hope that
- * this will quickly trigger autovacuuming on the table or tables with the
- * oldest relminmxid, thus allowing datminmxid values to advance and removing
- * some members.
- *
- * As the fraction of the member space currently in use grows, we become
- * more aggressive in clamping this value.  That not only causes autovacuum
- * to ramp up, but also makes any manual vacuums the user issues more
- * aggressive.  This happens because vacuum_get_cutoffs() will clamp the
- * freeze table and the minimum freeze age cutoffs based on the effective
- * autovacuum_multixact_freeze_max_age this function returns.  In the worst
- * case, we'll claim the freeze_max_age to zero, and every vacuum of any
- * table will freeze every multixact.
- */
-int
-MultiXactMemberFreezeThreshold(void)
-{
-	MultiXactOffset members;
-	uint32		multixacts;
-	uint32		victim_multixacts;
-	double		fraction;
-	int			result;
-
-	/* If we can't determine member space utilization, assume the worst. */
-	if (!ReadMultiXactCounts(&multixacts, &members))
-		return 0;
-
-	/* If member space utilization is low, no special action is required. */
-	if (members <= MULTIXACT_MEMBER_SAFE_THRESHOLD)
-		return autovacuum_multixact_freeze_max_age;
-
-	/*
-	 * Compute a target for relminmxid advancement.  The number of multixacts
-	 * we try to eliminate from the system is based on how far we are past
-	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
-	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD) /
-		(MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-	victim_multixacts = multixacts * fraction;
-
-	/* fraction could be > 1.0, but lowest possible freeze age is zero */
-	if (victim_multixacts > multixacts)
-		return 0;
-	result = multixacts - victim_multixacts;
-
-	/*
-	 * Clamp to autovacuum_multixact_freeze_max_age, so that we never make
-	 * autovacuum less aggressive than it would otherwise be.
-	 */
-	return Min(result, autovacuum_multixact_freeze_max_age);
+	return offset;
 }
 
 typedef struct mxtruncinfo
@@ -3039,37 +2600,13 @@ SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int64 segpage, void *data
 
 
 /*
- * Delete members segments [oldest, newOldest)
- *
- * The members SLRU can, in contrast to the offsets one, be filled to almost
- * the full range at once. This means SimpleLruTruncate() can't trivially be
- * used - instead the to-be-deleted range is computed using the offsets
- * SLRU. C.f. TruncateMultiXact().
+ * Delete members segments before the newOldestOffset.
  */
 static void
-PerformMembersTruncation(MultiXactOffset oldestOffset, MultiXactOffset newOldestOffset)
+PerformMembersTruncation(MultiXactOffset newOldestOffset)
 {
-	const int64 maxsegment = MXOffsetToMemberSegment(MaxMultiXactOffset);
-	int64		startsegment = MXOffsetToMemberSegment(oldestOffset);
-	int64		endsegment = MXOffsetToMemberSegment(newOldestOffset);
-	int64		segment = startsegment;
-
-	/*
-	 * Delete all the segments but the last one. The last segment can still
-	 * contain, possibly partially, valid data.
-	 */
-	while (segment != endsegment)
-	{
-		elog(DEBUG2, "truncating multixact members segment %llx",
-			 (unsigned long long) segment);
-		SlruDeleteSegment(MultiXactMemberCtl, segment);
-
-		/* move to next segment, handling wraparound correctly */
-		if (segment == maxsegment)
-			segment = 0;
-		else
-			segment += 1;
-	}
+	SimpleLruTruncate(MultiXactMemberCtl,
+					  MXOffsetToMemberPage(newOldestOffset));
 }
 
 /*
@@ -3174,23 +2711,15 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	/*
 	 * First, compute the safe truncation point for MultiXactMember. This is
 	 * the starting offset of the oldest multixact.
-	 *
-	 * Hopefully, find_multixact_start will always work here, because we've
-	 * already checked that it doesn't precede the earliest MultiXact on disk.
-	 * But if it fails, don't truncate anything, and log a message.
 	 */
 	if (oldestMulti == nextMulti)
 	{
 		/* there are NO MultiXacts */
 		oldestOffset = nextOffset;
 	}
-	else if (!find_multixact_start(oldestMulti, &oldestOffset))
+	else
 	{
-		ereport(LOG,
-				(errmsg("oldest MultiXact %u not found, earliest MultiXact %u, skipping truncation",
-						oldestMulti, earliest)));
-		LWLockRelease(MultiXactTruncationLock);
-		return;
+		oldestOffset = find_multixact_start(oldestMulti);
 	}
 
 	/*
@@ -3202,13 +2731,9 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 		/* there are NO MultiXacts */
 		newOldestOffset = nextOffset;
 	}
-	else if (!find_multixact_start(newOldestMulti, &newOldestOffset))
+	else
 	{
-		ereport(LOG,
-				(errmsg("cannot truncate up to MultiXact %u because it does not exist on disk, skipping truncation",
-						newOldestMulti)));
-		LWLockRelease(MultiXactTruncationLock);
-		return;
+		newOldestOffset = find_multixact_start(newOldestMulti);
 	}
 
 	elog(DEBUG1, "performing multixact truncation: "
@@ -3258,7 +2783,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	LWLockRelease(MultiXactGenLock);
 
 	/* First truncate members */
-	PerformMembersTruncation(oldestOffset, newOldestOffset);
+	PerformMembersTruncation(newOldestOffset);
 
 	/* Then offsets */
 	PerformOffsetsTruncation(oldestMulti, newOldestMulti);
@@ -3345,7 +2870,7 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
 static bool
 MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
 {
-	int32		diff = (int32) (offset1 - offset2);
+	int64		diff = (int64) (offset1 - offset2);
 
 	return (diff < 0);
 }
@@ -3492,7 +3017,7 @@ multixact_redo(XLogReaderState *record)
 		 */
 		SetMultiXactIdLimit(xlrec.endTruncOff, xlrec.oldestMultiDB, false);
 
-		PerformMembersTruncation(xlrec.startTruncMemb, xlrec.endTruncMemb);
+		PerformMembersTruncation(xlrec.endTruncMemb);
 
 		/*
 		 * During XLOG replay, latest_page_number isn't necessarily set up
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bf3dbda901..a813a090fa 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5083,7 +5083,7 @@ BootStrapXLOG(uint32 data_checksum_version)
 		FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId);
 	checkPoint.nextOid = FirstGenbkiObjectId;
 	checkPoint.nextMulti = FirstMultiXactId;
-	checkPoint.nextMultiOffset = 0;
+	checkPoint.nextMultiOffset = 1;
 	checkPoint.oldestXid = FirstNormalTransactionId;
 	checkPoint.oldestXidDB = Template1DbOid;
 	checkPoint.oldestMulti = FirstMultiXactId;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index e6745e6145..c96fbf004d 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1134,7 +1134,7 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Almost ready to set freeze output parameters; check if OldestXmin or
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 0ab921a169..ed5fc09c38 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1134,7 +1134,7 @@ do_start_worker(void)
 
 	/* Also determine the oldest datminmxid we will consider. */
 	recentMulti = ReadNextMultiXactId();
-	multiForceLimit = recentMulti - MultiXactMemberFreezeThreshold();
+	multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
 	if (multiForceLimit < FirstMultiXactId)
 		multiForceLimit -= FirstMultiXactId;
 
@@ -1922,7 +1922,7 @@ do_autovacuum(void)
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Find the pg_database entry and select the default freeze ages. We use
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index fff401e469..4ad64cf1ed 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -264,7 +264,7 @@ main(int argc, char *argv[])
 
 			case 'O':
 				errno = 0;
-				set_mxoff = strtoul(optarg, &endptr, 0);
+				set_mxoff = strtou64(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index 323cd483cf..e107646875 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -207,7 +207,7 @@ push @cmd,
   sprintf("%d,%d", hex($files[0]) == 0 ? 3 : hex($files[0]), hex($files[-1]));
 
 @files = get_slru_files('pg_multixact/offsets');
-$mult = 32 * $blcksz / 4;
+$mult = 32 * $blcksz / 8;
 # --multixact-ids argument is "new,old"
 push @cmd,
   '--multixact-ids' => sprintf("%d,%d",
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 4e6b0eec2f..5ee632dfe6 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -27,7 +27,7 @@
 
 #define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
 
-#define MaxMultiXactOffset	((MultiXactOffset) 0xFFFFFFFF)
+#define MaxMultiXactOffset	UINT64CONST(0xFFFFFFFFFFFFFFFF)
 
 /*
  * Possible multixact lock modes ("status").  The first four modes are for
@@ -143,7 +143,6 @@ extern void MultiXactSetNextMXact(MultiXactId nextMulti,
 extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 									  MultiXactOffset minMultiOffset);
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
-extern int	MultiXactMemberFreezeThreshold(void);
 
 extern void multixact_twophase_recover(TransactionId xid, uint16 info,
 									   void *recdata, uint32 len);
diff --git a/src/include/access/multixact_internal.h b/src/include/access/multixact_internal.h
new file mode 100644
index 0000000000..39e74a21c7
--- /dev/null
+++ b/src/include/access/multixact_internal.h
@@ -0,0 +1,115 @@
+/*
+ * multixact_internal.h
+ *
+ * Internal definitions for the on-disk format of multixact manager.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/multixact_internal.h
+ */
+#ifndef MULTIXACT_INTERNAL_H
+#define MULTIXACT_INTERNAL_H
+
+/* FIXME: had to duplicate this */
+#define SLRU_PAGES_PER_SEGMENT	32
+
+/*
+ * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
+ * used everywhere else in Postgres.
+ */
+
+/* We need four bytes per offset */
+#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)
+{
+	return multi / MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int
+MultiXactIdToOffsetEntry(MultiXactId multi)
+{
+	return multi % MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int64
+MultiXactIdToOffsetSegment(MultiXactId multi)
+{
+	return MultiXactIdToOffsetPage(multi) / SLRU_PAGES_PER_SEGMENT;
+}
+
+/*
+ * The situation for members is a bit more complex: we store one byte of
+ * additional flag bits for each TransactionId.  To do this without getting
+ * into alignment issues, we store four bytes of flags, and then the
+ * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
+ * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
+ * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
+ * performance) trumps space efficiency here.
+ *
+ * Note that the "offset" macros work with byte offset, not array indexes, so
+ * arithmetic must be done using "char *" pointers.
+ */
+/* We need eight bits per xact, so one xact fits in a byte */
+#define MXACT_MEMBER_BITS_PER_XACT			8
+#define MXACT_MEMBER_FLAGS_PER_BYTE			1
+#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
+
+/* how many full bytes of flags are there in a group? */
+#define MULTIXACT_FLAGBYTES_PER_GROUP		4
+#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
+	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
+/* size in bytes of a complete group */
+#define MULTIXACT_MEMBERGROUP_SIZE \
+	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERS_PER_PAGE	\
+	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
+
+/* page in which a member is to be found */
+static inline int64
+MXOffsetToMemberPage(MultiXactOffset offset)
+{
+	return offset / MULTIXACT_MEMBERS_PER_PAGE;
+}
+
+static inline int64
+MXOffsetToMemberSegment(MultiXactOffset offset)
+{
+	return MXOffsetToMemberPage(offset) / SLRU_PAGES_PER_SEGMENT;
+}
+
+/* Location (byte offset within page) of flag word for a given member */
+static inline int
+MXOffsetToFlagsOffset(MultiXactOffset offset)
+{
+	MultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
+	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
+
+	return byteoff;
+}
+
+static inline int
+MXOffsetToFlagsBitShift(MultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
+
+	return bshift;
+}
+
+/* Location (byte offset within page) of TransactionId of given member */
+static inline int
+MXOffsetToMemberOffset(MultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+
+	return MXOffsetToFlagsOffset(offset) +
+		MULTIXACT_FLAGBYTES_PER_GROUP +
+		member_in_group * sizeof(TransactionId);
+}
+
+#endif							/* MULTIXACT_INTERNAL_H */
diff --git a/src/include/c.h b/src/include/c.h
index a14c631516..318194f78d 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -618,7 +618,7 @@ typedef uint32 SubTransactionId;
 /* MultiXactId must be equivalent to TransactionId, to fit in t_xmax */
 typedef TransactionId MultiXactId;
 
-typedef uint32 MultiXactOffset;
+typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
-- 
2.43.0

v13-0006-TEST-try-to-replicate-buggy-oldest-offset.patch.txttext/plain; charset=US-ASCII; name=v13-0006-TEST-try-to-replicate-buggy-oldest-offset.patch.txtDownload
From 4a2e64b44cf8ec22a264d8fa495432f535482fb4 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Fri, 27 Dec 2024 19:39:58 +0300
Subject: [PATCH v13 6/7] TEST: try to replicate buggy oldest offset

---
 src/bin/pg_upgrade/t/005_offset.pl | 59 ++++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/src/bin/pg_upgrade/t/005_offset.pl b/src/bin/pg_upgrade/t/005_offset.pl
index df84186de4..2d91d101fa 100644
--- a/src/bin/pg_upgrade/t/005_offset.pl
+++ b/src/bin/pg_upgrade/t/005_offset.pl
@@ -305,6 +305,8 @@ my @TESTS = (
 	100, 101, 102, 103, 104, 105, 106,
 	# self upgrade
 	1000,
+	# buggy
+	2000
 );
 
 # =============================================================================
@@ -560,4 +562,61 @@ SKIP:
 	ok(1, "TEST $TEST_NO PASSED");
 }
 
+# =============================================================================
+# Buggy
+# =============================================================================
+
+SKIP:
+{
+	my $TEST_NO = 2000;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	if (!defined($ENV{oldinstall}))
+	{
+		die "oldinstall is not defined";
+	}
+
+	my $dbname = 'buggy';
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	my $oldnode = PostgreSQL::Test::Cluster->new("old_$dbname",
+					install_path => $ENV{oldinstall});
+	$newnode->init;
+	$oldnode->init(force_initdb => 1);
+
+	$oldnode->append_conf('postgresql.conf', q(
+		autovacuum = off
+		max_prepared_transactions = 2
+		fsync = off
+	));
+	$oldnode->start;
+
+	mxid_gen2($oldnode, 'FOO', 1.25);
+	mxid_gen2($oldnode, 'BAR', 1.25);
+
+	$oldnode->safe_psql('postgres', q(
+		DROP TABLE BAR;
+		CHECKPOINT;
+	));
+	$oldnode->stop;
+
+	unlink($oldnode->data_dir . "/pg_multixact/offsets/0000");
+
+	command_ok(
+		[
+			'pg_upgrade', '--no-sync',
+			'-d', $oldnode->data_dir,
+			'-D', $newnode->data_dir,
+			'-b', $oldnode->config_data('--bindir'),
+			'-B', $newnode->config_data('--bindir'),
+			'-s', $newnode->host,
+			'-p', $oldnode->port,
+			'-P', $newnode->port,
+			'--copy'
+		],
+		'run of pg_upgrade');
+
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
 done_testing();
-- 
2.43.0

v13-0007-TEST-bump-catver.patch.txttext/plain; charset=US-ASCII; name=v13-0007-TEST-bump-catver.patch.txtDownload
From 52b7019b4b964bd221de2c891d2b1f073b7465bf Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 13 Nov 2024 16:34:34 +0300
Subject: [PATCH v13 7/7] TEST: bump catver

---
 src/bin/pg_upgrade/pg_upgrade.h  | 2 +-
 src/include/catalog/catversion.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 9b3d645b08..0fd791c442 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -119,7 +119,7 @@ extern char *output_files[];
  *
  * XXX: should be changed to the actual CATALOG_VERSION_NO on commit.
  */
-#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202409041
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202501283
 
 /*
  * large object chunk size added to pg_controldata,
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 6edaa20368..dfcb940501 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202501282
+#define CATALOG_VERSION_NO	202501283
 
 #endif
-- 
2.43.0

#33Maxim Orlov
orlovmg@gmail.com
In reply to: Maxim Orlov (#32)
7 attachment(s)
Re: POC: make mxidoff 64 bits

Here is a rebase, v14.

--
Best regards,
Maxim Orlov.

Attachments:

v14-0005-TEST-initdb-option-to-initialize-cluster-with-no.patch.txttext/plain; charset=US-ASCII; name=v14-0005-TEST-initdb-option-to-initialize-cluster-with-no.patch.txtDownload
From ee4b3b3c3ad3293460eb1f0418d87a065b9a589b Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Wed, 4 May 2022 15:53:36 +0300
Subject: [PATCH v14 5/7] TEST: initdb option to initialize cluster with
 non-standard xid/mxid/mxoff

To date testing database cluster wraparund was not easy as initdb has always
inited it with default xid/mxid/mxoff. The option to specify any valid
xid/mxid/mxoff at cluster startup will make these things easier.

Author: Maxim Orlov <orlovmg@gmail.com>
Author: Pavel Borisov <pashkin.elfe@gmail.com>
Author: Svetlana Derevyanko <s.derevyanko@postgrespro.ru>
Discussion: https://www.postgresql.org/message-id/flat/CACG%3Dezaa4vqYjJ16yoxgrpa-%3DgXnf0Vv3Ey9bjGrRRFN2YyWFQ%40mail.gmail.com
---
 src/backend/access/transam/clog.c      |  21 +++++
 src/backend/access/transam/multixact.c |  53 ++++++++++++
 src/backend/access/transam/subtrans.c  |   8 +-
 src/backend/access/transam/xlog.c      |  15 ++--
 src/backend/bootstrap/bootstrap.c      |  50 +++++++++++-
 src/backend/main/main.c                |   6 ++
 src/backend/postmaster/postmaster.c    |  14 +++-
 src/backend/tcop/postgres.c            |  53 +++++++++++-
 src/bin/initdb/initdb.c                | 107 ++++++++++++++++++++++++-
 src/bin/initdb/t/001_initdb.pl         |  60 ++++++++++++++
 src/include/access/xlog.h              |   3 +
 src/include/c.h                        |   4 +
 src/include/catalog/pg_class.h         |   2 +-
 13 files changed, 382 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 48f10bec91..eb8a9791ab 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -834,6 +834,7 @@ BootStrapCLOG(void)
 {
 	int			slotno;
 	LWLock	   *lock = SimpleLruGetBankLock(XactCtl, 0);
+	int64		pageno;
 
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
@@ -844,6 +845,26 @@ BootStrapCLOG(void)
 	SimpleLruWritePage(XactCtl, slotno);
 	Assert(!XactCtl->shared->page_dirty[slotno]);
 
+	pageno = TransactionIdToPage(XidFromFullTransactionId(TransamVariables->nextXid));
+	if (pageno != 0)
+	{
+		LWLock *nextlock = SimpleLruGetBankLock(XactCtl, pageno);
+
+		if (nextlock != lock)
+		{
+			LWLockRelease(lock);
+			LWLockAcquire(nextlock, LW_EXCLUSIVE);
+			lock = nextlock;
+		}
+
+		/* Create and zero the first page of the commit log */
+		slotno = ZeroCLOGPage(pageno, false);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(XactCtl, slotno);
+		Assert(!XactCtl->shared->page_dirty[slotno]);
+	}
+
 	LWLockRelease(lock);
 }
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 4dae8f4799..ac68becf8b 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1955,6 +1955,7 @@ BootStrapMultiXact(void)
 {
 	int			slotno;
 	LWLock	   *lock;
+	int64		pageno;
 
 	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, 0);
 	LWLockAcquire(lock, LW_EXCLUSIVE);
@@ -1966,6 +1967,26 @@ BootStrapMultiXact(void)
 	SimpleLruWritePage(MultiXactOffsetCtl, slotno);
 	Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
 
+	pageno = MultiXactIdToOffsetPage(MultiXactState->nextMXact);
+	if (pageno != 0)
+	{
+		LWLock *nextlock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
+
+		if (nextlock != lock)
+		{
+			LWLockRelease(lock);
+			LWLockAcquire(nextlock, LW_EXCLUSIVE);
+			lock = nextlock;
+		}
+
+		/* Create and zero the first page of the offsets log */
+		slotno = ZeroMultiXactOffsetPage(pageno, false);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
+		Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
+	}
+
 	LWLockRelease(lock);
 
 	lock = SimpleLruGetBankLock(MultiXactMemberCtl, 0);
@@ -1978,7 +1999,39 @@ BootStrapMultiXact(void)
 	SimpleLruWritePage(MultiXactMemberCtl, slotno);
 	Assert(!MultiXactMemberCtl->shared->page_dirty[slotno]);
 
+	pageno = MXOffsetToMemberPage(MultiXactState->nextOffset);
+	if (pageno != 0)
+	{
+		LWLock *nextlock = SimpleLruGetBankLock(MultiXactMemberCtl, pageno);
+
+		if (nextlock != lock)
+		{
+			LWLockRelease(lock);
+			LWLockAcquire(nextlock, LW_EXCLUSIVE);
+			lock = nextlock;
+		}
+
+		/* Create and zero the first page of the members log */
+		slotno = ZeroMultiXactMemberPage(pageno, false);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(MultiXactMemberCtl, slotno);
+		Assert(!MultiXactMemberCtl->shared->page_dirty[slotno]);
+	}
+
 	LWLockRelease(lock);
+
+	/*
+	 * If we're starting not from zero offset, initilize dummy multixact to
+	 * evade too long loop in PerformMembersTruncation().
+	 */
+	if (MultiXactState->nextOffset > 0 && MultiXactState->nextMXact > 0)
+	{
+		RecordNewMultiXact(FirstMultiXactId,
+						   MultiXactState->nextOffset, 0, NULL);
+		RecordNewMultiXact(MultiXactState->nextMXact,
+						   MultiXactState->nextOffset, 0, NULL);
+	}
 }
 
 /*
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 15153618fa..218675fa60 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -270,12 +270,16 @@ void
 BootStrapSUBTRANS(void)
 {
 	int			slotno;
-	LWLock	   *lock = SimpleLruGetBankLock(SubTransCtl, 0);
+	LWLock	   *lock;
+	int64		pageno;
+
+	pageno = TransactionIdToPage(XidFromFullTransactionId(TransamVariables->nextXid));
+	lock = SimpleLruGetBankLock(SubTransCtl, pageno);
 
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
 	/* Create and zero the first page of the subtrans log */
-	slotno = ZeroSUBTRANSPage(0);
+	slotno = ZeroSUBTRANSPage(pageno);
 
 	/* Make sure it's written out */
 	SimpleLruWritePage(SubTransCtl, slotno);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 799fc739e1..ced9d5cd26 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -136,6 +136,10 @@ int			max_slot_wal_keep_size_mb = -1;
 int			wal_decode_buffer_size = 512 * 1024;
 bool		track_wal_io_timing = false;
 
+TransactionId		start_xid = FirstNormalTransactionId;
+MultiXactId			start_mxid = FirstMultiXactId;
+MultiXactOffset		start_mxoff = 0;
+
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
 #endif
@@ -5126,13 +5130,14 @@ BootStrapXLOG(uint32 data_checksum_version)
 	checkPoint.fullPageWrites = fullPageWrites;
 	checkPoint.wal_level = wal_level;
 	checkPoint.nextXid =
-		FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId);
+		FullTransactionIdFromEpochAndXid(0, Max(FirstNormalTransactionId,
+												start_xid));
 	checkPoint.nextOid = FirstGenbkiObjectId;
-	checkPoint.nextMulti = FirstMultiXactId;
-	checkPoint.nextMultiOffset = 0;
-	checkPoint.oldestXid = FirstNormalTransactionId;
+	checkPoint.nextMulti = Max(FirstMultiXactId, start_mxid);
+	checkPoint.nextMultiOffset = start_mxoff;
+	checkPoint.oldestXid = XidFromFullTransactionId(checkPoint.nextXid);
 	checkPoint.oldestXidDB = Template1DbOid;
-	checkPoint.oldestMulti = FirstMultiXactId;
+	checkPoint.oldestMulti = checkPoint.nextMulti;
 	checkPoint.oldestMultiDB = Template1DbOid;
 	checkPoint.oldestCommitTsXid = InvalidTransactionId;
 	checkPoint.newestCommitTsXid = InvalidTransactionId;
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 6db864892d..458dc3eb29 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -218,7 +218,7 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	argv++;
 	argc--;
 
-	while ((flag = getopt(argc, argv, "B:c:d:D:Fkr:X:-:")) != -1)
+	while ((flag = getopt(argc, argv, "B:c:d:D:Fkm:o:r:X:x:-:")) != -1)
 	{
 		switch (flag)
 		{
@@ -286,12 +286,60 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 			case 'k':
 				bootstrap_data_checksum_version = PG_DATA_CHECKSUM_VERSION;
 				break;
+			case 'm':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactIdIsValid(start_mxid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact id")));
+					}
+				}
+				break;
+			case 'o':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxoff = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactOffsetIsValid(start_mxoff))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact offset")));
+					}
+				}
+				break;
 			case 'r':
 				strlcpy(OutputFileName, optarg, MAXPGPATH);
 				break;
 			case 'X':
 				SetConfigOption("wal_segment_size", optarg, PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
 				break;
+			case 'x':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_xid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartTransactionIdIsValid(start_xid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster xid value")));
+					}
+				}
+				break;
 			default:
 				write_stderr("Try \"%s --help\" for more information.\n",
 							 progname);
diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index e8effe5024..ff252dffbd 100644
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -426,12 +426,18 @@ help(const char *progname)
 	printf(_("  -E                 echo statement before execution\n"));
 	printf(_("  -j                 do not use newline as interactive query delimiter\n"));
 	printf(_("  -r FILENAME        send stdout and stderr to given file\n"));
+	printf(_("  -m START_MXID      set initial database cluster multixact id\n"));
+	printf(_("  -o START_MXOFF     set initial database cluster multixact offset\n"));
+	printf(_("  -x START_XID       set initial database cluster xid\n"));
 
 	printf(_("\nOptions for bootstrapping mode:\n"));
 	printf(_("  --boot             selects bootstrapping mode (must be first argument)\n"));
 	printf(_("  --check            selects check mode (must be first argument)\n"));
 	printf(_("  DBNAME             database name (mandatory argument in bootstrapping mode)\n"));
 	printf(_("  -r FILENAME        send stdout and stderr to given file\n"));
+	printf(_("  -m START_MXID      set initial database cluster multixact id\n"));
+	printf(_("  -o START_MXOFF     set initial database cluster multixact offset\n"));
+	printf(_("  -x START_XID       set initial database cluster xid\n"));
 
 	printf(_("\nPlease read the documentation for the complete list of run-time\n"
 			 "configuration settings and how to set them on the command line or in\n"
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 5dd3b6a4fd..d166c26f4c 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -585,7 +585,7 @@ PostmasterMain(int argc, char *argv[])
 	 * tcop/postgres.c (the option sets should not conflict) and with the
 	 * common help() function in main/main.c.
 	 */
-	while ((opt = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lN:OPp:r:S:sTt:W:-:")) != -1)
+	while ((opt = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lm:N:Oo:Pp:r:S:sTt:W:x:-:")) != -1)
 	{
 		switch (opt)
 		{
@@ -695,10 +695,18 @@ PostmasterMain(int argc, char *argv[])
 				SetConfigOption("max_connections", optarg, PGC_POSTMASTER, PGC_S_ARGV);
 				break;
 
+			case 'm':
+				/* only used by single-user backend */
+				break;
+
 			case 'O':
 				SetConfigOption("allow_system_table_mods", "true", PGC_POSTMASTER, PGC_S_ARGV);
 				break;
 
+			case 'o':
+				/* only used by single-user backend */
+				break;
+
 			case 'P':
 				SetConfigOption("ignore_system_indexes", "true", PGC_POSTMASTER, PGC_S_ARGV);
 				break;
@@ -749,6 +757,10 @@ PostmasterMain(int argc, char *argv[])
 				SetConfigOption("post_auth_delay", optarg, PGC_POSTMASTER, PGC_S_ARGV);
 				break;
 
+			case 'x':
+				/* only used by single-user backend */
+				break;
+
 			default:
 				write_stderr("Try \"%s --help\" for more information.\n",
 							 progname);
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index f2f75aa0f8..9b8326ed6a 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3790,7 +3790,7 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 	 * postmaster/postmaster.c (the option sets should not conflict) and with
 	 * the common help() function in main/main.c.
 	 */
-	while ((flag = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lN:nOPp:r:S:sTt:v:W:-:")) != -1)
+	while ((flag = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lm:N:nOo:Pp:r:S:sTt:v:W:x:-:")) != -1)
 	{
 		switch (flag)
 		{
@@ -3895,6 +3895,23 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 				SetConfigOption("ssl", "true", ctx, gucsource);
 				break;
 
+			case 'm':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactIdIsValid(start_mxid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact id")));
+					}
+				}
+				break;
+
 			case 'N':
 				SetConfigOption("max_connections", optarg, ctx, gucsource);
 				break;
@@ -3907,6 +3924,23 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 				SetConfigOption("allow_system_table_mods", "true", ctx, gucsource);
 				break;
 
+			case 'o':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxoff = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactOffsetIsValid(start_mxoff))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact offset")));
+					}
+				}
+				break;
+
 			case 'P':
 				SetConfigOption("ignore_system_indexes", "true", ctx, gucsource);
 				break;
@@ -3961,6 +3995,23 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 				SetConfigOption("post_auth_delay", optarg, ctx, gucsource);
 				break;
 
+			case 'x':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_xid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartTransactionIdIsValid(start_xid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster xid")));
+					}
+				}
+				break;
+
 			default:
 				errs++;
 				break;
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 21a0fe3ecd..04d56cca4f 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -168,6 +168,9 @@ static bool data_checksums = true;
 static char *xlog_dir = NULL;
 static int	wal_segment_size_mb = (DEFAULT_XLOG_SEG_SIZE) / (1024 * 1024);
 static DataDirSyncMethod sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+static TransactionId start_xid = 0;
+static MultiXactId start_mxid = 0;
+static MultiXactOffset start_mxoff = 0;
 
 
 /* internal vars */
@@ -1598,6 +1601,11 @@ bootstrap_template1(void)
 	bki_lines = replace_token(bki_lines, "POSTGRES",
 							  escape_quotes_bki(username));
 
+	/* relfrozenxid must not be less than FirstNormalTransactionId */
+	sprintf(buf, "%llu", (unsigned long long) Max(start_xid, 3));
+	bki_lines = replace_token(bki_lines, "RECENTXMIN",
+							  buf);
+
 	bki_lines = replace_token(bki_lines, "ENCODING",
 							  encodingid_to_string(encodingid));
 
@@ -1623,6 +1631,9 @@ bootstrap_template1(void)
 
 	printfPQExpBuffer(&cmd, "\"%s\" --boot %s %s", backend_exec, boot_options, extra_options);
 	appendPQExpBuffer(&cmd, " -X %d", wal_segment_size_mb * (1024 * 1024));
+	appendPQExpBuffer(&cmd, " -m %llu", (unsigned long long) start_mxid);
+	appendPQExpBuffer(&cmd, " -o %llu", (unsigned long long) start_mxoff);
+	appendPQExpBuffer(&cmd, " -x %llu", (unsigned long long) start_xid);
 	if (data_checksums)
 		appendPQExpBuffer(&cmd, " -k");
 	if (debug)
@@ -2564,12 +2575,20 @@ usage(const char *progname)
 	printf(_("  -d, --debug               generate lots of debugging output\n"));
 	printf(_("      --discard-caches      set debug_discard_caches=1\n"));
 	printf(_("  -L DIRECTORY              where to find the input files\n"));
+	printf(_("  -m, --multixact-id=START_MXID\n"
+			 "                            set initial database cluster multixact id\n"
+			 "                            max value is 2^62-1\n"));
 	printf(_("  -n, --no-clean            do not clean up after errors\n"));
 	printf(_("  -N, --no-sync             do not wait for changes to be written safely to disk\n"));
 	printf(_("      --no-instructions     do not print instructions for next steps\n"));
+	printf(_("  -o, --multixact-offset=START_MXOFF\n"
+			 "                            set initial database cluster multixact offset\n"
+			 "                            max value is 2^62-1\n"));
 	printf(_("  -s, --show                show internal settings, then exit\n"));
 	printf(_("      --sync-method=METHOD  set method for syncing files to disk\n"));
 	printf(_("  -S, --sync-only           only sync database files to disk, then exit\n"));
+	printf(_("  -x, --xid=START_XID       set initial database cluster xid\n"
+			 "                            max value is 2^62-1\n"));
 	printf(_("\nOther options:\n"));
 	printf(_("  -V, --version             output version information, then exit\n"));
 	printf(_("  -?, --help                show this help, then exit\n"));
@@ -3104,6 +3123,18 @@ initialize_data_directory(void)
 	/* Now create all the text config files */
 	setup_config();
 
+	if (start_mxid != 0)
+		printf(_("selecting initial multixact id ... %llu\n"),
+				 (unsigned long long) start_mxid);
+
+	if (start_mxoff != 0)
+		printf(_("selecting initial multixact offset ... %llu\n"),
+				 (unsigned long long) start_mxoff);
+
+	if (start_xid != 0)
+		printf(_("selecting initial xid ... %llu\n"),
+				 (unsigned long long) start_xid);
+
 	/* Bootstrap template1 */
 	bootstrap_template1();
 
@@ -3120,8 +3151,12 @@ initialize_data_directory(void)
 	fflush(stdout);
 
 	initPQExpBuffer(&cmd);
-	printfPQExpBuffer(&cmd, "\"%s\" %s %s template1 >%s",
-					  backend_exec, backend_options, extra_options, DEVNULL);
+	printfPQExpBuffer(&cmd, "\"%s\" %s %s",
+					  backend_exec, backend_options, extra_options);
+	appendPQExpBuffer(&cmd, " -m %llu", (unsigned long long) start_mxid);
+	appendPQExpBuffer(&cmd, " -o %llu", (unsigned long long) start_mxoff);
+	appendPQExpBuffer(&cmd, " -x %llu", (unsigned long long) start_xid);
+	appendPQExpBuffer(&cmd, " template1 >%s", DEVNULL);
 
 	PG_CMD_OPEN(cmd.data);
 
@@ -3208,6 +3243,9 @@ main(int argc, char *argv[])
 		{"icu-rules", required_argument, NULL, 18},
 		{"sync-method", required_argument, NULL, 19},
 		{"no-data-checksums", no_argument, NULL, 20},
+		{"xid", required_argument, NULL, 'x'},
+		{"multixact-id", required_argument, NULL, 'm'},
+		{"multixact-offset", required_argument, NULL, 'o'},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -3249,7 +3287,7 @@ main(int argc, char *argv[])
 
 	/* process command-line options */
 
-	while ((c = getopt_long(argc, argv, "A:c:dD:E:gkL:nNsST:U:WX:",
+	while ((c = getopt_long(argc, argv, "A:c:dD:E:gkL:m:nNo:sST:U:Wx:X:",
 							long_options, &option_index)) != -1)
 	{
 		switch (c)
@@ -3307,6 +3345,30 @@ main(int argc, char *argv[])
 				debug = true;
 				printf(_("Running in debug mode.\n"));
 				break;
+			case 'm':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactIdIsValid(start_mxid))
+					{
+						pg_log_error("invalid initial database cluster multixact id");
+						exit(1);
+					}
+					else if (start_mxid < 1) /* FirstMultiXactId */
+					{
+						/*
+						 * We avoid mxid to be silently set to
+						 * FirstMultiXactId, though it does not harm.
+						 */
+						pg_log_error("multixact id should be greater than 0");
+						exit(1);
+					}
+				}
+				break;
 			case 'n':
 				noclean = true;
 				printf(_("Running in no-clean mode.  Mistakes will not be cleaned up.\n"));
@@ -3314,6 +3376,21 @@ main(int argc, char *argv[])
 			case 'N':
 				do_sync = false;
 				break;
+			case 'o':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxoff = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactOffsetIsValid(start_mxoff))
+					{
+						pg_log_error("invalid initial database cluster multixact offset");
+						exit(1);
+					}
+				}
+				break;
 			case 'S':
 				sync_only = true;
 				break;
@@ -3402,6 +3479,30 @@ main(int argc, char *argv[])
 			case 20:
 				data_checksums = false;
 				break;
+			case 'x':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_xid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartTransactionIdIsValid(start_xid))
+					{
+						pg_log_error("invalid value for initial database cluster xid");
+						exit(1);
+					}
+					else if (start_xid < 3) /* FirstNormalTransactionId */
+					{
+						/*
+						 * We avoid xid to be silently set to
+						 * FirstNormalTransactionId, though it does not harm.
+						 */
+						pg_log_error("xid should be greater than 2");
+						exit(1);
+					}
+				}
+				break;
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index 01cc4a1602..8b017eb907 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -329,4 +329,64 @@ command_fails(
 	[ 'pg_checksums', '--pgdata' => $datadir_nochecksums ],
 	"pg_checksums fails with data checksum disabled");
 
+# Set non-standard initial mxid/mxoff/xid.
+command_fails_like(
+	[ 'initdb', '-m', 'seven', $datadir ],
+	qr/initdb: error: invalid initial database cluster multixact id/,
+	'fails for invalid initial database cluster multixact id');
+command_fails_like(
+	[ 'initdb', '-o', 'seven', $datadir ],
+	qr/initdb: error: invalid initial database cluster multixact offset/,
+	'fails for invalid initial database cluster multixact offset');
+command_fails_like(
+	[ 'initdb', '-x', 'seven', $datadir ],
+	qr/initdb: error: invalid value for initial database cluster xid/,
+	'fails for invalid initial database cluster xid');
+
+command_checks_all(
+	[ 'initdb', '-m', '65535', "$tempdir/data-m65535" ],
+	0,
+	[qr/selecting initial multixact id ... 65535/],
+	[],
+	'selecting initial multixact id');
+command_checks_all(
+	[ 'initdb', '-o', '65535', "$tempdir/data-o65535" ],
+	0,
+	[qr/selecting initial multixact offset ... 65535/],
+	[],
+	'selecting initial multixact offset');
+command_checks_all(
+	[ 'initdb', '-x', '65535', "$tempdir/data-x65535" ],
+	0,
+	[qr/selecting initial xid ... 65535/],
+	[],
+	'selecting initial xid');
+
+# Setup new cluster with given mxid/mxoff/xid.
+my $node;
+my $result;
+
+$node = PostgreSQL::Test::Cluster->new('test-mxid');
+$node->init(extra => ['-m', '16777215']); # 0xFFFFFF
+$node->start;
+$result = $node->safe_psql('postgres', "SELECT next_multixact_id FROM pg_control_checkpoint();");
+ok($result >= 16777215, 'setup cluster with given mxid');
+$node->stop;
+
+$node = PostgreSQL::Test::Cluster->new('test-mxoff');
+$node->init(extra => ['-o', '16777215']); # 0xFFFFFF
+$node->start;
+$result = $node->safe_psql('postgres', "SELECT next_multi_offset FROM pg_control_checkpoint();");
+ok($result >= 16777215, 'setup cluster with given mxoff');
+$node->stop;
+
+$node = PostgreSQL::Test::Cluster->new('test-xid');
+$node->init(extra => ['-x', '16777215']); # 0xFFFFFF
+$node->start;
+$result = $node->safe_psql('postgres', "SELECT txid_current();");
+ok($result >= 16777215, 'setup cluster with given xid - check 1');
+$result = $node->safe_psql('postgres', "SELECT oldest_xid FROM pg_control_checkpoint();");
+ok($result >= 16777215, 'setup cluster with given xid - check 2');
+$node->stop;
+
 done_testing();
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d313099c02..23b8dd0375 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -94,6 +94,9 @@ typedef enum RecoveryState
 } RecoveryState;
 
 extern PGDLLIMPORT int wal_level;
+extern PGDLLIMPORT TransactionId start_xid;
+extern PGDLLIMPORT MultiXactId start_mxid;
+extern PGDLLIMPORT MultiXactOffset start_mxoff;
 
 /* Is WAL archiving enabled (always or only while server is running normally)? */
 #define XLogArchivingActive() \
diff --git a/src/include/c.h b/src/include/c.h
index 318194f78d..4f2b5432e5 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -622,6 +622,10 @@ typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
+#define StartTransactionIdIsValid(xid)			((xid) <= 0xFFFFFFFF)
+#define StartMultiXactIdIsValid(mxid)			((mxid) <= 0xFFFFFFFF)
+#define StartMultiXactOffsetIsValid(offset)		((offset) <= 0xFFFFFFFF)
+
 #define FirstCommandId	((CommandId) 0)
 #define InvalidCommandId	(~(CommandId)0)
 
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index fa96ba07bf..44c2f9cdf9 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -126,7 +126,7 @@ CATALOG(pg_class,1259,RelationRelationId) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83,Relat
 	Oid			relrewrite BKI_DEFAULT(0) BKI_LOOKUP_OPT(pg_class);
 
 	/* all Xids < this are frozen in this rel */
-	TransactionId relfrozenxid BKI_DEFAULT(3);	/* FirstNormalTransactionId */
+	TransactionId relfrozenxid BKI_DEFAULT(RECENTXMIN);	/* FirstNormalTransactionId */
 
 	/* all multixacts in this rel are >= this; it is really a MultiXactId */
 	TransactionId relminmxid BKI_DEFAULT(1);	/* FirstMultiXactId */
-- 
2.43.0

v14-0001-Use-64-bit-format-output-for-multixact-offsets.patchapplication/octet-stream; name=v14-0001-Use-64-bit-format-output-for-multixact-offsets.patchDownload
From fecc0c940df767c78d3bb24640a421dde707d7da Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 7 Aug 2024 16:35:22 +0300
Subject: [PATCH v14 1/7] Use 64-bit format output for multixact offsets

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/rmgrdesc/mxactdesc.c   |  9 ++++----
 src/backend/access/rmgrdesc/xlogdesc.c    |  4 ++--
 src/backend/access/transam/multixact.c    | 26 +++++++++++++----------
 src/backend/access/transam/xlogrecovery.c |  5 +++--
 src/bin/pg_controldata/pg_controldata.c   |  4 ++--
 src/bin/pg_resetwal/pg_resetwal.c         |  8 +++----
 6 files changed, 31 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 8bd3d5b63c..b792e9d939 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -65,8 +65,8 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		xl_multixact_create *xlrec = (xl_multixact_create *) rec;
 		int			i;
 
-		appendStringInfo(buf, "%u offset %u nmembers %d: ", xlrec->mid,
-						 xlrec->moff, xlrec->nmembers);
+		appendStringInfo(buf, "%u offset %llu nmembers %d: ", xlrec->mid,
+						 (unsigned long long) xlrec->moff, xlrec->nmembers);
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
 	}
@@ -74,9 +74,10 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
 
-		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+		appendStringInfo(buf, "offsets [%u, %u), members [%llu, %llu)",
 						 xlrec->startTruncOff, xlrec->endTruncOff,
-						 xlrec->startTruncMemb, xlrec->endTruncMemb);
+						 (unsigned long long) xlrec->startTruncMemb,
+						 (unsigned long long) xlrec->endTruncMemb);
 	}
 }
 
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 58040f2865..e52a5625a8 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -66,7 +66,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
 		appendStringInfo(buf, "redo %X/%X; "
-						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %u; "
+						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %llu; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
 						 "oldest running xid %u; %s",
@@ -79,7 +79,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 XidFromFullTransactionId(checkpoint->nextXid),
 						 checkpoint->nextOid,
 						 checkpoint->nextMulti,
-						 checkpoint->nextMultiOffset,
+						 (unsigned long long) checkpoint->nextMultiOffset,
 						 checkpoint->oldestXid,
 						 checkpoint->oldestXidDB,
 						 checkpoint->oldestMulti,
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index c1e2c42e1b..83b0956dbc 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1264,7 +1264,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	LWLockRelease(MultiXactGenLock);
 
-	debug_elog4(DEBUG2, "GetNew: returning %u offset %u", result, *offset);
+	debug_elog4(DEBUG2, "GetNew: returning %u offset %llu", result,
+				(unsigned long long) *offset);
 	return result;
 }
 
@@ -2293,8 +2294,9 @@ MultiXactGetCheckptMulti(bool is_shutdown,
 	LWLockRelease(MultiXactGenLock);
 
 	debug_elog6(DEBUG2,
-				"MultiXact: checkpoint is nextMulti %u, nextOffset %u, oldestMulti %u in DB %u",
-				*nextMulti, *nextMultiOffset, *oldestMulti, *oldestMultiDB);
+				"MultiXact: checkpoint is nextMulti %u, nextOffset %llu, oldestMulti %u in DB %u",
+				*nextMulti, (unsigned long long) *nextMultiOffset, *oldestMulti,
+				*oldestMultiDB);
 }
 
 /*
@@ -2328,8 +2330,8 @@ void
 MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset)
 {
-	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %u",
-				nextMulti, nextMultiOffset);
+	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %llu",
+				nextMulti, (unsigned long long) nextMultiOffset);
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->nextMXact = nextMulti;
 	MultiXactState->nextOffset = nextMultiOffset;
@@ -2519,8 +2521,8 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 	}
 	if (MultiXactOffsetPrecedes(MultiXactState->nextOffset, minMultiOffset))
 	{
-		debug_elog3(DEBUG2, "MultiXact: setting next offset to %u",
-					minMultiOffset);
+		debug_elog3(DEBUG2, "MultiXact: setting next offset to %llu",
+					(unsigned long long) minMultiOffset);
 		MultiXactState->nextOffset = minMultiOffset;
 	}
 	LWLockRelease(MultiXactGenLock);
@@ -3211,11 +3213,12 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 
 	elog(DEBUG1, "performing multixact truncation: "
 		 "offsets [%u, %u), offsets segments [%llx, %llx), "
-		 "members [%u, %u), members segments [%llx, %llx)",
+		 "members [%llu, %llu), members segments [%llx, %llx)",
 		 oldestMulti, newOldestMulti,
 		 (unsigned long long) MultiXactIdToOffsetSegment(oldestMulti),
 		 (unsigned long long) MultiXactIdToOffsetSegment(newOldestMulti),
-		 oldestOffset, newOldestOffset,
+		 (unsigned long long) oldestOffset,
+		 (unsigned long long) newOldestOffset,
 		 (unsigned long long) MXOffsetToMemberSegment(oldestOffset),
 		 (unsigned long long) MXOffsetToMemberSegment(newOldestOffset));
 
@@ -3471,11 +3474,12 @@ multixact_redo(XLogReaderState *record)
 
 		elog(DEBUG1, "replaying multixact truncation: "
 			 "offsets [%u, %u), offsets segments [%llx, %llx), "
-			 "members [%u, %u), members segments [%llx, %llx)",
+			 "members [%llu, %llu), members segments [%llx, %llx)",
 			 xlrec.startTruncOff, xlrec.endTruncOff,
 			 (unsigned long long) MultiXactIdToOffsetSegment(xlrec.startTruncOff),
 			 (unsigned long long) MultiXactIdToOffsetSegment(xlrec.endTruncOff),
-			 xlrec.startTruncMemb, xlrec.endTruncMemb,
+			 (unsigned long long) xlrec.startTruncMemb,
+			 (unsigned long long) xlrec.endTruncMemb,
 			 (unsigned long long) MXOffsetToMemberSegment(xlrec.startTruncMemb),
 			 (unsigned long long) MXOffsetToMemberSegment(xlrec.endTruncMemb));
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 52f53fa12e..5820b18af0 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -879,8 +879,9 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 							 U64FromFullTransactionId(checkPoint.nextXid),
 							 checkPoint.nextOid)));
 	ereport(DEBUG1,
-			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %u",
-							 checkPoint.nextMulti, checkPoint.nextMultiOffset)));
+			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %llu",
+							 checkPoint.nextMulti,
+							 (unsigned long long) checkPoint.nextMultiOffset)));
 	ereport(DEBUG1,
 			(errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
 							 checkPoint.oldestXid, checkPoint.oldestXidDB)));
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index bea779eef9..edd635cb2d 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -264,8 +264,8 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile->checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
-		   ControlFile->checkPointCopy.nextMultiOffset);
+	printf(_("Latest checkpoint's NextMultiOffset:  %llu\n"),
+		   (unsigned long long) ControlFile->checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile->checkPointCopy.oldestXid);
 	printf(_("Latest checkpoint's oldestXID's DB:   %u\n"),
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 31bc0abff1..ca9f01c10b 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -759,8 +759,8 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile.checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
-		   ControlFile.checkPointCopy.nextMultiOffset);
+	printf(_("Latest checkpoint's NextMultiOffset:  %llu\n"),
+		   (unsigned long long) ControlFile.checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile.checkPointCopy.oldestXid);
 	printf(_("Latest checkpoint's oldestXID's DB:   %u\n"),
@@ -833,8 +833,8 @@ PrintNewControlValues(void)
 
 	if (set_mxoff != -1)
 	{
-		printf(_("NextMultiOffset:                      %u\n"),
-			   ControlFile.checkPointCopy.nextMultiOffset);
+		printf(_("NextMultiOffset:                      %llu\n"),
+			   (unsigned long long) ControlFile.checkPointCopy.nextMultiOffset);
 	}
 
 	if (set_oid != 0)
-- 
2.43.0

v14-0004-Get-rid-of-MultiXactMemberFreezeThreshold-call.patchapplication/octet-stream; name=v14-0004-Get-rid-of-MultiXactMemberFreezeThreshold-call.patchDownload
From 75eeee8a3731acad56436ba35c5a2f55c9dcfe9a Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 23 Oct 2024 18:23:39 +0300
Subject: [PATCH v14 4/7] Get rid of MultiXactMemberFreezeThreshold call.

Since MaxMultiXactOffset are UINT64_MAX now, MULTIXACT_MEMBER_SAFE_THRESHOLD and
MULTIXACT_MEMBER_DANGER_THRESHOLD values are not meaningful any more. Thus,
MultiXactMemberFreezeThreshold is not needed too.

Instead, switch to MULTIXACT_MEMBER_AUTOVAC_THRESHOLD (eq 2^32) members
threshold. It is used to determine if we need to force autovacuum or not.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/transam/multixact.c | 117 +++----------------------
 src/backend/commands/vacuum.c          |   2 +-
 src/backend/postmaster/autovacuum.c    |   4 +-
 src/include/access/multixact.h         |   1 -
 4 files changed, 15 insertions(+), 109 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 259abc9922..4dae8f4799 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -204,10 +204,14 @@ MXOffsetToMemberOffset(MultiXactOffset offset)
 		member_in_group * sizeof(TransactionId);
 }
 
-/* Multixact members wraparound thresholds. */
-#define MULTIXACT_MEMBER_SAFE_THRESHOLD		(MaxMultiXactOffset / 2)
-#define MULTIXACT_MEMBER_DANGER_THRESHOLD	\
-	(MaxMultiXactOffset - MaxMultiXactOffset / 4)
+/*
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuumin order to release the disk space, reduce table bloat if
+ * possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD		UINT64CONST(0xFFFFFFFF)
 
 static inline MultiXactId
 PreviousMultiXactId(MultiXactId multi)
@@ -2616,15 +2620,13 @@ GetOldestMultiXactId(void)
 }
 
 /*
- * Determine how aggressively we need to vacuum in order to prevent member
- * wraparound.
+ * Determine if we need to vacuum for member or not.
  *
  * To do so determine what's the oldest member offset and install the limit
  * info in MultiXactState, where it can be used to prevent overrun of old data
  * in the members SLRU area.
  *
- * The return value is true if emergency autovacuum is required and false
- * otherwise.
+ * The return value is true if autovacuum is required and false otherwise.
  */
 static bool
 SetOffsetVacuumLimit(bool is_startup)
@@ -2712,10 +2714,10 @@ SetOffsetVacuumLimit(bool is_startup)
 	LWLockRelease(MultiXactGenLock);
 
 	/*
-	 * Do we need an emergency autovacuum?	If we're not sure, assume yes.
+	 * Do we need autovacuum?	If we're not sure, assume yes.
 	 */
 	return !oldestOffsetKnown ||
-		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
+		(nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);
 }
 
 /*
@@ -2761,101 +2763,6 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	return true;
 }
 
-/*
- * Determine how many multixacts, and how many multixact members, currently
- * exist.  Return false if unable to determine.
- */
-static bool
-ReadMultiXactCounts(uint32 *multixacts, MultiXactOffset *members)
-{
-	MultiXactOffset nextOffset;
-	MultiXactOffset oldestOffset;
-	MultiXactId oldestMultiXactId;
-	MultiXactId nextMultiXactId;
-	bool		oldestOffsetKnown;
-
-	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	nextOffset = MultiXactState->nextOffset;
-	oldestMultiXactId = MultiXactState->oldestMultiXactId;
-	nextMultiXactId = MultiXactState->nextMXact;
-	oldestOffset = MultiXactState->oldestOffset;
-	oldestOffsetKnown = MultiXactState->oldestOffsetKnown;
-	LWLockRelease(MultiXactGenLock);
-
-	if (!oldestOffsetKnown)
-		return false;
-
-	*members = nextOffset - oldestOffset;
-	*multixacts = nextMultiXactId - oldestMultiXactId;
-	return true;
-}
-
-/*
- * Multixact members can be removed once the multixacts that refer to them
- * are older than every datminmxid.  autovacuum_multixact_freeze_max_age and
- * vacuum_multixact_freeze_table_age work together to make sure we never have
- * too many multixacts; we hope that, at least under normal circumstances,
- * this will also be sufficient to keep us from using too many offsets.
- * However, if the average multixact has many members, we might exhaust the
- * members space while still using few enough members that these limits fail
- * to trigger relminmxid advancement by VACUUM.  At that point, we'd have no
- * choice but to start failing multixact-creating operations with an error.
- *
- * To prevent that, if more than a threshold portion of the members space is
- * used, we effectively reduce autovacuum_multixact_freeze_max_age and
- * to a value just less than the number of multixacts in use.  We hope that
- * this will quickly trigger autovacuuming on the table or tables with the
- * oldest relminmxid, thus allowing datminmxid values to advance and removing
- * some members.
- *
- * As the fraction of the member space currently in use grows, we become
- * more aggressive in clamping this value.  That not only causes autovacuum
- * to ramp up, but also makes any manual vacuums the user issues more
- * aggressive.  This happens because vacuum_get_cutoffs() will clamp the
- * freeze table and the minimum freeze age cutoffs based on the effective
- * autovacuum_multixact_freeze_max_age this function returns.  In the worst
- * case, we'll claim the freeze_max_age to zero, and every vacuum of any
- * table will freeze every multixact.
- */
-int
-MultiXactMemberFreezeThreshold(void)
-{
-	MultiXactOffset members;
-	uint32		multixacts;
-	uint32		victim_multixacts;
-	double		fraction;
-	int			result;
-
-	/* If we can't determine member space utilization, assume the worst. */
-	if (!ReadMultiXactCounts(&multixacts, &members))
-		return 0;
-
-	/* If member space utilization is low, no special action is required. */
-	if (members <= MULTIXACT_MEMBER_SAFE_THRESHOLD)
-		return autovacuum_multixact_freeze_max_age;
-
-	/*
-	 * Compute a target for relminmxid advancement.  The number of multixacts
-	 * we try to eliminate from the system is based on how far we are past
-	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
-	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-	fraction /= (double) (MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-
-	victim_multixacts = multixacts * fraction;
-
-	/* fraction could be > 1.0, but lowest possible freeze age is zero */
-	if (victim_multixacts > multixacts)
-		return 0;
-	result = multixacts - victim_multixacts;
-
-	/*
-	 * Clamp to autovacuum_multixact_freeze_max_age, so that we never make
-	 * autovacuum less aggressive than it would otherwise be.
-	 */
-	return Min(result, autovacuum_multixact_freeze_max_age);
-}
-
 typedef struct mxtruncinfo
 {
 	int64		earliestExistingPage;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index e81c9a8aba..da9c3d2f11 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1151,7 +1151,7 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Almost ready to set freeze output parameters; check if OldestXmin or
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index dfb8d068ec..4165c35eb7 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1135,7 +1135,7 @@ do_start_worker(void)
 
 	/* Also determine the oldest datminmxid we will consider. */
 	recentMulti = ReadNextMultiXactId();
-	multiForceLimit = recentMulti - MultiXactMemberFreezeThreshold();
+	multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
 	if (multiForceLimit < FirstMultiXactId)
 		multiForceLimit -= FirstMultiXactId;
 
@@ -1923,7 +1923,7 @@ do_autovacuum(void)
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Find the pg_database entry and select the default freeze ages. We use
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index b1510532ee..5ee632dfe6 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -143,7 +143,6 @@ extern void MultiXactSetNextMXact(MultiXactId nextMulti,
 extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 									  MultiXactOffset minMultiOffset);
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
-extern int	MultiXactMemberFreezeThreshold(void);
 
 extern void multixact_twophase_recover(TransactionId xid, uint16 info,
 									   void *recdata, uint32 len);
-- 
2.43.0

v14-0002-Use-64-bit-multixact-offsets.patchapplication/octet-stream; name=v14-0002-Use-64-bit-multixact-offsets.patchDownload
From d6cf2fcb4d5616dc77bba67ad9e06bded8fb1f35 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Wed, 6 Mar 2024 11:11:33 +0300
Subject: [PATCH v14 2/7] Use 64-bit multixact offsets.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/transam/multixact.c | 170 +------------------------
 src/bin/pg_resetwal/pg_resetwal.c      |   2 +-
 src/bin/pg_resetwal/t/001_basic.pl     |   2 +-
 src/include/access/multixact.h         |   2 +-
 src/include/c.h                        |   2 +-
 5 files changed, 10 insertions(+), 168 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 83b0956dbc..259abc9922 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -96,14 +96,6 @@
 /*
  * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
  * used everywhere else in Postgres.
- *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
  */
 
 /* We need four bytes per offset */
@@ -272,9 +264,6 @@ typedef struct MultiXactStateData
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 
-	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
-
 	/*
 	 * This is used to sleep until a multixact offset is written when we want
 	 * to create the next one.
@@ -409,8 +398,6 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
-									 MultiXactOffset start, uint32 distance);
 static bool SetOffsetVacuumLimit(bool is_startup);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMZeroPageXlogRec(int64 pageno, uint8 info);
@@ -1164,78 +1151,6 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	else
 		*offset = nextOffset;
 
-	/*----------
-	 * Protect against overrun of the members space as well, with the
-	 * following rules:
-	 *
-	 * If we're past offsetStopLimit, refuse to generate more multis.
-	 * If we're close to offsetStopLimit, emit a warning.
-	 *
-	 * Arbitrarily, we start emitting warnings when we're 20 segments or less
-	 * from offsetStopLimit.
-	 *
-	 * Note we haven't updated the shared state yet, so if we fail at this
-	 * point, the multixact ID we grabbed can still be used by the next guy.
-	 *
-	 * Note that there is no point in forcing autovacuum runs here: the
-	 * multixact freeze settings would have to be reduced for that to have any
-	 * effect.
-	 *----------
-	 */
-#define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
-								 nmembers))
-	{
-		/* see comment in the corresponding offsets wraparound case */
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("multixact \"members\" limit exceeded"),
-				 errdetail_plural("This command would create a multixact with %u members, but the remaining space is only enough for %u member.",
-								  "This command would create a multixact with %u members, but the remaining space is only enough for %u members.",
-								  MultiXactState->offsetStopLimit - nextOffset - 1,
-								  nmembers,
-								  MultiXactState->offsetStopLimit - nextOffset - 1),
-				 errhint("Execute a database-wide VACUUM in database with OID %u with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.",
-						 MultiXactState->oldestMultiXactDB)));
-	}
-
-	/*
-	 * Check whether we should kick autovacuum into action, to prevent members
-	 * wraparound. NB we use a much larger window to trigger autovacuum than
-	 * just the warning limit. The warning is just a measure of last resort -
-	 * this is in line with GetNewTransactionId's behaviour.
-	 */
-	if (!MultiXactState->oldestOffsetKnown ||
-		(MultiXactState->nextOffset - MultiXactState->oldestOffset
-		 > MULTIXACT_MEMBER_SAFE_THRESHOLD))
-	{
-		/*
-		 * To avoid swamping the postmaster with signals, we issue the autovac
-		 * request only when crossing a segment boundary. With default
-		 * compilation settings that's roughly after 50k members.  This still
-		 * gives plenty of chances before we get into real trouble.
-		 */
-		if ((MXOffsetToMemberPage(nextOffset) / SLRU_PAGES_PER_SEGMENT) !=
-			(MXOffsetToMemberPage(nextOffset + nmembers) / SLRU_PAGES_PER_SEGMENT))
-			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	}
-
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
-								 nextOffset,
-								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
-		ereport(WARNING,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg_plural("database with OID %u must be vacuumed before %d more multixact member is used",
-							   "database with OID %u must be vacuumed before %d more multixact members are used",
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers,
-							   MultiXactState->oldestMultiXactDB,
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers),
-				 errhint("Execute a database-wide VACUUM in that database with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.")));
-
 	ExtendMultiXactMember(nextOffset, nmembers);
 
 	/*
@@ -2721,8 +2636,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	MultiXactOffset nextOffset;
 	bool		oldestOffsetKnown = false;
 	bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;
-	MultiXactOffset prevOffsetStopLimit;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2737,7 +2650,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	nextOffset = MultiXactState->nextOffset;
 	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
 	prevOldestOffset = MultiXactState->oldestOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2768,11 +2680,7 @@ SetOffsetVacuumLimit(bool is_startup)
 		oldestOffsetKnown =
 			find_multixact_start(oldestMultiXactId, &oldestOffset);
 
-		if (oldestOffsetKnown)
-			ereport(DEBUG1,
-					(errmsg_internal("oldest MultiXactId member is at offset %u",
-									 oldestOffset)));
-		else
+		if (!oldestOffsetKnown)
 			ereport(LOG,
 					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
 							oldestMultiXactId)));
@@ -2785,24 +2693,7 @@ SetOffsetVacuumLimit(bool is_startup)
 	 * overrun of old data in the members SLRU area. We can only do so if the
 	 * oldest offset is known though.
 	 */
-	if (oldestOffsetKnown)
-	{
-		/* move back to start of the corresponding segment */
-		offsetStopLimit = oldestOffset - (oldestOffset %
-										  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-
-		/* always leave one segment before the wraparound point */
-		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-		if (!prevOldestOffsetKnown && !is_startup)
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are now enabled")));
-
-		ereport(DEBUG1,
-				(errmsg_internal("MultiXact member stop limit is now %u based on MultiXact %u",
-								 offsetStopLimit, oldestMultiXactId)));
-	}
-	else if (prevOldestOffsetKnown)
+	if (prevOldestOffsetKnown)
 	{
 		/*
 		 * If we failed to get the oldest offset this time, but we have a
@@ -2812,14 +2703,12 @@ SetOffsetVacuumLimit(bool is_startup)
 		 */
 		oldestOffset = prevOldestOffset;
 		oldestOffsetKnown = true;
-		offsetStopLimit = prevOffsetStopLimit;
 	}
 
 	/* Install the computed values */
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestOffset = oldestOffset;
 	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	MultiXactState->offsetStopLimit = offsetStopLimit;
 	LWLockRelease(MultiXactGenLock);
 
 	/*
@@ -2829,54 +2718,6 @@ SetOffsetVacuumLimit(bool is_startup)
 		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
 }
 
-/*
- * Return whether adding "distance" to "start" would move past "boundary".
- *
- * We use this to determine whether the addition is "wrapping around" the
- * boundary point, hence the name.  The reason we don't want to use the regular
- * 2^31-modulo arithmetic here is that we want to be able to use the whole of
- * the 2^32-1 space here, allowing for more multixacts than would fit
- * otherwise.
- */
-static bool
-MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
-						 uint32 distance)
-{
-	MultiXactOffset finish;
-
-	/*
-	 * Note that offset number 0 is not used (see GetMultiXactIdMembers), so
-	 * if the addition wraps around the UINT_MAX boundary, skip that value.
-	 */
-	finish = start + distance;
-	if (finish < start)
-		finish++;
-
-	/*-----------------------------------------------------------------------
-	 * When the boundary is numerically greater than the starting point, any
-	 * value numerically between the two is not wrapped:
-	 *
-	 *	<----S----B---->
-	 *	[---)			 = F wrapped past B (and UINT_MAX)
-	 *		 [---)		 = F not wrapped
-	 *			  [----] = F wrapped past B
-	 *
-	 * When the boundary is numerically less than the starting point (i.e. the
-	 * UINT_MAX wraparound occurs somewhere in between) then all values in
-	 * between are wrapped:
-	 *
-	 *	<----B----S---->
-	 *	[---)			 = F not wrapped past B (but wrapped past UINT_MAX)
-	 *		 [---)		 = F wrapped past B (and UINT_MAX)
-	 *			  [----] = F not wrapped
-	 *-----------------------------------------------------------------------
-	 */
-	if (start < boundary)
-		return finish >= boundary || finish < start;
-	else
-		return finish >= boundary && finish < start;
-}
-
 /*
  * Find the starting offset of the given MultiXactId.
  *
@@ -2998,8 +2839,9 @@ MultiXactMemberFreezeThreshold(void)
 	 * we try to eliminate from the system is based on how far we are past
 	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
 	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD) /
-		(MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+	fraction /= (double) (MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+
 	victim_multixacts = multixacts * fraction;
 
 	/* fraction could be > 1.0, but lowest possible freeze age is zero */
@@ -3345,7 +3187,7 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
 static bool
 MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
 {
-	int32		diff = (int32) (offset1 - offset2);
+	int64		diff = (int64) (offset1 - offset2);
 
 	return (diff < 0);
 }
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index ca9f01c10b..e7acf54cf3 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -266,7 +266,7 @@ main(int argc, char *argv[])
 
 			case 'O':
 				errno = 0;
-				set_mxoff = strtoul(optarg, &endptr, 0);
+				set_mxoff = strtou64(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index d6bbbd0ced..cc89e0764a 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -213,7 +213,7 @@ push @cmd,
   sprintf("%d,%d", hex($files[0]) == 0 ? 3 : hex($files[0]), hex($files[-1]));
 
 @files = get_slru_files('pg_multixact/offsets');
-$mult = 32 * $blcksz / 4;
+$mult = 32 * $blcksz / 8;
 # --multixact-ids argument is "new,old"
 push @cmd,
   '--multixact-ids' => sprintf("%d,%d",
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 4e6b0eec2f..b1510532ee 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -27,7 +27,7 @@
 
 #define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
 
-#define MaxMultiXactOffset	((MultiXactOffset) 0xFFFFFFFF)
+#define MaxMultiXactOffset	UINT64CONST(0xFFFFFFFFFFFFFFFF)
 
 /*
  * Possible multixact lock modes ("status").  The first four modes are for
diff --git a/src/include/c.h b/src/include/c.h
index a14c631516..318194f78d 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -618,7 +618,7 @@ typedef uint32 SubTransactionId;
 /* MultiXactId must be equivalent to TransactionId, to fit in t_xmax */
 typedef TransactionId MultiXactId;
 
-typedef uint32 MultiXactOffset;
+typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
-- 
2.43.0

v14-0003-Make-pg_upgrade-convert-multixact-offsets.patchapplication/octet-stream; name=v14-0003-Make-pg_upgrade-convert-multixact-offsets.patchDownload
From dafc3ae60a1f4b3f9f141fb307bfde520c77fe32 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Tue, 13 Aug 2024 14:44:50 +0300
Subject: [PATCH v14 3/7] Make pg_upgrade convert multixact offsets.

Author: Maxim Orlov <orlovmg@gmail.com>
Author: Yura Sokolov <y.sokolov@postgrespro.ru>
---
 src/bin/pg_upgrade/Makefile     |   1 +
 src/bin/pg_upgrade/meson.build  |   1 +
 src/bin/pg_upgrade/pg_upgrade.c |  42 ++-
 src/bin/pg_upgrade/pg_upgrade.h |  14 +-
 src/bin/pg_upgrade/segresize.c  | 527 ++++++++++++++++++++++++++++++++
 5 files changed, 580 insertions(+), 5 deletions(-)
 create mode 100644 src/bin/pg_upgrade/segresize.c

diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index f83d2b5d30..70908d63a3 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	info.o \
 	option.o \
 	parallel.o \
+	segresize.o \
 	pg_upgrade.o \
 	relfilenumber.o \
 	server.o \
diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index da84344966..a06236f2a3 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -10,6 +10,7 @@ pg_upgrade_sources = files(
   'info.c',
   'option.c',
   'parallel.c',
+  'segresize.c',
   'pg_upgrade.c',
   'relfilenumber.c',
   'server.c',
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 174cd92084..5cc6283494 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -783,8 +783,42 @@ copy_xact_xlog_xid(void)
 	if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER &&
 		new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
 	{
-		copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
-		copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		/*
+		 * If the old server is before the MULTIXACTOFFSET_FORMATCHANGE_CAT_VER
+		 * it must have 32-bit multixid offsets, thus it should be converted.
+		 */
+		if (old_cluster.controldata.cat_ver < MULTIXACTOFFSET_FORMATCHANGE_CAT_VER &&
+			new_cluster.controldata.cat_ver >= MULTIXACTOFFSET_FORMATCHANGE_CAT_VER)
+		{
+			MultiXactOffset		oldest_offset,
+								next_offset;
+
+			remove_new_subdir("pg_multixact/offsets", false);
+			prep_status("Converting pg_multixact/offsets to 64-bit");
+			oldest_offset = convert_multixact_offsets();
+			check_ok();
+
+			remove_new_subdir("pg_multixact/members", false);
+			prep_status("Converting pg_multixact/members");
+			convert_multixact_members(oldest_offset);
+			check_ok();
+
+			next_offset = old_cluster.controldata.chkpnt_nxtmxoff;
+			if (oldest_offset)
+			{
+				if (next_offset < oldest_offset)
+					next_offset += ((MultiXactOffset) 1 << 32) - 1;
+
+				next_offset -= oldest_offset - 1;
+
+				old_cluster.controldata.chkpnt_nxtmxoff = next_offset;
+			}
+		}
+		else
+		{
+			copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
+			copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		}
 
 		prep_status("Setting next multixact ID and offset for new cluster");
 
@@ -793,9 +827,9 @@ copy_xact_xlog_xid(void)
 		 * counters here and the oldest multi present on system.
 		 */
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/pg_resetwal\" -O %u -m %u,%u \"%s\"",
+				  "\"%s/pg_resetwal\" -O %llu -m %u,%u \"%s\"",
 				  new_cluster.bindir,
-				  old_cluster.controldata.chkpnt_nxtmxoff,
+				  (unsigned long long) old_cluster.controldata.chkpnt_nxtmxoff,
 				  old_cluster.controldata.chkpnt_nxtmulti,
 				  old_cluster.controldata.chkpnt_oldstMulti,
 				  new_cluster.pgdata);
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index f4e375d27c..1adea73bd3 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -114,6 +114,13 @@ extern char *output_files[];
  */
 #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231
 
+/*
+ * Swicth from 32-bit to 64-bit for multixid offsets.
+ *
+ * XXX: should be changed to the actual CATALOG_VERSION_NO on commit.
+ */
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202409041
+
 /*
  * large object chunk size added to pg_controldata,
  * commit 5f93c37805e7485488480916b4585e098d3cc883
@@ -235,7 +242,7 @@ typedef struct
 	uint32		chkpnt_nxtepoch;
 	uint32		chkpnt_nxtoid;
 	uint32		chkpnt_nxtmulti;
-	uint32		chkpnt_nxtmxoff;
+	uint64		chkpnt_nxtmxoff;
 	uint32		chkpnt_oldstMulti;
 	uint32		chkpnt_oldstxid;
 	uint32		align;
@@ -525,3 +532,8 @@ typedef struct
 	FILE	   *file;
 	char		path[MAXPGPATH];
 } UpgradeTaskReport;
+
+/* segresize.c */
+
+MultiXactOffset		convert_multixact_offsets(void);
+void				convert_multixact_members(MultiXactOffset oldest_offset);
diff --git a/src/bin/pg_upgrade/segresize.c b/src/bin/pg_upgrade/segresize.c
new file mode 100644
index 0000000000..73064c77de
--- /dev/null
+++ b/src/bin/pg_upgrade/segresize.c
@@ -0,0 +1,527 @@
+/*
+ *	segresize.c
+ *
+ *	SLRU segment resize utility
+ *
+ *	Copyright (c) 2024, PostgreSQL Global Development Group
+ *	src/bin/pg_upgrade/segresize.c
+ */
+
+#include "postgres_fe.h"
+
+#include "pg_upgrade.h"
+#include "access/multixact.h"
+
+/* See slru.h */
+#define SLRU_PAGES_PER_SEGMENT		32
+
+/*
+ * Some kind of iterator associated with a particular SLRU segment.  The idea is
+ * to specify the segment and page number and then move through the pages.
+ */
+typedef struct SlruSegState
+{
+	char	   *dir;
+	char	   *fn;
+	FILE	   *file;
+	int64		segno;
+	uint64		pageno;
+	bool		leading_gap;
+} SlruSegState;
+
+/*
+ * Mirrors the SlruFileName from slru.c
+ */
+static inline char *
+SlruFileName(SlruSegState *state)
+{
+	Assert(state->segno >= 0 && state->segno <= INT64CONST(0xFFFFFF));
+	return psprintf("%s/%04X", state->dir, (unsigned int) state->segno);
+}
+
+/*
+ * Create new SLRU segment file.
+ */
+static void
+create_segment(SlruSegState *state)
+{
+	Assert(state->fn == NULL);
+	Assert(state->file == NULL);
+
+	state->fn = SlruFileName(state);
+	state->file = fopen(state->fn, "wb");
+	if (!state->file)
+		pg_fatal("could not create file \"%s\": %m", state->fn);
+}
+
+/*
+ * Open existing SLRU segment file.
+ */
+static void
+open_segment(SlruSegState *state)
+{
+	Assert(state->fn == NULL);
+	Assert(state->file == NULL);
+
+	state->fn = SlruFileName(state);
+	state->file = fopen(state->fn, "rb");
+	if (!state->file)
+		pg_fatal("could not open file \"%s\": %m", state->fn);
+}
+
+/*
+ * Close SLRU segment file.
+ */
+static void
+close_segment(SlruSegState *state)
+{
+	if (state->file)
+	{
+		fclose(state->file);
+		state->file = NULL;
+	}
+
+	if (state->fn)
+	{
+		pfree(state->fn);
+		state->fn = NULL;
+	}
+}
+
+/*
+ * Read next page from the old 32-bit offset segment file.
+ */
+static int
+read_old_segment_page(SlruSegState *state, void *buf, bool *empty)
+{
+	int		len;
+
+	/* Open next segment file, if needed. */
+	if (!state->fn)
+	{
+		if (!state->segno)
+			state->leading_gap = true;
+
+		open_segment(state);
+
+		/* Set position to the needed page. */
+		if (state->pageno > 0 &&
+			fseek(state->file, state->pageno * BLCKSZ, SEEK_SET))
+		{
+			close_segment(state);
+		}
+	}
+
+	if (state->file)
+	{
+		/* Segment file do exists, read page from it. */
+		state->leading_gap = false;
+
+		len = fread(buf, sizeof(char), BLCKSZ, state->file);
+
+		/* Are we done or was there an error? */
+		if (len <= 0)
+		{
+			if (ferror(state->file))
+				pg_fatal("error reading file \"%s\": %m", state->fn);
+
+			if (feof(state->file))
+			{
+				*empty = true;
+				len = -1;
+
+				close_segment(state);
+			}
+		}
+		else
+			*empty = false;
+	}
+	else if (!state->leading_gap)
+	{
+		/* We reached the last segment. */
+		len = -1;
+		*empty = true;
+	}
+	else
+	{
+		/* Skip few first segments if they were frozen and removed. */
+		len = BLCKSZ;
+		*empty = true;
+	}
+
+	if (++state->pageno >= SLRU_PAGES_PER_SEGMENT)
+	{
+		/* Start a new segment. */
+		state->segno++;
+		state->pageno = 0;
+
+		close_segment(state);
+	}
+
+	return len;
+}
+
+/*
+ * Write next page to the new 64-bit offset segment file.
+ */
+static void
+write_new_segment_page(SlruSegState *state, void *buf)
+{
+	/*
+	 * Create a new segment file if we still didn't.  Creation is
+	 * postponed until the first non-empty page is found.  This helps
+	 * not to create completely empty segments.
+	 */
+	if (!state->file)
+	{
+		create_segment(state);
+
+		/* Write zeroes to the previously skipped prefix. */
+		if (state->pageno > 0)
+		{
+			char		zerobuf[BLCKSZ] = {0};
+
+			for (int64 i = 0; i < state->pageno; i++)
+			{
+				if (fwrite(zerobuf, sizeof(char), BLCKSZ, state->file) != BLCKSZ)
+					pg_fatal("could not write file \"%s\": %m", state->fn);
+			}
+		}
+	}
+
+	/* Write page to the new segment (if it was created). */
+	if (state->file)
+	{
+		if (fwrite(buf, sizeof(char), BLCKSZ, state->file) != BLCKSZ)
+			pg_fatal("could not write file \"%s\": %m", state->fn);
+	}
+
+	/*
+	 * Did we reach the maximum page number?  Then close segment file
+	 * and create a new one on the next iteration.
+	 */
+	if (++state->pageno >= SLRU_PAGES_PER_SEGMENT)
+	{
+		/* Start a new segment. */
+		state->segno++;
+		state->pageno = 0;
+
+		close_segment(state);
+	}
+}
+
+typedef uint32 MultiXactOffsetOld;
+
+#define MaxMultiXactOffsetOld	((MultiXactOffsetOld) 0xFFFFFFFF)
+
+#define MULTIXACT_OFFSETS_PER_PAGE_OLD (BLCKSZ / sizeof(MultiXactOffsetOld))
+#define MULTIXACT_OFFSETS_PER_PAGE_NEW (BLCKSZ / sizeof(MultiXactOffset))
+
+/*
+ * Convert pg_multixact/offsets segments and return oldest multi offset.
+ */
+MultiXactOffset
+convert_multixact_offsets(void)
+{
+	SlruSegState		oldseg = {0},
+						newseg = {0};
+	MultiXactOffsetOld	oldbuf[MULTIXACT_OFFSETS_PER_PAGE_OLD] = {0};
+	MultiXactOffset		newbuf[MULTIXACT_OFFSETS_PER_PAGE_NEW] = {0},
+						oldest_offset = 0;
+	uint64				oldest_multi = old_cluster.controldata.chkpnt_oldstMulti,
+						next_multi = old_cluster.controldata.chkpnt_nxtmulti,
+						multi,
+						old_entry,
+						new_entry;
+	bool				oldest_offset_known = false;
+
+	oldseg.dir = psprintf("%s/pg_multixact/offsets", old_cluster.pgdata);
+	newseg.dir = psprintf("%s/pg_multixact/offsets", new_cluster.pgdata);
+
+	old_entry = oldest_multi % MULTIXACT_OFFSETS_PER_PAGE_OLD;
+	oldseg.pageno = oldest_multi / MULTIXACT_OFFSETS_PER_PAGE_OLD;
+	oldseg.segno = oldseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	oldseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+
+	new_entry = oldest_multi % MULTIXACT_OFFSETS_PER_PAGE_NEW;
+	newseg.pageno = oldest_multi / MULTIXACT_OFFSETS_PER_PAGE_NEW;
+	newseg.segno = newseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	newseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+
+	if (next_multi < oldest_multi)
+		next_multi += (uint64) 1 << 32;	/* wraparound */
+
+	/* Copy multi offsets reading only needed segment pages */
+	for (multi = oldest_multi; multi < next_multi; old_entry = 0)
+	{
+		int		oldlen;
+		bool	empty;
+
+		/* Handle possible segment wraparound */
+#define OLD_OFFSET_SEGNO_MAX	\
+	(MaxMultiXactId / MULTIXACT_OFFSETS_PER_PAGE_OLD / SLRU_PAGES_PER_SEGMENT)
+		if (oldseg.segno > OLD_OFFSET_SEGNO_MAX)
+		{
+			oldseg.segno = 0;
+			oldseg.pageno = 0;
+		}
+
+		oldlen = read_old_segment_page(&oldseg, oldbuf, &empty);
+		if (empty || oldlen != BLCKSZ)
+			pg_fatal("cannot read page %llu from file \"%s\": %m",
+					 (unsigned long long) oldseg.pageno, oldseg.fn);
+
+		/* Save oldest multi offset */
+		if (!oldest_offset_known)
+		{
+			oldest_offset = oldbuf[old_entry];
+			oldest_offset_known = true;
+		}
+
+		/* Skip wrapped-around invalid MultiXactIds */
+		if (multi == (uint64) 1 << 32)
+		{
+			Assert(oldseg.segno == 0);
+			Assert(oldseg.pageno == 1);
+			Assert(old_entry == 0);
+			Assert(new_entry == 0);
+
+			multi += FirstMultiXactId;
+			old_entry = FirstMultiXactId;
+			new_entry = FirstMultiXactId;
+		}
+
+		/* Copy entries to the new page */
+		for (; multi < next_multi && old_entry < MULTIXACT_OFFSETS_PER_PAGE_OLD;
+			 multi++, old_entry++)
+		{
+			MultiXactOffset offset = oldbuf[old_entry];
+
+			/* Handle possible offset wraparound (1 becomes 2^32) */
+			if (offset < oldest_offset)
+				offset += ((uint64) 1 << 32) - 1;
+
+			/* Subtract oldest_offset, so new offsets will start from 1 */
+			newbuf[new_entry++] = offset - oldest_offset + 1;
+
+			if (new_entry >= MULTIXACT_OFFSETS_PER_PAGE_NEW)
+			{
+				/* Handle possible segment wraparound */
+#define NEW_OFFSET_SEGNO_MAX	\
+	(MaxMultiXactId / MULTIXACT_OFFSETS_PER_PAGE_NEW / SLRU_PAGES_PER_SEGMENT)
+				if (newseg.segno > NEW_OFFSET_SEGNO_MAX)
+				{
+					newseg.segno = 0;
+					newseg.pageno = 0;
+				}
+
+				/* Write new page */
+				write_new_segment_page(&newseg, newbuf);
+				new_entry = 0;
+			}
+		}
+	}
+
+	/* Write the last incomplete page */
+	if (new_entry > 0 || oldest_multi == next_multi)
+	{
+		memset(&newbuf[new_entry], 0,
+			   sizeof(newbuf[0]) * (MULTIXACT_OFFSETS_PER_PAGE_NEW - new_entry));
+		write_new_segment_page(&newseg, newbuf);
+	}
+
+	/* Use next_offset as oldest_offset, if oldest_multi == next_multi */
+	if (!oldest_offset_known)
+	{
+		Assert(oldest_multi == next_multi);
+		oldest_offset = (MultiXactOffset) old_cluster.controldata.chkpnt_nxtmxoff;
+	}
+
+	/* Release resources */
+	close_segment(&oldseg);
+	close_segment(&newseg);
+
+	pfree(oldseg.dir);
+	pfree(newseg.dir);
+
+	return oldest_offset;
+}
+
+#define MXACT_MEMBERS_FLAG_BYTES			1
+
+#define MULTIXACT_MEMBERS_PER_GROUP			4
+#define MULTIXACT_MEMBERGROUP_SIZE			\
+	(MULTIXACT_MEMBERS_PER_GROUP * (sizeof(TransactionId) + MXACT_MEMBERS_FLAG_BYTES))
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE		\
+	(BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+
+#define MULTIXACT_MEMBERS_PER_PAGE				\
+	(MULTIXACT_MEMBERS_PER_GROUP * MULTIXACT_MEMBERGROUPS_PER_PAGE)
+#define MULTIXACT_MEMBER_FLAG_BYTES_PER_GROUP	\
+	(MXACT_MEMBERS_FLAG_BYTES * MULTIXACT_MEMBERS_PER_GROUP)
+
+typedef struct MultiXactMembersCtx
+{
+	SlruSegState	seg;
+	char			buf[BLCKSZ];
+	int				group;
+	int				member;
+	char		   *flag;
+	TransactionId  *xid;
+} MultiXactMembersCtx;
+
+static void
+MultiXactMembersCtxInit(MultiXactMembersCtx *ctx)
+{
+	ctx->seg.dir = psprintf("%s/pg_multixact/members", new_cluster.pgdata);
+
+	ctx->group = 0;
+	ctx->member = 1;		/* skip invalid zero offset */
+
+	ctx->flag = (char *) ctx->buf + ctx->group * MULTIXACT_MEMBERGROUP_SIZE;
+	ctx->xid = (TransactionId *)(ctx->flag + MXACT_MEMBERS_FLAG_BYTES * MULTIXACT_MEMBERS_PER_GROUP);
+
+	ctx->flag += ctx->member;
+	ctx->xid += ctx->member;
+}
+
+static void
+MultiXactMembersCtxAdd(MultiXactMembersCtx *ctx, char flag, TransactionId xid)
+{
+	/* Copy member's xid and flags to the new page */
+	*ctx->flag++ = flag;
+	*ctx->xid++ = xid;
+
+	if (++ctx->member < MULTIXACT_MEMBERS_PER_GROUP)
+		return;
+
+	/* Start next member group */
+	ctx->member = 0;
+
+	if (++ctx->group >= MULTIXACT_MEMBERGROUPS_PER_PAGE)
+	{
+		/* Write current page and start new */
+		write_new_segment_page(&ctx->seg, ctx->buf);
+
+		ctx->group = 0;
+		memset(ctx->buf, 0, BLCKSZ);
+	}
+
+	ctx->flag = (char *) ctx->buf + ctx->group * MULTIXACT_MEMBERGROUP_SIZE;
+	ctx->xid = (TransactionId *)(ctx->flag + MXACT_MEMBERS_FLAG_BYTES * MULTIXACT_MEMBERS_PER_GROUP);
+}
+
+static void
+MultiXactMembersCtxFinit(MultiXactMembersCtx *ctx)
+{
+	if (ctx->flag > (char *) ctx->buf)
+		write_new_segment_page(&ctx->seg, ctx->buf);
+
+	close_segment(&ctx->seg);
+
+	pfree(ctx->seg.dir);
+}
+
+/*
+ * Convert pg_multixact/members segments, offsets will start from 1.
+ *
+ */
+void
+convert_multixact_members(MultiXactOffset oldest_offset)
+{
+	MultiXactOffset			next_offset,
+							offset;
+	SlruSegState			oldseg = {0};
+	char					oldbuf[BLCKSZ] = {0};
+	int						oldidx;
+	MultiXactMembersCtx		newctx = {0};
+
+	oldseg.dir = psprintf("%s/pg_multixact/members", old_cluster.pgdata);
+
+	next_offset = (MultiXactOffset) old_cluster.controldata.chkpnt_nxtmxoff;
+	if (next_offset < oldest_offset)
+		next_offset += ((uint64) 1 << 32) - 1;
+
+	/* Initialize the old starting position */
+	oldseg.pageno = oldest_offset / MULTIXACT_MEMBERS_PER_PAGE;
+	oldseg.segno = oldseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	oldseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+
+	/* Initialize new starting position */
+	MultiXactMembersCtxInit(&newctx);
+
+	/* Iterate through the original directory */
+	oldidx = oldest_offset % MULTIXACT_MEMBERS_PER_PAGE;
+	for (offset = oldest_offset; offset < next_offset;)
+	{
+		bool	empty;
+		int		oldlen;
+		int		ngroups;
+		int		oldgroup;
+		int		oldmember;
+
+		oldlen = read_old_segment_page(&oldseg, oldbuf, &empty);
+		if (empty || oldlen != BLCKSZ)
+			pg_fatal("cannot read page %llu from file \"%s\": %m",
+					 (unsigned long long) oldseg.pageno, oldseg.fn);
+
+		/* Iterate through the old member groups */
+		ngroups = oldlen / MULTIXACT_MEMBERGROUP_SIZE;
+		oldmember = oldidx % MULTIXACT_MEMBERS_PER_GROUP;
+		oldgroup = oldidx / MULTIXACT_MEMBERS_PER_GROUP;
+		while (oldgroup < ngroups && offset < next_offset)
+		{
+			char		   *oldflag;
+			TransactionId  *oldxid;
+			int				i;
+
+			oldflag = (char *) oldbuf + oldgroup * MULTIXACT_MEMBERGROUP_SIZE;
+			oldxid = (TransactionId *)(oldflag + MULTIXACT_MEMBER_FLAG_BYTES_PER_GROUP);
+
+			oldxid += oldmember;
+			oldflag += oldmember;
+
+			/* Iterate through the old members */
+			for (i = oldmember;
+				 i < MULTIXACT_MEMBERS_PER_GROUP && offset < next_offset;
+				 i++)
+			{
+				MultiXactMembersCtxAdd(&newctx, *oldflag++, *oldxid++);
+
+				if (++offset == (uint64) 1 << 32)
+				{
+					Assert(i == MaxMultiXactOffsetOld % MULTIXACT_MEMBERS_PER_GROUP);
+					goto wraparound;
+				}
+			}
+
+			oldgroup++;
+			oldmember = 0;
+		}
+
+		oldidx = 0;
+
+		continue;
+
+wraparound:
+#define SEGNO_MAX	MaxMultiXactOffsetOld / MULTIXACT_MEMBERS_PER_PAGE / SLRU_PAGES_PER_SEGMENT
+#define PAGENO_MAX	MaxMultiXactOffsetOld / MULTIXACT_MEMBERS_PER_PAGE % SLRU_PAGES_PER_SEGMENT
+		Assert((oldseg.segno == SEGNO_MAX && oldseg.pageno == PAGENO_MAX + 1) ||
+			   (oldseg.segno == SEGNO_MAX + 1 && oldseg.pageno == 0));
+
+		/* Switch to segment 0000 */
+		close_segment(&oldseg);
+		oldseg.segno = 0;
+		oldseg.pageno = 0;
+
+		/* skip invalid zero multi offset */
+		oldidx = 1;
+	}
+
+	MultiXactMembersCtxFinit(&newctx);
+
+	/* Release resources */
+	close_segment(&oldseg);
+
+	pfree(oldseg.dir);
+}
-- 
2.43.0

v14-0006-TEST-add-src-bin-pg_upgrade-t-006_offset.pl.patch.txttext/plain; charset=US-ASCII; name=v14-0006-TEST-add-src-bin-pg_upgrade-t-006_offset.pl.patch.txtDownload
From 325612fc61a5f4728b62070169eb2dd013baa119 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Tue, 19 Nov 2024 17:08:10 +0300
Subject: [PATCH v14 6/7] TEST: add src/bin/pg_upgrade/t/006_offset.pl

---
 src/bin/pg_upgrade/t/006_offset.pl | 562 +++++++++++++++++++++++++++++
 1 file changed, 562 insertions(+)
 create mode 100644 src/bin/pg_upgrade/t/006_offset.pl

diff --git a/src/bin/pg_upgrade/t/006_offset.pl b/src/bin/pg_upgrade/t/006_offset.pl
new file mode 100644
index 0000000000..f5dc733a30
--- /dev/null
+++ b/src/bin/pg_upgrade/t/006_offset.pl
@@ -0,0 +1,562 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use File::Find qw(find);
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# This pair of calls will create significantly more member segments than offset
+# segments.
+sub prep
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl} (I INT PRIMARY KEY, N_UPDATED INT) " .
+		"       WITH (AUTOVACUUM_ENABLED=FALSE);" .
+		"INSERT INTO ${tbl} SELECT G, 0 FROM GENERATE_SERIES(1, 50) G;");
+}
+
+sub fill
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	my $nclients = 50;
+	my $update_every = 90;
+	my @connections = ();
+
+	for (0..$nclients)
+	{
+		my $conn = $node->background_psql('postgres');
+		$conn->query_safe("BEGIN");
+
+		push(@connections, $conn);
+	}
+
+	for (my $i = 0; $i < 20000; $i++)
+	{
+		my $conn = $connections[$i % $nclients];
+
+		$conn->query_safe("COMMIT;");
+		$conn->query_safe("BEGIN");
+
+		if ($i % $update_every == 0)
+		{
+			$conn->query_safe(
+				"UPDATE ${tbl} SET " .
+				"N_UPDATED = N_UPDATED + 1 " .
+				"WHERE I = ${i} % 50");
+		}
+		else
+		{
+			$conn->query_safe(
+				"SELECT * FROM ${tbl} FOR KEY SHARE");
+		}
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+}
+
+# This pair of calls will create more or less the same amount of membsers and
+# offsets segments.
+sub prep2
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl}(BAR INT PRIMARY KEY, BAZ INT); " .
+		"CREATE OR REPLACE PROCEDURE MXIDFILLER(N_STEPS INT DEFAULT 1000) " .
+		"LANGUAGE PLPGSQL " .
+		"AS \$\$ " .
+		"BEGIN " .
+		"	FOR I IN 1..N_STEPS LOOP " .
+		"		UPDATE ${tbl} SET BAZ = RANDOM(1, 1000) " .
+		"		WHERE BAR IN (SELECT BAR FROM ${tbl} " .
+		"						TABLESAMPLE BERNOULLI(80)); " .
+		"		COMMIT; " .
+		"	END LOOP; " .
+		"END; \$\$; " .
+		"INSERT INTO ${tbl} (BAR, BAZ) " .
+		"SELECT ID, ID FROM GENERATE_SERIES(1, 1024) ID;");
+}
+
+sub fill2
+{
+	my $node = shift;
+	my $tbl = shift;
+	my $scale = shift // 1;
+
+	$node->safe_psql('postgres',
+		"BEGIN; " .
+		"SELECT * FROM ${tbl} FOR KEY SHARE; " .
+		"PREPARE TRANSACTION 'A'; " .
+		"CALL MXIDFILLER((365 * ${scale})::int); " .
+		"COMMIT PREPARED 'A';");
+}
+
+
+# generate around 2 offset segments and 55 member segments
+sub mxid_gen1
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	prep($node, $tbl);
+	fill($node, $tbl);
+
+	$node->safe_psql('postgres', q(CHECKPOINT));
+}
+
+# generate around 10 offset segments and 12 member segments
+sub mxid_gen2
+{
+	my $node = shift;
+	my $tbl = shift;
+	my $scale = shift // 1;
+
+	prep2($node, $tbl);
+	fill2($node, $tbl, $scale);
+
+	$node->safe_psql('postgres', q(CHECKPOINT));
+}
+
+# Fetch latest multixact checkpoint values.
+sub multi_bounds
+{
+	my ($node) = @_;
+	my $path = $node->config_data('--bindir');
+	my ($stdout, $stderr) = run_command([
+									$path . '/pg_controldata',
+									$node->data_dir
+								]);
+	my @control_data = split("\n", $stdout);
+	my $next = undef;
+	my $oldest = undef;
+	my $next_offset = undef;
+
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint's NextMultiXactId:\s*(.*)$/mg)
+		{
+			$next = $1;
+			print ">>> @ node ". $node->name . ", " . $_ . "\n";
+		}
+
+		if ($_ =~ /^Latest checkpoint's oldestMultiXid:\s*(.*)$/mg)
+		{
+			$oldest = $1;
+			print ">>> @ node ". $node->name . ", " . $_ . "\n";
+		}
+
+		if ($_ =~ /^Latest checkpoint's NextMultiOffset:\s*(.*)$/mg)
+		{
+			$next_offset = $1;
+			print ">>> @ node ". $node->name . ", " . $_ . "\n";
+		}
+
+		if (defined($oldest) && defined($next) && defined($next_offset))
+		{
+			last;
+		}
+	}
+
+	die "Latest checkpoint's NextMultiXactId not found in control file!\n"
+	unless defined($next);
+
+	die "Latest checkpoint's oldestMultiXid not found in control file!\n"
+	unless defined($oldest);
+
+	die "Latest checkpoint's NextMultiOffset not found in control file!\n"
+	unless defined($next_offset);
+
+	return ($oldest, $next, $next_offset);
+}
+
+# Create node from existing bins.
+sub create_new_node
+{
+	my ($name, %params) = @_;
+
+	create_node(0, @_);
+}
+
+# Create node from ENV oldinstall
+sub create_old_node
+{
+	my ($name, %params) = @_;
+
+	if (!defined($ENV{oldinstall}))
+	{
+		die "oldinstall is not defined";
+	}
+
+	create_node(1, @_);
+}
+
+sub create_node
+{
+	my ($install_path_from_env, $name, %params) = @_;
+	my $scale = defined $params{scale} ? $params{scale} : 1;
+	my $multi = defined $params{multi} ? $params{multi} : undef;
+	my $offset = defined $params{offset} ? $params{offset} : undef;
+
+	my $node =
+		$install_path_from_env ?
+			PostgreSQL::Test::Cluster->new($name,
+					install_path => $ENV{oldinstall}) :
+			PostgreSQL::Test::Cluster->new($name);
+
+	$node->init(force_initdb => 1,
+		extra => [
+			$multi ? ('-m', $multi) : (),
+			$offset ? ('-o', $offset) : (),
+		]);
+
+	# Fixup MOX patch quirk
+	if ($multi)
+	{
+		unlink $node->data_dir . '/pg_multixact/offsets/0000';
+	}
+	if ($offset)
+	{
+		unlink $node->data_dir . '/pg_multixact/members/0000';
+	}
+
+	$node->append_conf('fsync', 'off');
+	$node->append_conf('postgresql.conf', 'max_prepared_transactions = 2');
+
+	$node->start();
+	mxid_gen2($node, 'FOO', $scale);
+	mxid_gen1($node, 'BAR', $scale);
+	$node->restart();
+	$node->safe_psql('postgres', q(SELECT * FROM FOO));		# just in case...
+	$node->safe_psql('postgres', q(SELECT * FROM BAR));
+	$node->safe_psql('postgres', q(CHECKPOINT));
+	$node->stop();
+
+	return $node;
+}
+
+sub do_upgrade
+{
+	my ($oldnode, $newnode) = @_;
+
+	command_ok(
+		[
+			'pg_upgrade', '--no-sync',
+			'-d', $oldnode->data_dir,
+			'-D', $newnode->data_dir,
+			'-b', $oldnode->config_data('--bindir'),
+			'-B', $newnode->config_data('--bindir'),
+			'-s', $newnode->host,
+			'-p', $oldnode->port,
+			'-P', $newnode->port,
+			'--check'
+		],
+		'run of pg_upgrade');
+
+	command_ok(
+		[
+			'pg_upgrade', '--no-sync',
+			'-d', $oldnode->data_dir,
+			'-D', $newnode->data_dir,
+			'-b', $oldnode->config_data('--bindir'),
+			'-B', $newnode->config_data('--bindir'),
+			'-s', $newnode->host,
+			'-p', $oldnode->port,
+			'-P', $newnode->port,
+			'--copy'
+		],
+		'run of pg_upgrade');
+
+	$oldnode->start();
+	$newnode->start();
+
+	my $oldfoo = $oldnode->safe_psql('postgres', q(SELECT * FROM FOO));
+	my $newfoo = $newnode->safe_psql('postgres', q(SELECT * FROM FOO));
+	is($oldfoo, $newfoo, "select foo eq");
+
+	my $oldbar = $oldnode->safe_psql('postgres', q(SELECT * FROM BAR));
+	my $newbar = $newnode->safe_psql('postgres', q(SELECT * FROM BAR));
+	is($oldbar, $newbar, "select bar eq");
+
+	$oldnode->stop();
+	$newnode->stop();
+
+	multi_bounds($oldnode);
+	multi_bounds($newnode);
+}
+
+my @TESTS = (
+	# tests without ENV oldinstall
+	#0, 1, 2, 3, 4, 5, 6,
+	# tests with "real" pg_upgrade
+	#100, 101, 102, 103, 104, 105, 106,
+	# self upgrade
+	1000,
+);
+
+# =============================================================================
+# Basic sanity tests on a NEW bin
+# =============================================================================
+
+# starts from the zero
+SKIP:
+{
+	my $TEST_NO = 0;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_mo',
+						scale => 1);
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value
+SKIP:
+{
+	my $TEST_NO = 1;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_Mo',
+						scale => 1.15,
+						multi => '0x123400');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# offsets starts from the value
+SKIP:
+{
+	my $TEST_NO = 2;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_mO',
+						scale => 1.15,
+						offset => '0x432100');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi and offsets starts from the value
+SKIP:
+{
+	my $TEST_NO = 3;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_MO',
+						scale => 1.15,
+						multi => '0xDEAD00', offset => '0xBEEF00');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value, multi wrap
+SKIP:
+{
+	my $TEST_NO = 4;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_Mo_wrap',
+						scale => 1.15,
+						multi => '0xFFFF7000');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# offsets starts from the value, offsets wrap
+SKIP:
+{
+	my $TEST_NO = 5;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_mO_wrap',
+						scale => 1.15,
+						offset => '0xFFFFFC00');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value, offsets starts from the value,
+# multi wrap, offsets wrap
+SKIP:
+{
+	my $TEST_NO = 6;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_MO_wrap',
+						scale => 1.15,
+						multi => '0xFFFF7000', offset => '0xFFFFFC00');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# =============================================================================
+# pg_upgarde tests
+# =============================================================================
+
+# starts from the zero
+SKIP:
+{
+	my $TEST_NO = 100;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'mo';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1);
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value
+SKIP:
+{
+	my $TEST_NO = 101;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'Mo';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						multi => '0x123400');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# offsets starts from the value
+SKIP:
+{
+	my $TEST_NO = 102;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'mO';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						offset => '0x432100');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi and offsets starts from the value
+SKIP:
+{
+	my $TEST_NO = 103;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'MO';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						multi => '0xDEAD00', offset => '0xBEEF00');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value, multi wrap
+SKIP:
+{
+	my $TEST_NO = 104;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'Mo_wrap';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						multi => '0xFFFF7000');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# offsets starts from the value, offsets wrap
+SKIP:
+{
+	my $TEST_NO = 105;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'mO_wrap';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						offset => '0xFFFFFC00');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value, offsets starts from the value,
+# multi wrap, offsets wrap
+SKIP:
+{
+	my $TEST_NO = 106;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'MO_wrap';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						multi => '0xFFFF7000', offset => '0xFFFFFC00');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# =============================================================================
+# Self upgrade
+# =============================================================================
+
+# starts from the zero
+SKIP:
+{
+	my $TEST_NO = 1000;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'self_upgrade';
+	my $oldnode = create_new_node("old_$dbname",
+						scale => 1);
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+done_testing();
-- 
2.43.0

v14-0007-TEST-bump-catver.patch.txttext/plain; charset=US-ASCII; name=v14-0007-TEST-bump-catver.patch.txtDownload
From b80320d07ade1fba3fce94326bd2e5e06a9e7f58 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 13 Nov 2024 16:34:34 +0300
Subject: [PATCH v14 7/7] TEST: bump catver

---
 src/bin/pg_upgrade/pg_upgrade.h  | 2 +-
 src/include/catalog/catversion.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 1adea73bd3..14d49be040 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -119,7 +119,7 @@ extern char *output_files[];
  *
  * XXX: should be changed to the actual CATALOG_VERSION_NO on commit.
  */
-#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202409041
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202503032
 
 /*
  * large object chunk size added to pg_controldata,
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index f0962e17b3..c952e122f2 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202503031
+#define CATALOG_VERSION_NO	202503032
 
 #endif
-- 
2.43.0

#34wenhui qiu
qiuwenhuifx@gmail.com
In reply to: Maxim Orlov (#33)
Re: POC: make mxidoff 64 bits

HI Maxim Orlov Heikki Linnakangas
Thank you for working on it,A few more days is a code freeze.It seems
like things have been quiet for a while, but I believe implementing xid64
is absolutely necessary. For instance, there’s often concern about
performance jitter caused by frequent freezes. If xid64 is implemented, the
freeze process can be smoother.

Thanks

On Fri, Mar 7, 2025 at 7:30 PM Maxim Orlov <orlovmg@gmail.com> wrote:

Show quoted text

Here is a rebase, v14.

--
Best regards,
Maxim Orlov.

#35Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Maxim Orlov (#33)
Re: POC: make mxidoff 64 bits

On 07/03/2025 13:30, Maxim Orlov wrote:

Here is a rebase, v14.

Thanks! I did some manual testing of this. I created a little helper
function to consume multixids, to test the autovacuum behavior, and
found one issue:

If you consume a lot of multixid members space, by creating lots of
multixids with huge number of members in each, you can end up with a
very bloated members SLRU, and autovacuum is in no hurry to clean it up.
Here's what I did:

1. Installed attached test module
2. Ran "select consume_multixids(10000, 100000);" many times
3. ran:

$ du -h data/pg_multixact/members/
26G data/pg_multixact/members/

When I run "vacuum freeze; select * from pg_database;", I can see that
'datminmxid' for the current database is advanced. However, autovacuum
is in no hurry to vacuum 'template0' and 'template1', so
pg_multixact/members/ does not get truncated. Eventually, when
autovacuum_multixact_freeze_max_age is reached, it presumably will, but
you will run out of disk space before that.

There is this check for members size at the end of SetOffsetVacuumLimit():

/*
* Do we need autovacuum? If we're not sure, assume yes.
*/
return !oldestOffsetKnown ||
(nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);

And the caller (SetMultiXactIdLimit()) will in fact signal the
autovacuum launcher after "vacuum freeze" because of that. But
autovacuum launcher will look at the datminmxid / relminmxid values, see
that they are well within autovacuum_multixact_freeze_max_age, and do
nothing.

This is a very extreme case, but clearly the code to signal autovacuum
launcher, and the freeze age cutoff that autovacuum then uses, are not
in sync.

This patch removed MultiXactMemberFreezeThreshold(), per my suggestion,
but we threw this baby with the bathwater. We discussed that in this
thread, but didn't come up with any solution. But ISTM we still need
something like MultiXactMemberFreezeThreshold() to trigger autovacuum
freezing if the members have grown too large.

--
Heikki Linnakangas
Neon (https://neon.tech)

#36Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#35)
1 attachment(s)
Re: POC: make mxidoff 64 bits

On 01/04/2025 21:25, Heikki Linnakangas wrote:

On 07/03/2025 13:30, Maxim Orlov wrote:

Here is a rebase, v14.

Thanks! I did some manual testing of this. I created a little helper
function to consume multixids, to test the autovacuum behavior, and
found one issue:

Forgot to attach the test function I used, here it is.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachments:

0001-TEST-add-consume_multixids-function.patchtext/x-patch; charset=UTF-8; name=0001-TEST-add-consume_multixids-function.patchDownload
From b2f156bfdd15df21ae25b3369d090cf9899e80aa Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 1 Apr 2025 21:01:07 +0300
Subject: [PATCH 1/1] TEST: add consume_multixids function

---
 src/test/modules/xid_wraparound/Makefile      |  1 +
 src/test/modules/xid_wraparound/meson.build   |  1 +
 .../xid_wraparound/multixid_wraparound.c      | 96 +++++++++++++++++++
 .../xid_wraparound/xid_wraparound--1.0.sql    |  4 +
 4 files changed, 102 insertions(+)
 create mode 100644 src/test/modules/xid_wraparound/multixid_wraparound.c

diff --git a/src/test/modules/xid_wraparound/Makefile b/src/test/modules/xid_wraparound/Makefile
index 7a6e0f66762..ebb3d8fcb3e 100644
--- a/src/test/modules/xid_wraparound/Makefile
+++ b/src/test/modules/xid_wraparound/Makefile
@@ -3,6 +3,7 @@
 MODULE_big = xid_wraparound
 OBJS = \
 	$(WIN32RES) \
+	multixid_wraparound.o \
 	xid_wraparound.o
 PGFILEDESC = "xid_wraparound - tests for XID wraparound"
 
diff --git a/src/test/modules/xid_wraparound/meson.build b/src/test/modules/xid_wraparound/meson.build
index f7dada67f67..98ad381614c 100644
--- a/src/test/modules/xid_wraparound/meson.build
+++ b/src/test/modules/xid_wraparound/meson.build
@@ -1,6 +1,7 @@
 # Copyright (c) 2023-2025, PostgreSQL Global Development Group
 
 xid_wraparound_sources = files(
+  'multixid_wraparound.c',
   'xid_wraparound.c',
 )
 
diff --git a/src/test/modules/xid_wraparound/multixid_wraparound.c b/src/test/modules/xid_wraparound/multixid_wraparound.c
new file mode 100644
index 00000000000..af567c6e541
--- /dev/null
+++ b/src/test/modules/xid_wraparound/multixid_wraparound.c
@@ -0,0 +1,96 @@
+/*--------------------------------------------------------------------------
+ *
+ * multixid_wraparound.c
+ *		Utilities for testing multixids
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/test/modules/xid_wraparound/multixid_wraparound.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/multixact.h"
+#include "access/xact.h"
+#include "miscadmin.h"
+#include "storage/proc.h"
+#include "utils/xid8.h"
+
+static int mxactMemberComparator(const void *arg1, const void *arg2);
+
+/*
+ * Consume the specified number of multi-XIDs, with specified number of
+ * members each.
+ */
+PG_FUNCTION_INFO_V1(consume_multixids);
+Datum
+consume_multixids(PG_FUNCTION_ARGS)
+{
+	int64		nmultis = PG_GETARG_INT64(0);
+	int32		nmembers = PG_GETARG_INT32(1);
+	MultiXactMember *members;
+	MultiXactId	lastmxid = InvalidMultiXactId;
+
+	if (nmultis < 0)
+		elog(ERROR, "invalid nxids argument: %" PRId64, nmultis);
+	if (nmembers < 1)
+		elog(ERROR, "invalid nmembers argument: %d", nmembers);
+
+	/*
+	 * We consume XIDs by calling GetNewTransactionId(true), which marks the
+	 * consumed XIDs as subtransactions of the current top-level transaction.
+	 * For that to work, this transaction must have a top-level XID.
+	 *
+	 * GetNewTransactionId registers them in the subxid cache in PGPROC, until
+	 * the cache overflows, but beyond that, we don't keep track of the
+	 * consumed XIDs.
+	 */
+	(void) GetTopTransactionId();
+
+	members = palloc((nmultis + nmembers) * sizeof(MultiXactMember));
+	for (int32 i = 0; i < nmultis + nmembers; i++)
+	{
+		FullTransactionId xid;
+
+		xid = GetNewTransactionId(true);
+		members[i].xid = XidFromFullTransactionId(xid);
+		members[i].status = MultiXactStatusForKeyShare;
+	}
+	/*
+	 * pre-sort the array like mXactCacheGetBySet does, so that the qsort call
+	 * in mXactCacheGetBySet() is cheaper.
+	 */
+	qsort(members, nmultis + nmembers, sizeof(MultiXactMember), mxactMemberComparator);
+
+	for (int64 i = 0; i < nmultis; i++)
+	{
+		lastmxid = MultiXactIdCreateFromMembers(nmembers, &members[i]);
+		CHECK_FOR_INTERRUPTS();
+	}
+
+	pfree(members);
+
+	PG_RETURN_TRANSACTIONID(lastmxid);
+}
+
+/* copied from multixact.c */
+static int
+mxactMemberComparator(const void *arg1, const void *arg2)
+{
+	MultiXactMember member1 = *(const MultiXactMember *) arg1;
+	MultiXactMember member2 = *(const MultiXactMember *) arg2;
+
+	if (member1.xid > member2.xid)
+		return 1;
+	if (member1.xid < member2.xid)
+		return -1;
+	if (member1.status > member2.status)
+		return 1;
+	if (member1.status < member2.status)
+		return -1;
+	return 0;
+}
diff --git a/src/test/modules/xid_wraparound/xid_wraparound--1.0.sql b/src/test/modules/xid_wraparound/xid_wraparound--1.0.sql
index 96356b4b974..ed7520c3d86 100644
--- a/src/test/modules/xid_wraparound/xid_wraparound--1.0.sql
+++ b/src/test/modules/xid_wraparound/xid_wraparound--1.0.sql
@@ -10,3 +10,7 @@ AS 'MODULE_PATHNAME' LANGUAGE C;
 CREATE FUNCTION consume_xids_until(targetxid xid8)
 RETURNS xid8 VOLATILE PARALLEL UNSAFE STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION consume_multixids(nmultis bigint, nmembers int4)
+RETURNS bigint VOLATILE PARALLEL UNSAFE STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
-- 
2.39.5

#37wenhui qiu
qiuwenhuifx@gmail.com
In reply to: Heikki Linnakangas (#36)
Re: POC: make mxidoff 64 bits

HI Heikki
Pls Kindly help to create task in https://commitfest.postgresql.org/53/
,I can not found this path in
Commitfest 2025-07

Thanks

On Wed, Apr 2, 2025 at 2:26 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Show quoted text

On 01/04/2025 21:25, Heikki Linnakangas wrote:

On 07/03/2025 13:30, Maxim Orlov wrote:

Here is a rebase, v14.

Thanks! I did some manual testing of this. I created a little helper
function to consume multixids, to test the autovacuum behavior, and
found one issue:

Forgot to attach the test function I used, here it is.

--
Heikki Linnakangas
Neon (https://neon.tech)

#38Maxim Orlov
orlovmg@gmail.com
In reply to: wenhui qiu (#37)
Re: POC: make mxidoff 64 bits

I moved the topic to the next commitfest.

--
Best regards,
Maxim Orlov.

#39Ashutosh Bapat
ashutosh.bapat.oss@gmail.com
In reply to: Maxim Orlov (#38)
Re: POC: make mxidoff 64 bits

Hi Maxim,

On Wed, Apr 16, 2025 at 1:42 PM Maxim Orlov <orlovmg@gmail.com> wrote:

I moved the topic to the next commitfest.

I am reviewing these patches.

I notice that transam/README does not mention multixact except one place in
section "Transaction Emulation during Recovery". I expected it to document
how pg_multixact/members and pg_multixact/offset are used and what is their
layout. It's not the responsibility of this patchset to document it, but it
will be good if we add a section about multixacts in transam/README. It
will make reviews easier.

--
Best Wishes,
Ashutosh Bapat

#40Maxim Orlov
orlovmg@gmail.com
In reply to: Ashutosh Bapat (#39)
7 attachment(s)
Re: POC: make mxidoff 64 bits

On Tue, 29 Apr 2025 at 15:01, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
wrote:

I notice that transam/README does not mention multixact except one place
in section "Transaction Emulation during Recovery". I expected it to
document how pg_multixact/members and pg_multixact/offset are used and what
is their layout. It's not the responsibility of this patchset to document
it, but it will be good if we add a section about multixacts in
transam/README. It will make reviews easier.

Yeah, I agree, this is a big overlook, I think. Anyone who tries to
understand how pg_multixact works has to deal with the code.
Certainly, we need to address this issue.

But for now, here is a new rebase @ 70a13c528b6e382a381f.
The only change is that following commits 15a79c7 and a0ed19e, we must also
switch to PRIu64 format.

--
Best regards,
Maxim Orlov.

Attachments:

v15-0005-TEST-initdb-option-to-initialize-cluster-with-no.patch.txttext/plain; charset=US-ASCII; name=v15-0005-TEST-initdb-option-to-initialize-cluster-with-no.patch.txtDownload
From cd2af98bef93b18d3f64afed7239f2a18958a878 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Wed, 4 May 2022 15:53:36 +0300
Subject: [PATCH v15 5/7] TEST: initdb option to initialize cluster with
 non-standard xid/mxid/mxoff

To date testing database cluster wraparund was not easy as initdb has always
inited it with default xid/mxid/mxoff. The option to specify any valid
xid/mxid/mxoff at cluster startup will make these things easier.

Author: Maxim Orlov <orlovmg@gmail.com>
Author: Pavel Borisov <pashkin.elfe@gmail.com>
Author: Svetlana Derevyanko <s.derevyanko@postgrespro.ru>
Discussion: https://www.postgresql.org/message-id/flat/CACG%3Dezaa4vqYjJ16yoxgrpa-%3DgXnf0Vv3Ey9bjGrRRFN2YyWFQ%40mail.gmail.com
---
 src/backend/access/transam/clog.c      |  21 +++++
 src/backend/access/transam/multixact.c |  53 ++++++++++++
 src/backend/access/transam/subtrans.c  |   8 +-
 src/backend/access/transam/xlog.c      |  15 ++--
 src/backend/bootstrap/bootstrap.c      |  50 +++++++++++-
 src/backend/main/main.c                |   6 ++
 src/backend/postmaster/postmaster.c    |  14 +++-
 src/backend/tcop/postgres.c            |  53 +++++++++++-
 src/bin/initdb/initdb.c                | 107 ++++++++++++++++++++++++-
 src/bin/initdb/t/001_initdb.pl         |  60 ++++++++++++++
 src/include/access/xlog.h              |   3 +
 src/include/c.h                        |   4 +
 src/include/catalog/pg_class.h         |   2 +-
 13 files changed, 382 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 48f10bec91..eb8a9791ab 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -834,6 +834,7 @@ BootStrapCLOG(void)
 {
 	int			slotno;
 	LWLock	   *lock = SimpleLruGetBankLock(XactCtl, 0);
+	int64		pageno;
 
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
@@ -844,6 +845,26 @@ BootStrapCLOG(void)
 	SimpleLruWritePage(XactCtl, slotno);
 	Assert(!XactCtl->shared->page_dirty[slotno]);
 
+	pageno = TransactionIdToPage(XidFromFullTransactionId(TransamVariables->nextXid));
+	if (pageno != 0)
+	{
+		LWLock *nextlock = SimpleLruGetBankLock(XactCtl, pageno);
+
+		if (nextlock != lock)
+		{
+			LWLockRelease(lock);
+			LWLockAcquire(nextlock, LW_EXCLUSIVE);
+			lock = nextlock;
+		}
+
+		/* Create and zero the first page of the commit log */
+		slotno = ZeroCLOGPage(pageno, false);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(XactCtl, slotno);
+		Assert(!XactCtl->shared->page_dirty[slotno]);
+	}
+
 	LWLockRelease(lock);
 }
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 059a72f106..9e24198a1e 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1955,6 +1955,7 @@ BootStrapMultiXact(void)
 {
 	int			slotno;
 	LWLock	   *lock;
+	int64		pageno;
 
 	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, 0);
 	LWLockAcquire(lock, LW_EXCLUSIVE);
@@ -1966,6 +1967,26 @@ BootStrapMultiXact(void)
 	SimpleLruWritePage(MultiXactOffsetCtl, slotno);
 	Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
 
+	pageno = MultiXactIdToOffsetPage(MultiXactState->nextMXact);
+	if (pageno != 0)
+	{
+		LWLock *nextlock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
+
+		if (nextlock != lock)
+		{
+			LWLockRelease(lock);
+			LWLockAcquire(nextlock, LW_EXCLUSIVE);
+			lock = nextlock;
+		}
+
+		/* Create and zero the first page of the offsets log */
+		slotno = ZeroMultiXactOffsetPage(pageno, false);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
+		Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
+	}
+
 	LWLockRelease(lock);
 
 	lock = SimpleLruGetBankLock(MultiXactMemberCtl, 0);
@@ -1978,7 +1999,39 @@ BootStrapMultiXact(void)
 	SimpleLruWritePage(MultiXactMemberCtl, slotno);
 	Assert(!MultiXactMemberCtl->shared->page_dirty[slotno]);
 
+	pageno = MXOffsetToMemberPage(MultiXactState->nextOffset);
+	if (pageno != 0)
+	{
+		LWLock *nextlock = SimpleLruGetBankLock(MultiXactMemberCtl, pageno);
+
+		if (nextlock != lock)
+		{
+			LWLockRelease(lock);
+			LWLockAcquire(nextlock, LW_EXCLUSIVE);
+			lock = nextlock;
+		}
+
+		/* Create and zero the first page of the members log */
+		slotno = ZeroMultiXactMemberPage(pageno, false);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(MultiXactMemberCtl, slotno);
+		Assert(!MultiXactMemberCtl->shared->page_dirty[slotno]);
+	}
+
 	LWLockRelease(lock);
+
+	/*
+	 * If we're starting not from zero offset, initilize dummy multixact to
+	 * evade too long loop in PerformMembersTruncation().
+	 */
+	if (MultiXactState->nextOffset > 0 && MultiXactState->nextMXact > 0)
+	{
+		RecordNewMultiXact(FirstMultiXactId,
+						   MultiXactState->nextOffset, 0, NULL);
+		RecordNewMultiXact(MultiXactState->nextMXact,
+						   MultiXactState->nextOffset, 0, NULL);
+	}
 }
 
 /*
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 15153618fa..218675fa60 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -270,12 +270,16 @@ void
 BootStrapSUBTRANS(void)
 {
 	int			slotno;
-	LWLock	   *lock = SimpleLruGetBankLock(SubTransCtl, 0);
+	LWLock	   *lock;
+	int64		pageno;
+
+	pageno = TransactionIdToPage(XidFromFullTransactionId(TransamVariables->nextXid));
+	lock = SimpleLruGetBankLock(SubTransCtl, pageno);
 
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
 	/* Create and zero the first page of the subtrans log */
-	slotno = ZeroSUBTRANSPage(0);
+	slotno = ZeroSUBTRANSPage(pageno);
 
 	/* Make sure it's written out */
 	SimpleLruWritePage(SubTransCtl, slotno);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1914859b2e..09954fd5f8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -136,6 +136,10 @@ int			max_slot_wal_keep_size_mb = -1;
 int			wal_decode_buffer_size = 512 * 1024;
 bool		track_wal_io_timing = false;
 
+TransactionId		start_xid = FirstNormalTransactionId;
+MultiXactId			start_mxid = FirstMultiXactId;
+MultiXactOffset		start_mxoff = 0;
+
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
 #endif
@@ -5265,13 +5269,14 @@ BootStrapXLOG(uint32 data_checksum_version)
 	checkPoint.fullPageWrites = fullPageWrites;
 	checkPoint.wal_level = wal_level;
 	checkPoint.nextXid =
-		FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId);
+		FullTransactionIdFromEpochAndXid(0, Max(FirstNormalTransactionId,
+												start_xid));
 	checkPoint.nextOid = FirstGenbkiObjectId;
-	checkPoint.nextMulti = FirstMultiXactId;
-	checkPoint.nextMultiOffset = 0;
-	checkPoint.oldestXid = FirstNormalTransactionId;
+	checkPoint.nextMulti = Max(FirstMultiXactId, start_mxid);
+	checkPoint.nextMultiOffset = start_mxoff;
+	checkPoint.oldestXid = XidFromFullTransactionId(checkPoint.nextXid);
 	checkPoint.oldestXidDB = Template1DbOid;
-	checkPoint.oldestMulti = FirstMultiXactId;
+	checkPoint.oldestMulti = checkPoint.nextMulti;
 	checkPoint.oldestMultiDB = Template1DbOid;
 	checkPoint.oldestCommitTsXid = InvalidTransactionId;
 	checkPoint.newestCommitTsXid = InvalidTransactionId;
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 6db864892d..458dc3eb29 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -218,7 +218,7 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	argv++;
 	argc--;
 
-	while ((flag = getopt(argc, argv, "B:c:d:D:Fkr:X:-:")) != -1)
+	while ((flag = getopt(argc, argv, "B:c:d:D:Fkm:o:r:X:x:-:")) != -1)
 	{
 		switch (flag)
 		{
@@ -286,12 +286,60 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 			case 'k':
 				bootstrap_data_checksum_version = PG_DATA_CHECKSUM_VERSION;
 				break;
+			case 'm':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactIdIsValid(start_mxid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact id")));
+					}
+				}
+				break;
+			case 'o':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxoff = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactOffsetIsValid(start_mxoff))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact offset")));
+					}
+				}
+				break;
 			case 'r':
 				strlcpy(OutputFileName, optarg, MAXPGPATH);
 				break;
 			case 'X':
 				SetConfigOption("wal_segment_size", optarg, PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
 				break;
+			case 'x':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_xid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartTransactionIdIsValid(start_xid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster xid value")));
+					}
+				}
+				break;
 			default:
 				write_stderr("Try \"%s --help\" for more information.\n",
 							 progname);
diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index 7d63cf94a6..0a0ee16220 100644
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -423,12 +423,18 @@ help(const char *progname)
 	printf(_("  -E                 echo statement before execution\n"));
 	printf(_("  -j                 do not use newline as interactive query delimiter\n"));
 	printf(_("  -r FILENAME        send stdout and stderr to given file\n"));
+	printf(_("  -m START_MXID      set initial database cluster multixact id\n"));
+	printf(_("  -o START_MXOFF     set initial database cluster multixact offset\n"));
+	printf(_("  -x START_XID       set initial database cluster xid\n"));
 
 	printf(_("\nOptions for bootstrapping mode:\n"));
 	printf(_("  --boot             selects bootstrapping mode (must be first argument)\n"));
 	printf(_("  --check            selects check mode (must be first argument)\n"));
 	printf(_("  DBNAME             database name (mandatory argument in bootstrapping mode)\n"));
 	printf(_("  -r FILENAME        send stdout and stderr to given file\n"));
+	printf(_("  -m START_MXID      set initial database cluster multixact id\n"));
+	printf(_("  -o START_MXOFF     set initial database cluster multixact offset\n"));
+	printf(_("  -x START_XID       set initial database cluster xid\n"));
 
 	printf(_("\nPlease read the documentation for the complete list of run-time\n"
 			 "configuration settings and how to set them on the command line or in\n"
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 490f7ce366..b41915deed 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -595,7 +595,7 @@ PostmasterMain(int argc, char *argv[])
 	 * tcop/postgres.c (the option sets should not conflict) and with the
 	 * common help() function in main/main.c.
 	 */
-	while ((opt = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lN:OPp:r:S:sTt:W:-:")) != -1)
+	while ((opt = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lm:N:Oo:Pp:r:S:sTt:W:x:-:")) != -1)
 	{
 		switch (opt)
 		{
@@ -705,10 +705,18 @@ PostmasterMain(int argc, char *argv[])
 				SetConfigOption("max_connections", optarg, PGC_POSTMASTER, PGC_S_ARGV);
 				break;
 
+			case 'm':
+				/* only used by single-user backend */
+				break;
+
 			case 'O':
 				SetConfigOption("allow_system_table_mods", "true", PGC_POSTMASTER, PGC_S_ARGV);
 				break;
 
+			case 'o':
+				/* only used by single-user backend */
+				break;
+
 			case 'P':
 				SetConfigOption("ignore_system_indexes", "true", PGC_POSTMASTER, PGC_S_ARGV);
 				break;
@@ -759,6 +767,10 @@ PostmasterMain(int argc, char *argv[])
 				SetConfigOption("post_auth_delay", optarg, PGC_POSTMASTER, PGC_S_ARGV);
 				break;
 
+			case 'x':
+				/* only used by single-user backend */
+				break;
+
 			default:
 				write_stderr("Try \"%s --help\" for more information.\n",
 							 progname);
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 37784b7816..07e1c3d3f4 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3829,7 +3829,7 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 	 * postmaster/postmaster.c (the option sets should not conflict) and with
 	 * the common help() function in main/main.c.
 	 */
-	while ((flag = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lN:nOPp:r:S:sTt:v:W:-:")) != -1)
+	while ((flag = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lm:N:nOo:Pp:r:S:sTt:v:W:x:-:")) != -1)
 	{
 		switch (flag)
 		{
@@ -3934,6 +3934,23 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 				SetConfigOption("ssl", "true", ctx, gucsource);
 				break;
 
+			case 'm':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactIdIsValid(start_mxid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact id")));
+					}
+				}
+				break;
+
 			case 'N':
 				SetConfigOption("max_connections", optarg, ctx, gucsource);
 				break;
@@ -3946,6 +3963,23 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 				SetConfigOption("allow_system_table_mods", "true", ctx, gucsource);
 				break;
 
+			case 'o':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxoff = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactOffsetIsValid(start_mxoff))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact offset")));
+					}
+				}
+				break;
+
 			case 'P':
 				SetConfigOption("ignore_system_indexes", "true", ctx, gucsource);
 				break;
@@ -4000,6 +4034,23 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 				SetConfigOption("post_auth_delay", optarg, ctx, gucsource);
 				break;
 
+			case 'x':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_xid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartTransactionIdIsValid(start_xid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster xid")));
+					}
+				}
+				break;
+
 			default:
 				errs++;
 				break;
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 62bbd08d9f..c61b12de9c 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -169,6 +169,9 @@ static char *xlog_dir = NULL;
 static int	wal_segment_size_mb = (DEFAULT_XLOG_SEG_SIZE) / (1024 * 1024);
 static DataDirSyncMethod sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
 static bool sync_data_files = true;
+static TransactionId start_xid = 0;
+static MultiXactId start_mxid = 0;
+static MultiXactOffset start_mxoff = 0;
 
 
 /* internal vars */
@@ -1586,6 +1589,11 @@ bootstrap_template1(void)
 	bki_lines = replace_token(bki_lines, "POSTGRES",
 							  escape_quotes_bki(username));
 
+	/* relfrozenxid must not be less than FirstNormalTransactionId */
+	sprintf(buf, "%u", Max(start_xid, 3));
+	bki_lines = replace_token(bki_lines, "RECENTXMIN",
+							  buf);
+
 	bki_lines = replace_token(bki_lines, "ENCODING",
 							  encodingid_to_string(encodingid));
 
@@ -1611,6 +1619,9 @@ bootstrap_template1(void)
 
 	printfPQExpBuffer(&cmd, "\"%s\" --boot %s %s", backend_exec, boot_options, extra_options);
 	appendPQExpBuffer(&cmd, " -X %d", wal_segment_size_mb * (1024 * 1024));
+	appendPQExpBuffer(&cmd, " -m %u", start_mxid);
+	appendPQExpBuffer(&cmd, " -o %" PRIu64, start_mxoff);
+	appendPQExpBuffer(&cmd, " -x %u", start_xid);
 	if (data_checksums)
 		appendPQExpBufferStr(&cmd, " -k");
 	if (debug)
@@ -2552,13 +2563,21 @@ usage(const char *progname)
 	printf(_("  -d, --debug               generate lots of debugging output\n"));
 	printf(_("      --discard-caches      set debug_discard_caches=1\n"));
 	printf(_("  -L DIRECTORY              where to find the input files\n"));
+	printf(_("  -m, --multixact-id=START_MXID\n"
+			 "                            set initial database cluster multixact id\n"
+			 "                            max value is 2^62-1\n"));
 	printf(_("  -n, --no-clean            do not clean up after errors\n"));
 	printf(_("  -N, --no-sync             do not wait for changes to be written safely to disk\n"));
 	printf(_("      --no-sync-data-files  do not sync files within database directories\n"));
 	printf(_("      --no-instructions     do not print instructions for next steps\n"));
+	printf(_("  -o, --multixact-offset=START_MXOFF\n"
+			 "                            set initial database cluster multixact offset\n"
+			 "                            max value is 2^62-1\n"));
 	printf(_("  -s, --show                show internal settings, then exit\n"));
 	printf(_("      --sync-method=METHOD  set method for syncing files to disk\n"));
 	printf(_("  -S, --sync-only           only sync database files to disk, then exit\n"));
+	printf(_("  -x, --xid=START_XID       set initial database cluster xid\n"
+			 "                            max value is 2^62-1\n"));
 	printf(_("\nOther options:\n"));
 	printf(_("  -V, --version             output version information, then exit\n"));
 	printf(_("  -?, --help                show this help, then exit\n"));
@@ -3093,6 +3112,18 @@ initialize_data_directory(void)
 	/* Now create all the text config files */
 	setup_config();
 
+	if (start_mxid != 0)
+		printf(_("selecting initial multixact id ... %u\n"),
+				 start_mxid);
+
+	if (start_mxoff != 0)
+		printf(_("selecting initial multixact offset ... %" PRIu64 "\n"),
+				 start_mxoff);
+
+	if (start_xid != 0)
+		printf(_("selecting initial xid ... %u\n"),
+				 start_xid);
+
 	/* Bootstrap template1 */
 	bootstrap_template1();
 
@@ -3109,8 +3140,12 @@ initialize_data_directory(void)
 	fflush(stdout);
 
 	initPQExpBuffer(&cmd);
-	printfPQExpBuffer(&cmd, "\"%s\" %s %s template1 >%s",
-					  backend_exec, backend_options, extra_options, DEVNULL);
+	printfPQExpBuffer(&cmd, "\"%s\" %s %s",
+					  backend_exec, backend_options, extra_options);
+	appendPQExpBuffer(&cmd, " -m %u", start_mxid);
+	appendPQExpBuffer(&cmd, " -o %" PRIu64, start_mxoff);
+	appendPQExpBuffer(&cmd, " -x %u", start_xid);
+	appendPQExpBuffer(&cmd, " template1 >%s", DEVNULL);
 
 	PG_CMD_OPEN(cmd.data);
 
@@ -3198,6 +3233,9 @@ main(int argc, char *argv[])
 		{"sync-method", required_argument, NULL, 19},
 		{"no-data-checksums", no_argument, NULL, 20},
 		{"no-sync-data-files", no_argument, NULL, 21},
+		{"xid", required_argument, NULL, 'x'},
+		{"multixact-id", required_argument, NULL, 'm'},
+		{"multixact-offset", required_argument, NULL, 'o'},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -3239,7 +3277,7 @@ main(int argc, char *argv[])
 
 	/* process command-line options */
 
-	while ((c = getopt_long(argc, argv, "A:c:dD:E:gkL:nNsST:U:WX:",
+	while ((c = getopt_long(argc, argv, "A:c:dD:E:gkL:m:nNo:sST:U:Wx:X:",
 							long_options, &option_index)) != -1)
 	{
 		switch (c)
@@ -3297,6 +3335,30 @@ main(int argc, char *argv[])
 				debug = true;
 				printf(_("Running in debug mode.\n"));
 				break;
+			case 'm':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactIdIsValid(start_mxid))
+					{
+						pg_log_error("invalid initial database cluster multixact id");
+						exit(1);
+					}
+					else if (start_mxid < 1) /* FirstMultiXactId */
+					{
+						/*
+						 * We avoid mxid to be silently set to
+						 * FirstMultiXactId, though it does not harm.
+						 */
+						pg_log_error("multixact id should be greater than 0");
+						exit(1);
+					}
+				}
+				break;
 			case 'n':
 				noclean = true;
 				printf(_("Running in no-clean mode.  Mistakes will not be cleaned up.\n"));
@@ -3304,6 +3366,21 @@ main(int argc, char *argv[])
 			case 'N':
 				do_sync = false;
 				break;
+			case 'o':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxoff = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactOffsetIsValid(start_mxoff))
+					{
+						pg_log_error("invalid initial database cluster multixact offset");
+						exit(1);
+					}
+				}
+				break;
 			case 'S':
 				sync_only = true;
 				break;
@@ -3395,6 +3472,30 @@ main(int argc, char *argv[])
 			case 21:
 				sync_data_files = false;
 				break;
+			case 'x':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_xid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartTransactionIdIsValid(start_xid))
+					{
+						pg_log_error("invalid value for initial database cluster xid");
+						exit(1);
+					}
+					else if (start_xid < 3) /* FirstNormalTransactionId */
+					{
+						/*
+						 * We avoid xid to be silently set to
+						 * FirstNormalTransactionId, though it does not harm.
+						 */
+						pg_log_error("xid should be greater than 2");
+						exit(1);
+					}
+				}
+				break;
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index 15dd10ce40..668d0f108d 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -330,4 +330,64 @@ command_fails(
 	[ 'pg_checksums', '--pgdata' => $datadir_nochecksums ],
 	"pg_checksums fails with data checksum disabled");
 
+# Set non-standard initial mxid/mxoff/xid.
+command_fails_like(
+	[ 'initdb', '-m', 'seven', $datadir ],
+	qr/initdb: error: invalid initial database cluster multixact id/,
+	'fails for invalid initial database cluster multixact id');
+command_fails_like(
+	[ 'initdb', '-o', 'seven', $datadir ],
+	qr/initdb: error: invalid initial database cluster multixact offset/,
+	'fails for invalid initial database cluster multixact offset');
+command_fails_like(
+	[ 'initdb', '-x', 'seven', $datadir ],
+	qr/initdb: error: invalid value for initial database cluster xid/,
+	'fails for invalid initial database cluster xid');
+
+command_checks_all(
+	[ 'initdb', '-m', '65535', "$tempdir/data-m65535" ],
+	0,
+	[qr/selecting initial multixact id ... 65535/],
+	[],
+	'selecting initial multixact id');
+command_checks_all(
+	[ 'initdb', '-o', '65535', "$tempdir/data-o65535" ],
+	0,
+	[qr/selecting initial multixact offset ... 65535/],
+	[],
+	'selecting initial multixact offset');
+command_checks_all(
+	[ 'initdb', '-x', '65535', "$tempdir/data-x65535" ],
+	0,
+	[qr/selecting initial xid ... 65535/],
+	[],
+	'selecting initial xid');
+
+# Setup new cluster with given mxid/mxoff/xid.
+my $node;
+my $result;
+
+$node = PostgreSQL::Test::Cluster->new('test-mxid');
+$node->init(extra => ['-m', '16777215']); # 0xFFFFFF
+$node->start;
+$result = $node->safe_psql('postgres', "SELECT next_multixact_id FROM pg_control_checkpoint();");
+ok($result >= 16777215, 'setup cluster with given mxid');
+$node->stop;
+
+$node = PostgreSQL::Test::Cluster->new('test-mxoff');
+$node->init(extra => ['-o', '16777215']); # 0xFFFFFF
+$node->start;
+$result = $node->safe_psql('postgres', "SELECT next_multi_offset FROM pg_control_checkpoint();");
+ok($result >= 16777215, 'setup cluster with given mxoff');
+$node->stop;
+
+$node = PostgreSQL::Test::Cluster->new('test-xid');
+$node->init(extra => ['-x', '16777215']); # 0xFFFFFF
+$node->start;
+$result = $node->safe_psql('postgres', "SELECT txid_current();");
+ok($result >= 16777215, 'setup cluster with given xid - check 1');
+$result = $node->safe_psql('postgres', "SELECT oldest_xid FROM pg_control_checkpoint();");
+ok($result >= 16777215, 'setup cluster with given xid - check 2');
+$node->stop;
+
 done_testing();
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d313099c02..23b8dd0375 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -94,6 +94,9 @@ typedef enum RecoveryState
 } RecoveryState;
 
 extern PGDLLIMPORT int wal_level;
+extern PGDLLIMPORT TransactionId start_xid;
+extern PGDLLIMPORT MultiXactId start_mxid;
+extern PGDLLIMPORT MultiXactOffset start_mxoff;
 
 /* Is WAL archiving enabled (always or only while server is running normally)? */
 #define XLogArchivingActive() \
diff --git a/src/include/c.h b/src/include/c.h
index 45ba4c523f..270cc71f7c 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -636,6 +636,10 @@ typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
+#define StartTransactionIdIsValid(xid)			((xid) <= 0xFFFFFFFF)
+#define StartMultiXactIdIsValid(mxid)			((mxid) <= 0xFFFFFFFF)
+#define StartMultiXactOffsetIsValid(offset)		((offset) <= 0xFFFFFFFF)
+
 #define FirstCommandId	((CommandId) 0)
 #define InvalidCommandId	(~(CommandId)0)
 
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index 07d182da79..06e9969862 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -126,7 +126,7 @@ CATALOG(pg_class,1259,RelationRelationId) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83,Relat
 	Oid			relrewrite BKI_DEFAULT(0) BKI_LOOKUP_OPT(pg_class);
 
 	/* all Xids < this are frozen in this rel */
-	TransactionId relfrozenxid BKI_DEFAULT(3);	/* FirstNormalTransactionId */
+	TransactionId relfrozenxid BKI_DEFAULT(RECENTXMIN);	/* FirstNormalTransactionId */
 
 	/* all multixacts in this rel are >= this; it is really a MultiXactId */
 	TransactionId relminmxid BKI_DEFAULT(1);	/* FirstMultiXactId */
-- 
2.43.0

v15-0003-Make-pg_upgrade-convert-multixact-offsets.patchapplication/octet-stream; name=v15-0003-Make-pg_upgrade-convert-multixact-offsets.patchDownload
From bde44bf4f2bc265dd1c4c46383f5ce729f37d46d Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Tue, 13 Aug 2024 14:44:50 +0300
Subject: [PATCH v15 3/7] Make pg_upgrade convert multixact offsets.

Author: Maxim Orlov <orlovmg@gmail.com>
Author: Yura Sokolov <y.sokolov@postgrespro.ru>
---
 src/bin/pg_upgrade/Makefile     |   1 +
 src/bin/pg_upgrade/meson.build  |   1 +
 src/bin/pg_upgrade/pg_upgrade.c |  40 ++-
 src/bin/pg_upgrade/pg_upgrade.h |  14 +-
 src/bin/pg_upgrade/segresize.c  | 527 ++++++++++++++++++++++++++++++++
 5 files changed, 579 insertions(+), 4 deletions(-)
 create mode 100644 src/bin/pg_upgrade/segresize.c

diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index f83d2b5d30..70908d63a3 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	info.o \
 	option.o \
 	parallel.o \
+	segresize.o \
 	pg_upgrade.o \
 	relfilenumber.o \
 	server.o \
diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index ac992f0d14..a8b1d77c2d 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -10,6 +10,7 @@ pg_upgrade_sources = files(
   'info.c',
   'option.c',
   'parallel.c',
+  'segresize.c',
   'pg_upgrade.c',
   'relfilenumber.c',
   'server.c',
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 536e49d261..7fae173016 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -793,8 +793,42 @@ copy_xact_xlog_xid(void)
 	if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER &&
 		new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
 	{
-		copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
-		copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		/*
+		 * If the old server is before the MULTIXACTOFFSET_FORMATCHANGE_CAT_VER
+		 * it must have 32-bit multixid offsets, thus it should be converted.
+		 */
+		if (old_cluster.controldata.cat_ver < MULTIXACTOFFSET_FORMATCHANGE_CAT_VER &&
+			new_cluster.controldata.cat_ver >= MULTIXACTOFFSET_FORMATCHANGE_CAT_VER)
+		{
+			MultiXactOffset		oldest_offset,
+								next_offset;
+
+			remove_new_subdir("pg_multixact/offsets", false);
+			prep_status("Converting pg_multixact/offsets to 64-bit");
+			oldest_offset = convert_multixact_offsets();
+			check_ok();
+
+			remove_new_subdir("pg_multixact/members", false);
+			prep_status("Converting pg_multixact/members");
+			convert_multixact_members(oldest_offset);
+			check_ok();
+
+			next_offset = old_cluster.controldata.chkpnt_nxtmxoff;
+			if (oldest_offset)
+			{
+				if (next_offset < oldest_offset)
+					next_offset += ((MultiXactOffset) 1 << 32) - 1;
+
+				next_offset -= oldest_offset - 1;
+
+				old_cluster.controldata.chkpnt_nxtmxoff = next_offset;
+			}
+		}
+		else
+		{
+			copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
+			copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		}
 
 		prep_status("Setting next multixact ID and offset for new cluster");
 
@@ -803,7 +837,7 @@ copy_xact_xlog_xid(void)
 		 * counters here and the oldest multi present on system.
 		 */
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/pg_resetwal\" -O %u -m %u,%u \"%s\"",
+				  "\"%s/pg_resetwal\" -O %" PRIu64 " -m %u,%u \"%s\"",
 				  new_cluster.bindir,
 				  old_cluster.controldata.chkpnt_nxtmxoff,
 				  old_cluster.controldata.chkpnt_nxtmulti,
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 69c965bb7d..2d4f1d39e5 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -114,6 +114,13 @@ extern char *output_files[];
  */
 #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231
 
+/*
+ * Swicth from 32-bit to 64-bit for multixid offsets.
+ *
+ * XXX: should be changed to the actual CATALOG_VERSION_NO on commit.
+ */
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202409041
+
 /*
  * large object chunk size added to pg_controldata,
  * commit 5f93c37805e7485488480916b4585e098d3cc883
@@ -235,7 +242,7 @@ typedef struct
 	uint32		chkpnt_nxtepoch;
 	uint32		chkpnt_nxtoid;
 	uint32		chkpnt_nxtmulti;
-	uint32		chkpnt_nxtmxoff;
+	uint64		chkpnt_nxtmxoff;
 	uint32		chkpnt_oldstMulti;
 	uint32		chkpnt_oldstxid;
 	uint32		align;
@@ -526,3 +533,8 @@ typedef struct
 	FILE	   *file;
 	char		path[MAXPGPATH];
 } UpgradeTaskReport;
+
+/* segresize.c */
+
+MultiXactOffset		convert_multixact_offsets(void);
+void				convert_multixact_members(MultiXactOffset oldest_offset);
diff --git a/src/bin/pg_upgrade/segresize.c b/src/bin/pg_upgrade/segresize.c
new file mode 100644
index 0000000000..7f80d0652a
--- /dev/null
+++ b/src/bin/pg_upgrade/segresize.c
@@ -0,0 +1,527 @@
+/*
+ *	segresize.c
+ *
+ *	SLRU segment resize utility
+ *
+ *	Copyright (c) 2024, PostgreSQL Global Development Group
+ *	src/bin/pg_upgrade/segresize.c
+ */
+
+#include "postgres_fe.h"
+
+#include "pg_upgrade.h"
+#include "access/multixact.h"
+
+/* See slru.h */
+#define SLRU_PAGES_PER_SEGMENT		32
+
+/*
+ * Some kind of iterator associated with a particular SLRU segment.  The idea is
+ * to specify the segment and page number and then move through the pages.
+ */
+typedef struct SlruSegState
+{
+	char	   *dir;
+	char	   *fn;
+	FILE	   *file;
+	int64		segno;
+	uint64		pageno;
+	bool		leading_gap;
+} SlruSegState;
+
+/*
+ * Mirrors the SlruFileName from slru.c
+ */
+static inline char *
+SlruFileName(SlruSegState *state)
+{
+	Assert(state->segno >= 0 && state->segno <= INT64CONST(0xFFFFFF));
+	return psprintf("%s/%04X", state->dir, (unsigned int) state->segno);
+}
+
+/*
+ * Create new SLRU segment file.
+ */
+static void
+create_segment(SlruSegState *state)
+{
+	Assert(state->fn == NULL);
+	Assert(state->file == NULL);
+
+	state->fn = SlruFileName(state);
+	state->file = fopen(state->fn, "wb");
+	if (!state->file)
+		pg_fatal("could not create file \"%s\": %m", state->fn);
+}
+
+/*
+ * Open existing SLRU segment file.
+ */
+static void
+open_segment(SlruSegState *state)
+{
+	Assert(state->fn == NULL);
+	Assert(state->file == NULL);
+
+	state->fn = SlruFileName(state);
+	state->file = fopen(state->fn, "rb");
+	if (!state->file)
+		pg_fatal("could not open file \"%s\": %m", state->fn);
+}
+
+/*
+ * Close SLRU segment file.
+ */
+static void
+close_segment(SlruSegState *state)
+{
+	if (state->file)
+	{
+		fclose(state->file);
+		state->file = NULL;
+	}
+
+	if (state->fn)
+	{
+		pfree(state->fn);
+		state->fn = NULL;
+	}
+}
+
+/*
+ * Read next page from the old 32-bit offset segment file.
+ */
+static int
+read_old_segment_page(SlruSegState *state, void *buf, bool *empty)
+{
+	int		len;
+
+	/* Open next segment file, if needed. */
+	if (!state->fn)
+	{
+		if (!state->segno)
+			state->leading_gap = true;
+
+		open_segment(state);
+
+		/* Set position to the needed page. */
+		if (state->pageno > 0 &&
+			fseek(state->file, state->pageno * BLCKSZ, SEEK_SET))
+		{
+			close_segment(state);
+		}
+	}
+
+	if (state->file)
+	{
+		/* Segment file do exists, read page from it. */
+		state->leading_gap = false;
+
+		len = fread(buf, sizeof(char), BLCKSZ, state->file);
+
+		/* Are we done or was there an error? */
+		if (len <= 0)
+		{
+			if (ferror(state->file))
+				pg_fatal("error reading file \"%s\": %m", state->fn);
+
+			if (feof(state->file))
+			{
+				*empty = true;
+				len = -1;
+
+				close_segment(state);
+			}
+		}
+		else
+			*empty = false;
+	}
+	else if (!state->leading_gap)
+	{
+		/* We reached the last segment. */
+		len = -1;
+		*empty = true;
+	}
+	else
+	{
+		/* Skip few first segments if they were frozen and removed. */
+		len = BLCKSZ;
+		*empty = true;
+	}
+
+	if (++state->pageno >= SLRU_PAGES_PER_SEGMENT)
+	{
+		/* Start a new segment. */
+		state->segno++;
+		state->pageno = 0;
+
+		close_segment(state);
+	}
+
+	return len;
+}
+
+/*
+ * Write next page to the new 64-bit offset segment file.
+ */
+static void
+write_new_segment_page(SlruSegState *state, void *buf)
+{
+	/*
+	 * Create a new segment file if we still didn't.  Creation is
+	 * postponed until the first non-empty page is found.  This helps
+	 * not to create completely empty segments.
+	 */
+	if (!state->file)
+	{
+		create_segment(state);
+
+		/* Write zeroes to the previously skipped prefix. */
+		if (state->pageno > 0)
+		{
+			char		zerobuf[BLCKSZ] = {0};
+
+			for (int64 i = 0; i < state->pageno; i++)
+			{
+				if (fwrite(zerobuf, sizeof(char), BLCKSZ, state->file) != BLCKSZ)
+					pg_fatal("could not write file \"%s\": %m", state->fn);
+			}
+		}
+	}
+
+	/* Write page to the new segment (if it was created). */
+	if (state->file)
+	{
+		if (fwrite(buf, sizeof(char), BLCKSZ, state->file) != BLCKSZ)
+			pg_fatal("could not write file \"%s\": %m", state->fn);
+	}
+
+	/*
+	 * Did we reach the maximum page number?  Then close segment file
+	 * and create a new one on the next iteration.
+	 */
+	if (++state->pageno >= SLRU_PAGES_PER_SEGMENT)
+	{
+		/* Start a new segment. */
+		state->segno++;
+		state->pageno = 0;
+
+		close_segment(state);
+	}
+}
+
+typedef uint32 MultiXactOffsetOld;
+
+#define MaxMultiXactOffsetOld	((MultiXactOffsetOld) 0xFFFFFFFF)
+
+#define MULTIXACT_OFFSETS_PER_PAGE_OLD (BLCKSZ / sizeof(MultiXactOffsetOld))
+#define MULTIXACT_OFFSETS_PER_PAGE_NEW (BLCKSZ / sizeof(MultiXactOffset))
+
+/*
+ * Convert pg_multixact/offsets segments and return oldest multi offset.
+ */
+MultiXactOffset
+convert_multixact_offsets(void)
+{
+	SlruSegState		oldseg = {0},
+						newseg = {0};
+	MultiXactOffsetOld	oldbuf[MULTIXACT_OFFSETS_PER_PAGE_OLD] = {0};
+	MultiXactOffset		newbuf[MULTIXACT_OFFSETS_PER_PAGE_NEW] = {0},
+						oldest_offset = 0;
+	uint64				oldest_multi = old_cluster.controldata.chkpnt_oldstMulti,
+						next_multi = old_cluster.controldata.chkpnt_nxtmulti,
+						multi,
+						old_entry,
+						new_entry;
+	bool				oldest_offset_known = false;
+
+	oldseg.dir = psprintf("%s/pg_multixact/offsets", old_cluster.pgdata);
+	newseg.dir = psprintf("%s/pg_multixact/offsets", new_cluster.pgdata);
+
+	old_entry = oldest_multi % MULTIXACT_OFFSETS_PER_PAGE_OLD;
+	oldseg.pageno = oldest_multi / MULTIXACT_OFFSETS_PER_PAGE_OLD;
+	oldseg.segno = oldseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	oldseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+
+	new_entry = oldest_multi % MULTIXACT_OFFSETS_PER_PAGE_NEW;
+	newseg.pageno = oldest_multi / MULTIXACT_OFFSETS_PER_PAGE_NEW;
+	newseg.segno = newseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	newseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+
+	if (next_multi < oldest_multi)
+		next_multi += (uint64) 1 << 32;	/* wraparound */
+
+	/* Copy multi offsets reading only needed segment pages */
+	for (multi = oldest_multi; multi < next_multi; old_entry = 0)
+	{
+		int		oldlen;
+		bool	empty;
+
+		/* Handle possible segment wraparound */
+#define OLD_OFFSET_SEGNO_MAX	\
+	(MaxMultiXactId / MULTIXACT_OFFSETS_PER_PAGE_OLD / SLRU_PAGES_PER_SEGMENT)
+		if (oldseg.segno > OLD_OFFSET_SEGNO_MAX)
+		{
+			oldseg.segno = 0;
+			oldseg.pageno = 0;
+		}
+
+		oldlen = read_old_segment_page(&oldseg, oldbuf, &empty);
+		if (empty || oldlen != BLCKSZ)
+			pg_fatal("cannot read page %" PRIu64 " from file \"%s\": %m",
+					 oldseg.pageno, oldseg.fn);
+
+		/* Save oldest multi offset */
+		if (!oldest_offset_known)
+		{
+			oldest_offset = oldbuf[old_entry];
+			oldest_offset_known = true;
+		}
+
+		/* Skip wrapped-around invalid MultiXactIds */
+		if (multi == (uint64) 1 << 32)
+		{
+			Assert(oldseg.segno == 0);
+			Assert(oldseg.pageno == 1);
+			Assert(old_entry == 0);
+			Assert(new_entry == 0);
+
+			multi += FirstMultiXactId;
+			old_entry = FirstMultiXactId;
+			new_entry = FirstMultiXactId;
+		}
+
+		/* Copy entries to the new page */
+		for (; multi < next_multi && old_entry < MULTIXACT_OFFSETS_PER_PAGE_OLD;
+			 multi++, old_entry++)
+		{
+			MultiXactOffset offset = oldbuf[old_entry];
+
+			/* Handle possible offset wraparound (1 becomes 2^32) */
+			if (offset < oldest_offset)
+				offset += ((uint64) 1 << 32) - 1;
+
+			/* Subtract oldest_offset, so new offsets will start from 1 */
+			newbuf[new_entry++] = offset - oldest_offset + 1;
+
+			if (new_entry >= MULTIXACT_OFFSETS_PER_PAGE_NEW)
+			{
+				/* Handle possible segment wraparound */
+#define NEW_OFFSET_SEGNO_MAX	\
+	(MaxMultiXactId / MULTIXACT_OFFSETS_PER_PAGE_NEW / SLRU_PAGES_PER_SEGMENT)
+				if (newseg.segno > NEW_OFFSET_SEGNO_MAX)
+				{
+					newseg.segno = 0;
+					newseg.pageno = 0;
+				}
+
+				/* Write new page */
+				write_new_segment_page(&newseg, newbuf);
+				new_entry = 0;
+			}
+		}
+	}
+
+	/* Write the last incomplete page */
+	if (new_entry > 0 || oldest_multi == next_multi)
+	{
+		memset(&newbuf[new_entry], 0,
+			   sizeof(newbuf[0]) * (MULTIXACT_OFFSETS_PER_PAGE_NEW - new_entry));
+		write_new_segment_page(&newseg, newbuf);
+	}
+
+	/* Use next_offset as oldest_offset, if oldest_multi == next_multi */
+	if (!oldest_offset_known)
+	{
+		Assert(oldest_multi == next_multi);
+		oldest_offset = (MultiXactOffset) old_cluster.controldata.chkpnt_nxtmxoff;
+	}
+
+	/* Release resources */
+	close_segment(&oldseg);
+	close_segment(&newseg);
+
+	pfree(oldseg.dir);
+	pfree(newseg.dir);
+
+	return oldest_offset;
+}
+
+#define MXACT_MEMBERS_FLAG_BYTES			1
+
+#define MULTIXACT_MEMBERS_PER_GROUP			4
+#define MULTIXACT_MEMBERGROUP_SIZE			\
+	(MULTIXACT_MEMBERS_PER_GROUP * (sizeof(TransactionId) + MXACT_MEMBERS_FLAG_BYTES))
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE		\
+	(BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+
+#define MULTIXACT_MEMBERS_PER_PAGE				\
+	(MULTIXACT_MEMBERS_PER_GROUP * MULTIXACT_MEMBERGROUPS_PER_PAGE)
+#define MULTIXACT_MEMBER_FLAG_BYTES_PER_GROUP	\
+	(MXACT_MEMBERS_FLAG_BYTES * MULTIXACT_MEMBERS_PER_GROUP)
+
+typedef struct MultiXactMembersCtx
+{
+	SlruSegState	seg;
+	char			buf[BLCKSZ];
+	int				group;
+	int				member;
+	char		   *flag;
+	TransactionId  *xid;
+} MultiXactMembersCtx;
+
+static void
+MultiXactMembersCtxInit(MultiXactMembersCtx *ctx)
+{
+	ctx->seg.dir = psprintf("%s/pg_multixact/members", new_cluster.pgdata);
+
+	ctx->group = 0;
+	ctx->member = 1;		/* skip invalid zero offset */
+
+	ctx->flag = (char *) ctx->buf + ctx->group * MULTIXACT_MEMBERGROUP_SIZE;
+	ctx->xid = (TransactionId *)(ctx->flag + MXACT_MEMBERS_FLAG_BYTES * MULTIXACT_MEMBERS_PER_GROUP);
+
+	ctx->flag += ctx->member;
+	ctx->xid += ctx->member;
+}
+
+static void
+MultiXactMembersCtxAdd(MultiXactMembersCtx *ctx, char flag, TransactionId xid)
+{
+	/* Copy member's xid and flags to the new page */
+	*ctx->flag++ = flag;
+	*ctx->xid++ = xid;
+
+	if (++ctx->member < MULTIXACT_MEMBERS_PER_GROUP)
+		return;
+
+	/* Start next member group */
+	ctx->member = 0;
+
+	if (++ctx->group >= MULTIXACT_MEMBERGROUPS_PER_PAGE)
+	{
+		/* Write current page and start new */
+		write_new_segment_page(&ctx->seg, ctx->buf);
+
+		ctx->group = 0;
+		memset(ctx->buf, 0, BLCKSZ);
+	}
+
+	ctx->flag = (char *) ctx->buf + ctx->group * MULTIXACT_MEMBERGROUP_SIZE;
+	ctx->xid = (TransactionId *)(ctx->flag + MXACT_MEMBERS_FLAG_BYTES * MULTIXACT_MEMBERS_PER_GROUP);
+}
+
+static void
+MultiXactMembersCtxFinit(MultiXactMembersCtx *ctx)
+{
+	if (ctx->flag > (char *) ctx->buf)
+		write_new_segment_page(&ctx->seg, ctx->buf);
+
+	close_segment(&ctx->seg);
+
+	pfree(ctx->seg.dir);
+}
+
+/*
+ * Convert pg_multixact/members segments, offsets will start from 1.
+ *
+ */
+void
+convert_multixact_members(MultiXactOffset oldest_offset)
+{
+	MultiXactOffset			next_offset,
+							offset;
+	SlruSegState			oldseg = {0};
+	char					oldbuf[BLCKSZ] = {0};
+	int						oldidx;
+	MultiXactMembersCtx		newctx = {0};
+
+	oldseg.dir = psprintf("%s/pg_multixact/members", old_cluster.pgdata);
+
+	next_offset = (MultiXactOffset) old_cluster.controldata.chkpnt_nxtmxoff;
+	if (next_offset < oldest_offset)
+		next_offset += ((uint64) 1 << 32) - 1;
+
+	/* Initialize the old starting position */
+	oldseg.pageno = oldest_offset / MULTIXACT_MEMBERS_PER_PAGE;
+	oldseg.segno = oldseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	oldseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+
+	/* Initialize new starting position */
+	MultiXactMembersCtxInit(&newctx);
+
+	/* Iterate through the original directory */
+	oldidx = oldest_offset % MULTIXACT_MEMBERS_PER_PAGE;
+	for (offset = oldest_offset; offset < next_offset;)
+	{
+		bool	empty;
+		int		oldlen;
+		int		ngroups;
+		int		oldgroup;
+		int		oldmember;
+
+		oldlen = read_old_segment_page(&oldseg, oldbuf, &empty);
+		if (empty || oldlen != BLCKSZ)
+			pg_fatal("cannot read page %" PRIu64 " from file \"%s\": %m",
+					 oldseg.pageno, oldseg.fn);
+
+		/* Iterate through the old member groups */
+		ngroups = oldlen / MULTIXACT_MEMBERGROUP_SIZE;
+		oldmember = oldidx % MULTIXACT_MEMBERS_PER_GROUP;
+		oldgroup = oldidx / MULTIXACT_MEMBERS_PER_GROUP;
+		while (oldgroup < ngroups && offset < next_offset)
+		{
+			char		   *oldflag;
+			TransactionId  *oldxid;
+			int				i;
+
+			oldflag = (char *) oldbuf + oldgroup * MULTIXACT_MEMBERGROUP_SIZE;
+			oldxid = (TransactionId *)(oldflag + MULTIXACT_MEMBER_FLAG_BYTES_PER_GROUP);
+
+			oldxid += oldmember;
+			oldflag += oldmember;
+
+			/* Iterate through the old members */
+			for (i = oldmember;
+				 i < MULTIXACT_MEMBERS_PER_GROUP && offset < next_offset;
+				 i++)
+			{
+				MultiXactMembersCtxAdd(&newctx, *oldflag++, *oldxid++);
+
+				if (++offset == (uint64) 1 << 32)
+				{
+					Assert(i == MaxMultiXactOffsetOld % MULTIXACT_MEMBERS_PER_GROUP);
+					goto wraparound;
+				}
+			}
+
+			oldgroup++;
+			oldmember = 0;
+		}
+
+		oldidx = 0;
+
+		continue;
+
+wraparound:
+#define SEGNO_MAX	MaxMultiXactOffsetOld / MULTIXACT_MEMBERS_PER_PAGE / SLRU_PAGES_PER_SEGMENT
+#define PAGENO_MAX	MaxMultiXactOffsetOld / MULTIXACT_MEMBERS_PER_PAGE % SLRU_PAGES_PER_SEGMENT
+		Assert((oldseg.segno == SEGNO_MAX && oldseg.pageno == PAGENO_MAX + 1) ||
+			   (oldseg.segno == SEGNO_MAX + 1 && oldseg.pageno == 0));
+
+		/* Switch to segment 0000 */
+		close_segment(&oldseg);
+		oldseg.segno = 0;
+		oldseg.pageno = 0;
+
+		/* skip invalid zero multi offset */
+		oldidx = 1;
+	}
+
+	MultiXactMembersCtxFinit(&newctx);
+
+	/* Release resources */
+	close_segment(&oldseg);
+
+	pfree(oldseg.dir);
+}
-- 
2.43.0

v15-0004-Get-rid-of-MultiXactMemberFreezeThreshold-call.patchapplication/octet-stream; name=v15-0004-Get-rid-of-MultiXactMemberFreezeThreshold-call.patchDownload
From cfbcf575ede1eb1c55950fe64cbeb1216b5eb1ef Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 23 Oct 2024 18:23:39 +0300
Subject: [PATCH v15 4/7] Get rid of MultiXactMemberFreezeThreshold call.

Since MaxMultiXactOffset are UINT64_MAX now, MULTIXACT_MEMBER_SAFE_THRESHOLD and
MULTIXACT_MEMBER_DANGER_THRESHOLD values are not meaningful any more. Thus,
MultiXactMemberFreezeThreshold is not needed too.

Instead, switch to MULTIXACT_MEMBER_AUTOVAC_THRESHOLD (eq 2^32) members
threshold. It is used to determine if we need to force autovacuum or not.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/transam/multixact.c | 117 +++----------------------
 src/backend/commands/vacuum.c          |   2 +-
 src/backend/postmaster/autovacuum.c    |   4 +-
 src/include/access/multixact.h         |   1 -
 4 files changed, 15 insertions(+), 109 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 98ea1690b5..059a72f106 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -204,10 +204,14 @@ MXOffsetToMemberOffset(MultiXactOffset offset)
 		member_in_group * sizeof(TransactionId);
 }
 
-/* Multixact members wraparound thresholds. */
-#define MULTIXACT_MEMBER_SAFE_THRESHOLD		(MaxMultiXactOffset / 2)
-#define MULTIXACT_MEMBER_DANGER_THRESHOLD	\
-	(MaxMultiXactOffset - MaxMultiXactOffset / 4)
+/*
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuumin order to release the disk space, reduce table bloat if
+ * possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD		UINT64CONST(0xFFFFFFFF)
 
 static inline MultiXactId
 PreviousMultiXactId(MultiXactId multi)
@@ -2615,15 +2619,13 @@ GetOldestMultiXactId(void)
 }
 
 /*
- * Determine how aggressively we need to vacuum in order to prevent member
- * wraparound.
+ * Determine if we need to vacuum for member or not.
  *
  * To do so determine what's the oldest member offset and install the limit
  * info in MultiXactState, where it can be used to prevent overrun of old data
  * in the members SLRU area.
  *
- * The return value is true if emergency autovacuum is required and false
- * otherwise.
+ * The return value is true if autovacuum is required and false otherwise.
  */
 static bool
 SetOffsetVacuumLimit(bool is_startup)
@@ -2711,10 +2713,10 @@ SetOffsetVacuumLimit(bool is_startup)
 	LWLockRelease(MultiXactGenLock);
 
 	/*
-	 * Do we need an emergency autovacuum?	If we're not sure, assume yes.
+	 * Do we need autovacuum?	If we're not sure, assume yes.
 	 */
 	return !oldestOffsetKnown ||
-		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
+		(nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);
 }
 
 /*
@@ -2760,101 +2762,6 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	return true;
 }
 
-/*
- * Determine how many multixacts, and how many multixact members, currently
- * exist.  Return false if unable to determine.
- */
-static bool
-ReadMultiXactCounts(uint32 *multixacts, MultiXactOffset *members)
-{
-	MultiXactOffset nextOffset;
-	MultiXactOffset oldestOffset;
-	MultiXactId oldestMultiXactId;
-	MultiXactId nextMultiXactId;
-	bool		oldestOffsetKnown;
-
-	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	nextOffset = MultiXactState->nextOffset;
-	oldestMultiXactId = MultiXactState->oldestMultiXactId;
-	nextMultiXactId = MultiXactState->nextMXact;
-	oldestOffset = MultiXactState->oldestOffset;
-	oldestOffsetKnown = MultiXactState->oldestOffsetKnown;
-	LWLockRelease(MultiXactGenLock);
-
-	if (!oldestOffsetKnown)
-		return false;
-
-	*members = nextOffset - oldestOffset;
-	*multixacts = nextMultiXactId - oldestMultiXactId;
-	return true;
-}
-
-/*
- * Multixact members can be removed once the multixacts that refer to them
- * are older than every datminmxid.  autovacuum_multixact_freeze_max_age and
- * vacuum_multixact_freeze_table_age work together to make sure we never have
- * too many multixacts; we hope that, at least under normal circumstances,
- * this will also be sufficient to keep us from using too many offsets.
- * However, if the average multixact has many members, we might exhaust the
- * members space while still using few enough members that these limits fail
- * to trigger relminmxid advancement by VACUUM.  At that point, we'd have no
- * choice but to start failing multixact-creating operations with an error.
- *
- * To prevent that, if more than a threshold portion of the members space is
- * used, we effectively reduce autovacuum_multixact_freeze_max_age and
- * to a value just less than the number of multixacts in use.  We hope that
- * this will quickly trigger autovacuuming on the table or tables with the
- * oldest relminmxid, thus allowing datminmxid values to advance and removing
- * some members.
- *
- * As the fraction of the member space currently in use grows, we become
- * more aggressive in clamping this value.  That not only causes autovacuum
- * to ramp up, but also makes any manual vacuums the user issues more
- * aggressive.  This happens because vacuum_get_cutoffs() will clamp the
- * freeze table and the minimum freeze age cutoffs based on the effective
- * autovacuum_multixact_freeze_max_age this function returns.  In the worst
- * case, we'll claim the freeze_max_age to zero, and every vacuum of any
- * table will freeze every multixact.
- */
-int
-MultiXactMemberFreezeThreshold(void)
-{
-	MultiXactOffset members;
-	uint32		multixacts;
-	uint32		victim_multixacts;
-	double		fraction;
-	int			result;
-
-	/* If we can't determine member space utilization, assume the worst. */
-	if (!ReadMultiXactCounts(&multixacts, &members))
-		return 0;
-
-	/* If member space utilization is low, no special action is required. */
-	if (members <= MULTIXACT_MEMBER_SAFE_THRESHOLD)
-		return autovacuum_multixact_freeze_max_age;
-
-	/*
-	 * Compute a target for relminmxid advancement.  The number of multixacts
-	 * we try to eliminate from the system is based on how far we are past
-	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
-	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-	fraction /= (double) (MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-
-	victim_multixacts = multixacts * fraction;
-
-	/* fraction could be > 1.0, but lowest possible freeze age is zero */
-	if (victim_multixacts > multixacts)
-		return 0;
-	result = multixacts - victim_multixacts;
-
-	/*
-	 * Clamp to autovacuum_multixact_freeze_max_age, so that we never make
-	 * autovacuum less aggressive than it would otherwise be.
-	 */
-	return Min(result, autovacuum_multixact_freeze_max_age);
-}
-
 typedef struct mxtruncinfo
 {
 	int64		earliestExistingPage;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 33a33bf6b1..0fcfc08a44 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1154,7 +1154,7 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Almost ready to set freeze output parameters; check if OldestXmin or
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 4d4a1a3197..29aa23c1a4 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1141,7 +1141,7 @@ do_start_worker(void)
 
 	/* Also determine the oldest datminmxid we will consider. */
 	recentMulti = ReadNextMultiXactId();
-	multiForceLimit = recentMulti - MultiXactMemberFreezeThreshold();
+	multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
 	if (multiForceLimit < FirstMultiXactId)
 		multiForceLimit -= FirstMultiXactId;
 
@@ -1929,7 +1929,7 @@ do_autovacuum(void)
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Find the pg_database entry and select the default freeze ages. We use
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index b1510532ee..5ee632dfe6 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -143,7 +143,6 @@ extern void MultiXactSetNextMXact(MultiXactId nextMulti,
 extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 									  MultiXactOffset minMultiOffset);
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
-extern int	MultiXactMemberFreezeThreshold(void);
 
 extern void multixact_twophase_recover(TransactionId xid, uint16 info,
 									   void *recdata, uint32 len);
-- 
2.43.0

v15-0002-Use-64-bit-multixact-offsets.patchapplication/octet-stream; name=v15-0002-Use-64-bit-multixact-offsets.patchDownload
From 4ec4639fe3637253f82a79576a89a2fe3bd48a48 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Wed, 6 Mar 2024 11:11:33 +0300
Subject: [PATCH v15 2/7] Use 64-bit multixact offsets.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/transam/multixact.c | 170 +------------------------
 src/bin/pg_resetwal/pg_resetwal.c      |   2 +-
 src/bin/pg_resetwal/t/001_basic.pl     |   2 +-
 src/include/access/multixact.h         |   2 +-
 src/include/c.h                        |   2 +-
 5 files changed, 10 insertions(+), 168 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index d4b70c1c1f..98ea1690b5 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -96,14 +96,6 @@
 /*
  * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
  * used everywhere else in Postgres.
- *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
  */
 
 /* We need four bytes per offset */
@@ -272,9 +264,6 @@ typedef struct MultiXactStateData
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 
-	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
-
 	/*
 	 * This is used to sleep until a multixact offset is written when we want
 	 * to create the next one.
@@ -409,8 +398,6 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
-									 MultiXactOffset start, uint32 distance);
 static bool SetOffsetVacuumLimit(bool is_startup);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMZeroPageXlogRec(int64 pageno, uint8 info);
@@ -1164,78 +1151,6 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	else
 		*offset = nextOffset;
 
-	/*----------
-	 * Protect against overrun of the members space as well, with the
-	 * following rules:
-	 *
-	 * If we're past offsetStopLimit, refuse to generate more multis.
-	 * If we're close to offsetStopLimit, emit a warning.
-	 *
-	 * Arbitrarily, we start emitting warnings when we're 20 segments or less
-	 * from offsetStopLimit.
-	 *
-	 * Note we haven't updated the shared state yet, so if we fail at this
-	 * point, the multixact ID we grabbed can still be used by the next guy.
-	 *
-	 * Note that there is no point in forcing autovacuum runs here: the
-	 * multixact freeze settings would have to be reduced for that to have any
-	 * effect.
-	 *----------
-	 */
-#define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
-								 nmembers))
-	{
-		/* see comment in the corresponding offsets wraparound case */
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("multixact \"members\" limit exceeded"),
-				 errdetail_plural("This command would create a multixact with %u members, but the remaining space is only enough for %u member.",
-								  "This command would create a multixact with %u members, but the remaining space is only enough for %u members.",
-								  MultiXactState->offsetStopLimit - nextOffset - 1,
-								  nmembers,
-								  MultiXactState->offsetStopLimit - nextOffset - 1),
-				 errhint("Execute a database-wide VACUUM in database with OID %u with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.",
-						 MultiXactState->oldestMultiXactDB)));
-	}
-
-	/*
-	 * Check whether we should kick autovacuum into action, to prevent members
-	 * wraparound. NB we use a much larger window to trigger autovacuum than
-	 * just the warning limit. The warning is just a measure of last resort -
-	 * this is in line with GetNewTransactionId's behaviour.
-	 */
-	if (!MultiXactState->oldestOffsetKnown ||
-		(MultiXactState->nextOffset - MultiXactState->oldestOffset
-		 > MULTIXACT_MEMBER_SAFE_THRESHOLD))
-	{
-		/*
-		 * To avoid swamping the postmaster with signals, we issue the autovac
-		 * request only when crossing a segment boundary. With default
-		 * compilation settings that's roughly after 50k members.  This still
-		 * gives plenty of chances before we get into real trouble.
-		 */
-		if ((MXOffsetToMemberPage(nextOffset) / SLRU_PAGES_PER_SEGMENT) !=
-			(MXOffsetToMemberPage(nextOffset + nmembers) / SLRU_PAGES_PER_SEGMENT))
-			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	}
-
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
-								 nextOffset,
-								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
-		ereport(WARNING,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg_plural("database with OID %u must be vacuumed before %d more multixact member is used",
-							   "database with OID %u must be vacuumed before %d more multixact members are used",
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers,
-							   MultiXactState->oldestMultiXactDB,
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers),
-				 errhint("Execute a database-wide VACUUM in that database with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.")));
-
 	ExtendMultiXactMember(nextOffset, nmembers);
 
 	/*
@@ -2720,8 +2635,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	MultiXactOffset nextOffset;
 	bool		oldestOffsetKnown = false;
 	bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;
-	MultiXactOffset prevOffsetStopLimit;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2736,7 +2649,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	nextOffset = MultiXactState->nextOffset;
 	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
 	prevOldestOffset = MultiXactState->oldestOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2767,11 +2679,7 @@ SetOffsetVacuumLimit(bool is_startup)
 		oldestOffsetKnown =
 			find_multixact_start(oldestMultiXactId, &oldestOffset);
 
-		if (oldestOffsetKnown)
-			ereport(DEBUG1,
-					(errmsg_internal("oldest MultiXactId member is at offset %u",
-									 oldestOffset)));
-		else
+		if (!oldestOffsetKnown)
 			ereport(LOG,
 					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
 							oldestMultiXactId)));
@@ -2784,24 +2692,7 @@ SetOffsetVacuumLimit(bool is_startup)
 	 * overrun of old data in the members SLRU area. We can only do so if the
 	 * oldest offset is known though.
 	 */
-	if (oldestOffsetKnown)
-	{
-		/* move back to start of the corresponding segment */
-		offsetStopLimit = oldestOffset - (oldestOffset %
-										  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-
-		/* always leave one segment before the wraparound point */
-		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-		if (!prevOldestOffsetKnown && !is_startup)
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are now enabled")));
-
-		ereport(DEBUG1,
-				(errmsg_internal("MultiXact member stop limit is now %u based on MultiXact %u",
-								 offsetStopLimit, oldestMultiXactId)));
-	}
-	else if (prevOldestOffsetKnown)
+	if (prevOldestOffsetKnown)
 	{
 		/*
 		 * If we failed to get the oldest offset this time, but we have a
@@ -2811,14 +2702,12 @@ SetOffsetVacuumLimit(bool is_startup)
 		 */
 		oldestOffset = prevOldestOffset;
 		oldestOffsetKnown = true;
-		offsetStopLimit = prevOffsetStopLimit;
 	}
 
 	/* Install the computed values */
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestOffset = oldestOffset;
 	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	MultiXactState->offsetStopLimit = offsetStopLimit;
 	LWLockRelease(MultiXactGenLock);
 
 	/*
@@ -2828,54 +2717,6 @@ SetOffsetVacuumLimit(bool is_startup)
 		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
 }
 
-/*
- * Return whether adding "distance" to "start" would move past "boundary".
- *
- * We use this to determine whether the addition is "wrapping around" the
- * boundary point, hence the name.  The reason we don't want to use the regular
- * 2^31-modulo arithmetic here is that we want to be able to use the whole of
- * the 2^32-1 space here, allowing for more multixacts than would fit
- * otherwise.
- */
-static bool
-MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
-						 uint32 distance)
-{
-	MultiXactOffset finish;
-
-	/*
-	 * Note that offset number 0 is not used (see GetMultiXactIdMembers), so
-	 * if the addition wraps around the UINT_MAX boundary, skip that value.
-	 */
-	finish = start + distance;
-	if (finish < start)
-		finish++;
-
-	/*-----------------------------------------------------------------------
-	 * When the boundary is numerically greater than the starting point, any
-	 * value numerically between the two is not wrapped:
-	 *
-	 *	<----S----B---->
-	 *	[---)			 = F wrapped past B (and UINT_MAX)
-	 *		 [---)		 = F not wrapped
-	 *			  [----] = F wrapped past B
-	 *
-	 * When the boundary is numerically less than the starting point (i.e. the
-	 * UINT_MAX wraparound occurs somewhere in between) then all values in
-	 * between are wrapped:
-	 *
-	 *	<----B----S---->
-	 *	[---)			 = F not wrapped past B (but wrapped past UINT_MAX)
-	 *		 [---)		 = F wrapped past B (and UINT_MAX)
-	 *			  [----] = F not wrapped
-	 *-----------------------------------------------------------------------
-	 */
-	if (start < boundary)
-		return finish >= boundary || finish < start;
-	else
-		return finish >= boundary && finish < start;
-}
-
 /*
  * Find the starting offset of the given MultiXactId.
  *
@@ -2997,8 +2838,9 @@ MultiXactMemberFreezeThreshold(void)
 	 * we try to eliminate from the system is based on how far we are past
 	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
 	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD) /
-		(MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+	fraction /= (double) (MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+
 	victim_multixacts = multixacts * fraction;
 
 	/* fraction could be > 1.0, but lowest possible freeze age is zero */
@@ -3343,7 +3185,7 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
 static bool
 MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
 {
-	int32		diff = (int32) (offset1 - offset2);
+	int64		diff = (int64) (offset1 - offset2);
 
 	return (diff < 0);
 }
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index df2d8f37cf..9252160280 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -266,7 +266,7 @@ main(int argc, char *argv[])
 
 			case 'O':
 				errno = 0;
-				set_mxoff = strtoul(optarg, &endptr, 0);
+				set_mxoff = strtou64(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index d6bbbd0ced..cc89e0764a 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -213,7 +213,7 @@ push @cmd,
   sprintf("%d,%d", hex($files[0]) == 0 ? 3 : hex($files[0]), hex($files[-1]));
 
 @files = get_slru_files('pg_multixact/offsets');
-$mult = 32 * $blcksz / 4;
+$mult = 32 * $blcksz / 8;
 # --multixact-ids argument is "new,old"
 push @cmd,
   '--multixact-ids' => sprintf("%d,%d",
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 4e6b0eec2f..b1510532ee 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -27,7 +27,7 @@
 
 #define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
 
-#define MaxMultiXactOffset	((MultiXactOffset) 0xFFFFFFFF)
+#define MaxMultiXactOffset	UINT64CONST(0xFFFFFFFFFFFFFFFF)
 
 /*
  * Possible multixact lock modes ("status").  The first four modes are for
diff --git a/src/include/c.h b/src/include/c.h
index 8cdc16a0f4..45ba4c523f 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -632,7 +632,7 @@ typedef uint32 SubTransactionId;
 /* MultiXactId must be equivalent to TransactionId, to fit in t_xmax */
 typedef TransactionId MultiXactId;
 
-typedef uint32 MultiXactOffset;
+typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
-- 
2.43.0

v15-0001-Use-64-bit-format-output-for-multixact-offsets.patchapplication/octet-stream; name=v15-0001-Use-64-bit-format-output-for-multixact-offsets.patchDownload
From 3754ef3615260fea7469102127cb41aa8772bc08 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 7 Aug 2024 16:35:22 +0300
Subject: [PATCH v15 1/7] Use 64-bit format output for multixact offsets

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/rmgrdesc/mxactdesc.c   |  4 ++--
 src/backend/access/rmgrdesc/xlogdesc.c    |  2 +-
 src/backend/access/transam/multixact.c    | 13 +++++++------
 src/backend/access/transam/xlogrecovery.c |  2 +-
 src/bin/pg_controldata/pg_controldata.c   |  2 +-
 src/bin/pg_resetwal/pg_resetwal.c         |  4 ++--
 6 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 3ca0582db3..052dd0a4ce 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -65,7 +65,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		xl_multixact_create *xlrec = (xl_multixact_create *) rec;
 		int			i;
 
-		appendStringInfo(buf, "%u offset %u nmembers %d: ", xlrec->mid,
+		appendStringInfo(buf, "%u offset %" PRIu64 " nmembers %d: ", xlrec->mid,
 						 xlrec->moff, xlrec->nmembers);
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
@@ -74,7 +74,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
 
-		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+		appendStringInfo(buf, "offsets [%u, %u), members [%" PRIu64 ", %" PRIu64 ")",
 						 xlrec->startTruncOff, xlrec->endTruncOff,
 						 xlrec->startTruncMemb, xlrec->endTruncMemb);
 	}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 58040f2865..3c42aa3e39 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -66,7 +66,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
 		appendStringInfo(buf, "redo %X/%X; "
-						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %u; "
+						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %" PRIu64 "; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
 						 "oldest running xid %u; %s",
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 3c06ac4553..d4b70c1c1f 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1264,7 +1264,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	LWLockRelease(MultiXactGenLock);
 
-	debug_elog4(DEBUG2, "GetNew: returning %u offset %u", result, *offset);
+	debug_elog4(DEBUG2, "GetNew: returning %u offset %" PRIu64, result,
+				*offset);
 	return result;
 }
 
@@ -2293,7 +2294,7 @@ MultiXactGetCheckptMulti(bool is_shutdown,
 	LWLockRelease(MultiXactGenLock);
 
 	debug_elog6(DEBUG2,
-				"MultiXact: checkpoint is nextMulti %u, nextOffset %u, oldestMulti %u in DB %u",
+				"MultiXact: checkpoint is nextMulti %u, nextOffset %" PRIu64 ", oldestMulti %u in DB %u",
 				*nextMulti, *nextMultiOffset, *oldestMulti, *oldestMultiDB);
 }
 
@@ -2328,7 +2329,7 @@ void
 MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset)
 {
-	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %u",
+	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %" PRIu64,
 				nextMulti, nextMultiOffset);
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->nextMXact = nextMulti;
@@ -2519,7 +2520,7 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 	}
 	if (MultiXactOffsetPrecedes(MultiXactState->nextOffset, minMultiOffset))
 	{
-		debug_elog3(DEBUG2, "MultiXact: setting next offset to %u",
+		debug_elog3(DEBUG2, "MultiXact: setting next offset to %" PRIU64,
 					minMultiOffset);
 		MultiXactState->nextOffset = minMultiOffset;
 	}
@@ -3211,7 +3212,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 
 	elog(DEBUG1, "performing multixact truncation: "
 		 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-		 "members [%u, %u), members segments [%" PRIx64 ", %" PRIx64 ")",
+		 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
 		 oldestMulti, newOldestMulti,
 		 MultiXactIdToOffsetSegment(oldestMulti),
 		 MultiXactIdToOffsetSegment(newOldestMulti),
@@ -3471,7 +3472,7 @@ multixact_redo(XLogReaderState *record)
 
 		elog(DEBUG1, "replaying multixact truncation: "
 			 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-			 "members [%u, %u), members segments [%" PRIx64 ", %" PRIx64 ")",
+			 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
 			 xlrec.startTruncOff, xlrec.endTruncOff,
 			 MultiXactIdToOffsetSegment(xlrec.startTruncOff),
 			 MultiXactIdToOffsetSegment(xlrec.endTruncOff),
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 6ce979f2d8..27b6059291 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -884,7 +884,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 							 U64FromFullTransactionId(checkPoint.nextXid),
 							 checkPoint.nextOid)));
 	ereport(DEBUG1,
-			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %u",
+			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %" PRIu64,
 							 checkPoint.nextMulti, checkPoint.nextMultiOffset)));
 	ereport(DEBUG1,
 			(errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 7bb801bb88..b195806699 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -264,7 +264,7 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile->checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
+	printf(_("Latest checkpoint's NextMultiOffset:  %" PRIu64 "\n"),
 		   ControlFile->checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile->checkPointCopy.oldestXid);
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index e876f35f38..df2d8f37cf 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -759,7 +759,7 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile.checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
+	printf(_("Latest checkpoint's NextMultiOffset:  %" PRIu64 "\n"),
 		   ControlFile.checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile.checkPointCopy.oldestXid);
@@ -833,7 +833,7 @@ PrintNewControlValues(void)
 
 	if (set_mxoff != -1)
 	{
-		printf(_("NextMultiOffset:                      %u\n"),
+		printf(_("NextMultiOffset:                      %" PRIu64 "\n"),
 			   ControlFile.checkPointCopy.nextMultiOffset);
 	}
 
-- 
2.43.0

v15-0007-TEST-bump-catver.patch.txttext/plain; charset=US-ASCII; name=v15-0007-TEST-bump-catver.patch.txtDownload
From 263953e78d78051d25e310616f966ae5bea69b33 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 13 Nov 2024 16:34:34 +0300
Subject: [PATCH v15 7/7] TEST: bump catver

---
 src/bin/pg_upgrade/pg_upgrade.h  | 2 +-
 src/include/catalog/catversion.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index e9c5d3177f..a8e08716f0 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -119,7 +119,7 @@ extern char *output_files[];
  *
  * XXX: should be changed to the actual CATALOG_VERSION_NO on commit.
  */
-#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202504092
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202505072
 
 /*
  * large object chunk size added to pg_controldata,
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 82988d2443..e552b8491f 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202505071
+#define CATALOG_VERSION_NO	202505072
 
 #endif
-- 
2.43.0

v15-0006-TEST-add-src-bin-pg_upgrade-t-006_offset.pl.patch.txttext/plain; charset=US-ASCII; name=v15-0006-TEST-add-src-bin-pg_upgrade-t-006_offset.pl.patch.txtDownload
From 8a2ad294913b0ced96dde5bda4e451d98a0640cb Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Tue, 19 Nov 2024 17:08:10 +0300
Subject: [PATCH v15 6/7] TEST: add src/bin/pg_upgrade/t/006_offset.pl

---
 src/bin/pg_upgrade/pg_upgrade.h    |   2 +-
 src/bin/pg_upgrade/t/006_offset.pl | 562 +++++++++++++++++++++++++++++
 2 files changed, 563 insertions(+), 1 deletion(-)
 create mode 100644 src/bin/pg_upgrade/t/006_offset.pl

diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 2d4f1d39e5..e9c5d3177f 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -119,7 +119,7 @@ extern char *output_files[];
  *
  * XXX: should be changed to the actual CATALOG_VERSION_NO on commit.
  */
-#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202409041
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202504092
 
 /*
  * large object chunk size added to pg_controldata,
diff --git a/src/bin/pg_upgrade/t/006_offset.pl b/src/bin/pg_upgrade/t/006_offset.pl
new file mode 100644
index 0000000000..f5dc733a30
--- /dev/null
+++ b/src/bin/pg_upgrade/t/006_offset.pl
@@ -0,0 +1,562 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use File::Find qw(find);
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# This pair of calls will create significantly more member segments than offset
+# segments.
+sub prep
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl} (I INT PRIMARY KEY, N_UPDATED INT) " .
+		"       WITH (AUTOVACUUM_ENABLED=FALSE);" .
+		"INSERT INTO ${tbl} SELECT G, 0 FROM GENERATE_SERIES(1, 50) G;");
+}
+
+sub fill
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	my $nclients = 50;
+	my $update_every = 90;
+	my @connections = ();
+
+	for (0..$nclients)
+	{
+		my $conn = $node->background_psql('postgres');
+		$conn->query_safe("BEGIN");
+
+		push(@connections, $conn);
+	}
+
+	for (my $i = 0; $i < 20000; $i++)
+	{
+		my $conn = $connections[$i % $nclients];
+
+		$conn->query_safe("COMMIT;");
+		$conn->query_safe("BEGIN");
+
+		if ($i % $update_every == 0)
+		{
+			$conn->query_safe(
+				"UPDATE ${tbl} SET " .
+				"N_UPDATED = N_UPDATED + 1 " .
+				"WHERE I = ${i} % 50");
+		}
+		else
+		{
+			$conn->query_safe(
+				"SELECT * FROM ${tbl} FOR KEY SHARE");
+		}
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+}
+
+# This pair of calls will create more or less the same amount of membsers and
+# offsets segments.
+sub prep2
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl}(BAR INT PRIMARY KEY, BAZ INT); " .
+		"CREATE OR REPLACE PROCEDURE MXIDFILLER(N_STEPS INT DEFAULT 1000) " .
+		"LANGUAGE PLPGSQL " .
+		"AS \$\$ " .
+		"BEGIN " .
+		"	FOR I IN 1..N_STEPS LOOP " .
+		"		UPDATE ${tbl} SET BAZ = RANDOM(1, 1000) " .
+		"		WHERE BAR IN (SELECT BAR FROM ${tbl} " .
+		"						TABLESAMPLE BERNOULLI(80)); " .
+		"		COMMIT; " .
+		"	END LOOP; " .
+		"END; \$\$; " .
+		"INSERT INTO ${tbl} (BAR, BAZ) " .
+		"SELECT ID, ID FROM GENERATE_SERIES(1, 1024) ID;");
+}
+
+sub fill2
+{
+	my $node = shift;
+	my $tbl = shift;
+	my $scale = shift // 1;
+
+	$node->safe_psql('postgres',
+		"BEGIN; " .
+		"SELECT * FROM ${tbl} FOR KEY SHARE; " .
+		"PREPARE TRANSACTION 'A'; " .
+		"CALL MXIDFILLER((365 * ${scale})::int); " .
+		"COMMIT PREPARED 'A';");
+}
+
+
+# generate around 2 offset segments and 55 member segments
+sub mxid_gen1
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	prep($node, $tbl);
+	fill($node, $tbl);
+
+	$node->safe_psql('postgres', q(CHECKPOINT));
+}
+
+# generate around 10 offset segments and 12 member segments
+sub mxid_gen2
+{
+	my $node = shift;
+	my $tbl = shift;
+	my $scale = shift // 1;
+
+	prep2($node, $tbl);
+	fill2($node, $tbl, $scale);
+
+	$node->safe_psql('postgres', q(CHECKPOINT));
+}
+
+# Fetch latest multixact checkpoint values.
+sub multi_bounds
+{
+	my ($node) = @_;
+	my $path = $node->config_data('--bindir');
+	my ($stdout, $stderr) = run_command([
+									$path . '/pg_controldata',
+									$node->data_dir
+								]);
+	my @control_data = split("\n", $stdout);
+	my $next = undef;
+	my $oldest = undef;
+	my $next_offset = undef;
+
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint's NextMultiXactId:\s*(.*)$/mg)
+		{
+			$next = $1;
+			print ">>> @ node ". $node->name . ", " . $_ . "\n";
+		}
+
+		if ($_ =~ /^Latest checkpoint's oldestMultiXid:\s*(.*)$/mg)
+		{
+			$oldest = $1;
+			print ">>> @ node ". $node->name . ", " . $_ . "\n";
+		}
+
+		if ($_ =~ /^Latest checkpoint's NextMultiOffset:\s*(.*)$/mg)
+		{
+			$next_offset = $1;
+			print ">>> @ node ". $node->name . ", " . $_ . "\n";
+		}
+
+		if (defined($oldest) && defined($next) && defined($next_offset))
+		{
+			last;
+		}
+	}
+
+	die "Latest checkpoint's NextMultiXactId not found in control file!\n"
+	unless defined($next);
+
+	die "Latest checkpoint's oldestMultiXid not found in control file!\n"
+	unless defined($oldest);
+
+	die "Latest checkpoint's NextMultiOffset not found in control file!\n"
+	unless defined($next_offset);
+
+	return ($oldest, $next, $next_offset);
+}
+
+# Create node from existing bins.
+sub create_new_node
+{
+	my ($name, %params) = @_;
+
+	create_node(0, @_);
+}
+
+# Create node from ENV oldinstall
+sub create_old_node
+{
+	my ($name, %params) = @_;
+
+	if (!defined($ENV{oldinstall}))
+	{
+		die "oldinstall is not defined";
+	}
+
+	create_node(1, @_);
+}
+
+sub create_node
+{
+	my ($install_path_from_env, $name, %params) = @_;
+	my $scale = defined $params{scale} ? $params{scale} : 1;
+	my $multi = defined $params{multi} ? $params{multi} : undef;
+	my $offset = defined $params{offset} ? $params{offset} : undef;
+
+	my $node =
+		$install_path_from_env ?
+			PostgreSQL::Test::Cluster->new($name,
+					install_path => $ENV{oldinstall}) :
+			PostgreSQL::Test::Cluster->new($name);
+
+	$node->init(force_initdb => 1,
+		extra => [
+			$multi ? ('-m', $multi) : (),
+			$offset ? ('-o', $offset) : (),
+		]);
+
+	# Fixup MOX patch quirk
+	if ($multi)
+	{
+		unlink $node->data_dir . '/pg_multixact/offsets/0000';
+	}
+	if ($offset)
+	{
+		unlink $node->data_dir . '/pg_multixact/members/0000';
+	}
+
+	$node->append_conf('fsync', 'off');
+	$node->append_conf('postgresql.conf', 'max_prepared_transactions = 2');
+
+	$node->start();
+	mxid_gen2($node, 'FOO', $scale);
+	mxid_gen1($node, 'BAR', $scale);
+	$node->restart();
+	$node->safe_psql('postgres', q(SELECT * FROM FOO));		# just in case...
+	$node->safe_psql('postgres', q(SELECT * FROM BAR));
+	$node->safe_psql('postgres', q(CHECKPOINT));
+	$node->stop();
+
+	return $node;
+}
+
+sub do_upgrade
+{
+	my ($oldnode, $newnode) = @_;
+
+	command_ok(
+		[
+			'pg_upgrade', '--no-sync',
+			'-d', $oldnode->data_dir,
+			'-D', $newnode->data_dir,
+			'-b', $oldnode->config_data('--bindir'),
+			'-B', $newnode->config_data('--bindir'),
+			'-s', $newnode->host,
+			'-p', $oldnode->port,
+			'-P', $newnode->port,
+			'--check'
+		],
+		'run of pg_upgrade');
+
+	command_ok(
+		[
+			'pg_upgrade', '--no-sync',
+			'-d', $oldnode->data_dir,
+			'-D', $newnode->data_dir,
+			'-b', $oldnode->config_data('--bindir'),
+			'-B', $newnode->config_data('--bindir'),
+			'-s', $newnode->host,
+			'-p', $oldnode->port,
+			'-P', $newnode->port,
+			'--copy'
+		],
+		'run of pg_upgrade');
+
+	$oldnode->start();
+	$newnode->start();
+
+	my $oldfoo = $oldnode->safe_psql('postgres', q(SELECT * FROM FOO));
+	my $newfoo = $newnode->safe_psql('postgres', q(SELECT * FROM FOO));
+	is($oldfoo, $newfoo, "select foo eq");
+
+	my $oldbar = $oldnode->safe_psql('postgres', q(SELECT * FROM BAR));
+	my $newbar = $newnode->safe_psql('postgres', q(SELECT * FROM BAR));
+	is($oldbar, $newbar, "select bar eq");
+
+	$oldnode->stop();
+	$newnode->stop();
+
+	multi_bounds($oldnode);
+	multi_bounds($newnode);
+}
+
+my @TESTS = (
+	# tests without ENV oldinstall
+	#0, 1, 2, 3, 4, 5, 6,
+	# tests with "real" pg_upgrade
+	#100, 101, 102, 103, 104, 105, 106,
+	# self upgrade
+	1000,
+);
+
+# =============================================================================
+# Basic sanity tests on a NEW bin
+# =============================================================================
+
+# starts from the zero
+SKIP:
+{
+	my $TEST_NO = 0;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_mo',
+						scale => 1);
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value
+SKIP:
+{
+	my $TEST_NO = 1;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_Mo',
+						scale => 1.15,
+						multi => '0x123400');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# offsets starts from the value
+SKIP:
+{
+	my $TEST_NO = 2;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_mO',
+						scale => 1.15,
+						offset => '0x432100');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi and offsets starts from the value
+SKIP:
+{
+	my $TEST_NO = 3;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_MO',
+						scale => 1.15,
+						multi => '0xDEAD00', offset => '0xBEEF00');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value, multi wrap
+SKIP:
+{
+	my $TEST_NO = 4;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_Mo_wrap',
+						scale => 1.15,
+						multi => '0xFFFF7000');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# offsets starts from the value, offsets wrap
+SKIP:
+{
+	my $TEST_NO = 5;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_mO_wrap',
+						scale => 1.15,
+						offset => '0xFFFFFC00');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value, offsets starts from the value,
+# multi wrap, offsets wrap
+SKIP:
+{
+	my $TEST_NO = 6;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_MO_wrap',
+						scale => 1.15,
+						multi => '0xFFFF7000', offset => '0xFFFFFC00');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# =============================================================================
+# pg_upgarde tests
+# =============================================================================
+
+# starts from the zero
+SKIP:
+{
+	my $TEST_NO = 100;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'mo';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1);
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value
+SKIP:
+{
+	my $TEST_NO = 101;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'Mo';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						multi => '0x123400');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# offsets starts from the value
+SKIP:
+{
+	my $TEST_NO = 102;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'mO';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						offset => '0x432100');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi and offsets starts from the value
+SKIP:
+{
+	my $TEST_NO = 103;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'MO';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						multi => '0xDEAD00', offset => '0xBEEF00');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value, multi wrap
+SKIP:
+{
+	my $TEST_NO = 104;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'Mo_wrap';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						multi => '0xFFFF7000');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# offsets starts from the value, offsets wrap
+SKIP:
+{
+	my $TEST_NO = 105;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'mO_wrap';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						offset => '0xFFFFFC00');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value, offsets starts from the value,
+# multi wrap, offsets wrap
+SKIP:
+{
+	my $TEST_NO = 106;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'MO_wrap';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						multi => '0xFFFF7000', offset => '0xFFFFFC00');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# =============================================================================
+# Self upgrade
+# =============================================================================
+
+# starts from the zero
+SKIP:
+{
+	my $TEST_NO = 1000;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'self_upgrade';
+	my $oldnode = create_new_node("old_$dbname",
+						scale => 1);
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+done_testing();
-- 
2.43.0

#41Maxim Orlov
orlovmg@gmail.com
In reply to: Maxim Orlov (#40)
7 attachment(s)
Re: POC: make mxidoff 64 bits

Yet another rebase @ f5a987c0e5

--
Best regards,
Maxim Orlov.

Attachments:

v16-0001-Use-64-bit-format-output-for-multixact-offsets.patchapplication/octet-stream; name=v16-0001-Use-64-bit-format-output-for-multixact-offsets.patchDownload
From 9faa201c251420d89dae2d0284fa574acfe7fdc5 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 7 Aug 2024 16:35:22 +0300
Subject: [PATCH v16 1/7] Use 64-bit format output for multixact offsets

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/rmgrdesc/mxactdesc.c   |  4 ++--
 src/backend/access/rmgrdesc/xlogdesc.c    |  2 +-
 src/backend/access/transam/multixact.c    | 13 +++++++------
 src/backend/access/transam/xlogrecovery.c |  2 +-
 src/bin/pg_controldata/pg_controldata.c   |  2 +-
 src/bin/pg_resetwal/pg_resetwal.c         |  4 ++--
 6 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 3ca0582db3..052dd0a4ce 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -65,7 +65,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		xl_multixact_create *xlrec = (xl_multixact_create *) rec;
 		int			i;
 
-		appendStringInfo(buf, "%u offset %u nmembers %d: ", xlrec->mid,
+		appendStringInfo(buf, "%u offset %" PRIu64 " nmembers %d: ", xlrec->mid,
 						 xlrec->moff, xlrec->nmembers);
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
@@ -74,7 +74,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
 
-		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+		appendStringInfo(buf, "offsets [%u, %u), members [%" PRIu64 ", %" PRIu64 ")",
 						 xlrec->startTruncOff, xlrec->endTruncOff,
 						 xlrec->startTruncMemb, xlrec->endTruncMemb);
 	}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index cd6c2a2f65..441034f592 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -66,7 +66,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
 		appendStringInfo(buf, "redo %X/%08X; "
-						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %u; "
+						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %" PRIu64 "; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
 						 "oldest running xid %u; %s",
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 3cb09c3d59..3a1fb4746c 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1261,7 +1261,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	LWLockRelease(MultiXactGenLock);
 
-	debug_elog4(DEBUG2, "GetNew: returning %u offset %u", result, *offset);
+	debug_elog4(DEBUG2, "GetNew: returning %u offset %" PRIu64, result,
+				*offset);
 	return result;
 }
 
@@ -2229,7 +2230,7 @@ MultiXactGetCheckptMulti(bool is_shutdown,
 	LWLockRelease(MultiXactGenLock);
 
 	debug_elog6(DEBUG2,
-				"MultiXact: checkpoint is nextMulti %u, nextOffset %u, oldestMulti %u in DB %u",
+				"MultiXact: checkpoint is nextMulti %u, nextOffset %" PRIu64 ", oldestMulti %u in DB %u",
 				*nextMulti, *nextMultiOffset, *oldestMulti, *oldestMultiDB);
 }
 
@@ -2264,7 +2265,7 @@ void
 MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset)
 {
-	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %u",
+	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %" PRIu64,
 				nextMulti, nextMultiOffset);
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->nextMXact = nextMulti;
@@ -2455,7 +2456,7 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 	}
 	if (MultiXactOffsetPrecedes(MultiXactState->nextOffset, minMultiOffset))
 	{
-		debug_elog3(DEBUG2, "MultiXact: setting next offset to %u",
+		debug_elog3(DEBUG2, "MultiXact: setting next offset to %" PRIU64,
 					minMultiOffset);
 		MultiXactState->nextOffset = minMultiOffset;
 	}
@@ -3151,7 +3152,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 
 	elog(DEBUG1, "performing multixact truncation: "
 		 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-		 "members [%u, %u), members segments [%" PRIx64 ", %" PRIx64 ")",
+		 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
 		 oldestMulti, newOldestMulti,
 		 MultiXactIdToOffsetSegment(oldestMulti),
 		 MultiXactIdToOffsetSegment(newOldestMulti),
@@ -3379,7 +3380,7 @@ multixact_redo(XLogReaderState *record)
 
 		elog(DEBUG1, "replaying multixact truncation: "
 			 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-			 "members [%u, %u), members segments [%" PRIx64 ", %" PRIx64 ")",
+			 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
 			 xlrec.startTruncOff, xlrec.endTruncOff,
 			 MultiXactIdToOffsetSegment(xlrec.startTruncOff),
 			 MultiXactIdToOffsetSegment(xlrec.endTruncOff),
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 23878b2dd9..51dd52f127 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -884,7 +884,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 							 U64FromFullTransactionId(checkPoint.nextXid),
 							 checkPoint.nextOid)));
 	ereport(DEBUG1,
-			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %u",
+			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %" PRIu64,
 							 checkPoint.nextMulti, checkPoint.nextMultiOffset)));
 	ereport(DEBUG1,
 			(errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 10de058ce9..5295108ade 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -264,7 +264,7 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile->checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
+	printf(_("Latest checkpoint's NextMultiOffset:  %" PRIu64 "\n"),
 		   ControlFile->checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile->checkPointCopy.oldestXid);
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index e876f35f38..df2d8f37cf 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -759,7 +759,7 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile.checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
+	printf(_("Latest checkpoint's NextMultiOffset:  %" PRIu64 "\n"),
 		   ControlFile.checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile.checkPointCopy.oldestXid);
@@ -833,7 +833,7 @@ PrintNewControlValues(void)
 
 	if (set_mxoff != -1)
 	{
-		printf(_("NextMultiOffset:                      %u\n"),
+		printf(_("NextMultiOffset:                      %" PRIu64 "\n"),
 			   ControlFile.checkPointCopy.nextMultiOffset);
 	}
 
-- 
2.49.0

v16-0002-Use-64-bit-multixact-offsets.patchapplication/octet-stream; name=v16-0002-Use-64-bit-multixact-offsets.patchDownload
From c8d9ccafc847cdc6e02fa14b2af4dcf6f8b661de Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Wed, 6 Mar 2024 11:11:33 +0300
Subject: [PATCH v16 2/7] Use 64-bit multixact offsets.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/transam/multixact.c | 170 +------------------------
 src/bin/pg_resetwal/pg_resetwal.c      |   2 +-
 src/bin/pg_resetwal/t/001_basic.pl     |   2 +-
 src/include/access/multixact.h         |   2 +-
 src/include/c.h                        |   2 +-
 5 files changed, 10 insertions(+), 168 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 3a1fb4746c..e62cb523d4 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -96,14 +96,6 @@
 /*
  * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
  * used everywhere else in Postgres.
- *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
  */
 
 /* We need four bytes per offset */
@@ -272,9 +264,6 @@ typedef struct MultiXactStateData
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 
-	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
-
 	/*
 	 * This is used to sleep until a multixact offset is written when we want
 	 * to create the next one.
@@ -407,8 +396,6 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
-									 MultiXactOffset start, uint32 distance);
 static bool SetOffsetVacuumLimit(bool is_startup);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMTruncateXlogRec(Oid oldestMultiDB,
@@ -1161,78 +1148,6 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	else
 		*offset = nextOffset;
 
-	/*----------
-	 * Protect against overrun of the members space as well, with the
-	 * following rules:
-	 *
-	 * If we're past offsetStopLimit, refuse to generate more multis.
-	 * If we're close to offsetStopLimit, emit a warning.
-	 *
-	 * Arbitrarily, we start emitting warnings when we're 20 segments or less
-	 * from offsetStopLimit.
-	 *
-	 * Note we haven't updated the shared state yet, so if we fail at this
-	 * point, the multixact ID we grabbed can still be used by the next guy.
-	 *
-	 * Note that there is no point in forcing autovacuum runs here: the
-	 * multixact freeze settings would have to be reduced for that to have any
-	 * effect.
-	 *----------
-	 */
-#define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
-								 nmembers))
-	{
-		/* see comment in the corresponding offsets wraparound case */
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("multixact \"members\" limit exceeded"),
-				 errdetail_plural("This command would create a multixact with %u members, but the remaining space is only enough for %u member.",
-								  "This command would create a multixact with %u members, but the remaining space is only enough for %u members.",
-								  MultiXactState->offsetStopLimit - nextOffset - 1,
-								  nmembers,
-								  MultiXactState->offsetStopLimit - nextOffset - 1),
-				 errhint("Execute a database-wide VACUUM in database with OID %u with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.",
-						 MultiXactState->oldestMultiXactDB)));
-	}
-
-	/*
-	 * Check whether we should kick autovacuum into action, to prevent members
-	 * wraparound. NB we use a much larger window to trigger autovacuum than
-	 * just the warning limit. The warning is just a measure of last resort -
-	 * this is in line with GetNewTransactionId's behaviour.
-	 */
-	if (!MultiXactState->oldestOffsetKnown ||
-		(MultiXactState->nextOffset - MultiXactState->oldestOffset
-		 > MULTIXACT_MEMBER_SAFE_THRESHOLD))
-	{
-		/*
-		 * To avoid swamping the postmaster with signals, we issue the autovac
-		 * request only when crossing a segment boundary. With default
-		 * compilation settings that's roughly after 50k members.  This still
-		 * gives plenty of chances before we get into real trouble.
-		 */
-		if ((MXOffsetToMemberPage(nextOffset) / SLRU_PAGES_PER_SEGMENT) !=
-			(MXOffsetToMemberPage(nextOffset + nmembers) / SLRU_PAGES_PER_SEGMENT))
-			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	}
-
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
-								 nextOffset,
-								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
-		ereport(WARNING,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg_plural("database with OID %u must be vacuumed before %d more multixact member is used",
-							   "database with OID %u must be vacuumed before %d more multixact members are used",
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers,
-							   MultiXactState->oldestMultiXactDB,
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers),
-				 errhint("Execute a database-wide VACUUM in that database with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.")));
-
 	ExtendMultiXactMember(nextOffset, nmembers);
 
 	/*
@@ -2660,8 +2575,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	MultiXactOffset nextOffset;
 	bool		oldestOffsetKnown = false;
 	bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;
-	MultiXactOffset prevOffsetStopLimit;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2676,7 +2589,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	nextOffset = MultiXactState->nextOffset;
 	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
 	prevOldestOffset = MultiXactState->oldestOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2707,11 +2619,7 @@ SetOffsetVacuumLimit(bool is_startup)
 		oldestOffsetKnown =
 			find_multixact_start(oldestMultiXactId, &oldestOffset);
 
-		if (oldestOffsetKnown)
-			ereport(DEBUG1,
-					(errmsg_internal("oldest MultiXactId member is at offset %u",
-									 oldestOffset)));
-		else
+		if (!oldestOffsetKnown)
 			ereport(LOG,
 					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
 							oldestMultiXactId)));
@@ -2724,24 +2632,7 @@ SetOffsetVacuumLimit(bool is_startup)
 	 * overrun of old data in the members SLRU area. We can only do so if the
 	 * oldest offset is known though.
 	 */
-	if (oldestOffsetKnown)
-	{
-		/* move back to start of the corresponding segment */
-		offsetStopLimit = oldestOffset - (oldestOffset %
-										  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-
-		/* always leave one segment before the wraparound point */
-		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-		if (!prevOldestOffsetKnown && !is_startup)
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are now enabled")));
-
-		ereport(DEBUG1,
-				(errmsg_internal("MultiXact member stop limit is now %u based on MultiXact %u",
-								 offsetStopLimit, oldestMultiXactId)));
-	}
-	else if (prevOldestOffsetKnown)
+	if (prevOldestOffsetKnown)
 	{
 		/*
 		 * If we failed to get the oldest offset this time, but we have a
@@ -2751,14 +2642,12 @@ SetOffsetVacuumLimit(bool is_startup)
 		 */
 		oldestOffset = prevOldestOffset;
 		oldestOffsetKnown = true;
-		offsetStopLimit = prevOffsetStopLimit;
 	}
 
 	/* Install the computed values */
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestOffset = oldestOffset;
 	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	MultiXactState->offsetStopLimit = offsetStopLimit;
 	LWLockRelease(MultiXactGenLock);
 
 	/*
@@ -2768,54 +2657,6 @@ SetOffsetVacuumLimit(bool is_startup)
 		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
 }
 
-/*
- * Return whether adding "distance" to "start" would move past "boundary".
- *
- * We use this to determine whether the addition is "wrapping around" the
- * boundary point, hence the name.  The reason we don't want to use the regular
- * 2^31-modulo arithmetic here is that we want to be able to use the whole of
- * the 2^32-1 space here, allowing for more multixacts than would fit
- * otherwise.
- */
-static bool
-MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
-						 uint32 distance)
-{
-	MultiXactOffset finish;
-
-	/*
-	 * Note that offset number 0 is not used (see GetMultiXactIdMembers), so
-	 * if the addition wraps around the UINT_MAX boundary, skip that value.
-	 */
-	finish = start + distance;
-	if (finish < start)
-		finish++;
-
-	/*-----------------------------------------------------------------------
-	 * When the boundary is numerically greater than the starting point, any
-	 * value numerically between the two is not wrapped:
-	 *
-	 *	<----S----B---->
-	 *	[---)			 = F wrapped past B (and UINT_MAX)
-	 *		 [---)		 = F not wrapped
-	 *			  [----] = F wrapped past B
-	 *
-	 * When the boundary is numerically less than the starting point (i.e. the
-	 * UINT_MAX wraparound occurs somewhere in between) then all values in
-	 * between are wrapped:
-	 *
-	 *	<----B----S---->
-	 *	[---)			 = F not wrapped past B (but wrapped past UINT_MAX)
-	 *		 [---)		 = F wrapped past B (and UINT_MAX)
-	 *			  [----] = F not wrapped
-	 *-----------------------------------------------------------------------
-	 */
-	if (start < boundary)
-		return finish >= boundary || finish < start;
-	else
-		return finish >= boundary && finish < start;
-}
-
 /*
  * Find the starting offset of the given MultiXactId.
  *
@@ -2937,8 +2778,9 @@ MultiXactMemberFreezeThreshold(void)
 	 * we try to eliminate from the system is based on how far we are past
 	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
 	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD) /
-		(MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+	fraction /= (double) (MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+
 	victim_multixacts = multixacts * fraction;
 
 	/* fraction could be > 1.0, but lowest possible freeze age is zero */
@@ -3283,7 +3125,7 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
 static bool
 MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
 {
-	int32		diff = (int32) (offset1 - offset2);
+	int64		diff = (int64) (offset1 - offset2);
 
 	return (diff < 0);
 }
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index df2d8f37cf..9252160280 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -266,7 +266,7 @@ main(int argc, char *argv[])
 
 			case 'O':
 				errno = 0;
-				set_mxoff = strtoul(optarg, &endptr, 0);
+				set_mxoff = strtou64(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index d6bbbd0ced..cc89e0764a 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -213,7 +213,7 @@ push @cmd,
   sprintf("%d,%d", hex($files[0]) == 0 ? 3 : hex($files[0]), hex($files[-1]));
 
 @files = get_slru_files('pg_multixact/offsets');
-$mult = 32 * $blcksz / 4;
+$mult = 32 * $blcksz / 8;
 # --multixact-ids argument is "new,old"
 push @cmd,
   '--multixact-ids' => sprintf("%d,%d",
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index b876e98f46..f143e1d116 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -28,7 +28,7 @@
 
 #define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
 
-#define MaxMultiXactOffset	((MultiXactOffset) 0xFFFFFFFF)
+#define MaxMultiXactOffset	UINT64CONST(0xFFFFFFFFFFFFFFFF)
 
 /*
  * Possible multixact lock modes ("status").  The first four modes are for
diff --git a/src/include/c.h b/src/include/c.h
index 04fd23577d..efa6e099f8 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -613,7 +613,7 @@ typedef uint32 SubTransactionId;
 /* MultiXactId must be equivalent to TransactionId, to fit in t_xmax */
 typedef TransactionId MultiXactId;
 
-typedef uint32 MultiXactOffset;
+typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
-- 
2.49.0

v16-0004-Get-rid-of-MultiXactMemberFreezeThreshold-call.patchapplication/octet-stream; name=v16-0004-Get-rid-of-MultiXactMemberFreezeThreshold-call.patchDownload
From bae3329f59dc998817f18a395da4c8119923529d Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 23 Oct 2024 18:23:39 +0300
Subject: [PATCH v16 4/7] Get rid of MultiXactMemberFreezeThreshold call.

Since MaxMultiXactOffset are UINT64_MAX now, MULTIXACT_MEMBER_SAFE_THRESHOLD and
MULTIXACT_MEMBER_DANGER_THRESHOLD values are not meaningful any more. Thus,
MultiXactMemberFreezeThreshold is not needed too.

Instead, switch to MULTIXACT_MEMBER_AUTOVAC_THRESHOLD (eq 2^32) members
threshold. It is used to determine if we need to force autovacuum or not.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/transam/multixact.c | 117 +++----------------------
 src/backend/commands/vacuum.c          |   2 +-
 src/backend/postmaster/autovacuum.c    |   4 +-
 src/include/access/multixact.h         |   1 -
 4 files changed, 15 insertions(+), 109 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index e62cb523d4..6e7a099617 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -204,10 +204,14 @@ MXOffsetToMemberOffset(MultiXactOffset offset)
 		member_in_group * sizeof(TransactionId);
 }
 
-/* Multixact members wraparound thresholds. */
-#define MULTIXACT_MEMBER_SAFE_THRESHOLD		(MaxMultiXactOffset / 2)
-#define MULTIXACT_MEMBER_DANGER_THRESHOLD	\
-	(MaxMultiXactOffset - MaxMultiXactOffset / 4)
+/*
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuumin order to release the disk space, reduce table bloat if
+ * possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD		UINT64CONST(0xFFFFFFFF)
 
 static inline MultiXactId
 PreviousMultiXactId(MultiXactId multi)
@@ -2555,15 +2559,13 @@ GetOldestMultiXactId(void)
 }
 
 /*
- * Determine how aggressively we need to vacuum in order to prevent member
- * wraparound.
+ * Determine if we need to vacuum for member or not.
  *
  * To do so determine what's the oldest member offset and install the limit
  * info in MultiXactState, where it can be used to prevent overrun of old data
  * in the members SLRU area.
  *
- * The return value is true if emergency autovacuum is required and false
- * otherwise.
+ * The return value is true if autovacuum is required and false otherwise.
  */
 static bool
 SetOffsetVacuumLimit(bool is_startup)
@@ -2651,10 +2653,10 @@ SetOffsetVacuumLimit(bool is_startup)
 	LWLockRelease(MultiXactGenLock);
 
 	/*
-	 * Do we need an emergency autovacuum?	If we're not sure, assume yes.
+	 * Do we need autovacuum?	If we're not sure, assume yes.
 	 */
 	return !oldestOffsetKnown ||
-		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
+		(nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);
 }
 
 /*
@@ -2700,101 +2702,6 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	return true;
 }
 
-/*
- * Determine how many multixacts, and how many multixact members, currently
- * exist.  Return false if unable to determine.
- */
-static bool
-ReadMultiXactCounts(uint32 *multixacts, MultiXactOffset *members)
-{
-	MultiXactOffset nextOffset;
-	MultiXactOffset oldestOffset;
-	MultiXactId oldestMultiXactId;
-	MultiXactId nextMultiXactId;
-	bool		oldestOffsetKnown;
-
-	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	nextOffset = MultiXactState->nextOffset;
-	oldestMultiXactId = MultiXactState->oldestMultiXactId;
-	nextMultiXactId = MultiXactState->nextMXact;
-	oldestOffset = MultiXactState->oldestOffset;
-	oldestOffsetKnown = MultiXactState->oldestOffsetKnown;
-	LWLockRelease(MultiXactGenLock);
-
-	if (!oldestOffsetKnown)
-		return false;
-
-	*members = nextOffset - oldestOffset;
-	*multixacts = nextMultiXactId - oldestMultiXactId;
-	return true;
-}
-
-/*
- * Multixact members can be removed once the multixacts that refer to them
- * are older than every datminmxid.  autovacuum_multixact_freeze_max_age and
- * vacuum_multixact_freeze_table_age work together to make sure we never have
- * too many multixacts; we hope that, at least under normal circumstances,
- * this will also be sufficient to keep us from using too many offsets.
- * However, if the average multixact has many members, we might exhaust the
- * members space while still using few enough members that these limits fail
- * to trigger relminmxid advancement by VACUUM.  At that point, we'd have no
- * choice but to start failing multixact-creating operations with an error.
- *
- * To prevent that, if more than a threshold portion of the members space is
- * used, we effectively reduce autovacuum_multixact_freeze_max_age and
- * to a value just less than the number of multixacts in use.  We hope that
- * this will quickly trigger autovacuuming on the table or tables with the
- * oldest relminmxid, thus allowing datminmxid values to advance and removing
- * some members.
- *
- * As the fraction of the member space currently in use grows, we become
- * more aggressive in clamping this value.  That not only causes autovacuum
- * to ramp up, but also makes any manual vacuums the user issues more
- * aggressive.  This happens because vacuum_get_cutoffs() will clamp the
- * freeze table and the minimum freeze age cutoffs based on the effective
- * autovacuum_multixact_freeze_max_age this function returns.  In the worst
- * case, we'll claim the freeze_max_age to zero, and every vacuum of any
- * table will freeze every multixact.
- */
-int
-MultiXactMemberFreezeThreshold(void)
-{
-	MultiXactOffset members;
-	uint32		multixacts;
-	uint32		victim_multixacts;
-	double		fraction;
-	int			result;
-
-	/* If we can't determine member space utilization, assume the worst. */
-	if (!ReadMultiXactCounts(&multixacts, &members))
-		return 0;
-
-	/* If member space utilization is low, no special action is required. */
-	if (members <= MULTIXACT_MEMBER_SAFE_THRESHOLD)
-		return autovacuum_multixact_freeze_max_age;
-
-	/*
-	 * Compute a target for relminmxid advancement.  The number of multixacts
-	 * we try to eliminate from the system is based on how far we are past
-	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
-	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-	fraction /= (double) (MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-
-	victim_multixacts = multixacts * fraction;
-
-	/* fraction could be > 1.0, but lowest possible freeze age is zero */
-	if (victim_multixacts > multixacts)
-		return 0;
-	result = multixacts - victim_multixacts;
-
-	/*
-	 * Clamp to autovacuum_multixact_freeze_max_age, so that we never make
-	 * autovacuum less aggressive than it would otherwise be.
-	 */
-	return Min(result, autovacuum_multixact_freeze_max_age);
-}
-
 typedef struct mxtruncinfo
 {
 	int64		earliestExistingPage;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 733ef40ae7..8f5092670b 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1153,7 +1153,7 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams params,
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Almost ready to set freeze output parameters; check if OldestXmin or
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 9474095f27..9379cd6b8a 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1137,7 +1137,7 @@ do_start_worker(void)
 
 	/* Also determine the oldest datminmxid we will consider. */
 	recentMulti = ReadNextMultiXactId();
-	multiForceLimit = recentMulti - MultiXactMemberFreezeThreshold();
+	multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
 	if (multiForceLimit < FirstMultiXactId)
 		multiForceLimit -= FirstMultiXactId;
 
@@ -1925,7 +1925,7 @@ do_autovacuum(void)
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Find the pg_database entry and select the default freeze ages. We use
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index f143e1d116..0d699831f1 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -144,7 +144,6 @@ extern void MultiXactSetNextMXact(MultiXactId nextMulti,
 extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 									  MultiXactOffset minMultiOffset);
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
-extern int	MultiXactMemberFreezeThreshold(void);
 
 extern void multixact_twophase_recover(FullTransactionId fxid, uint16 info,
 									   void *recdata, uint32 len);
-- 
2.49.0

v16-0005-TEST-initdb-option-to-initialize-cluster-with-no.patchapplication/octet-stream; name=v16-0005-TEST-initdb-option-to-initialize-cluster-with-no.patchDownload
From 3edcb71f21a655d16949de3cb23a7f6d5402c605 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Wed, 4 May 2022 15:53:36 +0300
Subject: [PATCH v16 5/7] TEST: initdb option to initialize cluster with
 non-standard xid/mxid/mxoff

To date testing database cluster wraparund was not easy as initdb has always
inited it with default xid/mxid/mxoff. The option to specify any valid
xid/mxid/mxoff at cluster startup will make these things easier.

Author: Maxim Orlov <orlovmg@gmail.com>
Author: Pavel Borisov <pashkin.elfe@gmail.com>
Author: Svetlana Derevyanko <s.derevyanko@postgrespro.ru>
Discussion: https://www.postgresql.org/message-id/flat/CACG%3Dezaa4vqYjJ16yoxgrpa-%3DgXnf0Vv3Ey9bjGrRRFN2YyWFQ%40mail.gmail.com
---
 src/backend/access/transam/clog.c      |   6 ++
 src/backend/access/transam/multixact.c |  10 +++
 src/backend/access/transam/subtrans.c  |   6 ++
 src/backend/access/transam/xlog.c      |  15 ++--
 src/backend/bootstrap/bootstrap.c      |  50 +++++++++++-
 src/backend/main/main.c                |   6 ++
 src/backend/postmaster/postmaster.c    |  14 +++-
 src/backend/tcop/postgres.c            |  53 +++++++++++-
 src/bin/initdb/initdb.c                | 107 ++++++++++++++++++++++++-
 src/bin/initdb/t/001_initdb.pl         |  60 ++++++++++++++
 src/include/access/xlog.h              |   3 +
 src/include/c.h                        |   4 +
 src/include/catalog/pg_class.h         |   2 +-
 13 files changed, 324 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index e80fbe109c..bff6926bf7 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -830,8 +830,14 @@ check_transaction_buffers(int *newval, void **extra, GucSource source)
 void
 BootStrapCLOG(void)
 {
+	int64	pageno;
+
 	/* Zero the initial page and flush it to disk */
 	SimpleLruZeroAndWritePage(XactCtl, 0);
+
+	pageno = TransactionIdToPage(XidFromFullTransactionId(TransamVariables->nextXid));
+	if (pageno)
+		SimpleLruZeroAndWritePage(XactCtl, pageno);
 }
 
 /*
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 6e7a099617..9b6eeee1c9 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1950,9 +1950,19 @@ check_multixact_member_buffers(int *newval, void **extra, GucSource source)
 void
 BootStrapMultiXact(void)
 {
+	int64	pageno;
+
 	/* Zero the initial pages and flush them to disk */
 	SimpleLruZeroAndWritePage(MultiXactOffsetCtl, 0);
 	SimpleLruZeroAndWritePage(MultiXactMemberCtl, 0);
+
+	pageno = MultiXactIdToOffsetPage(MultiXactState->nextMXact);
+	if (pageno)
+		SimpleLruZeroAndWritePage(MultiXactOffsetCtl, pageno);
+
+	pageno = MXOffsetToMemberPage(MultiXactState->nextOffset);
+	if (pageno)
+		SimpleLruZeroAndWritePage(MultiXactMemberCtl, pageno);
 }
 
 /*
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 09aace9e09..e73e0c3a0f 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -268,8 +268,14 @@ check_subtrans_buffers(int *newval, void **extra, GucSource source)
 void
 BootStrapSUBTRANS(void)
 {
+	int64	pageno;
+
 	/* Zero the initial page and flush it to disk */
 	SimpleLruZeroAndWritePage(SubTransCtl, 0);
+
+	pageno = TransactionIdToPage(XidFromFullTransactionId(TransamVariables->nextXid));
+	if (pageno)
+		SimpleLruZeroAndWritePage(SubTransCtl, pageno);
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a8cc6402d6..d75f332548 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -136,6 +136,10 @@ int			max_slot_wal_keep_size_mb = -1;
 int			wal_decode_buffer_size = 512 * 1024;
 bool		track_wal_io_timing = false;
 
+TransactionId		start_xid = FirstNormalTransactionId;
+MultiXactId			start_mxid = FirstMultiXactId;
+MultiXactOffset		start_mxoff = 0;
+
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
 #endif
@@ -5265,13 +5269,14 @@ BootStrapXLOG(uint32 data_checksum_version)
 	checkPoint.fullPageWrites = fullPageWrites;
 	checkPoint.wal_level = wal_level;
 	checkPoint.nextXid =
-		FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId);
+		FullTransactionIdFromEpochAndXid(0, Max(FirstNormalTransactionId,
+												start_xid));
 	checkPoint.nextOid = FirstGenbkiObjectId;
-	checkPoint.nextMulti = FirstMultiXactId;
-	checkPoint.nextMultiOffset = 0;
-	checkPoint.oldestXid = FirstNormalTransactionId;
+	checkPoint.nextMulti = Max(FirstMultiXactId, start_mxid);
+	checkPoint.nextMultiOffset = start_mxoff;
+	checkPoint.oldestXid = XidFromFullTransactionId(checkPoint.nextXid);
 	checkPoint.oldestXidDB = Template1DbOid;
-	checkPoint.oldestMulti = FirstMultiXactId;
+	checkPoint.oldestMulti = checkPoint.nextMulti;
 	checkPoint.oldestMultiDB = Template1DbOid;
 	checkPoint.oldestCommitTsXid = InvalidTransactionId;
 	checkPoint.newestCommitTsXid = InvalidTransactionId;
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index fc8638c1b6..51fc1106fe 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -220,7 +220,7 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	argv++;
 	argc--;
 
-	while ((flag = getopt(argc, argv, "B:c:d:D:Fkr:X:-:")) != -1)
+	while ((flag = getopt(argc, argv, "B:c:d:D:Fkm:o:r:X:x:-:")) != -1)
 	{
 		switch (flag)
 		{
@@ -288,12 +288,60 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 			case 'k':
 				bootstrap_data_checksum_version = PG_DATA_CHECKSUM_VERSION;
 				break;
+			case 'm':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactIdIsValid(start_mxid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact id")));
+					}
+				}
+				break;
+			case 'o':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxoff = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactOffsetIsValid(start_mxoff))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact offset")));
+					}
+				}
+				break;
 			case 'r':
 				strlcpy(OutputFileName, optarg, MAXPGPATH);
 				break;
 			case 'X':
 				SetConfigOption("wal_segment_size", optarg, PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
 				break;
+			case 'x':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_xid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartTransactionIdIsValid(start_xid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster xid value")));
+					}
+				}
+				break;
 			default:
 				write_stderr("Try \"%s --help\" for more information.\n",
 							 progname);
diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index 7d63cf94a6..0a0ee16220 100644
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -423,12 +423,18 @@ help(const char *progname)
 	printf(_("  -E                 echo statement before execution\n"));
 	printf(_("  -j                 do not use newline as interactive query delimiter\n"));
 	printf(_("  -r FILENAME        send stdout and stderr to given file\n"));
+	printf(_("  -m START_MXID      set initial database cluster multixact id\n"));
+	printf(_("  -o START_MXOFF     set initial database cluster multixact offset\n"));
+	printf(_("  -x START_XID       set initial database cluster xid\n"));
 
 	printf(_("\nOptions for bootstrapping mode:\n"));
 	printf(_("  --boot             selects bootstrapping mode (must be first argument)\n"));
 	printf(_("  --check            selects check mode (must be first argument)\n"));
 	printf(_("  DBNAME             database name (mandatory argument in bootstrapping mode)\n"));
 	printf(_("  -r FILENAME        send stdout and stderr to given file\n"));
+	printf(_("  -m START_MXID      set initial database cluster multixact id\n"));
+	printf(_("  -o START_MXOFF     set initial database cluster multixact offset\n"));
+	printf(_("  -x START_XID       set initial database cluster xid\n"));
 
 	printf(_("\nPlease read the documentation for the complete list of run-time\n"
 			 "configuration settings and how to set them on the command line or in\n"
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 490f7ce366..b41915deed 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -595,7 +595,7 @@ PostmasterMain(int argc, char *argv[])
 	 * tcop/postgres.c (the option sets should not conflict) and with the
 	 * common help() function in main/main.c.
 	 */
-	while ((opt = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lN:OPp:r:S:sTt:W:-:")) != -1)
+	while ((opt = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lm:N:Oo:Pp:r:S:sTt:W:x:-:")) != -1)
 	{
 		switch (opt)
 		{
@@ -705,10 +705,18 @@ PostmasterMain(int argc, char *argv[])
 				SetConfigOption("max_connections", optarg, PGC_POSTMASTER, PGC_S_ARGV);
 				break;
 
+			case 'm':
+				/* only used by single-user backend */
+				break;
+
 			case 'O':
 				SetConfigOption("allow_system_table_mods", "true", PGC_POSTMASTER, PGC_S_ARGV);
 				break;
 
+			case 'o':
+				/* only used by single-user backend */
+				break;
+
 			case 'P':
 				SetConfigOption("ignore_system_indexes", "true", PGC_POSTMASTER, PGC_S_ARGV);
 				break;
@@ -759,6 +767,10 @@ PostmasterMain(int argc, char *argv[])
 				SetConfigOption("post_auth_delay", optarg, PGC_POSTMASTER, PGC_S_ARGV);
 				break;
 
+			case 'x':
+				/* only used by single-user backend */
+				break;
+
 			default:
 				write_stderr("Try \"%s --help\" for more information.\n",
 							 progname);
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 2f8c3d5f91..b4a83130e3 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3826,7 +3826,7 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 	 * postmaster/postmaster.c (the option sets should not conflict) and with
 	 * the common help() function in main/main.c.
 	 */
-	while ((flag = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lN:nOPp:r:S:sTt:v:W:-:")) != -1)
+	while ((flag = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lm:N:nOo:Pp:r:S:sTt:v:W:x:-:")) != -1)
 	{
 		switch (flag)
 		{
@@ -3931,6 +3931,23 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 				SetConfigOption("ssl", "true", ctx, gucsource);
 				break;
 
+			case 'm':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactIdIsValid(start_mxid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact id")));
+					}
+				}
+				break;
+
 			case 'N':
 				SetConfigOption("max_connections", optarg, ctx, gucsource);
 				break;
@@ -3943,6 +3960,23 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 				SetConfigOption("allow_system_table_mods", "true", ctx, gucsource);
 				break;
 
+			case 'o':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxoff = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactOffsetIsValid(start_mxoff))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact offset")));
+					}
+				}
+				break;
+
 			case 'P':
 				SetConfigOption("ignore_system_indexes", "true", ctx, gucsource);
 				break;
@@ -3997,6 +4031,23 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 				SetConfigOption("post_auth_delay", optarg, ctx, gucsource);
 				break;
 
+			case 'x':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_xid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartTransactionIdIsValid(start_xid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster xid")));
+					}
+				}
+				break;
+
 			default:
 				errs++;
 				break;
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 62bbd08d9f..c61b12de9c 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -169,6 +169,9 @@ static char *xlog_dir = NULL;
 static int	wal_segment_size_mb = (DEFAULT_XLOG_SEG_SIZE) / (1024 * 1024);
 static DataDirSyncMethod sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
 static bool sync_data_files = true;
+static TransactionId start_xid = 0;
+static MultiXactId start_mxid = 0;
+static MultiXactOffset start_mxoff = 0;
 
 
 /* internal vars */
@@ -1586,6 +1589,11 @@ bootstrap_template1(void)
 	bki_lines = replace_token(bki_lines, "POSTGRES",
 							  escape_quotes_bki(username));
 
+	/* relfrozenxid must not be less than FirstNormalTransactionId */
+	sprintf(buf, "%u", Max(start_xid, 3));
+	bki_lines = replace_token(bki_lines, "RECENTXMIN",
+							  buf);
+
 	bki_lines = replace_token(bki_lines, "ENCODING",
 							  encodingid_to_string(encodingid));
 
@@ -1611,6 +1619,9 @@ bootstrap_template1(void)
 
 	printfPQExpBuffer(&cmd, "\"%s\" --boot %s %s", backend_exec, boot_options, extra_options);
 	appendPQExpBuffer(&cmd, " -X %d", wal_segment_size_mb * (1024 * 1024));
+	appendPQExpBuffer(&cmd, " -m %u", start_mxid);
+	appendPQExpBuffer(&cmd, " -o %" PRIu64, start_mxoff);
+	appendPQExpBuffer(&cmd, " -x %u", start_xid);
 	if (data_checksums)
 		appendPQExpBufferStr(&cmd, " -k");
 	if (debug)
@@ -2552,13 +2563,21 @@ usage(const char *progname)
 	printf(_("  -d, --debug               generate lots of debugging output\n"));
 	printf(_("      --discard-caches      set debug_discard_caches=1\n"));
 	printf(_("  -L DIRECTORY              where to find the input files\n"));
+	printf(_("  -m, --multixact-id=START_MXID\n"
+			 "                            set initial database cluster multixact id\n"
+			 "                            max value is 2^62-1\n"));
 	printf(_("  -n, --no-clean            do not clean up after errors\n"));
 	printf(_("  -N, --no-sync             do not wait for changes to be written safely to disk\n"));
 	printf(_("      --no-sync-data-files  do not sync files within database directories\n"));
 	printf(_("      --no-instructions     do not print instructions for next steps\n"));
+	printf(_("  -o, --multixact-offset=START_MXOFF\n"
+			 "                            set initial database cluster multixact offset\n"
+			 "                            max value is 2^62-1\n"));
 	printf(_("  -s, --show                show internal settings, then exit\n"));
 	printf(_("      --sync-method=METHOD  set method for syncing files to disk\n"));
 	printf(_("  -S, --sync-only           only sync database files to disk, then exit\n"));
+	printf(_("  -x, --xid=START_XID       set initial database cluster xid\n"
+			 "                            max value is 2^62-1\n"));
 	printf(_("\nOther options:\n"));
 	printf(_("  -V, --version             output version information, then exit\n"));
 	printf(_("  -?, --help                show this help, then exit\n"));
@@ -3093,6 +3112,18 @@ initialize_data_directory(void)
 	/* Now create all the text config files */
 	setup_config();
 
+	if (start_mxid != 0)
+		printf(_("selecting initial multixact id ... %u\n"),
+				 start_mxid);
+
+	if (start_mxoff != 0)
+		printf(_("selecting initial multixact offset ... %" PRIu64 "\n"),
+				 start_mxoff);
+
+	if (start_xid != 0)
+		printf(_("selecting initial xid ... %u\n"),
+				 start_xid);
+
 	/* Bootstrap template1 */
 	bootstrap_template1();
 
@@ -3109,8 +3140,12 @@ initialize_data_directory(void)
 	fflush(stdout);
 
 	initPQExpBuffer(&cmd);
-	printfPQExpBuffer(&cmd, "\"%s\" %s %s template1 >%s",
-					  backend_exec, backend_options, extra_options, DEVNULL);
+	printfPQExpBuffer(&cmd, "\"%s\" %s %s",
+					  backend_exec, backend_options, extra_options);
+	appendPQExpBuffer(&cmd, " -m %u", start_mxid);
+	appendPQExpBuffer(&cmd, " -o %" PRIu64, start_mxoff);
+	appendPQExpBuffer(&cmd, " -x %u", start_xid);
+	appendPQExpBuffer(&cmd, " template1 >%s", DEVNULL);
 
 	PG_CMD_OPEN(cmd.data);
 
@@ -3198,6 +3233,9 @@ main(int argc, char *argv[])
 		{"sync-method", required_argument, NULL, 19},
 		{"no-data-checksums", no_argument, NULL, 20},
 		{"no-sync-data-files", no_argument, NULL, 21},
+		{"xid", required_argument, NULL, 'x'},
+		{"multixact-id", required_argument, NULL, 'm'},
+		{"multixact-offset", required_argument, NULL, 'o'},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -3239,7 +3277,7 @@ main(int argc, char *argv[])
 
 	/* process command-line options */
 
-	while ((c = getopt_long(argc, argv, "A:c:dD:E:gkL:nNsST:U:WX:",
+	while ((c = getopt_long(argc, argv, "A:c:dD:E:gkL:m:nNo:sST:U:Wx:X:",
 							long_options, &option_index)) != -1)
 	{
 		switch (c)
@@ -3297,6 +3335,30 @@ main(int argc, char *argv[])
 				debug = true;
 				printf(_("Running in debug mode.\n"));
 				break;
+			case 'm':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactIdIsValid(start_mxid))
+					{
+						pg_log_error("invalid initial database cluster multixact id");
+						exit(1);
+					}
+					else if (start_mxid < 1) /* FirstMultiXactId */
+					{
+						/*
+						 * We avoid mxid to be silently set to
+						 * FirstMultiXactId, though it does not harm.
+						 */
+						pg_log_error("multixact id should be greater than 0");
+						exit(1);
+					}
+				}
+				break;
 			case 'n':
 				noclean = true;
 				printf(_("Running in no-clean mode.  Mistakes will not be cleaned up.\n"));
@@ -3304,6 +3366,21 @@ main(int argc, char *argv[])
 			case 'N':
 				do_sync = false;
 				break;
+			case 'o':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxoff = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactOffsetIsValid(start_mxoff))
+					{
+						pg_log_error("invalid initial database cluster multixact offset");
+						exit(1);
+					}
+				}
+				break;
 			case 'S':
 				sync_only = true;
 				break;
@@ -3395,6 +3472,30 @@ main(int argc, char *argv[])
 			case 21:
 				sync_data_files = false;
 				break;
+			case 'x':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_xid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartTransactionIdIsValid(start_xid))
+					{
+						pg_log_error("invalid value for initial database cluster xid");
+						exit(1);
+					}
+					else if (start_xid < 3) /* FirstNormalTransactionId */
+					{
+						/*
+						 * We avoid xid to be silently set to
+						 * FirstNormalTransactionId, though it does not harm.
+						 */
+						pg_log_error("xid should be greater than 2");
+						exit(1);
+					}
+				}
+				break;
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index b7ef7ed8d0..1c2e167110 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -331,4 +331,64 @@ command_fails(
 	[ 'pg_checksums', '--pgdata' => $datadir_nochecksums ],
 	"pg_checksums fails with data checksum disabled");
 
+# Set non-standard initial mxid/mxoff/xid.
+command_fails_like(
+	[ 'initdb', '-m', 'seven', $datadir ],
+	qr/initdb: error: invalid initial database cluster multixact id/,
+	'fails for invalid initial database cluster multixact id');
+command_fails_like(
+	[ 'initdb', '-o', 'seven', $datadir ],
+	qr/initdb: error: invalid initial database cluster multixact offset/,
+	'fails for invalid initial database cluster multixact offset');
+command_fails_like(
+	[ 'initdb', '-x', 'seven', $datadir ],
+	qr/initdb: error: invalid value for initial database cluster xid/,
+	'fails for invalid initial database cluster xid');
+
+command_checks_all(
+	[ 'initdb', '-m', '65535', "$tempdir/data-m65535" ],
+	0,
+	[qr/selecting initial multixact id ... 65535/],
+	[],
+	'selecting initial multixact id');
+command_checks_all(
+	[ 'initdb', '-o', '65535', "$tempdir/data-o65535" ],
+	0,
+	[qr/selecting initial multixact offset ... 65535/],
+	[],
+	'selecting initial multixact offset');
+command_checks_all(
+	[ 'initdb', '-x', '65535', "$tempdir/data-x65535" ],
+	0,
+	[qr/selecting initial xid ... 65535/],
+	[],
+	'selecting initial xid');
+
+# Setup new cluster with given mxid/mxoff/xid.
+my $node;
+my $result;
+
+$node = PostgreSQL::Test::Cluster->new('test-mxid');
+$node->init(extra => ['-m', '16777215']); # 0xFFFFFF
+$node->start;
+$result = $node->safe_psql('postgres', "SELECT next_multixact_id FROM pg_control_checkpoint();");
+ok($result >= 16777215, 'setup cluster with given mxid');
+$node->stop;
+
+$node = PostgreSQL::Test::Cluster->new('test-mxoff');
+$node->init(extra => ['-o', '16777215']); # 0xFFFFFF
+$node->start;
+$result = $node->safe_psql('postgres', "SELECT next_multi_offset FROM pg_control_checkpoint();");
+ok($result >= 16777215, 'setup cluster with given mxoff');
+$node->stop;
+
+$node = PostgreSQL::Test::Cluster->new('test-xid');
+$node->init(extra => ['-x', '16777215']); # 0xFFFFFF
+$node->start;
+$result = $node->safe_psql('postgres', "SELECT txid_current();");
+ok($result >= 16777215, 'setup cluster with given xid - check 1');
+$result = $node->safe_psql('postgres', "SELECT oldest_xid FROM pg_control_checkpoint();");
+ok($result >= 16777215, 'setup cluster with given xid - check 2');
+$node->stop;
+
 done_testing();
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d313099c02..23b8dd0375 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -94,6 +94,9 @@ typedef enum RecoveryState
 } RecoveryState;
 
 extern PGDLLIMPORT int wal_level;
+extern PGDLLIMPORT TransactionId start_xid;
+extern PGDLLIMPORT MultiXactId start_mxid;
+extern PGDLLIMPORT MultiXactOffset start_mxoff;
 
 /* Is WAL archiving enabled (always or only while server is running normally)? */
 #define XLogArchivingActive() \
diff --git a/src/include/c.h b/src/include/c.h
index efa6e099f8..ff4e46c363 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -617,6 +617,10 @@ typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
+#define StartTransactionIdIsValid(xid)			((xid) <= 0xFFFFFFFF)
+#define StartMultiXactIdIsValid(mxid)			((mxid) <= 0xFFFFFFFF)
+#define StartMultiXactOffsetIsValid(offset)		((offset) <= 0xFFFFFFFF)
+
 #define FirstCommandId	((CommandId) 0)
 #define InvalidCommandId	(~(CommandId)0)
 
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index 07d182da79..06e9969862 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -126,7 +126,7 @@ CATALOG(pg_class,1259,RelationRelationId) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83,Relat
 	Oid			relrewrite BKI_DEFAULT(0) BKI_LOOKUP_OPT(pg_class);
 
 	/* all Xids < this are frozen in this rel */
-	TransactionId relfrozenxid BKI_DEFAULT(3);	/* FirstNormalTransactionId */
+	TransactionId relfrozenxid BKI_DEFAULT(RECENTXMIN);	/* FirstNormalTransactionId */
 
 	/* all multixacts in this rel are >= this; it is really a MultiXactId */
 	TransactionId relminmxid BKI_DEFAULT(1);	/* FirstMultiXactId */
-- 
2.49.0

v16-0003-Make-pg_upgrade-convert-multixact-offsets.patchapplication/octet-stream; name=v16-0003-Make-pg_upgrade-convert-multixact-offsets.patchDownload
From 54008c24d98034e272fed45f459f6ab7001213c7 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Tue, 13 Aug 2024 14:44:50 +0300
Subject: [PATCH v16 3/7] Make pg_upgrade convert multixact offsets.

Author: Maxim Orlov <orlovmg@gmail.com>
Author: Yura Sokolov <y.sokolov@postgrespro.ru>
---
 src/bin/pg_upgrade/Makefile     |   1 +
 src/bin/pg_upgrade/meson.build  |   1 +
 src/bin/pg_upgrade/pg_upgrade.c |  40 ++-
 src/bin/pg_upgrade/pg_upgrade.h |  14 +-
 src/bin/pg_upgrade/segresize.c  | 527 ++++++++++++++++++++++++++++++++
 5 files changed, 579 insertions(+), 4 deletions(-)
 create mode 100644 src/bin/pg_upgrade/segresize.c

diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index f83d2b5d30..70908d63a3 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	info.o \
 	option.o \
 	parallel.o \
+	segresize.o \
 	pg_upgrade.o \
 	relfilenumber.o \
 	server.o \
diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index ac992f0d14..a8b1d77c2d 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -10,6 +10,7 @@ pg_upgrade_sources = files(
   'info.c',
   'option.c',
   'parallel.c',
+  'segresize.c',
   'pg_upgrade.c',
   'relfilenumber.c',
   'server.c',
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 536e49d261..7fae173016 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -793,8 +793,42 @@ copy_xact_xlog_xid(void)
 	if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER &&
 		new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
 	{
-		copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
-		copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		/*
+		 * If the old server is before the MULTIXACTOFFSET_FORMATCHANGE_CAT_VER
+		 * it must have 32-bit multixid offsets, thus it should be converted.
+		 */
+		if (old_cluster.controldata.cat_ver < MULTIXACTOFFSET_FORMATCHANGE_CAT_VER &&
+			new_cluster.controldata.cat_ver >= MULTIXACTOFFSET_FORMATCHANGE_CAT_VER)
+		{
+			MultiXactOffset		oldest_offset,
+								next_offset;
+
+			remove_new_subdir("pg_multixact/offsets", false);
+			prep_status("Converting pg_multixact/offsets to 64-bit");
+			oldest_offset = convert_multixact_offsets();
+			check_ok();
+
+			remove_new_subdir("pg_multixact/members", false);
+			prep_status("Converting pg_multixact/members");
+			convert_multixact_members(oldest_offset);
+			check_ok();
+
+			next_offset = old_cluster.controldata.chkpnt_nxtmxoff;
+			if (oldest_offset)
+			{
+				if (next_offset < oldest_offset)
+					next_offset += ((MultiXactOffset) 1 << 32) - 1;
+
+				next_offset -= oldest_offset - 1;
+
+				old_cluster.controldata.chkpnt_nxtmxoff = next_offset;
+			}
+		}
+		else
+		{
+			copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
+			copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		}
 
 		prep_status("Setting next multixact ID and offset for new cluster");
 
@@ -803,7 +837,7 @@ copy_xact_xlog_xid(void)
 		 * counters here and the oldest multi present on system.
 		 */
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/pg_resetwal\" -O %u -m %u,%u \"%s\"",
+				  "\"%s/pg_resetwal\" -O %" PRIu64 " -m %u,%u \"%s\"",
 				  new_cluster.bindir,
 				  old_cluster.controldata.chkpnt_nxtmxoff,
 				  old_cluster.controldata.chkpnt_nxtmulti,
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 69c965bb7d..2d4f1d39e5 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -114,6 +114,13 @@ extern char *output_files[];
  */
 #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231
 
+/*
+ * Swicth from 32-bit to 64-bit for multixid offsets.
+ *
+ * XXX: should be changed to the actual CATALOG_VERSION_NO on commit.
+ */
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202409041
+
 /*
  * large object chunk size added to pg_controldata,
  * commit 5f93c37805e7485488480916b4585e098d3cc883
@@ -235,7 +242,7 @@ typedef struct
 	uint32		chkpnt_nxtepoch;
 	uint32		chkpnt_nxtoid;
 	uint32		chkpnt_nxtmulti;
-	uint32		chkpnt_nxtmxoff;
+	uint64		chkpnt_nxtmxoff;
 	uint32		chkpnt_oldstMulti;
 	uint32		chkpnt_oldstxid;
 	uint32		align;
@@ -526,3 +533,8 @@ typedef struct
 	FILE	   *file;
 	char		path[MAXPGPATH];
 } UpgradeTaskReport;
+
+/* segresize.c */
+
+MultiXactOffset		convert_multixact_offsets(void);
+void				convert_multixact_members(MultiXactOffset oldest_offset);
diff --git a/src/bin/pg_upgrade/segresize.c b/src/bin/pg_upgrade/segresize.c
new file mode 100644
index 0000000000..7f80d0652a
--- /dev/null
+++ b/src/bin/pg_upgrade/segresize.c
@@ -0,0 +1,527 @@
+/*
+ *	segresize.c
+ *
+ *	SLRU segment resize utility
+ *
+ *	Copyright (c) 2024, PostgreSQL Global Development Group
+ *	src/bin/pg_upgrade/segresize.c
+ */
+
+#include "postgres_fe.h"
+
+#include "pg_upgrade.h"
+#include "access/multixact.h"
+
+/* See slru.h */
+#define SLRU_PAGES_PER_SEGMENT		32
+
+/*
+ * Some kind of iterator associated with a particular SLRU segment.  The idea is
+ * to specify the segment and page number and then move through the pages.
+ */
+typedef struct SlruSegState
+{
+	char	   *dir;
+	char	   *fn;
+	FILE	   *file;
+	int64		segno;
+	uint64		pageno;
+	bool		leading_gap;
+} SlruSegState;
+
+/*
+ * Mirrors the SlruFileName from slru.c
+ */
+static inline char *
+SlruFileName(SlruSegState *state)
+{
+	Assert(state->segno >= 0 && state->segno <= INT64CONST(0xFFFFFF));
+	return psprintf("%s/%04X", state->dir, (unsigned int) state->segno);
+}
+
+/*
+ * Create new SLRU segment file.
+ */
+static void
+create_segment(SlruSegState *state)
+{
+	Assert(state->fn == NULL);
+	Assert(state->file == NULL);
+
+	state->fn = SlruFileName(state);
+	state->file = fopen(state->fn, "wb");
+	if (!state->file)
+		pg_fatal("could not create file \"%s\": %m", state->fn);
+}
+
+/*
+ * Open existing SLRU segment file.
+ */
+static void
+open_segment(SlruSegState *state)
+{
+	Assert(state->fn == NULL);
+	Assert(state->file == NULL);
+
+	state->fn = SlruFileName(state);
+	state->file = fopen(state->fn, "rb");
+	if (!state->file)
+		pg_fatal("could not open file \"%s\": %m", state->fn);
+}
+
+/*
+ * Close SLRU segment file.
+ */
+static void
+close_segment(SlruSegState *state)
+{
+	if (state->file)
+	{
+		fclose(state->file);
+		state->file = NULL;
+	}
+
+	if (state->fn)
+	{
+		pfree(state->fn);
+		state->fn = NULL;
+	}
+}
+
+/*
+ * Read next page from the old 32-bit offset segment file.
+ */
+static int
+read_old_segment_page(SlruSegState *state, void *buf, bool *empty)
+{
+	int		len;
+
+	/* Open next segment file, if needed. */
+	if (!state->fn)
+	{
+		if (!state->segno)
+			state->leading_gap = true;
+
+		open_segment(state);
+
+		/* Set position to the needed page. */
+		if (state->pageno > 0 &&
+			fseek(state->file, state->pageno * BLCKSZ, SEEK_SET))
+		{
+			close_segment(state);
+		}
+	}
+
+	if (state->file)
+	{
+		/* Segment file do exists, read page from it. */
+		state->leading_gap = false;
+
+		len = fread(buf, sizeof(char), BLCKSZ, state->file);
+
+		/* Are we done or was there an error? */
+		if (len <= 0)
+		{
+			if (ferror(state->file))
+				pg_fatal("error reading file \"%s\": %m", state->fn);
+
+			if (feof(state->file))
+			{
+				*empty = true;
+				len = -1;
+
+				close_segment(state);
+			}
+		}
+		else
+			*empty = false;
+	}
+	else if (!state->leading_gap)
+	{
+		/* We reached the last segment. */
+		len = -1;
+		*empty = true;
+	}
+	else
+	{
+		/* Skip few first segments if they were frozen and removed. */
+		len = BLCKSZ;
+		*empty = true;
+	}
+
+	if (++state->pageno >= SLRU_PAGES_PER_SEGMENT)
+	{
+		/* Start a new segment. */
+		state->segno++;
+		state->pageno = 0;
+
+		close_segment(state);
+	}
+
+	return len;
+}
+
+/*
+ * Write next page to the new 64-bit offset segment file.
+ */
+static void
+write_new_segment_page(SlruSegState *state, void *buf)
+{
+	/*
+	 * Create a new segment file if we still didn't.  Creation is
+	 * postponed until the first non-empty page is found.  This helps
+	 * not to create completely empty segments.
+	 */
+	if (!state->file)
+	{
+		create_segment(state);
+
+		/* Write zeroes to the previously skipped prefix. */
+		if (state->pageno > 0)
+		{
+			char		zerobuf[BLCKSZ] = {0};
+
+			for (int64 i = 0; i < state->pageno; i++)
+			{
+				if (fwrite(zerobuf, sizeof(char), BLCKSZ, state->file) != BLCKSZ)
+					pg_fatal("could not write file \"%s\": %m", state->fn);
+			}
+		}
+	}
+
+	/* Write page to the new segment (if it was created). */
+	if (state->file)
+	{
+		if (fwrite(buf, sizeof(char), BLCKSZ, state->file) != BLCKSZ)
+			pg_fatal("could not write file \"%s\": %m", state->fn);
+	}
+
+	/*
+	 * Did we reach the maximum page number?  Then close segment file
+	 * and create a new one on the next iteration.
+	 */
+	if (++state->pageno >= SLRU_PAGES_PER_SEGMENT)
+	{
+		/* Start a new segment. */
+		state->segno++;
+		state->pageno = 0;
+
+		close_segment(state);
+	}
+}
+
+typedef uint32 MultiXactOffsetOld;
+
+#define MaxMultiXactOffsetOld	((MultiXactOffsetOld) 0xFFFFFFFF)
+
+#define MULTIXACT_OFFSETS_PER_PAGE_OLD (BLCKSZ / sizeof(MultiXactOffsetOld))
+#define MULTIXACT_OFFSETS_PER_PAGE_NEW (BLCKSZ / sizeof(MultiXactOffset))
+
+/*
+ * Convert pg_multixact/offsets segments and return oldest multi offset.
+ */
+MultiXactOffset
+convert_multixact_offsets(void)
+{
+	SlruSegState		oldseg = {0},
+						newseg = {0};
+	MultiXactOffsetOld	oldbuf[MULTIXACT_OFFSETS_PER_PAGE_OLD] = {0};
+	MultiXactOffset		newbuf[MULTIXACT_OFFSETS_PER_PAGE_NEW] = {0},
+						oldest_offset = 0;
+	uint64				oldest_multi = old_cluster.controldata.chkpnt_oldstMulti,
+						next_multi = old_cluster.controldata.chkpnt_nxtmulti,
+						multi,
+						old_entry,
+						new_entry;
+	bool				oldest_offset_known = false;
+
+	oldseg.dir = psprintf("%s/pg_multixact/offsets", old_cluster.pgdata);
+	newseg.dir = psprintf("%s/pg_multixact/offsets", new_cluster.pgdata);
+
+	old_entry = oldest_multi % MULTIXACT_OFFSETS_PER_PAGE_OLD;
+	oldseg.pageno = oldest_multi / MULTIXACT_OFFSETS_PER_PAGE_OLD;
+	oldseg.segno = oldseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	oldseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+
+	new_entry = oldest_multi % MULTIXACT_OFFSETS_PER_PAGE_NEW;
+	newseg.pageno = oldest_multi / MULTIXACT_OFFSETS_PER_PAGE_NEW;
+	newseg.segno = newseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	newseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+
+	if (next_multi < oldest_multi)
+		next_multi += (uint64) 1 << 32;	/* wraparound */
+
+	/* Copy multi offsets reading only needed segment pages */
+	for (multi = oldest_multi; multi < next_multi; old_entry = 0)
+	{
+		int		oldlen;
+		bool	empty;
+
+		/* Handle possible segment wraparound */
+#define OLD_OFFSET_SEGNO_MAX	\
+	(MaxMultiXactId / MULTIXACT_OFFSETS_PER_PAGE_OLD / SLRU_PAGES_PER_SEGMENT)
+		if (oldseg.segno > OLD_OFFSET_SEGNO_MAX)
+		{
+			oldseg.segno = 0;
+			oldseg.pageno = 0;
+		}
+
+		oldlen = read_old_segment_page(&oldseg, oldbuf, &empty);
+		if (empty || oldlen != BLCKSZ)
+			pg_fatal("cannot read page %" PRIu64 " from file \"%s\": %m",
+					 oldseg.pageno, oldseg.fn);
+
+		/* Save oldest multi offset */
+		if (!oldest_offset_known)
+		{
+			oldest_offset = oldbuf[old_entry];
+			oldest_offset_known = true;
+		}
+
+		/* Skip wrapped-around invalid MultiXactIds */
+		if (multi == (uint64) 1 << 32)
+		{
+			Assert(oldseg.segno == 0);
+			Assert(oldseg.pageno == 1);
+			Assert(old_entry == 0);
+			Assert(new_entry == 0);
+
+			multi += FirstMultiXactId;
+			old_entry = FirstMultiXactId;
+			new_entry = FirstMultiXactId;
+		}
+
+		/* Copy entries to the new page */
+		for (; multi < next_multi && old_entry < MULTIXACT_OFFSETS_PER_PAGE_OLD;
+			 multi++, old_entry++)
+		{
+			MultiXactOffset offset = oldbuf[old_entry];
+
+			/* Handle possible offset wraparound (1 becomes 2^32) */
+			if (offset < oldest_offset)
+				offset += ((uint64) 1 << 32) - 1;
+
+			/* Subtract oldest_offset, so new offsets will start from 1 */
+			newbuf[new_entry++] = offset - oldest_offset + 1;
+
+			if (new_entry >= MULTIXACT_OFFSETS_PER_PAGE_NEW)
+			{
+				/* Handle possible segment wraparound */
+#define NEW_OFFSET_SEGNO_MAX	\
+	(MaxMultiXactId / MULTIXACT_OFFSETS_PER_PAGE_NEW / SLRU_PAGES_PER_SEGMENT)
+				if (newseg.segno > NEW_OFFSET_SEGNO_MAX)
+				{
+					newseg.segno = 0;
+					newseg.pageno = 0;
+				}
+
+				/* Write new page */
+				write_new_segment_page(&newseg, newbuf);
+				new_entry = 0;
+			}
+		}
+	}
+
+	/* Write the last incomplete page */
+	if (new_entry > 0 || oldest_multi == next_multi)
+	{
+		memset(&newbuf[new_entry], 0,
+			   sizeof(newbuf[0]) * (MULTIXACT_OFFSETS_PER_PAGE_NEW - new_entry));
+		write_new_segment_page(&newseg, newbuf);
+	}
+
+	/* Use next_offset as oldest_offset, if oldest_multi == next_multi */
+	if (!oldest_offset_known)
+	{
+		Assert(oldest_multi == next_multi);
+		oldest_offset = (MultiXactOffset) old_cluster.controldata.chkpnt_nxtmxoff;
+	}
+
+	/* Release resources */
+	close_segment(&oldseg);
+	close_segment(&newseg);
+
+	pfree(oldseg.dir);
+	pfree(newseg.dir);
+
+	return oldest_offset;
+}
+
+#define MXACT_MEMBERS_FLAG_BYTES			1
+
+#define MULTIXACT_MEMBERS_PER_GROUP			4
+#define MULTIXACT_MEMBERGROUP_SIZE			\
+	(MULTIXACT_MEMBERS_PER_GROUP * (sizeof(TransactionId) + MXACT_MEMBERS_FLAG_BYTES))
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE		\
+	(BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+
+#define MULTIXACT_MEMBERS_PER_PAGE				\
+	(MULTIXACT_MEMBERS_PER_GROUP * MULTIXACT_MEMBERGROUPS_PER_PAGE)
+#define MULTIXACT_MEMBER_FLAG_BYTES_PER_GROUP	\
+	(MXACT_MEMBERS_FLAG_BYTES * MULTIXACT_MEMBERS_PER_GROUP)
+
+typedef struct MultiXactMembersCtx
+{
+	SlruSegState	seg;
+	char			buf[BLCKSZ];
+	int				group;
+	int				member;
+	char		   *flag;
+	TransactionId  *xid;
+} MultiXactMembersCtx;
+
+static void
+MultiXactMembersCtxInit(MultiXactMembersCtx *ctx)
+{
+	ctx->seg.dir = psprintf("%s/pg_multixact/members", new_cluster.pgdata);
+
+	ctx->group = 0;
+	ctx->member = 1;		/* skip invalid zero offset */
+
+	ctx->flag = (char *) ctx->buf + ctx->group * MULTIXACT_MEMBERGROUP_SIZE;
+	ctx->xid = (TransactionId *)(ctx->flag + MXACT_MEMBERS_FLAG_BYTES * MULTIXACT_MEMBERS_PER_GROUP);
+
+	ctx->flag += ctx->member;
+	ctx->xid += ctx->member;
+}
+
+static void
+MultiXactMembersCtxAdd(MultiXactMembersCtx *ctx, char flag, TransactionId xid)
+{
+	/* Copy member's xid and flags to the new page */
+	*ctx->flag++ = flag;
+	*ctx->xid++ = xid;
+
+	if (++ctx->member < MULTIXACT_MEMBERS_PER_GROUP)
+		return;
+
+	/* Start next member group */
+	ctx->member = 0;
+
+	if (++ctx->group >= MULTIXACT_MEMBERGROUPS_PER_PAGE)
+	{
+		/* Write current page and start new */
+		write_new_segment_page(&ctx->seg, ctx->buf);
+
+		ctx->group = 0;
+		memset(ctx->buf, 0, BLCKSZ);
+	}
+
+	ctx->flag = (char *) ctx->buf + ctx->group * MULTIXACT_MEMBERGROUP_SIZE;
+	ctx->xid = (TransactionId *)(ctx->flag + MXACT_MEMBERS_FLAG_BYTES * MULTIXACT_MEMBERS_PER_GROUP);
+}
+
+static void
+MultiXactMembersCtxFinit(MultiXactMembersCtx *ctx)
+{
+	if (ctx->flag > (char *) ctx->buf)
+		write_new_segment_page(&ctx->seg, ctx->buf);
+
+	close_segment(&ctx->seg);
+
+	pfree(ctx->seg.dir);
+}
+
+/*
+ * Convert pg_multixact/members segments, offsets will start from 1.
+ *
+ */
+void
+convert_multixact_members(MultiXactOffset oldest_offset)
+{
+	MultiXactOffset			next_offset,
+							offset;
+	SlruSegState			oldseg = {0};
+	char					oldbuf[BLCKSZ] = {0};
+	int						oldidx;
+	MultiXactMembersCtx		newctx = {0};
+
+	oldseg.dir = psprintf("%s/pg_multixact/members", old_cluster.pgdata);
+
+	next_offset = (MultiXactOffset) old_cluster.controldata.chkpnt_nxtmxoff;
+	if (next_offset < oldest_offset)
+		next_offset += ((uint64) 1 << 32) - 1;
+
+	/* Initialize the old starting position */
+	oldseg.pageno = oldest_offset / MULTIXACT_MEMBERS_PER_PAGE;
+	oldseg.segno = oldseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	oldseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+
+	/* Initialize new starting position */
+	MultiXactMembersCtxInit(&newctx);
+
+	/* Iterate through the original directory */
+	oldidx = oldest_offset % MULTIXACT_MEMBERS_PER_PAGE;
+	for (offset = oldest_offset; offset < next_offset;)
+	{
+		bool	empty;
+		int		oldlen;
+		int		ngroups;
+		int		oldgroup;
+		int		oldmember;
+
+		oldlen = read_old_segment_page(&oldseg, oldbuf, &empty);
+		if (empty || oldlen != BLCKSZ)
+			pg_fatal("cannot read page %" PRIu64 " from file \"%s\": %m",
+					 oldseg.pageno, oldseg.fn);
+
+		/* Iterate through the old member groups */
+		ngroups = oldlen / MULTIXACT_MEMBERGROUP_SIZE;
+		oldmember = oldidx % MULTIXACT_MEMBERS_PER_GROUP;
+		oldgroup = oldidx / MULTIXACT_MEMBERS_PER_GROUP;
+		while (oldgroup < ngroups && offset < next_offset)
+		{
+			char		   *oldflag;
+			TransactionId  *oldxid;
+			int				i;
+
+			oldflag = (char *) oldbuf + oldgroup * MULTIXACT_MEMBERGROUP_SIZE;
+			oldxid = (TransactionId *)(oldflag + MULTIXACT_MEMBER_FLAG_BYTES_PER_GROUP);
+
+			oldxid += oldmember;
+			oldflag += oldmember;
+
+			/* Iterate through the old members */
+			for (i = oldmember;
+				 i < MULTIXACT_MEMBERS_PER_GROUP && offset < next_offset;
+				 i++)
+			{
+				MultiXactMembersCtxAdd(&newctx, *oldflag++, *oldxid++);
+
+				if (++offset == (uint64) 1 << 32)
+				{
+					Assert(i == MaxMultiXactOffsetOld % MULTIXACT_MEMBERS_PER_GROUP);
+					goto wraparound;
+				}
+			}
+
+			oldgroup++;
+			oldmember = 0;
+		}
+
+		oldidx = 0;
+
+		continue;
+
+wraparound:
+#define SEGNO_MAX	MaxMultiXactOffsetOld / MULTIXACT_MEMBERS_PER_PAGE / SLRU_PAGES_PER_SEGMENT
+#define PAGENO_MAX	MaxMultiXactOffsetOld / MULTIXACT_MEMBERS_PER_PAGE % SLRU_PAGES_PER_SEGMENT
+		Assert((oldseg.segno == SEGNO_MAX && oldseg.pageno == PAGENO_MAX + 1) ||
+			   (oldseg.segno == SEGNO_MAX + 1 && oldseg.pageno == 0));
+
+		/* Switch to segment 0000 */
+		close_segment(&oldseg);
+		oldseg.segno = 0;
+		oldseg.pageno = 0;
+
+		/* skip invalid zero multi offset */
+		oldidx = 1;
+	}
+
+	MultiXactMembersCtxFinit(&newctx);
+
+	/* Release resources */
+	close_segment(&oldseg);
+
+	pfree(oldseg.dir);
+}
-- 
2.49.0

v16-0007-TEST-bump-catver.patchapplication/octet-stream; name=v16-0007-TEST-bump-catver.patchDownload
From 877c76cef1db331bd8782f92f63224f317d7df18 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 13 Nov 2024 16:34:34 +0300
Subject: [PATCH v16 7/7] TEST: bump catver

---
 src/bin/pg_upgrade/pg_upgrade.h  | 2 +-
 src/include/catalog/catversion.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index e9c5d3177f..ca4fcbf66c 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -119,7 +119,7 @@ extern char *output_files[];
  *
  * XXX: should be changed to the actual CATALOG_VERSION_NO on commit.
  */
-#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202504092
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202506302
 
 /*
  * large object chunk size added to pg_controldata,
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index ff9ffd9d47..6631a3008a 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202506301
+#define CATALOG_VERSION_NO	202506302
 
 #endif
-- 
2.49.0

v16-0006-TEST-add-src-bin-pg_upgrade-t-006_offset.pl.patchapplication/octet-stream; name=v16-0006-TEST-add-src-bin-pg_upgrade-t-006_offset.pl.patchDownload
From 5686a434c47bf1348d23c8c918a4e15017a7065a Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Tue, 19 Nov 2024 17:08:10 +0300
Subject: [PATCH v16 6/7] TEST: add src/bin/pg_upgrade/t/006_offset.pl

---
 src/bin/pg_upgrade/pg_upgrade.h    |   2 +-
 src/bin/pg_upgrade/t/006_offset.pl | 562 +++++++++++++++++++++++++++++
 2 files changed, 563 insertions(+), 1 deletion(-)
 create mode 100644 src/bin/pg_upgrade/t/006_offset.pl

diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 2d4f1d39e5..e9c5d3177f 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -119,7 +119,7 @@ extern char *output_files[];
  *
  * XXX: should be changed to the actual CATALOG_VERSION_NO on commit.
  */
-#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202409041
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202504092
 
 /*
  * large object chunk size added to pg_controldata,
diff --git a/src/bin/pg_upgrade/t/006_offset.pl b/src/bin/pg_upgrade/t/006_offset.pl
new file mode 100644
index 0000000000..f5dc733a30
--- /dev/null
+++ b/src/bin/pg_upgrade/t/006_offset.pl
@@ -0,0 +1,562 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use File::Find qw(find);
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# This pair of calls will create significantly more member segments than offset
+# segments.
+sub prep
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl} (I INT PRIMARY KEY, N_UPDATED INT) " .
+		"       WITH (AUTOVACUUM_ENABLED=FALSE);" .
+		"INSERT INTO ${tbl} SELECT G, 0 FROM GENERATE_SERIES(1, 50) G;");
+}
+
+sub fill
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	my $nclients = 50;
+	my $update_every = 90;
+	my @connections = ();
+
+	for (0..$nclients)
+	{
+		my $conn = $node->background_psql('postgres');
+		$conn->query_safe("BEGIN");
+
+		push(@connections, $conn);
+	}
+
+	for (my $i = 0; $i < 20000; $i++)
+	{
+		my $conn = $connections[$i % $nclients];
+
+		$conn->query_safe("COMMIT;");
+		$conn->query_safe("BEGIN");
+
+		if ($i % $update_every == 0)
+		{
+			$conn->query_safe(
+				"UPDATE ${tbl} SET " .
+				"N_UPDATED = N_UPDATED + 1 " .
+				"WHERE I = ${i} % 50");
+		}
+		else
+		{
+			$conn->query_safe(
+				"SELECT * FROM ${tbl} FOR KEY SHARE");
+		}
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+}
+
+# This pair of calls will create more or less the same amount of membsers and
+# offsets segments.
+sub prep2
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl}(BAR INT PRIMARY KEY, BAZ INT); " .
+		"CREATE OR REPLACE PROCEDURE MXIDFILLER(N_STEPS INT DEFAULT 1000) " .
+		"LANGUAGE PLPGSQL " .
+		"AS \$\$ " .
+		"BEGIN " .
+		"	FOR I IN 1..N_STEPS LOOP " .
+		"		UPDATE ${tbl} SET BAZ = RANDOM(1, 1000) " .
+		"		WHERE BAR IN (SELECT BAR FROM ${tbl} " .
+		"						TABLESAMPLE BERNOULLI(80)); " .
+		"		COMMIT; " .
+		"	END LOOP; " .
+		"END; \$\$; " .
+		"INSERT INTO ${tbl} (BAR, BAZ) " .
+		"SELECT ID, ID FROM GENERATE_SERIES(1, 1024) ID;");
+}
+
+sub fill2
+{
+	my $node = shift;
+	my $tbl = shift;
+	my $scale = shift // 1;
+
+	$node->safe_psql('postgres',
+		"BEGIN; " .
+		"SELECT * FROM ${tbl} FOR KEY SHARE; " .
+		"PREPARE TRANSACTION 'A'; " .
+		"CALL MXIDFILLER((365 * ${scale})::int); " .
+		"COMMIT PREPARED 'A';");
+}
+
+
+# generate around 2 offset segments and 55 member segments
+sub mxid_gen1
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	prep($node, $tbl);
+	fill($node, $tbl);
+
+	$node->safe_psql('postgres', q(CHECKPOINT));
+}
+
+# generate around 10 offset segments and 12 member segments
+sub mxid_gen2
+{
+	my $node = shift;
+	my $tbl = shift;
+	my $scale = shift // 1;
+
+	prep2($node, $tbl);
+	fill2($node, $tbl, $scale);
+
+	$node->safe_psql('postgres', q(CHECKPOINT));
+}
+
+# Fetch latest multixact checkpoint values.
+sub multi_bounds
+{
+	my ($node) = @_;
+	my $path = $node->config_data('--bindir');
+	my ($stdout, $stderr) = run_command([
+									$path . '/pg_controldata',
+									$node->data_dir
+								]);
+	my @control_data = split("\n", $stdout);
+	my $next = undef;
+	my $oldest = undef;
+	my $next_offset = undef;
+
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint's NextMultiXactId:\s*(.*)$/mg)
+		{
+			$next = $1;
+			print ">>> @ node ". $node->name . ", " . $_ . "\n";
+		}
+
+		if ($_ =~ /^Latest checkpoint's oldestMultiXid:\s*(.*)$/mg)
+		{
+			$oldest = $1;
+			print ">>> @ node ". $node->name . ", " . $_ . "\n";
+		}
+
+		if ($_ =~ /^Latest checkpoint's NextMultiOffset:\s*(.*)$/mg)
+		{
+			$next_offset = $1;
+			print ">>> @ node ". $node->name . ", " . $_ . "\n";
+		}
+
+		if (defined($oldest) && defined($next) && defined($next_offset))
+		{
+			last;
+		}
+	}
+
+	die "Latest checkpoint's NextMultiXactId not found in control file!\n"
+	unless defined($next);
+
+	die "Latest checkpoint's oldestMultiXid not found in control file!\n"
+	unless defined($oldest);
+
+	die "Latest checkpoint's NextMultiOffset not found in control file!\n"
+	unless defined($next_offset);
+
+	return ($oldest, $next, $next_offset);
+}
+
+# Create node from existing bins.
+sub create_new_node
+{
+	my ($name, %params) = @_;
+
+	create_node(0, @_);
+}
+
+# Create node from ENV oldinstall
+sub create_old_node
+{
+	my ($name, %params) = @_;
+
+	if (!defined($ENV{oldinstall}))
+	{
+		die "oldinstall is not defined";
+	}
+
+	create_node(1, @_);
+}
+
+sub create_node
+{
+	my ($install_path_from_env, $name, %params) = @_;
+	my $scale = defined $params{scale} ? $params{scale} : 1;
+	my $multi = defined $params{multi} ? $params{multi} : undef;
+	my $offset = defined $params{offset} ? $params{offset} : undef;
+
+	my $node =
+		$install_path_from_env ?
+			PostgreSQL::Test::Cluster->new($name,
+					install_path => $ENV{oldinstall}) :
+			PostgreSQL::Test::Cluster->new($name);
+
+	$node->init(force_initdb => 1,
+		extra => [
+			$multi ? ('-m', $multi) : (),
+			$offset ? ('-o', $offset) : (),
+		]);
+
+	# Fixup MOX patch quirk
+	if ($multi)
+	{
+		unlink $node->data_dir . '/pg_multixact/offsets/0000';
+	}
+	if ($offset)
+	{
+		unlink $node->data_dir . '/pg_multixact/members/0000';
+	}
+
+	$node->append_conf('fsync', 'off');
+	$node->append_conf('postgresql.conf', 'max_prepared_transactions = 2');
+
+	$node->start();
+	mxid_gen2($node, 'FOO', $scale);
+	mxid_gen1($node, 'BAR', $scale);
+	$node->restart();
+	$node->safe_psql('postgres', q(SELECT * FROM FOO));		# just in case...
+	$node->safe_psql('postgres', q(SELECT * FROM BAR));
+	$node->safe_psql('postgres', q(CHECKPOINT));
+	$node->stop();
+
+	return $node;
+}
+
+sub do_upgrade
+{
+	my ($oldnode, $newnode) = @_;
+
+	command_ok(
+		[
+			'pg_upgrade', '--no-sync',
+			'-d', $oldnode->data_dir,
+			'-D', $newnode->data_dir,
+			'-b', $oldnode->config_data('--bindir'),
+			'-B', $newnode->config_data('--bindir'),
+			'-s', $newnode->host,
+			'-p', $oldnode->port,
+			'-P', $newnode->port,
+			'--check'
+		],
+		'run of pg_upgrade');
+
+	command_ok(
+		[
+			'pg_upgrade', '--no-sync',
+			'-d', $oldnode->data_dir,
+			'-D', $newnode->data_dir,
+			'-b', $oldnode->config_data('--bindir'),
+			'-B', $newnode->config_data('--bindir'),
+			'-s', $newnode->host,
+			'-p', $oldnode->port,
+			'-P', $newnode->port,
+			'--copy'
+		],
+		'run of pg_upgrade');
+
+	$oldnode->start();
+	$newnode->start();
+
+	my $oldfoo = $oldnode->safe_psql('postgres', q(SELECT * FROM FOO));
+	my $newfoo = $newnode->safe_psql('postgres', q(SELECT * FROM FOO));
+	is($oldfoo, $newfoo, "select foo eq");
+
+	my $oldbar = $oldnode->safe_psql('postgres', q(SELECT * FROM BAR));
+	my $newbar = $newnode->safe_psql('postgres', q(SELECT * FROM BAR));
+	is($oldbar, $newbar, "select bar eq");
+
+	$oldnode->stop();
+	$newnode->stop();
+
+	multi_bounds($oldnode);
+	multi_bounds($newnode);
+}
+
+my @TESTS = (
+	# tests without ENV oldinstall
+	#0, 1, 2, 3, 4, 5, 6,
+	# tests with "real" pg_upgrade
+	#100, 101, 102, 103, 104, 105, 106,
+	# self upgrade
+	1000,
+);
+
+# =============================================================================
+# Basic sanity tests on a NEW bin
+# =============================================================================
+
+# starts from the zero
+SKIP:
+{
+	my $TEST_NO = 0;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_mo',
+						scale => 1);
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value
+SKIP:
+{
+	my $TEST_NO = 1;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_Mo',
+						scale => 1.15,
+						multi => '0x123400');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# offsets starts from the value
+SKIP:
+{
+	my $TEST_NO = 2;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_mO',
+						scale => 1.15,
+						offset => '0x432100');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi and offsets starts from the value
+SKIP:
+{
+	my $TEST_NO = 3;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_MO',
+						scale => 1.15,
+						multi => '0xDEAD00', offset => '0xBEEF00');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value, multi wrap
+SKIP:
+{
+	my $TEST_NO = 4;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_Mo_wrap',
+						scale => 1.15,
+						multi => '0xFFFF7000');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# offsets starts from the value, offsets wrap
+SKIP:
+{
+	my $TEST_NO = 5;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_mO_wrap',
+						scale => 1.15,
+						offset => '0xFFFFFC00');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value, offsets starts from the value,
+# multi wrap, offsets wrap
+SKIP:
+{
+	my $TEST_NO = 6;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_MO_wrap',
+						scale => 1.15,
+						multi => '0xFFFF7000', offset => '0xFFFFFC00');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# =============================================================================
+# pg_upgarde tests
+# =============================================================================
+
+# starts from the zero
+SKIP:
+{
+	my $TEST_NO = 100;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'mo';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1);
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value
+SKIP:
+{
+	my $TEST_NO = 101;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'Mo';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						multi => '0x123400');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# offsets starts from the value
+SKIP:
+{
+	my $TEST_NO = 102;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'mO';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						offset => '0x432100');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi and offsets starts from the value
+SKIP:
+{
+	my $TEST_NO = 103;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'MO';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						multi => '0xDEAD00', offset => '0xBEEF00');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value, multi wrap
+SKIP:
+{
+	my $TEST_NO = 104;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'Mo_wrap';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						multi => '0xFFFF7000');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# offsets starts from the value, offsets wrap
+SKIP:
+{
+	my $TEST_NO = 105;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'mO_wrap';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						offset => '0xFFFFFC00');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value, offsets starts from the value,
+# multi wrap, offsets wrap
+SKIP:
+{
+	my $TEST_NO = 106;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'MO_wrap';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						multi => '0xFFFF7000', offset => '0xFFFFFC00');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# =============================================================================
+# Self upgrade
+# =============================================================================
+
+# starts from the zero
+SKIP:
+{
+	my $TEST_NO = 1000;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'self_upgrade';
+	my $oldnode = create_new_node("old_$dbname",
+						scale => 1);
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+done_testing();
-- 
2.49.0

#42Maxim Orlov
orlovmg@gmail.com
In reply to: Maxim Orlov (#41)
7 attachment(s)
Re: POC: make mxidoff 64 bits

Once again, @ 8191e0c16a

--
Best regards,
Maxim Orlov.

Attachments:

v17-0005-TEST-initdb-option-to-initialize-cluster-with-no.patchapplication/octet-stream; name=v17-0005-TEST-initdb-option-to-initialize-cluster-with-no.patchDownload
From 99324c51139fb28f176e6a2c0bd0fd43385e09f9 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Wed, 4 May 2022 15:53:36 +0300
Subject: [PATCH v17 5/7] TEST: initdb option to initialize cluster with
 non-standard xid/mxid/mxoff

To date testing database cluster wraparund was not easy as initdb has always
inited it with default xid/mxid/mxoff. The option to specify any valid
xid/mxid/mxoff at cluster startup will make these things easier.

Author: Maxim Orlov <orlovmg@gmail.com>
Author: Pavel Borisov <pashkin.elfe@gmail.com>
Author: Svetlana Derevyanko <s.derevyanko@postgrespro.ru>
Discussion: https://www.postgresql.org/message-id/flat/CACG%3Dezaa4vqYjJ16yoxgrpa-%3DgXnf0Vv3Ey9bjGrRRFN2YyWFQ%40mail.gmail.com
---
 src/backend/access/transam/clog.c      |   6 ++
 src/backend/access/transam/multixact.c |  10 +++
 src/backend/access/transam/subtrans.c  |   6 ++
 src/backend/access/transam/xlog.c      |  15 ++--
 src/backend/bootstrap/bootstrap.c      |  50 +++++++++++-
 src/backend/main/main.c                |   6 ++
 src/backend/postmaster/postmaster.c    |  14 +++-
 src/backend/tcop/postgres.c            |  53 +++++++++++-
 src/bin/initdb/initdb.c                | 107 ++++++++++++++++++++++++-
 src/bin/initdb/t/001_initdb.pl         |  60 ++++++++++++++
 src/include/access/xlog.h              |   3 +
 src/include/c.h                        |   4 +
 src/include/catalog/pg_class.h         |   2 +-
 13 files changed, 324 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index e80fbe109c..bff6926bf7 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -830,8 +830,14 @@ check_transaction_buffers(int *newval, void **extra, GucSource source)
 void
 BootStrapCLOG(void)
 {
+	int64	pageno;
+
 	/* Zero the initial page and flush it to disk */
 	SimpleLruZeroAndWritePage(XactCtl, 0);
+
+	pageno = TransactionIdToPage(XidFromFullTransactionId(TransamVariables->nextXid));
+	if (pageno)
+		SimpleLruZeroAndWritePage(XactCtl, pageno);
 }
 
 /*
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 48e2f8a4dd..64d00078d9 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1943,9 +1943,19 @@ check_multixact_member_buffers(int *newval, void **extra, GucSource source)
 void
 BootStrapMultiXact(void)
 {
+	int64	pageno;
+
 	/* Zero the initial pages and flush them to disk */
 	SimpleLruZeroAndWritePage(MultiXactOffsetCtl, 0);
 	SimpleLruZeroAndWritePage(MultiXactMemberCtl, 0);
+
+	pageno = MultiXactIdToOffsetPage(MultiXactState->nextMXact);
+	if (pageno)
+		SimpleLruZeroAndWritePage(MultiXactOffsetCtl, pageno);
+
+	pageno = MXOffsetToMemberPage(MultiXactState->nextOffset);
+	if (pageno)
+		SimpleLruZeroAndWritePage(MultiXactMemberCtl, pageno);
 }
 
 /*
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 09aace9e09..e73e0c3a0f 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -268,8 +268,14 @@ check_subtrans_buffers(int *newval, void **extra, GucSource source)
 void
 BootStrapSUBTRANS(void)
 {
+	int64	pageno;
+
 	/* Zero the initial page and flush it to disk */
 	SimpleLruZeroAndWritePage(SubTransCtl, 0);
+
+	pageno = TransactionIdToPage(XidFromFullTransactionId(TransamVariables->nextXid));
+	if (pageno)
+		SimpleLruZeroAndWritePage(SubTransCtl, pageno);
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7ffb217915..ebf67d0713 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -137,6 +137,10 @@ int			max_slot_wal_keep_size_mb = -1;
 int			wal_decode_buffer_size = 512 * 1024;
 bool		track_wal_io_timing = false;
 
+TransactionId		start_xid = FirstNormalTransactionId;
+MultiXactId			start_mxid = FirstMultiXactId;
+MultiXactOffset		start_mxoff = 0;
+
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
 #endif
@@ -5102,13 +5106,14 @@ BootStrapXLOG(uint32 data_checksum_version)
 	checkPoint.fullPageWrites = fullPageWrites;
 	checkPoint.wal_level = wal_level;
 	checkPoint.nextXid =
-		FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId);
+		FullTransactionIdFromEpochAndXid(0, Max(FirstNormalTransactionId,
+												start_xid));
 	checkPoint.nextOid = FirstGenbkiObjectId;
-	checkPoint.nextMulti = FirstMultiXactId;
-	checkPoint.nextMultiOffset = 0;
-	checkPoint.oldestXid = FirstNormalTransactionId;
+	checkPoint.nextMulti = Max(FirstMultiXactId, start_mxid);
+	checkPoint.nextMultiOffset = start_mxoff;
+	checkPoint.oldestXid = XidFromFullTransactionId(checkPoint.nextXid);
 	checkPoint.oldestXidDB = Template1DbOid;
-	checkPoint.oldestMulti = FirstMultiXactId;
+	checkPoint.oldestMulti = checkPoint.nextMulti;
 	checkPoint.oldestMultiDB = Template1DbOid;
 	checkPoint.oldestCommitTsXid = InvalidTransactionId;
 	checkPoint.newestCommitTsXid = InvalidTransactionId;
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index fc8638c1b6..51fc1106fe 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -220,7 +220,7 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	argv++;
 	argc--;
 
-	while ((flag = getopt(argc, argv, "B:c:d:D:Fkr:X:-:")) != -1)
+	while ((flag = getopt(argc, argv, "B:c:d:D:Fkm:o:r:X:x:-:")) != -1)
 	{
 		switch (flag)
 		{
@@ -288,12 +288,60 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 			case 'k':
 				bootstrap_data_checksum_version = PG_DATA_CHECKSUM_VERSION;
 				break;
+			case 'm':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactIdIsValid(start_mxid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact id")));
+					}
+				}
+				break;
+			case 'o':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxoff = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactOffsetIsValid(start_mxoff))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact offset")));
+					}
+				}
+				break;
 			case 'r':
 				strlcpy(OutputFileName, optarg, MAXPGPATH);
 				break;
 			case 'X':
 				SetConfigOption("wal_segment_size", optarg, PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
 				break;
+			case 'x':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_xid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartTransactionIdIsValid(start_xid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster xid value")));
+					}
+				}
+				break;
 			default:
 				write_stderr("Try \"%s --help\" for more information.\n",
 							 progname);
diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index bdcb5e4f26..910f2ceb6a 100644
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -427,12 +427,18 @@ help(const char *progname)
 	printf(_("  -E                 echo statement before execution\n"));
 	printf(_("  -j                 do not use newline as interactive query delimiter\n"));
 	printf(_("  -r FILENAME        send stdout and stderr to given file\n"));
+	printf(_("  -m START_MXID      set initial database cluster multixact id\n"));
+	printf(_("  -o START_MXOFF     set initial database cluster multixact offset\n"));
+	printf(_("  -x START_XID       set initial database cluster xid\n"));
 
 	printf(_("\nOptions for bootstrapping mode:\n"));
 	printf(_("  --boot             selects bootstrapping mode (must be first argument)\n"));
 	printf(_("  --check            selects check mode (must be first argument)\n"));
 	printf(_("  DBNAME             database name (mandatory argument in bootstrapping mode)\n"));
 	printf(_("  -r FILENAME        send stdout and stderr to given file\n"));
+	printf(_("  -m START_MXID      set initial database cluster multixact id\n"));
+	printf(_("  -o START_MXOFF     set initial database cluster multixact offset\n"));
+	printf(_("  -x START_XID       set initial database cluster xid\n"));
 
 	printf(_("\nPlease read the documentation for the complete list of run-time\n"
 			 "configuration settings and how to set them on the command line or in\n"
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index e1d643b013..e48e528497 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -595,7 +595,7 @@ PostmasterMain(int argc, char *argv[])
 	 * tcop/postgres.c (the option sets should not conflict) and with the
 	 * common help() function in main/main.c.
 	 */
-	while ((opt = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lN:OPp:r:S:sTt:W:-:")) != -1)
+	while ((opt = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lm:N:Oo:Pp:r:S:sTt:W:x:-:")) != -1)
 	{
 		switch (opt)
 		{
@@ -705,10 +705,18 @@ PostmasterMain(int argc, char *argv[])
 				SetConfigOption("max_connections", optarg, PGC_POSTMASTER, PGC_S_ARGV);
 				break;
 
+			case 'm':
+				/* only used by single-user backend */
+				break;
+
 			case 'O':
 				SetConfigOption("allow_system_table_mods", "true", PGC_POSTMASTER, PGC_S_ARGV);
 				break;
 
+			case 'o':
+				/* only used by single-user backend */
+				break;
+
 			case 'P':
 				SetConfigOption("ignore_system_indexes", "true", PGC_POSTMASTER, PGC_S_ARGV);
 				break;
@@ -759,6 +767,10 @@ PostmasterMain(int argc, char *argv[])
 				SetConfigOption("post_auth_delay", optarg, PGC_POSTMASTER, PGC_S_ARGV);
 				break;
 
+			case 'x':
+				/* only used by single-user backend */
+				break;
+
 			default:
 				write_stderr("Try \"%s --help\" for more information.\n",
 							 progname);
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index d356830f75..4ec39702f1 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3834,7 +3834,7 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 	 * postmaster/postmaster.c (the option sets should not conflict) and with
 	 * the common help() function in main/main.c.
 	 */
-	while ((flag = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lN:nOPp:r:S:sTt:v:W:-:")) != -1)
+	while ((flag = getopt(argc, argv, "B:bC:c:D:d:EeFf:h:ijk:lm:N:nOo:Pp:r:S:sTt:v:W:x:-:")) != -1)
 	{
 		switch (flag)
 		{
@@ -3939,6 +3939,23 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 				SetConfigOption("ssl", "true", ctx, gucsource);
 				break;
 
+			case 'm':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactIdIsValid(start_mxid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact id")));
+					}
+				}
+				break;
+
 			case 'N':
 				SetConfigOption("max_connections", optarg, ctx, gucsource);
 				break;
@@ -3951,6 +3968,23 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 				SetConfigOption("allow_system_table_mods", "true", ctx, gucsource);
 				break;
 
+			case 'o':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxoff = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactOffsetIsValid(start_mxoff))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster multixact offset")));
+					}
+				}
+				break;
+
 			case 'P':
 				SetConfigOption("ignore_system_indexes", "true", ctx, gucsource);
 				break;
@@ -4005,6 +4039,23 @@ process_postgres_switches(int argc, char *argv[], GucContext ctx,
 				SetConfigOption("post_auth_delay", optarg, ctx, gucsource);
 				break;
 
+			case 'x':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_xid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartTransactionIdIsValid(start_xid))
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid initial database cluster xid")));
+					}
+				}
+				break;
+
 			default:
 				errs++;
 				break;
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 92fe2f531f..9974dd9f6b 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -169,6 +169,9 @@ static char *xlog_dir = NULL;
 static int	wal_segment_size_mb = (DEFAULT_XLOG_SEG_SIZE) / (1024 * 1024);
 static DataDirSyncMethod sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
 static bool sync_data_files = true;
+static TransactionId start_xid = 0;
+static MultiXactId start_mxid = 0;
+static MultiXactOffset start_mxoff = 0;
 
 
 /* internal vars */
@@ -1583,6 +1586,11 @@ bootstrap_template1(void)
 	bki_lines = replace_token(bki_lines, "POSTGRES",
 							  escape_quotes_bki(username));
 
+	/* relfrozenxid must not be less than FirstNormalTransactionId */
+	sprintf(buf, "%u", Max(start_xid, 3));
+	bki_lines = replace_token(bki_lines, "RECENTXMIN",
+							  buf);
+
 	bki_lines = replace_token(bki_lines, "ENCODING",
 							  encodingid_to_string(encodingid));
 
@@ -1608,6 +1616,9 @@ bootstrap_template1(void)
 
 	printfPQExpBuffer(&cmd, "\"%s\" --boot %s %s", backend_exec, boot_options, extra_options);
 	appendPQExpBuffer(&cmd, " -X %d", wal_segment_size_mb * (1024 * 1024));
+	appendPQExpBuffer(&cmd, " -m %u", start_mxid);
+	appendPQExpBuffer(&cmd, " -o %" PRIu64, start_mxoff);
+	appendPQExpBuffer(&cmd, " -x %u", start_xid);
 	if (data_checksums)
 		appendPQExpBufferStr(&cmd, " -k");
 	if (debug)
@@ -2549,13 +2560,21 @@ usage(const char *progname)
 	printf(_("  -d, --debug               generate lots of debugging output\n"));
 	printf(_("      --discard-caches      set debug_discard_caches=1\n"));
 	printf(_("  -L DIRECTORY              where to find the input files\n"));
+	printf(_("  -m, --multixact-id=START_MXID\n"
+			 "                            set initial database cluster multixact id\n"
+			 "                            max value is 2^62-1\n"));
 	printf(_("  -n, --no-clean            do not clean up after errors\n"));
 	printf(_("  -N, --no-sync             do not wait for changes to be written safely to disk\n"));
 	printf(_("      --no-sync-data-files  do not sync files within database directories\n"));
 	printf(_("      --no-instructions     do not print instructions for next steps\n"));
+	printf(_("  -o, --multixact-offset=START_MXOFF\n"
+			 "                            set initial database cluster multixact offset\n"
+			 "                            max value is 2^62-1\n"));
 	printf(_("  -s, --show                show internal settings, then exit\n"));
 	printf(_("      --sync-method=METHOD  set method for syncing files to disk\n"));
 	printf(_("  -S, --sync-only           only sync database files to disk, then exit\n"));
+	printf(_("  -x, --xid=START_XID       set initial database cluster xid\n"
+			 "                            max value is 2^62-1\n"));
 	printf(_("\nOther options:\n"));
 	printf(_("  -V, --version             output version information, then exit\n"));
 	printf(_("  -?, --help                show this help, then exit\n"));
@@ -3090,6 +3109,18 @@ initialize_data_directory(void)
 	/* Now create all the text config files */
 	setup_config();
 
+	if (start_mxid != 0)
+		printf(_("selecting initial multixact id ... %u\n"),
+				 start_mxid);
+
+	if (start_mxoff != 0)
+		printf(_("selecting initial multixact offset ... %" PRIu64 "\n"),
+				 start_mxoff);
+
+	if (start_xid != 0)
+		printf(_("selecting initial xid ... %u\n"),
+				 start_xid);
+
 	/* Bootstrap template1 */
 	bootstrap_template1();
 
@@ -3106,8 +3137,12 @@ initialize_data_directory(void)
 	fflush(stdout);
 
 	initPQExpBuffer(&cmd);
-	printfPQExpBuffer(&cmd, "\"%s\" %s %s template1 >%s",
-					  backend_exec, backend_options, extra_options, DEVNULL);
+	printfPQExpBuffer(&cmd, "\"%s\" %s %s",
+					  backend_exec, backend_options, extra_options);
+	appendPQExpBuffer(&cmd, " -m %u", start_mxid);
+	appendPQExpBuffer(&cmd, " -o %" PRIu64, start_mxoff);
+	appendPQExpBuffer(&cmd, " -x %u", start_xid);
+	appendPQExpBuffer(&cmd, " template1 >%s", DEVNULL);
 
 	PG_CMD_OPEN(cmd.data);
 
@@ -3195,6 +3230,9 @@ main(int argc, char *argv[])
 		{"sync-method", required_argument, NULL, 19},
 		{"no-data-checksums", no_argument, NULL, 20},
 		{"no-sync-data-files", no_argument, NULL, 21},
+		{"xid", required_argument, NULL, 'x'},
+		{"multixact-id", required_argument, NULL, 'm'},
+		{"multixact-offset", required_argument, NULL, 'o'},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -3236,7 +3274,7 @@ main(int argc, char *argv[])
 
 	/* process command-line options */
 
-	while ((c = getopt_long(argc, argv, "A:c:dD:E:gkL:nNsST:U:WX:",
+	while ((c = getopt_long(argc, argv, "A:c:dD:E:gkL:m:nNo:sST:U:Wx:X:",
 							long_options, &option_index)) != -1)
 	{
 		switch (c)
@@ -3294,6 +3332,30 @@ main(int argc, char *argv[])
 				debug = true;
 				printf(_("Running in debug mode.\n"));
 				break;
+			case 'm':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactIdIsValid(start_mxid))
+					{
+						pg_log_error("invalid initial database cluster multixact id");
+						exit(1);
+					}
+					else if (start_mxid < 1) /* FirstMultiXactId */
+					{
+						/*
+						 * We avoid mxid to be silently set to
+						 * FirstMultiXactId, though it does not harm.
+						 */
+						pg_log_error("multixact id should be greater than 0");
+						exit(1);
+					}
+				}
+				break;
 			case 'n':
 				noclean = true;
 				printf(_("Running in no-clean mode.  Mistakes will not be cleaned up.\n"));
@@ -3301,6 +3363,21 @@ main(int argc, char *argv[])
 			case 'N':
 				do_sync = false;
 				break;
+			case 'o':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_mxoff = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartMultiXactOffsetIsValid(start_mxoff))
+					{
+						pg_log_error("invalid initial database cluster multixact offset");
+						exit(1);
+					}
+				}
+				break;
 			case 'S':
 				sync_only = true;
 				break;
@@ -3392,6 +3469,30 @@ main(int argc, char *argv[])
 			case 21:
 				sync_data_files = false;
 				break;
+			case 'x':
+				{
+					char	   *endptr;
+
+					errno = 0;
+					start_xid = strtou64(optarg, &endptr, 0);
+
+					if (endptr == optarg || *endptr != '\0' || errno != 0 ||
+						!StartTransactionIdIsValid(start_xid))
+					{
+						pg_log_error("invalid value for initial database cluster xid");
+						exit(1);
+					}
+					else if (start_xid < 3) /* FirstNormalTransactionId */
+					{
+						/*
+						 * We avoid xid to be silently set to
+						 * FirstNormalTransactionId, though it does not harm.
+						 */
+						pg_log_error("xid should be greater than 2");
+						exit(1);
+					}
+				}
+				break;
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index b7ef7ed8d0..1c2e167110 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -331,4 +331,64 @@ command_fails(
 	[ 'pg_checksums', '--pgdata' => $datadir_nochecksums ],
 	"pg_checksums fails with data checksum disabled");
 
+# Set non-standard initial mxid/mxoff/xid.
+command_fails_like(
+	[ 'initdb', '-m', 'seven', $datadir ],
+	qr/initdb: error: invalid initial database cluster multixact id/,
+	'fails for invalid initial database cluster multixact id');
+command_fails_like(
+	[ 'initdb', '-o', 'seven', $datadir ],
+	qr/initdb: error: invalid initial database cluster multixact offset/,
+	'fails for invalid initial database cluster multixact offset');
+command_fails_like(
+	[ 'initdb', '-x', 'seven', $datadir ],
+	qr/initdb: error: invalid value for initial database cluster xid/,
+	'fails for invalid initial database cluster xid');
+
+command_checks_all(
+	[ 'initdb', '-m', '65535', "$tempdir/data-m65535" ],
+	0,
+	[qr/selecting initial multixact id ... 65535/],
+	[],
+	'selecting initial multixact id');
+command_checks_all(
+	[ 'initdb', '-o', '65535', "$tempdir/data-o65535" ],
+	0,
+	[qr/selecting initial multixact offset ... 65535/],
+	[],
+	'selecting initial multixact offset');
+command_checks_all(
+	[ 'initdb', '-x', '65535', "$tempdir/data-x65535" ],
+	0,
+	[qr/selecting initial xid ... 65535/],
+	[],
+	'selecting initial xid');
+
+# Setup new cluster with given mxid/mxoff/xid.
+my $node;
+my $result;
+
+$node = PostgreSQL::Test::Cluster->new('test-mxid');
+$node->init(extra => ['-m', '16777215']); # 0xFFFFFF
+$node->start;
+$result = $node->safe_psql('postgres', "SELECT next_multixact_id FROM pg_control_checkpoint();");
+ok($result >= 16777215, 'setup cluster with given mxid');
+$node->stop;
+
+$node = PostgreSQL::Test::Cluster->new('test-mxoff');
+$node->init(extra => ['-o', '16777215']); # 0xFFFFFF
+$node->start;
+$result = $node->safe_psql('postgres', "SELECT next_multi_offset FROM pg_control_checkpoint();");
+ok($result >= 16777215, 'setup cluster with given mxoff');
+$node->stop;
+
+$node = PostgreSQL::Test::Cluster->new('test-xid');
+$node->init(extra => ['-x', '16777215']); # 0xFFFFFF
+$node->start;
+$result = $node->safe_psql('postgres', "SELECT txid_current();");
+ok($result >= 16777215, 'setup cluster with given xid - check 1');
+$result = $node->safe_psql('postgres', "SELECT oldest_xid FROM pg_control_checkpoint();");
+ok($result >= 16777215, 'setup cluster with given xid - check 2');
+$node->stop;
+
 done_testing();
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d12798be3d..61a5c0c269 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -94,6 +94,9 @@ typedef enum RecoveryState
 } RecoveryState;
 
 extern PGDLLIMPORT int wal_level;
+extern PGDLLIMPORT TransactionId start_xid;
+extern PGDLLIMPORT MultiXactId start_mxid;
+extern PGDLLIMPORT MultiXactOffset start_mxoff;
 
 /* Is WAL archiving enabled (always or only while server is running normally)? */
 #define XLogArchivingActive() \
diff --git a/src/include/c.h b/src/include/c.h
index de9ac13be7..20d01805af 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -645,6 +645,10 @@ typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
+#define StartTransactionIdIsValid(xid)			((xid) <= 0xFFFFFFFF)
+#define StartMultiXactIdIsValid(mxid)			((mxid) <= 0xFFFFFFFF)
+#define StartMultiXactOffsetIsValid(offset)		((offset) <= 0xFFFFFFFF)
+
 #define FirstCommandId	((CommandId) 0)
 #define InvalidCommandId	(~(CommandId)0)
 
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index 07d182da79..06e9969862 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -126,7 +126,7 @@ CATALOG(pg_class,1259,RelationRelationId) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83,Relat
 	Oid			relrewrite BKI_DEFAULT(0) BKI_LOOKUP_OPT(pg_class);
 
 	/* all Xids < this are frozen in this rel */
-	TransactionId relfrozenxid BKI_DEFAULT(3);	/* FirstNormalTransactionId */
+	TransactionId relfrozenxid BKI_DEFAULT(RECENTXMIN);	/* FirstNormalTransactionId */
 
 	/* all multixacts in this rel are >= this; it is really a MultiXactId */
 	TransactionId relminmxid BKI_DEFAULT(1);	/* FirstMultiXactId */
-- 
2.50.1

v17-0003-Make-pg_upgrade-convert-multixact-offsets.patchapplication/octet-stream; name=v17-0003-Make-pg_upgrade-convert-multixact-offsets.patchDownload
From 5402fa0f390047bb8f94a52151522d3fddb71d12 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Tue, 13 Aug 2024 14:44:50 +0300
Subject: [PATCH v17 3/7] Make pg_upgrade convert multixact offsets.

Author: Maxim Orlov <orlovmg@gmail.com>
Author: Yura Sokolov <y.sokolov@postgrespro.ru>
---
 src/bin/pg_upgrade/Makefile     |   1 +
 src/bin/pg_upgrade/meson.build  |   1 +
 src/bin/pg_upgrade/pg_upgrade.c |  40 ++-
 src/bin/pg_upgrade/pg_upgrade.h |  14 +-
 src/bin/pg_upgrade/segresize.c  | 527 ++++++++++++++++++++++++++++++++
 5 files changed, 579 insertions(+), 4 deletions(-)
 create mode 100644 src/bin/pg_upgrade/segresize.c

diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index f83d2b5d30..70908d63a3 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	info.o \
 	option.o \
 	parallel.o \
+	segresize.o \
 	pg_upgrade.o \
 	relfilenumber.o \
 	server.o \
diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index ac992f0d14..a8b1d77c2d 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -10,6 +10,7 @@ pg_upgrade_sources = files(
   'info.c',
   'option.c',
   'parallel.c',
+  'segresize.c',
   'pg_upgrade.c',
   'relfilenumber.c',
   'server.c',
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index d5cd5bf0b3..4837ee02a4 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -816,8 +816,42 @@ copy_xact_xlog_xid(void)
 	if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER &&
 		new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
 	{
-		copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
-		copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		/*
+		 * If the old server is before the MULTIXACTOFFSET_FORMATCHANGE_CAT_VER
+		 * it must have 32-bit multixid offsets, thus it should be converted.
+		 */
+		if (old_cluster.controldata.cat_ver < MULTIXACTOFFSET_FORMATCHANGE_CAT_VER &&
+			new_cluster.controldata.cat_ver >= MULTIXACTOFFSET_FORMATCHANGE_CAT_VER)
+		{
+			MultiXactOffset		oldest_offset,
+								next_offset;
+
+			remove_new_subdir("pg_multixact/offsets", false);
+			prep_status("Converting pg_multixact/offsets to 64-bit");
+			oldest_offset = convert_multixact_offsets();
+			check_ok();
+
+			remove_new_subdir("pg_multixact/members", false);
+			prep_status("Converting pg_multixact/members");
+			convert_multixact_members(oldest_offset);
+			check_ok();
+
+			next_offset = old_cluster.controldata.chkpnt_nxtmxoff;
+			if (oldest_offset)
+			{
+				if (next_offset < oldest_offset)
+					next_offset += ((MultiXactOffset) 1 << 32) - 1;
+
+				next_offset -= oldest_offset - 1;
+
+				old_cluster.controldata.chkpnt_nxtmxoff = next_offset;
+			}
+		}
+		else
+		{
+			copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
+			copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		}
 
 		prep_status("Setting next multixact ID and offset for new cluster");
 
@@ -826,7 +860,7 @@ copy_xact_xlog_xid(void)
 		 * counters here and the oldest multi present on system.
 		 */
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/pg_resetwal\" -O %u -m %u,%u \"%s\"",
+				  "\"%s/pg_resetwal\" -O %" PRIu64 " -m %u,%u \"%s\"",
 				  new_cluster.bindir,
 				  old_cluster.controldata.chkpnt_nxtmxoff,
 				  old_cluster.controldata.chkpnt_nxtmulti,
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 0ef47be0dc..54ce718e68 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -114,6 +114,13 @@ extern char *output_files[];
  */
 #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231
 
+/*
+ * Swicth from 32-bit to 64-bit for multixid offsets.
+ *
+ * XXX: should be changed to the actual CATALOG_VERSION_NO on commit.
+ */
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202409041
+
 /*
  * large object chunk size added to pg_controldata,
  * commit 5f93c37805e7485488480916b4585e098d3cc883
@@ -235,7 +242,7 @@ typedef struct
 	uint32		chkpnt_nxtepoch;
 	uint32		chkpnt_nxtoid;
 	uint32		chkpnt_nxtmulti;
-	uint32		chkpnt_nxtmxoff;
+	uint64		chkpnt_nxtmxoff;
 	uint32		chkpnt_oldstMulti;
 	uint32		chkpnt_oldstxid;
 	uint32		align;
@@ -528,3 +535,8 @@ typedef struct
 	FILE	   *file;
 	char		path[MAXPGPATH];
 } UpgradeTaskReport;
+
+/* segresize.c */
+
+MultiXactOffset		convert_multixact_offsets(void);
+void				convert_multixact_members(MultiXactOffset oldest_offset);
diff --git a/src/bin/pg_upgrade/segresize.c b/src/bin/pg_upgrade/segresize.c
new file mode 100644
index 0000000000..7f80d0652a
--- /dev/null
+++ b/src/bin/pg_upgrade/segresize.c
@@ -0,0 +1,527 @@
+/*
+ *	segresize.c
+ *
+ *	SLRU segment resize utility
+ *
+ *	Copyright (c) 2024, PostgreSQL Global Development Group
+ *	src/bin/pg_upgrade/segresize.c
+ */
+
+#include "postgres_fe.h"
+
+#include "pg_upgrade.h"
+#include "access/multixact.h"
+
+/* See slru.h */
+#define SLRU_PAGES_PER_SEGMENT		32
+
+/*
+ * Some kind of iterator associated with a particular SLRU segment.  The idea is
+ * to specify the segment and page number and then move through the pages.
+ */
+typedef struct SlruSegState
+{
+	char	   *dir;
+	char	   *fn;
+	FILE	   *file;
+	int64		segno;
+	uint64		pageno;
+	bool		leading_gap;
+} SlruSegState;
+
+/*
+ * Mirrors the SlruFileName from slru.c
+ */
+static inline char *
+SlruFileName(SlruSegState *state)
+{
+	Assert(state->segno >= 0 && state->segno <= INT64CONST(0xFFFFFF));
+	return psprintf("%s/%04X", state->dir, (unsigned int) state->segno);
+}
+
+/*
+ * Create new SLRU segment file.
+ */
+static void
+create_segment(SlruSegState *state)
+{
+	Assert(state->fn == NULL);
+	Assert(state->file == NULL);
+
+	state->fn = SlruFileName(state);
+	state->file = fopen(state->fn, "wb");
+	if (!state->file)
+		pg_fatal("could not create file \"%s\": %m", state->fn);
+}
+
+/*
+ * Open existing SLRU segment file.
+ */
+static void
+open_segment(SlruSegState *state)
+{
+	Assert(state->fn == NULL);
+	Assert(state->file == NULL);
+
+	state->fn = SlruFileName(state);
+	state->file = fopen(state->fn, "rb");
+	if (!state->file)
+		pg_fatal("could not open file \"%s\": %m", state->fn);
+}
+
+/*
+ * Close SLRU segment file.
+ */
+static void
+close_segment(SlruSegState *state)
+{
+	if (state->file)
+	{
+		fclose(state->file);
+		state->file = NULL;
+	}
+
+	if (state->fn)
+	{
+		pfree(state->fn);
+		state->fn = NULL;
+	}
+}
+
+/*
+ * Read next page from the old 32-bit offset segment file.
+ */
+static int
+read_old_segment_page(SlruSegState *state, void *buf, bool *empty)
+{
+	int		len;
+
+	/* Open next segment file, if needed. */
+	if (!state->fn)
+	{
+		if (!state->segno)
+			state->leading_gap = true;
+
+		open_segment(state);
+
+		/* Set position to the needed page. */
+		if (state->pageno > 0 &&
+			fseek(state->file, state->pageno * BLCKSZ, SEEK_SET))
+		{
+			close_segment(state);
+		}
+	}
+
+	if (state->file)
+	{
+		/* Segment file do exists, read page from it. */
+		state->leading_gap = false;
+
+		len = fread(buf, sizeof(char), BLCKSZ, state->file);
+
+		/* Are we done or was there an error? */
+		if (len <= 0)
+		{
+			if (ferror(state->file))
+				pg_fatal("error reading file \"%s\": %m", state->fn);
+
+			if (feof(state->file))
+			{
+				*empty = true;
+				len = -1;
+
+				close_segment(state);
+			}
+		}
+		else
+			*empty = false;
+	}
+	else if (!state->leading_gap)
+	{
+		/* We reached the last segment. */
+		len = -1;
+		*empty = true;
+	}
+	else
+	{
+		/* Skip few first segments if they were frozen and removed. */
+		len = BLCKSZ;
+		*empty = true;
+	}
+
+	if (++state->pageno >= SLRU_PAGES_PER_SEGMENT)
+	{
+		/* Start a new segment. */
+		state->segno++;
+		state->pageno = 0;
+
+		close_segment(state);
+	}
+
+	return len;
+}
+
+/*
+ * Write next page to the new 64-bit offset segment file.
+ */
+static void
+write_new_segment_page(SlruSegState *state, void *buf)
+{
+	/*
+	 * Create a new segment file if we still didn't.  Creation is
+	 * postponed until the first non-empty page is found.  This helps
+	 * not to create completely empty segments.
+	 */
+	if (!state->file)
+	{
+		create_segment(state);
+
+		/* Write zeroes to the previously skipped prefix. */
+		if (state->pageno > 0)
+		{
+			char		zerobuf[BLCKSZ] = {0};
+
+			for (int64 i = 0; i < state->pageno; i++)
+			{
+				if (fwrite(zerobuf, sizeof(char), BLCKSZ, state->file) != BLCKSZ)
+					pg_fatal("could not write file \"%s\": %m", state->fn);
+			}
+		}
+	}
+
+	/* Write page to the new segment (if it was created). */
+	if (state->file)
+	{
+		if (fwrite(buf, sizeof(char), BLCKSZ, state->file) != BLCKSZ)
+			pg_fatal("could not write file \"%s\": %m", state->fn);
+	}
+
+	/*
+	 * Did we reach the maximum page number?  Then close segment file
+	 * and create a new one on the next iteration.
+	 */
+	if (++state->pageno >= SLRU_PAGES_PER_SEGMENT)
+	{
+		/* Start a new segment. */
+		state->segno++;
+		state->pageno = 0;
+
+		close_segment(state);
+	}
+}
+
+typedef uint32 MultiXactOffsetOld;
+
+#define MaxMultiXactOffsetOld	((MultiXactOffsetOld) 0xFFFFFFFF)
+
+#define MULTIXACT_OFFSETS_PER_PAGE_OLD (BLCKSZ / sizeof(MultiXactOffsetOld))
+#define MULTIXACT_OFFSETS_PER_PAGE_NEW (BLCKSZ / sizeof(MultiXactOffset))
+
+/*
+ * Convert pg_multixact/offsets segments and return oldest multi offset.
+ */
+MultiXactOffset
+convert_multixact_offsets(void)
+{
+	SlruSegState		oldseg = {0},
+						newseg = {0};
+	MultiXactOffsetOld	oldbuf[MULTIXACT_OFFSETS_PER_PAGE_OLD] = {0};
+	MultiXactOffset		newbuf[MULTIXACT_OFFSETS_PER_PAGE_NEW] = {0},
+						oldest_offset = 0;
+	uint64				oldest_multi = old_cluster.controldata.chkpnt_oldstMulti,
+						next_multi = old_cluster.controldata.chkpnt_nxtmulti,
+						multi,
+						old_entry,
+						new_entry;
+	bool				oldest_offset_known = false;
+
+	oldseg.dir = psprintf("%s/pg_multixact/offsets", old_cluster.pgdata);
+	newseg.dir = psprintf("%s/pg_multixact/offsets", new_cluster.pgdata);
+
+	old_entry = oldest_multi % MULTIXACT_OFFSETS_PER_PAGE_OLD;
+	oldseg.pageno = oldest_multi / MULTIXACT_OFFSETS_PER_PAGE_OLD;
+	oldseg.segno = oldseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	oldseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+
+	new_entry = oldest_multi % MULTIXACT_OFFSETS_PER_PAGE_NEW;
+	newseg.pageno = oldest_multi / MULTIXACT_OFFSETS_PER_PAGE_NEW;
+	newseg.segno = newseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	newseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+
+	if (next_multi < oldest_multi)
+		next_multi += (uint64) 1 << 32;	/* wraparound */
+
+	/* Copy multi offsets reading only needed segment pages */
+	for (multi = oldest_multi; multi < next_multi; old_entry = 0)
+	{
+		int		oldlen;
+		bool	empty;
+
+		/* Handle possible segment wraparound */
+#define OLD_OFFSET_SEGNO_MAX	\
+	(MaxMultiXactId / MULTIXACT_OFFSETS_PER_PAGE_OLD / SLRU_PAGES_PER_SEGMENT)
+		if (oldseg.segno > OLD_OFFSET_SEGNO_MAX)
+		{
+			oldseg.segno = 0;
+			oldseg.pageno = 0;
+		}
+
+		oldlen = read_old_segment_page(&oldseg, oldbuf, &empty);
+		if (empty || oldlen != BLCKSZ)
+			pg_fatal("cannot read page %" PRIu64 " from file \"%s\": %m",
+					 oldseg.pageno, oldseg.fn);
+
+		/* Save oldest multi offset */
+		if (!oldest_offset_known)
+		{
+			oldest_offset = oldbuf[old_entry];
+			oldest_offset_known = true;
+		}
+
+		/* Skip wrapped-around invalid MultiXactIds */
+		if (multi == (uint64) 1 << 32)
+		{
+			Assert(oldseg.segno == 0);
+			Assert(oldseg.pageno == 1);
+			Assert(old_entry == 0);
+			Assert(new_entry == 0);
+
+			multi += FirstMultiXactId;
+			old_entry = FirstMultiXactId;
+			new_entry = FirstMultiXactId;
+		}
+
+		/* Copy entries to the new page */
+		for (; multi < next_multi && old_entry < MULTIXACT_OFFSETS_PER_PAGE_OLD;
+			 multi++, old_entry++)
+		{
+			MultiXactOffset offset = oldbuf[old_entry];
+
+			/* Handle possible offset wraparound (1 becomes 2^32) */
+			if (offset < oldest_offset)
+				offset += ((uint64) 1 << 32) - 1;
+
+			/* Subtract oldest_offset, so new offsets will start from 1 */
+			newbuf[new_entry++] = offset - oldest_offset + 1;
+
+			if (new_entry >= MULTIXACT_OFFSETS_PER_PAGE_NEW)
+			{
+				/* Handle possible segment wraparound */
+#define NEW_OFFSET_SEGNO_MAX	\
+	(MaxMultiXactId / MULTIXACT_OFFSETS_PER_PAGE_NEW / SLRU_PAGES_PER_SEGMENT)
+				if (newseg.segno > NEW_OFFSET_SEGNO_MAX)
+				{
+					newseg.segno = 0;
+					newseg.pageno = 0;
+				}
+
+				/* Write new page */
+				write_new_segment_page(&newseg, newbuf);
+				new_entry = 0;
+			}
+		}
+	}
+
+	/* Write the last incomplete page */
+	if (new_entry > 0 || oldest_multi == next_multi)
+	{
+		memset(&newbuf[new_entry], 0,
+			   sizeof(newbuf[0]) * (MULTIXACT_OFFSETS_PER_PAGE_NEW - new_entry));
+		write_new_segment_page(&newseg, newbuf);
+	}
+
+	/* Use next_offset as oldest_offset, if oldest_multi == next_multi */
+	if (!oldest_offset_known)
+	{
+		Assert(oldest_multi == next_multi);
+		oldest_offset = (MultiXactOffset) old_cluster.controldata.chkpnt_nxtmxoff;
+	}
+
+	/* Release resources */
+	close_segment(&oldseg);
+	close_segment(&newseg);
+
+	pfree(oldseg.dir);
+	pfree(newseg.dir);
+
+	return oldest_offset;
+}
+
+#define MXACT_MEMBERS_FLAG_BYTES			1
+
+#define MULTIXACT_MEMBERS_PER_GROUP			4
+#define MULTIXACT_MEMBERGROUP_SIZE			\
+	(MULTIXACT_MEMBERS_PER_GROUP * (sizeof(TransactionId) + MXACT_MEMBERS_FLAG_BYTES))
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE		\
+	(BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+
+#define MULTIXACT_MEMBERS_PER_PAGE				\
+	(MULTIXACT_MEMBERS_PER_GROUP * MULTIXACT_MEMBERGROUPS_PER_PAGE)
+#define MULTIXACT_MEMBER_FLAG_BYTES_PER_GROUP	\
+	(MXACT_MEMBERS_FLAG_BYTES * MULTIXACT_MEMBERS_PER_GROUP)
+
+typedef struct MultiXactMembersCtx
+{
+	SlruSegState	seg;
+	char			buf[BLCKSZ];
+	int				group;
+	int				member;
+	char		   *flag;
+	TransactionId  *xid;
+} MultiXactMembersCtx;
+
+static void
+MultiXactMembersCtxInit(MultiXactMembersCtx *ctx)
+{
+	ctx->seg.dir = psprintf("%s/pg_multixact/members", new_cluster.pgdata);
+
+	ctx->group = 0;
+	ctx->member = 1;		/* skip invalid zero offset */
+
+	ctx->flag = (char *) ctx->buf + ctx->group * MULTIXACT_MEMBERGROUP_SIZE;
+	ctx->xid = (TransactionId *)(ctx->flag + MXACT_MEMBERS_FLAG_BYTES * MULTIXACT_MEMBERS_PER_GROUP);
+
+	ctx->flag += ctx->member;
+	ctx->xid += ctx->member;
+}
+
+static void
+MultiXactMembersCtxAdd(MultiXactMembersCtx *ctx, char flag, TransactionId xid)
+{
+	/* Copy member's xid and flags to the new page */
+	*ctx->flag++ = flag;
+	*ctx->xid++ = xid;
+
+	if (++ctx->member < MULTIXACT_MEMBERS_PER_GROUP)
+		return;
+
+	/* Start next member group */
+	ctx->member = 0;
+
+	if (++ctx->group >= MULTIXACT_MEMBERGROUPS_PER_PAGE)
+	{
+		/* Write current page and start new */
+		write_new_segment_page(&ctx->seg, ctx->buf);
+
+		ctx->group = 0;
+		memset(ctx->buf, 0, BLCKSZ);
+	}
+
+	ctx->flag = (char *) ctx->buf + ctx->group * MULTIXACT_MEMBERGROUP_SIZE;
+	ctx->xid = (TransactionId *)(ctx->flag + MXACT_MEMBERS_FLAG_BYTES * MULTIXACT_MEMBERS_PER_GROUP);
+}
+
+static void
+MultiXactMembersCtxFinit(MultiXactMembersCtx *ctx)
+{
+	if (ctx->flag > (char *) ctx->buf)
+		write_new_segment_page(&ctx->seg, ctx->buf);
+
+	close_segment(&ctx->seg);
+
+	pfree(ctx->seg.dir);
+}
+
+/*
+ * Convert pg_multixact/members segments, offsets will start from 1.
+ *
+ */
+void
+convert_multixact_members(MultiXactOffset oldest_offset)
+{
+	MultiXactOffset			next_offset,
+							offset;
+	SlruSegState			oldseg = {0};
+	char					oldbuf[BLCKSZ] = {0};
+	int						oldidx;
+	MultiXactMembersCtx		newctx = {0};
+
+	oldseg.dir = psprintf("%s/pg_multixact/members", old_cluster.pgdata);
+
+	next_offset = (MultiXactOffset) old_cluster.controldata.chkpnt_nxtmxoff;
+	if (next_offset < oldest_offset)
+		next_offset += ((uint64) 1 << 32) - 1;
+
+	/* Initialize the old starting position */
+	oldseg.pageno = oldest_offset / MULTIXACT_MEMBERS_PER_PAGE;
+	oldseg.segno = oldseg.pageno / SLRU_PAGES_PER_SEGMENT;
+	oldseg.pageno %= SLRU_PAGES_PER_SEGMENT;
+
+	/* Initialize new starting position */
+	MultiXactMembersCtxInit(&newctx);
+
+	/* Iterate through the original directory */
+	oldidx = oldest_offset % MULTIXACT_MEMBERS_PER_PAGE;
+	for (offset = oldest_offset; offset < next_offset;)
+	{
+		bool	empty;
+		int		oldlen;
+		int		ngroups;
+		int		oldgroup;
+		int		oldmember;
+
+		oldlen = read_old_segment_page(&oldseg, oldbuf, &empty);
+		if (empty || oldlen != BLCKSZ)
+			pg_fatal("cannot read page %" PRIu64 " from file \"%s\": %m",
+					 oldseg.pageno, oldseg.fn);
+
+		/* Iterate through the old member groups */
+		ngroups = oldlen / MULTIXACT_MEMBERGROUP_SIZE;
+		oldmember = oldidx % MULTIXACT_MEMBERS_PER_GROUP;
+		oldgroup = oldidx / MULTIXACT_MEMBERS_PER_GROUP;
+		while (oldgroup < ngroups && offset < next_offset)
+		{
+			char		   *oldflag;
+			TransactionId  *oldxid;
+			int				i;
+
+			oldflag = (char *) oldbuf + oldgroup * MULTIXACT_MEMBERGROUP_SIZE;
+			oldxid = (TransactionId *)(oldflag + MULTIXACT_MEMBER_FLAG_BYTES_PER_GROUP);
+
+			oldxid += oldmember;
+			oldflag += oldmember;
+
+			/* Iterate through the old members */
+			for (i = oldmember;
+				 i < MULTIXACT_MEMBERS_PER_GROUP && offset < next_offset;
+				 i++)
+			{
+				MultiXactMembersCtxAdd(&newctx, *oldflag++, *oldxid++);
+
+				if (++offset == (uint64) 1 << 32)
+				{
+					Assert(i == MaxMultiXactOffsetOld % MULTIXACT_MEMBERS_PER_GROUP);
+					goto wraparound;
+				}
+			}
+
+			oldgroup++;
+			oldmember = 0;
+		}
+
+		oldidx = 0;
+
+		continue;
+
+wraparound:
+#define SEGNO_MAX	MaxMultiXactOffsetOld / MULTIXACT_MEMBERS_PER_PAGE / SLRU_PAGES_PER_SEGMENT
+#define PAGENO_MAX	MaxMultiXactOffsetOld / MULTIXACT_MEMBERS_PER_PAGE % SLRU_PAGES_PER_SEGMENT
+		Assert((oldseg.segno == SEGNO_MAX && oldseg.pageno == PAGENO_MAX + 1) ||
+			   (oldseg.segno == SEGNO_MAX + 1 && oldseg.pageno == 0));
+
+		/* Switch to segment 0000 */
+		close_segment(&oldseg);
+		oldseg.segno = 0;
+		oldseg.pageno = 0;
+
+		/* skip invalid zero multi offset */
+		oldidx = 1;
+	}
+
+	MultiXactMembersCtxFinit(&newctx);
+
+	/* Release resources */
+	close_segment(&oldseg);
+
+	pfree(oldseg.dir);
+}
-- 
2.50.1

v17-0004-Get-rid-of-MultiXactMemberFreezeThreshold-call.patchapplication/octet-stream; name=v17-0004-Get-rid-of-MultiXactMemberFreezeThreshold-call.patchDownload
From 328d9fc96597144972ee3ac424ba2a1c8b290b79 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 23 Oct 2024 18:23:39 +0300
Subject: [PATCH v17 4/7] Get rid of MultiXactMemberFreezeThreshold call.

Since MaxMultiXactOffset are UINT64_MAX now, MULTIXACT_MEMBER_SAFE_THRESHOLD and
MULTIXACT_MEMBER_DANGER_THRESHOLD values are not meaningful any more. Thus,
MultiXactMemberFreezeThreshold is not needed too.

Instead, switch to MULTIXACT_MEMBER_AUTOVAC_THRESHOLD (eq 2^32) members
threshold. It is used to determine if we need to force autovacuum or not.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/transam/multixact.c | 22 ++++++++++++----------
 src/backend/commands/vacuum.c          |  2 +-
 src/backend/postmaster/autovacuum.c    |  4 ++--
 src/include/access/multixact.h         |  1 -
 4 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 0e9ba324ca..48e2f8a4dd 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -199,10 +199,14 @@ MXOffsetToMemberOffset(MultiXactOffset offset)
 		member_in_group * sizeof(TransactionId);
 }
 
-/* Multixact members wraparound thresholds. */
-#define MULTIXACT_MEMBER_SAFE_THRESHOLD		(MaxMultiXactOffset / 2)
-#define MULTIXACT_MEMBER_DANGER_THRESHOLD	\
-	(MaxMultiXactOffset - MaxMultiXactOffset / 4)
+/*
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuumin order to release the disk space, reduce table bloat if
+ * possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD		UINT64CONST(0xFFFFFFFF)
 
 static inline MultiXactId
 PreviousMultiXactId(MultiXactId multi)
@@ -2548,15 +2552,13 @@ GetOldestMultiXactId(void)
 }
 
 /*
- * Determine how aggressively we need to vacuum in order to prevent member
- * wraparound.
+ * Determine if we need to vacuum for member or not.
  *
  * To do so determine what's the oldest member offset and install the limit
  * info in MultiXactState, where it can be used to prevent overrun of old data
  * in the members SLRU area.
  *
- * The return value is true if emergency autovacuum is required and false
- * otherwise.
+ * The return value is true if autovacuum is required and false otherwise.
  */
 static bool
 SetOffsetVacuumLimit(bool is_startup)
@@ -2644,10 +2646,10 @@ SetOffsetVacuumLimit(bool is_startup)
 	LWLockRelease(MultiXactGenLock);
 
 	/*
-	 * Do we need an emergency autovacuum?	If we're not sure, assume yes.
+	 * Do we need autovacuum?	If we're not sure, assume yes.
 	 */
 	return !oldestOffsetKnown ||
-		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
+		(nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);
 }
 
 /*
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 733ef40ae7..8f5092670b 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1153,7 +1153,7 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams params,
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Almost ready to set freeze output parameters; check if OldestXmin or
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index dce4c8c45b..9bf03734c2 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1150,7 +1150,7 @@ do_start_worker(void)
 
 	/* Also determine the oldest datminmxid we will consider. */
 	recentMulti = ReadNextMultiXactId();
-	multiForceLimit = recentMulti - MultiXactMemberFreezeThreshold();
+	multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
 	if (multiForceLimit < FirstMultiXactId)
 		multiForceLimit -= FirstMultiXactId;
 
@@ -1938,7 +1938,7 @@ do_autovacuum(void)
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Find the pg_database entry and select the default freeze ages. We use
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 792b5f140f..16a0772308 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -147,7 +147,6 @@ extern void MultiXactSetNextMXact(MultiXactId nextMulti,
 extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 									  MultiXactOffset minMultiOffset);
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
-extern int	MultiXactMemberFreezeThreshold(void);
 
 extern void multixact_twophase_recover(FullTransactionId fxid, uint16 info,
 									   void *recdata, uint32 len);
-- 
2.50.1

v17-0002-Use-64-bit-multixact-offsets.patchapplication/octet-stream; name=v17-0002-Use-64-bit-multixact-offsets.patchDownload
From f4c9a1c7dd3e70fbfa2a7a986bc5e405b18022e2 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Wed, 6 Mar 2024 11:11:33 +0300
Subject: [PATCH v17 2/7] Use 64-bit multixact offsets.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/transam/multixact.c | 232 +------------------------
 src/bin/pg_resetwal/pg_resetwal.c      |   2 +-
 src/bin/pg_resetwal/t/001_basic.pl     |   2 +-
 src/include/access/multixact.h         |   2 +-
 src/include/c.h                        |   2 +-
 5 files changed, 7 insertions(+), 233 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index dc71c8f44e..0e9ba324ca 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -91,14 +91,6 @@
 /*
  * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
  * used everywhere else in Postgres.
- *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
  */
 
 /* We need four bytes per offset */
@@ -267,9 +259,6 @@ typedef struct MultiXactStateData
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 
-	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
-
 	/*
 	 * This is used to sleep until a multixact offset is written when we want
 	 * to create the next one.
@@ -400,8 +389,6 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
-									 MultiXactOffset start, uint32 distance);
 static bool SetOffsetVacuumLimit(bool is_startup);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMTruncateXlogRec(Oid oldestMultiDB,
@@ -1154,78 +1141,6 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	else
 		*offset = nextOffset;
 
-	/*----------
-	 * Protect against overrun of the members space as well, with the
-	 * following rules:
-	 *
-	 * If we're past offsetStopLimit, refuse to generate more multis.
-	 * If we're close to offsetStopLimit, emit a warning.
-	 *
-	 * Arbitrarily, we start emitting warnings when we're 20 segments or less
-	 * from offsetStopLimit.
-	 *
-	 * Note we haven't updated the shared state yet, so if we fail at this
-	 * point, the multixact ID we grabbed can still be used by the next guy.
-	 *
-	 * Note that there is no point in forcing autovacuum runs here: the
-	 * multixact freeze settings would have to be reduced for that to have any
-	 * effect.
-	 *----------
-	 */
-#define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
-								 nmembers))
-	{
-		/* see comment in the corresponding offsets wraparound case */
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("multixact \"members\" limit exceeded"),
-				 errdetail_plural("This command would create a multixact with %u members, but the remaining space is only enough for %u member.",
-								  "This command would create a multixact with %u members, but the remaining space is only enough for %u members.",
-								  MultiXactState->offsetStopLimit - nextOffset - 1,
-								  nmembers,
-								  MultiXactState->offsetStopLimit - nextOffset - 1),
-				 errhint("Execute a database-wide VACUUM in database with OID %u with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.",
-						 MultiXactState->oldestMultiXactDB)));
-	}
-
-	/*
-	 * Check whether we should kick autovacuum into action, to prevent members
-	 * wraparound. NB we use a much larger window to trigger autovacuum than
-	 * just the warning limit. The warning is just a measure of last resort -
-	 * this is in line with GetNewTransactionId's behaviour.
-	 */
-	if (!MultiXactState->oldestOffsetKnown ||
-		(MultiXactState->nextOffset - MultiXactState->oldestOffset
-		 > MULTIXACT_MEMBER_SAFE_THRESHOLD))
-	{
-		/*
-		 * To avoid swamping the postmaster with signals, we issue the autovac
-		 * request only when crossing a segment boundary. With default
-		 * compilation settings that's roughly after 50k members.  This still
-		 * gives plenty of chances before we get into real trouble.
-		 */
-		if ((MXOffsetToMemberPage(nextOffset) / SLRU_PAGES_PER_SEGMENT) !=
-			(MXOffsetToMemberPage(nextOffset + nmembers) / SLRU_PAGES_PER_SEGMENT))
-			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	}
-
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
-								 nextOffset,
-								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
-		ereport(WARNING,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg_plural("database with OID %u must be vacuumed before %d more multixact member is used",
-							   "database with OID %u must be vacuumed before %d more multixact members are used",
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers,
-							   MultiXactState->oldestMultiXactDB,
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers),
-				 errhint("Execute a database-wide VACUUM in that database with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.")));
-
 	ExtendMultiXactMember(nextOffset, nmembers);
 
 	/*
@@ -2653,8 +2568,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	MultiXactOffset nextOffset;
 	bool		oldestOffsetKnown = false;
 	bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;
-	MultiXactOffset prevOffsetStopLimit;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2669,7 +2582,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	nextOffset = MultiXactState->nextOffset;
 	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
 	prevOldestOffset = MultiXactState->oldestOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2700,11 +2612,7 @@ SetOffsetVacuumLimit(bool is_startup)
 		oldestOffsetKnown =
 			find_multixact_start(oldestMultiXactId, &oldestOffset);
 
-		if (oldestOffsetKnown)
-			ereport(DEBUG1,
-					(errmsg_internal("oldest MultiXactId member is at offset %u",
-									 oldestOffset)));
-		else
+		if (!oldestOffsetKnown)
 			ereport(LOG,
 					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
 							oldestMultiXactId)));
@@ -2717,24 +2625,7 @@ SetOffsetVacuumLimit(bool is_startup)
 	 * overrun of old data in the members SLRU area. We can only do so if the
 	 * oldest offset is known though.
 	 */
-	if (oldestOffsetKnown)
-	{
-		/* move back to start of the corresponding segment */
-		offsetStopLimit = oldestOffset - (oldestOffset %
-										  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-
-		/* always leave one segment before the wraparound point */
-		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-		if (!prevOldestOffsetKnown && !is_startup)
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are now enabled")));
-
-		ereport(DEBUG1,
-				(errmsg_internal("MultiXact member stop limit is now %u based on MultiXact %u",
-								 offsetStopLimit, oldestMultiXactId)));
-	}
-	else if (prevOldestOffsetKnown)
+	if (prevOldestOffsetKnown)
 	{
 		/*
 		 * If we failed to get the oldest offset this time, but we have a
@@ -2744,14 +2635,12 @@ SetOffsetVacuumLimit(bool is_startup)
 		 */
 		oldestOffset = prevOldestOffset;
 		oldestOffsetKnown = true;
-		offsetStopLimit = prevOffsetStopLimit;
 	}
 
 	/* Install the computed values */
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestOffset = oldestOffset;
 	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	MultiXactState->offsetStopLimit = offsetStopLimit;
 	LWLockRelease(MultiXactGenLock);
 
 	/*
@@ -2761,54 +2650,6 @@ SetOffsetVacuumLimit(bool is_startup)
 		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
 }
 
-/*
- * Return whether adding "distance" to "start" would move past "boundary".
- *
- * We use this to determine whether the addition is "wrapping around" the
- * boundary point, hence the name.  The reason we don't want to use the regular
- * 2^31-modulo arithmetic here is that we want to be able to use the whole of
- * the 2^32-1 space here, allowing for more multixacts than would fit
- * otherwise.
- */
-static bool
-MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
-						 uint32 distance)
-{
-	MultiXactOffset finish;
-
-	/*
-	 * Note that offset number 0 is not used (see GetMultiXactIdMembers), so
-	 * if the addition wraps around the UINT_MAX boundary, skip that value.
-	 */
-	finish = start + distance;
-	if (finish < start)
-		finish++;
-
-	/*-----------------------------------------------------------------------
-	 * When the boundary is numerically greater than the starting point, any
-	 * value numerically between the two is not wrapped:
-	 *
-	 *	<----S----B---->
-	 *	[---)			 = F wrapped past B (and UINT_MAX)
-	 *		 [---)		 = F not wrapped
-	 *			  [----] = F wrapped past B
-	 *
-	 * When the boundary is numerically less than the starting point (i.e. the
-	 * UINT_MAX wraparound occurs somewhere in between) then all values in
-	 * between are wrapped:
-	 *
-	 *	<----B----S---->
-	 *	[---)			 = F not wrapped past B (but wrapped past UINT_MAX)
-	 *		 [---)		 = F wrapped past B (and UINT_MAX)
-	 *			  [----] = F not wrapped
-	 *-----------------------------------------------------------------------
-	 */
-	if (start < boundary)
-		return finish >= boundary || finish < start;
-	else
-		return finish >= boundary && finish < start;
-}
-
 /*
  * Find the starting offset of the given MultiXactId.
  *
@@ -2893,73 +2734,6 @@ GetMultiXactInfo(uint32 *multixacts, MultiXactOffset *members,
 	return true;
 }
 
-/*
- * Multixact members can be removed once the multixacts that refer to them
- * are older than every datminmxid.  autovacuum_multixact_freeze_max_age and
- * vacuum_multixact_freeze_table_age work together to make sure we never have
- * too many multixacts; we hope that, at least under normal circumstances,
- * this will also be sufficient to keep us from using too many offsets.
- * However, if the average multixact has many members, we might exhaust the
- * members space while still using few enough members that these limits fail
- * to trigger relminmxid advancement by VACUUM.  At that point, we'd have no
- * choice but to start failing multixact-creating operations with an error.
- *
- * To prevent that, if more than a threshold portion of the members space is
- * used, we effectively reduce autovacuum_multixact_freeze_max_age and
- * to a value just less than the number of multixacts in use.  We hope that
- * this will quickly trigger autovacuuming on the table or tables with the
- * oldest relminmxid, thus allowing datminmxid values to advance and removing
- * some members.
- *
- * As the fraction of the member space currently in use grows, we become
- * more aggressive in clamping this value.  That not only causes autovacuum
- * to ramp up, but also makes any manual vacuums the user issues more
- * aggressive.  This happens because vacuum_get_cutoffs() will clamp the
- * freeze table and the minimum freeze age cutoffs based on the effective
- * autovacuum_multixact_freeze_max_age this function returns.  In the worst
- * case, we'll claim the freeze_max_age to zero, and every vacuum of any
- * table will freeze every multixact.
- */
-int
-MultiXactMemberFreezeThreshold(void)
-{
-	MultiXactOffset members;
-	uint32		multixacts;
-	uint32		victim_multixacts;
-	double		fraction;
-	int			result;
-	MultiXactId oldestMultiXactId;
-	MultiXactOffset oldestOffset;
-
-	/* If we can't determine member space utilization, assume the worst. */
-	if (!GetMultiXactInfo(&multixacts, &members, &oldestMultiXactId, &oldestOffset))
-		return 0;
-
-	/* If member space utilization is low, no special action is required. */
-	if (members <= MULTIXACT_MEMBER_SAFE_THRESHOLD)
-		return autovacuum_multixact_freeze_max_age;
-
-	/*
-	 * Compute a target for relminmxid advancement.  The number of multixacts
-	 * we try to eliminate from the system is based on how far we are past
-	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
-	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD) /
-		(MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-	victim_multixacts = multixacts * fraction;
-
-	/* fraction could be > 1.0, but lowest possible freeze age is zero */
-	if (victim_multixacts > multixacts)
-		return 0;
-	result = multixacts - victim_multixacts;
-
-	/*
-	 * Clamp to autovacuum_multixact_freeze_max_age, so that we never make
-	 * autovacuum less aggressive than it would otherwise be.
-	 */
-	return Min(result, autovacuum_multixact_freeze_max_age);
-}
-
 typedef struct mxtruncinfo
 {
 	int64		earliestExistingPage;
@@ -3290,7 +3064,7 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
 static bool
 MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
 {
-	int32		diff = (int32) (offset1 - offset2);
+	int64		diff = (int64) (offset1 - offset2);
 
 	return (diff < 0);
 }
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index f9f3b5da96..646ab1b80d 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -266,7 +266,7 @@ main(int argc, char *argv[])
 
 			case 'O':
 				errno = 0;
-				set_mxoff = strtoul(optarg, &endptr, 0);
+				set_mxoff = strtou64(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index d6bbbd0ced..cc89e0764a 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -213,7 +213,7 @@ push @cmd,
   sprintf("%d,%d", hex($files[0]) == 0 ? 3 : hex($files[0]), hex($files[-1]));
 
 @files = get_slru_files('pg_multixact/offsets');
-$mult = 32 * $blcksz / 4;
+$mult = 32 * $blcksz / 8;
 # --multixact-ids argument is "new,old"
 push @cmd,
   '--multixact-ids' => sprintf("%d,%d",
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 82e4bb90dd..792b5f140f 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -28,7 +28,7 @@
 
 #define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
 
-#define MaxMultiXactOffset	((MultiXactOffset) 0xFFFFFFFF)
+#define MaxMultiXactOffset	UINT64CONST(0xFFFFFFFFFFFFFFFF)
 
 /*
  * Possible multixact lock modes ("status").  The first four modes are for
diff --git a/src/include/c.h b/src/include/c.h
index 39022f8a9d..de9ac13be7 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -641,7 +641,7 @@ typedef uint32 SubTransactionId;
 /* MultiXactId must be equivalent to TransactionId, to fit in t_xmax */
 typedef TransactionId MultiXactId;
 
-typedef uint32 MultiXactOffset;
+typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
-- 
2.50.1

v17-0001-Use-64-bit-format-output-for-multixact-offsets.patchapplication/octet-stream; name=v17-0001-Use-64-bit-format-output-for-multixact-offsets.patchDownload
From 039b54d23fe5bb3e6c933e0603fe844d39b7b4f5 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 7 Aug 2024 16:35:22 +0300
Subject: [PATCH v17 1/7] Use 64-bit format output for multixact offsets

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/rmgrdesc/mxactdesc.c   |  4 ++--
 src/backend/access/rmgrdesc/xlogdesc.c    |  2 +-
 src/backend/access/transam/multixact.c    | 13 +++++++------
 src/backend/access/transam/xlogrecovery.c |  2 +-
 src/bin/pg_controldata/pg_controldata.c   |  2 +-
 src/bin/pg_resetwal/pg_resetwal.c         |  4 ++--
 6 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 3ca0582db3..052dd0a4ce 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -65,7 +65,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		xl_multixact_create *xlrec = (xl_multixact_create *) rec;
 		int			i;
 
-		appendStringInfo(buf, "%u offset %u nmembers %d: ", xlrec->mid,
+		appendStringInfo(buf, "%u offset %" PRIu64 " nmembers %d: ", xlrec->mid,
 						 xlrec->moff, xlrec->nmembers);
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
@@ -74,7 +74,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
 
-		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+		appendStringInfo(buf, "offsets [%u, %u), members [%" PRIu64 ", %" PRIu64 ")",
 						 xlrec->startTruncOff, xlrec->endTruncOff,
 						 xlrec->startTruncMemb, xlrec->endTruncMemb);
 	}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index cd6c2a2f65..441034f592 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -66,7 +66,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
 		appendStringInfo(buf, "redo %X/%08X; "
-						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %u; "
+						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %" PRIu64 "; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
 						 "oldest running xid %u; %s",
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 8bf59d369f..dc71c8f44e 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1254,7 +1254,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	LWLockRelease(MultiXactGenLock);
 
-	debug_elog4(DEBUG2, "GetNew: returning %u offset %u", result, *offset);
+	debug_elog4(DEBUG2, "GetNew: returning %u offset %" PRIu64, result,
+				*offset);
 	return result;
 }
 
@@ -2222,7 +2223,7 @@ MultiXactGetCheckptMulti(bool is_shutdown,
 	LWLockRelease(MultiXactGenLock);
 
 	debug_elog6(DEBUG2,
-				"MultiXact: checkpoint is nextMulti %u, nextOffset %u, oldestMulti %u in DB %u",
+				"MultiXact: checkpoint is nextMulti %u, nextOffset %" PRIu64 ", oldestMulti %u in DB %u",
 				*nextMulti, *nextMultiOffset, *oldestMulti, *oldestMultiDB);
 }
 
@@ -2257,7 +2258,7 @@ void
 MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset)
 {
-	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %u",
+	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %" PRIu64,
 				nextMulti, nextMultiOffset);
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->nextMXact = nextMulti;
@@ -2448,7 +2449,7 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 	}
 	if (MultiXactOffsetPrecedes(MultiXactState->nextOffset, minMultiOffset))
 	{
-		debug_elog3(DEBUG2, "MultiXact: setting next offset to %u",
+		debug_elog3(DEBUG2, "MultiXact: setting next offset to %" PRIU64,
 					minMultiOffset);
 		MultiXactState->nextOffset = minMultiOffset;
 	}
@@ -3158,7 +3159,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 
 	elog(DEBUG1, "performing multixact truncation: "
 		 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-		 "members [%u, %u), members segments [%" PRIx64 ", %" PRIx64 ")",
+		 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
 		 oldestMulti, newOldestMulti,
 		 MultiXactIdToOffsetSegment(oldestMulti),
 		 MultiXactIdToOffsetSegment(newOldestMulti),
@@ -3386,7 +3387,7 @@ multixact_redo(XLogReaderState *record)
 
 		elog(DEBUG1, "replaying multixact truncation: "
 			 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-			 "members [%u, %u), members segments [%" PRIx64 ", %" PRIx64 ")",
+			 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
 			 xlrec.startTruncOff, xlrec.endTruncOff,
 			 MultiXactIdToOffsetSegment(xlrec.startTruncOff),
 			 MultiXactIdToOffsetSegment(xlrec.endTruncOff),
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 346319338a..3af08d579a 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -884,7 +884,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 							 U64FromFullTransactionId(checkPoint.nextXid),
 							 checkPoint.nextOid)));
 	ereport(DEBUG1,
-			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %u",
+			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %" PRIu64,
 							 checkPoint.nextMulti, checkPoint.nextMultiOffset)));
 	ereport(DEBUG1,
 			(errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 10de058ce9..5295108ade 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -264,7 +264,7 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile->checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
+	printf(_("Latest checkpoint's NextMultiOffset:  %" PRIu64 "\n"),
 		   ControlFile->checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile->checkPointCopy.oldestXid);
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 7a4e4eb957..f9f3b5da96 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -759,7 +759,7 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile.checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
+	printf(_("Latest checkpoint's NextMultiOffset:  %" PRIu64 "\n"),
 		   ControlFile.checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile.checkPointCopy.oldestXid);
@@ -833,7 +833,7 @@ PrintNewControlValues(void)
 
 	if (set_mxoff != -1)
 	{
-		printf(_("NextMultiOffset:                      %u\n"),
+		printf(_("NextMultiOffset:                      %" PRIu64 "\n"),
 			   ControlFile.checkPointCopy.nextMultiOffset);
 	}
 
-- 
2.50.1

v17-0006-TEST-add-src-bin-pg_upgrade-t-006_offset.pl.patchapplication/octet-stream; name=v17-0006-TEST-add-src-bin-pg_upgrade-t-006_offset.pl.patchDownload
From 5ba2757abca2e63298de169c4567a59c69a038ab Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Tue, 19 Nov 2024 17:08:10 +0300
Subject: [PATCH v17 6/7] TEST: add src/bin/pg_upgrade/t/006_offset.pl

---
 src/bin/pg_upgrade/pg_upgrade.h    |   2 +-
 src/bin/pg_upgrade/t/006_offset.pl | 562 +++++++++++++++++++++++++++++
 2 files changed, 563 insertions(+), 1 deletion(-)
 create mode 100644 src/bin/pg_upgrade/t/006_offset.pl

diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 54ce718e68..6577567860 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -119,7 +119,7 @@ extern char *output_files[];
  *
  * XXX: should be changed to the actual CATALOG_VERSION_NO on commit.
  */
-#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202409041
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202504092
 
 /*
  * large object chunk size added to pg_controldata,
diff --git a/src/bin/pg_upgrade/t/006_offset.pl b/src/bin/pg_upgrade/t/006_offset.pl
new file mode 100644
index 0000000000..f5dc733a30
--- /dev/null
+++ b/src/bin/pg_upgrade/t/006_offset.pl
@@ -0,0 +1,562 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use File::Find qw(find);
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# This pair of calls will create significantly more member segments than offset
+# segments.
+sub prep
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl} (I INT PRIMARY KEY, N_UPDATED INT) " .
+		"       WITH (AUTOVACUUM_ENABLED=FALSE);" .
+		"INSERT INTO ${tbl} SELECT G, 0 FROM GENERATE_SERIES(1, 50) G;");
+}
+
+sub fill
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	my $nclients = 50;
+	my $update_every = 90;
+	my @connections = ();
+
+	for (0..$nclients)
+	{
+		my $conn = $node->background_psql('postgres');
+		$conn->query_safe("BEGIN");
+
+		push(@connections, $conn);
+	}
+
+	for (my $i = 0; $i < 20000; $i++)
+	{
+		my $conn = $connections[$i % $nclients];
+
+		$conn->query_safe("COMMIT;");
+		$conn->query_safe("BEGIN");
+
+		if ($i % $update_every == 0)
+		{
+			$conn->query_safe(
+				"UPDATE ${tbl} SET " .
+				"N_UPDATED = N_UPDATED + 1 " .
+				"WHERE I = ${i} % 50");
+		}
+		else
+		{
+			$conn->query_safe(
+				"SELECT * FROM ${tbl} FOR KEY SHARE");
+		}
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+}
+
+# This pair of calls will create more or less the same amount of membsers and
+# offsets segments.
+sub prep2
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl}(BAR INT PRIMARY KEY, BAZ INT); " .
+		"CREATE OR REPLACE PROCEDURE MXIDFILLER(N_STEPS INT DEFAULT 1000) " .
+		"LANGUAGE PLPGSQL " .
+		"AS \$\$ " .
+		"BEGIN " .
+		"	FOR I IN 1..N_STEPS LOOP " .
+		"		UPDATE ${tbl} SET BAZ = RANDOM(1, 1000) " .
+		"		WHERE BAR IN (SELECT BAR FROM ${tbl} " .
+		"						TABLESAMPLE BERNOULLI(80)); " .
+		"		COMMIT; " .
+		"	END LOOP; " .
+		"END; \$\$; " .
+		"INSERT INTO ${tbl} (BAR, BAZ) " .
+		"SELECT ID, ID FROM GENERATE_SERIES(1, 1024) ID;");
+}
+
+sub fill2
+{
+	my $node = shift;
+	my $tbl = shift;
+	my $scale = shift // 1;
+
+	$node->safe_psql('postgres',
+		"BEGIN; " .
+		"SELECT * FROM ${tbl} FOR KEY SHARE; " .
+		"PREPARE TRANSACTION 'A'; " .
+		"CALL MXIDFILLER((365 * ${scale})::int); " .
+		"COMMIT PREPARED 'A';");
+}
+
+
+# generate around 2 offset segments and 55 member segments
+sub mxid_gen1
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	prep($node, $tbl);
+	fill($node, $tbl);
+
+	$node->safe_psql('postgres', q(CHECKPOINT));
+}
+
+# generate around 10 offset segments and 12 member segments
+sub mxid_gen2
+{
+	my $node = shift;
+	my $tbl = shift;
+	my $scale = shift // 1;
+
+	prep2($node, $tbl);
+	fill2($node, $tbl, $scale);
+
+	$node->safe_psql('postgres', q(CHECKPOINT));
+}
+
+# Fetch latest multixact checkpoint values.
+sub multi_bounds
+{
+	my ($node) = @_;
+	my $path = $node->config_data('--bindir');
+	my ($stdout, $stderr) = run_command([
+									$path . '/pg_controldata',
+									$node->data_dir
+								]);
+	my @control_data = split("\n", $stdout);
+	my $next = undef;
+	my $oldest = undef;
+	my $next_offset = undef;
+
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint's NextMultiXactId:\s*(.*)$/mg)
+		{
+			$next = $1;
+			print ">>> @ node ". $node->name . ", " . $_ . "\n";
+		}
+
+		if ($_ =~ /^Latest checkpoint's oldestMultiXid:\s*(.*)$/mg)
+		{
+			$oldest = $1;
+			print ">>> @ node ". $node->name . ", " . $_ . "\n";
+		}
+
+		if ($_ =~ /^Latest checkpoint's NextMultiOffset:\s*(.*)$/mg)
+		{
+			$next_offset = $1;
+			print ">>> @ node ". $node->name . ", " . $_ . "\n";
+		}
+
+		if (defined($oldest) && defined($next) && defined($next_offset))
+		{
+			last;
+		}
+	}
+
+	die "Latest checkpoint's NextMultiXactId not found in control file!\n"
+	unless defined($next);
+
+	die "Latest checkpoint's oldestMultiXid not found in control file!\n"
+	unless defined($oldest);
+
+	die "Latest checkpoint's NextMultiOffset not found in control file!\n"
+	unless defined($next_offset);
+
+	return ($oldest, $next, $next_offset);
+}
+
+# Create node from existing bins.
+sub create_new_node
+{
+	my ($name, %params) = @_;
+
+	create_node(0, @_);
+}
+
+# Create node from ENV oldinstall
+sub create_old_node
+{
+	my ($name, %params) = @_;
+
+	if (!defined($ENV{oldinstall}))
+	{
+		die "oldinstall is not defined";
+	}
+
+	create_node(1, @_);
+}
+
+sub create_node
+{
+	my ($install_path_from_env, $name, %params) = @_;
+	my $scale = defined $params{scale} ? $params{scale} : 1;
+	my $multi = defined $params{multi} ? $params{multi} : undef;
+	my $offset = defined $params{offset} ? $params{offset} : undef;
+
+	my $node =
+		$install_path_from_env ?
+			PostgreSQL::Test::Cluster->new($name,
+					install_path => $ENV{oldinstall}) :
+			PostgreSQL::Test::Cluster->new($name);
+
+	$node->init(force_initdb => 1,
+		extra => [
+			$multi ? ('-m', $multi) : (),
+			$offset ? ('-o', $offset) : (),
+		]);
+
+	# Fixup MOX patch quirk
+	if ($multi)
+	{
+		unlink $node->data_dir . '/pg_multixact/offsets/0000';
+	}
+	if ($offset)
+	{
+		unlink $node->data_dir . '/pg_multixact/members/0000';
+	}
+
+	$node->append_conf('fsync', 'off');
+	$node->append_conf('postgresql.conf', 'max_prepared_transactions = 2');
+
+	$node->start();
+	mxid_gen2($node, 'FOO', $scale);
+	mxid_gen1($node, 'BAR', $scale);
+	$node->restart();
+	$node->safe_psql('postgres', q(SELECT * FROM FOO));		# just in case...
+	$node->safe_psql('postgres', q(SELECT * FROM BAR));
+	$node->safe_psql('postgres', q(CHECKPOINT));
+	$node->stop();
+
+	return $node;
+}
+
+sub do_upgrade
+{
+	my ($oldnode, $newnode) = @_;
+
+	command_ok(
+		[
+			'pg_upgrade', '--no-sync',
+			'-d', $oldnode->data_dir,
+			'-D', $newnode->data_dir,
+			'-b', $oldnode->config_data('--bindir'),
+			'-B', $newnode->config_data('--bindir'),
+			'-s', $newnode->host,
+			'-p', $oldnode->port,
+			'-P', $newnode->port,
+			'--check'
+		],
+		'run of pg_upgrade');
+
+	command_ok(
+		[
+			'pg_upgrade', '--no-sync',
+			'-d', $oldnode->data_dir,
+			'-D', $newnode->data_dir,
+			'-b', $oldnode->config_data('--bindir'),
+			'-B', $newnode->config_data('--bindir'),
+			'-s', $newnode->host,
+			'-p', $oldnode->port,
+			'-P', $newnode->port,
+			'--copy'
+		],
+		'run of pg_upgrade');
+
+	$oldnode->start();
+	$newnode->start();
+
+	my $oldfoo = $oldnode->safe_psql('postgres', q(SELECT * FROM FOO));
+	my $newfoo = $newnode->safe_psql('postgres', q(SELECT * FROM FOO));
+	is($oldfoo, $newfoo, "select foo eq");
+
+	my $oldbar = $oldnode->safe_psql('postgres', q(SELECT * FROM BAR));
+	my $newbar = $newnode->safe_psql('postgres', q(SELECT * FROM BAR));
+	is($oldbar, $newbar, "select bar eq");
+
+	$oldnode->stop();
+	$newnode->stop();
+
+	multi_bounds($oldnode);
+	multi_bounds($newnode);
+}
+
+my @TESTS = (
+	# tests without ENV oldinstall
+	#0, 1, 2, 3, 4, 5, 6,
+	# tests with "real" pg_upgrade
+	#100, 101, 102, 103, 104, 105, 106,
+	# self upgrade
+	1000,
+);
+
+# =============================================================================
+# Basic sanity tests on a NEW bin
+# =============================================================================
+
+# starts from the zero
+SKIP:
+{
+	my $TEST_NO = 0;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_mo',
+						scale => 1);
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value
+SKIP:
+{
+	my $TEST_NO = 1;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_Mo',
+						scale => 1.15,
+						multi => '0x123400');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# offsets starts from the value
+SKIP:
+{
+	my $TEST_NO = 2;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_mO',
+						scale => 1.15,
+						offset => '0x432100');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi and offsets starts from the value
+SKIP:
+{
+	my $TEST_NO = 3;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_MO',
+						scale => 1.15,
+						multi => '0xDEAD00', offset => '0xBEEF00');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value, multi wrap
+SKIP:
+{
+	my $TEST_NO = 4;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_Mo_wrap',
+						scale => 1.15,
+						multi => '0xFFFF7000');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# offsets starts from the value, offsets wrap
+SKIP:
+{
+	my $TEST_NO = 5;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_mO_wrap',
+						scale => 1.15,
+						offset => '0xFFFFFC00');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value, offsets starts from the value,
+# multi wrap, offsets wrap
+SKIP:
+{
+	my $TEST_NO = 6;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $node = create_new_node('simple_MO_wrap',
+						scale => 1.15,
+						multi => '0xFFFF7000', offset => '0xFFFFFC00');
+	multi_bounds($node);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# =============================================================================
+# pg_upgarde tests
+# =============================================================================
+
+# starts from the zero
+SKIP:
+{
+	my $TEST_NO = 100;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'mo';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1);
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value
+SKIP:
+{
+	my $TEST_NO = 101;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'Mo';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						multi => '0x123400');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# offsets starts from the value
+SKIP:
+{
+	my $TEST_NO = 102;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'mO';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						offset => '0x432100');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi and offsets starts from the value
+SKIP:
+{
+	my $TEST_NO = 103;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'MO';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						multi => '0xDEAD00', offset => '0xBEEF00');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value, multi wrap
+SKIP:
+{
+	my $TEST_NO = 104;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'Mo_wrap';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						multi => '0xFFFF7000');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# offsets starts from the value, offsets wrap
+SKIP:
+{
+	my $TEST_NO = 105;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'mO_wrap';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						offset => '0xFFFFFC00');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# multi starts from the value, offsets starts from the value,
+# multi wrap, offsets wrap
+SKIP:
+{
+	my $TEST_NO = 106;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'MO_wrap';
+	my $oldnode = create_old_node("old_$dbname",
+						scale => 1.2,
+						multi => '0xFFFF7000', offset => '0xFFFFFC00');
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+# =============================================================================
+# Self upgrade
+# =============================================================================
+
+# starts from the zero
+SKIP:
+{
+	my $TEST_NO = 1000;
+	skip "do not test case $TEST_NO", 1
+		unless ( grep( /^$TEST_NO$/, @TESTS ) );
+
+	my $dbname = 'self_upgrade';
+	my $oldnode = create_new_node("old_$dbname",
+						scale => 1);
+	my $newnode = PostgreSQL::Test::Cluster->new("new_$dbname");
+	$newnode->init();
+
+	do_upgrade($oldnode, $newnode);
+	ok(1, "TEST $TEST_NO PASSED");
+}
+
+done_testing();
-- 
2.50.1

v17-0007-TEST-bump-catver.patchapplication/octet-stream; name=v17-0007-TEST-bump-catver.patchDownload
From 85dfdd1078d2b0c7f1020a0100747ae6fa17093d Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 13 Nov 2024 16:34:34 +0300
Subject: [PATCH v17 7/7] TEST: bump catver

---
 src/bin/pg_upgrade/pg_upgrade.h  | 2 +-
 src/include/catalog/catversion.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 6577567860..a5acc8654d 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -119,7 +119,7 @@ extern char *output_files[];
  *
  * XXX: should be changed to the actual CATALOG_VERSION_NO on commit.
  */
-#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202504092
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 202509022
 
 /*
  * large object chunk size added to pg_controldata,
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 836369f163..492f6e25fb 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202509021
+#define CATALOG_VERSION_NO	202509022
 
 #endif
-- 
2.50.1

#43Alexander Korotkov
aekorotkov@gmail.com
In reply to: Maxim Orlov (#42)
Re: POC: make mxidoff 64 bits

Hello Maxim!

On Thu, Sep 11, 2025 at 11:58 AM Maxim Orlov <orlovmg@gmail.com> wrote:

Once again, @ 8191e0c16a

Thank you for your work on this subject. Multixact members can really
grow much faster than multixact offsets, and avoiding wraparound just
here might make sense. At the same time, making multixact offsets
64-bit is local and doesn't require changing the tuple xmin/xmax
interpretation.

I went through the patchset. The shape does not look bad, but I have
a concern about the size of the multixact offsets. As I understand,
this patchset grows multixact offsets twice; each multixact offset
grows from 32 bits to 64 bits. This seems quite a price for avoiding
the Multixact members' wraparound.

We can try to squeeze multixact offsets given it's ascending sequence
each time increased by a multixact size. But how many members can a
multixact contain at maximum? Looking at MultiXactIdExpand(), I get
that we're keeping locks from in-progress transactions, and committed
non-lock transactions (I guess the latter could be only one). The
number of transactions running by backends should fit MAX_BACKENDS
(2^18 - 1), and the number of prepared transactions should also fit
MAX_BACKENDS. So, I guess we can cap the total number of one multixact
members to 2^24.

Therefore, we can change from each 8 of 32-bit multixact offsets
(takes 32-bytes) to one 64-bit offset + 7 of 24-bit offset increments
(takes 29-bytes). The actual multixact offsets can be calculated at
the fly, overhead shouldn't be significant. What do you think?

------
Regards,
Alexander Korotkov
Supabase

#44Maxim Orlov
orlovmg@gmail.com
In reply to: Alexander Korotkov (#43)
Re: POC: make mxidoff 64 bits

On Sat, 13 Sept 2025 at 16:34, Alexander Korotkov <aekorotkov@gmail.com>
wrote:

Therefore, we can change from each 8 of 32-bit multixact offsets
(takes 32-bytes) to one 64-bit offset + 7 of 24-bit offset increments
(takes 29-bytes). The actual multixact offsets can be calculated at
the fly, overhead shouldn't be significant. What do you think?

Thank you for your review; I'm pleased to hear from you again.

Yes, because the maximum number of mxoff is limited by the number of
running transactions, we may do it that way.
However, it is a bit wired to have offsets with the 7-byte "base".

I believe we may take advantage of the 64XID patch's notion of putting a
8 byte base followed by 4 byte offsets for particular page.

32kB page may contain then 2^13-2 offsets, each is maxed by 2^18+1.
Therefore, offset from base will never overflow 2^31 and will always
fit uint32.

It appears logical to me.

--
Best regards,
Maxim Orlov.

#45wenhui qiu
qiuwenhuifx@gmail.com
In reply to: Maxim Orlov (#44)
Re: POC: make mxidoff 64 bits

Hi Maxim
Thanks for your continued efforts to get XID64 implemented.

32kB page may contain then 2^13-2 offsets, each is maxed by 2^18+1.
Therefore, offset from base will never overflow 2^31 and will always
fit uint32.

It appears logical to me.

Agree +1 , but I have a question: I remember the XID64 patch got split into
a few threads. How are these threads related? The original one was seen as
too big a change, so it was broken up after people raised concerns.

Thanks

On Mon, Sep 15, 2025 at 11:42 PM Maxim Orlov <orlovmg@gmail.com> wrote:

Show quoted text

On Sat, 13 Sept 2025 at 16:34, Alexander Korotkov <aekorotkov@gmail.com>
wrote:

Therefore, we can change from each 8 of 32-bit multixact offsets
(takes 32-bytes) to one 64-bit offset + 7 of 24-bit offset increments
(takes 29-bytes). The actual multixact offsets can be calculated at
the fly, overhead shouldn't be significant. What do you think?

Thank you for your review; I'm pleased to hear from you again.

Yes, because the maximum number of mxoff is limited by the number of
running transactions, we may do it that way.
However, it is a bit wired to have offsets with the 7-byte "base".

I believe we may take advantage of the 64XID patch's notion of putting a
8 byte base followed by 4 byte offsets for particular page.

32kB page may contain then 2^13-2 offsets, each is maxed by 2^18+1.
Therefore, offset from base will never overflow 2^31 and will always
fit uint32.

It appears logical to me.

--
Best regards,
Maxim Orlov.

#46Maxim Orlov
orlovmg@gmail.com
In reply to: wenhui qiu (#45)
1 attachment(s)
Re: POC: make mxidoff 64 bits

On Tue, 16 Sept 2025 at 15:12, wenhui qiu <qiuwenhuifx@gmail.com> wrote:

Agree +1 , but I have a question: I remember the XID64 patch got split
into a few threads. How are these threads related? The original one was
seen as too big a change, so it was broken up after people raised
concerns.

Yeah, you're absolutely correct. This thread is a part of the overall
work on the transition to XID64 in Postgres. As far as I remember, no
one explicitly raised concerns, but it's clear to me that it won't be
committed as one big patch set.

Here is a WIP patch @ 8191e0c16a for the discussed above issue.
I also had to merge several patches from the previous set, since the
consensus is to use the PRI* format for outputting 64-bit values, and a
separate patch that only changed the format with cast and %llu lost
it's meaning.

If this option suits everyone, the next step is to add a part related
to the pg_upgrade.

--
Best regards,
Maxim Orlov.

Attachments:

v18-wip-0001-Use-64-bit-multixact-offsets.patchapplication/octet-stream; name=v18-wip-0001-Use-64-bit-multixact-offsets.patchDownload
From 434d27269936be180cef9ab4f4b9ed23f6bda288 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 7 Aug 2024 16:35:22 +0300
Subject: [PATCH v18-wip] Use 64-bit multixact offsets

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/rmgrdesc/mxactdesc.c   |   4 +-
 src/backend/access/rmgrdesc/xlogdesc.c    |   2 +-
 src/backend/access/transam/multixact.c    | 388 +++++++---------------
 src/backend/access/transam/xlogrecovery.c |   2 +-
 src/backend/commands/vacuum.c             |   2 +-
 src/backend/postmaster/autovacuum.c       |   4 +-
 src/bin/pg_controldata/pg_controldata.c   |   2 +-
 src/bin/pg_resetwal/pg_resetwal.c         |   6 +-
 src/bin/pg_resetwal/t/001_basic.pl        |   2 +-
 src/include/access/multixact.h            |   3 +-
 src/include/c.h                           |   2 +-
 11 files changed, 134 insertions(+), 283 deletions(-)

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 3ca0582db3..052dd0a4ce 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -65,7 +65,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		xl_multixact_create *xlrec = (xl_multixact_create *) rec;
 		int			i;
 
-		appendStringInfo(buf, "%u offset %u nmembers %d: ", xlrec->mid,
+		appendStringInfo(buf, "%u offset %" PRIu64 " nmembers %d: ", xlrec->mid,
 						 xlrec->moff, xlrec->nmembers);
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
@@ -74,7 +74,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
 
-		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+		appendStringInfo(buf, "offsets [%u, %u), members [%" PRIu64 ", %" PRIu64 ")",
 						 xlrec->startTruncOff, xlrec->endTruncOff,
 						 xlrec->startTruncMemb, xlrec->endTruncMemb);
 	}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index cd6c2a2f65..441034f592 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -66,7 +66,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
 		appendStringInfo(buf, "redo %X/%08X; "
-						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %u; "
+						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %" PRIu64 "; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
 						 "oldest running xid %u; %s",
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 8bf59d369f..09607ff00d 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -88,21 +88,31 @@
 #include "utils/memutils.h"
 
 
+typedef uint32 ShortMultiXactOffset;	/* for a disk storage */
+
 /*
  * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
  * used everywhere else in Postgres.
  *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
+ * There are two key factors why utilising straightforward 64-bit offset values
+ * for is wasteful in terms of disc space usage:
+ * 1) offset values are recorded in ascending order and not overwritten;
+ * 2) the largest supported BLCKSZ is 32k, which can store up to 2^13 32-bit
+ *    items on a single page;  thus, with MAX_BACKENDS limited to 2^18-1 we have
+ *    2^13 * (2^18-1) which is less 2^31 and fits 32-bits.
+ *
+ * In other words, max "distance" for offsets on a single page is not exeeded
+ * 32-bits.  To optimise disc space allocation, we employ the following scheme.
+ * On each page, the basic 64-bit offset, known as the page base, is located
+ * first.  Next, there are 32-bit deltas relative to the base element are
+ * placed.  Thus, the required offset for the 0-th element is the page's
+ * base; the value for each subsequent offset on the same page is calculated
+ * by adding it to the page base (0-th) element.
  */
 
 /* We need four bytes per offset */
-#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
+#define MULTIXACT_OFFSETS_PER_PAGE		\
+	((BLCKSZ - sizeof(MultiXactOffset)) / sizeof(ShortMultiXactOffset))
 
 static inline int64
 MultiXactIdToOffsetPage(MultiXactId multi)
@@ -207,10 +217,14 @@ MXOffsetToMemberOffset(MultiXactOffset offset)
 		member_in_group * sizeof(TransactionId);
 }
 
-/* Multixact members wraparound thresholds. */
-#define MULTIXACT_MEMBER_SAFE_THRESHOLD		(MaxMultiXactOffset / 2)
-#define MULTIXACT_MEMBER_DANGER_THRESHOLD	\
-	(MaxMultiXactOffset - MaxMultiXactOffset / 4)
+/*
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuumin order to release the disk space, reduce table bloat if
+ * possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD		UINT64CONST(0xFFFFFFFF)
 
 static inline MultiXactId
 PreviousMultiXactId(MultiXactId multi)
@@ -227,6 +241,49 @@ static SlruCtlData MultiXactMemberCtlData;
 #define MultiXactOffsetCtl	(&MultiXactOffsetCtlData)
 #define MultiXactMemberCtl	(&MultiXactMemberCtlData)
 
+static inline MultiXactOffset
+MXOffsetRead(int entryno, int slotno)
+{
+	MultiXactOffset		   *offptr;
+	ShortMultiXactOffset   *off32ptr;
+
+	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
+	if (entryno != 0)
+	{
+		off32ptr = (ShortMultiXactOffset *) (offptr + 1);	/* bypass base */
+		off32ptr += entryno - 1;
+
+		return *off32ptr + *offptr;		/* 64-bit base + 32-bit value */
+	}
+
+	/* 0-th element are 64-bit value */
+	return *offptr;
+}
+
+static inline void
+MXOffsetWrite(int entryno, int slotno, MultiXactOffset offset)
+{
+	MultiXactOffset		   *offptr;
+	ShortMultiXactOffset   *off32ptr;
+
+	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
+	if (entryno != 0)
+	{
+		off32ptr = (ShortMultiXactOffset *) (offptr + 1);	/* bypass base */
+		off32ptr += entryno - 1;
+		*off32ptr = (ShortMultiXactOffset) (offset - *offptr);
+
+		return;
+	}
+
+	/*
+	 * The first offset on the page is assigned a 64-bit value.  All other
+	 * elements on the page will be calculated using this value as a base and
+	 * added to it 32-bit value.
+	 */
+	*offptr = offset;
+}
+
 /*
  * MultiXact state shared across all backends.  All this state is protected
  * by MultiXactGenLock.  (We also use SLRU bank's lock of MultiXactOffset and
@@ -267,9 +324,6 @@ typedef struct MultiXactStateData
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 
-	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
-
 	/*
 	 * This is used to sleep until a multixact offset is written when we want
 	 * to create the next one.
@@ -400,8 +454,6 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
-									 MultiXactOffset start, uint32 distance);
 static bool SetOffsetVacuumLimit(bool is_startup);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMTruncateXlogRec(Oid oldestMultiDB,
@@ -910,7 +962,6 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	int64		prev_pageno;
 	int			entryno;
 	int			slotno;
-	MultiXactOffset *offptr;
 	int			i;
 	LWLock	   *lock;
 	LWLock	   *prevlock = NULL;
@@ -929,10 +980,8 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	 * take the trouble to generalize the slru.c error reporting code.
 	 */
 	slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
-	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
-	offptr += entryno;
 
-	*offptr = offset;
+	MXOffsetWrite(entryno, slotno, offset);
 
 	MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 
@@ -1154,78 +1203,6 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	else
 		*offset = nextOffset;
 
-	/*----------
-	 * Protect against overrun of the members space as well, with the
-	 * following rules:
-	 *
-	 * If we're past offsetStopLimit, refuse to generate more multis.
-	 * If we're close to offsetStopLimit, emit a warning.
-	 *
-	 * Arbitrarily, we start emitting warnings when we're 20 segments or less
-	 * from offsetStopLimit.
-	 *
-	 * Note we haven't updated the shared state yet, so if we fail at this
-	 * point, the multixact ID we grabbed can still be used by the next guy.
-	 *
-	 * Note that there is no point in forcing autovacuum runs here: the
-	 * multixact freeze settings would have to be reduced for that to have any
-	 * effect.
-	 *----------
-	 */
-#define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
-								 nmembers))
-	{
-		/* see comment in the corresponding offsets wraparound case */
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("multixact \"members\" limit exceeded"),
-				 errdetail_plural("This command would create a multixact with %u members, but the remaining space is only enough for %u member.",
-								  "This command would create a multixact with %u members, but the remaining space is only enough for %u members.",
-								  MultiXactState->offsetStopLimit - nextOffset - 1,
-								  nmembers,
-								  MultiXactState->offsetStopLimit - nextOffset - 1),
-				 errhint("Execute a database-wide VACUUM in database with OID %u with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.",
-						 MultiXactState->oldestMultiXactDB)));
-	}
-
-	/*
-	 * Check whether we should kick autovacuum into action, to prevent members
-	 * wraparound. NB we use a much larger window to trigger autovacuum than
-	 * just the warning limit. The warning is just a measure of last resort -
-	 * this is in line with GetNewTransactionId's behaviour.
-	 */
-	if (!MultiXactState->oldestOffsetKnown ||
-		(MultiXactState->nextOffset - MultiXactState->oldestOffset
-		 > MULTIXACT_MEMBER_SAFE_THRESHOLD))
-	{
-		/*
-		 * To avoid swamping the postmaster with signals, we issue the autovac
-		 * request only when crossing a segment boundary. With default
-		 * compilation settings that's roughly after 50k members.  This still
-		 * gives plenty of chances before we get into real trouble.
-		 */
-		if ((MXOffsetToMemberPage(nextOffset) / SLRU_PAGES_PER_SEGMENT) !=
-			(MXOffsetToMemberPage(nextOffset + nmembers) / SLRU_PAGES_PER_SEGMENT))
-			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	}
-
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
-								 nextOffset,
-								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
-		ereport(WARNING,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg_plural("database with OID %u must be vacuumed before %d more multixact member is used",
-							   "database with OID %u must be vacuumed before %d more multixact members are used",
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers,
-							   MultiXactState->oldestMultiXactDB,
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers),
-				 errhint("Execute a database-wide VACUUM in that database with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.")));
-
 	ExtendMultiXactMember(nextOffset, nmembers);
 
 	/*
@@ -1254,7 +1231,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	LWLockRelease(MultiXactGenLock);
 
-	debug_elog4(DEBUG2, "GetNew: returning %u offset %u", result, *offset);
+	debug_elog4(DEBUG2, "GetNew: returning %u offset %" PRIu64, result,
+				*offset);
 	return result;
 }
 
@@ -1293,7 +1271,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	int64		prev_pageno;
 	int			entryno;
 	int			slotno;
-	MultiXactOffset *offptr;
 	MultiXactOffset offset;
 	int			length;
 	int			truelength;
@@ -1417,9 +1394,8 @@ retry:
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
 	slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
-	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
-	offptr += entryno;
-	offset = *offptr;
+
+	offset = MXOffsetRead(entryno, slotno);
 
 	Assert(offset != 0);
 
@@ -1466,9 +1442,7 @@ retry:
 			slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, tmpMXact);
 		}
 
-		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
-		offptr += entryno;
-		nextMXOffset = *offptr;
+		nextMXOffset = MXOffsetRead(entryno, slotno);
 
 		if (nextMXOffset == 0)
 		{
@@ -2142,18 +2116,40 @@ TrimMultiXact(void)
 	if (entryno != 0)
 	{
 		int			slotno;
-		MultiXactOffset *offptr;
 		LWLock	   *lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
 
-		LWLockAcquire(lock, LW_EXCLUSIVE);
-		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
-		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
-		offptr += entryno;
+		if (SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno))
+		{
+			MultiXactOffset *offptr;
 
-		MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
+			LWLockAcquire(lock, LW_EXCLUSIVE);
+			slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true,
+									   nextMXact);
+			offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 
-		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
-		LWLockRelease(lock);
+			if (entryno == 0)
+				MemSet(offptr, 0, BLCKSZ);
+			else
+			{
+				ShortMultiXactOffset *off32ptr;
+
+				off32ptr = (ShortMultiXactOffset *) (offptr + 1);
+				off32ptr += entryno;
+
+				/*
+				 * Knowing that offptr points to the beginning of the buffer,
+				 * address arithmetic can be used to determine the amount of
+				 * bytes remaining.
+				 */
+				MemSet(off32ptr, 0,
+					   BLCKSZ - (((char *) off32ptr - (char *) offptr)));
+			}
+
+			MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+			LWLockRelease(lock);
+		}
+		else
+			SimpleLruZeroAndWritePage(MultiXactOffsetCtl, pageno);
 	}
 
 	/*
@@ -2222,7 +2218,7 @@ MultiXactGetCheckptMulti(bool is_shutdown,
 	LWLockRelease(MultiXactGenLock);
 
 	debug_elog6(DEBUG2,
-				"MultiXact: checkpoint is nextMulti %u, nextOffset %u, oldestMulti %u in DB %u",
+				"MultiXact: checkpoint is nextMulti %u, nextOffset %" PRIu64 ", oldestMulti %u in DB %u",
 				*nextMulti, *nextMultiOffset, *oldestMulti, *oldestMultiDB);
 }
 
@@ -2257,7 +2253,7 @@ void
 MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset)
 {
-	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %u",
+	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %" PRIu64,
 				nextMulti, nextMultiOffset);
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->nextMXact = nextMulti;
@@ -2448,7 +2444,7 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 	}
 	if (MultiXactOffsetPrecedes(MultiXactState->nextOffset, minMultiOffset))
 	{
-		debug_elog3(DEBUG2, "MultiXact: setting next offset to %u",
+		debug_elog3(DEBUG2, "MultiXact: setting next offset to %" PRIU64,
 					minMultiOffset);
 		MultiXactState->nextOffset = minMultiOffset;
 	}
@@ -2632,15 +2628,13 @@ GetOldestMultiXactId(void)
 }
 
 /*
- * Determine how aggressively we need to vacuum in order to prevent member
- * wraparound.
+ * Determine if we need to vacuum for member or not.
  *
  * To do so determine what's the oldest member offset and install the limit
  * info in MultiXactState, where it can be used to prevent overrun of old data
  * in the members SLRU area.
  *
- * The return value is true if emergency autovacuum is required and false
- * otherwise.
+ * The return value is true if autovacuum is required and false otherwise.
  */
 static bool
 SetOffsetVacuumLimit(bool is_startup)
@@ -2652,8 +2646,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	MultiXactOffset nextOffset;
 	bool		oldestOffsetKnown = false;
 	bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;
-	MultiXactOffset prevOffsetStopLimit;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2668,7 +2660,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	nextOffset = MultiXactState->nextOffset;
 	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
 	prevOldestOffset = MultiXactState->oldestOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2699,11 +2690,7 @@ SetOffsetVacuumLimit(bool is_startup)
 		oldestOffsetKnown =
 			find_multixact_start(oldestMultiXactId, &oldestOffset);
 
-		if (oldestOffsetKnown)
-			ereport(DEBUG1,
-					(errmsg_internal("oldest MultiXactId member is at offset %u",
-									 oldestOffset)));
-		else
+		if (!oldestOffsetKnown)
 			ereport(LOG,
 					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
 							oldestMultiXactId)));
@@ -2716,24 +2703,7 @@ SetOffsetVacuumLimit(bool is_startup)
 	 * overrun of old data in the members SLRU area. We can only do so if the
 	 * oldest offset is known though.
 	 */
-	if (oldestOffsetKnown)
-	{
-		/* move back to start of the corresponding segment */
-		offsetStopLimit = oldestOffset - (oldestOffset %
-										  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-
-		/* always leave one segment before the wraparound point */
-		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-		if (!prevOldestOffsetKnown && !is_startup)
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are now enabled")));
-
-		ereport(DEBUG1,
-				(errmsg_internal("MultiXact member stop limit is now %u based on MultiXact %u",
-								 offsetStopLimit, oldestMultiXactId)));
-	}
-	else if (prevOldestOffsetKnown)
+	if (prevOldestOffsetKnown)
 	{
 		/*
 		 * If we failed to get the oldest offset this time, but we have a
@@ -2743,69 +2713,19 @@ SetOffsetVacuumLimit(bool is_startup)
 		 */
 		oldestOffset = prevOldestOffset;
 		oldestOffsetKnown = true;
-		offsetStopLimit = prevOffsetStopLimit;
 	}
 
 	/* Install the computed values */
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestOffset = oldestOffset;
 	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	MultiXactState->offsetStopLimit = offsetStopLimit;
 	LWLockRelease(MultiXactGenLock);
 
 	/*
-	 * Do we need an emergency autovacuum?	If we're not sure, assume yes.
+	 * Do we need autovacuum?	If we're not sure, assume yes.
 	 */
 	return !oldestOffsetKnown ||
-		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
-}
-
-/*
- * Return whether adding "distance" to "start" would move past "boundary".
- *
- * We use this to determine whether the addition is "wrapping around" the
- * boundary point, hence the name.  The reason we don't want to use the regular
- * 2^31-modulo arithmetic here is that we want to be able to use the whole of
- * the 2^32-1 space here, allowing for more multixacts than would fit
- * otherwise.
- */
-static bool
-MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
-						 uint32 distance)
-{
-	MultiXactOffset finish;
-
-	/*
-	 * Note that offset number 0 is not used (see GetMultiXactIdMembers), so
-	 * if the addition wraps around the UINT_MAX boundary, skip that value.
-	 */
-	finish = start + distance;
-	if (finish < start)
-		finish++;
-
-	/*-----------------------------------------------------------------------
-	 * When the boundary is numerically greater than the starting point, any
-	 * value numerically between the two is not wrapped:
-	 *
-	 *	<----S----B---->
-	 *	[---)			 = F wrapped past B (and UINT_MAX)
-	 *		 [---)		 = F not wrapped
-	 *			  [----] = F wrapped past B
-	 *
-	 * When the boundary is numerically less than the starting point (i.e. the
-	 * UINT_MAX wraparound occurs somewhere in between) then all values in
-	 * between are wrapped:
-	 *
-	 *	<----B----S---->
-	 *	[---)			 = F not wrapped past B (but wrapped past UINT_MAX)
-	 *		 [---)		 = F wrapped past B (and UINT_MAX)
-	 *			  [----] = F not wrapped
-	 *-----------------------------------------------------------------------
-	 */
-	if (start < boundary)
-		return finish >= boundary || finish < start;
-	else
-		return finish >= boundary && finish < start;
+		(nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);
 }
 
 /*
@@ -2824,7 +2744,6 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	int64		pageno;
 	int			entryno;
 	int			slotno;
-	MultiXactOffset *offptr;
 
 	Assert(MultiXactState->finishedStartup);
 
@@ -2842,9 +2761,9 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 
 	/* lock is acquired by SimpleLruReadPage_ReadOnly */
 	slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno, multi);
-	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
-	offptr += entryno;
-	offset = *offptr;
+
+	offset = MXOffsetRead(entryno, slotno);
+
 	LWLockRelease(SimpleLruGetBankLock(MultiXactOffsetCtl, pageno));
 
 	*result = offset;
@@ -2892,73 +2811,6 @@ GetMultiXactInfo(uint32 *multixacts, MultiXactOffset *members,
 	return true;
 }
 
-/*
- * Multixact members can be removed once the multixacts that refer to them
- * are older than every datminmxid.  autovacuum_multixact_freeze_max_age and
- * vacuum_multixact_freeze_table_age work together to make sure we never have
- * too many multixacts; we hope that, at least under normal circumstances,
- * this will also be sufficient to keep us from using too many offsets.
- * However, if the average multixact has many members, we might exhaust the
- * members space while still using few enough members that these limits fail
- * to trigger relminmxid advancement by VACUUM.  At that point, we'd have no
- * choice but to start failing multixact-creating operations with an error.
- *
- * To prevent that, if more than a threshold portion of the members space is
- * used, we effectively reduce autovacuum_multixact_freeze_max_age and
- * to a value just less than the number of multixacts in use.  We hope that
- * this will quickly trigger autovacuuming on the table or tables with the
- * oldest relminmxid, thus allowing datminmxid values to advance and removing
- * some members.
- *
- * As the fraction of the member space currently in use grows, we become
- * more aggressive in clamping this value.  That not only causes autovacuum
- * to ramp up, but also makes any manual vacuums the user issues more
- * aggressive.  This happens because vacuum_get_cutoffs() will clamp the
- * freeze table and the minimum freeze age cutoffs based on the effective
- * autovacuum_multixact_freeze_max_age this function returns.  In the worst
- * case, we'll claim the freeze_max_age to zero, and every vacuum of any
- * table will freeze every multixact.
- */
-int
-MultiXactMemberFreezeThreshold(void)
-{
-	MultiXactOffset members;
-	uint32		multixacts;
-	uint32		victim_multixacts;
-	double		fraction;
-	int			result;
-	MultiXactId oldestMultiXactId;
-	MultiXactOffset oldestOffset;
-
-	/* If we can't determine member space utilization, assume the worst. */
-	if (!GetMultiXactInfo(&multixacts, &members, &oldestMultiXactId, &oldestOffset))
-		return 0;
-
-	/* If member space utilization is low, no special action is required. */
-	if (members <= MULTIXACT_MEMBER_SAFE_THRESHOLD)
-		return autovacuum_multixact_freeze_max_age;
-
-	/*
-	 * Compute a target for relminmxid advancement.  The number of multixacts
-	 * we try to eliminate from the system is based on how far we are past
-	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
-	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD) /
-		(MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-	victim_multixacts = multixacts * fraction;
-
-	/* fraction could be > 1.0, but lowest possible freeze age is zero */
-	if (victim_multixacts > multixacts)
-		return 0;
-	result = multixacts - victim_multixacts;
-
-	/*
-	 * Clamp to autovacuum_multixact_freeze_max_age, so that we never make
-	 * autovacuum less aggressive than it would otherwise be.
-	 */
-	return Min(result, autovacuum_multixact_freeze_max_age);
-}
-
 typedef struct mxtruncinfo
 {
 	int64		earliestExistingPage;
@@ -3158,7 +3010,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 
 	elog(DEBUG1, "performing multixact truncation: "
 		 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-		 "members [%u, %u), members segments [%" PRIx64 ", %" PRIx64 ")",
+		 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
 		 oldestMulti, newOldestMulti,
 		 MultiXactIdToOffsetSegment(oldestMulti),
 		 MultiXactIdToOffsetSegment(newOldestMulti),
@@ -3289,7 +3141,7 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
 static bool
 MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
 {
-	int32		diff = (int32) (offset1 - offset2);
+	int64		diff = (int64) (offset1 - offset2);
 
 	return (diff < 0);
 }
@@ -3386,7 +3238,7 @@ multixact_redo(XLogReaderState *record)
 
 		elog(DEBUG1, "replaying multixact truncation: "
 			 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-			 "members [%u, %u), members segments [%" PRIx64 ", %" PRIx64 ")",
+			 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
 			 xlrec.startTruncOff, xlrec.endTruncOff,
 			 MultiXactIdToOffsetSegment(xlrec.startTruncOff),
 			 MultiXactIdToOffsetSegment(xlrec.endTruncOff),
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 346319338a..3af08d579a 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -884,7 +884,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 							 U64FromFullTransactionId(checkPoint.nextXid),
 							 checkPoint.nextOid)));
 	ereport(DEBUG1,
-			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %u",
+			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %" PRIu64,
 							 checkPoint.nextMulti, checkPoint.nextMultiOffset)));
 	ereport(DEBUG1,
 			(errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 733ef40ae7..8f5092670b 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1153,7 +1153,7 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams params,
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Almost ready to set freeze output parameters; check if OldestXmin or
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index dce4c8c45b..9bf03734c2 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1150,7 +1150,7 @@ do_start_worker(void)
 
 	/* Also determine the oldest datminmxid we will consider. */
 	recentMulti = ReadNextMultiXactId();
-	multiForceLimit = recentMulti - MultiXactMemberFreezeThreshold();
+	multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
 	if (multiForceLimit < FirstMultiXactId)
 		multiForceLimit -= FirstMultiXactId;
 
@@ -1938,7 +1938,7 @@ do_autovacuum(void)
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Find the pg_database entry and select the default freeze ages. We use
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 10de058ce9..5295108ade 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -264,7 +264,7 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile->checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
+	printf(_("Latest checkpoint's NextMultiOffset:  %" PRIu64 "\n"),
 		   ControlFile->checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile->checkPointCopy.oldestXid);
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 7a4e4eb957..646ab1b80d 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -266,7 +266,7 @@ main(int argc, char *argv[])
 
 			case 'O':
 				errno = 0;
-				set_mxoff = strtoul(optarg, &endptr, 0);
+				set_mxoff = strtou64(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
@@ -759,7 +759,7 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile.checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
+	printf(_("Latest checkpoint's NextMultiOffset:  %" PRIu64 "\n"),
 		   ControlFile.checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile.checkPointCopy.oldestXid);
@@ -833,7 +833,7 @@ PrintNewControlValues(void)
 
 	if (set_mxoff != -1)
 	{
-		printf(_("NextMultiOffset:                      %u\n"),
+		printf(_("NextMultiOffset:                      %" PRIu64 "\n"),
 			   ControlFile.checkPointCopy.nextMultiOffset);
 	}
 
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index d6bbbd0ced..cc89e0764a 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -213,7 +213,7 @@ push @cmd,
   sprintf("%d,%d", hex($files[0]) == 0 ? 3 : hex($files[0]), hex($files[-1]));
 
 @files = get_slru_files('pg_multixact/offsets');
-$mult = 32 * $blcksz / 4;
+$mult = 32 * $blcksz / 8;
 # --multixact-ids argument is "new,old"
 push @cmd,
   '--multixact-ids' => sprintf("%d,%d",
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 82e4bb90dd..16a0772308 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -28,7 +28,7 @@
 
 #define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
 
-#define MaxMultiXactOffset	((MultiXactOffset) 0xFFFFFFFF)
+#define MaxMultiXactOffset	UINT64CONST(0xFFFFFFFFFFFFFFFF)
 
 /*
  * Possible multixact lock modes ("status").  The first four modes are for
@@ -147,7 +147,6 @@ extern void MultiXactSetNextMXact(MultiXactId nextMulti,
 extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 									  MultiXactOffset minMultiOffset);
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
-extern int	MultiXactMemberFreezeThreshold(void);
 
 extern void multixact_twophase_recover(FullTransactionId fxid, uint16 info,
 									   void *recdata, uint32 len);
diff --git a/src/include/c.h b/src/include/c.h
index 39022f8a9d..de9ac13be7 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -641,7 +641,7 @@ typedef uint32 SubTransactionId;
 /* MultiXactId must be equivalent to TransactionId, to fit in t_xmax */
 typedef TransactionId MultiXactId;
 
-typedef uint32 MultiXactOffset;
+typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
-- 
2.43.0

#47wenhui qiu
qiuwenhuifx@gmail.com
In reply to: Maxim Orlov (#46)
Re: POC: make mxidoff 64 bits

Hi Maxim
Thank you for your feedback.I remember this path the primary
challenges are in the upgrade section, where we're still debating how to
address a few edge cases. Right now, this thread is missing input from core
code contributors.

Thanks

On Fri, Sep 19, 2025 at 12:37 AM Maxim Orlov <orlovmg@gmail.com> wrote:

Show quoted text

On Tue, 16 Sept 2025 at 15:12, wenhui qiu <qiuwenhuifx@gmail.com> wrote:

Agree +1 , but I have a question: I remember the XID64 patch got split
into a few threads. How are these threads related? The original one was
seen as too big a change, so it was broken up after people raised
concerns.

Yeah, you're absolutely correct. This thread is a part of the overall
work on the transition to XID64 in Postgres. As far as I remember, no
one explicitly raised concerns, but it's clear to me that it won't be
committed as one big patch set.

Here is a WIP patch @ 8191e0c16a for the discussed above issue.
I also had to merge several patches from the previous set, since the
consensus is to use the PRI* format for outputting 64-bit values, and a
separate patch that only changed the format with cast and %llu lost
it's meaning.

If this option suits everyone, the next step is to add a part related
to the pg_upgrade.

--
Best regards,
Maxim Orlov.

#48Maxim Orlov
orlovmg@gmail.com
In reply to: wenhui qiu (#47)
3 attachment(s)
Re: POC: make mxidoff 64 bits

Here is a new patch set @ 10b5bb3bffaee8

As previously stated, the patch set implements the concept of saving the
"difference" between page offsets in order to save disc space.

The second significant change was that I decided to modify the
pg_upgrade portion as suggested by Heikki above.

At the very least, the logic becomes much simpler and completely
replicates what is done in the multxact.c module and provide some
optimization.

Perhaps this will be easier to comprehend and analyse than attempting
to correctly convert bytes from one format to another.

In the near future, I intend to focus on testing and implement a test
suite.

--
Best regards,
Maxim Orlov.

Attachments:

v18-0003-TEST-bump-catversion.patch.txttext/plain; charset=US-ASCII; name=v18-0003-TEST-bump-catversion.patch.txtDownload
From 8bfa7e2365a0ce7fe4d30f84efbd5b1636b7740e Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Fri, 24 Oct 2025 11:47:50 +0300
Subject: [PATCH v18 3/3] TEST: bump catversion

To avoid constant CF-bot complains, make catversion bump in a separate
commit.

NOTE: keep it in sync with MULTIXACTOFFSET_FORMATCHANGE_CAT_VER
---
 src/include/catalog/catversion.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 1b0b16a343f..6a13fa3cdb0 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202510221
+#define CATALOG_VERSION_NO	999999999
 
 #endif
-- 
2.51.0

v18-0002-Add-pg_upgarde-for-64-bit-multixact-offset.patchapplication/octet-stream; name=v18-0002-Add-pg_upgarde-for-64-bit-multixact-offset.patchDownload
From 5ccc29bb23bfff509596bb2f251e420bb3d45ba0 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Fri, 24 Oct 2025 10:58:37 +0300
Subject: [PATCH v18 2/3] Add pg_upgarde for 64 bit multixact offset

Author: Maxim Orlov <orlovmg@gmail.com>
Author: Heikki Linnakangas <hlinnaka@iki.fi>
---
 src/backend/access/transam/multixact.c |  35 +--
 src/bin/pg_upgrade/Makefile            |   3 +
 src/bin/pg_upgrade/meson.build         |   3 +
 src/bin/pg_upgrade/multixact_new.c     | 245 ++++++++++++++++++++
 src/bin/pg_upgrade/multixact_new.h     |  31 +++
 src/bin/pg_upgrade/multixact_old.c     | 296 +++++++++++++++++++++++++
 src/bin/pg_upgrade/multixact_old.h     |  31 +++
 src/bin/pg_upgrade/pg_upgrade.c        | 108 ++++++++-
 src/bin/pg_upgrade/pg_upgrade.h        |   5 +
 src/bin/pg_upgrade/slru_io.c           | 228 +++++++++++++++++++
 src/bin/pg_upgrade/slru_io.h           |  30 +++
 11 files changed, 983 insertions(+), 32 deletions(-)
 create mode 100644 src/bin/pg_upgrade/multixact_new.c
 create mode 100644 src/bin/pg_upgrade/multixact_new.h
 create mode 100644 src/bin/pg_upgrade/multixact_old.c
 create mode 100644 src/bin/pg_upgrade/multixact_old.h
 create mode 100644 src/bin/pg_upgrade/slru_io.c
 create mode 100644 src/bin/pg_upgrade/slru_io.h

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index e7861614bec..2d44c781f93 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1276,7 +1276,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	int			slotno;
 	MultiXactOffset offset;
 	int			length;
-	int			truelength;
 	MultiXactId oldestMXact;
 	MultiXactId nextMXact;
 	MultiXactId tmpMXact;
@@ -1375,15 +1374,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * we have just for this; the process in charge will signal the CV as soon
 	 * as it has finished writing the multixact offset.
 	 *
-	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
-	 * handle case #2, there is an ambiguity near the point of offset
-	 * wraparound.  If we see next multixact's offset is one, is that our
-	 * multixact's actual endpoint, or did it end at zero with a subsequent
-	 * increment?  We handle this using the knowledge that if the zero'th
-	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
-	 * transaction ID so it can't be a multixact member.  Therefore, if we
-	 * read a zero from the members array, just ignore it.
-	 *
 	 * This is all pretty messy, but the mess occurs only in infrequent corner
 	 * cases, so it seems better than holding the MultiXactGenLock for a long
 	 * time on every multixact creation.
@@ -1467,6 +1457,9 @@ retry:
 	LWLockRelease(lock);
 	lock = NULL;
 
+	/* A multixid with zero members should not happen */
+	Assert(length > 0);
+
 	/*
 	 * If we slept above, clean up state; it's no longer needed.
 	 */
@@ -1475,7 +1468,6 @@ retry:
 
 	ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
 
-	truelength = 0;
 	prev_pageno = -1;
 	for (int i = 0; i < length; i++, offset++)
 	{
@@ -1513,36 +1505,27 @@ retry:
 		xactptr = (TransactionId *)
 			(MultiXactMemberCtl->shared->page_buffer[slotno] + memberoff);
 
-		if (!TransactionIdIsValid(*xactptr))
-		{
-			/* Corner case 3: we must be looking at unused slot zero */
-			Assert(offset == 0);
-			continue;
-		}
+		Assert(TransactionIdIsValid(*xactptr));
 
 		flagsoff = MXOffsetToFlagsOffset(offset);
 		bshift = MXOffsetToFlagsBitShift(offset);
 		flagsptr = (uint32 *) (MultiXactMemberCtl->shared->page_buffer[slotno] + flagsoff);
 
-		ptr[truelength].xid = *xactptr;
-		ptr[truelength].status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
-		truelength++;
+		ptr[i].xid = *xactptr;
+		ptr[i].status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
 	}
 
 	LWLockRelease(lock);
 
-	/* A multixid with zero members should not happen */
-	Assert(truelength > 0);
-
 	/*
 	 * Copy the result into the local cache.
 	 */
-	mXactCachePut(multi, truelength, ptr);
+	mXactCachePut(multi, length, ptr);
 
 	debug_elog3(DEBUG2, "GetMembers: no cache for %s",
-				mxid_to_string(multi, truelength, ptr));
+				mxid_to_string(multi, length, ptr));
 	*members = ptr;
-	return truelength;
+	return length;
 }
 
 /*
diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index 69fcf593cae..42995d53b0b 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -18,11 +18,14 @@ OBJS = \
 	file.o \
 	function.o \
 	info.o \
+	multixact_new.o \
+	multixact_old.o \
 	option.o \
 	parallel.o \
 	pg_upgrade.o \
 	relfilenumber.o \
 	server.o \
+	slru_io.o \
 	tablespace.o \
 	task.o \
 	util.o \
diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index ac992f0d14b..3e46c4512cf 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -8,11 +8,14 @@ pg_upgrade_sources = files(
   'file.c',
   'function.c',
   'info.c',
+  'multixact_new.c',
+  'multixact_old.c',
   'option.c',
   'parallel.c',
   'pg_upgrade.c',
   'relfilenumber.c',
   'server.c',
+  'slru_io.c',
   'tablespace.c',
   'task.c',
   'util.c',
diff --git a/src/bin/pg_upgrade/multixact_new.c b/src/bin/pg_upgrade/multixact_new.c
new file mode 100644
index 00000000000..d6f2e38c287
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_new.c
@@ -0,0 +1,245 @@
+/*
+ * multixact_new.c
+ *
+ * Rewrite pre-v19 multixacts to new format with 64-bit MultiXactOffsets
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_new.c
+ */
+
+#include "multixact_new.h"
+
+/*
+ * NOTE: Below are a bunch of definitions and simple inline functions that are
+ * copy-pasted from multixact.c
+ */
+typedef uint32 ShortMultiXactOffset;
+
+/* We need four bytes per offset, 8 bytes for the base */
+#define MULTIXACT_OFFSETS_PER_PAGE		\
+	((BLCKSZ - sizeof(MultiXactOffset)) / sizeof(ShortMultiXactOffset))
+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)
+{
+	return multi / MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int
+MultiXactIdToOffsetEntry(MultiXactId multi)
+{
+	return multi % MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+/* We need eight bits per xact, so one xact fits in a byte */
+#define MXACT_MEMBER_BITS_PER_XACT			8
+#define MXACT_MEMBER_FLAGS_PER_BYTE			1
+#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
+
+/* how many full bytes of flags are there in a group? */
+#define MULTIXACT_FLAGBYTES_PER_GROUP		4
+#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
+	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
+/* size in bytes of a complete group */
+#define MULTIXACT_MEMBERGROUP_SIZE \
+	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERS_PER_PAGE	\
+	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
+
+/*
+ * Because the number of items per page is not a divisor of the last item
+ * number (member 0xFFFFFFFF), the last segment does not use the maximum number
+ * of pages, and moreover the last used page therein does not use the same
+ * number of items as previous pages.  (Another way to say it is that the
+ * 0xFFFFFFFF member is somewhere in the middle of the last page, so the page
+ * has some empty space after that item.)
+ *
+ * This constant is the number of members in the last page of the last segment.
+ */
+#define MAX_MEMBERS_IN_LAST_MEMBERS_PAGE \
+		((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
+
+/* page in which a member is to be found */
+static inline int64
+MXOffsetToMemberPage(MultiXactOffset offset)
+{
+	return offset / MULTIXACT_MEMBERS_PER_PAGE;
+}
+
+/* Location (byte offset within page) of flag word for a given member */
+static inline int
+MXOffsetToFlagsOffset(MultiXactOffset offset)
+{
+	MultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
+	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
+
+	return byteoff;
+}
+
+static inline int
+MXOffsetToFlagsBitShift(MultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
+
+	return bshift;
+}
+
+/* Location (byte offset within page) of TransactionId of given member */
+static inline int
+MXOffsetToMemberOffset(MultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+
+	return MXOffsetToFlagsOffset(offset) +
+		MULTIXACT_FLAGBYTES_PER_GROUP +
+		member_in_group * sizeof(TransactionId);
+}
+
+static inline void
+MXOffsetWrite(char *buf, int entryno, MultiXactOffset offset)
+{
+	MultiXactOffset			*offptr;
+	ShortMultiXactOffset	*off32ptr;
+
+	offptr = (MultiXactOffset *) buf;
+	if (entryno != 0)
+	{
+		off32ptr = (ShortMultiXactOffset *) (offptr + 1);	/* bypass base */
+		off32ptr += entryno - 1;
+		*off32ptr = (ShortMultiXactOffset) (offset - *offptr);
+
+		return;
+	}
+
+	/*
+	 * The first offset on the page is assigned a 64-bit value.  All other
+	 * elements on the page will be calculated using this value as a base and
+	 * added to it 32-bit value.
+	 */
+	*offptr = offset;
+}
+
+MultiXactWriter *
+AllocMultiXactWrite(char *pgdata, MultiXactId firstMulti,
+					MultiXactOffset firstOffset)
+{
+	MultiXactWriter *state = state = pg_malloc(sizeof(*state));
+	char				dir[MAXPGPATH] = {0};
+
+	state->nextMXact = firstMulti;
+	state->nextOffset = firstOffset;
+
+	pg_sprintf(dir, "%s/pg_multixact/offsets", pgdata);
+	state->offset = AllocSlruWrite(dir);
+
+	pg_sprintf(dir, "%s/pg_multixact/members", pgdata);
+	state->members = AllocSlruWrite(dir);
+
+	return state;
+}
+
+/*
+ * Simplified copy of the corresponding server function
+ */
+MultiXactId
+GetNewMultiXactId(MultiXactWriter *state, int nmembers, MultiXactOffset *offset)
+{
+	MultiXactId		result;
+
+	/* Handle wraparound of the nextMXact counter */
+	if (state->nextMXact < FirstMultiXactId)
+		state->nextMXact = FirstMultiXactId;
+
+	/* Assign the MXID */
+	result = state->nextMXact;
+
+	/* Reserve the members space, similarly to above. */
+	*offset = state->nextOffset;
+
+	/*
+	 * Advance counters.  As in GetNewTransactionId(), this must not happen
+	 * until after file extension has succeeded!
+	 *
+	 * We don't care about MultiXactId wraparound here; it will be handled by
+	 * the next iteration.  But note that nextMXact may be InvalidMultiXactId
+	 * or the first value on a segment-beginning page after this routine
+	 * exits, so anyone else looking at the variable must be prepared to deal
+	 * with either case.  Similarly, nextOffset may be zero, but we won't use
+	 * that as the actual start offset of the next multixact.
+	 */
+	(state->nextMXact)++;
+
+	state->nextOffset += nmembers;
+
+	return result;
+}
+
+/*
+ * Write a new multixact with members.
+ *
+ * Simplified version of the correspoding server function, hence the name.
+ */
+void
+RecordNewMultiXact(MultiXactWriter *state, MultiXactOffset offset,
+				   MultiXactId multi, int nmembers, MultiXactMember *members)
+{
+	int64		pageno;
+	int64		prev_pageno;
+	int			entryno,
+				i;
+	char	   *buf;
+
+	pageno = MultiXactIdToOffsetPage(multi);
+	entryno = MultiXactIdToOffsetEntry(multi);
+
+	buf = SlruWriteSwitchPage(state->offset, pageno);
+	MXOffsetWrite(buf, entryno, offset);
+
+	prev_pageno = -1;
+
+	for (i = 0; i < nmembers; i++, offset++)
+	{
+		TransactionId *memberptr;
+		uint32	   *flagsptr;
+		uint32		flagsval;
+		int			bshift;
+		int			flagsoff;
+		int			memberoff;
+
+		Assert(members[i].status <= MultiXactStatusUpdate);
+
+		pageno = MXOffsetToMemberPage(offset);
+		memberoff = MXOffsetToMemberOffset(offset);
+		flagsoff = MXOffsetToFlagsOffset(offset);
+		bshift = MXOffsetToFlagsBitShift(offset);
+
+		if (pageno != prev_pageno)
+		{
+			buf = SlruWriteSwitchPage(state->members, pageno);
+			prev_pageno = pageno;
+		}
+
+		memberptr = (TransactionId *) (buf + memberoff);
+
+		*memberptr = members[i].xid;
+
+		flagsptr = (uint32 *) (buf + flagsoff);
+
+		flagsval = *flagsptr;
+		flagsval &= ~(((1 << MXACT_MEMBER_BITS_PER_XACT) - 1) << bshift);
+		flagsval |= (members[i].status << bshift);
+		*flagsptr = flagsval;
+	}
+}
+
+void
+FreeMultiXactWrite(MultiXactWriter *state)
+{
+	FreeSlruWrite(state->offset);
+	FreeSlruWrite(state->members);
+
+	pfree(state);
+}
diff --git a/src/bin/pg_upgrade/multixact_new.h b/src/bin/pg_upgrade/multixact_new.h
new file mode 100644
index 00000000000..33d5d1b8222
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_new.h
@@ -0,0 +1,31 @@
+/*
+ * multixact_new.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_new.h
+ */
+
+#include "postgres_fe.h"
+
+#include "access/multixact.h"
+
+#include "slru_io.h"
+
+typedef struct MultiXactWriter
+{
+	MultiXactId			nextMXact;
+	MultiXactOffset		nextOffset;
+
+	SlruSegState	   *offset;
+	SlruSegState	   *members;
+} MultiXactWriter;
+
+extern MultiXactWriter *AllocMultiXactWrite(char *pgdata,
+											MultiXactId firstMulti,
+											MultiXactOffset firstOffset);
+extern MultiXactId GetNewMultiXactId(MultiXactWriter *state, int nmembers,
+									 MultiXactOffset *offset);
+extern void RecordNewMultiXact(MultiXactWriter *state, MultiXactOffset offset,
+							   MultiXactId multi, int nmembers,
+							   MultiXactMember *members);
+extern void FreeMultiXactWrite(MultiXactWriter *writer);
diff --git a/src/bin/pg_upgrade/multixact_old.c b/src/bin/pg_upgrade/multixact_old.c
new file mode 100644
index 00000000000..6cc384d2cf2
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_old.c
@@ -0,0 +1,296 @@
+/*
+ * multixact_old.c
+ *
+ * Rewrite pre-v19 multixacts to new format with 64-bit MultiXactOffsets
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_old.c
+ */
+
+#include "multixact_old.h"
+
+#include "pg_upgrade.h"
+
+/*
+ * NOTE: below are a bunch of definitions and simple sttaic inline functions
+ * that are copy-pasted from multixact.c from version 18.  The only difference
+ * is that we use the OldMultiXactOffset type equal to uint32 instead of
+ * MultiXactOffset which became uint64.
+ */
+
+/* We need four bytes per offset and 8 bytes per base for each page. */
+#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(OldMultiXactOffset))
+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)
+{
+	return multi / MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int
+MultiXactIdToOffsetEntry(MultiXactId multi)
+{
+	return multi % MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+/*
+ * The situation for members is a bit more complex: we store one byte of
+ * additional flag bits for each TransactionId.  To do this without getting
+ * into alignment issues, we store four bytes of flags, and then the
+ * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
+ * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
+ * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
+ * performance) trumps space efficiency here.
+ *
+ * Note that the "offset" macros work with byte offset, not array indexes, so
+ * arithmetic must be done using "char *" pointers.
+ */
+/* We need eight bits per xact, so one xact fits in a byte */
+#define MXACT_MEMBER_BITS_PER_XACT			8
+#define MXACT_MEMBER_FLAGS_PER_BYTE			1
+#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
+
+/* how many full bytes of flags are there in a group? */
+#define MULTIXACT_FLAGBYTES_PER_GROUP		4
+#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
+	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
+/* size in bytes of a complete group */
+#define MULTIXACT_MEMBERGROUP_SIZE \
+	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERS_PER_PAGE	\
+	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
+
+/* page in which a member is to be found */
+static inline int64
+MXOffsetToMemberPage(OldMultiXactOffset offset)
+{
+	return offset / MULTIXACT_MEMBERS_PER_PAGE;
+}
+
+/* Location (byte offset within page) of flag word for a given member */
+static inline int
+MXOffsetToFlagsOffset(MultiXactOffset offset)
+{
+	OldMultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
+	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
+
+	return byteoff;
+}
+
+/* Location (byte offset within page) of TransactionId of given member */
+static inline int
+MXOffsetToMemberOffset(OldMultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+
+	return MXOffsetToFlagsOffset(offset) +
+		MULTIXACT_FLAGBYTES_PER_GROUP +
+		member_in_group * sizeof(TransactionId);
+}
+
+static inline int
+MXOffsetToFlagsBitShift(OldMultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
+
+	return bshift;
+}
+
+/*
+ * Construct reader of old multixacts.
+ *
+ * Returns the malloced memory used by the all other calls in this module.
+ */
+OldMultiXactReader *
+AllocOldMultiXactRead(char *pgdata, MultiXactId nextMulti,
+					  OldMultiXactOffset nextOffset)
+{
+	OldMultiXactReader *state = state = pg_malloc(sizeof(*state));
+	char				dir[MAXPGPATH] = {0};
+
+	state->nextMXact = nextMulti;
+	state->nextOffset = nextOffset;
+
+	pg_sprintf(dir, "%s/pg_multixact/offsets", pgdata);
+	state->offset = AllocSlruRead(dir);
+
+	pg_sprintf(dir, "%s/pg_multixact/members", pgdata);
+	state->members = AllocSlruRead(dir);
+
+	return state;
+}
+
+/*
+ * This is a simplified version of the GetMultiXactIdMembers() server function.
+ *
+ * - Only return the updating member, if any. Upgrade only cares about the
+ *   updaters. If there is no updating member, return the first locking-only
+ *   member. We don't have any way to represent "no members", but we also don't
+ *   need to preserve all the locking members.
+ *
+ * - We don't need to worry about locking and some corner cases because there's
+ *   no concurrent activity.
+ */
+void
+GetOldMultiXactIdSingleMember(OldMultiXactReader *state, MultiXactId multi,
+							  TransactionId *result, MultiXactStatus *status)
+{
+	MultiXactId		nextMXact,
+					nextOffset,
+					tmpMXact;
+	int64			pageno,
+					prev_pageno;
+	int				entryno,
+					length;
+	char		   *buf;
+	OldMultiXactOffset *offptr,
+						offset;
+	TransactionId	result_xid = InvalidTransactionId;
+	bool			result_isupdate = false;
+
+	nextMXact = state->nextMXact;
+	nextOffset = state->nextOffset;
+
+	/*
+	 * See GetMultiXactIdMembers in multixact.c
+	 *
+	 * Find out the offset at which we need to start reading MultiXactMembers
+	 * and the number of members in the multixact.  We determine the latter as
+	 * the difference between this multixact's starting offset and the next
+	 * one's.  However, there are some corner cases to worry about:
+	 *
+	 * 1. This multixact may be the latest one created, in which case there is
+	 * no next one to look at.  In this case the nextOffset value we just
+	 * saved is the correct endpoint.
+	 *
+	 * 2. The next multixact may still be in process of being filled in...
+	 * This cannot happen during upgrade.
+	 *
+	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
+	 * handle case #2, there is an ambiguity near the point of offset
+	 * wraparound.  If we see next multixact's offset is one, is that our
+	 * multixact's actual endpoint, or did it end at zero with a subsequent
+	 * increment?  We handle this using the knowledge that if the zero'th
+	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
+	 * transaction ID so it can't be a multixact member.  Therefore, if we
+	 * read a zero from the members array, just ignore it.
+	 */
+
+	pageno = MultiXactIdToOffsetPage(multi);
+	entryno = MultiXactIdToOffsetEntry(multi);
+
+	buf = SlruReadSwitchPage(state->offset, pageno);
+	offptr = (OldMultiXactOffset *) buf;
+	offptr += entryno;
+	offset = *offptr;
+
+	Assert(offset != 0);
+
+	/*
+	 * Use the same increment rule as GetNewMultiXactId(), that is, don't
+	 * handle wraparound explicitly until needed.
+	 */
+	tmpMXact = multi + 1;
+
+	if (nextMXact == tmpMXact)
+	{
+		/* Corner case 1: there is no next multixact */
+		length = nextOffset - offset;
+	}
+	else
+	{
+		OldMultiXactOffset nextMXOffset;
+
+		/* handle wraparound if needed */
+		if (tmpMXact < FirstMultiXactId)
+			tmpMXact = FirstMultiXactId;
+
+		prev_pageno = pageno;
+
+		pageno = MultiXactIdToOffsetPage(tmpMXact);
+		entryno = MultiXactIdToOffsetEntry(tmpMXact);
+
+		if (pageno != prev_pageno)
+			buf = SlruReadSwitchPage(state->offset, pageno);
+
+		offptr = (OldMultiXactOffset *) buf;
+		offptr += entryno;
+		nextMXOffset = *offptr;
+
+		/*
+		 * Corner case 2: next multixact is still being filled in, this must
+		 * not happen during upgrade.
+		 */
+		Assert(nextMXOffset != 0);
+
+		length = nextMXOffset - offset;
+	}
+
+	prev_pageno = -1;
+	for (int i = 0; i < length; i++, offset++)
+	{
+		TransactionId *xactptr;
+		uint32	   *flagsptr;
+		int			flagsoff;
+		int			bshift;
+		int			memberoff;
+		MultiXactStatus st;
+
+		pageno = MXOffsetToMemberPage(offset);
+		memberoff = MXOffsetToMemberOffset(offset);
+
+		if (pageno != prev_pageno)
+		{
+			buf = SlruReadSwitchPage(state->members, pageno);
+			prev_pageno = pageno;
+		}
+
+		xactptr = (TransactionId *) (buf + memberoff);
+		if (!TransactionIdIsValid(*xactptr))
+		{
+			/* Corner case 3: we must be looking at unused slot zero */
+			Assert(offset == 0);
+			continue;
+		}
+
+		flagsoff = MXOffsetToFlagsOffset(offset);
+		bshift = MXOffsetToFlagsBitShift(offset);
+		flagsptr = (uint32 *) (buf + flagsoff);
+
+		st = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
+
+		/* Verify that there is a single update Xid among the given members. */
+		if (ISUPDATE_from_mxstatus(st))
+		{
+			if (result_isupdate)
+				pg_fatal("multixact %u has more than one updating member",
+						 multi);
+			result_xid = *xactptr;
+			result_isupdate = true;
+		}
+		else if (!TransactionIdIsValid(result_xid))
+			result_xid = *xactptr;
+	}
+
+	/* A multixid with zero members should not happen */
+	Assert(TransactionIdIsValid(result_xid));
+
+	*result = result_xid;
+	*status = result_isupdate ? MultiXactStatusUpdate :
+								MultiXactStatusForKeyShare;
+}
+
+/*
+ * Frees the malloced reader.
+ */
+void
+FreeOldMultiXactReader(OldMultiXactReader *state)
+{
+	FreeSlruRead(state->offset);
+	FreeSlruRead(state->members);
+
+	pfree(state);
+}
diff --git a/src/bin/pg_upgrade/multixact_old.h b/src/bin/pg_upgrade/multixact_old.h
new file mode 100644
index 00000000000..8d4659ba6a0
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_old.h
@@ -0,0 +1,31 @@
+/*
+ * multixact_old.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_old.h
+ */
+
+#include "postgres_fe.h"
+
+#include "access/multixact.h"
+#include "slru_io.h"
+
+typedef uint32 OldMultiXactOffset;
+
+typedef struct OldMultiXactReader
+{
+	MultiXactId			nextMXact;
+	OldMultiXactOffset	nextOffset;
+
+	SlruSegState	   *offset;
+	SlruSegState	   *members;
+} OldMultiXactReader;
+
+extern OldMultiXactReader *AllocOldMultiXactRead(char *pgdata,
+												 MultiXactId nextMulti,
+												 OldMultiXactOffset nextOffset);
+extern void GetOldMultiXactIdSingleMember(OldMultiXactReader *state,
+										  MultiXactId multi,
+										  TransactionId *result,
+										  MultiXactStatus *status);
+extern void FreeOldMultiXactReader(OldMultiXactReader *reader);
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 490e98fa26f..5432c03a2b0 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -49,6 +49,8 @@
 #include "common/restricted_token.h"
 #include "fe_utils/string_utils.h"
 #include "pg_upgrade.h"
+#include "multixact_old.h"
+#include "multixact_new.h"
 
 /*
  * Maximum number of pg_restore actions (TOC entries) to process within one
@@ -769,6 +771,82 @@ copy_subdir_files(const char *old_subdir, const char *new_subdir)
 	check_ok();
 }
 
+/*
+ * Convert pg_multixact/offset and /members to new format with 64-bit offsets.
+ */
+static void
+convert_multixacts(MultiXactId *new_nxtmulti, MultiXactOffset *new_nxtmxoff)
+{
+	MultiXactId			oldest_multi,
+						next_multi;
+	OldMultiXactReader *old_reader;
+	MultiXactWriter	   *new_writer;
+
+	old_reader = AllocOldMultiXactRead(old_cluster.pgdata,
+									   old_cluster.controldata.chkpnt_nxtmulti,
+									   old_cluster.controldata.chkpnt_nxtmxoff);
+	new_writer = AllocMultiXactWrite(new_cluster.pgdata,
+									 old_cluster.controldata.chkpnt_oldstMulti,
+									 1 /* see below */);
+
+	oldest_multi = old_cluster.controldata.chkpnt_oldstMulti;
+	next_multi = old_cluster.controldata.chkpnt_nxtmulti;
+
+	/* handle wraparound */
+	if (next_multi < FirstMultiXactId)
+		next_multi = FirstMultiXactId;
+
+	/*
+	 * Read multixids from old files one by one, and write them back in the new
+	 * format.
+	 *
+	 * The locking-only XIDs that may be part of multi-xids don't matter after
+	 * upgrade, as there can be no transactions running across upgrade.  So as
+	 * a little optimization, we only read one member from each multixid: the
+	 * one updating one, or if there was no update, arbitrarily the first
+	 * locking xid.
+	 */
+	for (MultiXactId multi = oldest_multi; multi != next_multi;)
+	{
+		TransactionId		xid;
+		MultiXactStatus		status;
+		MultiXactMember		member;
+		MultiXactId			new_multi PG_USED_FOR_ASSERTS_ONLY;
+		MultiXactOffset		offset;
+
+		/* Read the old multixid */
+		GetOldMultiXactIdSingleMember(old_reader, multi, &xid, &status);
+
+		/* Write it out in new format */
+		member.xid = xid;
+		member.status = status;
+		new_multi = GetNewMultiXactId(new_writer, 1, &offset);
+
+		Assert(new_multi == multi);
+
+		RecordNewMultiXact(new_writer, offset, multi, 1, &member);
+
+		multi++;
+		/* handle wraparound */
+		if (multi < FirstMultiXactId)
+			multi = FirstMultiXactId;
+	}
+
+	/*
+	 * Update the nextMXact/Offset values in the control file to match what we
+	 * wrote.  The nextMXact should be unchanged, but because we ignored the
+	 * locking XIDs members, the nextOffset will be different.
+	 */
+	Assert(new_writer->nextMXact == next_multi);
+
+	*new_nxtmulti = next_multi;
+	*new_nxtmxoff = new_writer->nextOffset;
+
+	/* Release resources */
+	FreeMultiXactWrite(new_writer);
+	FreeOldMultiXactReader(old_reader);
+}
+
 static void
 copy_xact_xlog_xid(void)
 {
@@ -816,8 +894,28 @@ copy_xact_xlog_xid(void)
 	if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER &&
 		new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
 	{
-		copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
-		copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		MultiXactId		new_nxtmulti = old_cluster.controldata.chkpnt_nxtmulti;
+		MultiXactOffset new_nxtmxoff = old_cluster.controldata.chkpnt_nxtmxoff;
+
+		/*
+		 * If the old server is before the MULTIXACTOFFSET_FORMATCHANGE_CAT_VER
+		 * it must have 32-bit multixid offsets, thus it should be converted.
+		 */
+		if (old_cluster.controldata.cat_ver < MULTIXACTOFFSET_FORMATCHANGE_CAT_VER &&
+			new_cluster.controldata.cat_ver >= MULTIXACTOFFSET_FORMATCHANGE_CAT_VER)
+		{
+			remove_new_subdir("pg_multixact/members", false);
+			remove_new_subdir("pg_multixact/offsets", false);
+
+			prep_status("Converting pg_multixact/offsets to 64-bit");
+			convert_multixacts(&new_nxtmulti, &new_nxtmxoff);
+			check_ok();
+		}
+		else
+		{
+			copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
+			copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		}
 
 		prep_status("Setting next multixact ID and offset for new cluster");
 
@@ -826,10 +924,8 @@ copy_xact_xlog_xid(void)
 		 * counters here and the oldest multi present on system.
 		 */
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/pg_resetwal\" -O %u -m %u,%u \"%s\"",
-				  new_cluster.bindir,
-				  old_cluster.controldata.chkpnt_nxtmxoff,
-				  old_cluster.controldata.chkpnt_nxtmulti,
+				  "\"%s/pg_resetwal\" -O %" PRIu64 " -m %u,%u \"%s\"",
+				  new_cluster.bindir, new_nxtmxoff, new_nxtmulti,
 				  old_cluster.controldata.chkpnt_oldstMulti,
 				  new_cluster.pgdata);
 		check_ok();
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index e86336f4be9..127b2cb00fa 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -114,6 +114,11 @@ extern char *output_files[];
  */
 #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231
 
+/*
+ * Swicth from 32-bit to 64-bit for multixid offsets.
+ */
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 999999999
+
 /*
  * large object chunk size added to pg_controldata,
  * commit 5f93c37805e7485488480916b4585e098d3cc883
diff --git a/src/bin/pg_upgrade/slru_io.c b/src/bin/pg_upgrade/slru_io.c
new file mode 100644
index 00000000000..6e5e0dc5b12
--- /dev/null
+++ b/src/bin/pg_upgrade/slru_io.c
@@ -0,0 +1,228 @@
+/*
+ * slru_io.c
+ *
+ * Routines for reading and writing SLRU files during upgrade.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/slru_io.c
+ */
+
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+
+#include "pg_upgrade.h"
+#include "slru_io.h"
+
+#include "common/fe_memutils.h"
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "port/pg_iovec.h"
+
+/*
+ * State for reading or writing an SLRU, with a one page buffer.
+ */
+typedef struct SlruSegState
+{
+	bool		writing;
+
+	char	   *dir;
+	char	   *fn;
+	int			fd;
+	int64		segno;
+	uint64		pageno;
+
+	PGAlignedBlock buf;
+} SlruSegState;
+
+static inline SlruSegState *
+AllocSlruSegState(char *dir)
+{
+	SlruSegState *state = pg_malloc(sizeof(*state));
+
+	state->segno = -1;
+	state->pageno = 0;
+	state->dir = pstrdup(dir);
+	state->fd = -1;
+	state->fn = NULL;
+
+	return state;
+}
+
+static inline void
+SlruFlush(SlruSegState *state)
+{
+	struct iovec	iovec = {
+		.iov_base = &state->buf,
+		.iov_len = BLCKSZ,
+	};
+	off_t			offset;
+
+	if (state->segno == -1)
+		return;
+
+	offset = (state->pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+	if (pg_pwritev_with_retry(state->fd, &iovec, 1, offset) < 0)
+		pg_fatal("could not write file \"%s\": %m", state->fn);
+}
+
+/*
+ * Create slru reader for dir.
+ *
+ * Returns the malloced memory used by the all other read calls in this module.
+ */
+SlruSegState *
+AllocSlruRead(char *dir)
+{
+	SlruSegState *state = AllocSlruSegState(dir);
+
+	state->writing = false;
+
+	return state;
+}
+
+/*
+ * Open given page for reading.
+ *
+ * Reading can be done in random order.
+ */
+char *
+SlruReadSwitchPage(SlruSegState *state, uint64 pageno)
+{
+	int64 segno;
+
+	Assert(!state->writing);	/* read only mode */
+
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+
+	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	if (segno != state->segno)
+	{
+		if (state->segno != -1)
+		{
+			close(state->fd);
+			state->fd = -1;
+
+			pg_free(state->fn);
+			state->fn = NULL;
+		}
+
+		/* Open new segment */
+		state->fn = psprintf("%s/%04X", state->dir, (unsigned int) segno);
+		if ((state->fd = open(state->fn, O_RDONLY | PG_BINARY, 0)) < 0)
+			pg_fatal("could not open file \"%s\": %m", state->fn);
+	}
+
+	state->segno = segno;
+
+	{
+		struct iovec	iovec = {
+			.iov_base = &state->buf,
+			.iov_len = BLCKSZ,
+		};
+		off_t			offset = (pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+		if (pg_preadv(state->fd, &iovec, 1, offset) < 0)
+			pg_fatal("could not read file \"%s\": %m", state->fn);
+
+		state->pageno = pageno;
+	}
+
+	return state->buf.data;
+}
+
+/*
+ * Frees the malloced reader.
+ */
+void
+FreeSlruRead(SlruSegState *state)
+{
+	Assert(!state->writing);	/* read only mode */
+
+	close(state->fd);
+	pg_free(state);
+}
+
+/*
+ * Open the given page for writing.
+ *
+ * NOTE: This uses O_EXCL when stepping to a new segment, so this assumes that
+ * each segment is written in full before moving on to next one.  This
+ * limitation would be easy to lift if needed, but it fits the usage pattern of
+ * current callers.
+ */
+char *
+SlruWriteSwitchPage(SlruSegState *state, uint64 pageno)
+{
+	int64	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	off_t	offset;
+
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+
+	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	offset = (pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+	SlruFlush(state);
+	memset(state->buf.data, 0, BLCKSZ);
+
+	if (segno != state->segno)
+	{
+		if (state->segno != -1)
+		{
+			close(state->fd);
+			state->fd = -1;
+
+			pg_free(state->fn);
+			state->fn = NULL;
+		}
+
+		/* Create the segment */
+		state->fn = psprintf("%s/%04X", state->dir, (unsigned int) segno);
+		if ((state->fd = open(state->fn, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+							  pg_file_create_mode)) < 0)
+		{
+			pg_fatal("could not create file \"%s\": %m", state->fn);
+		}
+
+		state->segno = segno;
+
+		if (offset > 0 && pg_pwrite_zeros(state->fd, offset, 0) < 0)
+			pg_fatal("could not write file \"%s\": %m", state->fn);
+	}
+
+	state->pageno = pageno;
+
+	return state->buf.data;
+}
+
+/*
+ * Create slru writer for dir.
+ *
+ * Returns the malloced memory used by the all other write calls in this module.
+ */
+SlruSegState *
+AllocSlruWrite(char *dir)
+{
+	SlruSegState *state = AllocSlruSegState(dir);
+
+	state->writing = true;
+
+	return state;
+}
+
+/*
+ * Frees the malloced writer.
+ */
+void
+FreeSlruWrite(SlruSegState *state)
+{
+	Assert(state->writing);
+
+	SlruFlush(state);
+
+	close(state->fd);
+	pg_free(state);
+}
diff --git a/src/bin/pg_upgrade/slru_io.h b/src/bin/pg_upgrade/slru_io.h
new file mode 100644
index 00000000000..0ac7cec440c
--- /dev/null
+++ b/src/bin/pg_upgrade/slru_io.h
@@ -0,0 +1,30 @@
+/*
+ * slru_io.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/slru_io.h
+ */
+
+/*
+ * Some kind of iterator associated with a particular SLRU segment.  The idea is
+ * to specify the segment and page number and then move through the pages.
+ */
+
+#include "postgres_fe.h"
+
+/*
+ * See access/slru.h
+ *
+ * Copy here, since slru.h could not be included in fe code.
+ */
+#define SLRU_PAGES_PER_SEGMENT 32
+
+typedef struct SlruSegState SlruSegState;
+
+extern SlruSegState *AllocSlruRead(char *dir);
+extern char *SlruReadSwitchPage(SlruSegState *state, uint64 pageno);
+extern void FreeSlruRead(SlruSegState *state);
+
+extern SlruSegState *AllocSlruWrite(char *dir);
+extern char *SlruWriteSwitchPage(SlruSegState *state, uint64 pageno);
+extern void FreeSlruWrite(SlruSegState *state);
\ No newline at end of file
-- 
2.51.0

v18-0001-Use-64-bit-multixact-offsets.patchapplication/octet-stream; name=v18-0001-Use-64-bit-multixact-offsets.patchDownload
From c1b911cffbbb52a4479f1b47318c4d73bb671bb6 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 7 Aug 2024 16:35:22 +0300
Subject: [PATCH v18 1/3] Use 64-bit multixact offsets

Switching to 64-bit multitransaction offsets removes wraparound and the
2^32 limit on their total number.

On the other hand, this move is inevitable in increased disc space
utilisation.  Fortunately, multitransaction offsets rise monotonically
and without gaps.  To conserve disc space consumed by segments, we
write a 64-bit "base" at the start of each page, which also serves as
the page's first offset.  All subsequent offsets on the page are
calculated relative to this "base".

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/rmgrdesc/mxactdesc.c   |   4 +-
 src/backend/access/rmgrdesc/xlogdesc.c    |   2 +-
 src/backend/access/transam/multixact.c    | 375 +++++++---------------
 src/backend/access/transam/xlogrecovery.c |   2 +-
 src/backend/commands/vacuum.c             |   2 +-
 src/backend/postmaster/autovacuum.c       |   4 +-
 src/bin/pg_controldata/pg_controldata.c   |   2 +-
 src/bin/pg_resetwal/pg_resetwal.c         |   6 +-
 src/bin/pg_resetwal/t/001_basic.pl        |   2 +-
 src/include/access/multixact.h            |   3 +-
 src/include/c.h                           |   2 +-
 11 files changed, 125 insertions(+), 279 deletions(-)

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 3ca0582db36..052dd0a4ce5 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -65,7 +65,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		xl_multixact_create *xlrec = (xl_multixact_create *) rec;
 		int			i;
 
-		appendStringInfo(buf, "%u offset %u nmembers %d: ", xlrec->mid,
+		appendStringInfo(buf, "%u offset %" PRIu64 " nmembers %d: ", xlrec->mid,
 						 xlrec->moff, xlrec->nmembers);
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
@@ -74,7 +74,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
 
-		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+		appendStringInfo(buf, "offsets [%u, %u), members [%" PRIu64 ", %" PRIu64 ")",
 						 xlrec->startTruncOff, xlrec->endTruncOff,
 						 xlrec->startTruncMemb, xlrec->endTruncMemb);
 	}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index cd6c2a2f650..441034f5929 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -66,7 +66,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
 		appendStringInfo(buf, "redo %X/%08X; "
-						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %u; "
+						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %" PRIu64 "; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
 						 "oldest running xid %u; %s",
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 9d5f130af7e..e7861614bec 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -89,21 +89,31 @@
 #include "utils/memutils.h"
 
 
+typedef int32 ShortMultiXactOffset;	/* for a disk storage */
+
 /*
  * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
  * used everywhere else in Postgres.
  *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
+ * There are two key factors why utilising straightforward 64-bit offset values
+ * for is wasteful in terms of disc space usage:
+ * 1) offset values are recorded in ascending order and not overwritten;
+ * 2) the largest supported BLCKSZ is 32k, which can store up to 2^13 32-bit
+ *    items on a single page;  thus, with MAX_BACKENDS limited to 2^18-1 we have
+ *    2^13 * (2^18-1) which is less 2^31 and fits 32-bits.
+ *
+ * In other words, max "distance" for offsets on a single page is not exeeded
+ * 32-bits.  To optimise disc space allocation, we employ the following scheme.
+ * On each page, the basic 64-bit offset, known as the page base, is located
+ * first.  Next, there are 32-bit deltas relative to the base element are
+ * placed.  Thus, the required offset for the 0-th element is the page's
+ * base; the value for each subsequent offset on the same page is calculated
+ * by adding it to the page base (0-th) element.
  */
 
-/* We need four bytes per offset */
-#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
+/* We need four bytes per offset, 8 bytes for the base */
+#define MULTIXACT_OFFSETS_PER_PAGE		\
+	((BLCKSZ - sizeof(MultiXactOffset)) / sizeof(ShortMultiXactOffset))
 
 static inline int64
 MultiXactIdToOffsetPage(MultiXactId multi)
@@ -208,10 +218,14 @@ MXOffsetToMemberOffset(MultiXactOffset offset)
 		member_in_group * sizeof(TransactionId);
 }
 
-/* Multixact members wraparound thresholds. */
-#define MULTIXACT_MEMBER_SAFE_THRESHOLD		(MaxMultiXactOffset / 2)
-#define MULTIXACT_MEMBER_DANGER_THRESHOLD	\
-	(MaxMultiXactOffset - MaxMultiXactOffset / 4)
+/*
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuumin order to release the disk space, reduce table bloat if
+ * possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD		UINT64CONST(0xFFFFFFFF)
 
 static inline MultiXactId
 PreviousMultiXactId(MultiXactId multi)
@@ -228,6 +242,51 @@ static SlruCtlData MultiXactMemberCtlData;
 #define MultiXactOffsetCtl	(&MultiXactOffsetCtlData)
 #define MultiXactMemberCtl	(&MultiXactMemberCtlData)
 
+/*
+ * To avoid diving deep into address arithmetic, we declare an auxiliary
+ * structure that access the MultiXactOffset SLRU page.
+ */
+typedef struct MultiXactOffsetSLRUPage
+{
+	MultiXactOffset			base;
+	ShortMultiXactOffset	offset[FLEXIBLE_ARRAY_MEMBER];
+} MultiXactOffsetSLRUPage;
+
+static inline MultiXactOffset
+MXOffsetRead(int entryno, int slotno)
+{
+	MultiXactOffsetSLRUPage *page =
+		(MultiXactOffsetSLRUPage *) MultiXactOffsetCtl->shared->page_buffer[slotno];
+
+	if (page->offset[entryno] != 0)
+		return page->base + (page->offset[entryno] ^ 0x80000000);
+
+	return 0;
+}
+
+static inline void
+MXOffsetWrite(int entryno, int slotno, MultiXactOffset offset)
+{
+	MultiXactOffsetSLRUPage *page =
+		(MultiXactOffsetSLRUPage *) MultiXactOffsetCtl->shared->page_buffer[slotno];
+
+	if (page->base != 0)
+		page->offset[entryno] = offset - page->base;
+	else
+	{
+		page->base = offset;
+		page->offset[entryno] = 0;
+	}
+
+	/*
+	 * We need to distinguish between uninited value and not yet written offset.
+	 * See case 2 in GetMultiXactIdMembers.
+	 *
+	 * So, mark this offset inited.
+	 */
+	page->offset[entryno] ^= 0x80000000;
+}
+
 /*
  * MultiXact state shared across all backends.  All this state is protected
  * by MultiXactGenLock.  (We also use SLRU bank's lock of MultiXactOffset and
@@ -268,9 +327,6 @@ typedef struct MultiXactStateData
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 
-	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
-
 	/*
 	 * This is used to sleep until a multixact offset is written when we want
 	 * to create the next one.
@@ -401,8 +457,6 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
-									 MultiXactOffset start, uint32 distance);
 static bool SetOffsetVacuumLimit(bool is_startup);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMTruncateXlogRec(Oid oldestMultiDB,
@@ -911,7 +965,6 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	int64		prev_pageno;
 	int			entryno;
 	int			slotno;
-	MultiXactOffset *offptr;
 	int			i;
 	LWLock	   *lock;
 	LWLock	   *prevlock = NULL;
@@ -930,10 +983,8 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	 * take the trouble to generalize the slru.c error reporting code.
 	 */
 	slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
-	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
-	offptr += entryno;
 
-	*offptr = offset;
+	MXOffsetWrite(entryno, slotno, offset);
 
 	MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 
@@ -1155,78 +1206,6 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	else
 		*offset = nextOffset;
 
-	/*----------
-	 * Protect against overrun of the members space as well, with the
-	 * following rules:
-	 *
-	 * If we're past offsetStopLimit, refuse to generate more multis.
-	 * If we're close to offsetStopLimit, emit a warning.
-	 *
-	 * Arbitrarily, we start emitting warnings when we're 20 segments or less
-	 * from offsetStopLimit.
-	 *
-	 * Note we haven't updated the shared state yet, so if we fail at this
-	 * point, the multixact ID we grabbed can still be used by the next guy.
-	 *
-	 * Note that there is no point in forcing autovacuum runs here: the
-	 * multixact freeze settings would have to be reduced for that to have any
-	 * effect.
-	 *----------
-	 */
-#define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
-								 nmembers))
-	{
-		/* see comment in the corresponding offsets wraparound case */
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("multixact \"members\" limit exceeded"),
-				 errdetail_plural("This command would create a multixact with %u members, but the remaining space is only enough for %u member.",
-								  "This command would create a multixact with %u members, but the remaining space is only enough for %u members.",
-								  MultiXactState->offsetStopLimit - nextOffset - 1,
-								  nmembers,
-								  MultiXactState->offsetStopLimit - nextOffset - 1),
-				 errhint("Execute a database-wide VACUUM in database with OID %u with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.",
-						 MultiXactState->oldestMultiXactDB)));
-	}
-
-	/*
-	 * Check whether we should kick autovacuum into action, to prevent members
-	 * wraparound. NB we use a much larger window to trigger autovacuum than
-	 * just the warning limit. The warning is just a measure of last resort -
-	 * this is in line with GetNewTransactionId's behaviour.
-	 */
-	if (!MultiXactState->oldestOffsetKnown ||
-		(MultiXactState->nextOffset - MultiXactState->oldestOffset
-		 > MULTIXACT_MEMBER_SAFE_THRESHOLD))
-	{
-		/*
-		 * To avoid swamping the postmaster with signals, we issue the autovac
-		 * request only when crossing a segment boundary. With default
-		 * compilation settings that's roughly after 50k members.  This still
-		 * gives plenty of chances before we get into real trouble.
-		 */
-		if ((MXOffsetToMemberPage(nextOffset) / SLRU_PAGES_PER_SEGMENT) !=
-			(MXOffsetToMemberPage(nextOffset + nmembers) / SLRU_PAGES_PER_SEGMENT))
-			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	}
-
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
-								 nextOffset,
-								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
-		ereport(WARNING,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg_plural("database with OID %u must be vacuumed before %d more multixact member is used",
-							   "database with OID %u must be vacuumed before %d more multixact members are used",
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers,
-							   MultiXactState->oldestMultiXactDB,
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers),
-				 errhint("Execute a database-wide VACUUM in that database with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.")));
-
 	ExtendMultiXactMember(nextOffset, nmembers);
 
 	/*
@@ -1255,7 +1234,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	LWLockRelease(MultiXactGenLock);
 
-	debug_elog4(DEBUG2, "GetNew: returning %u offset %u", result, *offset);
+	debug_elog4(DEBUG2, "GetNew: returning %u offset %" PRIu64, result,
+				*offset);
 	return result;
 }
 
@@ -1294,7 +1274,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	int64		prev_pageno;
 	int			entryno;
 	int			slotno;
-	MultiXactOffset *offptr;
 	MultiXactOffset offset;
 	int			length;
 	int			truelength;
@@ -1418,9 +1397,8 @@ retry:
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
 	slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
-	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
-	offptr += entryno;
-	offset = *offptr;
+
+	offset = MXOffsetRead(entryno, slotno);
 
 	Assert(offset != 0);
 
@@ -1467,9 +1445,7 @@ retry:
 			slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, tmpMXact);
 		}
 
-		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
-		offptr += entryno;
-		nextMXOffset = *offptr;
+		nextMXOffset = MXOffsetRead(entryno, slotno);
 
 		if (nextMXOffset == 0)
 		{
@@ -1973,7 +1949,7 @@ MultiXactShmemInit(void)
 				  "pg_multixact/members", LWTRANCHE_MULTIXACTMEMBER_BUFFER,
 				  LWTRANCHE_MULTIXACTMEMBER_SLRU,
 				  SYNC_HANDLER_MULTIXACT_MEMBER,
-				  false);
+				  true);
 	/* doesn't call SimpleLruTruncate() or meet criteria for unit tests */
 
 	/* Initialize our shared state struct */
@@ -2149,9 +2125,24 @@ TrimMultiXact(void)
 		LWLockAcquire(lock, LW_EXCLUSIVE);
 		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
-		offptr += entryno;
 
-		MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
+		if (entryno == 0)
+			MemSet(offptr, 0, BLCKSZ);
+		else
+		{
+			ShortMultiXactOffset *off32ptr;
+
+			off32ptr = (ShortMultiXactOffset *) (offptr + 1);
+			off32ptr += entryno;
+
+			/*
+			 * Knowing that offptr points to the beginning of the buffer,
+			 * address arithmetic can be used to determine the amount of
+			 * bytes remaining.
+			 */
+			MemSet(off32ptr, 0,
+				   BLCKSZ - (((char *) off32ptr - (char *) offptr)));
+		}
 
 		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 		LWLockRelease(lock);
@@ -2223,7 +2214,7 @@ MultiXactGetCheckptMulti(bool is_shutdown,
 	LWLockRelease(MultiXactGenLock);
 
 	debug_elog6(DEBUG2,
-				"MultiXact: checkpoint is nextMulti %u, nextOffset %u, oldestMulti %u in DB %u",
+				"MultiXact: checkpoint is nextMulti %u, nextOffset %" PRIu64 ", oldestMulti %u in DB %u",
 				*nextMulti, *nextMultiOffset, *oldestMulti, *oldestMultiDB);
 }
 
@@ -2258,7 +2249,7 @@ void
 MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset)
 {
-	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %u",
+	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %" PRIu64,
 				nextMulti, nextMultiOffset);
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->nextMXact = nextMulti;
@@ -2449,7 +2440,7 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 	}
 	if (MultiXactOffsetPrecedes(MultiXactState->nextOffset, minMultiOffset))
 	{
-		debug_elog3(DEBUG2, "MultiXact: setting next offset to %u",
+		debug_elog3(DEBUG2, "MultiXact: setting next offset to %" PRIU64,
 					minMultiOffset);
 		MultiXactState->nextOffset = minMultiOffset;
 	}
@@ -2633,15 +2624,13 @@ GetOldestMultiXactId(void)
 }
 
 /*
- * Determine how aggressively we need to vacuum in order to prevent member
- * wraparound.
+ * Determine if we need to vacuum for member or not.
  *
  * To do so determine what's the oldest member offset and install the limit
  * info in MultiXactState, where it can be used to prevent overrun of old data
  * in the members SLRU area.
  *
- * The return value is true if emergency autovacuum is required and false
- * otherwise.
+ * The return value is true if autovacuum is required and false otherwise.
  */
 static bool
 SetOffsetVacuumLimit(bool is_startup)
@@ -2653,8 +2642,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	MultiXactOffset nextOffset;
 	bool		oldestOffsetKnown = false;
 	bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;
-	MultiXactOffset prevOffsetStopLimit;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2669,7 +2656,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	nextOffset = MultiXactState->nextOffset;
 	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
 	prevOldestOffset = MultiXactState->oldestOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2700,11 +2686,7 @@ SetOffsetVacuumLimit(bool is_startup)
 		oldestOffsetKnown =
 			find_multixact_start(oldestMultiXactId, &oldestOffset);
 
-		if (oldestOffsetKnown)
-			ereport(DEBUG1,
-					(errmsg_internal("oldest MultiXactId member is at offset %u",
-									 oldestOffset)));
-		else
+		if (!oldestOffsetKnown)
 			ereport(LOG,
 					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
 							oldestMultiXactId)));
@@ -2717,24 +2699,7 @@ SetOffsetVacuumLimit(bool is_startup)
 	 * overrun of old data in the members SLRU area. We can only do so if the
 	 * oldest offset is known though.
 	 */
-	if (oldestOffsetKnown)
-	{
-		/* move back to start of the corresponding segment */
-		offsetStopLimit = oldestOffset - (oldestOffset %
-										  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-
-		/* always leave one segment before the wraparound point */
-		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-		if (!prevOldestOffsetKnown && !is_startup)
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are now enabled")));
-
-		ereport(DEBUG1,
-				(errmsg_internal("MultiXact member stop limit is now %u based on MultiXact %u",
-								 offsetStopLimit, oldestMultiXactId)));
-	}
-	else if (prevOldestOffsetKnown)
+	if (prevOldestOffsetKnown)
 	{
 		/*
 		 * If we failed to get the oldest offset this time, but we have a
@@ -2744,69 +2709,19 @@ SetOffsetVacuumLimit(bool is_startup)
 		 */
 		oldestOffset = prevOldestOffset;
 		oldestOffsetKnown = true;
-		offsetStopLimit = prevOffsetStopLimit;
 	}
 
 	/* Install the computed values */
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestOffset = oldestOffset;
 	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	MultiXactState->offsetStopLimit = offsetStopLimit;
 	LWLockRelease(MultiXactGenLock);
 
 	/*
-	 * Do we need an emergency autovacuum?	If we're not sure, assume yes.
+	 * Do we need autovacuum?	If we're not sure, assume yes.
 	 */
 	return !oldestOffsetKnown ||
-		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
-}
-
-/*
- * Return whether adding "distance" to "start" would move past "boundary".
- *
- * We use this to determine whether the addition is "wrapping around" the
- * boundary point, hence the name.  The reason we don't want to use the regular
- * 2^31-modulo arithmetic here is that we want to be able to use the whole of
- * the 2^32-1 space here, allowing for more multixacts than would fit
- * otherwise.
- */
-static bool
-MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
-						 uint32 distance)
-{
-	MultiXactOffset finish;
-
-	/*
-	 * Note that offset number 0 is not used (see GetMultiXactIdMembers), so
-	 * if the addition wraps around the UINT_MAX boundary, skip that value.
-	 */
-	finish = start + distance;
-	if (finish < start)
-		finish++;
-
-	/*-----------------------------------------------------------------------
-	 * When the boundary is numerically greater than the starting point, any
-	 * value numerically between the two is not wrapped:
-	 *
-	 *	<----S----B---->
-	 *	[---)			 = F wrapped past B (and UINT_MAX)
-	 *		 [---)		 = F not wrapped
-	 *			  [----] = F wrapped past B
-	 *
-	 * When the boundary is numerically less than the starting point (i.e. the
-	 * UINT_MAX wraparound occurs somewhere in between) then all values in
-	 * between are wrapped:
-	 *
-	 *	<----B----S---->
-	 *	[---)			 = F not wrapped past B (but wrapped past UINT_MAX)
-	 *		 [---)		 = F wrapped past B (and UINT_MAX)
-	 *			  [----] = F not wrapped
-	 *-----------------------------------------------------------------------
-	 */
-	if (start < boundary)
-		return finish >= boundary || finish < start;
-	else
-		return finish >= boundary && finish < start;
+		(nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);
 }
 
 /*
@@ -2825,7 +2740,6 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	int64		pageno;
 	int			entryno;
 	int			slotno;
-	MultiXactOffset *offptr;
 
 	Assert(MultiXactState->finishedStartup);
 
@@ -2843,9 +2757,9 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 
 	/* lock is acquired by SimpleLruReadPage_ReadOnly */
 	slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno, multi);
-	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
-	offptr += entryno;
-	offset = *offptr;
+
+	offset = MXOffsetRead(entryno, slotno);
+
 	LWLockRelease(SimpleLruGetBankLock(MultiXactOffsetCtl, pageno));
 
 	*result = offset;
@@ -2893,73 +2807,6 @@ GetMultiXactInfo(uint32 *multixacts, MultiXactOffset *members,
 	return true;
 }
 
-/*
- * Multixact members can be removed once the multixacts that refer to them
- * are older than every datminmxid.  autovacuum_multixact_freeze_max_age and
- * vacuum_multixact_freeze_table_age work together to make sure we never have
- * too many multixacts; we hope that, at least under normal circumstances,
- * this will also be sufficient to keep us from using too many offsets.
- * However, if the average multixact has many members, we might exhaust the
- * members space while still using few enough members that these limits fail
- * to trigger relminmxid advancement by VACUUM.  At that point, we'd have no
- * choice but to start failing multixact-creating operations with an error.
- *
- * To prevent that, if more than a threshold portion of the members space is
- * used, we effectively reduce autovacuum_multixact_freeze_max_age and
- * to a value just less than the number of multixacts in use.  We hope that
- * this will quickly trigger autovacuuming on the table or tables with the
- * oldest relminmxid, thus allowing datminmxid values to advance and removing
- * some members.
- *
- * As the fraction of the member space currently in use grows, we become
- * more aggressive in clamping this value.  That not only causes autovacuum
- * to ramp up, but also makes any manual vacuums the user issues more
- * aggressive.  This happens because vacuum_get_cutoffs() will clamp the
- * freeze table and the minimum freeze age cutoffs based on the effective
- * autovacuum_multixact_freeze_max_age this function returns.  In the worst
- * case, we'll claim the freeze_max_age to zero, and every vacuum of any
- * table will freeze every multixact.
- */
-int
-MultiXactMemberFreezeThreshold(void)
-{
-	MultiXactOffset members;
-	uint32		multixacts;
-	uint32		victim_multixacts;
-	double		fraction;
-	int			result;
-	MultiXactId oldestMultiXactId;
-	MultiXactOffset oldestOffset;
-
-	/* If we can't determine member space utilization, assume the worst. */
-	if (!GetMultiXactInfo(&multixacts, &members, &oldestMultiXactId, &oldestOffset))
-		return 0;
-
-	/* If member space utilization is low, no special action is required. */
-	if (members <= MULTIXACT_MEMBER_SAFE_THRESHOLD)
-		return autovacuum_multixact_freeze_max_age;
-
-	/*
-	 * Compute a target for relminmxid advancement.  The number of multixacts
-	 * we try to eliminate from the system is based on how far we are past
-	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
-	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD) /
-		(MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-	victim_multixacts = multixacts * fraction;
-
-	/* fraction could be > 1.0, but lowest possible freeze age is zero */
-	if (victim_multixacts > multixacts)
-		return 0;
-	result = multixacts - victim_multixacts;
-
-	/*
-	 * Clamp to autovacuum_multixact_freeze_max_age, so that we never make
-	 * autovacuum less aggressive than it would otherwise be.
-	 */
-	return Min(result, autovacuum_multixact_freeze_max_age);
-}
-
 typedef struct mxtruncinfo
 {
 	int64		earliestExistingPage;
@@ -3159,7 +3006,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 
 	elog(DEBUG1, "performing multixact truncation: "
 		 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-		 "members [%u, %u), members segments [%" PRIx64 ", %" PRIx64 ")",
+		 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
 		 oldestMulti, newOldestMulti,
 		 MultiXactIdToOffsetSegment(oldestMulti),
 		 MultiXactIdToOffsetSegment(newOldestMulti),
@@ -3290,7 +3137,7 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
 static bool
 MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
 {
-	int32		diff = (int32) (offset1 - offset2);
+	int64		diff = (int64) (offset1 - offset2);
 
 	return (diff < 0);
 }
@@ -3387,7 +3234,7 @@ multixact_redo(XLogReaderState *record)
 
 		elog(DEBUG1, "replaying multixact truncation: "
 			 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-			 "members [%u, %u), members segments [%" PRIx64 ", %" PRIx64 ")",
+			 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
 			 xlrec.startTruncOff, xlrec.endTruncOff,
 			 MultiXactIdToOffsetSegment(xlrec.startTruncOff),
 			 MultiXactIdToOffsetSegment(xlrec.endTruncOff),
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 3e3c4da01a2..3b2b0a522cb 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -885,7 +885,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 							 U64FromFullTransactionId(checkPoint.nextXid),
 							 checkPoint.nextOid)));
 	ereport(DEBUG1,
-			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %u",
+			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %" PRIu64,
 							 checkPoint.nextMulti, checkPoint.nextMultiOffset)));
 	ereport(DEBUG1,
 			(errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ed03e3bd50d..259ef60bd31 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1147,7 +1147,7 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams params,
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Almost ready to set freeze output parameters; check if OldestXmin or
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 5084af7dfb6..26385470c19 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1151,7 +1151,7 @@ do_start_worker(void)
 
 	/* Also determine the oldest datminmxid we will consider. */
 	recentMulti = ReadNextMultiXactId();
-	multiForceLimit = recentMulti - MultiXactMemberFreezeThreshold();
+	multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
 	if (multiForceLimit < FirstMultiXactId)
 		multiForceLimit -= FirstMultiXactId;
 
@@ -1939,7 +1939,7 @@ do_autovacuum(void)
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Find the pg_database entry and select the default freeze ages. We use
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 10de058ce91..5295108ade3 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -264,7 +264,7 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile->checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
+	printf(_("Latest checkpoint's NextMultiOffset:  %" PRIu64 "\n"),
 		   ControlFile->checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile->checkPointCopy.oldestXid);
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index a89d72fc5cf..4e5eeced89d 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -267,7 +267,7 @@ main(int argc, char *argv[])
 
 			case 'O':
 				errno = 0;
-				set_mxoff = strtoul(optarg, &endptr, 0);
+				set_mxoff = strtou64(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
@@ -743,7 +743,7 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile.checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
+	printf(_("Latest checkpoint's NextMultiOffset:  %" PRIu64 "\n"),
 		   ControlFile.checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile.checkPointCopy.oldestXid);
@@ -817,7 +817,7 @@ PrintNewControlValues(void)
 
 	if (set_mxoff != -1)
 	{
-		printf(_("NextMultiOffset:                      %u\n"),
+		printf(_("NextMultiOffset:                      %" PRIu64 "\n"),
 			   ControlFile.checkPointCopy.nextMultiOffset);
 	}
 
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index d6bbbd0ceda..16b5a623900 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -213,7 +213,7 @@ push @cmd,
   sprintf("%d,%d", hex($files[0]) == 0 ? 3 : hex($files[0]), hex($files[-1]));
 
 @files = get_slru_files('pg_multixact/offsets');
-$mult = 32 * $blcksz / 4;
+$mult = 32 * ($blcksz - 8) / 4;
 # --multixact-ids argument is "new,old"
 push @cmd,
   '--multixact-ids' => sprintf("%d,%d",
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 82e4bb90dd5..16a07723088 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -28,7 +28,7 @@
 
 #define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
 
-#define MaxMultiXactOffset	((MultiXactOffset) 0xFFFFFFFF)
+#define MaxMultiXactOffset	UINT64CONST(0xFFFFFFFFFFFFFFFF)
 
 /*
  * Possible multixact lock modes ("status").  The first four modes are for
@@ -147,7 +147,6 @@ extern void MultiXactSetNextMXact(MultiXactId nextMulti,
 extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 									  MultiXactOffset minMultiOffset);
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
-extern int	MultiXactMemberFreezeThreshold(void);
 
 extern void multixact_twophase_recover(FullTransactionId fxid, uint16 info,
 									   void *recdata, uint32 len);
diff --git a/src/include/c.h b/src/include/c.h
index 9ab5e617995..7ab9e68af6a 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -666,7 +666,7 @@ typedef uint32 SubTransactionId;
 /* MultiXactId must be equivalent to TransactionId, to fit in t_xmax */
 typedef TransactionId MultiXactId;
 
-typedef uint32 MultiXactOffset;
+typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
-- 
2.51.0

#49Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Maxim Orlov (#48)
Re: POC: make mxidoff 64 bits

On 27/10/2025 17:54, Maxim Orlov wrote:

Here is a new patch set @ 10b5bb3bffaee8

As previously stated, the patch set implements the concept of saving the
"difference" between page offsets in order to save disc space.

Hmm, is that safe? We do the assignment of multixact and offset, in the
GetNewMultiXactId() function, separately from updating the SLRU pages in
the RecordNewMultiXact() function. I believe this happen:

To keep the arithmetic simple, let's assume that multixid 100 is the
first multixid on an offsets SLRU page. So the 'base' on the page is
initialized when multixid 100 is written.

1. Backend A calls GetNewMultiXactId(), is assigned multixid 100, offset
1000
2. Backend B calls GetNewMultiXactId(), is assigned multixid 101, offset
1010
3. Backend B calls RecordNewMultiXact() and sets 'page->offsets[1] = 10'
4. Backend A calls RecordNewMultiXact() and sets 'page->base = 1000' and
'page->offsets[0] = 0'

If backend C looks up multixid 101 in between steps 3 and 4, it would
read the offset incorrectly, because 'base' isn't set yet.

- Heikki

#50Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#49)
5 attachment(s)
Re: POC: make mxidoff 64 bits

Unfortunately, I need to admit that I have messed a bit with v18.
I forgot to sync the pg_upgrade portion with the rest of the patch,
among other things. Here's a proper version with additional testing.

pg_upgrade/t/007_mxoff.pl is not meant to be committed, it's just
for test purposes. In order to run it, env var oldinstall must be set.

On Tue, 28 Oct 2025 at 17:17, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 27/10/2025 17:54, Maxim Orlov wrote:

If backend C looks up multixid 101 in between steps 3 and 4, it would
read the offset incorrectly, because 'base' isn't set yet.

Hmm, maybe I miss something? We set page base on first write of any

offset on the page, not only the first one. In other words, there
should never be a case when we read an offset without a previously
defined page base. Correct me if I'm wrong:
1. Backend A assigned mxact=100, offset=1000.
2. Backend B assigned mxact=101, offset=1010.
3. Backend B calls RecordNewMultiXact()/MXOffsetWrite() and
set page base=1010, offset plus 0^0x80000000 bit while
holding lock on the page.
4. Backend C looks up for the mxact=101 by calling MXOffsetRead()
and should get exactly what he's looking for:
base (1010) + offset (0) minus 0x80000000 bit.
5. Backend A calls RecordNewMultiXact() and sets his offset using
existing base from step 3.

--
Best regards,
Maxim Orlov.

Attachments:

v19-0005-Add-test-for-64-bit-mxoff-in-pg_upgrade.patch.txttext/plain; charset=US-ASCII; name=v19-0005-Add-test-for-64-bit-mxoff-in-pg_upgrade.patch.txtDownload
From a0199ead32003c4fb4edb7cc7fc0225ae7452209 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 29 Oct 2025 14:19:56 +0300
Subject: [PATCH v19 5/5] Add test for 64-bit mxoff in pg_upgrade

---
 src/bin/pg_upgrade/t/007_mxoff.pl | 463 ++++++++++++++++++++++++++++++
 1 file changed, 463 insertions(+)
 create mode 100644 src/bin/pg_upgrade/t/007_mxoff.pl

diff --git a/src/bin/pg_upgrade/t/007_mxoff.pl b/src/bin/pg_upgrade/t/007_mxoff.pl
new file mode 100644
index 0000000000..10e4387953
--- /dev/null
+++ b/src/bin/pg_upgrade/t/007_mxoff.pl
@@ -0,0 +1,463 @@
+
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use Math::BigInt;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::AdjustDump;
+use PostgreSQL::Test::AdjustUpgrade;
+use Test::More;
+
+# This test involves different multitransaction states, similarly to that of
+# 002_pg_upgrade.pl.
+
+note "checking oldinstall environment variable set";
+unless (defined($ENV{oldinstall}))
+{
+	diag "Set oldinstall environment variable to the pre 64-bit mxoff cluster.";
+	die;
+}
+
+# Temp dir for a dumps.
+my $tempdir = PostgreSQL::Test::Utils::tempdir;
+
+# Can be changed to test the other modes.
+my $mode = $ENV{PG_TEST_PG_UPGRADE_MODE} || '--copy';
+
+# Get NextMultiOffset.
+sub next_mxoff
+{
+	my $node = shift;
+
+	my $pg_controldata_path =
+		defined($node->install_path) ?
+			$node->install_path . '/bin/pg_controldata' :
+			'pg_controldata';
+	my ($stdout, $stderr) = run_command([ $pg_controldata_path,
+											$node->data_dir ]);
+	my @control_data = split("\n", $stdout);
+	my $next_mxoff = undef;
+
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint's NextMultiOffset:\s*(.*)$/mg)
+		{
+			$next_mxoff = $1;
+			last;
+		}
+	}
+	die "NextMultiOffset not found in control file\n"
+		unless defined($next_mxoff);
+
+	return $next_mxoff;
+}
+
+# Consume around 10k of mxoffsets.
+sub mxact_eater
+{
+	my $node = shift;
+	my $tbl = 'FOO';
+
+	my ($mxoff1, $mxoff2);
+
+	$mxoff1 = next_mxoff($node);
+	$node->start;
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl} (I INT PRIMARY KEY, N_UPDATED INT) " .
+		"       WITH (AUTOVACUUM_ENABLED=FALSE);" .
+		"INSERT INTO ${tbl} SELECT G, 0 FROM GENERATE_SERIES(1, 50) G;");
+
+	# consume around 10k mxoff
+	my $nclients = 10;
+	my $update_every = 75;
+	my @connections = ();
+
+	for (0..$nclients)
+	{
+		my $conn = $node->background_psql('postgres');
+		$conn->query_safe("BEGIN");
+
+		push(@connections, $conn);
+	}
+
+	for (my $i = 0; $i < 1000; $i++)
+	{
+		my $conn = $connections[$i % $nclients];
+
+		$conn->query_safe("COMMIT;");
+		$conn->query_safe("BEGIN");
+
+		if ($i % $update_every == 0)
+		{
+			$conn->query_safe(
+				"UPDATE ${tbl} SET " .
+				"N_UPDATED = N_UPDATED + 1 " .
+				"WHERE I = ${i} % 50");
+		}
+		else
+		{
+			$conn->query_safe(
+				"SELECT * FROM ${tbl} FOR KEY SHARE");
+		}
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+
+	$node->stop;
+	$mxoff2 = next_mxoff($node);
+
+	return $mxoff1, $mxoff2;
+}
+
+# Consume around 2M of mxoffsets.
+sub mxact_huge_eater
+{
+	my $node = shift;
+	my $tbl = 'FOO';
+
+	my ($mxoff1, $mxoff2);
+
+	$mxoff1 = next_mxoff($node);
+	$node->start;
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl} (I INT PRIMARY KEY, N_UPDATED INT) " .
+		"       WITH (AUTOVACUUM_ENABLED=FALSE);" .
+		"INSERT INTO ${tbl} SELECT G, 0 FROM GENERATE_SERIES(1, 50) G;");
+
+	# consume around 1M mxoff
+	my $nclients = 10;
+	my $update_every = 95;
+	my @connections = ();
+	my $timeout = 10 * $PostgreSQL::Test::Utils::timeout_default;
+
+	for (0..$nclients)
+	{
+		my $conn = $node->background_psql('postgres',
+										  timeout => $timeout);
+		$conn->query_safe("BEGIN");
+
+		push(@connections, $conn);
+	}
+
+	# It's a long process, better to tell about progress.
+	my $n_steps = 200_000;
+	my $step = int($n_steps / 10);
+
+	diag "\nstart to consume mxoffsets ...\n";
+	for (my $i = 0; $i < $n_steps; $i++)
+	{
+		my $conn = $connections[$i % $nclients];
+
+		$conn->query_safe("COMMIT;");
+		$conn->query_safe("BEGIN");
+
+		if ($i % $update_every == 0)
+		{
+			# Perform some non-key UPDATEs too, to exercise different multixact
+			# member statuses.
+			$conn->query_safe(
+				"UPDATE ${tbl} SET " .
+				"N_UPDATED = N_UPDATED + 1 " .
+				"WHERE I = ${i} % 50");
+		}
+		else
+		{
+			$conn->query_safe(
+				"SELECT * FROM ${tbl} " .
+				"TABLESAMPLE SYSTEM (85) " .
+				"FOR KEY SHARE");
+		}
+
+		if ($i % $step == 0)
+		{
+			my $done = int(($i / $n_steps) * 100);
+			diag "$done% done...";
+		}
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+
+	$node->stop;
+	$mxoff2 = next_mxoff($node);
+
+	return $mxoff1, $mxoff2;
+}
+
+# Set oldest multixact-offset
+sub reset_mxoff
+{
+	my $node = shift;
+	my $offset = shift;
+
+	my $pg_resetwal_path = $node->install_path . '/bin/pg_resetwal';
+
+	# Get block size
+	my $out = (run_command([ $pg_resetwal_path, '--dry-run',
+							 $node->data_dir ]))[0];
+		$out =~ /^Database block size: *(\d+)$/m or die;
+	my $blcksz = $1;
+
+	# Reset to new offset
+	my @cmd = ($pg_resetwal_path, '--pgdata' => $node->data_dir);
+	push @cmd, '--multixact-offset' => $offset;
+	command_ok(\@cmd, 'set oldest multixact-offset');
+
+	# Fill empty pg_multixact/members segment
+	my $mult = 32 * int($blcksz / 20) * 4;
+	my $segname = sprintf "%04X", $offset / $mult;
+
+	my @dd = ('dd');
+	push @dd, "if=/dev/zero";
+	push @dd, "of=" . $node->data_dir . "/pg_multixact/members/" . $segname;
+	push @dd, "bs=$blcksz";
+	push @dd, "count=32";
+	command_ok(\@dd, 'fill empty multixact-members');
+}
+
+sub get_dump_for_comparison
+{
+	my ($node, $db, $file_prefix, $adjust_child_columns) = @_;
+
+	my $dumpfile = $tempdir . '/' . $file_prefix . '.sql';
+	my $dump_adjusted = "${dumpfile}_adjusted";
+
+	open(my $dh, '>', $dump_adjusted)
+	  || die "could not open $dump_adjusted for writing $!";
+
+	$node->run_log(
+		[
+			'pg_dump', '--no-sync',
+			'--restrict-key' => 'test',
+			'-d' => $node->connstr($db),
+			'-f' => $dumpfile
+		]);
+
+	print $dh adjust_regress_dumpfile(slurp_file($dumpfile),
+		$adjust_child_columns);
+	close($dh);
+
+	return $dump_adjusted;
+}
+
+# Main test workhorse routine.
+# Make pg_upgrade, dump data and compare it.
+sub run_test
+{
+	my $tag = shift;
+	my $oldnode = shift;
+	my $newnode = shift;
+
+	command_ok(
+		[
+			'pg_upgrade', '--no-sync',
+			'--old-datadir' => $oldnode->data_dir,
+			'--new-datadir' => $newnode->data_dir,
+			'--old-bindir' => $oldnode->config_data('--bindir'),
+			'--new-bindir' => $newnode->config_data('--bindir'),
+			'--socketdir' => $newnode->host,
+			'--old-port' => $oldnode->port,
+			'--new-port' => $newnode->port,
+			$mode,
+		],
+		'run of pg_upgrade for new instance');
+	ok( !-d $newnode->data_dir . "/pg_upgrade_output.d",
+		"pg_upgrade_output.d/ removed after pg_upgrade success");
+
+	$oldnode->start;
+	my $src_dump =
+		get_dump_for_comparison($oldnode, 'postgres',
+								"oldnode_${tag}_dump", 0);
+	$oldnode->stop;
+
+	$newnode->start;
+	my $dst_dump =
+		get_dump_for_comparison($newnode, 'postgres',
+								"newnode_${tag}_dump", 0);
+	$newnode->stop;
+
+	compare_files($src_dump, $dst_dump,
+		'dump outputs from original and restored regression databases match');
+}
+
+sub to_hex
+{
+	my $arg = shift;
+
+	$arg = Math::BigInt->new($arg);
+	$arg = $arg->as_hex();
+
+	return $arg;
+}
+
+# case #1: start old node from defaults
+{
+	my $tag = 1;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+	$old->init(extra => ['-k']);
+
+	my ($start_mxoff, $finish_mxoff) = mxact_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #2: start old node from before 32-bit wraparound
+{
+	my $tag = 2;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+	reset_mxoff($old, 0xFFFF0000);
+
+	my ($start_mxoff, $finish_mxoff) = mxact_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #3: start old node near 32-bit wraparound and reach wraparound state.
+{
+	my $tag = 3;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+
+	reset_mxoff($old, 0xFFFFEC77);
+	my ($start_mxoff, $finish_mxoff) = mxact_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #4: start old node from defaults
+{
+	my $tag = 4;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+
+	diag "test #${tag} for multiple mxoff segments";
+	my ($start_mxoff, $finish_mxoff) = mxact_huge_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #5: start old node from before 32-bit wraparound
+{
+	my $tag = 5;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+	reset_mxoff($old, 0xFFFF0000);
+
+	diag "test #${tag} for multiple mxoff segments";
+	my ($start_mxoff, $finish_mxoff) = mxact_huge_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #6: start old node near 32-bit wraparound and reach wraparound state.
+{
+	my $tag = 6;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+
+	reset_mxoff($old, 0xFFFFFFFF - 1_000_000);
+	my ($start_mxoff, $finish_mxoff) = mxact_huge_eater($old);
+
+	diag "test #${tag} for multiple mxoff segments";
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+done_testing();
-- 
2.51.0

v19-0003-Add-test-for-64-bit-mxoff-in-pg_resetwal.patchapplication/octet-stream; name=v19-0003-Add-test-for-64-bit-mxoff-in-pg_resetwal.patchDownload
From b7808bed198fdcc0ffe321f541cca1361b918346 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Tue, 28 Oct 2025 19:08:26 +0300
Subject: [PATCH v19 3/5] Add test for 64-bit mxoff in pg_resetwal

---
 src/bin/pg_resetwal/meson.build    |   1 +
 src/bin/pg_resetwal/t/003_mxoff.pl | 170 +++++++++++++++++++++++++++++
 2 files changed, 171 insertions(+)
 create mode 100644 src/bin/pg_resetwal/t/003_mxoff.pl

diff --git a/src/bin/pg_resetwal/meson.build b/src/bin/pg_resetwal/meson.build
index 290832b229..1e2dfb38a5 100644
--- a/src/bin/pg_resetwal/meson.build
+++ b/src/bin/pg_resetwal/meson.build
@@ -25,6 +25,7 @@ tests += {
     'tests': [
       't/001_basic.pl',
       't/002_corrupted.pl',
+      't/003_mxoff.pl',
     ],
   },
 }
diff --git a/src/bin/pg_resetwal/t/003_mxoff.pl b/src/bin/pg_resetwal/t/003_mxoff.pl
new file mode 100644
index 0000000000..3c1b7fa1d3
--- /dev/null
+++ b/src/bin/pg_resetwal/t/003_mxoff.pl
@@ -0,0 +1,170 @@
+
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use Math::BigInt;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+sub mxact_eater
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	$node->start;
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl} (I INT PRIMARY KEY, N_UPDATED INT) " .
+		"       WITH (AUTOVACUUM_ENABLED=FALSE);" .
+		"INSERT INTO ${tbl} SELECT G, 0 FROM GENERATE_SERIES(1, 50) G;");
+
+	# consume around 10k multixact-offsetfs
+	my $nclients = 10;
+	my $update_every = 75;
+	my @connections = ();
+
+	for (0..$nclients)
+	{
+		my $conn = $node->background_psql('postgres');
+		$conn->query_safe("BEGIN");
+
+		push(@connections, $conn);
+	}
+
+	for (my $i = 0; $i < 1000; $i++)
+	{
+		my $conn = $connections[$i % $nclients];
+
+		$conn->query_safe("COMMIT;");
+		$conn->query_safe("BEGIN");
+
+		if ($i % $update_every == 0)
+		{
+			$conn->query_safe(
+				"UPDATE ${tbl} SET " .
+				"N_UPDATED = N_UPDATED + 1 " .
+				"WHERE I = ${i} % 50");
+		}
+		else
+		{
+			$conn->query_safe(
+				"SELECT * FROM ${tbl} FOR KEY SHARE");
+		}
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+
+	$node->stop;
+}
+
+sub next_mxoff
+{
+	my $node = shift;
+	my ($stdout, $stderr) =
+	  run_command([ 'pg_controldata', $node->data_dir ]);
+	my @control_data = split("\n", $stdout);
+	my $next_mxoff = undef;
+
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint's NextMultiOffset:\s*(.*)$/mg)
+		{
+			$next_mxoff = $1;
+			last;
+		}
+	}
+	die "NextMultiOffset not found in control file\n"
+		unless defined($next_mxoff);
+
+	return $next_mxoff;
+}
+
+sub reset_mxoff
+{
+	my $node = shift;
+	my $offset = shift;
+		$offset = Math::BigInt->new($offset);
+
+	# Get block size
+	my $out = (run_command([ 'pg_resetwal', '--dry-run', $node->data_dir ]))[0];
+		$out =~ /^Database block size: *(\d+)$/m or die;
+	my $blcksz = $1;
+
+	# Reset to new offset
+	my @cmd = ('pg_resetwal', '--pgdata' => $node->data_dir);
+	push @cmd, '--multixact-offset' => $offset->as_hex();
+	command_ok(\@cmd, 'set oldest multixact-offset');
+
+	# Fill empty pg_multixact/members segment
+	my $mult = 32 * int($blcksz / 20) * 4;
+	my $segname = sprintf "%015X", $offset / $mult;
+
+	my @dd = ('dd');
+	push @dd, "if=/dev/zero";
+	push @dd, "of=" . $node->data_dir . "/pg_multixact/members/" . $segname;
+	push @dd, "bs=$blcksz";
+	push @dd, "count=32";
+	command_ok(\@dd, 'fill empty multixact-members');
+}
+
+my ($off1, $off2);
+
+# start from defaults
+my $node1 = PostgreSQL::Test::Cluster->new('node1');
+$node1->init;
+$off1 = next_mxoff($node1);
+mxact_eater($node1, "FOO");
+$off2 = next_mxoff($node1);
+note "> start from $off1, finished at $off2\n";
+
+# start from before 32-bit wraparound
+my $node2 = PostgreSQL::Test::Cluster->new('node2');
+$node2->init;
+reset_mxoff($node2, 0xFFFF0000);
+$off1 = next_mxoff($node2);
+mxact_eater($node2, "FOO");
+$off2 = next_mxoff($node2);
+note "> start from $off1, finished at $off2\n";
+
+# start near 32-bit wraparound
+my $node3 = PostgreSQL::Test::Cluster->new('node3');
+$node3->init;
+reset_mxoff($node3, 0xFFFFEC77);
+$off1 = next_mxoff($node3);
+mxact_eater($node3, "FOO");
+$off2 = next_mxoff($node3);
+note "> start from $off1, finished at $off2\n";
+
+# start over 32-bit wraparound
+my $node4 = PostgreSQL::Test::Cluster->new('node4');
+$node4->init;
+reset_mxoff($node4, '0xFFFFFFFF0000');
+$off1 = next_mxoff($node4);
+mxact_eater($node4, "FOO");
+$off2 = next_mxoff($node3);
+note "> start from $off1, finished at $off2\n";
+
+# check invariant
+$node1->start;
+$node2->start;
+$node3->start;
+$node4->start;
+
+my $var1 = $node1->safe_psql('postgres', 'TABLE FOO');
+my $var2 = $node2->safe_psql('postgres', 'TABLE FOO');
+my $var3 = $node3->safe_psql('postgres', 'TABLE FOO');
+my $var4 = $node4->safe_psql('postgres', 'TABLE FOO');
+ok($var1 eq $var2 eq $var3 eq $var4,
+	'check table invariant in all nodes');
+
+$node4->stop;
+$node3->stop;
+$node2->stop;
+$node1->stop;
+
+done_testing();
-- 
2.51.0

v19-0001-Use-64-bit-multixact-offsets.patchapplication/octet-stream; name=v19-0001-Use-64-bit-multixact-offsets.patchDownload
From c1b911cffbbb52a4479f1b47318c4d73bb671bb6 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 7 Aug 2024 16:35:22 +0300
Subject: [PATCH v19 1/5] Use 64-bit multixact offsets

Switching to 64-bit multitransaction offsets removes wraparound and the
2^32 limit on their total number.

On the other hand, this move is inevitable in increased disc space
utilisation.  Fortunately, multitransaction offsets rise monotonically
and without gaps.  To conserve disc space consumed by segments, we
write a 64-bit "base" at the start of each page, which also serves as
the page's first offset.  All subsequent offsets on the page are
calculated relative to this "base".

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/rmgrdesc/mxactdesc.c   |   4 +-
 src/backend/access/rmgrdesc/xlogdesc.c    |   2 +-
 src/backend/access/transam/multixact.c    | 375 +++++++---------------
 src/backend/access/transam/xlogrecovery.c |   2 +-
 src/backend/commands/vacuum.c             |   2 +-
 src/backend/postmaster/autovacuum.c       |   4 +-
 src/bin/pg_controldata/pg_controldata.c   |   2 +-
 src/bin/pg_resetwal/pg_resetwal.c         |   6 +-
 src/bin/pg_resetwal/t/001_basic.pl        |   2 +-
 src/include/access/multixact.h            |   3 +-
 src/include/c.h                           |   2 +-
 11 files changed, 125 insertions(+), 279 deletions(-)

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 3ca0582db3..052dd0a4ce 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -65,7 +65,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		xl_multixact_create *xlrec = (xl_multixact_create *) rec;
 		int			i;
 
-		appendStringInfo(buf, "%u offset %u nmembers %d: ", xlrec->mid,
+		appendStringInfo(buf, "%u offset %" PRIu64 " nmembers %d: ", xlrec->mid,
 						 xlrec->moff, xlrec->nmembers);
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
@@ -74,7 +74,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
 
-		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+		appendStringInfo(buf, "offsets [%u, %u), members [%" PRIu64 ", %" PRIu64 ")",
 						 xlrec->startTruncOff, xlrec->endTruncOff,
 						 xlrec->startTruncMemb, xlrec->endTruncMemb);
 	}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index cd6c2a2f65..441034f592 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -66,7 +66,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
 		appendStringInfo(buf, "redo %X/%08X; "
-						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %u; "
+						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %" PRIu64 "; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
 						 "oldest running xid %u; %s",
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 9d5f130af7..e7861614be 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -89,21 +89,31 @@
 #include "utils/memutils.h"
 
 
+typedef int32 ShortMultiXactOffset;	/* for a disk storage */
+
 /*
  * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
  * used everywhere else in Postgres.
  *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
+ * There are two key factors why utilising straightforward 64-bit offset values
+ * for is wasteful in terms of disc space usage:
+ * 1) offset values are recorded in ascending order and not overwritten;
+ * 2) the largest supported BLCKSZ is 32k, which can store up to 2^13 32-bit
+ *    items on a single page;  thus, with MAX_BACKENDS limited to 2^18-1 we have
+ *    2^13 * (2^18-1) which is less 2^31 and fits 32-bits.
+ *
+ * In other words, max "distance" for offsets on a single page is not exeeded
+ * 32-bits.  To optimise disc space allocation, we employ the following scheme.
+ * On each page, the basic 64-bit offset, known as the page base, is located
+ * first.  Next, there are 32-bit deltas relative to the base element are
+ * placed.  Thus, the required offset for the 0-th element is the page's
+ * base; the value for each subsequent offset on the same page is calculated
+ * by adding it to the page base (0-th) element.
  */
 
-/* We need four bytes per offset */
-#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
+/* We need four bytes per offset, 8 bytes for the base */
+#define MULTIXACT_OFFSETS_PER_PAGE		\
+	((BLCKSZ - sizeof(MultiXactOffset)) / sizeof(ShortMultiXactOffset))
 
 static inline int64
 MultiXactIdToOffsetPage(MultiXactId multi)
@@ -208,10 +218,14 @@ MXOffsetToMemberOffset(MultiXactOffset offset)
 		member_in_group * sizeof(TransactionId);
 }
 
-/* Multixact members wraparound thresholds. */
-#define MULTIXACT_MEMBER_SAFE_THRESHOLD		(MaxMultiXactOffset / 2)
-#define MULTIXACT_MEMBER_DANGER_THRESHOLD	\
-	(MaxMultiXactOffset - MaxMultiXactOffset / 4)
+/*
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuumin order to release the disk space, reduce table bloat if
+ * possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD		UINT64CONST(0xFFFFFFFF)
 
 static inline MultiXactId
 PreviousMultiXactId(MultiXactId multi)
@@ -228,6 +242,51 @@ static SlruCtlData MultiXactMemberCtlData;
 #define MultiXactOffsetCtl	(&MultiXactOffsetCtlData)
 #define MultiXactMemberCtl	(&MultiXactMemberCtlData)
 
+/*
+ * To avoid diving deep into address arithmetic, we declare an auxiliary
+ * structure that access the MultiXactOffset SLRU page.
+ */
+typedef struct MultiXactOffsetSLRUPage
+{
+	MultiXactOffset			base;
+	ShortMultiXactOffset	offset[FLEXIBLE_ARRAY_MEMBER];
+} MultiXactOffsetSLRUPage;
+
+static inline MultiXactOffset
+MXOffsetRead(int entryno, int slotno)
+{
+	MultiXactOffsetSLRUPage *page =
+		(MultiXactOffsetSLRUPage *) MultiXactOffsetCtl->shared->page_buffer[slotno];
+
+	if (page->offset[entryno] != 0)
+		return page->base + (page->offset[entryno] ^ 0x80000000);
+
+	return 0;
+}
+
+static inline void
+MXOffsetWrite(int entryno, int slotno, MultiXactOffset offset)
+{
+	MultiXactOffsetSLRUPage *page =
+		(MultiXactOffsetSLRUPage *) MultiXactOffsetCtl->shared->page_buffer[slotno];
+
+	if (page->base != 0)
+		page->offset[entryno] = offset - page->base;
+	else
+	{
+		page->base = offset;
+		page->offset[entryno] = 0;
+	}
+
+	/*
+	 * We need to distinguish between uninited value and not yet written offset.
+	 * See case 2 in GetMultiXactIdMembers.
+	 *
+	 * So, mark this offset inited.
+	 */
+	page->offset[entryno] ^= 0x80000000;
+}
+
 /*
  * MultiXact state shared across all backends.  All this state is protected
  * by MultiXactGenLock.  (We also use SLRU bank's lock of MultiXactOffset and
@@ -268,9 +327,6 @@ typedef struct MultiXactStateData
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 
-	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
-
 	/*
 	 * This is used to sleep until a multixact offset is written when we want
 	 * to create the next one.
@@ -401,8 +457,6 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
-									 MultiXactOffset start, uint32 distance);
 static bool SetOffsetVacuumLimit(bool is_startup);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMTruncateXlogRec(Oid oldestMultiDB,
@@ -911,7 +965,6 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	int64		prev_pageno;
 	int			entryno;
 	int			slotno;
-	MultiXactOffset *offptr;
 	int			i;
 	LWLock	   *lock;
 	LWLock	   *prevlock = NULL;
@@ -930,10 +983,8 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	 * take the trouble to generalize the slru.c error reporting code.
 	 */
 	slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
-	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
-	offptr += entryno;
 
-	*offptr = offset;
+	MXOffsetWrite(entryno, slotno, offset);
 
 	MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 
@@ -1155,78 +1206,6 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	else
 		*offset = nextOffset;
 
-	/*----------
-	 * Protect against overrun of the members space as well, with the
-	 * following rules:
-	 *
-	 * If we're past offsetStopLimit, refuse to generate more multis.
-	 * If we're close to offsetStopLimit, emit a warning.
-	 *
-	 * Arbitrarily, we start emitting warnings when we're 20 segments or less
-	 * from offsetStopLimit.
-	 *
-	 * Note we haven't updated the shared state yet, so if we fail at this
-	 * point, the multixact ID we grabbed can still be used by the next guy.
-	 *
-	 * Note that there is no point in forcing autovacuum runs here: the
-	 * multixact freeze settings would have to be reduced for that to have any
-	 * effect.
-	 *----------
-	 */
-#define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
-								 nmembers))
-	{
-		/* see comment in the corresponding offsets wraparound case */
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("multixact \"members\" limit exceeded"),
-				 errdetail_plural("This command would create a multixact with %u members, but the remaining space is only enough for %u member.",
-								  "This command would create a multixact with %u members, but the remaining space is only enough for %u members.",
-								  MultiXactState->offsetStopLimit - nextOffset - 1,
-								  nmembers,
-								  MultiXactState->offsetStopLimit - nextOffset - 1),
-				 errhint("Execute a database-wide VACUUM in database with OID %u with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.",
-						 MultiXactState->oldestMultiXactDB)));
-	}
-
-	/*
-	 * Check whether we should kick autovacuum into action, to prevent members
-	 * wraparound. NB we use a much larger window to trigger autovacuum than
-	 * just the warning limit. The warning is just a measure of last resort -
-	 * this is in line with GetNewTransactionId's behaviour.
-	 */
-	if (!MultiXactState->oldestOffsetKnown ||
-		(MultiXactState->nextOffset - MultiXactState->oldestOffset
-		 > MULTIXACT_MEMBER_SAFE_THRESHOLD))
-	{
-		/*
-		 * To avoid swamping the postmaster with signals, we issue the autovac
-		 * request only when crossing a segment boundary. With default
-		 * compilation settings that's roughly after 50k members.  This still
-		 * gives plenty of chances before we get into real trouble.
-		 */
-		if ((MXOffsetToMemberPage(nextOffset) / SLRU_PAGES_PER_SEGMENT) !=
-			(MXOffsetToMemberPage(nextOffset + nmembers) / SLRU_PAGES_PER_SEGMENT))
-			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	}
-
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
-								 nextOffset,
-								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
-		ereport(WARNING,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg_plural("database with OID %u must be vacuumed before %d more multixact member is used",
-							   "database with OID %u must be vacuumed before %d more multixact members are used",
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers,
-							   MultiXactState->oldestMultiXactDB,
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers),
-				 errhint("Execute a database-wide VACUUM in that database with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.")));
-
 	ExtendMultiXactMember(nextOffset, nmembers);
 
 	/*
@@ -1255,7 +1234,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	LWLockRelease(MultiXactGenLock);
 
-	debug_elog4(DEBUG2, "GetNew: returning %u offset %u", result, *offset);
+	debug_elog4(DEBUG2, "GetNew: returning %u offset %" PRIu64, result,
+				*offset);
 	return result;
 }
 
@@ -1294,7 +1274,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	int64		prev_pageno;
 	int			entryno;
 	int			slotno;
-	MultiXactOffset *offptr;
 	MultiXactOffset offset;
 	int			length;
 	int			truelength;
@@ -1418,9 +1397,8 @@ retry:
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
 	slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
-	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
-	offptr += entryno;
-	offset = *offptr;
+
+	offset = MXOffsetRead(entryno, slotno);
 
 	Assert(offset != 0);
 
@@ -1467,9 +1445,7 @@ retry:
 			slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, tmpMXact);
 		}
 
-		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
-		offptr += entryno;
-		nextMXOffset = *offptr;
+		nextMXOffset = MXOffsetRead(entryno, slotno);
 
 		if (nextMXOffset == 0)
 		{
@@ -1973,7 +1949,7 @@ MultiXactShmemInit(void)
 				  "pg_multixact/members", LWTRANCHE_MULTIXACTMEMBER_BUFFER,
 				  LWTRANCHE_MULTIXACTMEMBER_SLRU,
 				  SYNC_HANDLER_MULTIXACT_MEMBER,
-				  false);
+				  true);
 	/* doesn't call SimpleLruTruncate() or meet criteria for unit tests */
 
 	/* Initialize our shared state struct */
@@ -2149,9 +2125,24 @@ TrimMultiXact(void)
 		LWLockAcquire(lock, LW_EXCLUSIVE);
 		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
-		offptr += entryno;
 
-		MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
+		if (entryno == 0)
+			MemSet(offptr, 0, BLCKSZ);
+		else
+		{
+			ShortMultiXactOffset *off32ptr;
+
+			off32ptr = (ShortMultiXactOffset *) (offptr + 1);
+			off32ptr += entryno;
+
+			/*
+			 * Knowing that offptr points to the beginning of the buffer,
+			 * address arithmetic can be used to determine the amount of
+			 * bytes remaining.
+			 */
+			MemSet(off32ptr, 0,
+				   BLCKSZ - (((char *) off32ptr - (char *) offptr)));
+		}
 
 		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 		LWLockRelease(lock);
@@ -2223,7 +2214,7 @@ MultiXactGetCheckptMulti(bool is_shutdown,
 	LWLockRelease(MultiXactGenLock);
 
 	debug_elog6(DEBUG2,
-				"MultiXact: checkpoint is nextMulti %u, nextOffset %u, oldestMulti %u in DB %u",
+				"MultiXact: checkpoint is nextMulti %u, nextOffset %" PRIu64 ", oldestMulti %u in DB %u",
 				*nextMulti, *nextMultiOffset, *oldestMulti, *oldestMultiDB);
 }
 
@@ -2258,7 +2249,7 @@ void
 MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset)
 {
-	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %u",
+	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %" PRIu64,
 				nextMulti, nextMultiOffset);
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->nextMXact = nextMulti;
@@ -2449,7 +2440,7 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 	}
 	if (MultiXactOffsetPrecedes(MultiXactState->nextOffset, minMultiOffset))
 	{
-		debug_elog3(DEBUG2, "MultiXact: setting next offset to %u",
+		debug_elog3(DEBUG2, "MultiXact: setting next offset to %" PRIU64,
 					minMultiOffset);
 		MultiXactState->nextOffset = minMultiOffset;
 	}
@@ -2633,15 +2624,13 @@ GetOldestMultiXactId(void)
 }
 
 /*
- * Determine how aggressively we need to vacuum in order to prevent member
- * wraparound.
+ * Determine if we need to vacuum for member or not.
  *
  * To do so determine what's the oldest member offset and install the limit
  * info in MultiXactState, where it can be used to prevent overrun of old data
  * in the members SLRU area.
  *
- * The return value is true if emergency autovacuum is required and false
- * otherwise.
+ * The return value is true if autovacuum is required and false otherwise.
  */
 static bool
 SetOffsetVacuumLimit(bool is_startup)
@@ -2653,8 +2642,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	MultiXactOffset nextOffset;
 	bool		oldestOffsetKnown = false;
 	bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;
-	MultiXactOffset prevOffsetStopLimit;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2669,7 +2656,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	nextOffset = MultiXactState->nextOffset;
 	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
 	prevOldestOffset = MultiXactState->oldestOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2700,11 +2686,7 @@ SetOffsetVacuumLimit(bool is_startup)
 		oldestOffsetKnown =
 			find_multixact_start(oldestMultiXactId, &oldestOffset);
 
-		if (oldestOffsetKnown)
-			ereport(DEBUG1,
-					(errmsg_internal("oldest MultiXactId member is at offset %u",
-									 oldestOffset)));
-		else
+		if (!oldestOffsetKnown)
 			ereport(LOG,
 					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
 							oldestMultiXactId)));
@@ -2717,24 +2699,7 @@ SetOffsetVacuumLimit(bool is_startup)
 	 * overrun of old data in the members SLRU area. We can only do so if the
 	 * oldest offset is known though.
 	 */
-	if (oldestOffsetKnown)
-	{
-		/* move back to start of the corresponding segment */
-		offsetStopLimit = oldestOffset - (oldestOffset %
-										  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-
-		/* always leave one segment before the wraparound point */
-		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-		if (!prevOldestOffsetKnown && !is_startup)
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are now enabled")));
-
-		ereport(DEBUG1,
-				(errmsg_internal("MultiXact member stop limit is now %u based on MultiXact %u",
-								 offsetStopLimit, oldestMultiXactId)));
-	}
-	else if (prevOldestOffsetKnown)
+	if (prevOldestOffsetKnown)
 	{
 		/*
 		 * If we failed to get the oldest offset this time, but we have a
@@ -2744,69 +2709,19 @@ SetOffsetVacuumLimit(bool is_startup)
 		 */
 		oldestOffset = prevOldestOffset;
 		oldestOffsetKnown = true;
-		offsetStopLimit = prevOffsetStopLimit;
 	}
 
 	/* Install the computed values */
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestOffset = oldestOffset;
 	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	MultiXactState->offsetStopLimit = offsetStopLimit;
 	LWLockRelease(MultiXactGenLock);
 
 	/*
-	 * Do we need an emergency autovacuum?	If we're not sure, assume yes.
+	 * Do we need autovacuum?	If we're not sure, assume yes.
 	 */
 	return !oldestOffsetKnown ||
-		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
-}
-
-/*
- * Return whether adding "distance" to "start" would move past "boundary".
- *
- * We use this to determine whether the addition is "wrapping around" the
- * boundary point, hence the name.  The reason we don't want to use the regular
- * 2^31-modulo arithmetic here is that we want to be able to use the whole of
- * the 2^32-1 space here, allowing for more multixacts than would fit
- * otherwise.
- */
-static bool
-MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
-						 uint32 distance)
-{
-	MultiXactOffset finish;
-
-	/*
-	 * Note that offset number 0 is not used (see GetMultiXactIdMembers), so
-	 * if the addition wraps around the UINT_MAX boundary, skip that value.
-	 */
-	finish = start + distance;
-	if (finish < start)
-		finish++;
-
-	/*-----------------------------------------------------------------------
-	 * When the boundary is numerically greater than the starting point, any
-	 * value numerically between the two is not wrapped:
-	 *
-	 *	<----S----B---->
-	 *	[---)			 = F wrapped past B (and UINT_MAX)
-	 *		 [---)		 = F not wrapped
-	 *			  [----] = F wrapped past B
-	 *
-	 * When the boundary is numerically less than the starting point (i.e. the
-	 * UINT_MAX wraparound occurs somewhere in between) then all values in
-	 * between are wrapped:
-	 *
-	 *	<----B----S---->
-	 *	[---)			 = F not wrapped past B (but wrapped past UINT_MAX)
-	 *		 [---)		 = F wrapped past B (and UINT_MAX)
-	 *			  [----] = F not wrapped
-	 *-----------------------------------------------------------------------
-	 */
-	if (start < boundary)
-		return finish >= boundary || finish < start;
-	else
-		return finish >= boundary && finish < start;
+		(nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);
 }
 
 /*
@@ -2825,7 +2740,6 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	int64		pageno;
 	int			entryno;
 	int			slotno;
-	MultiXactOffset *offptr;
 
 	Assert(MultiXactState->finishedStartup);
 
@@ -2843,9 +2757,9 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 
 	/* lock is acquired by SimpleLruReadPage_ReadOnly */
 	slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno, multi);
-	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
-	offptr += entryno;
-	offset = *offptr;
+
+	offset = MXOffsetRead(entryno, slotno);
+
 	LWLockRelease(SimpleLruGetBankLock(MultiXactOffsetCtl, pageno));
 
 	*result = offset;
@@ -2893,73 +2807,6 @@ GetMultiXactInfo(uint32 *multixacts, MultiXactOffset *members,
 	return true;
 }
 
-/*
- * Multixact members can be removed once the multixacts that refer to them
- * are older than every datminmxid.  autovacuum_multixact_freeze_max_age and
- * vacuum_multixact_freeze_table_age work together to make sure we never have
- * too many multixacts; we hope that, at least under normal circumstances,
- * this will also be sufficient to keep us from using too many offsets.
- * However, if the average multixact has many members, we might exhaust the
- * members space while still using few enough members that these limits fail
- * to trigger relminmxid advancement by VACUUM.  At that point, we'd have no
- * choice but to start failing multixact-creating operations with an error.
- *
- * To prevent that, if more than a threshold portion of the members space is
- * used, we effectively reduce autovacuum_multixact_freeze_max_age and
- * to a value just less than the number of multixacts in use.  We hope that
- * this will quickly trigger autovacuuming on the table or tables with the
- * oldest relminmxid, thus allowing datminmxid values to advance and removing
- * some members.
- *
- * As the fraction of the member space currently in use grows, we become
- * more aggressive in clamping this value.  That not only causes autovacuum
- * to ramp up, but also makes any manual vacuums the user issues more
- * aggressive.  This happens because vacuum_get_cutoffs() will clamp the
- * freeze table and the minimum freeze age cutoffs based on the effective
- * autovacuum_multixact_freeze_max_age this function returns.  In the worst
- * case, we'll claim the freeze_max_age to zero, and every vacuum of any
- * table will freeze every multixact.
- */
-int
-MultiXactMemberFreezeThreshold(void)
-{
-	MultiXactOffset members;
-	uint32		multixacts;
-	uint32		victim_multixacts;
-	double		fraction;
-	int			result;
-	MultiXactId oldestMultiXactId;
-	MultiXactOffset oldestOffset;
-
-	/* If we can't determine member space utilization, assume the worst. */
-	if (!GetMultiXactInfo(&multixacts, &members, &oldestMultiXactId, &oldestOffset))
-		return 0;
-
-	/* If member space utilization is low, no special action is required. */
-	if (members <= MULTIXACT_MEMBER_SAFE_THRESHOLD)
-		return autovacuum_multixact_freeze_max_age;
-
-	/*
-	 * Compute a target for relminmxid advancement.  The number of multixacts
-	 * we try to eliminate from the system is based on how far we are past
-	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
-	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD) /
-		(MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-	victim_multixacts = multixacts * fraction;
-
-	/* fraction could be > 1.0, but lowest possible freeze age is zero */
-	if (victim_multixacts > multixacts)
-		return 0;
-	result = multixacts - victim_multixacts;
-
-	/*
-	 * Clamp to autovacuum_multixact_freeze_max_age, so that we never make
-	 * autovacuum less aggressive than it would otherwise be.
-	 */
-	return Min(result, autovacuum_multixact_freeze_max_age);
-}
-
 typedef struct mxtruncinfo
 {
 	int64		earliestExistingPage;
@@ -3159,7 +3006,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 
 	elog(DEBUG1, "performing multixact truncation: "
 		 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-		 "members [%u, %u), members segments [%" PRIx64 ", %" PRIx64 ")",
+		 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
 		 oldestMulti, newOldestMulti,
 		 MultiXactIdToOffsetSegment(oldestMulti),
 		 MultiXactIdToOffsetSegment(newOldestMulti),
@@ -3290,7 +3137,7 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
 static bool
 MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
 {
-	int32		diff = (int32) (offset1 - offset2);
+	int64		diff = (int64) (offset1 - offset2);
 
 	return (diff < 0);
 }
@@ -3387,7 +3234,7 @@ multixact_redo(XLogReaderState *record)
 
 		elog(DEBUG1, "replaying multixact truncation: "
 			 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-			 "members [%u, %u), members segments [%" PRIx64 ", %" PRIx64 ")",
+			 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
 			 xlrec.startTruncOff, xlrec.endTruncOff,
 			 MultiXactIdToOffsetSegment(xlrec.startTruncOff),
 			 MultiXactIdToOffsetSegment(xlrec.endTruncOff),
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 3e3c4da01a..3b2b0a522c 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -885,7 +885,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 							 U64FromFullTransactionId(checkPoint.nextXid),
 							 checkPoint.nextOid)));
 	ereport(DEBUG1,
-			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %u",
+			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %" PRIu64,
 							 checkPoint.nextMulti, checkPoint.nextMultiOffset)));
 	ereport(DEBUG1,
 			(errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ed03e3bd50..259ef60bd3 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1147,7 +1147,7 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams params,
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Almost ready to set freeze output parameters; check if OldestXmin or
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 5084af7dfb..26385470c1 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1151,7 +1151,7 @@ do_start_worker(void)
 
 	/* Also determine the oldest datminmxid we will consider. */
 	recentMulti = ReadNextMultiXactId();
-	multiForceLimit = recentMulti - MultiXactMemberFreezeThreshold();
+	multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
 	if (multiForceLimit < FirstMultiXactId)
 		multiForceLimit -= FirstMultiXactId;
 
@@ -1939,7 +1939,7 @@ do_autovacuum(void)
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Find the pg_database entry and select the default freeze ages. We use
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 10de058ce9..5295108ade 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -264,7 +264,7 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile->checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
+	printf(_("Latest checkpoint's NextMultiOffset:  %" PRIu64 "\n"),
 		   ControlFile->checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile->checkPointCopy.oldestXid);
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index a89d72fc5c..4e5eeced89 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -267,7 +267,7 @@ main(int argc, char *argv[])
 
 			case 'O':
 				errno = 0;
-				set_mxoff = strtoul(optarg, &endptr, 0);
+				set_mxoff = strtou64(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
@@ -743,7 +743,7 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile.checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
+	printf(_("Latest checkpoint's NextMultiOffset:  %" PRIu64 "\n"),
 		   ControlFile.checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile.checkPointCopy.oldestXid);
@@ -817,7 +817,7 @@ PrintNewControlValues(void)
 
 	if (set_mxoff != -1)
 	{
-		printf(_("NextMultiOffset:                      %u\n"),
+		printf(_("NextMultiOffset:                      %" PRIu64 "\n"),
 			   ControlFile.checkPointCopy.nextMultiOffset);
 	}
 
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index d6bbbd0ced..16b5a62390 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -213,7 +213,7 @@ push @cmd,
   sprintf("%d,%d", hex($files[0]) == 0 ? 3 : hex($files[0]), hex($files[-1]));
 
 @files = get_slru_files('pg_multixact/offsets');
-$mult = 32 * $blcksz / 4;
+$mult = 32 * ($blcksz - 8) / 4;
 # --multixact-ids argument is "new,old"
 push @cmd,
   '--multixact-ids' => sprintf("%d,%d",
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 82e4bb90dd..16a0772308 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -28,7 +28,7 @@
 
 #define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
 
-#define MaxMultiXactOffset	((MultiXactOffset) 0xFFFFFFFF)
+#define MaxMultiXactOffset	UINT64CONST(0xFFFFFFFFFFFFFFFF)
 
 /*
  * Possible multixact lock modes ("status").  The first four modes are for
@@ -147,7 +147,6 @@ extern void MultiXactSetNextMXact(MultiXactId nextMulti,
 extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 									  MultiXactOffset minMultiOffset);
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
-extern int	MultiXactMemberFreezeThreshold(void);
 
 extern void multixact_twophase_recover(FullTransactionId fxid, uint16 info,
 									   void *recdata, uint32 len);
diff --git a/src/include/c.h b/src/include/c.h
index 9ab5e61799..7ab9e68af6 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -666,7 +666,7 @@ typedef uint32 SubTransactionId;
 /* MultiXactId must be equivalent to TransactionId, to fit in t_xmax */
 typedef TransactionId MultiXactId;
 
-typedef uint32 MultiXactOffset;
+typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
-- 
2.51.0

v19-0004-TEST-bump-catversion.patch.txttext/plain; charset=US-ASCII; name=v19-0004-TEST-bump-catversion.patch.txtDownload
From 5e02b776a1783cdc9a39fa61cdb63e53882c5232 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Fri, 24 Oct 2025 11:47:50 +0300
Subject: [PATCH v19 4/5] TEST: bump catversion

To avoid constant CF-bot complains, make catversion bump in a separate
commit.

NOTE: keep it in sync with MULTIXACTOFFSET_FORMATCHANGE_CAT_VER
---
 src/include/catalog/catversion.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 1b0b16a343..6a13fa3cdb 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202510221
+#define CATALOG_VERSION_NO	999999999
 
 #endif
-- 
2.51.0

v19-0002-Add-pg_upgarde-for-64-bit-multixact-offsets.patchapplication/octet-stream; name=v19-0002-Add-pg_upgarde-for-64-bit-multixact-offsets.patchDownload
From 8d41a3ddc0cc5681c8bbf3fdb5a1e22dd9a16604 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Fri, 24 Oct 2025 10:58:37 +0300
Subject: [PATCH v19 2/5] Add pg_upgarde for 64 bit multixact offsets

Author: Maxim Orlov <orlovmg@gmail.com>
Author: Heikki Linnakangas <hlinnaka@iki.fi>
---
 src/backend/access/transam/multixact.c |  35 +--
 src/bin/pg_upgrade/Makefile            |   3 +
 src/bin/pg_upgrade/meson.build         |   3 +
 src/bin/pg_upgrade/multixact_new.c     | 253 +++++++++++++++++++++
 src/bin/pg_upgrade/multixact_new.h     |  31 +++
 src/bin/pg_upgrade/multixact_old.c     | 296 +++++++++++++++++++++++++
 src/bin/pg_upgrade/multixact_old.h     |  31 +++
 src/bin/pg_upgrade/pg_upgrade.c        | 108 ++++++++-
 src/bin/pg_upgrade/pg_upgrade.h        |   5 +
 src/bin/pg_upgrade/slru_io.c           | 240 ++++++++++++++++++++
 src/bin/pg_upgrade/slru_io.h           |  30 +++
 11 files changed, 1003 insertions(+), 32 deletions(-)
 create mode 100644 src/bin/pg_upgrade/multixact_new.c
 create mode 100644 src/bin/pg_upgrade/multixact_new.h
 create mode 100644 src/bin/pg_upgrade/multixact_old.c
 create mode 100644 src/bin/pg_upgrade/multixact_old.h
 create mode 100644 src/bin/pg_upgrade/slru_io.c
 create mode 100644 src/bin/pg_upgrade/slru_io.h

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index e7861614be..2d44c781f9 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1276,7 +1276,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	int			slotno;
 	MultiXactOffset offset;
 	int			length;
-	int			truelength;
 	MultiXactId oldestMXact;
 	MultiXactId nextMXact;
 	MultiXactId tmpMXact;
@@ -1375,15 +1374,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * we have just for this; the process in charge will signal the CV as soon
 	 * as it has finished writing the multixact offset.
 	 *
-	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
-	 * handle case #2, there is an ambiguity near the point of offset
-	 * wraparound.  If we see next multixact's offset is one, is that our
-	 * multixact's actual endpoint, or did it end at zero with a subsequent
-	 * increment?  We handle this using the knowledge that if the zero'th
-	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
-	 * transaction ID so it can't be a multixact member.  Therefore, if we
-	 * read a zero from the members array, just ignore it.
-	 *
 	 * This is all pretty messy, but the mess occurs only in infrequent corner
 	 * cases, so it seems better than holding the MultiXactGenLock for a long
 	 * time on every multixact creation.
@@ -1467,6 +1457,9 @@ retry:
 	LWLockRelease(lock);
 	lock = NULL;
 
+	/* A multixid with zero members should not happen */
+	Assert(length > 0);
+
 	/*
 	 * If we slept above, clean up state; it's no longer needed.
 	 */
@@ -1475,7 +1468,6 @@ retry:
 
 	ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
 
-	truelength = 0;
 	prev_pageno = -1;
 	for (int i = 0; i < length; i++, offset++)
 	{
@@ -1513,36 +1505,27 @@ retry:
 		xactptr = (TransactionId *)
 			(MultiXactMemberCtl->shared->page_buffer[slotno] + memberoff);
 
-		if (!TransactionIdIsValid(*xactptr))
-		{
-			/* Corner case 3: we must be looking at unused slot zero */
-			Assert(offset == 0);
-			continue;
-		}
+		Assert(TransactionIdIsValid(*xactptr));
 
 		flagsoff = MXOffsetToFlagsOffset(offset);
 		bshift = MXOffsetToFlagsBitShift(offset);
 		flagsptr = (uint32 *) (MultiXactMemberCtl->shared->page_buffer[slotno] + flagsoff);
 
-		ptr[truelength].xid = *xactptr;
-		ptr[truelength].status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
-		truelength++;
+		ptr[i].xid = *xactptr;
+		ptr[i].status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
 	}
 
 	LWLockRelease(lock);
 
-	/* A multixid with zero members should not happen */
-	Assert(truelength > 0);
-
 	/*
 	 * Copy the result into the local cache.
 	 */
-	mXactCachePut(multi, truelength, ptr);
+	mXactCachePut(multi, length, ptr);
 
 	debug_elog3(DEBUG2, "GetMembers: no cache for %s",
-				mxid_to_string(multi, truelength, ptr));
+				mxid_to_string(multi, length, ptr));
 	*members = ptr;
-	return truelength;
+	return length;
 }
 
 /*
diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index 69fcf593ca..42995d53b0 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -18,11 +18,14 @@ OBJS = \
 	file.o \
 	function.o \
 	info.o \
+	multixact_new.o \
+	multixact_old.o \
 	option.o \
 	parallel.o \
 	pg_upgrade.o \
 	relfilenumber.o \
 	server.o \
+	slru_io.o \
 	tablespace.o \
 	task.o \
 	util.o \
diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index ac992f0d14..3e46c4512c 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -8,11 +8,14 @@ pg_upgrade_sources = files(
   'file.c',
   'function.c',
   'info.c',
+  'multixact_new.c',
+  'multixact_old.c',
   'option.c',
   'parallel.c',
   'pg_upgrade.c',
   'relfilenumber.c',
   'server.c',
+  'slru_io.c',
   'tablespace.c',
   'task.c',
   'util.c',
diff --git a/src/bin/pg_upgrade/multixact_new.c b/src/bin/pg_upgrade/multixact_new.c
new file mode 100644
index 0000000000..d7a58a75de
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_new.c
@@ -0,0 +1,253 @@
+/*
+ * multixact_new.c
+ *
+ * Rewrite pre-v19 multixacts to new format with 64-bit MultiXactOffsets
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_new.c
+ */
+
+#include "multixact_new.h"
+
+/*
+ * NOTE: Below are a bunch of definitions and simple inline functions that are
+ * copy-pasted from multixact.c
+ */
+typedef int32 ShortMultiXactOffset;
+
+/* We need four bytes per offset, 8 bytes for the base */
+#define MULTIXACT_OFFSETS_PER_PAGE		\
+	((BLCKSZ - sizeof(MultiXactOffset)) / sizeof(ShortMultiXactOffset))
+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)
+{
+	return multi / MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int
+MultiXactIdToOffsetEntry(MultiXactId multi)
+{
+	return multi % MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+/* We need eight bits per xact, so one xact fits in a byte */
+#define MXACT_MEMBER_BITS_PER_XACT			8
+#define MXACT_MEMBER_FLAGS_PER_BYTE			1
+#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
+
+/* how many full bytes of flags are there in a group? */
+#define MULTIXACT_FLAGBYTES_PER_GROUP		4
+#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
+	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
+/* size in bytes of a complete group */
+#define MULTIXACT_MEMBERGROUP_SIZE \
+	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERS_PER_PAGE	\
+	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
+
+/*
+ * Because the number of items per page is not a divisor of the last item
+ * number (member 0xFFFFFFFF), the last segment does not use the maximum number
+ * of pages, and moreover the last used page therein does not use the same
+ * number of items as previous pages.  (Another way to say it is that the
+ * 0xFFFFFFFF member is somewhere in the middle of the last page, so the page
+ * has some empty space after that item.)
+ *
+ * This constant is the number of members in the last page of the last segment.
+ */
+#define MAX_MEMBERS_IN_LAST_MEMBERS_PAGE \
+		((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
+
+/* page in which a member is to be found */
+static inline int64
+MXOffsetToMemberPage(MultiXactOffset offset)
+{
+	return offset / MULTIXACT_MEMBERS_PER_PAGE;
+}
+
+/* Location (byte offset within page) of flag word for a given member */
+static inline int
+MXOffsetToFlagsOffset(MultiXactOffset offset)
+{
+	MultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
+	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
+
+	return byteoff;
+}
+
+static inline int
+MXOffsetToFlagsBitShift(MultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
+
+	return bshift;
+}
+
+/* Location (byte offset within page) of TransactionId of given member */
+static inline int
+MXOffsetToMemberOffset(MultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+
+	return MXOffsetToFlagsOffset(offset) +
+		MULTIXACT_FLAGBYTES_PER_GROUP +
+		member_in_group * sizeof(TransactionId);
+}
+
+/*
+ * To avoid diving deep into address arithmetic, we declare an auxiliary
+ * structure that access the MultiXactOffset SLRU page.
+ */
+typedef struct MultiXactOffsetSLRUPage
+{
+	MultiXactOffset			base;
+	ShortMultiXactOffset	offset[FLEXIBLE_ARRAY_MEMBER];
+} MultiXactOffsetSLRUPage;
+
+static inline void
+MXOffsetWrite(char *buf, int entryno, MultiXactOffset offset)
+{
+	MultiXactOffsetSLRUPage *page = (MultiXactOffsetSLRUPage *) buf;
+
+	if (page->base != 0)
+		page->offset[entryno] = offset - page->base;
+	else
+	{
+		page->base = offset;
+		page->offset[entryno] = 0;
+	}
+
+	/*
+	 * We need to distinguish between uninited value and not yet written offset.
+	 * See case 2 in GetMultiXactIdMembers.
+	 *
+	 * So, mark this offset inited.
+	 */
+	page->offset[entryno] ^= 0x80000000;
+}
+
+MultiXactWriter *
+AllocMultiXactWrite(char *pgdata, MultiXactId firstMulti,
+					MultiXactOffset firstOffset)
+{
+	MultiXactWriter *state = state = pg_malloc(sizeof(*state));
+	char				dir[MAXPGPATH] = {0};
+
+	state->nextMXact = firstMulti;
+	state->nextOffset = firstOffset;
+
+	pg_sprintf(dir, "%s/pg_multixact/offsets", pgdata);
+	state->offset = AllocSlruWrite(dir, false);
+
+	pg_sprintf(dir, "%s/pg_multixact/members", pgdata);
+	state->members = AllocSlruWrite(dir, true /* use long segment names */);
+
+	return state;
+}
+
+/*
+ * Simplified copy of the corresponding server function
+ */
+MultiXactId
+GetNewMultiXactId(MultiXactWriter *state, int nmembers, MultiXactOffset *offset)
+{
+	MultiXactId		result;
+
+	/* Handle wraparound of the nextMXact counter */
+	if (state->nextMXact < FirstMultiXactId)
+		state->nextMXact = FirstMultiXactId;
+
+	/* Assign the MXID */
+	result = state->nextMXact;
+
+	/* Reserve the members space, similarly to above. */
+	*offset = state->nextOffset;
+
+	/*
+	 * Advance counters.  As in GetNewTransactionId(), this must not happen
+	 * until after file extension has succeeded!
+	 *
+	 * We don't care about MultiXactId wraparound here; it will be handled by
+	 * the next iteration.  But note that nextMXact may be InvalidMultiXactId
+	 * or the first value on a segment-beginning page after this routine
+	 * exits, so anyone else looking at the variable must be prepared to deal
+	 * with either case.  Similarly, nextOffset may be zero, but we won't use
+	 * that as the actual start offset of the next multixact.
+	 */
+	(state->nextMXact)++;
+
+	state->nextOffset += nmembers;
+
+	return result;
+}
+
+/*
+ * Write a new multixact with members.
+ *
+ * Simplified version of the correspoding server function, hence the name.
+ */
+void
+RecordNewMultiXact(MultiXactWriter *state, MultiXactOffset offset,
+				   MultiXactId multi, int nmembers, MultiXactMember *members)
+{
+	int64		pageno;
+	int64		prev_pageno;
+	int			entryno,
+				i;
+	char	   *buf;
+
+	pageno = MultiXactIdToOffsetPage(multi);
+	entryno = MultiXactIdToOffsetEntry(multi);
+
+	buf = SlruWriteSwitchPage(state->offset, pageno);
+	MXOffsetWrite(buf, entryno, offset);
+
+	prev_pageno = -1;
+
+	for (i = 0; i < nmembers; i++, offset++)
+	{
+		TransactionId *memberptr;
+		uint32	   *flagsptr;
+		uint32		flagsval;
+		int			bshift;
+		int			flagsoff;
+		int			memberoff;
+
+		Assert(members[i].status <= MultiXactStatusUpdate);
+
+		pageno = MXOffsetToMemberPage(offset);
+		memberoff = MXOffsetToMemberOffset(offset);
+		flagsoff = MXOffsetToFlagsOffset(offset);
+		bshift = MXOffsetToFlagsBitShift(offset);
+
+		if (pageno != prev_pageno)
+		{
+			buf = SlruWriteSwitchPage(state->members, pageno);
+			prev_pageno = pageno;
+		}
+
+		memberptr = (TransactionId *) (buf + memberoff);
+
+		*memberptr = members[i].xid;
+
+		flagsptr = (uint32 *) (buf + flagsoff);
+
+		flagsval = *flagsptr;
+		flagsval &= ~(((1 << MXACT_MEMBER_BITS_PER_XACT) - 1) << bshift);
+		flagsval |= (members[i].status << bshift);
+		*flagsptr = flagsval;
+	}
+}
+
+void
+FreeMultiXactWrite(MultiXactWriter *state)
+{
+	FreeSlruWrite(state->offset);
+	FreeSlruWrite(state->members);
+
+	pfree(state);
+}
diff --git a/src/bin/pg_upgrade/multixact_new.h b/src/bin/pg_upgrade/multixact_new.h
new file mode 100644
index 0000000000..33d5d1b822
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_new.h
@@ -0,0 +1,31 @@
+/*
+ * multixact_new.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_new.h
+ */
+
+#include "postgres_fe.h"
+
+#include "access/multixact.h"
+
+#include "slru_io.h"
+
+typedef struct MultiXactWriter
+{
+	MultiXactId			nextMXact;
+	MultiXactOffset		nextOffset;
+
+	SlruSegState	   *offset;
+	SlruSegState	   *members;
+} MultiXactWriter;
+
+extern MultiXactWriter *AllocMultiXactWrite(char *pgdata,
+											MultiXactId firstMulti,
+											MultiXactOffset firstOffset);
+extern MultiXactId GetNewMultiXactId(MultiXactWriter *state, int nmembers,
+									 MultiXactOffset *offset);
+extern void RecordNewMultiXact(MultiXactWriter *state, MultiXactOffset offset,
+							   MultiXactId multi, int nmembers,
+							   MultiXactMember *members);
+extern void FreeMultiXactWrite(MultiXactWriter *writer);
diff --git a/src/bin/pg_upgrade/multixact_old.c b/src/bin/pg_upgrade/multixact_old.c
new file mode 100644
index 0000000000..6cc384d2cf
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_old.c
@@ -0,0 +1,296 @@
+/*
+ * multixact_old.c
+ *
+ * Rewrite pre-v19 multixacts to new format with 64-bit MultiXactOffsets
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_old.c
+ */
+
+#include "multixact_old.h"
+
+#include "pg_upgrade.h"
+
+/*
+ * NOTE: below are a bunch of definitions and simple sttaic inline functions
+ * that are copy-pasted from multixact.c from version 18.  The only difference
+ * is that we use the OldMultiXactOffset type equal to uint32 instead of
+ * MultiXactOffset which became uint64.
+ */
+
+/* We need four bytes per offset and 8 bytes per base for each page. */
+#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(OldMultiXactOffset))
+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)
+{
+	return multi / MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int
+MultiXactIdToOffsetEntry(MultiXactId multi)
+{
+	return multi % MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+/*
+ * The situation for members is a bit more complex: we store one byte of
+ * additional flag bits for each TransactionId.  To do this without getting
+ * into alignment issues, we store four bytes of flags, and then the
+ * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
+ * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
+ * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
+ * performance) trumps space efficiency here.
+ *
+ * Note that the "offset" macros work with byte offset, not array indexes, so
+ * arithmetic must be done using "char *" pointers.
+ */
+/* We need eight bits per xact, so one xact fits in a byte */
+#define MXACT_MEMBER_BITS_PER_XACT			8
+#define MXACT_MEMBER_FLAGS_PER_BYTE			1
+#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
+
+/* how many full bytes of flags are there in a group? */
+#define MULTIXACT_FLAGBYTES_PER_GROUP		4
+#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
+	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
+/* size in bytes of a complete group */
+#define MULTIXACT_MEMBERGROUP_SIZE \
+	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERS_PER_PAGE	\
+	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
+
+/* page in which a member is to be found */
+static inline int64
+MXOffsetToMemberPage(OldMultiXactOffset offset)
+{
+	return offset / MULTIXACT_MEMBERS_PER_PAGE;
+}
+
+/* Location (byte offset within page) of flag word for a given member */
+static inline int
+MXOffsetToFlagsOffset(MultiXactOffset offset)
+{
+	OldMultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
+	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
+
+	return byteoff;
+}
+
+/* Location (byte offset within page) of TransactionId of given member */
+static inline int
+MXOffsetToMemberOffset(OldMultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+
+	return MXOffsetToFlagsOffset(offset) +
+		MULTIXACT_FLAGBYTES_PER_GROUP +
+		member_in_group * sizeof(TransactionId);
+}
+
+static inline int
+MXOffsetToFlagsBitShift(OldMultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
+
+	return bshift;
+}
+
+/*
+ * Construct reader of old multixacts.
+ *
+ * Returns the malloced memory used by the all other calls in this module.
+ */
+OldMultiXactReader *
+AllocOldMultiXactRead(char *pgdata, MultiXactId nextMulti,
+					  OldMultiXactOffset nextOffset)
+{
+	OldMultiXactReader *state = state = pg_malloc(sizeof(*state));
+	char				dir[MAXPGPATH] = {0};
+
+	state->nextMXact = nextMulti;
+	state->nextOffset = nextOffset;
+
+	pg_sprintf(dir, "%s/pg_multixact/offsets", pgdata);
+	state->offset = AllocSlruRead(dir);
+
+	pg_sprintf(dir, "%s/pg_multixact/members", pgdata);
+	state->members = AllocSlruRead(dir);
+
+	return state;
+}
+
+/*
+ * This is a simplified version of the GetMultiXactIdMembers() server function.
+ *
+ * - Only return the updating member, if any. Upgrade only cares about the
+ *   updaters. If there is no updating member, return the first locking-only
+ *   member. We don't have any way to represent "no members", but we also don't
+ *   need to preserve all the locking members.
+ *
+ * - We don't need to worry about locking and some corner cases because there's
+ *   no concurrent activity.
+ */
+void
+GetOldMultiXactIdSingleMember(OldMultiXactReader *state, MultiXactId multi,
+							  TransactionId *result, MultiXactStatus *status)
+{
+	MultiXactId		nextMXact,
+					nextOffset,
+					tmpMXact;
+	int64			pageno,
+					prev_pageno;
+	int				entryno,
+					length;
+	char		   *buf;
+	OldMultiXactOffset *offptr,
+						offset;
+	TransactionId	result_xid = InvalidTransactionId;
+	bool			result_isupdate = false;
+
+	nextMXact = state->nextMXact;
+	nextOffset = state->nextOffset;
+
+	/*
+	 * See GetMultiXactIdMembers in multixact.c
+	 *
+	 * Find out the offset at which we need to start reading MultiXactMembers
+	 * and the number of members in the multixact.  We determine the latter as
+	 * the difference between this multixact's starting offset and the next
+	 * one's.  However, there are some corner cases to worry about:
+	 *
+	 * 1. This multixact may be the latest one created, in which case there is
+	 * no next one to look at.  In this case the nextOffset value we just
+	 * saved is the correct endpoint.
+	 *
+	 * 2. The next multixact may still be in process of being filled in...
+	 * This cannot happen during upgrade.
+	 *
+	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
+	 * handle case #2, there is an ambiguity near the point of offset
+	 * wraparound.  If we see next multixact's offset is one, is that our
+	 * multixact's actual endpoint, or did it end at zero with a subsequent
+	 * increment?  We handle this using the knowledge that if the zero'th
+	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
+	 * transaction ID so it can't be a multixact member.  Therefore, if we
+	 * read a zero from the members array, just ignore it.
+	 */
+
+	pageno = MultiXactIdToOffsetPage(multi);
+	entryno = MultiXactIdToOffsetEntry(multi);
+
+	buf = SlruReadSwitchPage(state->offset, pageno);
+	offptr = (OldMultiXactOffset *) buf;
+	offptr += entryno;
+	offset = *offptr;
+
+	Assert(offset != 0);
+
+	/*
+	 * Use the same increment rule as GetNewMultiXactId(), that is, don't
+	 * handle wraparound explicitly until needed.
+	 */
+	tmpMXact = multi + 1;
+
+	if (nextMXact == tmpMXact)
+	{
+		/* Corner case 1: there is no next multixact */
+		length = nextOffset - offset;
+	}
+	else
+	{
+		OldMultiXactOffset nextMXOffset;
+
+		/* handle wraparound if needed */
+		if (tmpMXact < FirstMultiXactId)
+			tmpMXact = FirstMultiXactId;
+
+		prev_pageno = pageno;
+
+		pageno = MultiXactIdToOffsetPage(tmpMXact);
+		entryno = MultiXactIdToOffsetEntry(tmpMXact);
+
+		if (pageno != prev_pageno)
+			buf = SlruReadSwitchPage(state->offset, pageno);
+
+		offptr = (OldMultiXactOffset *) buf;
+		offptr += entryno;
+		nextMXOffset = *offptr;
+
+		/*
+		 * Corner case 2: next multixact is still being filled in, this must
+		 * not happen during upgrade.
+		 */
+		Assert(nextMXOffset != 0);
+
+		length = nextMXOffset - offset;
+	}
+
+	prev_pageno = -1;
+	for (int i = 0; i < length; i++, offset++)
+	{
+		TransactionId *xactptr;
+		uint32	   *flagsptr;
+		int			flagsoff;
+		int			bshift;
+		int			memberoff;
+		MultiXactStatus st;
+
+		pageno = MXOffsetToMemberPage(offset);
+		memberoff = MXOffsetToMemberOffset(offset);
+
+		if (pageno != prev_pageno)
+		{
+			buf = SlruReadSwitchPage(state->members, pageno);
+			prev_pageno = pageno;
+		}
+
+		xactptr = (TransactionId *) (buf + memberoff);
+		if (!TransactionIdIsValid(*xactptr))
+		{
+			/* Corner case 3: we must be looking at unused slot zero */
+			Assert(offset == 0);
+			continue;
+		}
+
+		flagsoff = MXOffsetToFlagsOffset(offset);
+		bshift = MXOffsetToFlagsBitShift(offset);
+		flagsptr = (uint32 *) (buf + flagsoff);
+
+		st = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
+
+		/* Verify that there is a single update Xid among the given members. */
+		if (ISUPDATE_from_mxstatus(st))
+		{
+			if (result_isupdate)
+				pg_fatal("multixact %u has more than one updating member",
+						 multi);
+			result_xid = *xactptr;
+			result_isupdate = true;
+		}
+		else if (!TransactionIdIsValid(result_xid))
+			result_xid = *xactptr;
+	}
+
+	/* A multixid with zero members should not happen */
+	Assert(TransactionIdIsValid(result_xid));
+
+	*result = result_xid;
+	*status = result_isupdate ? MultiXactStatusUpdate :
+								MultiXactStatusForKeyShare;
+}
+
+/*
+ * Frees the malloced reader.
+ */
+void
+FreeOldMultiXactReader(OldMultiXactReader *state)
+{
+	FreeSlruRead(state->offset);
+	FreeSlruRead(state->members);
+
+	pfree(state);
+}
diff --git a/src/bin/pg_upgrade/multixact_old.h b/src/bin/pg_upgrade/multixact_old.h
new file mode 100644
index 0000000000..8d4659ba6a
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_old.h
@@ -0,0 +1,31 @@
+/*
+ * multixact_old.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_old.h
+ */
+
+#include "postgres_fe.h"
+
+#include "access/multixact.h"
+#include "slru_io.h"
+
+typedef uint32 OldMultiXactOffset;
+
+typedef struct OldMultiXactReader
+{
+	MultiXactId			nextMXact;
+	OldMultiXactOffset	nextOffset;
+
+	SlruSegState	   *offset;
+	SlruSegState	   *members;
+} OldMultiXactReader;
+
+extern OldMultiXactReader *AllocOldMultiXactRead(char *pgdata,
+												 MultiXactId nextMulti,
+												 OldMultiXactOffset nextOffset);
+extern void GetOldMultiXactIdSingleMember(OldMultiXactReader *state,
+										  MultiXactId multi,
+										  TransactionId *result,
+										  MultiXactStatus *status);
+extern void FreeOldMultiXactReader(OldMultiXactReader *reader);
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 490e98fa26..5432c03a2b 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -49,6 +49,8 @@
 #include "common/restricted_token.h"
 #include "fe_utils/string_utils.h"
 #include "pg_upgrade.h"
+#include "multixact_old.h"
+#include "multixact_new.h"
 
 /*
  * Maximum number of pg_restore actions (TOC entries) to process within one
@@ -769,6 +771,82 @@ copy_subdir_files(const char *old_subdir, const char *new_subdir)
 	check_ok();
 }
 
+/*
+ * Convert pg_multixact/offset and /members to new format with 64-bit offsets.
+ */
+static void
+convert_multixacts(MultiXactId *new_nxtmulti, MultiXactOffset *new_nxtmxoff)
+{
+	MultiXactId			oldest_multi,
+						next_multi;
+	OldMultiXactReader *old_reader;
+	MultiXactWriter	   *new_writer;
+
+	old_reader = AllocOldMultiXactRead(old_cluster.pgdata,
+									   old_cluster.controldata.chkpnt_nxtmulti,
+									   old_cluster.controldata.chkpnt_nxtmxoff);
+	new_writer = AllocMultiXactWrite(new_cluster.pgdata,
+									 old_cluster.controldata.chkpnt_oldstMulti,
+									 1 /* see below */);
+
+	oldest_multi = old_cluster.controldata.chkpnt_oldstMulti;
+	next_multi = old_cluster.controldata.chkpnt_nxtmulti;
+
+	/* handle wraparound */
+	if (next_multi < FirstMultiXactId)
+		next_multi = FirstMultiXactId;
+
+	/*
+	 * Read multixids from old files one by one, and write them back in the new
+	 * format.
+	 *
+	 * The locking-only XIDs that may be part of multi-xids don't matter after
+	 * upgrade, as there can be no transactions running across upgrade.  So as
+	 * a little optimization, we only read one member from each multixid: the
+	 * one updating one, or if there was no update, arbitrarily the first
+	 * locking xid.
+	 */
+	for (MultiXactId multi = oldest_multi; multi != next_multi;)
+	{
+		TransactionId		xid;
+		MultiXactStatus		status;
+		MultiXactMember		member;
+		MultiXactId			new_multi PG_USED_FOR_ASSERTS_ONLY;
+		MultiXactOffset		offset;
+
+		/* Read the old multixid */
+		GetOldMultiXactIdSingleMember(old_reader, multi, &xid, &status);
+
+		/* Write it out in new format */
+		member.xid = xid;
+		member.status = status;
+		new_multi = GetNewMultiXactId(new_writer, 1, &offset);
+
+		Assert(new_multi == multi);
+
+		RecordNewMultiXact(new_writer, offset, multi, 1, &member);
+
+		multi++;
+		/* handle wraparound */
+		if (multi < FirstMultiXactId)
+			multi = FirstMultiXactId;
+	}
+
+	/*
+	 * Update the nextMXact/Offset values in the control file to match what we
+	 * wrote.  The nextMXact should be unchanged, but because we ignored the
+	 * locking XIDs members, the nextOffset will be different.
+	 */
+	Assert(new_writer->nextMXact == next_multi);
+
+	*new_nxtmulti = next_multi;
+	*new_nxtmxoff = new_writer->nextOffset;
+
+	/* Release resources */
+	FreeMultiXactWrite(new_writer);
+	FreeOldMultiXactReader(old_reader);
+}
+
 static void
 copy_xact_xlog_xid(void)
 {
@@ -816,8 +894,28 @@ copy_xact_xlog_xid(void)
 	if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER &&
 		new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
 	{
-		copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
-		copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		MultiXactId		new_nxtmulti = old_cluster.controldata.chkpnt_nxtmulti;
+		MultiXactOffset new_nxtmxoff = old_cluster.controldata.chkpnt_nxtmxoff;
+
+		/*
+		 * If the old server is before the MULTIXACTOFFSET_FORMATCHANGE_CAT_VER
+		 * it must have 32-bit multixid offsets, thus it should be converted.
+		 */
+		if (old_cluster.controldata.cat_ver < MULTIXACTOFFSET_FORMATCHANGE_CAT_VER &&
+			new_cluster.controldata.cat_ver >= MULTIXACTOFFSET_FORMATCHANGE_CAT_VER)
+		{
+			remove_new_subdir("pg_multixact/members", false);
+			remove_new_subdir("pg_multixact/offsets", false);
+
+			prep_status("Converting pg_multixact/offsets to 64-bit");
+			convert_multixacts(&new_nxtmulti, &new_nxtmxoff);
+			check_ok();
+		}
+		else
+		{
+			copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
+			copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		}
 
 		prep_status("Setting next multixact ID and offset for new cluster");
 
@@ -826,10 +924,8 @@ copy_xact_xlog_xid(void)
 		 * counters here and the oldest multi present on system.
 		 */
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/pg_resetwal\" -O %u -m %u,%u \"%s\"",
-				  new_cluster.bindir,
-				  old_cluster.controldata.chkpnt_nxtmxoff,
-				  old_cluster.controldata.chkpnt_nxtmulti,
+				  "\"%s/pg_resetwal\" -O %" PRIu64 " -m %u,%u \"%s\"",
+				  new_cluster.bindir, new_nxtmxoff, new_nxtmulti,
 				  old_cluster.controldata.chkpnt_oldstMulti,
 				  new_cluster.pgdata);
 		check_ok();
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index e86336f4be..127b2cb00f 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -114,6 +114,11 @@ extern char *output_files[];
  */
 #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231
 
+/*
+ * Swicth from 32-bit to 64-bit for multixid offsets.
+ */
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 999999999
+
 /*
  * large object chunk size added to pg_controldata,
  * commit 5f93c37805e7485488480916b4585e098d3cc883
diff --git a/src/bin/pg_upgrade/slru_io.c b/src/bin/pg_upgrade/slru_io.c
new file mode 100644
index 0000000000..4e82319930
--- /dev/null
+++ b/src/bin/pg_upgrade/slru_io.c
@@ -0,0 +1,240 @@
+/*
+ * slru_io.c
+ *
+ * Routines for reading and writing SLRU files during upgrade.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/slru_io.c
+ */
+
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+
+#include "pg_upgrade.h"
+#include "slru_io.h"
+
+#include "common/fe_memutils.h"
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "port/pg_iovec.h"
+
+/*
+ * State for reading or writing an SLRU, with a one page buffer.
+ */
+typedef struct SlruSegState
+{
+	bool		writing;
+	bool		long_segment_names;
+
+	char	   *dir;
+	char	   *fn;
+	int			fd;
+	int64		segno;
+	uint64		pageno;
+
+	PGAlignedBlock buf;
+} SlruSegState;
+
+static inline SlruSegState *
+AllocSlruSegState(char *dir)
+{
+	SlruSegState *state = pg_malloc(sizeof(*state));
+
+	state->segno = -1;
+	state->pageno = 0;
+	state->dir = pstrdup(dir);
+	state->fd = -1;
+	state->fn = NULL;
+
+	return state;
+}
+
+static inline void
+SlruFlush(SlruSegState *state)
+{
+	struct iovec	iovec = {
+		.iov_base = &state->buf,
+		.iov_len = BLCKSZ,
+	};
+	off_t			offset;
+
+	if (state->segno == -1)
+		return;
+
+	offset = (state->pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+	if (pg_pwritev_with_retry(state->fd, &iovec, 1, offset) < 0)
+		pg_fatal("could not write file \"%s\": %m", state->fn);
+}
+
+/*
+ * Create slru reader for dir.
+ *
+ * Returns the malloced memory used by the all other read calls in this module.
+ */
+SlruSegState *
+AllocSlruRead(char *dir)
+{
+	SlruSegState *state = AllocSlruSegState(dir);
+
+	state->writing = false;
+
+	return state;
+}
+
+/*
+ * Open given page for reading.
+ *
+ * Reading can be done in random order.
+ */
+char *
+SlruReadSwitchPage(SlruSegState *state, uint64 pageno)
+{
+	int64 segno;
+
+	Assert(!state->writing);	/* read only mode */
+
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+
+	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	if (segno != state->segno)
+	{
+		if (state->segno != -1)
+		{
+			close(state->fd);
+			state->fd = -1;
+
+			pg_free(state->fn);
+			state->fn = NULL;
+		}
+
+		/* Open new segment */
+		state->fn = psprintf("%s/%04X", state->dir, (unsigned int) segno);
+		if ((state->fd = open(state->fn, O_RDONLY | PG_BINARY, 0)) < 0)
+			pg_fatal("could not open file \"%s\": %m", state->fn);
+	}
+
+	state->segno = segno;
+
+	{
+		struct iovec	iovec = {
+			.iov_base = &state->buf,
+			.iov_len = BLCKSZ,
+		};
+		off_t			offset = (pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+		if (pg_preadv(state->fd, &iovec, 1, offset) < 0)
+			pg_fatal("could not read file \"%s\": %m", state->fn);
+
+		state->pageno = pageno;
+	}
+
+	return state->buf.data;
+}
+
+/*
+ * Frees the malloced reader.
+ */
+void
+FreeSlruRead(SlruSegState *state)
+{
+	Assert(!state->writing);	/* read only mode */
+
+	close(state->fd);
+	pg_free(state);
+}
+
+/*
+ * Open the given page for writing.
+ *
+ * NOTE: This uses O_EXCL when stepping to a new segment, so this assumes that
+ * each segment is written in full before moving on to next one.  This
+ * limitation would be easy to lift if needed, but it fits the usage pattern of
+ * current callers.
+ */
+char *
+SlruWriteSwitchPage(SlruSegState *state, uint64 pageno)
+{
+	int64	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	off_t	offset;
+
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+
+	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	offset = (pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+	SlruFlush(state);
+	memset(state->buf.data, 0, BLCKSZ);
+
+	if (segno != state->segno)
+	{
+		if (state->segno != -1)
+		{
+			close(state->fd);
+			state->fd = -1;
+
+			pg_free(state->fn);
+			state->fn = NULL;
+		}
+
+		/* Create the segment */
+		if (state->long_segment_names)
+		{
+			Assert(segno >= 0 && segno <= INT64CONST(0xFFFFFFFFFFFFFFF));
+			state->fn = psprintf("%s/%015" PRIX64, state->dir, segno);
+		}
+		else
+		{
+			Assert(segno >= 0 && segno <= INT64CONST(0xFFFFFF));
+			state->fn = psprintf("%s/%04X", state->dir, (unsigned int) segno);
+		}
+
+		if ((state->fd = open(state->fn, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+							  pg_file_create_mode)) < 0)
+		{
+			pg_fatal("could not create file \"%s\": %m", state->fn);
+		}
+
+		state->segno = segno;
+
+		if (offset > 0 && pg_pwrite_zeros(state->fd, offset, 0) < 0)
+			pg_fatal("could not write file \"%s\": %m", state->fn);
+	}
+
+	state->pageno = pageno;
+
+	return state->buf.data;
+}
+
+/*
+ * Create slru writer for dir.
+ *
+ * Returns the malloced memory used by the all other write calls in this module.
+ */
+SlruSegState *
+AllocSlruWrite(char *dir, bool long_segment_names)
+{
+	SlruSegState *state = AllocSlruSegState(dir);
+
+	state->writing = true;
+	state->long_segment_names = long_segment_names;
+
+	return state;
+}
+
+/*
+ * Frees the malloced writer.
+ */
+void
+FreeSlruWrite(SlruSegState *state)
+{
+	Assert(state->writing);
+
+	SlruFlush(state);
+
+	close(state->fd);
+	pg_free(state);
+}
diff --git a/src/bin/pg_upgrade/slru_io.h b/src/bin/pg_upgrade/slru_io.h
new file mode 100644
index 0000000000..920b8ae82e
--- /dev/null
+++ b/src/bin/pg_upgrade/slru_io.h
@@ -0,0 +1,30 @@
+/*
+ * slru_io.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/slru_io.h
+ */
+
+/*
+ * Some kind of iterator associated with a particular SLRU segment.  The idea is
+ * to specify the segment and page number and then move through the pages.
+ */
+
+#include "postgres_fe.h"
+
+/*
+ * See access/slru.h
+ *
+ * Copy here, since slru.h could not be included in fe code.
+ */
+#define SLRU_PAGES_PER_SEGMENT 32
+
+typedef struct SlruSegState SlruSegState;
+
+extern SlruSegState *AllocSlruRead(char *dir);
+extern char *SlruReadSwitchPage(SlruSegState *state, uint64 pageno);
+extern void FreeSlruRead(SlruSegState *state);
+
+extern SlruSegState *AllocSlruWrite(char *dir, bool long_segment_names);
+extern char *SlruWriteSwitchPage(SlruSegState *state, uint64 pageno);
+extern void FreeSlruWrite(SlruSegState *state);
-- 
2.51.0

#51Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Maxim Orlov (#50)
Re: POC: make mxidoff 64 bits

On 30/10/2025 08:13, Maxim Orlov wrote:

On Tue, 28 Oct 2025 at 17:17, Heikki Linnakangas <hlinnaka@iki.fi
<mailto:hlinnaka@iki.fi>> wrote:

On 27/10/2025 17:54, Maxim Orlov wrote:

If backend C looks up multixid 101 in between steps 3 and 4, it would
read the offset incorrectly, because 'base' isn't set yet.

Hmm, maybe I miss something? We set page base on first write of any
offset on the page, not only the first one. In other words, there
should never be a case when we read an offset without a previously
defined page base. Correct me if I'm wrong:
1. Backend A assigned mxact=100, offset=1000.
2. Backend B assigned mxact=101, offset=1010.
3. Backend B calls RecordNewMultiXact()/MXOffsetWrite() and
    set page base=1010, offset plus 0^0x80000000 bit while
    holding lock on the page.
4. Backend C looks up for the mxact=101 by calling MXOffsetRead()
    and should get exactly what he's looking for:
    base (1010) + offset (0) minus 0x80000000 bit.
5. Backend A calls RecordNewMultiXact() and sets his offset using
    existing base from step 3.

Oh I see, the 'base' is not necessarily the base offset of the first
multixact on the page, it's the base offset of the first multixid that
is written to the page. And the (short) offsets can be negative. That's
a frighteningly clever encoding scheme. One upshot of that is that WAL
redo might get construct the page with a different 'base'. I guess that
works, but it scares me. Could we come up with a more deterministic scheme?

- Heikki

#52Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#51)
5 attachment(s)
Re: POC: make mxidoff 64 bits

On Thu, 30 Oct 2025 at 12:10, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Oh I see, the 'base' is not necessarily the base offset of the first
multixact on the page, it's the base offset of the first multixid that
is written to the page. And the (short) offsets can be negative. That's
a frighteningly clever encoding scheme. One upshot of that is that WAL
redo might get construct the page with a different 'base'. I guess that
works, but it scares me. Could we come up with a more deterministic scheme?

Definitely! The most stable approach is the one we had before, which

used actual 64-bit offsets in the SLRU. To be honest, I'm completely
happy with it. After all, what's most important for me is to have 64-bit
xids in Postgres, and this patch is a step towards that goal.

PFA v20 returns to using actual 64-bit offsets for on-disk SLRU
segments.

Fortunately, now that I've separated reading and writing offsets into
different functions, switching from one implementation to another is
easy to do.

Here's a quick overview of the current state of the patch:
1) Access to the offset is placed to separate calls:
MXOffsetWrite/MXOffsetRead.
2) I abandoned byte juggling in pg_upgrade and moved to using logic that
replicates the work with offsets im multixact.c
3) As a result, the update issue came down to the correct implementation
of functions MXOffsetWrite/MXOffsetRead.
4) The only question that remains is the question of disk representation
of 64-bit offsets in SLRU segments.

My thoughts on point (4).

Using 32-bit offsets + some kind of packing:
Pros:
 + Reduce the total disc space used by the segments; ideally it is
   almost the same as before.
Cons:
 - Reduces reliability (losing a part will most likely result in losing
   the entire page).
 - Complicates code, especially considering that segments may be written
   to the page in random order.
Using 64-bit offsets in SLRU:
Pros:
 + Easy to implement/transparent logic.
Cons:
 - Increases the amount of disk space used.

In terms of speed, I'm not sure which will be faster. On the one hand,
64-bit eliminates the necessity for calculations and branching. On the
other hand, the amount of data used will increase.

I am not opposed to any of these options, as our primary goal is getting
64-bit offsets. However, I like the approach using full 64-bit offsets
in SLRU, because it is more clear and, should we say, robust. Yes, it
will increase the number of segment, however this is not heap data in
for a table. Under typical circumstances, there should not be too many
such segments.

--
Best regards,
Maxim Orlov.

Attachments:

v20-0005-TEST-Add-test-for-64-bit-mxoff-in-pg_upgrade.patch.txttext/plain; charset=US-ASCII; name=v20-0005-TEST-Add-test-for-64-bit-mxoff-in-pg_upgrade.patch.txtDownload
From dd45bfb97be126c03a1c4e41f5794c5726dcd413 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 29 Oct 2025 14:19:56 +0300
Subject: [PATCH v20 5/5] TEST: Add test for 64-bit mxoff in pg_upgrade

---
 src/bin/pg_upgrade/t/007_mxoff.pl | 461 ++++++++++++++++++++++++++++++
 1 file changed, 461 insertions(+)
 create mode 100644 src/bin/pg_upgrade/t/007_mxoff.pl

diff --git a/src/bin/pg_upgrade/t/007_mxoff.pl b/src/bin/pg_upgrade/t/007_mxoff.pl
new file mode 100644
index 00000000000..26fd6e9c5d0
--- /dev/null
+++ b/src/bin/pg_upgrade/t/007_mxoff.pl
@@ -0,0 +1,461 @@
+
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use Math::BigInt;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::AdjustDump;
+use PostgreSQL::Test::AdjustUpgrade;
+use Test::More;
+
+# This test involves different multitransaction states, similarly to that of
+# 002_pg_upgrade.pl.
+
+unless (defined($ENV{oldinstall}))
+{
+	plan skip_all => 'to run test set oldinstall environment variable to the pre 64-bit mxoff cluster';
+}
+
+# Temp dir for a dumps.
+my $tempdir = PostgreSQL::Test::Utils::tempdir;
+
+# Can be changed to test the other modes.
+my $mode = $ENV{PG_TEST_PG_UPGRADE_MODE} || '--copy';
+
+# Get NextMultiOffset.
+sub next_mxoff
+{
+	my $node = shift;
+
+	my $pg_controldata_path =
+		defined($node->install_path) ?
+			$node->install_path . '/bin/pg_controldata' :
+			'pg_controldata';
+	my ($stdout, $stderr) = run_command([ $pg_controldata_path,
+											$node->data_dir ]);
+	my @control_data = split("\n", $stdout);
+	my $next_mxoff = undef;
+
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint's NextMultiOffset:\s*(.*)$/mg)
+		{
+			$next_mxoff = $1;
+			last;
+		}
+	}
+	die "NextMultiOffset not found in control file\n"
+		unless defined($next_mxoff);
+
+	return $next_mxoff;
+}
+
+# Consume around 10k of mxoffsets.
+sub mxact_eater
+{
+	my $node = shift;
+	my $tbl = 'FOO';
+
+	my ($mxoff1, $mxoff2);
+
+	$mxoff1 = next_mxoff($node);
+	$node->start;
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl} (I INT PRIMARY KEY, N_UPDATED INT) " .
+		"       WITH (AUTOVACUUM_ENABLED=FALSE);" .
+		"INSERT INTO ${tbl} SELECT G, 0 FROM GENERATE_SERIES(1, 50) G;");
+
+	# consume around 10k mxoff
+	my $nclients = 10;
+	my $update_every = 75;
+	my @connections = ();
+
+	for (0..$nclients)
+	{
+		my $conn = $node->background_psql('postgres');
+		$conn->query_safe("BEGIN");
+
+		push(@connections, $conn);
+	}
+
+	for (my $i = 0; $i < 1000; $i++)
+	{
+		my $conn = $connections[$i % $nclients];
+
+		$conn->query_safe("COMMIT;");
+		$conn->query_safe("BEGIN");
+
+		if ($i % $update_every == 0)
+		{
+			$conn->query_safe(
+				"UPDATE ${tbl} SET " .
+				"N_UPDATED = N_UPDATED + 1 " .
+				"WHERE I = ${i} % 50");
+		}
+		else
+		{
+			$conn->query_safe(
+				"SELECT * FROM ${tbl} FOR KEY SHARE");
+		}
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+
+	$node->stop;
+	$mxoff2 = next_mxoff($node);
+
+	return $mxoff1, $mxoff2;
+}
+
+# Consume around 2M of mxoffsets.
+sub mxact_huge_eater
+{
+	my $node = shift;
+	my $tbl = 'FOO';
+
+	my ($mxoff1, $mxoff2);
+
+	$mxoff1 = next_mxoff($node);
+	$node->start;
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl} (I INT PRIMARY KEY, N_UPDATED INT) " .
+		"       WITH (AUTOVACUUM_ENABLED=FALSE);" .
+		"INSERT INTO ${tbl} SELECT G, 0 FROM GENERATE_SERIES(1, 50) G;");
+
+	# consume around 1M mxoff
+	my $nclients = 10;
+	my $update_every = 95;
+	my @connections = ();
+	my $timeout = 10 * $PostgreSQL::Test::Utils::timeout_default;
+
+	for (0..$nclients)
+	{
+		my $conn = $node->background_psql('postgres',
+										  timeout => $timeout);
+		$conn->query_safe("BEGIN");
+
+		push(@connections, $conn);
+	}
+
+	# It's a long process, better to tell about progress.
+	my $n_steps = 200_000;
+	my $step = int($n_steps / 10);
+
+	diag "\nstart to consume mxoffsets ...\n";
+	for (my $i = 0; $i < $n_steps; $i++)
+	{
+		my $conn = $connections[$i % $nclients];
+
+		$conn->query_safe("COMMIT;");
+		$conn->query_safe("BEGIN");
+
+		if ($i % $update_every == 0)
+		{
+			# Perform some non-key UPDATEs too, to exercise different multixact
+			# member statuses.
+			$conn->query_safe(
+				"UPDATE ${tbl} SET " .
+				"N_UPDATED = N_UPDATED + 1 " .
+				"WHERE I = ${i} % 50");
+		}
+		else
+		{
+			$conn->query_safe(
+				"SELECT * FROM ${tbl} " .
+				"TABLESAMPLE SYSTEM (85) " .
+				"FOR KEY SHARE");
+		}
+
+		if ($i % $step == 0)
+		{
+			my $done = int(($i / $n_steps) * 100);
+			diag "$done% done...";
+		}
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+
+	$node->stop;
+	$mxoff2 = next_mxoff($node);
+
+	return $mxoff1, $mxoff2;
+}
+
+# Set oldest multixact-offset
+sub reset_mxoff
+{
+	my $node = shift;
+	my $offset = shift;
+
+	my $pg_resetwal_path = $node->install_path . '/bin/pg_resetwal';
+
+	# Get block size
+	my $out = (run_command([ $pg_resetwal_path, '--dry-run',
+							 $node->data_dir ]))[0];
+		$out =~ /^Database block size: *(\d+)$/m or die;
+	my $blcksz = $1;
+
+	# Reset to new offset
+	my @cmd = ($pg_resetwal_path, '--pgdata' => $node->data_dir);
+	push @cmd, '--multixact-offset' => $offset;
+	command_ok(\@cmd, 'set oldest multixact-offset');
+
+	# Fill empty pg_multixact/members segment
+	my $mult = 32 * int($blcksz / 20) * 4;
+	my $segname = sprintf "%04X", $offset / $mult;
+
+	my @dd = ('dd');
+	push @dd, "if=/dev/zero";
+	push @dd, "of=" . $node->data_dir . "/pg_multixact/members/" . $segname;
+	push @dd, "bs=$blcksz";
+	push @dd, "count=32";
+	command_ok(\@dd, 'fill empty multixact-members');
+}
+
+sub get_dump_for_comparison
+{
+	my ($node, $db, $file_prefix, $adjust_child_columns) = @_;
+
+	my $dumpfile = $tempdir . '/' . $file_prefix . '.sql';
+	my $dump_adjusted = "${dumpfile}_adjusted";
+
+	open(my $dh, '>', $dump_adjusted)
+	  || die "could not open $dump_adjusted for writing $!";
+
+	$node->run_log(
+		[
+			'pg_dump', '--no-sync',
+			'--restrict-key' => 'test',
+			'-d' => $node->connstr($db),
+			'-f' => $dumpfile
+		]);
+
+	print $dh adjust_regress_dumpfile(slurp_file($dumpfile),
+		$adjust_child_columns);
+	close($dh);
+
+	return $dump_adjusted;
+}
+
+# Main test workhorse routine.
+# Make pg_upgrade, dump data and compare it.
+sub run_test
+{
+	my $tag = shift;
+	my $oldnode = shift;
+	my $newnode = shift;
+
+	command_ok(
+		[
+			'pg_upgrade', '--no-sync',
+			'--old-datadir' => $oldnode->data_dir,
+			'--new-datadir' => $newnode->data_dir,
+			'--old-bindir' => $oldnode->config_data('--bindir'),
+			'--new-bindir' => $newnode->config_data('--bindir'),
+			'--socketdir' => $newnode->host,
+			'--old-port' => $oldnode->port,
+			'--new-port' => $newnode->port,
+			$mode,
+		],
+		'run of pg_upgrade for new instance');
+	ok( !-d $newnode->data_dir . "/pg_upgrade_output.d",
+		"pg_upgrade_output.d/ removed after pg_upgrade success");
+
+	$oldnode->start;
+	my $src_dump =
+		get_dump_for_comparison($oldnode, 'postgres',
+								"oldnode_${tag}_dump", 0);
+	$oldnode->stop;
+
+	$newnode->start;
+	my $dst_dump =
+		get_dump_for_comparison($newnode, 'postgres',
+								"newnode_${tag}_dump", 0);
+	$newnode->stop;
+
+	compare_files($src_dump, $dst_dump,
+		'dump outputs from original and restored regression databases match');
+}
+
+sub to_hex
+{
+	my $arg = shift;
+
+	$arg = Math::BigInt->new($arg);
+	$arg = $arg->as_hex();
+
+	return $arg;
+}
+
+# case #1: start old node from defaults
+{
+	my $tag = 1;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+	$old->init(extra => ['-k']);
+
+	my ($start_mxoff, $finish_mxoff) = mxact_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #2: start old node from before 32-bit wraparound
+{
+	my $tag = 2;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+	reset_mxoff($old, 0xFFFF0000);
+
+	my ($start_mxoff, $finish_mxoff) = mxact_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #3: start old node near 32-bit wraparound and reach wraparound state.
+{
+	my $tag = 3;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+
+	reset_mxoff($old, 0xFFFFEC77);
+	my ($start_mxoff, $finish_mxoff) = mxact_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #4: start old node from defaults
+{
+	my $tag = 4;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+
+	diag "test #${tag} for multiple mxoff segments";
+	my ($start_mxoff, $finish_mxoff) = mxact_huge_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #5: start old node from before 32-bit wraparound
+{
+	my $tag = 5;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+	reset_mxoff($old, 0xFFFF0000);
+
+	diag "test #${tag} for multiple mxoff segments";
+	my ($start_mxoff, $finish_mxoff) = mxact_huge_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #6: start old node near 32-bit wraparound and reach wraparound state.
+{
+	my $tag = 6;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+
+	reset_mxoff($old, 0xFFFFFFFF - 1_000_000);
+	my ($start_mxoff, $finish_mxoff) = mxact_huge_eater($old);
+
+	diag "test #${tag} for multiple mxoff segments";
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+done_testing();
-- 
2.43.0

v20-0003-Add-test-for-64-bit-mxoff-in-pg_resetwal.patchapplication/octet-stream; name=v20-0003-Add-test-for-64-bit-mxoff-in-pg_resetwal.patchDownload
From 2a6ae43f5be3a45741c233b001dfb8721c7f8217 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Tue, 28 Oct 2025 19:08:26 +0300
Subject: [PATCH v20 3/5] Add test for 64-bit mxoff in pg_resetwal

---
 src/bin/pg_resetwal/meson.build    |   1 +
 src/bin/pg_resetwal/t/003_mxoff.pl | 170 +++++++++++++++++++++++++++++
 2 files changed, 171 insertions(+)
 create mode 100644 src/bin/pg_resetwal/t/003_mxoff.pl

diff --git a/src/bin/pg_resetwal/meson.build b/src/bin/pg_resetwal/meson.build
index 290832b2299..1e2dfb38a5b 100644
--- a/src/bin/pg_resetwal/meson.build
+++ b/src/bin/pg_resetwal/meson.build
@@ -25,6 +25,7 @@ tests += {
     'tests': [
       't/001_basic.pl',
       't/002_corrupted.pl',
+      't/003_mxoff.pl',
     ],
   },
 }
diff --git a/src/bin/pg_resetwal/t/003_mxoff.pl b/src/bin/pg_resetwal/t/003_mxoff.pl
new file mode 100644
index 00000000000..3c1b7fa1d33
--- /dev/null
+++ b/src/bin/pg_resetwal/t/003_mxoff.pl
@@ -0,0 +1,170 @@
+
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use Math::BigInt;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+sub mxact_eater
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	$node->start;
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl} (I INT PRIMARY KEY, N_UPDATED INT) " .
+		"       WITH (AUTOVACUUM_ENABLED=FALSE);" .
+		"INSERT INTO ${tbl} SELECT G, 0 FROM GENERATE_SERIES(1, 50) G;");
+
+	# consume around 10k multixact-offsetfs
+	my $nclients = 10;
+	my $update_every = 75;
+	my @connections = ();
+
+	for (0..$nclients)
+	{
+		my $conn = $node->background_psql('postgres');
+		$conn->query_safe("BEGIN");
+
+		push(@connections, $conn);
+	}
+
+	for (my $i = 0; $i < 1000; $i++)
+	{
+		my $conn = $connections[$i % $nclients];
+
+		$conn->query_safe("COMMIT;");
+		$conn->query_safe("BEGIN");
+
+		if ($i % $update_every == 0)
+		{
+			$conn->query_safe(
+				"UPDATE ${tbl} SET " .
+				"N_UPDATED = N_UPDATED + 1 " .
+				"WHERE I = ${i} % 50");
+		}
+		else
+		{
+			$conn->query_safe(
+				"SELECT * FROM ${tbl} FOR KEY SHARE");
+		}
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+
+	$node->stop;
+}
+
+sub next_mxoff
+{
+	my $node = shift;
+	my ($stdout, $stderr) =
+	  run_command([ 'pg_controldata', $node->data_dir ]);
+	my @control_data = split("\n", $stdout);
+	my $next_mxoff = undef;
+
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint's NextMultiOffset:\s*(.*)$/mg)
+		{
+			$next_mxoff = $1;
+			last;
+		}
+	}
+	die "NextMultiOffset not found in control file\n"
+		unless defined($next_mxoff);
+
+	return $next_mxoff;
+}
+
+sub reset_mxoff
+{
+	my $node = shift;
+	my $offset = shift;
+		$offset = Math::BigInt->new($offset);
+
+	# Get block size
+	my $out = (run_command([ 'pg_resetwal', '--dry-run', $node->data_dir ]))[0];
+		$out =~ /^Database block size: *(\d+)$/m or die;
+	my $blcksz = $1;
+
+	# Reset to new offset
+	my @cmd = ('pg_resetwal', '--pgdata' => $node->data_dir);
+	push @cmd, '--multixact-offset' => $offset->as_hex();
+	command_ok(\@cmd, 'set oldest multixact-offset');
+
+	# Fill empty pg_multixact/members segment
+	my $mult = 32 * int($blcksz / 20) * 4;
+	my $segname = sprintf "%015X", $offset / $mult;
+
+	my @dd = ('dd');
+	push @dd, "if=/dev/zero";
+	push @dd, "of=" . $node->data_dir . "/pg_multixact/members/" . $segname;
+	push @dd, "bs=$blcksz";
+	push @dd, "count=32";
+	command_ok(\@dd, 'fill empty multixact-members');
+}
+
+my ($off1, $off2);
+
+# start from defaults
+my $node1 = PostgreSQL::Test::Cluster->new('node1');
+$node1->init;
+$off1 = next_mxoff($node1);
+mxact_eater($node1, "FOO");
+$off2 = next_mxoff($node1);
+note "> start from $off1, finished at $off2\n";
+
+# start from before 32-bit wraparound
+my $node2 = PostgreSQL::Test::Cluster->new('node2');
+$node2->init;
+reset_mxoff($node2, 0xFFFF0000);
+$off1 = next_mxoff($node2);
+mxact_eater($node2, "FOO");
+$off2 = next_mxoff($node2);
+note "> start from $off1, finished at $off2\n";
+
+# start near 32-bit wraparound
+my $node3 = PostgreSQL::Test::Cluster->new('node3');
+$node3->init;
+reset_mxoff($node3, 0xFFFFEC77);
+$off1 = next_mxoff($node3);
+mxact_eater($node3, "FOO");
+$off2 = next_mxoff($node3);
+note "> start from $off1, finished at $off2\n";
+
+# start over 32-bit wraparound
+my $node4 = PostgreSQL::Test::Cluster->new('node4');
+$node4->init;
+reset_mxoff($node4, '0xFFFFFFFF0000');
+$off1 = next_mxoff($node4);
+mxact_eater($node4, "FOO");
+$off2 = next_mxoff($node3);
+note "> start from $off1, finished at $off2\n";
+
+# check invariant
+$node1->start;
+$node2->start;
+$node3->start;
+$node4->start;
+
+my $var1 = $node1->safe_psql('postgres', 'TABLE FOO');
+my $var2 = $node2->safe_psql('postgres', 'TABLE FOO');
+my $var3 = $node3->safe_psql('postgres', 'TABLE FOO');
+my $var4 = $node4->safe_psql('postgres', 'TABLE FOO');
+ok($var1 eq $var2 eq $var3 eq $var4,
+	'check table invariant in all nodes');
+
+$node4->stop;
+$node3->stop;
+$node2->stop;
+$node1->stop;
+
+done_testing();
-- 
2.43.0

v20-0001-Use-64-bit-multixact-offsets.patchapplication/octet-stream; name=v20-0001-Use-64-bit-multixact-offsets.patchDownload
From 651dbd070e96059980c50d3dc414ffb7b590bb4b Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 7 Aug 2024 16:35:22 +0300
Subject: [PATCH v20 1/5] Use 64-bit multixact offsets

Switching to 64-bit multitransaction offsets removes wraparound and the
2^32 limit on their total number.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/rmgrdesc/mxactdesc.c   |   4 +-
 src/backend/access/rmgrdesc/xlogdesc.c    |   2 +-
 src/backend/access/transam/multixact.c    | 312 ++++------------------
 src/backend/access/transam/xlogrecovery.c |   2 +-
 src/backend/commands/vacuum.c             |   2 +-
 src/backend/postmaster/autovacuum.c       |   4 +-
 src/bin/pg_controldata/pg_controldata.c   |   2 +-
 src/bin/pg_resetwal/pg_resetwal.c         |   6 +-
 src/bin/pg_resetwal/t/001_basic.pl        |   2 +-
 src/include/access/multixact.h            |   3 +-
 src/include/c.h                           |   2 +-
 11 files changed, 63 insertions(+), 278 deletions(-)

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 3ca0582db36..052dd0a4ce5 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -65,7 +65,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		xl_multixact_create *xlrec = (xl_multixact_create *) rec;
 		int			i;
 
-		appendStringInfo(buf, "%u offset %u nmembers %d: ", xlrec->mid,
+		appendStringInfo(buf, "%u offset %" PRIu64 " nmembers %d: ", xlrec->mid,
 						 xlrec->moff, xlrec->nmembers);
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
@@ -74,7 +74,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
 
-		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+		appendStringInfo(buf, "offsets [%u, %u), members [%" PRIu64 ", %" PRIu64 ")",
 						 xlrec->startTruncOff, xlrec->endTruncOff,
 						 xlrec->startTruncMemb, xlrec->endTruncMemb);
 	}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index cd6c2a2f650..441034f5929 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -66,7 +66,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
 		appendStringInfo(buf, "redo %X/%08X; "
-						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %u; "
+						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %" PRIu64 "; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
 						 "oldest running xid %u; %s",
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 9d5f130af7e..93a1e4cfd2a 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -92,17 +92,9 @@
 /*
  * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
  * used everywhere else in Postgres.
- *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
  */
 
-/* We need four bytes per offset */
+/* We need 8 bytes per offset */
 #define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
 
 static inline int64
@@ -208,10 +200,14 @@ MXOffsetToMemberOffset(MultiXactOffset offset)
 		member_in_group * sizeof(TransactionId);
 }
 
-/* Multixact members wraparound thresholds. */
-#define MULTIXACT_MEMBER_SAFE_THRESHOLD		(MaxMultiXactOffset / 2)
-#define MULTIXACT_MEMBER_DANGER_THRESHOLD	\
-	(MaxMultiXactOffset - MaxMultiXactOffset / 4)
+/*
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuumin order to release the disk space, reduce table bloat if
+ * possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD		UINT64CONST(0xFFFFFFFF)
 
 static inline MultiXactId
 PreviousMultiXactId(MultiXactId multi)
@@ -228,6 +224,24 @@ static SlruCtlData MultiXactMemberCtlData;
 #define MultiXactOffsetCtl	(&MultiXactOffsetCtlData)
 #define MultiXactMemberCtl	(&MultiXactMemberCtlData)
 
+static inline MultiXactOffset
+MXOffsetRead(int entryno, int slotno)
+{
+	MultiXactOffset *offptr =
+		(MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
+
+	return offptr[entryno];
+}
+
+static inline void
+MXOffsetWrite(int entryno, int slotno, MultiXactOffset offset)
+{
+	MultiXactOffset *offptr =
+		(MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
+
+	offptr[entryno] = offset;
+}
+
 /*
  * MultiXact state shared across all backends.  All this state is protected
  * by MultiXactGenLock.  (We also use SLRU bank's lock of MultiXactOffset and
@@ -268,9 +282,6 @@ typedef struct MultiXactStateData
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 
-	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
-
 	/*
 	 * This is used to sleep until a multixact offset is written when we want
 	 * to create the next one.
@@ -401,8 +412,6 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
-									 MultiXactOffset start, uint32 distance);
 static bool SetOffsetVacuumLimit(bool is_startup);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMTruncateXlogRec(Oid oldestMultiDB,
@@ -911,7 +920,6 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	int64		prev_pageno;
 	int			entryno;
 	int			slotno;
-	MultiXactOffset *offptr;
 	int			i;
 	LWLock	   *lock;
 	LWLock	   *prevlock = NULL;
@@ -930,10 +938,8 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	 * take the trouble to generalize the slru.c error reporting code.
 	 */
 	slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
-	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
-	offptr += entryno;
 
-	*offptr = offset;
+	MXOffsetWrite(entryno, slotno, offset);
 
 	MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 
@@ -1155,78 +1161,6 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	else
 		*offset = nextOffset;
 
-	/*----------
-	 * Protect against overrun of the members space as well, with the
-	 * following rules:
-	 *
-	 * If we're past offsetStopLimit, refuse to generate more multis.
-	 * If we're close to offsetStopLimit, emit a warning.
-	 *
-	 * Arbitrarily, we start emitting warnings when we're 20 segments or less
-	 * from offsetStopLimit.
-	 *
-	 * Note we haven't updated the shared state yet, so if we fail at this
-	 * point, the multixact ID we grabbed can still be used by the next guy.
-	 *
-	 * Note that there is no point in forcing autovacuum runs here: the
-	 * multixact freeze settings would have to be reduced for that to have any
-	 * effect.
-	 *----------
-	 */
-#define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
-								 nmembers))
-	{
-		/* see comment in the corresponding offsets wraparound case */
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("multixact \"members\" limit exceeded"),
-				 errdetail_plural("This command would create a multixact with %u members, but the remaining space is only enough for %u member.",
-								  "This command would create a multixact with %u members, but the remaining space is only enough for %u members.",
-								  MultiXactState->offsetStopLimit - nextOffset - 1,
-								  nmembers,
-								  MultiXactState->offsetStopLimit - nextOffset - 1),
-				 errhint("Execute a database-wide VACUUM in database with OID %u with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.",
-						 MultiXactState->oldestMultiXactDB)));
-	}
-
-	/*
-	 * Check whether we should kick autovacuum into action, to prevent members
-	 * wraparound. NB we use a much larger window to trigger autovacuum than
-	 * just the warning limit. The warning is just a measure of last resort -
-	 * this is in line with GetNewTransactionId's behaviour.
-	 */
-	if (!MultiXactState->oldestOffsetKnown ||
-		(MultiXactState->nextOffset - MultiXactState->oldestOffset
-		 > MULTIXACT_MEMBER_SAFE_THRESHOLD))
-	{
-		/*
-		 * To avoid swamping the postmaster with signals, we issue the autovac
-		 * request only when crossing a segment boundary. With default
-		 * compilation settings that's roughly after 50k members.  This still
-		 * gives plenty of chances before we get into real trouble.
-		 */
-		if ((MXOffsetToMemberPage(nextOffset) / SLRU_PAGES_PER_SEGMENT) !=
-			(MXOffsetToMemberPage(nextOffset + nmembers) / SLRU_PAGES_PER_SEGMENT))
-			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	}
-
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
-								 nextOffset,
-								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
-		ereport(WARNING,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg_plural("database with OID %u must be vacuumed before %d more multixact member is used",
-							   "database with OID %u must be vacuumed before %d more multixact members are used",
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers,
-							   MultiXactState->oldestMultiXactDB,
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers),
-				 errhint("Execute a database-wide VACUUM in that database with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.")));
-
 	ExtendMultiXactMember(nextOffset, nmembers);
 
 	/*
@@ -1255,7 +1189,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	LWLockRelease(MultiXactGenLock);
 
-	debug_elog4(DEBUG2, "GetNew: returning %u offset %u", result, *offset);
+	debug_elog4(DEBUG2, "GetNew: returning %u offset %" PRIu64, result,
+				*offset);
 	return result;
 }
 
@@ -1294,7 +1229,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	int64		prev_pageno;
 	int			entryno;
 	int			slotno;
-	MultiXactOffset *offptr;
 	MultiXactOffset offset;
 	int			length;
 	int			truelength;
@@ -1418,9 +1352,8 @@ retry:
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
 	slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
-	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
-	offptr += entryno;
-	offset = *offptr;
+
+	offset = MXOffsetRead(entryno, slotno);
 
 	Assert(offset != 0);
 
@@ -1467,9 +1400,7 @@ retry:
 			slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, tmpMXact);
 		}
 
-		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
-		offptr += entryno;
-		nextMXOffset = *offptr;
+		nextMXOffset = MXOffsetRead(entryno, slotno);
 
 		if (nextMXOffset == 0)
 		{
@@ -1973,7 +1904,7 @@ MultiXactShmemInit(void)
 				  "pg_multixact/members", LWTRANCHE_MULTIXACTMEMBER_BUFFER,
 				  LWTRANCHE_MULTIXACTMEMBER_SLRU,
 				  SYNC_HANDLER_MULTIXACT_MEMBER,
-				  false);
+				  true);
 	/* doesn't call SimpleLruTruncate() or meet criteria for unit tests */
 
 	/* Initialize our shared state struct */
@@ -2150,7 +2081,6 @@ TrimMultiXact(void)
 		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 		offptr += entryno;
-
 		MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
 
 		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
@@ -2223,7 +2153,7 @@ MultiXactGetCheckptMulti(bool is_shutdown,
 	LWLockRelease(MultiXactGenLock);
 
 	debug_elog6(DEBUG2,
-				"MultiXact: checkpoint is nextMulti %u, nextOffset %u, oldestMulti %u in DB %u",
+				"MultiXact: checkpoint is nextMulti %u, nextOffset %" PRIu64 ", oldestMulti %u in DB %u",
 				*nextMulti, *nextMultiOffset, *oldestMulti, *oldestMultiDB);
 }
 
@@ -2258,7 +2188,7 @@ void
 MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset)
 {
-	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %u",
+	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %" PRIu64,
 				nextMulti, nextMultiOffset);
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->nextMXact = nextMulti;
@@ -2449,7 +2379,7 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 	}
 	if (MultiXactOffsetPrecedes(MultiXactState->nextOffset, minMultiOffset))
 	{
-		debug_elog3(DEBUG2, "MultiXact: setting next offset to %u",
+		debug_elog3(DEBUG2, "MultiXact: setting next offset to %" PRIU64,
 					minMultiOffset);
 		MultiXactState->nextOffset = minMultiOffset;
 	}
@@ -2633,15 +2563,13 @@ GetOldestMultiXactId(void)
 }
 
 /*
- * Determine how aggressively we need to vacuum in order to prevent member
- * wraparound.
+ * Determine if we need to vacuum for member or not.
  *
  * To do so determine what's the oldest member offset and install the limit
  * info in MultiXactState, where it can be used to prevent overrun of old data
  * in the members SLRU area.
  *
- * The return value is true if emergency autovacuum is required and false
- * otherwise.
+ * The return value is true if autovacuum is required and false otherwise.
  */
 static bool
 SetOffsetVacuumLimit(bool is_startup)
@@ -2653,8 +2581,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	MultiXactOffset nextOffset;
 	bool		oldestOffsetKnown = false;
 	bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;
-	MultiXactOffset prevOffsetStopLimit;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2669,7 +2595,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	nextOffset = MultiXactState->nextOffset;
 	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
 	prevOldestOffset = MultiXactState->oldestOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2700,11 +2625,7 @@ SetOffsetVacuumLimit(bool is_startup)
 		oldestOffsetKnown =
 			find_multixact_start(oldestMultiXactId, &oldestOffset);
 
-		if (oldestOffsetKnown)
-			ereport(DEBUG1,
-					(errmsg_internal("oldest MultiXactId member is at offset %u",
-									 oldestOffset)));
-		else
+		if (!oldestOffsetKnown)
 			ereport(LOG,
 					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
 							oldestMultiXactId)));
@@ -2717,24 +2638,7 @@ SetOffsetVacuumLimit(bool is_startup)
 	 * overrun of old data in the members SLRU area. We can only do so if the
 	 * oldest offset is known though.
 	 */
-	if (oldestOffsetKnown)
-	{
-		/* move back to start of the corresponding segment */
-		offsetStopLimit = oldestOffset - (oldestOffset %
-										  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-
-		/* always leave one segment before the wraparound point */
-		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-		if (!prevOldestOffsetKnown && !is_startup)
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are now enabled")));
-
-		ereport(DEBUG1,
-				(errmsg_internal("MultiXact member stop limit is now %u based on MultiXact %u",
-								 offsetStopLimit, oldestMultiXactId)));
-	}
-	else if (prevOldestOffsetKnown)
+	if (prevOldestOffsetKnown)
 	{
 		/*
 		 * If we failed to get the oldest offset this time, but we have a
@@ -2744,69 +2648,19 @@ SetOffsetVacuumLimit(bool is_startup)
 		 */
 		oldestOffset = prevOldestOffset;
 		oldestOffsetKnown = true;
-		offsetStopLimit = prevOffsetStopLimit;
 	}
 
 	/* Install the computed values */
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestOffset = oldestOffset;
 	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	MultiXactState->offsetStopLimit = offsetStopLimit;
 	LWLockRelease(MultiXactGenLock);
 
 	/*
-	 * Do we need an emergency autovacuum?	If we're not sure, assume yes.
+	 * Do we need autovacuum?	If we're not sure, assume yes.
 	 */
 	return !oldestOffsetKnown ||
-		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
-}
-
-/*
- * Return whether adding "distance" to "start" would move past "boundary".
- *
- * We use this to determine whether the addition is "wrapping around" the
- * boundary point, hence the name.  The reason we don't want to use the regular
- * 2^31-modulo arithmetic here is that we want to be able to use the whole of
- * the 2^32-1 space here, allowing for more multixacts than would fit
- * otherwise.
- */
-static bool
-MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
-						 uint32 distance)
-{
-	MultiXactOffset finish;
-
-	/*
-	 * Note that offset number 0 is not used (see GetMultiXactIdMembers), so
-	 * if the addition wraps around the UINT_MAX boundary, skip that value.
-	 */
-	finish = start + distance;
-	if (finish < start)
-		finish++;
-
-	/*-----------------------------------------------------------------------
-	 * When the boundary is numerically greater than the starting point, any
-	 * value numerically between the two is not wrapped:
-	 *
-	 *	<----S----B---->
-	 *	[---)			 = F wrapped past B (and UINT_MAX)
-	 *		 [---)		 = F not wrapped
-	 *			  [----] = F wrapped past B
-	 *
-	 * When the boundary is numerically less than the starting point (i.e. the
-	 * UINT_MAX wraparound occurs somewhere in between) then all values in
-	 * between are wrapped:
-	 *
-	 *	<----B----S---->
-	 *	[---)			 = F not wrapped past B (but wrapped past UINT_MAX)
-	 *		 [---)		 = F wrapped past B (and UINT_MAX)
-	 *			  [----] = F not wrapped
-	 *-----------------------------------------------------------------------
-	 */
-	if (start < boundary)
-		return finish >= boundary || finish < start;
-	else
-		return finish >= boundary && finish < start;
+		(nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);
 }
 
 /*
@@ -2825,7 +2679,6 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	int64		pageno;
 	int			entryno;
 	int			slotno;
-	MultiXactOffset *offptr;
 
 	Assert(MultiXactState->finishedStartup);
 
@@ -2843,9 +2696,9 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 
 	/* lock is acquired by SimpleLruReadPage_ReadOnly */
 	slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno, multi);
-	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
-	offptr += entryno;
-	offset = *offptr;
+
+	offset = MXOffsetRead(entryno, slotno);
+
 	LWLockRelease(SimpleLruGetBankLock(MultiXactOffsetCtl, pageno));
 
 	*result = offset;
@@ -2893,73 +2746,6 @@ GetMultiXactInfo(uint32 *multixacts, MultiXactOffset *members,
 	return true;
 }
 
-/*
- * Multixact members can be removed once the multixacts that refer to them
- * are older than every datminmxid.  autovacuum_multixact_freeze_max_age and
- * vacuum_multixact_freeze_table_age work together to make sure we never have
- * too many multixacts; we hope that, at least under normal circumstances,
- * this will also be sufficient to keep us from using too many offsets.
- * However, if the average multixact has many members, we might exhaust the
- * members space while still using few enough members that these limits fail
- * to trigger relminmxid advancement by VACUUM.  At that point, we'd have no
- * choice but to start failing multixact-creating operations with an error.
- *
- * To prevent that, if more than a threshold portion of the members space is
- * used, we effectively reduce autovacuum_multixact_freeze_max_age and
- * to a value just less than the number of multixacts in use.  We hope that
- * this will quickly trigger autovacuuming on the table or tables with the
- * oldest relminmxid, thus allowing datminmxid values to advance and removing
- * some members.
- *
- * As the fraction of the member space currently in use grows, we become
- * more aggressive in clamping this value.  That not only causes autovacuum
- * to ramp up, but also makes any manual vacuums the user issues more
- * aggressive.  This happens because vacuum_get_cutoffs() will clamp the
- * freeze table and the minimum freeze age cutoffs based on the effective
- * autovacuum_multixact_freeze_max_age this function returns.  In the worst
- * case, we'll claim the freeze_max_age to zero, and every vacuum of any
- * table will freeze every multixact.
- */
-int
-MultiXactMemberFreezeThreshold(void)
-{
-	MultiXactOffset members;
-	uint32		multixacts;
-	uint32		victim_multixacts;
-	double		fraction;
-	int			result;
-	MultiXactId oldestMultiXactId;
-	MultiXactOffset oldestOffset;
-
-	/* If we can't determine member space utilization, assume the worst. */
-	if (!GetMultiXactInfo(&multixacts, &members, &oldestMultiXactId, &oldestOffset))
-		return 0;
-
-	/* If member space utilization is low, no special action is required. */
-	if (members <= MULTIXACT_MEMBER_SAFE_THRESHOLD)
-		return autovacuum_multixact_freeze_max_age;
-
-	/*
-	 * Compute a target for relminmxid advancement.  The number of multixacts
-	 * we try to eliminate from the system is based on how far we are past
-	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
-	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD) /
-		(MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-	victim_multixacts = multixacts * fraction;
-
-	/* fraction could be > 1.0, but lowest possible freeze age is zero */
-	if (victim_multixacts > multixacts)
-		return 0;
-	result = multixacts - victim_multixacts;
-
-	/*
-	 * Clamp to autovacuum_multixact_freeze_max_age, so that we never make
-	 * autovacuum less aggressive than it would otherwise be.
-	 */
-	return Min(result, autovacuum_multixact_freeze_max_age);
-}
-
 typedef struct mxtruncinfo
 {
 	int64		earliestExistingPage;
@@ -3159,7 +2945,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 
 	elog(DEBUG1, "performing multixact truncation: "
 		 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-		 "members [%u, %u), members segments [%" PRIx64 ", %" PRIx64 ")",
+		 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
 		 oldestMulti, newOldestMulti,
 		 MultiXactIdToOffsetSegment(oldestMulti),
 		 MultiXactIdToOffsetSegment(newOldestMulti),
@@ -3290,7 +3076,7 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
 static bool
 MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
 {
-	int32		diff = (int32) (offset1 - offset2);
+	int64		diff = (int64) (offset1 - offset2);
 
 	return (diff < 0);
 }
@@ -3387,7 +3173,7 @@ multixact_redo(XLogReaderState *record)
 
 		elog(DEBUG1, "replaying multixact truncation: "
 			 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-			 "members [%u, %u), members segments [%" PRIx64 ", %" PRIx64 ")",
+			 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
 			 xlrec.startTruncOff, xlrec.endTruncOff,
 			 MultiXactIdToOffsetSegment(xlrec.startTruncOff),
 			 MultiXactIdToOffsetSegment(xlrec.endTruncOff),
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 3e3c4da01a2..3b2b0a522cb 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -885,7 +885,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 							 U64FromFullTransactionId(checkPoint.nextXid),
 							 checkPoint.nextOid)));
 	ereport(DEBUG1,
-			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %u",
+			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %" PRIu64,
 							 checkPoint.nextMulti, checkPoint.nextMultiOffset)));
 	ereport(DEBUG1,
 			(errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ed03e3bd50d..259ef60bd31 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1147,7 +1147,7 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams params,
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Almost ready to set freeze output parameters; check if OldestXmin or
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 5084af7dfb6..26385470c19 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1151,7 +1151,7 @@ do_start_worker(void)
 
 	/* Also determine the oldest datminmxid we will consider. */
 	recentMulti = ReadNextMultiXactId();
-	multiForceLimit = recentMulti - MultiXactMemberFreezeThreshold();
+	multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
 	if (multiForceLimit < FirstMultiXactId)
 		multiForceLimit -= FirstMultiXactId;
 
@@ -1939,7 +1939,7 @@ do_autovacuum(void)
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Find the pg_database entry and select the default freeze ages. We use
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 10de058ce91..5295108ade3 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -264,7 +264,7 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile->checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
+	printf(_("Latest checkpoint's NextMultiOffset:  %" PRIu64 "\n"),
 		   ControlFile->checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile->checkPointCopy.oldestXid);
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index a89d72fc5cf..4e5eeced89d 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -267,7 +267,7 @@ main(int argc, char *argv[])
 
 			case 'O':
 				errno = 0;
-				set_mxoff = strtoul(optarg, &endptr, 0);
+				set_mxoff = strtou64(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
@@ -743,7 +743,7 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile.checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
+	printf(_("Latest checkpoint's NextMultiOffset:  %" PRIu64 "\n"),
 		   ControlFile.checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile.checkPointCopy.oldestXid);
@@ -817,7 +817,7 @@ PrintNewControlValues(void)
 
 	if (set_mxoff != -1)
 	{
-		printf(_("NextMultiOffset:                      %u\n"),
+		printf(_("NextMultiOffset:                      %" PRIu64 "\n"),
 			   ControlFile.checkPointCopy.nextMultiOffset);
 	}
 
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index d6bbbd0ceda..cc89e0764ae 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -213,7 +213,7 @@ push @cmd,
   sprintf("%d,%d", hex($files[0]) == 0 ? 3 : hex($files[0]), hex($files[-1]));
 
 @files = get_slru_files('pg_multixact/offsets');
-$mult = 32 * $blcksz / 4;
+$mult = 32 * $blcksz / 8;
 # --multixact-ids argument is "new,old"
 push @cmd,
   '--multixact-ids' => sprintf("%d,%d",
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 82e4bb90dd5..16a07723088 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -28,7 +28,7 @@
 
 #define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
 
-#define MaxMultiXactOffset	((MultiXactOffset) 0xFFFFFFFF)
+#define MaxMultiXactOffset	UINT64CONST(0xFFFFFFFFFFFFFFFF)
 
 /*
  * Possible multixact lock modes ("status").  The first four modes are for
@@ -147,7 +147,6 @@ extern void MultiXactSetNextMXact(MultiXactId nextMulti,
 extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 									  MultiXactOffset minMultiOffset);
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
-extern int	MultiXactMemberFreezeThreshold(void);
 
 extern void multixact_twophase_recover(FullTransactionId fxid, uint16 info,
 									   void *recdata, uint32 len);
diff --git a/src/include/c.h b/src/include/c.h
index 757dfff4782..bc92a6f4565 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -670,7 +670,7 @@ typedef uint32 SubTransactionId;
 /* MultiXactId must be equivalent to TransactionId, to fit in t_xmax */
 typedef TransactionId MultiXactId;
 
-typedef uint32 MultiXactOffset;
+typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
-- 
2.43.0

v20-0004-TEST-bump-catversion.patch.txttext/plain; charset=US-ASCII; name=v20-0004-TEST-bump-catversion.patch.txtDownload
From 9ae48f1a66fcb2f1651a168c481214f1f9b6e201 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Fri, 24 Oct 2025 11:47:50 +0300
Subject: [PATCH v20 4/5] TEST: bump catversion

To avoid constant CF-bot complains, make catversion bump in a separate
commit.

NOTE: keep it in sync with MULTIXACTOFFSET_FORMATCHANGE_CAT_VER
---
 src/include/catalog/catversion.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 18e95179ab6..6a13fa3cdb0 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202510281
+#define CATALOG_VERSION_NO	999999999
 
 #endif
-- 
2.43.0

v20-0002-Add-pg_upgarde-for-64-bit-multixact-offsets.patchapplication/octet-stream; name=v20-0002-Add-pg_upgarde-for-64-bit-multixact-offsets.patchDownload
From 4076702a39d5ab0c0cd5af860fb47d2b0742c7ee Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Fri, 24 Oct 2025 10:58:37 +0300
Subject: [PATCH v20 2/5] Add pg_upgarde for 64 bit multixact offsets

Author: Maxim Orlov <orlovmg@gmail.com>
Author: Heikki Linnakangas <hlinnaka@iki.fi>
---
 src/backend/access/transam/multixact.c |  35 +--
 src/bin/pg_upgrade/Makefile            |   3 +
 src/bin/pg_upgrade/meson.build         |   3 +
 src/bin/pg_upgrade/multixact_new.c     | 227 +++++++++++++++++++
 src/bin/pg_upgrade/multixact_new.h     |  31 +++
 src/bin/pg_upgrade/multixact_old.c     | 296 +++++++++++++++++++++++++
 src/bin/pg_upgrade/multixact_old.h     |  31 +++
 src/bin/pg_upgrade/pg_upgrade.c        | 108 ++++++++-
 src/bin/pg_upgrade/pg_upgrade.h        |   5 +
 src/bin/pg_upgrade/slru_io.c           | 240 ++++++++++++++++++++
 src/bin/pg_upgrade/slru_io.h           |  30 +++
 11 files changed, 977 insertions(+), 32 deletions(-)
 create mode 100644 src/bin/pg_upgrade/multixact_new.c
 create mode 100644 src/bin/pg_upgrade/multixact_new.h
 create mode 100644 src/bin/pg_upgrade/multixact_old.c
 create mode 100644 src/bin/pg_upgrade/multixact_old.h
 create mode 100644 src/bin/pg_upgrade/slru_io.c
 create mode 100644 src/bin/pg_upgrade/slru_io.h

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 93a1e4cfd2a..5a13596cd86 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1231,7 +1231,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	int			slotno;
 	MultiXactOffset offset;
 	int			length;
-	int			truelength;
 	MultiXactId oldestMXact;
 	MultiXactId nextMXact;
 	MultiXactId tmpMXact;
@@ -1330,15 +1329,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * we have just for this; the process in charge will signal the CV as soon
 	 * as it has finished writing the multixact offset.
 	 *
-	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
-	 * handle case #2, there is an ambiguity near the point of offset
-	 * wraparound.  If we see next multixact's offset is one, is that our
-	 * multixact's actual endpoint, or did it end at zero with a subsequent
-	 * increment?  We handle this using the knowledge that if the zero'th
-	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
-	 * transaction ID so it can't be a multixact member.  Therefore, if we
-	 * read a zero from the members array, just ignore it.
-	 *
 	 * This is all pretty messy, but the mess occurs only in infrequent corner
 	 * cases, so it seems better than holding the MultiXactGenLock for a long
 	 * time on every multixact creation.
@@ -1422,6 +1412,9 @@ retry:
 	LWLockRelease(lock);
 	lock = NULL;
 
+	/* A multixid with zero members should not happen */
+	Assert(length > 0);
+
 	/*
 	 * If we slept above, clean up state; it's no longer needed.
 	 */
@@ -1430,7 +1423,6 @@ retry:
 
 	ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
 
-	truelength = 0;
 	prev_pageno = -1;
 	for (int i = 0; i < length; i++, offset++)
 	{
@@ -1468,36 +1460,27 @@ retry:
 		xactptr = (TransactionId *)
 			(MultiXactMemberCtl->shared->page_buffer[slotno] + memberoff);
 
-		if (!TransactionIdIsValid(*xactptr))
-		{
-			/* Corner case 3: we must be looking at unused slot zero */
-			Assert(offset == 0);
-			continue;
-		}
+		Assert(TransactionIdIsValid(*xactptr));
 
 		flagsoff = MXOffsetToFlagsOffset(offset);
 		bshift = MXOffsetToFlagsBitShift(offset);
 		flagsptr = (uint32 *) (MultiXactMemberCtl->shared->page_buffer[slotno] + flagsoff);
 
-		ptr[truelength].xid = *xactptr;
-		ptr[truelength].status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
-		truelength++;
+		ptr[i].xid = *xactptr;
+		ptr[i].status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
 	}
 
 	LWLockRelease(lock);
 
-	/* A multixid with zero members should not happen */
-	Assert(truelength > 0);
-
 	/*
 	 * Copy the result into the local cache.
 	 */
-	mXactCachePut(multi, truelength, ptr);
+	mXactCachePut(multi, length, ptr);
 
 	debug_elog3(DEBUG2, "GetMembers: no cache for %s",
-				mxid_to_string(multi, truelength, ptr));
+				mxid_to_string(multi, length, ptr));
 	*members = ptr;
-	return truelength;
+	return length;
 }
 
 /*
diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index 69fcf593cae..42995d53b0b 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -18,11 +18,14 @@ OBJS = \
 	file.o \
 	function.o \
 	info.o \
+	multixact_new.o \
+	multixact_old.o \
 	option.o \
 	parallel.o \
 	pg_upgrade.o \
 	relfilenumber.o \
 	server.o \
+	slru_io.o \
 	tablespace.o \
 	task.o \
 	util.o \
diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index ac992f0d14b..3e46c4512cf 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -8,11 +8,14 @@ pg_upgrade_sources = files(
   'file.c',
   'function.c',
   'info.c',
+  'multixact_new.c',
+  'multixact_old.c',
   'option.c',
   'parallel.c',
   'pg_upgrade.c',
   'relfilenumber.c',
   'server.c',
+  'slru_io.c',
   'tablespace.c',
   'task.c',
   'util.c',
diff --git a/src/bin/pg_upgrade/multixact_new.c b/src/bin/pg_upgrade/multixact_new.c
new file mode 100644
index 00000000000..d43442fb9a7
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_new.c
@@ -0,0 +1,227 @@
+/*
+ * multixact_new.c
+ *
+ * Rewrite pre-v19 multixacts to new format with 64-bit MultiXactOffsets
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_new.c
+ */
+
+#include "multixact_new.h"
+
+/*
+ * NOTE: Below are a bunch of definitions and simple inline functions that are
+ * copy-pasted from multixact.c
+ */
+
+/* We need four bytes per offset, 8 bytes for the base */
+#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)
+{
+	return multi / MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int
+MultiXactIdToOffsetEntry(MultiXactId multi)
+{
+	return multi % MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+/* We need eight bits per xact, so one xact fits in a byte */
+#define MXACT_MEMBER_BITS_PER_XACT			8
+#define MXACT_MEMBER_FLAGS_PER_BYTE			1
+#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
+
+/* how many full bytes of flags are there in a group? */
+#define MULTIXACT_FLAGBYTES_PER_GROUP		4
+#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
+	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
+/* size in bytes of a complete group */
+#define MULTIXACT_MEMBERGROUP_SIZE \
+	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERS_PER_PAGE	\
+	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
+
+/*
+ * Because the number of items per page is not a divisor of the last item
+ * number (member 0xFFFFFFFF), the last segment does not use the maximum number
+ * of pages, and moreover the last used page therein does not use the same
+ * number of items as previous pages.  (Another way to say it is that the
+ * 0xFFFFFFFF member is somewhere in the middle of the last page, so the page
+ * has some empty space after that item.)
+ *
+ * This constant is the number of members in the last page of the last segment.
+ */
+#define MAX_MEMBERS_IN_LAST_MEMBERS_PAGE \
+		((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
+
+/* page in which a member is to be found */
+static inline int64
+MXOffsetToMemberPage(MultiXactOffset offset)
+{
+	return offset / MULTIXACT_MEMBERS_PER_PAGE;
+}
+
+/* Location (byte offset within page) of flag word for a given member */
+static inline int
+MXOffsetToFlagsOffset(MultiXactOffset offset)
+{
+	MultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
+	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
+
+	return byteoff;
+}
+
+static inline int
+MXOffsetToFlagsBitShift(MultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
+
+	return bshift;
+}
+
+/* Location (byte offset within page) of TransactionId of given member */
+static inline int
+MXOffsetToMemberOffset(MultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+
+	return MXOffsetToFlagsOffset(offset) +
+		MULTIXACT_FLAGBYTES_PER_GROUP +
+		member_in_group * sizeof(TransactionId);
+}
+
+static inline void
+MXOffsetWrite(char *buf, int entryno, MultiXactOffset offset)
+{
+	MultiXactOffset *offptr = (MultiXactOffset *) buf;
+
+	offptr[entryno] = offset;
+}
+
+MultiXactWriter *
+AllocMultiXactWrite(char *pgdata, MultiXactId firstMulti,
+					MultiXactOffset firstOffset)
+{
+	MultiXactWriter    *state = pg_malloc(sizeof(*state));
+	char				dir[MAXPGPATH] = {0};
+
+	state->nextMXact = firstMulti;
+	state->nextOffset = firstOffset;
+
+	pg_sprintf(dir, "%s/pg_multixact/offsets", pgdata);
+	state->offset = AllocSlruWrite(dir, false);
+
+	pg_sprintf(dir, "%s/pg_multixact/members", pgdata);
+	state->members = AllocSlruWrite(dir, true /* use long segment names */);
+
+	return state;
+}
+
+/*
+ * Simplified copy of the corresponding server function
+ */
+MultiXactId
+GetNewMultiXactId(MultiXactWriter *state, int nmembers, MultiXactOffset *offset)
+{
+	MultiXactId		result;
+
+	/* Handle wraparound of the nextMXact counter */
+	if (state->nextMXact < FirstMultiXactId)
+		state->nextMXact = FirstMultiXactId;
+
+	/* Assign the MXID */
+	result = state->nextMXact;
+
+	/* Reserve the members space, similarly to above. */
+	*offset = state->nextOffset;
+
+	/*
+	 * Advance counters.  As in GetNewTransactionId(), this must not happen
+	 * until after file extension has succeeded!
+	 *
+	 * We don't care about MultiXactId wraparound here; it will be handled by
+	 * the next iteration.  But note that nextMXact may be InvalidMultiXactId
+	 * or the first value on a segment-beginning page after this routine
+	 * exits, so anyone else looking at the variable must be prepared to deal
+	 * with either case.  Similarly, nextOffset may be zero, but we won't use
+	 * that as the actual start offset of the next multixact.
+	 */
+	(state->nextMXact)++;
+
+	state->nextOffset += nmembers;
+
+	return result;
+}
+
+/*
+ * Write a new multixact with members.
+ *
+ * Simplified version of the correspoding server function, hence the name.
+ */
+void
+RecordNewMultiXact(MultiXactWriter *state, MultiXactOffset offset,
+				   MultiXactId multi, int nmembers, MultiXactMember *members)
+{
+	int64		pageno;
+	int64		prev_pageno;
+	int			entryno,
+				i;
+	char	   *buf;
+
+	pageno = MultiXactIdToOffsetPage(multi);
+	entryno = MultiXactIdToOffsetEntry(multi);
+
+	buf = SlruWriteSwitchPage(state->offset, pageno);
+	MXOffsetWrite(buf, entryno, offset);
+
+	prev_pageno = -1;
+
+	for (i = 0; i < nmembers; i++, offset++)
+	{
+		TransactionId *memberptr;
+		uint32	   *flagsptr;
+		uint32		flagsval;
+		int			bshift;
+		int			flagsoff;
+		int			memberoff;
+
+		Assert(members[i].status <= MultiXactStatusUpdate);
+
+		pageno = MXOffsetToMemberPage(offset);
+		memberoff = MXOffsetToMemberOffset(offset);
+		flagsoff = MXOffsetToFlagsOffset(offset);
+		bshift = MXOffsetToFlagsBitShift(offset);
+
+		if (pageno != prev_pageno)
+		{
+			buf = SlruWriteSwitchPage(state->members, pageno);
+			prev_pageno = pageno;
+		}
+
+		memberptr = (TransactionId *) (buf + memberoff);
+
+		*memberptr = members[i].xid;
+
+		flagsptr = (uint32 *) (buf + flagsoff);
+
+		flagsval = *flagsptr;
+		flagsval &= ~(((1 << MXACT_MEMBER_BITS_PER_XACT) - 1) << bshift);
+		flagsval |= (members[i].status << bshift);
+		*flagsptr = flagsval;
+	}
+}
+
+void
+FreeMultiXactWrite(MultiXactWriter *state)
+{
+	FreeSlruWrite(state->offset);
+	FreeSlruWrite(state->members);
+
+	pfree(state);
+}
diff --git a/src/bin/pg_upgrade/multixact_new.h b/src/bin/pg_upgrade/multixact_new.h
new file mode 100644
index 00000000000..33d5d1b8222
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_new.h
@@ -0,0 +1,31 @@
+/*
+ * multixact_new.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_new.h
+ */
+
+#include "postgres_fe.h"
+
+#include "access/multixact.h"
+
+#include "slru_io.h"
+
+typedef struct MultiXactWriter
+{
+	MultiXactId			nextMXact;
+	MultiXactOffset		nextOffset;
+
+	SlruSegState	   *offset;
+	SlruSegState	   *members;
+} MultiXactWriter;
+
+extern MultiXactWriter *AllocMultiXactWrite(char *pgdata,
+											MultiXactId firstMulti,
+											MultiXactOffset firstOffset);
+extern MultiXactId GetNewMultiXactId(MultiXactWriter *state, int nmembers,
+									 MultiXactOffset *offset);
+extern void RecordNewMultiXact(MultiXactWriter *state, MultiXactOffset offset,
+							   MultiXactId multi, int nmembers,
+							   MultiXactMember *members);
+extern void FreeMultiXactWrite(MultiXactWriter *writer);
diff --git a/src/bin/pg_upgrade/multixact_old.c b/src/bin/pg_upgrade/multixact_old.c
new file mode 100644
index 00000000000..6cc384d2cf2
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_old.c
@@ -0,0 +1,296 @@
+/*
+ * multixact_old.c
+ *
+ * Rewrite pre-v19 multixacts to new format with 64-bit MultiXactOffsets
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_old.c
+ */
+
+#include "multixact_old.h"
+
+#include "pg_upgrade.h"
+
+/*
+ * NOTE: below are a bunch of definitions and simple sttaic inline functions
+ * that are copy-pasted from multixact.c from version 18.  The only difference
+ * is that we use the OldMultiXactOffset type equal to uint32 instead of
+ * MultiXactOffset which became uint64.
+ */
+
+/* We need four bytes per offset and 8 bytes per base for each page. */
+#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(OldMultiXactOffset))
+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)
+{
+	return multi / MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int
+MultiXactIdToOffsetEntry(MultiXactId multi)
+{
+	return multi % MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+/*
+ * The situation for members is a bit more complex: we store one byte of
+ * additional flag bits for each TransactionId.  To do this without getting
+ * into alignment issues, we store four bytes of flags, and then the
+ * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
+ * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
+ * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
+ * performance) trumps space efficiency here.
+ *
+ * Note that the "offset" macros work with byte offset, not array indexes, so
+ * arithmetic must be done using "char *" pointers.
+ */
+/* We need eight bits per xact, so one xact fits in a byte */
+#define MXACT_MEMBER_BITS_PER_XACT			8
+#define MXACT_MEMBER_FLAGS_PER_BYTE			1
+#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
+
+/* how many full bytes of flags are there in a group? */
+#define MULTIXACT_FLAGBYTES_PER_GROUP		4
+#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
+	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
+/* size in bytes of a complete group */
+#define MULTIXACT_MEMBERGROUP_SIZE \
+	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERS_PER_PAGE	\
+	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
+
+/* page in which a member is to be found */
+static inline int64
+MXOffsetToMemberPage(OldMultiXactOffset offset)
+{
+	return offset / MULTIXACT_MEMBERS_PER_PAGE;
+}
+
+/* Location (byte offset within page) of flag word for a given member */
+static inline int
+MXOffsetToFlagsOffset(MultiXactOffset offset)
+{
+	OldMultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
+	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
+
+	return byteoff;
+}
+
+/* Location (byte offset within page) of TransactionId of given member */
+static inline int
+MXOffsetToMemberOffset(OldMultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+
+	return MXOffsetToFlagsOffset(offset) +
+		MULTIXACT_FLAGBYTES_PER_GROUP +
+		member_in_group * sizeof(TransactionId);
+}
+
+static inline int
+MXOffsetToFlagsBitShift(OldMultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
+
+	return bshift;
+}
+
+/*
+ * Construct reader of old multixacts.
+ *
+ * Returns the malloced memory used by the all other calls in this module.
+ */
+OldMultiXactReader *
+AllocOldMultiXactRead(char *pgdata, MultiXactId nextMulti,
+					  OldMultiXactOffset nextOffset)
+{
+	OldMultiXactReader *state = state = pg_malloc(sizeof(*state));
+	char				dir[MAXPGPATH] = {0};
+
+	state->nextMXact = nextMulti;
+	state->nextOffset = nextOffset;
+
+	pg_sprintf(dir, "%s/pg_multixact/offsets", pgdata);
+	state->offset = AllocSlruRead(dir);
+
+	pg_sprintf(dir, "%s/pg_multixact/members", pgdata);
+	state->members = AllocSlruRead(dir);
+
+	return state;
+}
+
+/*
+ * This is a simplified version of the GetMultiXactIdMembers() server function.
+ *
+ * - Only return the updating member, if any. Upgrade only cares about the
+ *   updaters. If there is no updating member, return the first locking-only
+ *   member. We don't have any way to represent "no members", but we also don't
+ *   need to preserve all the locking members.
+ *
+ * - We don't need to worry about locking and some corner cases because there's
+ *   no concurrent activity.
+ */
+void
+GetOldMultiXactIdSingleMember(OldMultiXactReader *state, MultiXactId multi,
+							  TransactionId *result, MultiXactStatus *status)
+{
+	MultiXactId		nextMXact,
+					nextOffset,
+					tmpMXact;
+	int64			pageno,
+					prev_pageno;
+	int				entryno,
+					length;
+	char		   *buf;
+	OldMultiXactOffset *offptr,
+						offset;
+	TransactionId	result_xid = InvalidTransactionId;
+	bool			result_isupdate = false;
+
+	nextMXact = state->nextMXact;
+	nextOffset = state->nextOffset;
+
+	/*
+	 * See GetMultiXactIdMembers in multixact.c
+	 *
+	 * Find out the offset at which we need to start reading MultiXactMembers
+	 * and the number of members in the multixact.  We determine the latter as
+	 * the difference between this multixact's starting offset and the next
+	 * one's.  However, there are some corner cases to worry about:
+	 *
+	 * 1. This multixact may be the latest one created, in which case there is
+	 * no next one to look at.  In this case the nextOffset value we just
+	 * saved is the correct endpoint.
+	 *
+	 * 2. The next multixact may still be in process of being filled in...
+	 * This cannot happen during upgrade.
+	 *
+	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
+	 * handle case #2, there is an ambiguity near the point of offset
+	 * wraparound.  If we see next multixact's offset is one, is that our
+	 * multixact's actual endpoint, or did it end at zero with a subsequent
+	 * increment?  We handle this using the knowledge that if the zero'th
+	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
+	 * transaction ID so it can't be a multixact member.  Therefore, if we
+	 * read a zero from the members array, just ignore it.
+	 */
+
+	pageno = MultiXactIdToOffsetPage(multi);
+	entryno = MultiXactIdToOffsetEntry(multi);
+
+	buf = SlruReadSwitchPage(state->offset, pageno);
+	offptr = (OldMultiXactOffset *) buf;
+	offptr += entryno;
+	offset = *offptr;
+
+	Assert(offset != 0);
+
+	/*
+	 * Use the same increment rule as GetNewMultiXactId(), that is, don't
+	 * handle wraparound explicitly until needed.
+	 */
+	tmpMXact = multi + 1;
+
+	if (nextMXact == tmpMXact)
+	{
+		/* Corner case 1: there is no next multixact */
+		length = nextOffset - offset;
+	}
+	else
+	{
+		OldMultiXactOffset nextMXOffset;
+
+		/* handle wraparound if needed */
+		if (tmpMXact < FirstMultiXactId)
+			tmpMXact = FirstMultiXactId;
+
+		prev_pageno = pageno;
+
+		pageno = MultiXactIdToOffsetPage(tmpMXact);
+		entryno = MultiXactIdToOffsetEntry(tmpMXact);
+
+		if (pageno != prev_pageno)
+			buf = SlruReadSwitchPage(state->offset, pageno);
+
+		offptr = (OldMultiXactOffset *) buf;
+		offptr += entryno;
+		nextMXOffset = *offptr;
+
+		/*
+		 * Corner case 2: next multixact is still being filled in, this must
+		 * not happen during upgrade.
+		 */
+		Assert(nextMXOffset != 0);
+
+		length = nextMXOffset - offset;
+	}
+
+	prev_pageno = -1;
+	for (int i = 0; i < length; i++, offset++)
+	{
+		TransactionId *xactptr;
+		uint32	   *flagsptr;
+		int			flagsoff;
+		int			bshift;
+		int			memberoff;
+		MultiXactStatus st;
+
+		pageno = MXOffsetToMemberPage(offset);
+		memberoff = MXOffsetToMemberOffset(offset);
+
+		if (pageno != prev_pageno)
+		{
+			buf = SlruReadSwitchPage(state->members, pageno);
+			prev_pageno = pageno;
+		}
+
+		xactptr = (TransactionId *) (buf + memberoff);
+		if (!TransactionIdIsValid(*xactptr))
+		{
+			/* Corner case 3: we must be looking at unused slot zero */
+			Assert(offset == 0);
+			continue;
+		}
+
+		flagsoff = MXOffsetToFlagsOffset(offset);
+		bshift = MXOffsetToFlagsBitShift(offset);
+		flagsptr = (uint32 *) (buf + flagsoff);
+
+		st = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
+
+		/* Verify that there is a single update Xid among the given members. */
+		if (ISUPDATE_from_mxstatus(st))
+		{
+			if (result_isupdate)
+				pg_fatal("multixact %u has more than one updating member",
+						 multi);
+			result_xid = *xactptr;
+			result_isupdate = true;
+		}
+		else if (!TransactionIdIsValid(result_xid))
+			result_xid = *xactptr;
+	}
+
+	/* A multixid with zero members should not happen */
+	Assert(TransactionIdIsValid(result_xid));
+
+	*result = result_xid;
+	*status = result_isupdate ? MultiXactStatusUpdate :
+								MultiXactStatusForKeyShare;
+}
+
+/*
+ * Frees the malloced reader.
+ */
+void
+FreeOldMultiXactReader(OldMultiXactReader *state)
+{
+	FreeSlruRead(state->offset);
+	FreeSlruRead(state->members);
+
+	pfree(state);
+}
diff --git a/src/bin/pg_upgrade/multixact_old.h b/src/bin/pg_upgrade/multixact_old.h
new file mode 100644
index 00000000000..8d4659ba6a0
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_old.h
@@ -0,0 +1,31 @@
+/*
+ * multixact_old.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_old.h
+ */
+
+#include "postgres_fe.h"
+
+#include "access/multixact.h"
+#include "slru_io.h"
+
+typedef uint32 OldMultiXactOffset;
+
+typedef struct OldMultiXactReader
+{
+	MultiXactId			nextMXact;
+	OldMultiXactOffset	nextOffset;
+
+	SlruSegState	   *offset;
+	SlruSegState	   *members;
+} OldMultiXactReader;
+
+extern OldMultiXactReader *AllocOldMultiXactRead(char *pgdata,
+												 MultiXactId nextMulti,
+												 OldMultiXactOffset nextOffset);
+extern void GetOldMultiXactIdSingleMember(OldMultiXactReader *state,
+										  MultiXactId multi,
+										  TransactionId *result,
+										  MultiXactStatus *status);
+extern void FreeOldMultiXactReader(OldMultiXactReader *reader);
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 490e98fa26f..5432c03a2b0 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -49,6 +49,8 @@
 #include "common/restricted_token.h"
 #include "fe_utils/string_utils.h"
 #include "pg_upgrade.h"
+#include "multixact_old.h"
+#include "multixact_new.h"
 
 /*
  * Maximum number of pg_restore actions (TOC entries) to process within one
@@ -769,6 +771,82 @@ copy_subdir_files(const char *old_subdir, const char *new_subdir)
 	check_ok();
 }
 
+/*
+ * Convert pg_multixact/offset and /members to new format with 64-bit offsets.
+ */
+static void
+convert_multixacts(MultiXactId *new_nxtmulti, MultiXactOffset *new_nxtmxoff)
+{
+	MultiXactId			oldest_multi,
+						next_multi;
+	OldMultiXactReader *old_reader;
+	MultiXactWriter	   *new_writer;
+
+	old_reader = AllocOldMultiXactRead(old_cluster.pgdata,
+									   old_cluster.controldata.chkpnt_nxtmulti,
+									   old_cluster.controldata.chkpnt_nxtmxoff);
+	new_writer = AllocMultiXactWrite(new_cluster.pgdata,
+									 old_cluster.controldata.chkpnt_oldstMulti,
+									 1 /* see below */);
+
+	oldest_multi = old_cluster.controldata.chkpnt_oldstMulti;
+	next_multi = old_cluster.controldata.chkpnt_nxtmulti;
+
+	/* handle wraparound */
+	if (next_multi < FirstMultiXactId)
+		next_multi = FirstMultiXactId;
+
+	/*
+	 * Read multixids from old files one by one, and write them back in the new
+	 * format.
+	 *
+	 * The locking-only XIDs that may be part of multi-xids don't matter after
+	 * upgrade, as there can be no transactions running across upgrade.  So as
+	 * a little optimization, we only read one member from each multixid: the
+	 * one updating one, or if there was no update, arbitrarily the first
+	 * locking xid.
+	 */
+	for (MultiXactId multi = oldest_multi; multi != next_multi;)
+	{
+		TransactionId		xid;
+		MultiXactStatus		status;
+		MultiXactMember		member;
+		MultiXactId			new_multi PG_USED_FOR_ASSERTS_ONLY;
+		MultiXactOffset		offset;
+
+		/* Read the old multixid */
+		GetOldMultiXactIdSingleMember(old_reader, multi, &xid, &status);
+
+		/* Write it out in new format */
+		member.xid = xid;
+		member.status = status;
+		new_multi = GetNewMultiXactId(new_writer, 1, &offset);
+
+		Assert(new_multi == multi);
+
+		RecordNewMultiXact(new_writer, offset, multi, 1, &member);
+
+		multi++;
+		/* handle wraparound */
+		if (multi < FirstMultiXactId)
+			multi = FirstMultiXactId;
+	}
+
+	/*
+	 * Update the nextMXact/Offset values in the control file to match what we
+	 * wrote.  The nextMXact should be unchanged, but because we ignored the
+	 * locking XIDs members, the nextOffset will be different.
+	 */
+	Assert(new_writer->nextMXact == next_multi);
+
+	*new_nxtmulti = next_multi;
+	*new_nxtmxoff = new_writer->nextOffset;
+
+	/* Release resources */
+	FreeMultiXactWrite(new_writer);
+	FreeOldMultiXactReader(old_reader);
+}
+
 static void
 copy_xact_xlog_xid(void)
 {
@@ -816,8 +894,28 @@ copy_xact_xlog_xid(void)
 	if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER &&
 		new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
 	{
-		copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
-		copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		MultiXactId		new_nxtmulti = old_cluster.controldata.chkpnt_nxtmulti;
+		MultiXactOffset new_nxtmxoff = old_cluster.controldata.chkpnt_nxtmxoff;
+
+		/*
+		 * If the old server is before the MULTIXACTOFFSET_FORMATCHANGE_CAT_VER
+		 * it must have 32-bit multixid offsets, thus it should be converted.
+		 */
+		if (old_cluster.controldata.cat_ver < MULTIXACTOFFSET_FORMATCHANGE_CAT_VER &&
+			new_cluster.controldata.cat_ver >= MULTIXACTOFFSET_FORMATCHANGE_CAT_VER)
+		{
+			remove_new_subdir("pg_multixact/members", false);
+			remove_new_subdir("pg_multixact/offsets", false);
+
+			prep_status("Converting pg_multixact/offsets to 64-bit");
+			convert_multixacts(&new_nxtmulti, &new_nxtmxoff);
+			check_ok();
+		}
+		else
+		{
+			copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
+			copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		}
 
 		prep_status("Setting next multixact ID and offset for new cluster");
 
@@ -826,10 +924,8 @@ copy_xact_xlog_xid(void)
 		 * counters here and the oldest multi present on system.
 		 */
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/pg_resetwal\" -O %u -m %u,%u \"%s\"",
-				  new_cluster.bindir,
-				  old_cluster.controldata.chkpnt_nxtmxoff,
-				  old_cluster.controldata.chkpnt_nxtmulti,
+				  "\"%s/pg_resetwal\" -O %" PRIu64 " -m %u,%u \"%s\"",
+				  new_cluster.bindir, new_nxtmxoff, new_nxtmulti,
 				  old_cluster.controldata.chkpnt_oldstMulti,
 				  new_cluster.pgdata);
 		check_ok();
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index e86336f4be9..127b2cb00fa 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -114,6 +114,11 @@ extern char *output_files[];
  */
 #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231
 
+/*
+ * Swicth from 32-bit to 64-bit for multixid offsets.
+ */
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 999999999
+
 /*
  * large object chunk size added to pg_controldata,
  * commit 5f93c37805e7485488480916b4585e098d3cc883
diff --git a/src/bin/pg_upgrade/slru_io.c b/src/bin/pg_upgrade/slru_io.c
new file mode 100644
index 00000000000..4e823199303
--- /dev/null
+++ b/src/bin/pg_upgrade/slru_io.c
@@ -0,0 +1,240 @@
+/*
+ * slru_io.c
+ *
+ * Routines for reading and writing SLRU files during upgrade.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/slru_io.c
+ */
+
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+
+#include "pg_upgrade.h"
+#include "slru_io.h"
+
+#include "common/fe_memutils.h"
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "port/pg_iovec.h"
+
+/*
+ * State for reading or writing an SLRU, with a one page buffer.
+ */
+typedef struct SlruSegState
+{
+	bool		writing;
+	bool		long_segment_names;
+
+	char	   *dir;
+	char	   *fn;
+	int			fd;
+	int64		segno;
+	uint64		pageno;
+
+	PGAlignedBlock buf;
+} SlruSegState;
+
+static inline SlruSegState *
+AllocSlruSegState(char *dir)
+{
+	SlruSegState *state = pg_malloc(sizeof(*state));
+
+	state->segno = -1;
+	state->pageno = 0;
+	state->dir = pstrdup(dir);
+	state->fd = -1;
+	state->fn = NULL;
+
+	return state;
+}
+
+static inline void
+SlruFlush(SlruSegState *state)
+{
+	struct iovec	iovec = {
+		.iov_base = &state->buf,
+		.iov_len = BLCKSZ,
+	};
+	off_t			offset;
+
+	if (state->segno == -1)
+		return;
+
+	offset = (state->pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+	if (pg_pwritev_with_retry(state->fd, &iovec, 1, offset) < 0)
+		pg_fatal("could not write file \"%s\": %m", state->fn);
+}
+
+/*
+ * Create slru reader for dir.
+ *
+ * Returns the malloced memory used by the all other read calls in this module.
+ */
+SlruSegState *
+AllocSlruRead(char *dir)
+{
+	SlruSegState *state = AllocSlruSegState(dir);
+
+	state->writing = false;
+
+	return state;
+}
+
+/*
+ * Open given page for reading.
+ *
+ * Reading can be done in random order.
+ */
+char *
+SlruReadSwitchPage(SlruSegState *state, uint64 pageno)
+{
+	int64 segno;
+
+	Assert(!state->writing);	/* read only mode */
+
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+
+	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	if (segno != state->segno)
+	{
+		if (state->segno != -1)
+		{
+			close(state->fd);
+			state->fd = -1;
+
+			pg_free(state->fn);
+			state->fn = NULL;
+		}
+
+		/* Open new segment */
+		state->fn = psprintf("%s/%04X", state->dir, (unsigned int) segno);
+		if ((state->fd = open(state->fn, O_RDONLY | PG_BINARY, 0)) < 0)
+			pg_fatal("could not open file \"%s\": %m", state->fn);
+	}
+
+	state->segno = segno;
+
+	{
+		struct iovec	iovec = {
+			.iov_base = &state->buf,
+			.iov_len = BLCKSZ,
+		};
+		off_t			offset = (pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+		if (pg_preadv(state->fd, &iovec, 1, offset) < 0)
+			pg_fatal("could not read file \"%s\": %m", state->fn);
+
+		state->pageno = pageno;
+	}
+
+	return state->buf.data;
+}
+
+/*
+ * Frees the malloced reader.
+ */
+void
+FreeSlruRead(SlruSegState *state)
+{
+	Assert(!state->writing);	/* read only mode */
+
+	close(state->fd);
+	pg_free(state);
+}
+
+/*
+ * Open the given page for writing.
+ *
+ * NOTE: This uses O_EXCL when stepping to a new segment, so this assumes that
+ * each segment is written in full before moving on to next one.  This
+ * limitation would be easy to lift if needed, but it fits the usage pattern of
+ * current callers.
+ */
+char *
+SlruWriteSwitchPage(SlruSegState *state, uint64 pageno)
+{
+	int64	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	off_t	offset;
+
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+
+	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	offset = (pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+	SlruFlush(state);
+	memset(state->buf.data, 0, BLCKSZ);
+
+	if (segno != state->segno)
+	{
+		if (state->segno != -1)
+		{
+			close(state->fd);
+			state->fd = -1;
+
+			pg_free(state->fn);
+			state->fn = NULL;
+		}
+
+		/* Create the segment */
+		if (state->long_segment_names)
+		{
+			Assert(segno >= 0 && segno <= INT64CONST(0xFFFFFFFFFFFFFFF));
+			state->fn = psprintf("%s/%015" PRIX64, state->dir, segno);
+		}
+		else
+		{
+			Assert(segno >= 0 && segno <= INT64CONST(0xFFFFFF));
+			state->fn = psprintf("%s/%04X", state->dir, (unsigned int) segno);
+		}
+
+		if ((state->fd = open(state->fn, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+							  pg_file_create_mode)) < 0)
+		{
+			pg_fatal("could not create file \"%s\": %m", state->fn);
+		}
+
+		state->segno = segno;
+
+		if (offset > 0 && pg_pwrite_zeros(state->fd, offset, 0) < 0)
+			pg_fatal("could not write file \"%s\": %m", state->fn);
+	}
+
+	state->pageno = pageno;
+
+	return state->buf.data;
+}
+
+/*
+ * Create slru writer for dir.
+ *
+ * Returns the malloced memory used by the all other write calls in this module.
+ */
+SlruSegState *
+AllocSlruWrite(char *dir, bool long_segment_names)
+{
+	SlruSegState *state = AllocSlruSegState(dir);
+
+	state->writing = true;
+	state->long_segment_names = long_segment_names;
+
+	return state;
+}
+
+/*
+ * Frees the malloced writer.
+ */
+void
+FreeSlruWrite(SlruSegState *state)
+{
+	Assert(state->writing);
+
+	SlruFlush(state);
+
+	close(state->fd);
+	pg_free(state);
+}
diff --git a/src/bin/pg_upgrade/slru_io.h b/src/bin/pg_upgrade/slru_io.h
new file mode 100644
index 00000000000..920b8ae82e2
--- /dev/null
+++ b/src/bin/pg_upgrade/slru_io.h
@@ -0,0 +1,30 @@
+/*
+ * slru_io.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/slru_io.h
+ */
+
+/*
+ * Some kind of iterator associated with a particular SLRU segment.  The idea is
+ * to specify the segment and page number and then move through the pages.
+ */
+
+#include "postgres_fe.h"
+
+/*
+ * See access/slru.h
+ *
+ * Copy here, since slru.h could not be included in fe code.
+ */
+#define SLRU_PAGES_PER_SEGMENT 32
+
+typedef struct SlruSegState SlruSegState;
+
+extern SlruSegState *AllocSlruRead(char *dir);
+extern char *SlruReadSwitchPage(SlruSegState *state, uint64 pageno);
+extern void FreeSlruRead(SlruSegState *state);
+
+extern SlruSegState *AllocSlruWrite(char *dir, bool long_segment_names);
+extern char *SlruWriteSwitchPage(SlruSegState *state, uint64 pageno);
+extern void FreeSlruWrite(SlruSegState *state);
-- 
2.43.0

#53Alexander Korotkov
aekorotkov@gmail.com
In reply to: Maxim Orlov (#52)
3 attachment(s)
Re: POC: make mxidoff 64 bits

On Thu, Oct 30, 2025 at 6:17 PM Maxim Orlov <orlovmg@gmail.com> wrote:

On Thu, 30 Oct 2025 at 12:10, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Oh I see, the 'base' is not necessarily the base offset of the first
multixact on the page, it's the base offset of the first multixid that
is written to the page. And the (short) offsets can be negative. That's
a frighteningly clever encoding scheme. One upshot of that is that WAL
redo might get construct the page with a different 'base'. I guess that
works, but it scares me. Could we come up with a more deterministic scheme?

Definitely! The most stable approach is the one we had before, which
used actual 64-bit offsets in the SLRU. To be honest, I'm completely
happy with it. After all, what's most important for me is to have 64-bit
xids in Postgres, and this patch is a step towards that goal.

Yes, but why can't we have an encoding scheme which would both be
deterministic and provide compression? The attached is what I meant
in [1]. It's based on v19 and provide deterministic conversion of
each 8 of 64-bit offsets into a chunks containing 64-bit base and 7 of
24-bit increments. I didn't touch pg_upgrade code though.

Links.
1. /messages/by-id/CAPpHfdtPybyMYBj-x3-Z5=4bj_vhYk2R0nezfy=Vjcz4QBMDgw@mail.gmail.com

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v21-0001-Use-64-bit-multixact-offsets.patchapplication/octet-stream; name=v21-0001-Use-64-bit-multixact-offsets.patchDownload
From b5714701a7583ec085ff5ead04b65c7a6addaa9b Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 7 Aug 2024 16:35:22 +0300
Subject: [PATCH v21 1/3] Use 64-bit multixact offsets

Switching to 64-bit multitransaction offsets removes wraparound and the
2^32 limit on their total number.

On the other hand, this move is inevitable in increased disc space
utilisation.  Fortunately, multitransaction offsets rise monotonically
and without gaps.  To conserve disc space consumed by segments, we
write encode each 8 values as chunks containing 8-bytes base and 7 of
3-bytes increments.

Author: Maxim Orlov <orlovmg@gmail.com>
---
 src/backend/access/rmgrdesc/mxactdesc.c   |   4 +-
 src/backend/access/rmgrdesc/xlogdesc.c    |   2 +-
 src/backend/access/transam/multixact.c    | 452 +++++++++-------------
 src/backend/access/transam/xlogrecovery.c |   2 +-
 src/backend/commands/vacuum.c             |   2 +-
 src/backend/postmaster/autovacuum.c       |   4 +-
 src/bin/pg_controldata/pg_controldata.c   |   2 +-
 src/bin/pg_resetwal/pg_resetwal.c         |   6 +-
 src/bin/pg_resetwal/t/001_basic.pl        |   2 +-
 src/include/access/multixact.h            |   3 +-
 src/include/c.h                           |   2 +-
 11 files changed, 202 insertions(+), 279 deletions(-)

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 3ca0582db36..052dd0a4ce5 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -65,7 +65,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		xl_multixact_create *xlrec = (xl_multixact_create *) rec;
 		int			i;
 
-		appendStringInfo(buf, "%u offset %u nmembers %d: ", xlrec->mid,
+		appendStringInfo(buf, "%u offset %" PRIu64 " nmembers %d: ", xlrec->mid,
 						 xlrec->moff, xlrec->nmembers);
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
@@ -74,7 +74,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
 
-		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+		appendStringInfo(buf, "offsets [%u, %u), members [%" PRIu64 ", %" PRIu64 ")",
 						 xlrec->startTruncOff, xlrec->endTruncOff,
 						 xlrec->startTruncMemb, xlrec->endTruncMemb);
 	}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index cd6c2a2f650..441034f5929 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -66,7 +66,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
 		appendStringInfo(buf, "redo %X/%08X; "
-						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %u; "
+						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %" PRIu64 "; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
 						 "oldest running xid %u; %s",
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 9d5f130af7e..1f59587c42e 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -89,21 +89,31 @@
 #include "utils/memutils.h"
 
 
+typedef int32 ShortMultiXactOffset;	/* for a disk storage */
+
 /*
  * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
  * used everywhere else in Postgres.
  *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
+ * There are two key factors why utilising straightforward 64-bit offset values
+ * for is wasteful in terms of disc space usage:
+ * 1) offset values are recorded in ascending order and not overwritten;
+ * 2) the largest supported BLCKSZ is 32k, which can store up to 2^13 32-bit
+ *    items on a single page;  thus, with MAX_BACKENDS limited to 2^18-1 we have
+ *    2^13 * (2^18-1) which is less 2^31 and fits 32-bits.
+ *
+ * In other words, max "distance" for offsets on a single page is not exeeded
+ * 32-bits.  To optimise disc space allocation, we employ the following scheme.
+ * On each page, the basic 64-bit offset, known as the page base, is located
+ * first.  Next, there are 32-bit deltas relative to the base element are
+ * placed.  Thus, the required offset for the 0-th element is the page's
+ * base; the value for each subsequent offset on the same page is calculated
+ * by adding it to the page base (0-th) element.
  */
 
-/* We need four bytes per offset */
-#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
+/* We need four bytes per offset, 8 bytes for the base */
+#define MULTIXACT_OFFSETS_PER_PAGE		\
+	((BLCKSZ - sizeof(MultiXactOffset)) / sizeof(ShortMultiXactOffset))
 
 static inline int64
 MultiXactIdToOffsetPage(MultiXactId multi)
@@ -208,10 +218,14 @@ MXOffsetToMemberOffset(MultiXactOffset offset)
 		member_in_group * sizeof(TransactionId);
 }
 
-/* Multixact members wraparound thresholds. */
-#define MULTIXACT_MEMBER_SAFE_THRESHOLD		(MaxMultiXactOffset / 2)
-#define MULTIXACT_MEMBER_DANGER_THRESHOLD	\
-	(MaxMultiXactOffset - MaxMultiXactOffset / 4)
+/*
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuumin order to release the disk space, reduce table bloat if
+ * possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD		UINT64CONST(0xFFFFFFFF)
 
 static inline MultiXactId
 PreviousMultiXactId(MultiXactId multi)
@@ -228,6 +242,128 @@ static SlruCtlData MultiXactMemberCtlData;
 #define MultiXactOffsetCtl	(&MultiXactOffsetCtlData)
 #define MultiXactMemberCtl	(&MultiXactMemberCtlData)
 
+typedef struct
+{
+	MultiXactOffset			base;
+	uint8					offsets[7][3];
+} MultixactEncodedChunk;
+
+/*
+ * To avoid diving deep into address arithmetic, we declare an auxiliary
+ * structure that access the MultiXactOffset SLRU page.
+ */
+typedef struct MultiXactOffsetSLRUPage
+{
+	MultixactEncodedChunk	chunks[FLEXIBLE_ARRAY_MEMBER];
+} MultiXactOffsetSLRUPage;
+
+#define BASE_HIGH_BIT ((UINT64CONST(1) << 63))
+#define OFFSET_HIGH_BIT (1 << 23)
+
+static inline void
+MultixactChunkEncode(MultiXactOffset values[8], MultixactEncodedChunk *chunk)
+{
+	MultiXactOffset			prevValue;
+	int		i;
+
+	Assert((values[0] & BASE_HIGH_BIT) == 0);
+
+	if (values[0] != 0)
+	{
+		chunk->base = values[0] | BASE_HIGH_BIT;
+		prevValue = values[0];
+	}
+	else
+	{
+		chunk->base = prevValue = 0;
+	}
+
+	for (i = 1; i < 8; i++)
+	{
+		uint64		diff;
+
+		if (values[i] == 0)
+		{
+			chunk->offsets[i - 1][0] = chunk->offsets[i - 1][1] = chunk->offsets[i - 1][2] = 0;
+			continue;
+		}
+
+		if (chunk->base == 0)
+			prevValue = chunk->base = values[i];
+
+		diff = (values[i] - prevValue);
+		prevValue = values[i];
+
+		Assert(diff < OFFSET_HIGH_BIT);
+		diff |= OFFSET_HIGH_BIT;
+
+		chunk->offsets[i - 1][0] = diff & 0xFF;
+		chunk->offsets[i - 1][1] = (diff >> 8) & 0xFF;
+		chunk->offsets[i - 1][2] = (diff >> 16) & 0xFF;
+	}
+}
+
+static inline void
+MultixactChunkDecode(MultiXactOffset values[8], MultixactEncodedChunk *chunk)
+{
+	MultiXactOffset		prevValue = chunk->base;
+	int			i;
+
+	if (chunk->base & BASE_HIGH_BIT)
+	{
+		prevValue = values[0] = chunk->base - BASE_HIGH_BIT;
+	}
+	else
+	{
+		values[0] = 0;
+		prevValue = chunk->base;
+	}
+
+	for (i = 1; i < 8; i++)
+	{
+		uint64		diff;
+
+		diff = chunk->offsets[i - 1][0] | (chunk->offsets[i - 1][1] << 8) | (chunk->offsets[i - 1][2] << 16);
+
+		if (diff == 0)
+		{
+			values[i] = 0;
+		}
+		else
+		{
+			Assert(diff & OFFSET_HIGH_BIT);
+			values[i] = prevValue + (diff - OFFSET_HIGH_BIT);
+			prevValue = values[i];
+		}
+	}
+
+}
+
+static inline MultiXactOffset
+MXOffsetRead(int entryno, int slotno)
+{
+	MultiXactOffsetSLRUPage *page =
+		(MultiXactOffsetSLRUPage *) MultiXactOffsetCtl->shared->page_buffer[slotno];
+	MultiXactOffset values[8];
+
+	MultixactChunkDecode(values, &page->chunks[entryno / 8]);
+
+	return values[entryno % 8];
+}
+
+
+static inline void
+MXOffsetWrite(int entryno, int slotno, MultiXactOffset offset)
+{
+	MultiXactOffsetSLRUPage *page =
+		(MultiXactOffsetSLRUPage *) MultiXactOffsetCtl->shared->page_buffer[slotno];
+	MultiXactOffset values[8];
+
+	MultixactChunkDecode(values, &page->chunks[entryno / 8]);
+	values[entryno % 8] = offset;
+	MultixactChunkEncode(values, &page->chunks[entryno / 8]);
+}
+
 /*
  * MultiXact state shared across all backends.  All this state is protected
  * by MultiXactGenLock.  (We also use SLRU bank's lock of MultiXactOffset and
@@ -268,9 +404,6 @@ typedef struct MultiXactStateData
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 
-	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
-
 	/*
 	 * This is used to sleep until a multixact offset is written when we want
 	 * to create the next one.
@@ -401,8 +534,6 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
-									 MultiXactOffset start, uint32 distance);
 static bool SetOffsetVacuumLimit(bool is_startup);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMTruncateXlogRec(Oid oldestMultiDB,
@@ -911,7 +1042,6 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	int64		prev_pageno;
 	int			entryno;
 	int			slotno;
-	MultiXactOffset *offptr;
 	int			i;
 	LWLock	   *lock;
 	LWLock	   *prevlock = NULL;
@@ -930,10 +1060,8 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	 * take the trouble to generalize the slru.c error reporting code.
 	 */
 	slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
-	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
-	offptr += entryno;
 
-	*offptr = offset;
+	MXOffsetWrite(entryno, slotno, offset);
 
 	MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 
@@ -1155,78 +1283,6 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	else
 		*offset = nextOffset;
 
-	/*----------
-	 * Protect against overrun of the members space as well, with the
-	 * following rules:
-	 *
-	 * If we're past offsetStopLimit, refuse to generate more multis.
-	 * If we're close to offsetStopLimit, emit a warning.
-	 *
-	 * Arbitrarily, we start emitting warnings when we're 20 segments or less
-	 * from offsetStopLimit.
-	 *
-	 * Note we haven't updated the shared state yet, so if we fail at this
-	 * point, the multixact ID we grabbed can still be used by the next guy.
-	 *
-	 * Note that there is no point in forcing autovacuum runs here: the
-	 * multixact freeze settings would have to be reduced for that to have any
-	 * effect.
-	 *----------
-	 */
-#define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
-								 nmembers))
-	{
-		/* see comment in the corresponding offsets wraparound case */
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("multixact \"members\" limit exceeded"),
-				 errdetail_plural("This command would create a multixact with %u members, but the remaining space is only enough for %u member.",
-								  "This command would create a multixact with %u members, but the remaining space is only enough for %u members.",
-								  MultiXactState->offsetStopLimit - nextOffset - 1,
-								  nmembers,
-								  MultiXactState->offsetStopLimit - nextOffset - 1),
-				 errhint("Execute a database-wide VACUUM in database with OID %u with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.",
-						 MultiXactState->oldestMultiXactDB)));
-	}
-
-	/*
-	 * Check whether we should kick autovacuum into action, to prevent members
-	 * wraparound. NB we use a much larger window to trigger autovacuum than
-	 * just the warning limit. The warning is just a measure of last resort -
-	 * this is in line with GetNewTransactionId's behaviour.
-	 */
-	if (!MultiXactState->oldestOffsetKnown ||
-		(MultiXactState->nextOffset - MultiXactState->oldestOffset
-		 > MULTIXACT_MEMBER_SAFE_THRESHOLD))
-	{
-		/*
-		 * To avoid swamping the postmaster with signals, we issue the autovac
-		 * request only when crossing a segment boundary. With default
-		 * compilation settings that's roughly after 50k members.  This still
-		 * gives plenty of chances before we get into real trouble.
-		 */
-		if ((MXOffsetToMemberPage(nextOffset) / SLRU_PAGES_PER_SEGMENT) !=
-			(MXOffsetToMemberPage(nextOffset + nmembers) / SLRU_PAGES_PER_SEGMENT))
-			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	}
-
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
-								 nextOffset,
-								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
-		ereport(WARNING,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg_plural("database with OID %u must be vacuumed before %d more multixact member is used",
-							   "database with OID %u must be vacuumed before %d more multixact members are used",
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers,
-							   MultiXactState->oldestMultiXactDB,
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers),
-				 errhint("Execute a database-wide VACUUM in that database with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.")));
-
 	ExtendMultiXactMember(nextOffset, nmembers);
 
 	/*
@@ -1255,7 +1311,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	LWLockRelease(MultiXactGenLock);
 
-	debug_elog4(DEBUG2, "GetNew: returning %u offset %u", result, *offset);
+	debug_elog4(DEBUG2, "GetNew: returning %u offset %" PRIu64, result,
+				*offset);
 	return result;
 }
 
@@ -1294,7 +1351,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	int64		prev_pageno;
 	int			entryno;
 	int			slotno;
-	MultiXactOffset *offptr;
 	MultiXactOffset offset;
 	int			length;
 	int			truelength;
@@ -1418,9 +1474,8 @@ retry:
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
 	slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
-	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
-	offptr += entryno;
-	offset = *offptr;
+
+	offset = MXOffsetRead(entryno, slotno);
 
 	Assert(offset != 0);
 
@@ -1467,9 +1522,7 @@ retry:
 			slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, tmpMXact);
 		}
 
-		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
-		offptr += entryno;
-		nextMXOffset = *offptr;
+		nextMXOffset = MXOffsetRead(entryno, slotno);
 
 		if (nextMXOffset == 0)
 		{
@@ -1973,7 +2026,7 @@ MultiXactShmemInit(void)
 				  "pg_multixact/members", LWTRANCHE_MULTIXACTMEMBER_BUFFER,
 				  LWTRANCHE_MULTIXACTMEMBER_SLRU,
 				  SYNC_HANDLER_MULTIXACT_MEMBER,
-				  false);
+				  true);
 	/* doesn't call SimpleLruTruncate() or meet criteria for unit tests */
 
 	/* Initialize our shared state struct */
@@ -2149,9 +2202,24 @@ TrimMultiXact(void)
 		LWLockAcquire(lock, LW_EXCLUSIVE);
 		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
-		offptr += entryno;
 
-		MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
+		if (entryno == 0)
+			MemSet(offptr, 0, BLCKSZ);
+		else
+		{
+			ShortMultiXactOffset *off32ptr;
+
+			off32ptr = (ShortMultiXactOffset *) (offptr + 1);
+			off32ptr += entryno;
+
+			/*
+			 * Knowing that offptr points to the beginning of the buffer,
+			 * address arithmetic can be used to determine the amount of
+			 * bytes remaining.
+			 */
+			MemSet(off32ptr, 0,
+				   BLCKSZ - (((char *) off32ptr - (char *) offptr)));
+		}
 
 		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 		LWLockRelease(lock);
@@ -2223,7 +2291,7 @@ MultiXactGetCheckptMulti(bool is_shutdown,
 	LWLockRelease(MultiXactGenLock);
 
 	debug_elog6(DEBUG2,
-				"MultiXact: checkpoint is nextMulti %u, nextOffset %u, oldestMulti %u in DB %u",
+				"MultiXact: checkpoint is nextMulti %u, nextOffset %" PRIu64 ", oldestMulti %u in DB %u",
 				*nextMulti, *nextMultiOffset, *oldestMulti, *oldestMultiDB);
 }
 
@@ -2258,7 +2326,7 @@ void
 MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset)
 {
-	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %u",
+	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %" PRIu64,
 				nextMulti, nextMultiOffset);
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->nextMXact = nextMulti;
@@ -2449,7 +2517,7 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 	}
 	if (MultiXactOffsetPrecedes(MultiXactState->nextOffset, minMultiOffset))
 	{
-		debug_elog3(DEBUG2, "MultiXact: setting next offset to %u",
+		debug_elog3(DEBUG2, "MultiXact: setting next offset to %" PRIU64,
 					minMultiOffset);
 		MultiXactState->nextOffset = minMultiOffset;
 	}
@@ -2633,15 +2701,13 @@ GetOldestMultiXactId(void)
 }
 
 /*
- * Determine how aggressively we need to vacuum in order to prevent member
- * wraparound.
+ * Determine if we need to vacuum for member or not.
  *
  * To do so determine what's the oldest member offset and install the limit
  * info in MultiXactState, where it can be used to prevent overrun of old data
  * in the members SLRU area.
  *
- * The return value is true if emergency autovacuum is required and false
- * otherwise.
+ * The return value is true if autovacuum is required and false otherwise.
  */
 static bool
 SetOffsetVacuumLimit(bool is_startup)
@@ -2653,8 +2719,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	MultiXactOffset nextOffset;
 	bool		oldestOffsetKnown = false;
 	bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;
-	MultiXactOffset prevOffsetStopLimit;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2669,7 +2733,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	nextOffset = MultiXactState->nextOffset;
 	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
 	prevOldestOffset = MultiXactState->oldestOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2700,11 +2763,7 @@ SetOffsetVacuumLimit(bool is_startup)
 		oldestOffsetKnown =
 			find_multixact_start(oldestMultiXactId, &oldestOffset);
 
-		if (oldestOffsetKnown)
-			ereport(DEBUG1,
-					(errmsg_internal("oldest MultiXactId member is at offset %u",
-									 oldestOffset)));
-		else
+		if (!oldestOffsetKnown)
 			ereport(LOG,
 					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
 							oldestMultiXactId)));
@@ -2717,24 +2776,7 @@ SetOffsetVacuumLimit(bool is_startup)
 	 * overrun of old data in the members SLRU area. We can only do so if the
 	 * oldest offset is known though.
 	 */
-	if (oldestOffsetKnown)
-	{
-		/* move back to start of the corresponding segment */
-		offsetStopLimit = oldestOffset - (oldestOffset %
-										  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-
-		/* always leave one segment before the wraparound point */
-		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-		if (!prevOldestOffsetKnown && !is_startup)
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are now enabled")));
-
-		ereport(DEBUG1,
-				(errmsg_internal("MultiXact member stop limit is now %u based on MultiXact %u",
-								 offsetStopLimit, oldestMultiXactId)));
-	}
-	else if (prevOldestOffsetKnown)
+	if (prevOldestOffsetKnown)
 	{
 		/*
 		 * If we failed to get the oldest offset this time, but we have a
@@ -2744,69 +2786,19 @@ SetOffsetVacuumLimit(bool is_startup)
 		 */
 		oldestOffset = prevOldestOffset;
 		oldestOffsetKnown = true;
-		offsetStopLimit = prevOffsetStopLimit;
 	}
 
 	/* Install the computed values */
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestOffset = oldestOffset;
 	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	MultiXactState->offsetStopLimit = offsetStopLimit;
 	LWLockRelease(MultiXactGenLock);
 
 	/*
-	 * Do we need an emergency autovacuum?	If we're not sure, assume yes.
+	 * Do we need autovacuum?	If we're not sure, assume yes.
 	 */
 	return !oldestOffsetKnown ||
-		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
-}
-
-/*
- * Return whether adding "distance" to "start" would move past "boundary".
- *
- * We use this to determine whether the addition is "wrapping around" the
- * boundary point, hence the name.  The reason we don't want to use the regular
- * 2^31-modulo arithmetic here is that we want to be able to use the whole of
- * the 2^32-1 space here, allowing for more multixacts than would fit
- * otherwise.
- */
-static bool
-MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
-						 uint32 distance)
-{
-	MultiXactOffset finish;
-
-	/*
-	 * Note that offset number 0 is not used (see GetMultiXactIdMembers), so
-	 * if the addition wraps around the UINT_MAX boundary, skip that value.
-	 */
-	finish = start + distance;
-	if (finish < start)
-		finish++;
-
-	/*-----------------------------------------------------------------------
-	 * When the boundary is numerically greater than the starting point, any
-	 * value numerically between the two is not wrapped:
-	 *
-	 *	<----S----B---->
-	 *	[---)			 = F wrapped past B (and UINT_MAX)
-	 *		 [---)		 = F not wrapped
-	 *			  [----] = F wrapped past B
-	 *
-	 * When the boundary is numerically less than the starting point (i.e. the
-	 * UINT_MAX wraparound occurs somewhere in between) then all values in
-	 * between are wrapped:
-	 *
-	 *	<----B----S---->
-	 *	[---)			 = F not wrapped past B (but wrapped past UINT_MAX)
-	 *		 [---)		 = F wrapped past B (and UINT_MAX)
-	 *			  [----] = F not wrapped
-	 *-----------------------------------------------------------------------
-	 */
-	if (start < boundary)
-		return finish >= boundary || finish < start;
-	else
-		return finish >= boundary && finish < start;
+		(nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);
 }
 
 /*
@@ -2825,7 +2817,6 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	int64		pageno;
 	int			entryno;
 	int			slotno;
-	MultiXactOffset *offptr;
 
 	Assert(MultiXactState->finishedStartup);
 
@@ -2843,9 +2834,9 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 
 	/* lock is acquired by SimpleLruReadPage_ReadOnly */
 	slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno, multi);
-	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
-	offptr += entryno;
-	offset = *offptr;
+
+	offset = MXOffsetRead(entryno, slotno);
+
 	LWLockRelease(SimpleLruGetBankLock(MultiXactOffsetCtl, pageno));
 
 	*result = offset;
@@ -2893,73 +2884,6 @@ GetMultiXactInfo(uint32 *multixacts, MultiXactOffset *members,
 	return true;
 }
 
-/*
- * Multixact members can be removed once the multixacts that refer to them
- * are older than every datminmxid.  autovacuum_multixact_freeze_max_age and
- * vacuum_multixact_freeze_table_age work together to make sure we never have
- * too many multixacts; we hope that, at least under normal circumstances,
- * this will also be sufficient to keep us from using too many offsets.
- * However, if the average multixact has many members, we might exhaust the
- * members space while still using few enough members that these limits fail
- * to trigger relminmxid advancement by VACUUM.  At that point, we'd have no
- * choice but to start failing multixact-creating operations with an error.
- *
- * To prevent that, if more than a threshold portion of the members space is
- * used, we effectively reduce autovacuum_multixact_freeze_max_age and
- * to a value just less than the number of multixacts in use.  We hope that
- * this will quickly trigger autovacuuming on the table or tables with the
- * oldest relminmxid, thus allowing datminmxid values to advance and removing
- * some members.
- *
- * As the fraction of the member space currently in use grows, we become
- * more aggressive in clamping this value.  That not only causes autovacuum
- * to ramp up, but also makes any manual vacuums the user issues more
- * aggressive.  This happens because vacuum_get_cutoffs() will clamp the
- * freeze table and the minimum freeze age cutoffs based on the effective
- * autovacuum_multixact_freeze_max_age this function returns.  In the worst
- * case, we'll claim the freeze_max_age to zero, and every vacuum of any
- * table will freeze every multixact.
- */
-int
-MultiXactMemberFreezeThreshold(void)
-{
-	MultiXactOffset members;
-	uint32		multixacts;
-	uint32		victim_multixacts;
-	double		fraction;
-	int			result;
-	MultiXactId oldestMultiXactId;
-	MultiXactOffset oldestOffset;
-
-	/* If we can't determine member space utilization, assume the worst. */
-	if (!GetMultiXactInfo(&multixacts, &members, &oldestMultiXactId, &oldestOffset))
-		return 0;
-
-	/* If member space utilization is low, no special action is required. */
-	if (members <= MULTIXACT_MEMBER_SAFE_THRESHOLD)
-		return autovacuum_multixact_freeze_max_age;
-
-	/*
-	 * Compute a target for relminmxid advancement.  The number of multixacts
-	 * we try to eliminate from the system is based on how far we are past
-	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
-	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD) /
-		(MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-	victim_multixacts = multixacts * fraction;
-
-	/* fraction could be > 1.0, but lowest possible freeze age is zero */
-	if (victim_multixacts > multixacts)
-		return 0;
-	result = multixacts - victim_multixacts;
-
-	/*
-	 * Clamp to autovacuum_multixact_freeze_max_age, so that we never make
-	 * autovacuum less aggressive than it would otherwise be.
-	 */
-	return Min(result, autovacuum_multixact_freeze_max_age);
-}
-
 typedef struct mxtruncinfo
 {
 	int64		earliestExistingPage;
@@ -3159,7 +3083,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 
 	elog(DEBUG1, "performing multixact truncation: "
 		 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-		 "members [%u, %u), members segments [%" PRIx64 ", %" PRIx64 ")",
+		 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
 		 oldestMulti, newOldestMulti,
 		 MultiXactIdToOffsetSegment(oldestMulti),
 		 MultiXactIdToOffsetSegment(newOldestMulti),
@@ -3290,7 +3214,7 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
 static bool
 MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
 {
-	int32		diff = (int32) (offset1 - offset2);
+	int64		diff = (int64) (offset1 - offset2);
 
 	return (diff < 0);
 }
@@ -3387,7 +3311,7 @@ multixact_redo(XLogReaderState *record)
 
 		elog(DEBUG1, "replaying multixact truncation: "
 			 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-			 "members [%u, %u), members segments [%" PRIx64 ", %" PRIx64 ")",
+			 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
 			 xlrec.startTruncOff, xlrec.endTruncOff,
 			 MultiXactIdToOffsetSegment(xlrec.startTruncOff),
 			 MultiXactIdToOffsetSegment(xlrec.endTruncOff),
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 0a5ae5050c4..5a4b2c8b387 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -885,7 +885,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 							 U64FromFullTransactionId(checkPoint.nextXid),
 							 checkPoint.nextOid)));
 	ereport(DEBUG1,
-			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %u",
+			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %" PRIu64,
 							 checkPoint.nextMulti, checkPoint.nextMultiOffset)));
 	ereport(DEBUG1,
 			(errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ed03e3bd50d..259ef60bd31 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1147,7 +1147,7 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams params,
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Almost ready to set freeze output parameters; check if OldestXmin or
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 59ec45a4e96..212e6e3b13b 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1151,7 +1151,7 @@ do_start_worker(void)
 
 	/* Also determine the oldest datminmxid we will consider. */
 	recentMulti = ReadNextMultiXactId();
-	multiForceLimit = recentMulti - MultiXactMemberFreezeThreshold();
+	multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
 	if (multiForceLimit < FirstMultiXactId)
 		multiForceLimit -= FirstMultiXactId;
 
@@ -1939,7 +1939,7 @@ do_autovacuum(void)
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Find the pg_database entry and select the default freeze ages. We use
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 10de058ce91..5295108ade3 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -264,7 +264,7 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile->checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
+	printf(_("Latest checkpoint's NextMultiOffset:  %" PRIu64 "\n"),
 		   ControlFile->checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile->checkPointCopy.oldestXid);
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index a89d72fc5cf..4e5eeced89d 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -267,7 +267,7 @@ main(int argc, char *argv[])
 
 			case 'O':
 				errno = 0;
-				set_mxoff = strtoul(optarg, &endptr, 0);
+				set_mxoff = strtou64(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
@@ -743,7 +743,7 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile.checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
+	printf(_("Latest checkpoint's NextMultiOffset:  %" PRIu64 "\n"),
 		   ControlFile.checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile.checkPointCopy.oldestXid);
@@ -817,7 +817,7 @@ PrintNewControlValues(void)
 
 	if (set_mxoff != -1)
 	{
-		printf(_("NextMultiOffset:                      %u\n"),
+		printf(_("NextMultiOffset:                      %" PRIu64 "\n"),
 			   ControlFile.checkPointCopy.nextMultiOffset);
 	}
 
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index d6bbbd0ceda..16b5a623900 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -213,7 +213,7 @@ push @cmd,
   sprintf("%d,%d", hex($files[0]) == 0 ? 3 : hex($files[0]), hex($files[-1]));
 
 @files = get_slru_files('pg_multixact/offsets');
-$mult = 32 * $blcksz / 4;
+$mult = 32 * ($blcksz - 8) / 4;
 # --multixact-ids argument is "new,old"
 push @cmd,
   '--multixact-ids' => sprintf("%d,%d",
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 82e4bb90dd5..16a07723088 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -28,7 +28,7 @@
 
 #define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
 
-#define MaxMultiXactOffset	((MultiXactOffset) 0xFFFFFFFF)
+#define MaxMultiXactOffset	UINT64CONST(0xFFFFFFFFFFFFFFFF)
 
 /*
  * Possible multixact lock modes ("status").  The first four modes are for
@@ -147,7 +147,6 @@ extern void MultiXactSetNextMXact(MultiXactId nextMulti,
 extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 									  MultiXactOffset minMultiOffset);
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
-extern int	MultiXactMemberFreezeThreshold(void);
 
 extern void multixact_twophase_recover(FullTransactionId fxid, uint16 info,
 									   void *recdata, uint32 len);
diff --git a/src/include/c.h b/src/include/c.h
index 757dfff4782..bc92a6f4565 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -670,7 +670,7 @@ typedef uint32 SubTransactionId;
 /* MultiXactId must be equivalent to TransactionId, to fit in t_xmax */
 typedef TransactionId MultiXactId;
 
-typedef uint32 MultiXactOffset;
+typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
-- 
2.39.5 (Apple Git-154)

v21-0002-Add-pg_upgarde-for-64-bit-multixact-offsets.patchapplication/octet-stream; name=v21-0002-Add-pg_upgarde-for-64-bit-multixact-offsets.patchDownload
From 8f5e88b2041e062a59ceaf692880821d2316dd0f Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Fri, 24 Oct 2025 10:58:37 +0300
Subject: [PATCH v21 2/3] Add pg_upgarde for 64 bit multixact offsets

Author: Maxim Orlov <orlovmg@gmail.com>
Author: Heikki Linnakangas <hlinnaka@iki.fi>
---
 src/backend/access/transam/multixact.c |  35 +--
 src/bin/pg_upgrade/Makefile            |   3 +
 src/bin/pg_upgrade/meson.build         |   3 +
 src/bin/pg_upgrade/multixact_new.c     | 253 +++++++++++++++++++++
 src/bin/pg_upgrade/multixact_new.h     |  31 +++
 src/bin/pg_upgrade/multixact_old.c     | 296 +++++++++++++++++++++++++
 src/bin/pg_upgrade/multixact_old.h     |  31 +++
 src/bin/pg_upgrade/pg_upgrade.c        | 108 ++++++++-
 src/bin/pg_upgrade/pg_upgrade.h        |   5 +
 src/bin/pg_upgrade/slru_io.c           | 240 ++++++++++++++++++++
 src/bin/pg_upgrade/slru_io.h           |  30 +++
 11 files changed, 1003 insertions(+), 32 deletions(-)
 create mode 100644 src/bin/pg_upgrade/multixact_new.c
 create mode 100644 src/bin/pg_upgrade/multixact_new.h
 create mode 100644 src/bin/pg_upgrade/multixact_old.c
 create mode 100644 src/bin/pg_upgrade/multixact_old.h
 create mode 100644 src/bin/pg_upgrade/slru_io.c
 create mode 100644 src/bin/pg_upgrade/slru_io.h

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 1f59587c42e..6a865ba2059 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1353,7 +1353,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	int			slotno;
 	MultiXactOffset offset;
 	int			length;
-	int			truelength;
 	MultiXactId oldestMXact;
 	MultiXactId nextMXact;
 	MultiXactId tmpMXact;
@@ -1452,15 +1451,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * we have just for this; the process in charge will signal the CV as soon
 	 * as it has finished writing the multixact offset.
 	 *
-	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
-	 * handle case #2, there is an ambiguity near the point of offset
-	 * wraparound.  If we see next multixact's offset is one, is that our
-	 * multixact's actual endpoint, or did it end at zero with a subsequent
-	 * increment?  We handle this using the knowledge that if the zero'th
-	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
-	 * transaction ID so it can't be a multixact member.  Therefore, if we
-	 * read a zero from the members array, just ignore it.
-	 *
 	 * This is all pretty messy, but the mess occurs only in infrequent corner
 	 * cases, so it seems better than holding the MultiXactGenLock for a long
 	 * time on every multixact creation.
@@ -1544,6 +1534,9 @@ retry:
 	LWLockRelease(lock);
 	lock = NULL;
 
+	/* A multixid with zero members should not happen */
+	Assert(length > 0);
+
 	/*
 	 * If we slept above, clean up state; it's no longer needed.
 	 */
@@ -1552,7 +1545,6 @@ retry:
 
 	ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
 
-	truelength = 0;
 	prev_pageno = -1;
 	for (int i = 0; i < length; i++, offset++)
 	{
@@ -1590,36 +1582,27 @@ retry:
 		xactptr = (TransactionId *)
 			(MultiXactMemberCtl->shared->page_buffer[slotno] + memberoff);
 
-		if (!TransactionIdIsValid(*xactptr))
-		{
-			/* Corner case 3: we must be looking at unused slot zero */
-			Assert(offset == 0);
-			continue;
-		}
+		Assert(TransactionIdIsValid(*xactptr));
 
 		flagsoff = MXOffsetToFlagsOffset(offset);
 		bshift = MXOffsetToFlagsBitShift(offset);
 		flagsptr = (uint32 *) (MultiXactMemberCtl->shared->page_buffer[slotno] + flagsoff);
 
-		ptr[truelength].xid = *xactptr;
-		ptr[truelength].status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
-		truelength++;
+		ptr[i].xid = *xactptr;
+		ptr[i].status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
 	}
 
 	LWLockRelease(lock);
 
-	/* A multixid with zero members should not happen */
-	Assert(truelength > 0);
-
 	/*
 	 * Copy the result into the local cache.
 	 */
-	mXactCachePut(multi, truelength, ptr);
+	mXactCachePut(multi, length, ptr);
 
 	debug_elog3(DEBUG2, "GetMembers: no cache for %s",
-				mxid_to_string(multi, truelength, ptr));
+				mxid_to_string(multi, length, ptr));
 	*members = ptr;
-	return truelength;
+	return length;
 }
 
 /*
diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index 69fcf593cae..42995d53b0b 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -18,11 +18,14 @@ OBJS = \
 	file.o \
 	function.o \
 	info.o \
+	multixact_new.o \
+	multixact_old.o \
 	option.o \
 	parallel.o \
 	pg_upgrade.o \
 	relfilenumber.o \
 	server.o \
+	slru_io.o \
 	tablespace.o \
 	task.o \
 	util.o \
diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index ac992f0d14b..3e46c4512cf 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -8,11 +8,14 @@ pg_upgrade_sources = files(
   'file.c',
   'function.c',
   'info.c',
+  'multixact_new.c',
+  'multixact_old.c',
   'option.c',
   'parallel.c',
   'pg_upgrade.c',
   'relfilenumber.c',
   'server.c',
+  'slru_io.c',
   'tablespace.c',
   'task.c',
   'util.c',
diff --git a/src/bin/pg_upgrade/multixact_new.c b/src/bin/pg_upgrade/multixact_new.c
new file mode 100644
index 00000000000..d7a58a75de1
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_new.c
@@ -0,0 +1,253 @@
+/*
+ * multixact_new.c
+ *
+ * Rewrite pre-v19 multixacts to new format with 64-bit MultiXactOffsets
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_new.c
+ */
+
+#include "multixact_new.h"
+
+/*
+ * NOTE: Below are a bunch of definitions and simple inline functions that are
+ * copy-pasted from multixact.c
+ */
+typedef int32 ShortMultiXactOffset;
+
+/* We need four bytes per offset, 8 bytes for the base */
+#define MULTIXACT_OFFSETS_PER_PAGE		\
+	((BLCKSZ - sizeof(MultiXactOffset)) / sizeof(ShortMultiXactOffset))
+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)
+{
+	return multi / MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int
+MultiXactIdToOffsetEntry(MultiXactId multi)
+{
+	return multi % MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+/* We need eight bits per xact, so one xact fits in a byte */
+#define MXACT_MEMBER_BITS_PER_XACT			8
+#define MXACT_MEMBER_FLAGS_PER_BYTE			1
+#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
+
+/* how many full bytes of flags are there in a group? */
+#define MULTIXACT_FLAGBYTES_PER_GROUP		4
+#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
+	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
+/* size in bytes of a complete group */
+#define MULTIXACT_MEMBERGROUP_SIZE \
+	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERS_PER_PAGE	\
+	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
+
+/*
+ * Because the number of items per page is not a divisor of the last item
+ * number (member 0xFFFFFFFF), the last segment does not use the maximum number
+ * of pages, and moreover the last used page therein does not use the same
+ * number of items as previous pages.  (Another way to say it is that the
+ * 0xFFFFFFFF member is somewhere in the middle of the last page, so the page
+ * has some empty space after that item.)
+ *
+ * This constant is the number of members in the last page of the last segment.
+ */
+#define MAX_MEMBERS_IN_LAST_MEMBERS_PAGE \
+		((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
+
+/* page in which a member is to be found */
+static inline int64
+MXOffsetToMemberPage(MultiXactOffset offset)
+{
+	return offset / MULTIXACT_MEMBERS_PER_PAGE;
+}
+
+/* Location (byte offset within page) of flag word for a given member */
+static inline int
+MXOffsetToFlagsOffset(MultiXactOffset offset)
+{
+	MultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
+	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
+
+	return byteoff;
+}
+
+static inline int
+MXOffsetToFlagsBitShift(MultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
+
+	return bshift;
+}
+
+/* Location (byte offset within page) of TransactionId of given member */
+static inline int
+MXOffsetToMemberOffset(MultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+
+	return MXOffsetToFlagsOffset(offset) +
+		MULTIXACT_FLAGBYTES_PER_GROUP +
+		member_in_group * sizeof(TransactionId);
+}
+
+/*
+ * To avoid diving deep into address arithmetic, we declare an auxiliary
+ * structure that access the MultiXactOffset SLRU page.
+ */
+typedef struct MultiXactOffsetSLRUPage
+{
+	MultiXactOffset			base;
+	ShortMultiXactOffset	offset[FLEXIBLE_ARRAY_MEMBER];
+} MultiXactOffsetSLRUPage;
+
+static inline void
+MXOffsetWrite(char *buf, int entryno, MultiXactOffset offset)
+{
+	MultiXactOffsetSLRUPage *page = (MultiXactOffsetSLRUPage *) buf;
+
+	if (page->base != 0)
+		page->offset[entryno] = offset - page->base;
+	else
+	{
+		page->base = offset;
+		page->offset[entryno] = 0;
+	}
+
+	/*
+	 * We need to distinguish between uninited value and not yet written offset.
+	 * See case 2 in GetMultiXactIdMembers.
+	 *
+	 * So, mark this offset inited.
+	 */
+	page->offset[entryno] ^= 0x80000000;
+}
+
+MultiXactWriter *
+AllocMultiXactWrite(char *pgdata, MultiXactId firstMulti,
+					MultiXactOffset firstOffset)
+{
+	MultiXactWriter *state = state = pg_malloc(sizeof(*state));
+	char				dir[MAXPGPATH] = {0};
+
+	state->nextMXact = firstMulti;
+	state->nextOffset = firstOffset;
+
+	pg_sprintf(dir, "%s/pg_multixact/offsets", pgdata);
+	state->offset = AllocSlruWrite(dir, false);
+
+	pg_sprintf(dir, "%s/pg_multixact/members", pgdata);
+	state->members = AllocSlruWrite(dir, true /* use long segment names */);
+
+	return state;
+}
+
+/*
+ * Simplified copy of the corresponding server function
+ */
+MultiXactId
+GetNewMultiXactId(MultiXactWriter *state, int nmembers, MultiXactOffset *offset)
+{
+	MultiXactId		result;
+
+	/* Handle wraparound of the nextMXact counter */
+	if (state->nextMXact < FirstMultiXactId)
+		state->nextMXact = FirstMultiXactId;
+
+	/* Assign the MXID */
+	result = state->nextMXact;
+
+	/* Reserve the members space, similarly to above. */
+	*offset = state->nextOffset;
+
+	/*
+	 * Advance counters.  As in GetNewTransactionId(), this must not happen
+	 * until after file extension has succeeded!
+	 *
+	 * We don't care about MultiXactId wraparound here; it will be handled by
+	 * the next iteration.  But note that nextMXact may be InvalidMultiXactId
+	 * or the first value on a segment-beginning page after this routine
+	 * exits, so anyone else looking at the variable must be prepared to deal
+	 * with either case.  Similarly, nextOffset may be zero, but we won't use
+	 * that as the actual start offset of the next multixact.
+	 */
+	(state->nextMXact)++;
+
+	state->nextOffset += nmembers;
+
+	return result;
+}
+
+/*
+ * Write a new multixact with members.
+ *
+ * Simplified version of the correspoding server function, hence the name.
+ */
+void
+RecordNewMultiXact(MultiXactWriter *state, MultiXactOffset offset,
+				   MultiXactId multi, int nmembers, MultiXactMember *members)
+{
+	int64		pageno;
+	int64		prev_pageno;
+	int			entryno,
+				i;
+	char	   *buf;
+
+	pageno = MultiXactIdToOffsetPage(multi);
+	entryno = MultiXactIdToOffsetEntry(multi);
+
+	buf = SlruWriteSwitchPage(state->offset, pageno);
+	MXOffsetWrite(buf, entryno, offset);
+
+	prev_pageno = -1;
+
+	for (i = 0; i < nmembers; i++, offset++)
+	{
+		TransactionId *memberptr;
+		uint32	   *flagsptr;
+		uint32		flagsval;
+		int			bshift;
+		int			flagsoff;
+		int			memberoff;
+
+		Assert(members[i].status <= MultiXactStatusUpdate);
+
+		pageno = MXOffsetToMemberPage(offset);
+		memberoff = MXOffsetToMemberOffset(offset);
+		flagsoff = MXOffsetToFlagsOffset(offset);
+		bshift = MXOffsetToFlagsBitShift(offset);
+
+		if (pageno != prev_pageno)
+		{
+			buf = SlruWriteSwitchPage(state->members, pageno);
+			prev_pageno = pageno;
+		}
+
+		memberptr = (TransactionId *) (buf + memberoff);
+
+		*memberptr = members[i].xid;
+
+		flagsptr = (uint32 *) (buf + flagsoff);
+
+		flagsval = *flagsptr;
+		flagsval &= ~(((1 << MXACT_MEMBER_BITS_PER_XACT) - 1) << bshift);
+		flagsval |= (members[i].status << bshift);
+		*flagsptr = flagsval;
+	}
+}
+
+void
+FreeMultiXactWrite(MultiXactWriter *state)
+{
+	FreeSlruWrite(state->offset);
+	FreeSlruWrite(state->members);
+
+	pfree(state);
+}
diff --git a/src/bin/pg_upgrade/multixact_new.h b/src/bin/pg_upgrade/multixact_new.h
new file mode 100644
index 00000000000..33d5d1b8222
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_new.h
@@ -0,0 +1,31 @@
+/*
+ * multixact_new.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_new.h
+ */
+
+#include "postgres_fe.h"
+
+#include "access/multixact.h"
+
+#include "slru_io.h"
+
+typedef struct MultiXactWriter
+{
+	MultiXactId			nextMXact;
+	MultiXactOffset		nextOffset;
+
+	SlruSegState	   *offset;
+	SlruSegState	   *members;
+} MultiXactWriter;
+
+extern MultiXactWriter *AllocMultiXactWrite(char *pgdata,
+											MultiXactId firstMulti,
+											MultiXactOffset firstOffset);
+extern MultiXactId GetNewMultiXactId(MultiXactWriter *state, int nmembers,
+									 MultiXactOffset *offset);
+extern void RecordNewMultiXact(MultiXactWriter *state, MultiXactOffset offset,
+							   MultiXactId multi, int nmembers,
+							   MultiXactMember *members);
+extern void FreeMultiXactWrite(MultiXactWriter *writer);
diff --git a/src/bin/pg_upgrade/multixact_old.c b/src/bin/pg_upgrade/multixact_old.c
new file mode 100644
index 00000000000..6cc384d2cf2
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_old.c
@@ -0,0 +1,296 @@
+/*
+ * multixact_old.c
+ *
+ * Rewrite pre-v19 multixacts to new format with 64-bit MultiXactOffsets
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_old.c
+ */
+
+#include "multixact_old.h"
+
+#include "pg_upgrade.h"
+
+/*
+ * NOTE: below are a bunch of definitions and simple sttaic inline functions
+ * that are copy-pasted from multixact.c from version 18.  The only difference
+ * is that we use the OldMultiXactOffset type equal to uint32 instead of
+ * MultiXactOffset which became uint64.
+ */
+
+/* We need four bytes per offset and 8 bytes per base for each page. */
+#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(OldMultiXactOffset))
+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)
+{
+	return multi / MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int
+MultiXactIdToOffsetEntry(MultiXactId multi)
+{
+	return multi % MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+/*
+ * The situation for members is a bit more complex: we store one byte of
+ * additional flag bits for each TransactionId.  To do this without getting
+ * into alignment issues, we store four bytes of flags, and then the
+ * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
+ * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
+ * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
+ * performance) trumps space efficiency here.
+ *
+ * Note that the "offset" macros work with byte offset, not array indexes, so
+ * arithmetic must be done using "char *" pointers.
+ */
+/* We need eight bits per xact, so one xact fits in a byte */
+#define MXACT_MEMBER_BITS_PER_XACT			8
+#define MXACT_MEMBER_FLAGS_PER_BYTE			1
+#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
+
+/* how many full bytes of flags are there in a group? */
+#define MULTIXACT_FLAGBYTES_PER_GROUP		4
+#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
+	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
+/* size in bytes of a complete group */
+#define MULTIXACT_MEMBERGROUP_SIZE \
+	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERS_PER_PAGE	\
+	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
+
+/* page in which a member is to be found */
+static inline int64
+MXOffsetToMemberPage(OldMultiXactOffset offset)
+{
+	return offset / MULTIXACT_MEMBERS_PER_PAGE;
+}
+
+/* Location (byte offset within page) of flag word for a given member */
+static inline int
+MXOffsetToFlagsOffset(MultiXactOffset offset)
+{
+	OldMultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
+	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
+
+	return byteoff;
+}
+
+/* Location (byte offset within page) of TransactionId of given member */
+static inline int
+MXOffsetToMemberOffset(OldMultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+
+	return MXOffsetToFlagsOffset(offset) +
+		MULTIXACT_FLAGBYTES_PER_GROUP +
+		member_in_group * sizeof(TransactionId);
+}
+
+static inline int
+MXOffsetToFlagsBitShift(OldMultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
+
+	return bshift;
+}
+
+/*
+ * Construct reader of old multixacts.
+ *
+ * Returns the malloced memory used by the all other calls in this module.
+ */
+OldMultiXactReader *
+AllocOldMultiXactRead(char *pgdata, MultiXactId nextMulti,
+					  OldMultiXactOffset nextOffset)
+{
+	OldMultiXactReader *state = state = pg_malloc(sizeof(*state));
+	char				dir[MAXPGPATH] = {0};
+
+	state->nextMXact = nextMulti;
+	state->nextOffset = nextOffset;
+
+	pg_sprintf(dir, "%s/pg_multixact/offsets", pgdata);
+	state->offset = AllocSlruRead(dir);
+
+	pg_sprintf(dir, "%s/pg_multixact/members", pgdata);
+	state->members = AllocSlruRead(dir);
+
+	return state;
+}
+
+/*
+ * This is a simplified version of the GetMultiXactIdMembers() server function.
+ *
+ * - Only return the updating member, if any. Upgrade only cares about the
+ *   updaters. If there is no updating member, return the first locking-only
+ *   member. We don't have any way to represent "no members", but we also don't
+ *   need to preserve all the locking members.
+ *
+ * - We don't need to worry about locking and some corner cases because there's
+ *   no concurrent activity.
+ */
+void
+GetOldMultiXactIdSingleMember(OldMultiXactReader *state, MultiXactId multi,
+							  TransactionId *result, MultiXactStatus *status)
+{
+	MultiXactId		nextMXact,
+					nextOffset,
+					tmpMXact;
+	int64			pageno,
+					prev_pageno;
+	int				entryno,
+					length;
+	char		   *buf;
+	OldMultiXactOffset *offptr,
+						offset;
+	TransactionId	result_xid = InvalidTransactionId;
+	bool			result_isupdate = false;
+
+	nextMXact = state->nextMXact;
+	nextOffset = state->nextOffset;
+
+	/*
+	 * See GetMultiXactIdMembers in multixact.c
+	 *
+	 * Find out the offset at which we need to start reading MultiXactMembers
+	 * and the number of members in the multixact.  We determine the latter as
+	 * the difference between this multixact's starting offset and the next
+	 * one's.  However, there are some corner cases to worry about:
+	 *
+	 * 1. This multixact may be the latest one created, in which case there is
+	 * no next one to look at.  In this case the nextOffset value we just
+	 * saved is the correct endpoint.
+	 *
+	 * 2. The next multixact may still be in process of being filled in...
+	 * This cannot happen during upgrade.
+	 *
+	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
+	 * handle case #2, there is an ambiguity near the point of offset
+	 * wraparound.  If we see next multixact's offset is one, is that our
+	 * multixact's actual endpoint, or did it end at zero with a subsequent
+	 * increment?  We handle this using the knowledge that if the zero'th
+	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
+	 * transaction ID so it can't be a multixact member.  Therefore, if we
+	 * read a zero from the members array, just ignore it.
+	 */
+
+	pageno = MultiXactIdToOffsetPage(multi);
+	entryno = MultiXactIdToOffsetEntry(multi);
+
+	buf = SlruReadSwitchPage(state->offset, pageno);
+	offptr = (OldMultiXactOffset *) buf;
+	offptr += entryno;
+	offset = *offptr;
+
+	Assert(offset != 0);
+
+	/*
+	 * Use the same increment rule as GetNewMultiXactId(), that is, don't
+	 * handle wraparound explicitly until needed.
+	 */
+	tmpMXact = multi + 1;
+
+	if (nextMXact == tmpMXact)
+	{
+		/* Corner case 1: there is no next multixact */
+		length = nextOffset - offset;
+	}
+	else
+	{
+		OldMultiXactOffset nextMXOffset;
+
+		/* handle wraparound if needed */
+		if (tmpMXact < FirstMultiXactId)
+			tmpMXact = FirstMultiXactId;
+
+		prev_pageno = pageno;
+
+		pageno = MultiXactIdToOffsetPage(tmpMXact);
+		entryno = MultiXactIdToOffsetEntry(tmpMXact);
+
+		if (pageno != prev_pageno)
+			buf = SlruReadSwitchPage(state->offset, pageno);
+
+		offptr = (OldMultiXactOffset *) buf;
+		offptr += entryno;
+		nextMXOffset = *offptr;
+
+		/*
+		 * Corner case 2: next multixact is still being filled in, this must
+		 * not happen during upgrade.
+		 */
+		Assert(nextMXOffset != 0);
+
+		length = nextMXOffset - offset;
+	}
+
+	prev_pageno = -1;
+	for (int i = 0; i < length; i++, offset++)
+	{
+		TransactionId *xactptr;
+		uint32	   *flagsptr;
+		int			flagsoff;
+		int			bshift;
+		int			memberoff;
+		MultiXactStatus st;
+
+		pageno = MXOffsetToMemberPage(offset);
+		memberoff = MXOffsetToMemberOffset(offset);
+
+		if (pageno != prev_pageno)
+		{
+			buf = SlruReadSwitchPage(state->members, pageno);
+			prev_pageno = pageno;
+		}
+
+		xactptr = (TransactionId *) (buf + memberoff);
+		if (!TransactionIdIsValid(*xactptr))
+		{
+			/* Corner case 3: we must be looking at unused slot zero */
+			Assert(offset == 0);
+			continue;
+		}
+
+		flagsoff = MXOffsetToFlagsOffset(offset);
+		bshift = MXOffsetToFlagsBitShift(offset);
+		flagsptr = (uint32 *) (buf + flagsoff);
+
+		st = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
+
+		/* Verify that there is a single update Xid among the given members. */
+		if (ISUPDATE_from_mxstatus(st))
+		{
+			if (result_isupdate)
+				pg_fatal("multixact %u has more than one updating member",
+						 multi);
+			result_xid = *xactptr;
+			result_isupdate = true;
+		}
+		else if (!TransactionIdIsValid(result_xid))
+			result_xid = *xactptr;
+	}
+
+	/* A multixid with zero members should not happen */
+	Assert(TransactionIdIsValid(result_xid));
+
+	*result = result_xid;
+	*status = result_isupdate ? MultiXactStatusUpdate :
+								MultiXactStatusForKeyShare;
+}
+
+/*
+ * Frees the malloced reader.
+ */
+void
+FreeOldMultiXactReader(OldMultiXactReader *state)
+{
+	FreeSlruRead(state->offset);
+	FreeSlruRead(state->members);
+
+	pfree(state);
+}
diff --git a/src/bin/pg_upgrade/multixact_old.h b/src/bin/pg_upgrade/multixact_old.h
new file mode 100644
index 00000000000..8d4659ba6a0
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_old.h
@@ -0,0 +1,31 @@
+/*
+ * multixact_old.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_old.h
+ */
+
+#include "postgres_fe.h"
+
+#include "access/multixact.h"
+#include "slru_io.h"
+
+typedef uint32 OldMultiXactOffset;
+
+typedef struct OldMultiXactReader
+{
+	MultiXactId			nextMXact;
+	OldMultiXactOffset	nextOffset;
+
+	SlruSegState	   *offset;
+	SlruSegState	   *members;
+} OldMultiXactReader;
+
+extern OldMultiXactReader *AllocOldMultiXactRead(char *pgdata,
+												 MultiXactId nextMulti,
+												 OldMultiXactOffset nextOffset);
+extern void GetOldMultiXactIdSingleMember(OldMultiXactReader *state,
+										  MultiXactId multi,
+										  TransactionId *result,
+										  MultiXactStatus *status);
+extern void FreeOldMultiXactReader(OldMultiXactReader *reader);
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 490e98fa26f..5432c03a2b0 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -49,6 +49,8 @@
 #include "common/restricted_token.h"
 #include "fe_utils/string_utils.h"
 #include "pg_upgrade.h"
+#include "multixact_old.h"
+#include "multixact_new.h"
 
 /*
  * Maximum number of pg_restore actions (TOC entries) to process within one
@@ -769,6 +771,82 @@ copy_subdir_files(const char *old_subdir, const char *new_subdir)
 	check_ok();
 }
 
+/*
+ * Convert pg_multixact/offset and /members to new format with 64-bit offsets.
+ */
+static void
+convert_multixacts(MultiXactId *new_nxtmulti, MultiXactOffset *new_nxtmxoff)
+{
+	MultiXactId			oldest_multi,
+						next_multi;
+	OldMultiXactReader *old_reader;
+	MultiXactWriter	   *new_writer;
+
+	old_reader = AllocOldMultiXactRead(old_cluster.pgdata,
+									   old_cluster.controldata.chkpnt_nxtmulti,
+									   old_cluster.controldata.chkpnt_nxtmxoff);
+	new_writer = AllocMultiXactWrite(new_cluster.pgdata,
+									 old_cluster.controldata.chkpnt_oldstMulti,
+									 1 /* see below */);
+
+	oldest_multi = old_cluster.controldata.chkpnt_oldstMulti;
+	next_multi = old_cluster.controldata.chkpnt_nxtmulti;
+
+	/* handle wraparound */
+	if (next_multi < FirstMultiXactId)
+		next_multi = FirstMultiXactId;
+
+	/*
+	 * Read multixids from old files one by one, and write them back in the new
+	 * format.
+	 *
+	 * The locking-only XIDs that may be part of multi-xids don't matter after
+	 * upgrade, as there can be no transactions running across upgrade.  So as
+	 * a little optimization, we only read one member from each multixid: the
+	 * one updating one, or if there was no update, arbitrarily the first
+	 * locking xid.
+	 */
+	for (MultiXactId multi = oldest_multi; multi != next_multi;)
+	{
+		TransactionId		xid;
+		MultiXactStatus		status;
+		MultiXactMember		member;
+		MultiXactId			new_multi PG_USED_FOR_ASSERTS_ONLY;
+		MultiXactOffset		offset;
+
+		/* Read the old multixid */
+		GetOldMultiXactIdSingleMember(old_reader, multi, &xid, &status);
+
+		/* Write it out in new format */
+		member.xid = xid;
+		member.status = status;
+		new_multi = GetNewMultiXactId(new_writer, 1, &offset);
+
+		Assert(new_multi == multi);
+
+		RecordNewMultiXact(new_writer, offset, multi, 1, &member);
+
+		multi++;
+		/* handle wraparound */
+		if (multi < FirstMultiXactId)
+			multi = FirstMultiXactId;
+	}
+
+	/*
+	 * Update the nextMXact/Offset values in the control file to match what we
+	 * wrote.  The nextMXact should be unchanged, but because we ignored the
+	 * locking XIDs members, the nextOffset will be different.
+	 */
+	Assert(new_writer->nextMXact == next_multi);
+
+	*new_nxtmulti = next_multi;
+	*new_nxtmxoff = new_writer->nextOffset;
+
+	/* Release resources */
+	FreeMultiXactWrite(new_writer);
+	FreeOldMultiXactReader(old_reader);
+}
+
 static void
 copy_xact_xlog_xid(void)
 {
@@ -816,8 +894,28 @@ copy_xact_xlog_xid(void)
 	if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER &&
 		new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
 	{
-		copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
-		copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		MultiXactId		new_nxtmulti = old_cluster.controldata.chkpnt_nxtmulti;
+		MultiXactOffset new_nxtmxoff = old_cluster.controldata.chkpnt_nxtmxoff;
+
+		/*
+		 * If the old server is before the MULTIXACTOFFSET_FORMATCHANGE_CAT_VER
+		 * it must have 32-bit multixid offsets, thus it should be converted.
+		 */
+		if (old_cluster.controldata.cat_ver < MULTIXACTOFFSET_FORMATCHANGE_CAT_VER &&
+			new_cluster.controldata.cat_ver >= MULTIXACTOFFSET_FORMATCHANGE_CAT_VER)
+		{
+			remove_new_subdir("pg_multixact/members", false);
+			remove_new_subdir("pg_multixact/offsets", false);
+
+			prep_status("Converting pg_multixact/offsets to 64-bit");
+			convert_multixacts(&new_nxtmulti, &new_nxtmxoff);
+			check_ok();
+		}
+		else
+		{
+			copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
+			copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		}
 
 		prep_status("Setting next multixact ID and offset for new cluster");
 
@@ -826,10 +924,8 @@ copy_xact_xlog_xid(void)
 		 * counters here and the oldest multi present on system.
 		 */
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/pg_resetwal\" -O %u -m %u,%u \"%s\"",
-				  new_cluster.bindir,
-				  old_cluster.controldata.chkpnt_nxtmxoff,
-				  old_cluster.controldata.chkpnt_nxtmulti,
+				  "\"%s/pg_resetwal\" -O %" PRIu64 " -m %u,%u \"%s\"",
+				  new_cluster.bindir, new_nxtmxoff, new_nxtmulti,
 				  old_cluster.controldata.chkpnt_oldstMulti,
 				  new_cluster.pgdata);
 		check_ok();
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index e86336f4be9..127b2cb00fa 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -114,6 +114,11 @@ extern char *output_files[];
  */
 #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231
 
+/*
+ * Swicth from 32-bit to 64-bit for multixid offsets.
+ */
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 999999999
+
 /*
  * large object chunk size added to pg_controldata,
  * commit 5f93c37805e7485488480916b4585e098d3cc883
diff --git a/src/bin/pg_upgrade/slru_io.c b/src/bin/pg_upgrade/slru_io.c
new file mode 100644
index 00000000000..4e823199303
--- /dev/null
+++ b/src/bin/pg_upgrade/slru_io.c
@@ -0,0 +1,240 @@
+/*
+ * slru_io.c
+ *
+ * Routines for reading and writing SLRU files during upgrade.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/slru_io.c
+ */
+
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+
+#include "pg_upgrade.h"
+#include "slru_io.h"
+
+#include "common/fe_memutils.h"
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "port/pg_iovec.h"
+
+/*
+ * State for reading or writing an SLRU, with a one page buffer.
+ */
+typedef struct SlruSegState
+{
+	bool		writing;
+	bool		long_segment_names;
+
+	char	   *dir;
+	char	   *fn;
+	int			fd;
+	int64		segno;
+	uint64		pageno;
+
+	PGAlignedBlock buf;
+} SlruSegState;
+
+static inline SlruSegState *
+AllocSlruSegState(char *dir)
+{
+	SlruSegState *state = pg_malloc(sizeof(*state));
+
+	state->segno = -1;
+	state->pageno = 0;
+	state->dir = pstrdup(dir);
+	state->fd = -1;
+	state->fn = NULL;
+
+	return state;
+}
+
+static inline void
+SlruFlush(SlruSegState *state)
+{
+	struct iovec	iovec = {
+		.iov_base = &state->buf,
+		.iov_len = BLCKSZ,
+	};
+	off_t			offset;
+
+	if (state->segno == -1)
+		return;
+
+	offset = (state->pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+	if (pg_pwritev_with_retry(state->fd, &iovec, 1, offset) < 0)
+		pg_fatal("could not write file \"%s\": %m", state->fn);
+}
+
+/*
+ * Create slru reader for dir.
+ *
+ * Returns the malloced memory used by the all other read calls in this module.
+ */
+SlruSegState *
+AllocSlruRead(char *dir)
+{
+	SlruSegState *state = AllocSlruSegState(dir);
+
+	state->writing = false;
+
+	return state;
+}
+
+/*
+ * Open given page for reading.
+ *
+ * Reading can be done in random order.
+ */
+char *
+SlruReadSwitchPage(SlruSegState *state, uint64 pageno)
+{
+	int64 segno;
+
+	Assert(!state->writing);	/* read only mode */
+
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+
+	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	if (segno != state->segno)
+	{
+		if (state->segno != -1)
+		{
+			close(state->fd);
+			state->fd = -1;
+
+			pg_free(state->fn);
+			state->fn = NULL;
+		}
+
+		/* Open new segment */
+		state->fn = psprintf("%s/%04X", state->dir, (unsigned int) segno);
+		if ((state->fd = open(state->fn, O_RDONLY | PG_BINARY, 0)) < 0)
+			pg_fatal("could not open file \"%s\": %m", state->fn);
+	}
+
+	state->segno = segno;
+
+	{
+		struct iovec	iovec = {
+			.iov_base = &state->buf,
+			.iov_len = BLCKSZ,
+		};
+		off_t			offset = (pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+		if (pg_preadv(state->fd, &iovec, 1, offset) < 0)
+			pg_fatal("could not read file \"%s\": %m", state->fn);
+
+		state->pageno = pageno;
+	}
+
+	return state->buf.data;
+}
+
+/*
+ * Frees the malloced reader.
+ */
+void
+FreeSlruRead(SlruSegState *state)
+{
+	Assert(!state->writing);	/* read only mode */
+
+	close(state->fd);
+	pg_free(state);
+}
+
+/*
+ * Open the given page for writing.
+ *
+ * NOTE: This uses O_EXCL when stepping to a new segment, so this assumes that
+ * each segment is written in full before moving on to next one.  This
+ * limitation would be easy to lift if needed, but it fits the usage pattern of
+ * current callers.
+ */
+char *
+SlruWriteSwitchPage(SlruSegState *state, uint64 pageno)
+{
+	int64	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	off_t	offset;
+
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+
+	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	offset = (pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+	SlruFlush(state);
+	memset(state->buf.data, 0, BLCKSZ);
+
+	if (segno != state->segno)
+	{
+		if (state->segno != -1)
+		{
+			close(state->fd);
+			state->fd = -1;
+
+			pg_free(state->fn);
+			state->fn = NULL;
+		}
+
+		/* Create the segment */
+		if (state->long_segment_names)
+		{
+			Assert(segno >= 0 && segno <= INT64CONST(0xFFFFFFFFFFFFFFF));
+			state->fn = psprintf("%s/%015" PRIX64, state->dir, segno);
+		}
+		else
+		{
+			Assert(segno >= 0 && segno <= INT64CONST(0xFFFFFF));
+			state->fn = psprintf("%s/%04X", state->dir, (unsigned int) segno);
+		}
+
+		if ((state->fd = open(state->fn, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+							  pg_file_create_mode)) < 0)
+		{
+			pg_fatal("could not create file \"%s\": %m", state->fn);
+		}
+
+		state->segno = segno;
+
+		if (offset > 0 && pg_pwrite_zeros(state->fd, offset, 0) < 0)
+			pg_fatal("could not write file \"%s\": %m", state->fn);
+	}
+
+	state->pageno = pageno;
+
+	return state->buf.data;
+}
+
+/*
+ * Create slru writer for dir.
+ *
+ * Returns the malloced memory used by the all other write calls in this module.
+ */
+SlruSegState *
+AllocSlruWrite(char *dir, bool long_segment_names)
+{
+	SlruSegState *state = AllocSlruSegState(dir);
+
+	state->writing = true;
+	state->long_segment_names = long_segment_names;
+
+	return state;
+}
+
+/*
+ * Frees the malloced writer.
+ */
+void
+FreeSlruWrite(SlruSegState *state)
+{
+	Assert(state->writing);
+
+	SlruFlush(state);
+
+	close(state->fd);
+	pg_free(state);
+}
diff --git a/src/bin/pg_upgrade/slru_io.h b/src/bin/pg_upgrade/slru_io.h
new file mode 100644
index 00000000000..920b8ae82e2
--- /dev/null
+++ b/src/bin/pg_upgrade/slru_io.h
@@ -0,0 +1,30 @@
+/*
+ * slru_io.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/slru_io.h
+ */
+
+/*
+ * Some kind of iterator associated with a particular SLRU segment.  The idea is
+ * to specify the segment and page number and then move through the pages.
+ */
+
+#include "postgres_fe.h"
+
+/*
+ * See access/slru.h
+ *
+ * Copy here, since slru.h could not be included in fe code.
+ */
+#define SLRU_PAGES_PER_SEGMENT 32
+
+typedef struct SlruSegState SlruSegState;
+
+extern SlruSegState *AllocSlruRead(char *dir);
+extern char *SlruReadSwitchPage(SlruSegState *state, uint64 pageno);
+extern void FreeSlruRead(SlruSegState *state);
+
+extern SlruSegState *AllocSlruWrite(char *dir, bool long_segment_names);
+extern char *SlruWriteSwitchPage(SlruSegState *state, uint64 pageno);
+extern void FreeSlruWrite(SlruSegState *state);
-- 
2.39.5 (Apple Git-154)

v21-0003-Add-test-for-64-bit-mxoff-in-pg_resetwal.patchapplication/octet-stream; name=v21-0003-Add-test-for-64-bit-mxoff-in-pg_resetwal.patchDownload
From 57a160f37e29b4e66b801d01fe442c1c0924a4ef Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Tue, 28 Oct 2025 19:08:26 +0300
Subject: [PATCH v21 3/3] Add test for 64-bit mxoff in pg_resetwal

---
 src/bin/pg_resetwal/meson.build    |   1 +
 src/bin/pg_resetwal/t/003_mxoff.pl | 170 +++++++++++++++++++++++++++++
 2 files changed, 171 insertions(+)
 create mode 100644 src/bin/pg_resetwal/t/003_mxoff.pl

diff --git a/src/bin/pg_resetwal/meson.build b/src/bin/pg_resetwal/meson.build
index 290832b2299..1e2dfb38a5b 100644
--- a/src/bin/pg_resetwal/meson.build
+++ b/src/bin/pg_resetwal/meson.build
@@ -25,6 +25,7 @@ tests += {
     'tests': [
       't/001_basic.pl',
       't/002_corrupted.pl',
+      't/003_mxoff.pl',
     ],
   },
 }
diff --git a/src/bin/pg_resetwal/t/003_mxoff.pl b/src/bin/pg_resetwal/t/003_mxoff.pl
new file mode 100644
index 00000000000..3c1b7fa1d33
--- /dev/null
+++ b/src/bin/pg_resetwal/t/003_mxoff.pl
@@ -0,0 +1,170 @@
+
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use Math::BigInt;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+sub mxact_eater
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	$node->start;
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl} (I INT PRIMARY KEY, N_UPDATED INT) " .
+		"       WITH (AUTOVACUUM_ENABLED=FALSE);" .
+		"INSERT INTO ${tbl} SELECT G, 0 FROM GENERATE_SERIES(1, 50) G;");
+
+	# consume around 10k multixact-offsetfs
+	my $nclients = 10;
+	my $update_every = 75;
+	my @connections = ();
+
+	for (0..$nclients)
+	{
+		my $conn = $node->background_psql('postgres');
+		$conn->query_safe("BEGIN");
+
+		push(@connections, $conn);
+	}
+
+	for (my $i = 0; $i < 1000; $i++)
+	{
+		my $conn = $connections[$i % $nclients];
+
+		$conn->query_safe("COMMIT;");
+		$conn->query_safe("BEGIN");
+
+		if ($i % $update_every == 0)
+		{
+			$conn->query_safe(
+				"UPDATE ${tbl} SET " .
+				"N_UPDATED = N_UPDATED + 1 " .
+				"WHERE I = ${i} % 50");
+		}
+		else
+		{
+			$conn->query_safe(
+				"SELECT * FROM ${tbl} FOR KEY SHARE");
+		}
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+
+	$node->stop;
+}
+
+sub next_mxoff
+{
+	my $node = shift;
+	my ($stdout, $stderr) =
+	  run_command([ 'pg_controldata', $node->data_dir ]);
+	my @control_data = split("\n", $stdout);
+	my $next_mxoff = undef;
+
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint's NextMultiOffset:\s*(.*)$/mg)
+		{
+			$next_mxoff = $1;
+			last;
+		}
+	}
+	die "NextMultiOffset not found in control file\n"
+		unless defined($next_mxoff);
+
+	return $next_mxoff;
+}
+
+sub reset_mxoff
+{
+	my $node = shift;
+	my $offset = shift;
+		$offset = Math::BigInt->new($offset);
+
+	# Get block size
+	my $out = (run_command([ 'pg_resetwal', '--dry-run', $node->data_dir ]))[0];
+		$out =~ /^Database block size: *(\d+)$/m or die;
+	my $blcksz = $1;
+
+	# Reset to new offset
+	my @cmd = ('pg_resetwal', '--pgdata' => $node->data_dir);
+	push @cmd, '--multixact-offset' => $offset->as_hex();
+	command_ok(\@cmd, 'set oldest multixact-offset');
+
+	# Fill empty pg_multixact/members segment
+	my $mult = 32 * int($blcksz / 20) * 4;
+	my $segname = sprintf "%015X", $offset / $mult;
+
+	my @dd = ('dd');
+	push @dd, "if=/dev/zero";
+	push @dd, "of=" . $node->data_dir . "/pg_multixact/members/" . $segname;
+	push @dd, "bs=$blcksz";
+	push @dd, "count=32";
+	command_ok(\@dd, 'fill empty multixact-members');
+}
+
+my ($off1, $off2);
+
+# start from defaults
+my $node1 = PostgreSQL::Test::Cluster->new('node1');
+$node1->init;
+$off1 = next_mxoff($node1);
+mxact_eater($node1, "FOO");
+$off2 = next_mxoff($node1);
+note "> start from $off1, finished at $off2\n";
+
+# start from before 32-bit wraparound
+my $node2 = PostgreSQL::Test::Cluster->new('node2');
+$node2->init;
+reset_mxoff($node2, 0xFFFF0000);
+$off1 = next_mxoff($node2);
+mxact_eater($node2, "FOO");
+$off2 = next_mxoff($node2);
+note "> start from $off1, finished at $off2\n";
+
+# start near 32-bit wraparound
+my $node3 = PostgreSQL::Test::Cluster->new('node3');
+$node3->init;
+reset_mxoff($node3, 0xFFFFEC77);
+$off1 = next_mxoff($node3);
+mxact_eater($node3, "FOO");
+$off2 = next_mxoff($node3);
+note "> start from $off1, finished at $off2\n";
+
+# start over 32-bit wraparound
+my $node4 = PostgreSQL::Test::Cluster->new('node4');
+$node4->init;
+reset_mxoff($node4, '0xFFFFFFFF0000');
+$off1 = next_mxoff($node4);
+mxact_eater($node4, "FOO");
+$off2 = next_mxoff($node3);
+note "> start from $off1, finished at $off2\n";
+
+# check invariant
+$node1->start;
+$node2->start;
+$node3->start;
+$node4->start;
+
+my $var1 = $node1->safe_psql('postgres', 'TABLE FOO');
+my $var2 = $node2->safe_psql('postgres', 'TABLE FOO');
+my $var3 = $node3->safe_psql('postgres', 'TABLE FOO');
+my $var4 = $node4->safe_psql('postgres', 'TABLE FOO');
+ok($var1 eq $var2 eq $var3 eq $var4,
+	'check table invariant in all nodes');
+
+$node4->stop;
+$node3->stop;
+$node2->stop;
+$node1->stop;
+
+done_testing();
-- 
2.39.5 (Apple Git-154)

#54Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Maxim Orlov (#52)
5 attachment(s)
Re: POC: make mxidoff 64 bits

On 30/10/2025 18:17, Maxim Orlov wrote:

PFA v20 returns to using actual 64-bit offsets for on-disk SLRU
segments.

Fortunately, now that I've separated reading and writing offsets into
different functions, switching from one implementation to another is
easy to do.

Here's a quick overview of the current state of the patch:
1) Access to the offset is placed to separate calls:
   MXOffsetWrite/MXOffsetRead.
2) I abandoned byte juggling in pg_upgrade and moved to using logic that
   replicates the work with offsets im multixact.c
3) As a result, the update issue came down to the correct implementation
   of functions MXOffsetWrite/MXOffsetRead.
4) The only question that remains is the question of disk representation
   of 64-bit offsets in SLRU segments.

Here's another round of review and cleanup of this. I made a bunch of
small changes, but haven't found any major problems. Looking pretty good.

Notable changes since v20:

- Changed MULTIXACT_MEMBER_AUTOVAC_THRESHOLD to 4000000000 instead of
0xFFFFFFFF. The 2^32 mark doesn't have any particular meaning
significance and using a round number makes that more clear. We should
possibly expose that as a separate GUC, but that can be done in a
followup patch.

- Removed the MXOffsetRead/Write functions again. They certainly make
sense if we make them more complicated with some kind of a compression
scheme, but I preferred to keep the code closer to 'master' for now.

- Removed more remnants of offset wraparound handling. There were still
a few places that checked for wraparound and tried to deal with it,
while other places just assumed that it doesn't happen. I added a check
in GetNewMultiXactId() to error out if it would wrap around. It really
should not happen in the real world, one reason being that we would run
out of WAL before running out of 64-bit mxoffsets, but a sanity check
won't hurt just in case someone e.g. abuses pg_resetwal to get into that
situation.

- Removed MaybeExtendOffsetSlru(). It was only used to deal with binary
upgrade from version 9.2 and below. Now that pg_upgrade rewrites the
files, it's not needed anymore.

- Modified PerformMembersTruncation() to use SimpleLruTruncate()

Changes in pg_upgrade:

- Removed the nextMXact/nextOffset fields from MultiXactWriter. They
were redundant with the next_multi and next_offset local variables in
the caller.

Remaining issues:

- There's one more refactoring I'd like to do before merging this: Move
the definitions that are now duplicated between
src/bin/pg_upgrade/multixact_new.c and
src/backend/access/transam/multixact.c into a new header file,
multixact_internal.h. One complication with that is that it needs
SLRU_PAGES_PER_SEGMENT from access/slru.h, but slru.h cannot currently
be included in FRONTEND code. Perhaps we should move
SLRU_PAGES_PER_SEGMENT to pg_config_manual.h, or if that feels too
global, to a separate slru_config.h file.

- I saw Alexander's proposal for a new compression scheme but didn't
incorporate that here. It might be a good idea, but I think we can do
that as a followup patch before the release, if it seems worth it. I
don't feel too bad about just making pg_multixact/offsets 2x larger either.

- Have you done any performance testing of the pg_upgrade code? How long
does the conversion take if you have e.g. 1 billion multixids?

- Is the !oldestOffsetKnown case in the code still reachable? I left one
FIXME comment about that. Needs a comment update at least.

- The new pg_upgrade test fails on my system with this error in the log:

# Running: pg_dump --no-sync --restrict-key test -d port=22462 host=/tmp/5KdMvth1jk dbname='postgres' -f /home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/007_mxoff/data/tmp_test_CINS/newnode_1_dump.sql
pg_dump: error: aborting because of server version mismatch
pg_dump: detail: server version: 19devel; pg_dump version: 17.6
could not read "/home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/007_mxoff/data/tmp_test_CINS/newnode_1_dump.sql": No such file or directory at /home/heikki/git-sandbox/postgresql/src/bin/pg_upgrade/t/007_mxoff.pl line 242.

This turns out to be an issue with IPC::Run. Setting the
IPCRUNDEBUG=basic env variable reveals that it has a built-in command cache:

IPC::Run 0004 [#19(109223)]: ****** harnessing *****
IPC::Run 0004 [#19(109223)]: parsing [ pg_dump --no-sync --restrict-key test -d 'port=20999 host=/tmp/NsJKldN1Ie dbname='postgres'' -f '/home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/007_mxoff/data/tmp_test_urgw/newnode_1_dump.sql' ]
IPC::Run 0004 [#19(109223)]: ** starting
IPC::Run 0004 [#19(109223)]: 'pg_dump' found in cache: '/home/heikki/pgsql.17stable/bin/pg_dump'
IPC::Run 0004 [#19(111432) pg_dump]: execing /home/heikki/pgsql.17stable/bin/pg_dump --no-sync --restrict-key test -d 'port=20999 host=/tmp/NsJKldN1Ie dbname='postgres'' -f /home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/007_mxoff/data/tmp_test_urgw/newnode_1_dump.sql
IPC::Run 0004 [#19(109223)]: ** finishing
pg_dump: error: aborting because of server version mismatch
pg_dump: detail: server version: 19devel; pg_dump version: 17.6

The test calls pg_dump twice: first with the old version, then with the
new version. But thanks to IPC::Run's command cache, the invocation of
the new pg_dump version actually also calls the old version. I'm not
sure how to fix that, but I was able to work around it by reversing the
pg_dump calls so that thew new version is called first. That way we use
the new pg_dump against both server versions which works.

- The new pg_ugprade test is very slow. I would love to include that
test permanently in the test suite, but it's too slow for that currently.

- Heikki

Attachments:

v22-0001-Use-64-bit-multixact-offsets.patchtext/x-patch; charset=UTF-8; name=v22-0001-Use-64-bit-multixact-offsets.patchDownload
From 11a352f40dbc0e53504f86eb61da019e8938e93f Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 7 Aug 2024 16:35:22 +0300
Subject: [PATCH v22 1/5] Use 64-bit multixact offsets

Switching to 64-bit multitransaction offsets removes wraparound and the
2^32 limit on their total number.

Author: Maxim Orlov <orlovmg@gmail.com>
Discussion: FIXME
---
 src/backend/access/rmgrdesc/mxactdesc.c   |   4 +-
 src/backend/access/rmgrdesc/xlogdesc.c    |   2 +-
 src/backend/access/transam/multixact.c    | 412 ++++------------------
 src/backend/access/transam/xlog.c         |   2 +-
 src/backend/access/transam/xlogrecovery.c |   2 +-
 src/backend/commands/vacuum.c             |   2 +-
 src/backend/postmaster/autovacuum.c       |   4 +-
 src/bin/pg_controldata/pg_controldata.c   |   2 +-
 src/bin/pg_resetwal/pg_resetwal.c         |   6 +-
 src/bin/pg_resetwal/t/001_basic.pl        |   2 +-
 src/include/access/multixact.h            |   3 -
 src/include/c.h                           |   2 +-
 12 files changed, 73 insertions(+), 370 deletions(-)

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 3ca0582db36..052dd0a4ce5 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -65,7 +65,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		xl_multixact_create *xlrec = (xl_multixact_create *) rec;
 		int			i;
 
-		appendStringInfo(buf, "%u offset %u nmembers %d: ", xlrec->mid,
+		appendStringInfo(buf, "%u offset %" PRIu64 " nmembers %d: ", xlrec->mid,
 						 xlrec->moff, xlrec->nmembers);
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
@@ -74,7 +74,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
 
-		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+		appendStringInfo(buf, "offsets [%u, %u), members [%" PRIu64 ", %" PRIu64 ")",
 						 xlrec->startTruncOff, xlrec->endTruncOff,
 						 xlrec->startTruncMemb, xlrec->endTruncMemb);
 	}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index cd6c2a2f650..441034f5929 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -66,7 +66,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
 		appendStringInfo(buf, "redo %X/%08X; "
-						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %u; "
+						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %" PRIu64 "; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
 						 "oldest running xid %u; %s",
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 9d5f130af7e..68fd3441816 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -92,17 +92,9 @@
 /*
  * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
  * used everywhere else in Postgres.
- *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
  */
 
-/* We need four bytes per offset */
+/* We need 8 bytes per offset */
 #define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
 
 static inline int64
@@ -151,19 +143,6 @@ MultiXactIdToOffsetSegment(MultiXactId multi)
 #define MULTIXACT_MEMBERS_PER_PAGE	\
 	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
 
-/*
- * Because the number of items per page is not a divisor of the last item
- * number (member 0xFFFFFFFF), the last segment does not use the maximum number
- * of pages, and moreover the last used page therein does not use the same
- * number of items as previous pages.  (Another way to say it is that the
- * 0xFFFFFFFF member is somewhere in the middle of the last page, so the page
- * has some empty space after that item.)
- *
- * This constant is the number of members in the last page of the last segment.
- */
-#define MAX_MEMBERS_IN_LAST_MEMBERS_PAGE \
-		((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
-
 /* page in which a member is to be found */
 static inline int64
 MXOffsetToMemberPage(MultiXactOffset offset)
@@ -208,10 +187,14 @@ MXOffsetToMemberOffset(MultiXactOffset offset)
 		member_in_group * sizeof(TransactionId);
 }
 
-/* Multixact members wraparound thresholds. */
-#define MULTIXACT_MEMBER_SAFE_THRESHOLD		(MaxMultiXactOffset / 2)
-#define MULTIXACT_MEMBER_DANGER_THRESHOLD	\
-	(MaxMultiXactOffset - MaxMultiXactOffset / 4)
+/*
+ * Multixact members warning threshold.
+ *
+ * If the difference between nextOffset and oldestOffset exceeds this value,
+ * we trigger autovacuum in order to release disk space consumed by the
+ * members SLRU.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD		UINT64CONST(4000000000)
 
 static inline MultiXactId
 PreviousMultiXactId(MultiXactId multi)
@@ -268,9 +251,6 @@ typedef struct MultiXactStateData
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 
-	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
-
 	/*
 	 * This is used to sleep until a multixact offset is written when we want
 	 * to create the next one.
@@ -401,8 +381,6 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
-									 MultiXactOffset start, uint32 distance);
 static bool SetOffsetVacuumLimit(bool is_startup);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMTruncateXlogRec(Oid oldestMultiDB,
@@ -1142,90 +1120,22 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	ExtendMultiXactOffset(result);
 
 	/*
-	 * Reserve the members space, similarly to above.  Also, be careful not to
-	 * return zero as the starting offset for any multixact. See
-	 * GetMultiXactIdMembers() for motivation.
+	 * Reserve the members space, similarly to above.
 	 */
 	nextOffset = MultiXactState->nextOffset;
-	if (nextOffset == 0)
-	{
-		*offset = 1;
-		nmembers++;				/* allocate member slot 0 too */
-	}
-	else
-		*offset = nextOffset;
-
-	/*----------
-	 * Protect against overrun of the members space as well, with the
-	 * following rules:
-	 *
-	 * If we're past offsetStopLimit, refuse to generate more multis.
-	 * If we're close to offsetStopLimit, emit a warning.
-	 *
-	 * Arbitrarily, we start emitting warnings when we're 20 segments or less
-	 * from offsetStopLimit.
-	 *
-	 * Note we haven't updated the shared state yet, so if we fail at this
-	 * point, the multixact ID we grabbed can still be used by the next guy.
-	 *
-	 * Note that there is no point in forcing autovacuum runs here: the
-	 * multixact freeze settings would have to be reduced for that to have any
-	 * effect.
-	 *----------
-	 */
-#define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
-								 nmembers))
-	{
-		/* see comment in the corresponding offsets wraparound case */
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("multixact \"members\" limit exceeded"),
-				 errdetail_plural("This command would create a multixact with %u members, but the remaining space is only enough for %u member.",
-								  "This command would create a multixact with %u members, but the remaining space is only enough for %u members.",
-								  MultiXactState->offsetStopLimit - nextOffset - 1,
-								  nmembers,
-								  MultiXactState->offsetStopLimit - nextOffset - 1),
-				 errhint("Execute a database-wide VACUUM in database with OID %u with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.",
-						 MultiXactState->oldestMultiXactDB)));
-	}
 
 	/*
-	 * Check whether we should kick autovacuum into action, to prevent members
-	 * wraparound. NB we use a much larger window to trigger autovacuum than
-	 * just the warning limit. The warning is just a measure of last resort -
-	 * this is in line with GetNewTransactionId's behaviour.
+	 * Offsets are 64-bit integers and will never wrap around.  Firstly, it
+	 * would take an unrealistic amount of time and resources to consume 2^64
+	 * offsets.  Secondly, multixid creation is WAL-logged, so you would run
+	 * out of LSNs before reaching offset wraparound.  Nevertheless, check for
+	 * wraparound as a sanity check.
 	 */
-	if (!MultiXactState->oldestOffsetKnown ||
-		(MultiXactState->nextOffset - MultiXactState->oldestOffset
-		 > MULTIXACT_MEMBER_SAFE_THRESHOLD))
-	{
-		/*
-		 * To avoid swamping the postmaster with signals, we issue the autovac
-		 * request only when crossing a segment boundary. With default
-		 * compilation settings that's roughly after 50k members.  This still
-		 * gives plenty of chances before we get into real trouble.
-		 */
-		if ((MXOffsetToMemberPage(nextOffset) / SLRU_PAGES_PER_SEGMENT) !=
-			(MXOffsetToMemberPage(nextOffset + nmembers) / SLRU_PAGES_PER_SEGMENT))
-			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	}
-
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
-								 nextOffset,
-								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
-		ereport(WARNING,
+	if (nextOffset + nmembers < nextOffset)
+		ereport(ERROR,
 				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg_plural("database with OID %u must be vacuumed before %d more multixact member is used",
-							   "database with OID %u must be vacuumed before %d more multixact members are used",
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers,
-							   MultiXactState->oldestMultiXactDB,
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers),
-				 errhint("Execute a database-wide VACUUM in that database with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.")));
+				 "MultiXact members would wrap around"));
+	*offset = nextOffset;
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
@@ -1246,8 +1156,7 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	 * the next iteration.  But note that nextMXact may be InvalidMultiXactId
 	 * or the first value on a segment-beginning page after this routine
 	 * exits, so anyone else looking at the variable must be prepared to deal
-	 * with either case.  Similarly, nextOffset may be zero, but we won't use
-	 * that as the actual start offset of the next multixact.
+	 * with either case.
 	 */
 	(MultiXactState->nextMXact)++;
 
@@ -1255,7 +1164,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	LWLockRelease(MultiXactGenLock);
 
-	debug_elog4(DEBUG2, "GetNew: returning %u offset %u", result, *offset);
+	debug_elog4(DEBUG2, "GetNew: returning %u offset %" PRIu64, result,
+				*offset);
 	return result;
 }
 
@@ -1297,7 +1207,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	MultiXactOffset *offptr;
 	MultiXactOffset offset;
 	int			length;
-	int			truelength;
 	MultiXactId oldestMXact;
 	MultiXactId nextMXact;
 	MultiXactId tmpMXact;
@@ -1396,16 +1305,7 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * we have just for this; the process in charge will signal the CV as soon
 	 * as it has finished writing the multixact offset.
 	 *
-	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
-	 * handle case #2, there is an ambiguity near the point of offset
-	 * wraparound.  If we see next multixact's offset is one, is that our
-	 * multixact's actual endpoint, or did it end at zero with a subsequent
-	 * increment?  We handle this using the knowledge that if the zero'th
-	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
-	 * transaction ID so it can't be a multixact member.  Therefore, if we
-	 * read a zero from the members array, just ignore it.
-	 *
-	 * This is all pretty messy, but the mess occurs only in infrequent corner
+	 * This is a little messy, but the mess occurs only in infrequent corner
 	 * cases, so it seems better than holding the MultiXactGenLock for a long
 	 * time on every multixact creation.
 	 */
@@ -1491,6 +1391,9 @@ retry:
 	LWLockRelease(lock);
 	lock = NULL;
 
+	/* A multixid with zero members should not happen */
+	Assert(length > 0);
+
 	/*
 	 * If we slept above, clean up state; it's no longer needed.
 	 */
@@ -1499,7 +1402,6 @@ retry:
 
 	ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
 
-	truelength = 0;
 	prev_pageno = -1;
 	for (int i = 0; i < length; i++, offset++)
 	{
@@ -1536,37 +1438,27 @@ retry:
 
 		xactptr = (TransactionId *)
 			(MultiXactMemberCtl->shared->page_buffer[slotno] + memberoff);
-
-		if (!TransactionIdIsValid(*xactptr))
-		{
-			/* Corner case 3: we must be looking at unused slot zero */
-			Assert(offset == 0);
-			continue;
-		}
+		Assert(TransactionIdIsValid(*xactptr));
 
 		flagsoff = MXOffsetToFlagsOffset(offset);
 		bshift = MXOffsetToFlagsBitShift(offset);
 		flagsptr = (uint32 *) (MultiXactMemberCtl->shared->page_buffer[slotno] + flagsoff);
 
-		ptr[truelength].xid = *xactptr;
-		ptr[truelength].status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
-		truelength++;
+		ptr[i].xid = *xactptr;
+		ptr[i].status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
 	}
 
 	LWLockRelease(lock);
 
-	/* A multixid with zero members should not happen */
-	Assert(truelength > 0);
-
 	/*
 	 * Copy the result into the local cache.
 	 */
-	mXactCachePut(multi, truelength, ptr);
+	mXactCachePut(multi, length, ptr);
 
 	debug_elog3(DEBUG2, "GetMembers: no cache for %s",
-				mxid_to_string(multi, truelength, ptr));
+				mxid_to_string(multi, length, ptr));
 	*members = ptr;
-	return truelength;
+	return length;
 }
 
 /*
@@ -1973,7 +1865,7 @@ MultiXactShmemInit(void)
 				  "pg_multixact/members", LWTRANCHE_MULTIXACTMEMBER_BUFFER,
 				  LWTRANCHE_MULTIXACTMEMBER_SLRU,
 				  SYNC_HANDLER_MULTIXACT_MEMBER,
-				  false);
+				  true);
 	/* doesn't call SimpleLruTruncate() or meet criteria for unit tests */
 
 	/* Initialize our shared state struct */
@@ -2150,7 +2042,6 @@ TrimMultiXact(void)
 		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 		offptr += entryno;
-
 		MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
 
 		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
@@ -2223,7 +2114,7 @@ MultiXactGetCheckptMulti(bool is_shutdown,
 	LWLockRelease(MultiXactGenLock);
 
 	debug_elog6(DEBUG2,
-				"MultiXact: checkpoint is nextMulti %u, nextOffset %u, oldestMulti %u in DB %u",
+				"MultiXact: checkpoint is nextMulti %u, nextOffset %" PRIu64 ", oldestMulti %u in DB %u",
 				*nextMulti, *nextMultiOffset, *oldestMulti, *oldestMultiDB);
 }
 
@@ -2258,7 +2149,7 @@ void
 MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset)
 {
-	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %u",
+	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %" PRIu64,
 				nextMulti, nextMultiOffset);
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->nextMXact = nextMulti;
@@ -2449,7 +2340,7 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 	}
 	if (MultiXactOffsetPrecedes(MultiXactState->nextOffset, minMultiOffset))
 	{
-		debug_elog3(DEBUG2, "MultiXact: setting next offset to %u",
+		debug_elog3(DEBUG2, "MultiXact: setting next offset to %" PRIU64,
 					minMultiOffset);
 		MultiXactState->nextOffset = minMultiOffset;
 	}
@@ -2551,23 +2442,8 @@ ExtendMultiXactMember(MultiXactOffset offset, int nmembers)
 			LWLockRelease(lock);
 		}
 
-		/*
-		 * Compute the number of items till end of current page.  Careful: if
-		 * addition of unsigned ints wraps around, we're at the last page of
-		 * the last segment; since that page holds a different number of items
-		 * than other pages, we need to do it differently.
-		 */
-		if (offset + MAX_MEMBERS_IN_LAST_MEMBERS_PAGE < offset)
-		{
-			/*
-			 * This is the last page of the last segment; we can compute the
-			 * number of items left to allocate in it without modulo
-			 * arithmetic.
-			 */
-			difference = MaxMultiXactOffset - offset + 1;
-		}
-		else
-			difference = MULTIXACT_MEMBERS_PER_PAGE - offset % MULTIXACT_MEMBERS_PER_PAGE;
+		/* Compute the number of items till end of current page. */
+		difference = MULTIXACT_MEMBERS_PER_PAGE - offset % MULTIXACT_MEMBERS_PER_PAGE;
 
 		/*
 		 * Advance to next page, taking care to properly handle the wraparound
@@ -2633,15 +2509,14 @@ GetOldestMultiXactId(void)
 }
 
 /*
- * Determine how aggressively we need to vacuum in order to prevent member
- * wraparound.
+ * Determine if we need to vacuum to keep the size of the members SLRU in
+ * check.
  *
  * To do so determine what's the oldest member offset and install the limit
  * info in MultiXactState, where it can be used to prevent overrun of old data
  * in the members SLRU area.
  *
- * The return value is true if emergency autovacuum is required and false
- * otherwise.
+ * The return value is true if autovacuum is required and false otherwise.
  */
 static bool
 SetOffsetVacuumLimit(bool is_startup)
@@ -2653,8 +2528,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	MultiXactOffset nextOffset;
 	bool		oldestOffsetKnown = false;
 	bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;
-	MultiXactOffset prevOffsetStopLimit;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2669,7 +2542,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	nextOffset = MultiXactState->nextOffset;
 	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
 	prevOldestOffset = MultiXactState->oldestOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2700,13 +2572,9 @@ SetOffsetVacuumLimit(bool is_startup)
 		oldestOffsetKnown =
 			find_multixact_start(oldestMultiXactId, &oldestOffset);
 
-		if (oldestOffsetKnown)
-			ereport(DEBUG1,
-					(errmsg_internal("oldest MultiXactId member is at offset %u",
-									 oldestOffset)));
-		else
+		if (!oldestOffsetKnown)
 			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
+					(errmsg("oldest checkpointed MultiXact %u does not exist on disk",
 							oldestMultiXactId)));
 	}
 
@@ -2716,97 +2584,32 @@ SetOffsetVacuumLimit(bool is_startup)
 	 * If we can, compute limits (and install them MultiXactState) to prevent
 	 * overrun of old data in the members SLRU area. We can only do so if the
 	 * oldest offset is known though.
+	 *
+	 * FIXME: Is !oldestOffsetKnown possible anymore? At least update the comment:
+	 * we won't overrun members anymore.
 	 */
-	if (oldestOffsetKnown)
-	{
-		/* move back to start of the corresponding segment */
-		offsetStopLimit = oldestOffset - (oldestOffset %
-										  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-
-		/* always leave one segment before the wraparound point */
-		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-		if (!prevOldestOffsetKnown && !is_startup)
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are now enabled")));
-
-		ereport(DEBUG1,
-				(errmsg_internal("MultiXact member stop limit is now %u based on MultiXact %u",
-								 offsetStopLimit, oldestMultiXactId)));
-	}
-	else if (prevOldestOffsetKnown)
+	if (prevOldestOffsetKnown)
 	{
 		/*
 		 * If we failed to get the oldest offset this time, but we have a
 		 * value from a previous pass through this function, use the old
-		 * values rather than automatically forcing an emergency autovacuum
-		 * cycle again.
+		 * values rather than automatically forcing an autovacuum cycle again.
 		 */
 		oldestOffset = prevOldestOffset;
 		oldestOffsetKnown = true;
-		offsetStopLimit = prevOffsetStopLimit;
 	}
 
 	/* Install the computed values */
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestOffset = oldestOffset;
 	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	MultiXactState->offsetStopLimit = offsetStopLimit;
 	LWLockRelease(MultiXactGenLock);
 
 	/*
-	 * Do we need an emergency autovacuum?	If we're not sure, assume yes.
+	 * Do we need autovacuum?	If we're not sure, assume yes.
 	 */
 	return !oldestOffsetKnown ||
-		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
-}
-
-/*
- * Return whether adding "distance" to "start" would move past "boundary".
- *
- * We use this to determine whether the addition is "wrapping around" the
- * boundary point, hence the name.  The reason we don't want to use the regular
- * 2^31-modulo arithmetic here is that we want to be able to use the whole of
- * the 2^32-1 space here, allowing for more multixacts than would fit
- * otherwise.
- */
-static bool
-MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
-						 uint32 distance)
-{
-	MultiXactOffset finish;
-
-	/*
-	 * Note that offset number 0 is not used (see GetMultiXactIdMembers), so
-	 * if the addition wraps around the UINT_MAX boundary, skip that value.
-	 */
-	finish = start + distance;
-	if (finish < start)
-		finish++;
-
-	/*-----------------------------------------------------------------------
-	 * When the boundary is numerically greater than the starting point, any
-	 * value numerically between the two is not wrapped:
-	 *
-	 *	<----S----B---->
-	 *	[---)			 = F wrapped past B (and UINT_MAX)
-	 *		 [---)		 = F not wrapped
-	 *			  [----] = F wrapped past B
-	 *
-	 * When the boundary is numerically less than the starting point (i.e. the
-	 * UINT_MAX wraparound occurs somewhere in between) then all values in
-	 * between are wrapped:
-	 *
-	 *	<----B----S---->
-	 *	[---)			 = F not wrapped past B (but wrapped past UINT_MAX)
-	 *		 [---)		 = F wrapped past B (and UINT_MAX)
-	 *			  [----] = F not wrapped
-	 *-----------------------------------------------------------------------
-	 */
-	if (start < boundary)
-		return finish >= boundary || finish < start;
-	else
-		return finish >= boundary && finish < start;
+		(nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);
 }
 
 /*
@@ -2846,6 +2649,7 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 	offptr += entryno;
 	offset = *offptr;
+
 	LWLockRelease(SimpleLruGetBankLock(MultiXactOffsetCtl, pageno));
 
 	*result = offset;
@@ -2893,73 +2697,6 @@ GetMultiXactInfo(uint32 *multixacts, MultiXactOffset *members,
 	return true;
 }
 
-/*
- * Multixact members can be removed once the multixacts that refer to them
- * are older than every datminmxid.  autovacuum_multixact_freeze_max_age and
- * vacuum_multixact_freeze_table_age work together to make sure we never have
- * too many multixacts; we hope that, at least under normal circumstances,
- * this will also be sufficient to keep us from using too many offsets.
- * However, if the average multixact has many members, we might exhaust the
- * members space while still using few enough members that these limits fail
- * to trigger relminmxid advancement by VACUUM.  At that point, we'd have no
- * choice but to start failing multixact-creating operations with an error.
- *
- * To prevent that, if more than a threshold portion of the members space is
- * used, we effectively reduce autovacuum_multixact_freeze_max_age and
- * to a value just less than the number of multixacts in use.  We hope that
- * this will quickly trigger autovacuuming on the table or tables with the
- * oldest relminmxid, thus allowing datminmxid values to advance and removing
- * some members.
- *
- * As the fraction of the member space currently in use grows, we become
- * more aggressive in clamping this value.  That not only causes autovacuum
- * to ramp up, but also makes any manual vacuums the user issues more
- * aggressive.  This happens because vacuum_get_cutoffs() will clamp the
- * freeze table and the minimum freeze age cutoffs based on the effective
- * autovacuum_multixact_freeze_max_age this function returns.  In the worst
- * case, we'll claim the freeze_max_age to zero, and every vacuum of any
- * table will freeze every multixact.
- */
-int
-MultiXactMemberFreezeThreshold(void)
-{
-	MultiXactOffset members;
-	uint32		multixacts;
-	uint32		victim_multixacts;
-	double		fraction;
-	int			result;
-	MultiXactId oldestMultiXactId;
-	MultiXactOffset oldestOffset;
-
-	/* If we can't determine member space utilization, assume the worst. */
-	if (!GetMultiXactInfo(&multixacts, &members, &oldestMultiXactId, &oldestOffset))
-		return 0;
-
-	/* If member space utilization is low, no special action is required. */
-	if (members <= MULTIXACT_MEMBER_SAFE_THRESHOLD)
-		return autovacuum_multixact_freeze_max_age;
-
-	/*
-	 * Compute a target for relminmxid advancement.  The number of multixacts
-	 * we try to eliminate from the system is based on how far we are past
-	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
-	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD) /
-		(MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-	victim_multixacts = multixacts * fraction;
-
-	/* fraction could be > 1.0, but lowest possible freeze age is zero */
-	if (victim_multixacts > multixacts)
-		return 0;
-	result = multixacts - victim_multixacts;
-
-	/*
-	 * Clamp to autovacuum_multixact_freeze_max_age, so that we never make
-	 * autovacuum less aggressive than it would otherwise be.
-	 */
-	return Min(result, autovacuum_multixact_freeze_max_age);
-}
-
 typedef struct mxtruncinfo
 {
 	int64		earliestExistingPage;
@@ -2986,36 +2723,12 @@ SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int64 segpage, void *data
 
 /*
  * Delete members segments [oldest, newOldest)
- *
- * The members SLRU can, in contrast to the offsets one, be filled to almost
- * the full range at once. This means SimpleLruTruncate() can't trivially be
- * used - instead the to-be-deleted range is computed using the offsets
- * SLRU. C.f. TruncateMultiXact().
  */
 static void
 PerformMembersTruncation(MultiXactOffset oldestOffset, MultiXactOffset newOldestOffset)
 {
-	const int64 maxsegment = MXOffsetToMemberSegment(MaxMultiXactOffset);
-	int64		startsegment = MXOffsetToMemberSegment(oldestOffset);
-	int64		endsegment = MXOffsetToMemberSegment(newOldestOffset);
-	int64		segment = startsegment;
-
-	/*
-	 * Delete all the segments but the last one. The last segment can still
-	 * contain, possibly partially, valid data.
-	 */
-	while (segment != endsegment)
-	{
-		elog(DEBUG2, "truncating multixact members segment %" PRIx64,
-			 segment);
-		SlruDeleteSegment(MultiXactMemberCtl, segment);
-
-		/* move to next segment, handling wraparound correctly */
-		if (segment == maxsegment)
-			segment = 0;
-		else
-			segment += 1;
-	}
+	SimpleLruTruncate(MultiXactMemberCtl,
+					  MXOffsetToMemberPage(newOldestOffset));
 }
 
 /*
@@ -3159,7 +2872,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 
 	elog(DEBUG1, "performing multixact truncation: "
 		 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-		 "members [%u, %u), members segments [%" PRIx64 ", %" PRIx64 ")",
+		 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
 		 oldestMulti, newOldestMulti,
 		 MultiXactIdToOffsetSegment(oldestMulti),
 		 MultiXactIdToOffsetSegment(newOldestMulti),
@@ -3239,20 +2952,13 @@ MultiXactOffsetPagePrecedes(int64 page1, int64 page2)
 
 /*
  * Decide whether a MultiXactMember page number is "older" for truncation
- * purposes.  There is no "invalid offset number" so use the numbers verbatim.
+ * purposes.  There is no "invalid offset number" and members never wrap
+ * around, so use the numbers verbatim.
  */
 static bool
 MultiXactMemberPagePrecedes(int64 page1, int64 page2)
 {
-	MultiXactOffset offset1;
-	MultiXactOffset offset2;
-
-	offset1 = ((MultiXactOffset) page1) * MULTIXACT_MEMBERS_PER_PAGE;
-	offset2 = ((MultiXactOffset) page2) * MULTIXACT_MEMBERS_PER_PAGE;
-
-	return (MultiXactOffsetPrecedes(offset1, offset2) &&
-			MultiXactOffsetPrecedes(offset1,
-									offset2 + MULTIXACT_MEMBERS_PER_PAGE - 1));
+	return page1 < page2;
 }
 
 /*
@@ -3290,7 +2996,7 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
 static bool
 MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
 {
-	int32		diff = (int32) (offset1 - offset2);
+	int64		diff = (int64) (offset1 - offset2);
 
 	return (diff < 0);
 }
@@ -3387,7 +3093,7 @@ multixact_redo(XLogReaderState *record)
 
 		elog(DEBUG1, "replaying multixact truncation: "
 			 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-			 "members [%u, %u), members segments [%" PRIx64 ", %" PRIx64 ")",
+			 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
 			 xlrec.startTruncOff, xlrec.endTruncOff,
 			 MultiXactIdToOffsetSegment(xlrec.startTruncOff),
 			 MultiXactIdToOffsetSegment(xlrec.endTruncOff),
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7c959051e11..2ade0b4a042 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5128,7 +5128,7 @@ BootStrapXLOG(uint32 data_checksum_version)
 		FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId);
 	checkPoint.nextOid = FirstGenbkiObjectId;
 	checkPoint.nextMulti = FirstMultiXactId;
-	checkPoint.nextMultiOffset = 0;
+	checkPoint.nextMultiOffset = 1;
 	checkPoint.oldestXid = FirstNormalTransactionId;
 	checkPoint.oldestXidDB = Template1DbOid;
 	checkPoint.oldestMulti = FirstMultiXactId;
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 550de6e4a59..66c2364aa9b 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -886,7 +886,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 							 U64FromFullTransactionId(checkPoint.nextXid),
 							 checkPoint.nextOid)));
 	ereport(DEBUG1,
-			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %u",
+			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %" PRIu64,
 							 checkPoint.nextMulti, checkPoint.nextMultiOffset)));
 	ereport(DEBUG1,
 			(errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ed03e3bd50d..259ef60bd31 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1147,7 +1147,7 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams params,
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Almost ready to set freeze output parameters; check if OldestXmin or
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index ed19c74bb19..34909ee54ff 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1151,7 +1151,7 @@ do_start_worker(void)
 
 	/* Also determine the oldest datminmxid we will consider. */
 	recentMulti = ReadNextMultiXactId();
-	multiForceLimit = recentMulti - MultiXactMemberFreezeThreshold();
+	multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
 	if (multiForceLimit < FirstMultiXactId)
 		multiForceLimit -= FirstMultiXactId;
 
@@ -1939,7 +1939,7 @@ do_autovacuum(void)
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Find the pg_database entry and select the default freeze ages. We use
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 10de058ce91..5295108ade3 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -264,7 +264,7 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile->checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
+	printf(_("Latest checkpoint's NextMultiOffset:  %" PRIu64 "\n"),
 		   ControlFile->checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile->checkPointCopy.oldestXid);
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index a89d72fc5cf..4e5eeced89d 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -267,7 +267,7 @@ main(int argc, char *argv[])
 
 			case 'O':
 				errno = 0;
-				set_mxoff = strtoul(optarg, &endptr, 0);
+				set_mxoff = strtou64(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
@@ -743,7 +743,7 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile.checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
+	printf(_("Latest checkpoint's NextMultiOffset:  %" PRIu64 "\n"),
 		   ControlFile.checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile.checkPointCopy.oldestXid);
@@ -817,7 +817,7 @@ PrintNewControlValues(void)
 
 	if (set_mxoff != -1)
 	{
-		printf(_("NextMultiOffset:                      %u\n"),
+		printf(_("NextMultiOffset:                      %" PRIu64 "\n"),
 			   ControlFile.checkPointCopy.nextMultiOffset);
 	}
 
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index d6bbbd0ceda..cc89e0764ae 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -213,7 +213,7 @@ push @cmd,
   sprintf("%d,%d", hex($files[0]) == 0 ? 3 : hex($files[0]), hex($files[-1]));
 
 @files = get_slru_files('pg_multixact/offsets');
-$mult = 32 * $blcksz / 4;
+$mult = 32 * $blcksz / 8;
 # --multixact-ids argument is "new,old"
 push @cmd,
   '--multixact-ids' => sprintf("%d,%d",
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 82e4bb90dd5..7d98fe0fe32 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -28,8 +28,6 @@
 
 #define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
 
-#define MaxMultiXactOffset	((MultiXactOffset) 0xFFFFFFFF)
-
 /*
  * Possible multixact lock modes ("status").  The first four modes are for
  * tuple locks (FOR KEY SHARE, FOR SHARE, FOR NO KEY UPDATE, FOR UPDATE); the
@@ -147,7 +145,6 @@ extern void MultiXactSetNextMXact(MultiXactId nextMulti,
 extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 									  MultiXactOffset minMultiOffset);
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
-extern int	MultiXactMemberFreezeThreshold(void);
 
 extern void multixact_twophase_recover(FullTransactionId fxid, uint16 info,
 									   void *recdata, uint32 len);
diff --git a/src/include/c.h b/src/include/c.h
index 757dfff4782..bc92a6f4565 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -670,7 +670,7 @@ typedef uint32 SubTransactionId;
 /* MultiXactId must be equivalent to TransactionId, to fit in t_xmax */
 typedef TransactionId MultiXactId;
 
-typedef uint32 MultiXactOffset;
+typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
-- 
2.47.3

v22-0002-Add-pg_upgarde-for-64-bit-multixact-offsets.patchtext/x-patch; charset=UTF-8; name=v22-0002-Add-pg_upgarde-for-64-bit-multixact-offsets.patchDownload
From c23ffcebfc4a807d3090bfcf63998abdb7368eb3 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Fri, 24 Oct 2025 10:58:37 +0300
Subject: [PATCH v22 2/5] Add pg_upgarde for 64 bit multixact offsets

Author: Maxim Orlov <orlovmg@gmail.com>
Author: Heikki Linnakangas <hlinnaka@iki.fi>
---
 src/backend/access/transam/multixact.c |  56 -----
 src/bin/pg_upgrade/Makefile            |   3 +
 src/bin/pg_upgrade/meson.build         |   4 +
 src/bin/pg_upgrade/multixact_new.c     | 174 +++++++++++++++
 src/bin/pg_upgrade/multixact_new.h     |  23 ++
 src/bin/pg_upgrade/multixact_old.c     | 297 +++++++++++++++++++++++++
 src/bin/pg_upgrade/multixact_old.h     |  29 +++
 src/bin/pg_upgrade/pg_upgrade.c        | 108 ++++++++-
 src/bin/pg_upgrade/pg_upgrade.h        |   5 +
 src/bin/pg_upgrade/slru_io.c           | 239 ++++++++++++++++++++
 src/bin/pg_upgrade/slru_io.h           |  23 ++
 src/tools/pgindent/typedefs.list       |   3 +
 12 files changed, 902 insertions(+), 62 deletions(-)
 create mode 100644 src/bin/pg_upgrade/multixact_new.c
 create mode 100644 src/bin/pg_upgrade/multixact_new.h
 create mode 100644 src/bin/pg_upgrade/multixact_old.c
 create mode 100644 src/bin/pg_upgrade/multixact_old.h
 create mode 100644 src/bin/pg_upgrade/slru_io.c
 create mode 100644 src/bin/pg_upgrade/slru_io.h

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 68fd3441816..eb5a1d37ce6 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1921,48 +1921,6 @@ BootStrapMultiXact(void)
 	SimpleLruZeroAndWritePage(MultiXactMemberCtl, 0);
 }
 
-/*
- * MaybeExtendOffsetSlru
- *		Extend the offsets SLRU area, if necessary
- *
- * After a binary upgrade from <= 9.2, the pg_multixact/offsets SLRU area might
- * contain files that are shorter than necessary; this would occur if the old
- * installation had used multixacts beyond the first page (files cannot be
- * copied, because the on-disk representation is different).  pg_upgrade would
- * update pg_control to set the next offset value to be at that position, so
- * that tuples marked as locked by such MultiXacts would be seen as visible
- * without having to consult multixact.  However, trying to create and use a
- * new MultiXactId would result in an error because the page on which the new
- * value would reside does not exist.  This routine is in charge of creating
- * such pages.
- */
-static void
-MaybeExtendOffsetSlru(void)
-{
-	int64		pageno;
-	LWLock	   *lock;
-
-	pageno = MultiXactIdToOffsetPage(MultiXactState->nextMXact);
-	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
-
-	LWLockAcquire(lock, LW_EXCLUSIVE);
-
-	if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno))
-	{
-		int			slotno;
-
-		/*
-		 * Fortunately for us, SimpleLruWritePage is already prepared to deal
-		 * with creating a new segment file even if the page we're writing is
-		 * not the first in it, so this is enough.
-		 */
-		slotno = SimpleLruZeroPage(MultiXactOffsetCtl, pageno);
-		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
-	}
-
-	LWLockRelease(lock);
-}
-
 /*
  * This must be called ONCE during postmaster or standalone-backend startup.
  *
@@ -2155,20 +2113,6 @@ MultiXactSetNextMXact(MultiXactId nextMulti,
 	MultiXactState->nextMXact = nextMulti;
 	MultiXactState->nextOffset = nextMultiOffset;
 	LWLockRelease(MultiXactGenLock);
-
-	/*
-	 * During a binary upgrade, make sure that the offsets SLRU is large
-	 * enough to contain the next value that would be created.
-	 *
-	 * We need to do this pretty early during the first startup in binary
-	 * upgrade mode: before StartupMultiXact() in fact, because this routine
-	 * is called even before that by StartupXLOG().  And we can't do it
-	 * earlier than at this point, because during that first call of this
-	 * routine we determine the MultiXactState->nextMXact value that
-	 * MaybeExtendOffsetSlru needs.
-	 */
-	if (IsBinaryUpgrade)
-		MaybeExtendOffsetSlru();
 }
 
 /*
diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index 69fcf593cae..42995d53b0b 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -18,11 +18,14 @@ OBJS = \
 	file.o \
 	function.o \
 	info.o \
+	multixact_new.o \
+	multixact_old.o \
 	option.o \
 	parallel.o \
 	pg_upgrade.o \
 	relfilenumber.o \
 	server.o \
+	slru_io.o \
 	tablespace.o \
 	task.o \
 	util.o \
diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index ac992f0d14b..952bcbe7435 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -8,11 +8,14 @@ pg_upgrade_sources = files(
   'file.c',
   'function.c',
   'info.c',
+  'multixact_new.c',
+  'multixact_old.c',
   'option.c',
   'parallel.c',
   'pg_upgrade.c',
   'relfilenumber.c',
   'server.c',
+  'slru_io.c',
   'tablespace.c',
   'task.c',
   'util.c',
@@ -47,6 +50,7 @@ tests += {
       't/004_subscription.pl',
       't/005_char_signedness.pl',
       't/006_transfer_modes.pl',
+      't/007_mxoff.pl',
     ],
     'test_kwargs': {'priority': 40}, # pg_upgrade tests are slow
   },
diff --git a/src/bin/pg_upgrade/multixact_new.c b/src/bin/pg_upgrade/multixact_new.c
new file mode 100644
index 00000000000..5db7af5b12d
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_new.c
@@ -0,0 +1,174 @@
+/*
+ * multixact_new.c
+ *
+ * Functions to write multixacts in the v19 format with 64-bit
+ * MultiXactOffsets
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_new.c
+ */
+
+#include "postgres_fe.h"
+
+#include "multixact_new.h"
+
+/*
+ * NOTE: Below are a bunch of definitions and simple inline functions that are
+ * copy-pasted from multixact.c
+ */
+
+/* We need four bytes per offset, 8 bytes for the base */
+#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)
+{
+	return multi / MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int
+MultiXactIdToOffsetEntry(MultiXactId multi)
+{
+	return multi % MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+/* We need eight bits per xact, so one xact fits in a byte */
+#define MXACT_MEMBER_BITS_PER_XACT			8
+#define MXACT_MEMBER_FLAGS_PER_BYTE			1
+#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
+
+/* how many full bytes of flags are there in a group? */
+#define MULTIXACT_FLAGBYTES_PER_GROUP		4
+#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
+	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
+/* size in bytes of a complete group */
+#define MULTIXACT_MEMBERGROUP_SIZE \
+	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERS_PER_PAGE	\
+	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
+
+/* page in which a member is to be found */
+static inline int64
+MXOffsetToMemberPage(MultiXactOffset offset)
+{
+	return offset / MULTIXACT_MEMBERS_PER_PAGE;
+}
+
+/* Location (byte offset within page) of flag word for a given member */
+static inline int
+MXOffsetToFlagsOffset(MultiXactOffset offset)
+{
+	MultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
+	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
+
+	return byteoff;
+}
+
+static inline int
+MXOffsetToFlagsBitShift(MultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
+
+	return bshift;
+}
+
+/* Location (byte offset within page) of TransactionId of given member */
+static inline int
+MXOffsetToMemberOffset(MultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+
+	return MXOffsetToFlagsOffset(offset) +
+		MULTIXACT_FLAGBYTES_PER_GROUP +
+		member_in_group * sizeof(TransactionId);
+}
+
+MultiXactWriter *
+AllocMultiXactWrite(const char *pgdata, MultiXactId firstMulti,
+					MultiXactOffset firstOffset)
+{
+	MultiXactWriter *state = pg_malloc(sizeof(*state));
+	char		dir[MAXPGPATH] = {0};
+
+	pg_sprintf(dir, "%s/pg_multixact/offsets", pgdata);
+	state->offset = AllocSlruWrite(dir, false);
+	SlruWriteSwitchPage(state->offset, MultiXactIdToOffsetPage(firstMulti));
+
+	pg_sprintf(dir, "%s/pg_multixact/members", pgdata);
+	state->members = AllocSlruWrite(dir, true /* use long segment names */ );
+	SlruWriteSwitchPage(state->members, MXOffsetToMemberPage(firstOffset));
+
+	return state;
+}
+
+/*
+ * Write a new multixact with members.
+ *
+ * Simplified version of the correspoding server function, hence the name.
+ */
+void
+RecordNewMultiXact(MultiXactWriter *state, MultiXactOffset offset,
+				   MultiXactId multi, int nmembers, MultiXactMember *members)
+{
+	int64		pageno;
+	int64		prev_pageno;
+	int			entryno;
+	char	   *buf;
+	MultiXactOffset *offptr;
+
+	pageno = MultiXactIdToOffsetPage(multi);
+	entryno = MultiXactIdToOffsetEntry(multi);
+
+	/* Store the offset */
+	buf = SlruWriteSwitchPage(state->offset, pageno);
+	offptr = (MultiXactOffset *) buf;
+	offptr[entryno] = offset;
+
+	/* Store the members */
+	prev_pageno = -1;
+	for (int i = 0; i < nmembers; i++, offset++)
+	{
+		TransactionId *memberptr;
+		uint32	   *flagsptr;
+		uint32		flagsval;
+		int			bshift;
+		int			flagsoff;
+		int			memberoff;
+
+		Assert(members[i].status <= MultiXactStatusUpdate);
+
+		pageno = MXOffsetToMemberPage(offset);
+		memberoff = MXOffsetToMemberOffset(offset);
+		flagsoff = MXOffsetToFlagsOffset(offset);
+		bshift = MXOffsetToFlagsBitShift(offset);
+
+		if (pageno != prev_pageno)
+		{
+			buf = SlruWriteSwitchPage(state->members, pageno);
+			prev_pageno = pageno;
+		}
+
+		memberptr = (TransactionId *) (buf + memberoff);
+
+		*memberptr = members[i].xid;
+
+		flagsptr = (uint32 *) (buf + flagsoff);
+
+		flagsval = *flagsptr;
+		flagsval &= ~(((1 << MXACT_MEMBER_BITS_PER_XACT) - 1) << bshift);
+		flagsval |= (members[i].status << bshift);
+		*flagsptr = flagsval;
+	}
+}
+
+void
+FreeMultiXactWrite(MultiXactWriter *state)
+{
+	FreeSlruWrite(state->offset);
+	FreeSlruWrite(state->members);
+
+	pfree(state);
+}
diff --git a/src/bin/pg_upgrade/multixact_new.h b/src/bin/pg_upgrade/multixact_new.h
new file mode 100644
index 00000000000..f66e6af7e45
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_new.h
@@ -0,0 +1,23 @@
+/*
+ * multixact_new.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_new.h
+ */
+#include "access/multixact.h"
+
+#include "slru_io.h"
+
+typedef struct MultiXactWriter
+{
+	SlruSegState *offset;
+	SlruSegState *members;
+} MultiXactWriter;
+
+extern MultiXactWriter *AllocMultiXactWrite(const char *pgdata,
+											MultiXactId firstMulti,
+											MultiXactOffset firstOffset);
+extern void RecordNewMultiXact(MultiXactWriter *state, MultiXactOffset offset,
+							   MultiXactId multi, int nmembers,
+							   MultiXactMember *members);
+extern void FreeMultiXactWrite(MultiXactWriter *writer);
diff --git a/src/bin/pg_upgrade/multixact_old.c b/src/bin/pg_upgrade/multixact_old.c
new file mode 100644
index 00000000000..70ae88d97f4
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_old.c
@@ -0,0 +1,297 @@
+/*
+ * multixact_old.c
+ *
+ * Functions to read pre-v19 multixacts
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_old.c
+ */
+
+#include "postgres_fe.h"
+
+#include "multixact_old.h"
+#include "pg_upgrade.h"
+
+/*
+ * NOTE: below are a bunch of definitions and simple sttaic inline functions
+ * that are copy-pasted from multixact.c from version 18.  The only difference
+ * is that we use the OldMultiXactOffset type equal to uint32 instead of
+ * MultiXactOffset which became uint64.
+ */
+
+/* We need four bytes per offset and 8 bytes per base for each page. */
+#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(OldMultiXactOffset))
+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)
+{
+	return multi / MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int
+MultiXactIdToOffsetEntry(MultiXactId multi)
+{
+	return multi % MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+/*
+ * The situation for members is a bit more complex: we store one byte of
+ * additional flag bits for each TransactionId.  To do this without getting
+ * into alignment issues, we store four bytes of flags, and then the
+ * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
+ * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
+ * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
+ * performance) trumps space efficiency here.
+ *
+ * Note that the "offset" macros work with byte offset, not array indexes, so
+ * arithmetic must be done using "char *" pointers.
+ */
+/* We need eight bits per xact, so one xact fits in a byte */
+#define MXACT_MEMBER_BITS_PER_XACT			8
+#define MXACT_MEMBER_FLAGS_PER_BYTE			1
+#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
+
+/* how many full bytes of flags are there in a group? */
+#define MULTIXACT_FLAGBYTES_PER_GROUP		4
+#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
+	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
+/* size in bytes of a complete group */
+#define MULTIXACT_MEMBERGROUP_SIZE \
+	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERS_PER_PAGE	\
+	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
+
+/* page in which a member is to be found */
+static inline int64
+MXOffsetToMemberPage(OldMultiXactOffset offset)
+{
+	return offset / MULTIXACT_MEMBERS_PER_PAGE;
+}
+
+/* Location (byte offset within page) of flag word for a given member */
+static inline int
+MXOffsetToFlagsOffset(MultiXactOffset offset)
+{
+	OldMultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
+	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
+
+	return byteoff;
+}
+
+/* Location (byte offset within page) of TransactionId of given member */
+static inline int
+MXOffsetToMemberOffset(OldMultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+
+	return MXOffsetToFlagsOffset(offset) +
+		MULTIXACT_FLAGBYTES_PER_GROUP +
+		member_in_group * sizeof(TransactionId);
+}
+
+static inline int
+MXOffsetToFlagsBitShift(OldMultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
+
+	return bshift;
+}
+
+/*
+ * Construct reader of old multixacts.
+ *
+ * Returns the malloced memory used by the all other calls in this module.
+ */
+OldMultiXactReader *
+AllocOldMultiXactRead(char *pgdata, MultiXactId nextMulti,
+					  OldMultiXactOffset nextOffset)
+{
+	OldMultiXactReader *state = state = pg_malloc(sizeof(*state));
+	char		dir[MAXPGPATH] = {0};
+
+	state->nextMXact = nextMulti;
+	state->nextOffset = nextOffset;
+
+	pg_sprintf(dir, "%s/pg_multixact/offsets", pgdata);
+	state->offset = AllocSlruRead(dir);
+
+	pg_sprintf(dir, "%s/pg_multixact/members", pgdata);
+	state->members = AllocSlruRead(dir);
+
+	return state;
+}
+
+/*
+ * This is a simplified version of the GetMultiXactIdMembers() server function.
+ *
+ * - Only return the updating member, if any. Upgrade only cares about the
+ *   updaters. If there is no updating member, return the first locking-only
+ *   member. We don't have any way to represent "no members", but we also don't
+ *   need to preserve all the locking members.
+ *
+ * - We don't need to worry about locking and some corner cases because there's
+ *   no concurrent activity.
+ */
+void
+GetOldMultiXactIdSingleMember(OldMultiXactReader *state, MultiXactId multi,
+							  TransactionId *result, MultiXactStatus *status)
+{
+	MultiXactId nextMXact,
+				nextOffset,
+				tmpMXact;
+	int64		pageno,
+				prev_pageno;
+	int			entryno,
+				length;
+	char	   *buf;
+	OldMultiXactOffset *offptr,
+				offset;
+	TransactionId result_xid = InvalidTransactionId;
+	bool		result_isupdate = false;
+
+	nextMXact = state->nextMXact;
+	nextOffset = state->nextOffset;
+
+	/*
+	 * See GetMultiXactIdMembers in multixact.c
+	 *
+	 * Find out the offset at which we need to start reading MultiXactMembers
+	 * and the number of members in the multixact.  We determine the latter as
+	 * the difference between this multixact's starting offset and the next
+	 * one's.  However, there are some corner cases to worry about:
+	 *
+	 * 1. This multixact may be the latest one created, in which case there is
+	 * no next one to look at.  In this case the nextOffset value we just
+	 * saved is the correct endpoint.
+	 *
+	 * 2. The next multixact may still be in process of being filled in...
+	 * This cannot happen during upgrade.
+	 *
+	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
+	 * handle case #2, there is an ambiguity near the point of offset
+	 * wraparound.  If we see next multixact's offset is one, is that our
+	 * multixact's actual endpoint, or did it end at zero with a subsequent
+	 * increment?  We handle this using the knowledge that if the zero'th
+	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
+	 * transaction ID so it can't be a multixact member.  Therefore, if we
+	 * read a zero from the members array, just ignore it.
+	 */
+
+	pageno = MultiXactIdToOffsetPage(multi);
+	entryno = MultiXactIdToOffsetEntry(multi);
+
+	buf = SlruReadSwitchPage(state->offset, pageno);
+	offptr = (OldMultiXactOffset *) buf;
+	offptr += entryno;
+	offset = *offptr;
+
+	Assert(offset != 0);
+
+	/*
+	 * Use the same increment rule as GetNewMultiXactId(), that is, don't
+	 * handle wraparound explicitly until needed.
+	 */
+	tmpMXact = multi + 1;
+
+	if (nextMXact == tmpMXact)
+	{
+		/* Corner case 1: there is no next multixact */
+		length = nextOffset - offset;
+	}
+	else
+	{
+		OldMultiXactOffset nextMXOffset;
+
+		/* handle wraparound if needed */
+		if (tmpMXact < FirstMultiXactId)
+			tmpMXact = FirstMultiXactId;
+
+		prev_pageno = pageno;
+
+		pageno = MultiXactIdToOffsetPage(tmpMXact);
+		entryno = MultiXactIdToOffsetEntry(tmpMXact);
+
+		if (pageno != prev_pageno)
+			buf = SlruReadSwitchPage(state->offset, pageno);
+
+		offptr = (OldMultiXactOffset *) buf;
+		offptr += entryno;
+		nextMXOffset = *offptr;
+
+		/*
+		 * Corner case 2: next multixact is still being filled in, this must
+		 * not happen during upgrade.
+		 */
+		Assert(nextMXOffset != 0);
+
+		length = nextMXOffset - offset;
+	}
+
+	prev_pageno = -1;
+	for (int i = 0; i < length; i++, offset++)
+	{
+		TransactionId *xactptr;
+		uint32	   *flagsptr;
+		int			flagsoff;
+		int			bshift;
+		int			memberoff;
+		MultiXactStatus st;
+
+		pageno = MXOffsetToMemberPage(offset);
+		memberoff = MXOffsetToMemberOffset(offset);
+
+		if (pageno != prev_pageno)
+		{
+			buf = SlruReadSwitchPage(state->members, pageno);
+			prev_pageno = pageno;
+		}
+
+		xactptr = (TransactionId *) (buf + memberoff);
+		if (!TransactionIdIsValid(*xactptr))
+		{
+			/* Corner case 3: we must be looking at unused slot zero */
+			Assert(offset == 0);
+			continue;
+		}
+
+		flagsoff = MXOffsetToFlagsOffset(offset);
+		bshift = MXOffsetToFlagsBitShift(offset);
+		flagsptr = (uint32 *) (buf + flagsoff);
+
+		st = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
+
+		/* Verify that there is a single update Xid among the given members. */
+		if (ISUPDATE_from_mxstatus(st))
+		{
+			if (result_isupdate)
+				pg_fatal("multixact %u has more than one updating member",
+						 multi);
+			result_xid = *xactptr;
+			result_isupdate = true;
+		}
+		else if (!TransactionIdIsValid(result_xid))
+			result_xid = *xactptr;
+	}
+
+	/* A multixid with zero members should not happen */
+	Assert(TransactionIdIsValid(result_xid));
+
+	*result = result_xid;
+	*status = result_isupdate ? MultiXactStatusUpdate :
+		MultiXactStatusForKeyShare;
+}
+
+/*
+ * Frees the malloced reader.
+ */
+void
+FreeOldMultiXactReader(OldMultiXactReader *state)
+{
+	FreeSlruRead(state->offset);
+	FreeSlruRead(state->members);
+
+	pfree(state);
+}
diff --git a/src/bin/pg_upgrade/multixact_old.h b/src/bin/pg_upgrade/multixact_old.h
new file mode 100644
index 00000000000..8eb5af2ccaf
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_old.h
@@ -0,0 +1,29 @@
+/*
+ * multixact_old.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_old.h
+ */
+
+#include "access/multixact.h"
+#include "slru_io.h"
+
+typedef uint32 OldMultiXactOffset;
+
+typedef struct OldMultiXactReader
+{
+	MultiXactId nextMXact;
+	OldMultiXactOffset nextOffset;
+
+	SlruSegState *offset;
+	SlruSegState *members;
+} OldMultiXactReader;
+
+extern OldMultiXactReader *AllocOldMultiXactRead(char *pgdata,
+												 MultiXactId nextMulti,
+												 OldMultiXactOffset nextOffset);
+extern void GetOldMultiXactIdSingleMember(OldMultiXactReader *state,
+										  MultiXactId multi,
+										  TransactionId *result,
+										  MultiXactStatus *status);
+extern void FreeOldMultiXactReader(OldMultiXactReader *reader);
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 490e98fa26f..0fdd05c127c 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -48,6 +48,8 @@
 #include "common/logging.h"
 #include "common/restricted_token.h"
 #include "fe_utils/string_utils.h"
+#include "multixact_old.h"
+#include "multixact_new.h"
 #include "pg_upgrade.h"
 
 /*
@@ -769,6 +771,81 @@ copy_subdir_files(const char *old_subdir, const char *new_subdir)
 	check_ok();
 }
 
+/*
+ * Convert pg_multixact/offset and /members to new format with 64-bit offsets.
+ */
+static void
+convert_multixacts(MultiXactId *new_nxtmulti, MultiXactOffset *new_nxtmxoff)
+{
+	MultiXactId oldest_multi,
+				next_multi;
+	OldMultiXactReader *old_reader;
+	MultiXactWriter *new_writer;
+	MultiXactOffset next_offset;
+
+	/*
+	 * The range of valid multi XIDs is unchanged by the conversion (they are
+	 * referenced from the heap tables), but the members SLRU is rewritten to
+	 * start from offset 1.
+	 */
+	oldest_multi = old_cluster.controldata.chkpnt_oldstMulti;
+	next_multi = old_cluster.controldata.chkpnt_nxtmulti;
+	next_offset = 1;
+
+	old_reader = AllocOldMultiXactRead(old_cluster.pgdata,
+									   old_cluster.controldata.chkpnt_nxtmulti,
+									   old_cluster.controldata.chkpnt_nxtmxoff);
+	new_writer = AllocMultiXactWrite(new_cluster.pgdata,
+									 oldest_multi, next_offset);
+
+	/* handle wraparound */
+	if (next_multi < FirstMultiXactId)
+		next_multi = FirstMultiXactId;
+
+	/*
+	 * Read multixids from old files one by one, and write them back in the
+	 * new format.
+	 */
+	for (MultiXactId multi = oldest_multi; multi != next_multi;)
+	{
+		TransactionId xid;
+		MultiXactStatus status;
+		MultiXactMember member;
+
+		/*
+		 * Read the old multixid.  The locking-only XIDs that may be part of
+		 * multi-xids don't matter after upgrade, as there can be no
+		 * transactions running across upgrade.  So as a little optimization,
+		 * we only read one member from each multixid: the one updating one,
+		 * or if there was no update, arbitrarily the first locking xid.
+		 */
+		GetOldMultiXactIdSingleMember(old_reader, multi, &xid, &status);
+
+		/* Write it out in new format */
+		member.xid = xid;
+		member.status = status;
+		RecordNewMultiXact(new_writer, next_offset, multi, 1, &member);
+
+		next_offset += 1;
+		multi++;
+		/* handle wraparound */
+		if (multi < FirstMultiXactId)
+			multi = FirstMultiXactId;
+	}
+
+	/*
+	 * Update the nextMXact/Offset values in the control file to match what we
+	 * wrote.  The nextMXact is unchanged, but nextOffset will be different.
+	 */
+	Assert(next_multi == old_cluster.controldata.chkpnt_nxtmulti);
+	*new_nxtmulti = next_multi;
+	*new_nxtmxoff = next_offset;
+
+	/* Release resources */
+	FreeMultiXactWrite(new_writer);
+	FreeOldMultiXactReader(old_reader);
+}
+
 static void
 copy_xact_xlog_xid(void)
 {
@@ -816,8 +893,29 @@ copy_xact_xlog_xid(void)
 	if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER &&
 		new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
 	{
-		copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
-		copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		MultiXactId new_nxtmulti = old_cluster.controldata.chkpnt_nxtmulti;
+		MultiXactOffset new_nxtmxoff = old_cluster.controldata.chkpnt_nxtmxoff;
+
+		/*
+		 * If the old server is before the
+		 * MULTIXACTOFFSET_FORMATCHANGE_CAT_VER it must have 32-bit multixid
+		 * offsets, thus it should be converted.
+		 */
+		if (old_cluster.controldata.cat_ver < MULTIXACTOFFSET_FORMATCHANGE_CAT_VER &&
+			new_cluster.controldata.cat_ver >= MULTIXACTOFFSET_FORMATCHANGE_CAT_VER)
+		{
+			remove_new_subdir("pg_multixact/members", false);
+			remove_new_subdir("pg_multixact/offsets", false);
+
+			prep_status("Converting pg_multixact/offsets to 64-bit");
+			convert_multixacts(&new_nxtmulti, &new_nxtmxoff);
+			check_ok();
+		}
+		else
+		{
+			copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
+			copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		}
 
 		prep_status("Setting next multixact ID and offset for new cluster");
 
@@ -826,10 +924,8 @@ copy_xact_xlog_xid(void)
 		 * counters here and the oldest multi present on system.
 		 */
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/pg_resetwal\" -O %u -m %u,%u \"%s\"",
-				  new_cluster.bindir,
-				  old_cluster.controldata.chkpnt_nxtmxoff,
-				  old_cluster.controldata.chkpnt_nxtmulti,
+				  "\"%s/pg_resetwal\" -O %" PRIu64 " -m %u,%u \"%s\"",
+				  new_cluster.bindir, new_nxtmxoff, new_nxtmulti,
 				  old_cluster.controldata.chkpnt_oldstMulti,
 				  new_cluster.pgdata);
 		check_ok();
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index e86336f4be9..127b2cb00fa 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -114,6 +114,11 @@ extern char *output_files[];
  */
 #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231
 
+/*
+ * Swicth from 32-bit to 64-bit for multixid offsets.
+ */
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 999999999
+
 /*
  * large object chunk size added to pg_controldata,
  * commit 5f93c37805e7485488480916b4585e098d3cc883
diff --git a/src/bin/pg_upgrade/slru_io.c b/src/bin/pg_upgrade/slru_io.c
new file mode 100644
index 00000000000..2a0624ea8b8
--- /dev/null
+++ b/src/bin/pg_upgrade/slru_io.c
@@ -0,0 +1,239 @@
+/*
+ * slru_io.c
+ *
+ * Routines for reading and writing SLRU files during upgrade.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/slru_io.c
+ */
+
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+
+#include "common/fe_memutils.h"
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "port/pg_iovec.h"
+#include "pg_upgrade.h"
+#include "slru_io.h"
+
+/*
+ * State for reading or writing an SLRU, with a one page buffer.
+ */
+typedef struct SlruSegState
+{
+	bool		writing;
+	bool		long_segment_names;
+
+	char	   *dir;
+	char	   *fn;
+	int			fd;
+	int64		segno;
+	uint64		pageno;
+
+	PGAlignedBlock buf;
+} SlruSegState;
+
+static inline SlruSegState *
+AllocSlruSegState(char *dir)
+{
+	SlruSegState *state = pg_malloc(sizeof(*state));
+
+	state->segno = -1;
+	state->pageno = 0;
+	state->dir = pstrdup(dir);
+	state->fd = -1;
+	state->fn = NULL;
+
+	return state;
+}
+
+static inline void
+SlruFlush(SlruSegState *state)
+{
+	struct iovec iovec = {
+		.iov_base = &state->buf,
+		.iov_len = BLCKSZ,
+	};
+	off_t		offset;
+
+	if (state->segno == -1)
+		return;
+
+	offset = (state->pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+	if (pg_pwritev_with_retry(state->fd, &iovec, 1, offset) < 0)
+		pg_fatal("could not write file \"%s\": %m", state->fn);
+}
+
+/*
+ * Create slru reader for dir.
+ *
+ * Returns the malloced memory used by the all other read calls in this module.
+ */
+SlruSegState *
+AllocSlruRead(char *dir)
+{
+	SlruSegState *state = AllocSlruSegState(dir);
+
+	state->writing = false;
+
+	return state;
+}
+
+/*
+ * Open given page for reading.
+ *
+ * Reading can be done in random order.
+ */
+char *
+SlruReadSwitchPage(SlruSegState *state, uint64 pageno)
+{
+	int64		segno;
+
+	Assert(!state->writing);	/* read only mode */
+
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+
+	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	if (segno != state->segno)
+	{
+		if (state->segno != -1)
+		{
+			close(state->fd);
+			state->fd = -1;
+
+			pg_free(state->fn);
+			state->fn = NULL;
+		}
+
+		/* Open new segment */
+		state->fn = psprintf("%s/%04X", state->dir, (unsigned int) segno);
+		if ((state->fd = open(state->fn, O_RDONLY | PG_BINARY, 0)) < 0)
+			pg_fatal("could not open file \"%s\": %m", state->fn);
+	}
+
+	state->segno = segno;
+
+	{
+		struct iovec iovec = {
+			.iov_base = &state->buf,
+			.iov_len = BLCKSZ,
+		};
+		off_t		offset = (pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+		if (pg_preadv(state->fd, &iovec, 1, offset) < 0)
+			pg_fatal("could not read file \"%s\": %m", state->fn);
+
+		state->pageno = pageno;
+	}
+
+	return state->buf.data;
+}
+
+/*
+ * Frees the malloced reader.
+ */
+void
+FreeSlruRead(SlruSegState *state)
+{
+	Assert(!state->writing);	/* read only mode */
+
+	close(state->fd);
+	pg_free(state);
+}
+
+/*
+ * Open the given page for writing.
+ *
+ * NOTE: This uses O_EXCL when stepping to a new segment, so this assumes that
+ * each segment is written in full before moving on to next one.  This
+ * limitation would be easy to lift if needed, but it fits the usage pattern of
+ * current callers.
+ */
+char *
+SlruWriteSwitchPage(SlruSegState *state, uint64 pageno)
+{
+	int64		segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	off_t		offset;
+
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+
+	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	offset = (pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+	SlruFlush(state);
+	memset(state->buf.data, 0, BLCKSZ);
+
+	if (segno != state->segno)
+	{
+		if (state->segno != -1)
+		{
+			close(state->fd);
+			state->fd = -1;
+
+			pg_free(state->fn);
+			state->fn = NULL;
+		}
+
+		/* Create the segment */
+		if (state->long_segment_names)
+		{
+			Assert(segno >= 0 && segno <= INT64CONST(0xFFFFFFFFFFFFFFF));
+			state->fn = psprintf("%s/%015" PRIX64, state->dir, segno);
+		}
+		else
+		{
+			Assert(segno >= 0 && segno <= INT64CONST(0xFFFFFF));
+			state->fn = psprintf("%s/%04X", state->dir, (unsigned int) segno);
+		}
+
+		if ((state->fd = open(state->fn, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+							  pg_file_create_mode)) < 0)
+		{
+			pg_fatal("could not create file \"%s\": %m", state->fn);
+		}
+
+		state->segno = segno;
+
+		if (offset > 0 && pg_pwrite_zeros(state->fd, offset, 0) < 0)
+			pg_fatal("could not write file \"%s\": %m", state->fn);
+	}
+
+	state->pageno = pageno;
+
+	return state->buf.data;
+}
+
+/*
+ * Create slru writer for dir.
+ *
+ * Returns the malloced memory used by the all other write calls in this module.
+ */
+SlruSegState *
+AllocSlruWrite(char *dir, bool long_segment_names)
+{
+	SlruSegState *state = AllocSlruSegState(dir);
+
+	state->writing = true;
+	state->long_segment_names = long_segment_names;
+
+	return state;
+}
+
+/*
+ * Frees the malloced writer.
+ */
+void
+FreeSlruWrite(SlruSegState *state)
+{
+	Assert(state->writing);
+
+	SlruFlush(state);
+
+	close(state->fd);
+	pg_free(state);
+}
diff --git a/src/bin/pg_upgrade/slru_io.h b/src/bin/pg_upgrade/slru_io.h
new file mode 100644
index 00000000000..295fd0bebc4
--- /dev/null
+++ b/src/bin/pg_upgrade/slru_io.h
@@ -0,0 +1,23 @@
+/*
+ * slru_io.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/slru_io.h
+ */
+
+/*
+ * See access/slru.h
+ *
+ * Copy here, since slru.h could not be included in fe code.
+ */
+#define SLRU_PAGES_PER_SEGMENT 32
+
+typedef struct SlruSegState SlruSegState;
+
+extern SlruSegState *AllocSlruRead(char *dir);
+extern char *SlruReadSwitchPage(SlruSegState *state, uint64 pageno);
+extern void FreeSlruRead(SlruSegState *state);
+
+extern SlruSegState *AllocSlruWrite(char *dir, bool long_segment_names);
+extern char *SlruWriteSwitchPage(SlruSegState *state, uint64 pageno);
+extern void FreeSlruWrite(SlruSegState *state);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 432509277c9..9392bb729b9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1725,6 +1725,7 @@ MultiXactMember
 MultiXactOffset
 MultiXactStateData
 MultiXactStatus
+MultiXactWriter
 MultirangeIOData
 MultirangeParseState
 MultirangeType
@@ -1808,6 +1809,7 @@ OffsetVarNodes_context
 Oid
 OidOptions
 OkeysState
+OldMultiXactReader
 OldToNewMapping
 OldToNewMappingData
 OnCommitAction
@@ -2804,6 +2806,7 @@ SlruCtlData
 SlruErrorCause
 SlruPageStatus
 SlruScanCallback
+SlruSegState
 SlruShared
 SlruSharedData
 SlruWriteAll
-- 
2.47.3

v22-0003-Add-test-for-64-bit-mxoff-in-pg_resetwal.patchtext/x-patch; charset=UTF-8; name=v22-0003-Add-test-for-64-bit-mxoff-in-pg_resetwal.patchDownload
From 3fb87adc5f7c4d94060b4813127635af646b46b6 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Tue, 28 Oct 2025 19:08:26 +0300
Subject: [PATCH v22 3/5] Add test for 64-bit mxoff in pg_resetwal

---
 src/bin/pg_resetwal/meson.build    |   1 +
 src/bin/pg_resetwal/t/003_mxoff.pl | 170 +++++++++++++++++++++++++++++
 2 files changed, 171 insertions(+)
 create mode 100644 src/bin/pg_resetwal/t/003_mxoff.pl

diff --git a/src/bin/pg_resetwal/meson.build b/src/bin/pg_resetwal/meson.build
index 290832b2299..1e2dfb38a5b 100644
--- a/src/bin/pg_resetwal/meson.build
+++ b/src/bin/pg_resetwal/meson.build
@@ -25,6 +25,7 @@ tests += {
     'tests': [
       't/001_basic.pl',
       't/002_corrupted.pl',
+      't/003_mxoff.pl',
     ],
   },
 }
diff --git a/src/bin/pg_resetwal/t/003_mxoff.pl b/src/bin/pg_resetwal/t/003_mxoff.pl
new file mode 100644
index 00000000000..3c1b7fa1d33
--- /dev/null
+++ b/src/bin/pg_resetwal/t/003_mxoff.pl
@@ -0,0 +1,170 @@
+
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use Math::BigInt;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+sub mxact_eater
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	$node->start;
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl} (I INT PRIMARY KEY, N_UPDATED INT) " .
+		"       WITH (AUTOVACUUM_ENABLED=FALSE);" .
+		"INSERT INTO ${tbl} SELECT G, 0 FROM GENERATE_SERIES(1, 50) G;");
+
+	# consume around 10k multixact-offsetfs
+	my $nclients = 10;
+	my $update_every = 75;
+	my @connections = ();
+
+	for (0..$nclients)
+	{
+		my $conn = $node->background_psql('postgres');
+		$conn->query_safe("BEGIN");
+
+		push(@connections, $conn);
+	}
+
+	for (my $i = 0; $i < 1000; $i++)
+	{
+		my $conn = $connections[$i % $nclients];
+
+		$conn->query_safe("COMMIT;");
+		$conn->query_safe("BEGIN");
+
+		if ($i % $update_every == 0)
+		{
+			$conn->query_safe(
+				"UPDATE ${tbl} SET " .
+				"N_UPDATED = N_UPDATED + 1 " .
+				"WHERE I = ${i} % 50");
+		}
+		else
+		{
+			$conn->query_safe(
+				"SELECT * FROM ${tbl} FOR KEY SHARE");
+		}
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+
+	$node->stop;
+}
+
+sub next_mxoff
+{
+	my $node = shift;
+	my ($stdout, $stderr) =
+	  run_command([ 'pg_controldata', $node->data_dir ]);
+	my @control_data = split("\n", $stdout);
+	my $next_mxoff = undef;
+
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint's NextMultiOffset:\s*(.*)$/mg)
+		{
+			$next_mxoff = $1;
+			last;
+		}
+	}
+	die "NextMultiOffset not found in control file\n"
+		unless defined($next_mxoff);
+
+	return $next_mxoff;
+}
+
+sub reset_mxoff
+{
+	my $node = shift;
+	my $offset = shift;
+		$offset = Math::BigInt->new($offset);
+
+	# Get block size
+	my $out = (run_command([ 'pg_resetwal', '--dry-run', $node->data_dir ]))[0];
+		$out =~ /^Database block size: *(\d+)$/m or die;
+	my $blcksz = $1;
+
+	# Reset to new offset
+	my @cmd = ('pg_resetwal', '--pgdata' => $node->data_dir);
+	push @cmd, '--multixact-offset' => $offset->as_hex();
+	command_ok(\@cmd, 'set oldest multixact-offset');
+
+	# Fill empty pg_multixact/members segment
+	my $mult = 32 * int($blcksz / 20) * 4;
+	my $segname = sprintf "%015X", $offset / $mult;
+
+	my @dd = ('dd');
+	push @dd, "if=/dev/zero";
+	push @dd, "of=" . $node->data_dir . "/pg_multixact/members/" . $segname;
+	push @dd, "bs=$blcksz";
+	push @dd, "count=32";
+	command_ok(\@dd, 'fill empty multixact-members');
+}
+
+my ($off1, $off2);
+
+# start from defaults
+my $node1 = PostgreSQL::Test::Cluster->new('node1');
+$node1->init;
+$off1 = next_mxoff($node1);
+mxact_eater($node1, "FOO");
+$off2 = next_mxoff($node1);
+note "> start from $off1, finished at $off2\n";
+
+# start from before 32-bit wraparound
+my $node2 = PostgreSQL::Test::Cluster->new('node2');
+$node2->init;
+reset_mxoff($node2, 0xFFFF0000);
+$off1 = next_mxoff($node2);
+mxact_eater($node2, "FOO");
+$off2 = next_mxoff($node2);
+note "> start from $off1, finished at $off2\n";
+
+# start near 32-bit wraparound
+my $node3 = PostgreSQL::Test::Cluster->new('node3');
+$node3->init;
+reset_mxoff($node3, 0xFFFFEC77);
+$off1 = next_mxoff($node3);
+mxact_eater($node3, "FOO");
+$off2 = next_mxoff($node3);
+note "> start from $off1, finished at $off2\n";
+
+# start over 32-bit wraparound
+my $node4 = PostgreSQL::Test::Cluster->new('node4');
+$node4->init;
+reset_mxoff($node4, '0xFFFFFFFF0000');
+$off1 = next_mxoff($node4);
+mxact_eater($node4, "FOO");
+$off2 = next_mxoff($node3);
+note "> start from $off1, finished at $off2\n";
+
+# check invariant
+$node1->start;
+$node2->start;
+$node3->start;
+$node4->start;
+
+my $var1 = $node1->safe_psql('postgres', 'TABLE FOO');
+my $var2 = $node2->safe_psql('postgres', 'TABLE FOO');
+my $var3 = $node3->safe_psql('postgres', 'TABLE FOO');
+my $var4 = $node4->safe_psql('postgres', 'TABLE FOO');
+ok($var1 eq $var2 eq $var3 eq $var4,
+	'check table invariant in all nodes');
+
+$node4->stop;
+$node3->stop;
+$node2->stop;
+$node1->stop;
+
+done_testing();
-- 
2.47.3

v22-0004-TEST-bump-catversion.patchtext/x-patch; charset=UTF-8; name=v22-0004-TEST-bump-catversion.patchDownload
From fb77dd55c361514c3419abe47a8fc3c0c3729813 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Fri, 24 Oct 2025 11:47:50 +0300
Subject: [PATCH v22 4/5] TEST: bump catversion

To avoid constant CF-bot complains, make catversion bump in a separate
commit.

NOTE: keep it in sync with MULTIXACTOFFSET_FORMATCHANGE_CAT_VER
---
 src/include/catalog/catversion.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 593aed7fe21..6a13fa3cdb0 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202511051
+#define CATALOG_VERSION_NO	999999999
 
 #endif
-- 
2.47.3

v22-0005-TEST-Add-test-for-64-bit-mxoff-in-pg_upgrade.patchtext/x-patch; charset=UTF-8; name=v22-0005-TEST-Add-test-for-64-bit-mxoff-in-pg_upgrade.patchDownload
From 734427cec2c117fd6616aa237bfeccd30a1ec777 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 29 Oct 2025 14:19:56 +0300
Subject: [PATCH v22 5/5] TEST: Add test for 64-bit mxoff in pg_upgrade

---
 src/bin/pg_upgrade/t/007_mxoff.pl | 461 ++++++++++++++++++++++++++++++
 1 file changed, 461 insertions(+)
 create mode 100644 src/bin/pg_upgrade/t/007_mxoff.pl

diff --git a/src/bin/pg_upgrade/t/007_mxoff.pl b/src/bin/pg_upgrade/t/007_mxoff.pl
new file mode 100644
index 00000000000..26fd6e9c5d0
--- /dev/null
+++ b/src/bin/pg_upgrade/t/007_mxoff.pl
@@ -0,0 +1,461 @@
+
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use Math::BigInt;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::AdjustDump;
+use PostgreSQL::Test::AdjustUpgrade;
+use Test::More;
+
+# This test involves different multitransaction states, similarly to that of
+# 002_pg_upgrade.pl.
+
+unless (defined($ENV{oldinstall}))
+{
+	plan skip_all => 'to run test set oldinstall environment variable to the pre 64-bit mxoff cluster';
+}
+
+# Temp dir for a dumps.
+my $tempdir = PostgreSQL::Test::Utils::tempdir;
+
+# Can be changed to test the other modes.
+my $mode = $ENV{PG_TEST_PG_UPGRADE_MODE} || '--copy';
+
+# Get NextMultiOffset.
+sub next_mxoff
+{
+	my $node = shift;
+
+	my $pg_controldata_path =
+		defined($node->install_path) ?
+			$node->install_path . '/bin/pg_controldata' :
+			'pg_controldata';
+	my ($stdout, $stderr) = run_command([ $pg_controldata_path,
+											$node->data_dir ]);
+	my @control_data = split("\n", $stdout);
+	my $next_mxoff = undef;
+
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint's NextMultiOffset:\s*(.*)$/mg)
+		{
+			$next_mxoff = $1;
+			last;
+		}
+	}
+	die "NextMultiOffset not found in control file\n"
+		unless defined($next_mxoff);
+
+	return $next_mxoff;
+}
+
+# Consume around 10k of mxoffsets.
+sub mxact_eater
+{
+	my $node = shift;
+	my $tbl = 'FOO';
+
+	my ($mxoff1, $mxoff2);
+
+	$mxoff1 = next_mxoff($node);
+	$node->start;
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl} (I INT PRIMARY KEY, N_UPDATED INT) " .
+		"       WITH (AUTOVACUUM_ENABLED=FALSE);" .
+		"INSERT INTO ${tbl} SELECT G, 0 FROM GENERATE_SERIES(1, 50) G;");
+
+	# consume around 10k mxoff
+	my $nclients = 10;
+	my $update_every = 75;
+	my @connections = ();
+
+	for (0..$nclients)
+	{
+		my $conn = $node->background_psql('postgres');
+		$conn->query_safe("BEGIN");
+
+		push(@connections, $conn);
+	}
+
+	for (my $i = 0; $i < 1000; $i++)
+	{
+		my $conn = $connections[$i % $nclients];
+
+		$conn->query_safe("COMMIT;");
+		$conn->query_safe("BEGIN");
+
+		if ($i % $update_every == 0)
+		{
+			$conn->query_safe(
+				"UPDATE ${tbl} SET " .
+				"N_UPDATED = N_UPDATED + 1 " .
+				"WHERE I = ${i} % 50");
+		}
+		else
+		{
+			$conn->query_safe(
+				"SELECT * FROM ${tbl} FOR KEY SHARE");
+		}
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+
+	$node->stop;
+	$mxoff2 = next_mxoff($node);
+
+	return $mxoff1, $mxoff2;
+}
+
+# Consume around 2M of mxoffsets.
+sub mxact_huge_eater
+{
+	my $node = shift;
+	my $tbl = 'FOO';
+
+	my ($mxoff1, $mxoff2);
+
+	$mxoff1 = next_mxoff($node);
+	$node->start;
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl} (I INT PRIMARY KEY, N_UPDATED INT) " .
+		"       WITH (AUTOVACUUM_ENABLED=FALSE);" .
+		"INSERT INTO ${tbl} SELECT G, 0 FROM GENERATE_SERIES(1, 50) G;");
+
+	# consume around 1M mxoff
+	my $nclients = 10;
+	my $update_every = 95;
+	my @connections = ();
+	my $timeout = 10 * $PostgreSQL::Test::Utils::timeout_default;
+
+	for (0..$nclients)
+	{
+		my $conn = $node->background_psql('postgres',
+										  timeout => $timeout);
+		$conn->query_safe("BEGIN");
+
+		push(@connections, $conn);
+	}
+
+	# It's a long process, better to tell about progress.
+	my $n_steps = 200_000;
+	my $step = int($n_steps / 10);
+
+	diag "\nstart to consume mxoffsets ...\n";
+	for (my $i = 0; $i < $n_steps; $i++)
+	{
+		my $conn = $connections[$i % $nclients];
+
+		$conn->query_safe("COMMIT;");
+		$conn->query_safe("BEGIN");
+
+		if ($i % $update_every == 0)
+		{
+			# Perform some non-key UPDATEs too, to exercise different multixact
+			# member statuses.
+			$conn->query_safe(
+				"UPDATE ${tbl} SET " .
+				"N_UPDATED = N_UPDATED + 1 " .
+				"WHERE I = ${i} % 50");
+		}
+		else
+		{
+			$conn->query_safe(
+				"SELECT * FROM ${tbl} " .
+				"TABLESAMPLE SYSTEM (85) " .
+				"FOR KEY SHARE");
+		}
+
+		if ($i % $step == 0)
+		{
+			my $done = int(($i / $n_steps) * 100);
+			diag "$done% done...";
+		}
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+
+	$node->stop;
+	$mxoff2 = next_mxoff($node);
+
+	return $mxoff1, $mxoff2;
+}
+
+# Set oldest multixact-offset
+sub reset_mxoff
+{
+	my $node = shift;
+	my $offset = shift;
+
+	my $pg_resetwal_path = $node->install_path . '/bin/pg_resetwal';
+
+	# Get block size
+	my $out = (run_command([ $pg_resetwal_path, '--dry-run',
+							 $node->data_dir ]))[0];
+		$out =~ /^Database block size: *(\d+)$/m or die;
+	my $blcksz = $1;
+
+	# Reset to new offset
+	my @cmd = ($pg_resetwal_path, '--pgdata' => $node->data_dir);
+	push @cmd, '--multixact-offset' => $offset;
+	command_ok(\@cmd, 'set oldest multixact-offset');
+
+	# Fill empty pg_multixact/members segment
+	my $mult = 32 * int($blcksz / 20) * 4;
+	my $segname = sprintf "%04X", $offset / $mult;
+
+	my @dd = ('dd');
+	push @dd, "if=/dev/zero";
+	push @dd, "of=" . $node->data_dir . "/pg_multixact/members/" . $segname;
+	push @dd, "bs=$blcksz";
+	push @dd, "count=32";
+	command_ok(\@dd, 'fill empty multixact-members');
+}
+
+sub get_dump_for_comparison
+{
+	my ($node, $db, $file_prefix, $adjust_child_columns) = @_;
+
+	my $dumpfile = $tempdir . '/' . $file_prefix . '.sql';
+	my $dump_adjusted = "${dumpfile}_adjusted";
+
+	open(my $dh, '>', $dump_adjusted)
+	  || die "could not open $dump_adjusted for writing $!";
+
+	$node->run_log(
+		[
+			'pg_dump', '--no-sync',
+			'--restrict-key' => 'test',
+			'-d' => $node->connstr($db),
+			'-f' => $dumpfile
+		]);
+
+	print $dh adjust_regress_dumpfile(slurp_file($dumpfile),
+		$adjust_child_columns);
+	close($dh);
+
+	return $dump_adjusted;
+}
+
+# Main test workhorse routine.
+# Make pg_upgrade, dump data and compare it.
+sub run_test
+{
+	my $tag = shift;
+	my $oldnode = shift;
+	my $newnode = shift;
+
+	command_ok(
+		[
+			'pg_upgrade', '--no-sync',
+			'--old-datadir' => $oldnode->data_dir,
+			'--new-datadir' => $newnode->data_dir,
+			'--old-bindir' => $oldnode->config_data('--bindir'),
+			'--new-bindir' => $newnode->config_data('--bindir'),
+			'--socketdir' => $newnode->host,
+			'--old-port' => $oldnode->port,
+			'--new-port' => $newnode->port,
+			$mode,
+		],
+		'run of pg_upgrade for new instance');
+	ok( !-d $newnode->data_dir . "/pg_upgrade_output.d",
+		"pg_upgrade_output.d/ removed after pg_upgrade success");
+
+	$oldnode->start;
+	my $src_dump =
+		get_dump_for_comparison($oldnode, 'postgres',
+								"oldnode_${tag}_dump", 0);
+	$oldnode->stop;
+
+	$newnode->start;
+	my $dst_dump =
+		get_dump_for_comparison($newnode, 'postgres',
+								"newnode_${tag}_dump", 0);
+	$newnode->stop;
+
+	compare_files($src_dump, $dst_dump,
+		'dump outputs from original and restored regression databases match');
+}
+
+sub to_hex
+{
+	my $arg = shift;
+
+	$arg = Math::BigInt->new($arg);
+	$arg = $arg->as_hex();
+
+	return $arg;
+}
+
+# case #1: start old node from defaults
+{
+	my $tag = 1;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+	$old->init(extra => ['-k']);
+
+	my ($start_mxoff, $finish_mxoff) = mxact_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #2: start old node from before 32-bit wraparound
+{
+	my $tag = 2;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+	reset_mxoff($old, 0xFFFF0000);
+
+	my ($start_mxoff, $finish_mxoff) = mxact_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #3: start old node near 32-bit wraparound and reach wraparound state.
+{
+	my $tag = 3;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+
+	reset_mxoff($old, 0xFFFFEC77);
+	my ($start_mxoff, $finish_mxoff) = mxact_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #4: start old node from defaults
+{
+	my $tag = 4;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+
+	diag "test #${tag} for multiple mxoff segments";
+	my ($start_mxoff, $finish_mxoff) = mxact_huge_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #5: start old node from before 32-bit wraparound
+{
+	my $tag = 5;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+	reset_mxoff($old, 0xFFFF0000);
+
+	diag "test #${tag} for multiple mxoff segments";
+	my ($start_mxoff, $finish_mxoff) = mxact_huge_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #6: start old node near 32-bit wraparound and reach wraparound state.
+{
+	my $tag = 6;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+
+	reset_mxoff($old, 0xFFFFFFFF - 1_000_000);
+	my ($start_mxoff, $finish_mxoff) = mxact_huge_eater($old);
+
+	diag "test #${tag} for multiple mxoff segments";
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+done_testing();
-- 
2.47.3

#55Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#54)
7 attachment(s)
Re: POC: make mxidoff 64 bits

On Wed, 5 Nov 2025 at 18:38, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Remaining issues:

- There's one more refactoring I'd like to do before merging this: Move
the definitions that are now duplicated between
src/bin/pg_upgrade/multixact_new.c and
src/backend/access/transam/multixact.c into a new header file,
multixact_internal.h. One complication with that is that it needs
SLRU_PAGES_PER_SEGMENT ...

Done. Also put SLRU_PAGES_PER_SEGMENT in pg_config_manual.h
In my opinion, this constant perfectly aligns the description in the
file header. In any case, feel free to move it anywhere you like.

- Have you done any performance testing of the pg_upgrade code? How long

does the conversion take if you have e.g. 1 billion multixids?

Unfortunately, not yet. I'd like to do this soon. Currently, the
bulk of the testing time is spent generating multi-transactions.

- Is the !oldestOffsetKnown case in the code still reachable? I left one

FIXME comment about that. Needs a comment update at least.

Yep, no longer needed. A separate commit has been added.

- The new pg_upgrade test fails on my system with this error in the log:

Unfortunately, I don't face this issue. I think this can be fixed by
providing an explicit path to the utility.

- The new pg_ugprade test is very slow. I would love to include that

test permanently in the test suite, but it's too slow for that currently.

Yes, unfortunately. The majority of the time is spent on tests that
produce multiple segments. These are cases numbered 4-th and higher.
If we remove these, the testing should be relatively fast.

I also add commit "Handle wraparound of next new multi in pg_upgrade".
Per BUG #18863 and BUG #18865

The issue is that pg_upgrade neglects to handle the wraparound of
mxact/mxoff.

We'll obviously resolve the issue with mxoff wraparound by moving to
64-bits. And the mxact bug can be easily solved with two lines of code.
Or five if you count indents and comments. Test also provided.

This commit is totally optional. If you think it deserves to be treated
as a different issue, feel free to discard it.

--
Best regards,
Maxim Orlov.

Attachments:

v23-0006-TEST-bump-catversion.patch.txttext/plain; charset=US-ASCII; name=v23-0006-TEST-bump-catversion.patch.txtDownload
From 69997cfd277ce2377d3599d7426fb8c83a78cf40 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Fri, 24 Oct 2025 11:47:50 +0300
Subject: [PATCH v23 6/7] TEST: bump catversion

To avoid constant CF-bot complains, make catversion bump in a separate
commit.

NOTE: keep it in sync with MULTIXACTOFFSET_FORMATCHANGE_CAT_VER
---
 src/include/catalog/catversion.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 593aed7fe21..6a13fa3cdb0 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202511051
+#define CATALOG_VERSION_NO	999999999
 
 #endif
-- 
2.51.0

v23-0001-Use-64-bit-multixact-offsets.patchapplication/octet-stream; name=v23-0001-Use-64-bit-multixact-offsets.patchDownload
From 132bf103728c3eb1605ac04b488b41c1edc90852 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 7 Aug 2024 16:35:22 +0300
Subject: [PATCH v23 1/7] Use 64-bit multixact offsets

Switching to 64-bit multitransaction offsets removes wraparound and the
2^32 limit on their total number.

Author: Maxim Orlov <orlovmg@gmail.com>
Discussion: FIXME
---
 src/backend/access/rmgrdesc/mxactdesc.c   |   4 +-
 src/backend/access/rmgrdesc/xlogdesc.c    |   2 +-
 src/backend/access/transam/multixact.c    | 500 +++-------------------
 src/backend/access/transam/xlog.c         |   2 +-
 src/backend/access/transam/xlogrecovery.c |   2 +-
 src/backend/commands/vacuum.c             |   2 +-
 src/backend/postmaster/autovacuum.c       |   4 +-
 src/bin/pg_controldata/pg_controldata.c   |   2 +-
 src/bin/pg_resetwal/pg_resetwal.c         |   6 +-
 src/bin/pg_resetwal/t/001_basic.pl        |   2 +-
 src/include/access/multixact.h            |   3 -
 src/include/access/multixact_internal.h   | 109 +++++
 src/include/access/slru.h                 |  15 -
 src/include/c.h                           |   2 +-
 src/include/pg_config_manual.h            |  15 +
 15 files changed, 195 insertions(+), 475 deletions(-)
 create mode 100644 src/include/access/multixact_internal.h

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 3ca0582db36..052dd0a4ce5 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -65,7 +65,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		xl_multixact_create *xlrec = (xl_multixact_create *) rec;
 		int			i;
 
-		appendStringInfo(buf, "%u offset %u nmembers %d: ", xlrec->mid,
+		appendStringInfo(buf, "%u offset %" PRIu64 " nmembers %d: ", xlrec->mid,
 						 xlrec->moff, xlrec->nmembers);
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
@@ -74,7 +74,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
 
-		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+		appendStringInfo(buf, "offsets [%u, %u), members [%" PRIu64 ", %" PRIu64 ")",
 						 xlrec->startTruncOff, xlrec->endTruncOff,
 						 xlrec->startTruncMemb, xlrec->endTruncMemb);
 	}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index cd6c2a2f650..441034f5929 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -66,7 +66,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
 		appendStringInfo(buf, "redo %X/%08X; "
-						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %u; "
+						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %" PRIu64 "; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
 						 "oldest running xid %u; %s",
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 9d5f130af7e..732f048df5e 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -69,6 +69,7 @@
 #include "postgres.h"
 
 #include "access/multixact.h"
+#include "access/multixact_internal.h"
 #include "access/slru.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
@@ -88,35 +89,6 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 
-
-/*
- * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
- * used everywhere else in Postgres.
- *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
- */
-
-/* We need four bytes per offset */
-#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
-
-static inline int64
-MultiXactIdToOffsetPage(MultiXactId multi)
-{
-	return multi / MULTIXACT_OFFSETS_PER_PAGE;
-}
-
-static inline int
-MultiXactIdToOffsetEntry(MultiXactId multi)
-{
-	return multi % MULTIXACT_OFFSETS_PER_PAGE;
-}
-
 static inline int64
 MultiXactIdToOffsetSegment(MultiXactId multi)
 {
@@ -124,94 +96,13 @@ MultiXactIdToOffsetSegment(MultiXactId multi)
 }
 
 /*
- * The situation for members is a bit more complex: we store one byte of
- * additional flag bits for each TransactionId.  To do this without getting
- * into alignment issues, we store four bytes of flags, and then the
- * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
- * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
- * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
- * performance) trumps space efficiency here.
- *
- * Note that the "offset" macros work with byte offset, not array indexes, so
- * arithmetic must be done using "char *" pointers.
- */
-/* We need eight bits per xact, so one xact fits in a byte */
-#define MXACT_MEMBER_BITS_PER_XACT			8
-#define MXACT_MEMBER_FLAGS_PER_BYTE			1
-#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
-
-/* how many full bytes of flags are there in a group? */
-#define MULTIXACT_FLAGBYTES_PER_GROUP		4
-#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
-	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
-/* size in bytes of a complete group */
-#define MULTIXACT_MEMBERGROUP_SIZE \
-	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
-#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
-#define MULTIXACT_MEMBERS_PER_PAGE	\
-	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
-
-/*
- * Because the number of items per page is not a divisor of the last item
- * number (member 0xFFFFFFFF), the last segment does not use the maximum number
- * of pages, and moreover the last used page therein does not use the same
- * number of items as previous pages.  (Another way to say it is that the
- * 0xFFFFFFFF member is somewhere in the middle of the last page, so the page
- * has some empty space after that item.)
- *
- * This constant is the number of members in the last page of the last segment.
+ * Multixact members warning threshold.
+ *
+ * If the difference between nextOffset and oldestOffset exceeds this value,
+ * we trigger autovacuum in order to release disk space consumed by the
+ * members SLRU.
  */
-#define MAX_MEMBERS_IN_LAST_MEMBERS_PAGE \
-		((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
-
-/* page in which a member is to be found */
-static inline int64
-MXOffsetToMemberPage(MultiXactOffset offset)
-{
-	return offset / MULTIXACT_MEMBERS_PER_PAGE;
-}
-
-static inline int64
-MXOffsetToMemberSegment(MultiXactOffset offset)
-{
-	return MXOffsetToMemberPage(offset) / SLRU_PAGES_PER_SEGMENT;
-}
-
-/* Location (byte offset within page) of flag word for a given member */
-static inline int
-MXOffsetToFlagsOffset(MultiXactOffset offset)
-{
-	MultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
-	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
-	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
-
-	return byteoff;
-}
-
-static inline int
-MXOffsetToFlagsBitShift(MultiXactOffset offset)
-{
-	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
-	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
-
-	return bshift;
-}
-
-/* Location (byte offset within page) of TransactionId of given member */
-static inline int
-MXOffsetToMemberOffset(MultiXactOffset offset)
-{
-	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
-
-	return MXOffsetToFlagsOffset(offset) +
-		MULTIXACT_FLAGBYTES_PER_GROUP +
-		member_in_group * sizeof(TransactionId);
-}
-
-/* Multixact members wraparound thresholds. */
-#define MULTIXACT_MEMBER_SAFE_THRESHOLD		(MaxMultiXactOffset / 2)
-#define MULTIXACT_MEMBER_DANGER_THRESHOLD	\
-	(MaxMultiXactOffset - MaxMultiXactOffset / 4)
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD		UINT64CONST(4000000000)
 
 static inline MultiXactId
 PreviousMultiXactId(MultiXactId multi)
@@ -268,9 +159,6 @@ typedef struct MultiXactStateData
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 
-	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
-
 	/*
 	 * This is used to sleep until a multixact offset is written when we want
 	 * to create the next one.
@@ -401,8 +289,6 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
-									 MultiXactOffset start, uint32 distance);
 static bool SetOffsetVacuumLimit(bool is_startup);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMTruncateXlogRec(Oid oldestMultiDB,
@@ -1142,90 +1028,22 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	ExtendMultiXactOffset(result);
 
 	/*
-	 * Reserve the members space, similarly to above.  Also, be careful not to
-	 * return zero as the starting offset for any multixact. See
-	 * GetMultiXactIdMembers() for motivation.
+	 * Reserve the members space, similarly to above.
 	 */
 	nextOffset = MultiXactState->nextOffset;
-	if (nextOffset == 0)
-	{
-		*offset = 1;
-		nmembers++;				/* allocate member slot 0 too */
-	}
-	else
-		*offset = nextOffset;
-
-	/*----------
-	 * Protect against overrun of the members space as well, with the
-	 * following rules:
-	 *
-	 * If we're past offsetStopLimit, refuse to generate more multis.
-	 * If we're close to offsetStopLimit, emit a warning.
-	 *
-	 * Arbitrarily, we start emitting warnings when we're 20 segments or less
-	 * from offsetStopLimit.
-	 *
-	 * Note we haven't updated the shared state yet, so if we fail at this
-	 * point, the multixact ID we grabbed can still be used by the next guy.
-	 *
-	 * Note that there is no point in forcing autovacuum runs here: the
-	 * multixact freeze settings would have to be reduced for that to have any
-	 * effect.
-	 *----------
-	 */
-#define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
-								 nmembers))
-	{
-		/* see comment in the corresponding offsets wraparound case */
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("multixact \"members\" limit exceeded"),
-				 errdetail_plural("This command would create a multixact with %u members, but the remaining space is only enough for %u member.",
-								  "This command would create a multixact with %u members, but the remaining space is only enough for %u members.",
-								  MultiXactState->offsetStopLimit - nextOffset - 1,
-								  nmembers,
-								  MultiXactState->offsetStopLimit - nextOffset - 1),
-				 errhint("Execute a database-wide VACUUM in database with OID %u with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.",
-						 MultiXactState->oldestMultiXactDB)));
-	}
 
 	/*
-	 * Check whether we should kick autovacuum into action, to prevent members
-	 * wraparound. NB we use a much larger window to trigger autovacuum than
-	 * just the warning limit. The warning is just a measure of last resort -
-	 * this is in line with GetNewTransactionId's behaviour.
+	 * Offsets are 64-bit integers and will never wrap around.  Firstly, it
+	 * would take an unrealistic amount of time and resources to consume 2^64
+	 * offsets.  Secondly, multixid creation is WAL-logged, so you would run
+	 * out of LSNs before reaching offset wraparound.  Nevertheless, check for
+	 * wraparound as a sanity check.
 	 */
-	if (!MultiXactState->oldestOffsetKnown ||
-		(MultiXactState->nextOffset - MultiXactState->oldestOffset
-		 > MULTIXACT_MEMBER_SAFE_THRESHOLD))
-	{
-		/*
-		 * To avoid swamping the postmaster with signals, we issue the autovac
-		 * request only when crossing a segment boundary. With default
-		 * compilation settings that's roughly after 50k members.  This still
-		 * gives plenty of chances before we get into real trouble.
-		 */
-		if ((MXOffsetToMemberPage(nextOffset) / SLRU_PAGES_PER_SEGMENT) !=
-			(MXOffsetToMemberPage(nextOffset + nmembers) / SLRU_PAGES_PER_SEGMENT))
-			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	}
-
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
-								 nextOffset,
-								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
-		ereport(WARNING,
+	if (nextOffset + nmembers < nextOffset)
+		ereport(ERROR,
 				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg_plural("database with OID %u must be vacuumed before %d more multixact member is used",
-							   "database with OID %u must be vacuumed before %d more multixact members are used",
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers,
-							   MultiXactState->oldestMultiXactDB,
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers),
-				 errhint("Execute a database-wide VACUUM in that database with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.")));
+				 "MultiXact members would wrap around"));
+	*offset = nextOffset;
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
@@ -1246,8 +1064,7 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	 * the next iteration.  But note that nextMXact may be InvalidMultiXactId
 	 * or the first value on a segment-beginning page after this routine
 	 * exits, so anyone else looking at the variable must be prepared to deal
-	 * with either case.  Similarly, nextOffset may be zero, but we won't use
-	 * that as the actual start offset of the next multixact.
+	 * with either case.
 	 */
 	(MultiXactState->nextMXact)++;
 
@@ -1255,7 +1072,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	LWLockRelease(MultiXactGenLock);
 
-	debug_elog4(DEBUG2, "GetNew: returning %u offset %u", result, *offset);
+	debug_elog4(DEBUG2, "GetNew: returning %u offset %" PRIu64, result,
+				*offset);
 	return result;
 }
 
@@ -1297,7 +1115,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	MultiXactOffset *offptr;
 	MultiXactOffset offset;
 	int			length;
-	int			truelength;
 	MultiXactId oldestMXact;
 	MultiXactId nextMXact;
 	MultiXactId tmpMXact;
@@ -1396,16 +1213,7 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * we have just for this; the process in charge will signal the CV as soon
 	 * as it has finished writing the multixact offset.
 	 *
-	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
-	 * handle case #2, there is an ambiguity near the point of offset
-	 * wraparound.  If we see next multixact's offset is one, is that our
-	 * multixact's actual endpoint, or did it end at zero with a subsequent
-	 * increment?  We handle this using the knowledge that if the zero'th
-	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
-	 * transaction ID so it can't be a multixact member.  Therefore, if we
-	 * read a zero from the members array, just ignore it.
-	 *
-	 * This is all pretty messy, but the mess occurs only in infrequent corner
+	 * This is a little messy, but the mess occurs only in infrequent corner
 	 * cases, so it seems better than holding the MultiXactGenLock for a long
 	 * time on every multixact creation.
 	 */
@@ -1491,6 +1299,9 @@ retry:
 	LWLockRelease(lock);
 	lock = NULL;
 
+	/* A multixid with zero members should not happen */
+	Assert(length > 0);
+
 	/*
 	 * If we slept above, clean up state; it's no longer needed.
 	 */
@@ -1499,7 +1310,6 @@ retry:
 
 	ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
 
-	truelength = 0;
 	prev_pageno = -1;
 	for (int i = 0; i < length; i++, offset++)
 	{
@@ -1536,37 +1346,27 @@ retry:
 
 		xactptr = (TransactionId *)
 			(MultiXactMemberCtl->shared->page_buffer[slotno] + memberoff);
-
-		if (!TransactionIdIsValid(*xactptr))
-		{
-			/* Corner case 3: we must be looking at unused slot zero */
-			Assert(offset == 0);
-			continue;
-		}
+		Assert(TransactionIdIsValid(*xactptr));
 
 		flagsoff = MXOffsetToFlagsOffset(offset);
 		bshift = MXOffsetToFlagsBitShift(offset);
 		flagsptr = (uint32 *) (MultiXactMemberCtl->shared->page_buffer[slotno] + flagsoff);
 
-		ptr[truelength].xid = *xactptr;
-		ptr[truelength].status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
-		truelength++;
+		ptr[i].xid = *xactptr;
+		ptr[i].status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
 	}
 
 	LWLockRelease(lock);
 
-	/* A multixid with zero members should not happen */
-	Assert(truelength > 0);
-
 	/*
 	 * Copy the result into the local cache.
 	 */
-	mXactCachePut(multi, truelength, ptr);
+	mXactCachePut(multi, length, ptr);
 
 	debug_elog3(DEBUG2, "GetMembers: no cache for %s",
-				mxid_to_string(multi, truelength, ptr));
+				mxid_to_string(multi, length, ptr));
 	*members = ptr;
-	return truelength;
+	return length;
 }
 
 /*
@@ -1973,7 +1773,7 @@ MultiXactShmemInit(void)
 				  "pg_multixact/members", LWTRANCHE_MULTIXACTMEMBER_BUFFER,
 				  LWTRANCHE_MULTIXACTMEMBER_SLRU,
 				  SYNC_HANDLER_MULTIXACT_MEMBER,
-				  false);
+				  true);
 	/* doesn't call SimpleLruTruncate() or meet criteria for unit tests */
 
 	/* Initialize our shared state struct */
@@ -2150,7 +1950,6 @@ TrimMultiXact(void)
 		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 		offptr += entryno;
-
 		MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
 
 		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
@@ -2223,7 +2022,7 @@ MultiXactGetCheckptMulti(bool is_shutdown,
 	LWLockRelease(MultiXactGenLock);
 
 	debug_elog6(DEBUG2,
-				"MultiXact: checkpoint is nextMulti %u, nextOffset %u, oldestMulti %u in DB %u",
+				"MultiXact: checkpoint is nextMulti %u, nextOffset %" PRIu64 ", oldestMulti %u in DB %u",
 				*nextMulti, *nextMultiOffset, *oldestMulti, *oldestMultiDB);
 }
 
@@ -2258,7 +2057,7 @@ void
 MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset)
 {
-	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %u",
+	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %" PRIu64,
 				nextMulti, nextMultiOffset);
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->nextMXact = nextMulti;
@@ -2449,7 +2248,7 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 	}
 	if (MultiXactOffsetPrecedes(MultiXactState->nextOffset, minMultiOffset))
 	{
-		debug_elog3(DEBUG2, "MultiXact: setting next offset to %u",
+		debug_elog3(DEBUG2, "MultiXact: setting next offset to %" PRIU64,
 					minMultiOffset);
 		MultiXactState->nextOffset = minMultiOffset;
 	}
@@ -2551,23 +2350,8 @@ ExtendMultiXactMember(MultiXactOffset offset, int nmembers)
 			LWLockRelease(lock);
 		}
 
-		/*
-		 * Compute the number of items till end of current page.  Careful: if
-		 * addition of unsigned ints wraps around, we're at the last page of
-		 * the last segment; since that page holds a different number of items
-		 * than other pages, we need to do it differently.
-		 */
-		if (offset + MAX_MEMBERS_IN_LAST_MEMBERS_PAGE < offset)
-		{
-			/*
-			 * This is the last page of the last segment; we can compute the
-			 * number of items left to allocate in it without modulo
-			 * arithmetic.
-			 */
-			difference = MaxMultiXactOffset - offset + 1;
-		}
-		else
-			difference = MULTIXACT_MEMBERS_PER_PAGE - offset % MULTIXACT_MEMBERS_PER_PAGE;
+		/* Compute the number of items till end of current page. */
+		difference = MULTIXACT_MEMBERS_PER_PAGE - offset % MULTIXACT_MEMBERS_PER_PAGE;
 
 		/*
 		 * Advance to next page, taking care to properly handle the wraparound
@@ -2633,15 +2417,14 @@ GetOldestMultiXactId(void)
 }
 
 /*
- * Determine how aggressively we need to vacuum in order to prevent member
- * wraparound.
+ * Determine if we need to vacuum to keep the size of the members SLRU in
+ * check.
  *
  * To do so determine what's the oldest member offset and install the limit
  * info in MultiXactState, where it can be used to prevent overrun of old data
  * in the members SLRU area.
  *
- * The return value is true if emergency autovacuum is required and false
- * otherwise.
+ * The return value is true if autovacuum is required and false otherwise.
  */
 static bool
 SetOffsetVacuumLimit(bool is_startup)
@@ -2653,8 +2436,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	MultiXactOffset nextOffset;
 	bool		oldestOffsetKnown = false;
 	bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;
-	MultiXactOffset prevOffsetStopLimit;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2669,7 +2450,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	nextOffset = MultiXactState->nextOffset;
 	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
 	prevOldestOffset = MultiXactState->oldestOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2700,13 +2480,9 @@ SetOffsetVacuumLimit(bool is_startup)
 		oldestOffsetKnown =
 			find_multixact_start(oldestMultiXactId, &oldestOffset);
 
-		if (oldestOffsetKnown)
-			ereport(DEBUG1,
-					(errmsg_internal("oldest MultiXactId member is at offset %u",
-									 oldestOffset)));
-		else
+		if (!oldestOffsetKnown)
 			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
+					(errmsg("oldest checkpointed MultiXact %u does not exist on disk",
 							oldestMultiXactId)));
 	}
 
@@ -2716,97 +2492,32 @@ SetOffsetVacuumLimit(bool is_startup)
 	 * If we can, compute limits (and install them MultiXactState) to prevent
 	 * overrun of old data in the members SLRU area. We can only do so if the
 	 * oldest offset is known though.
+	 *
+	 * FIXME: Is !oldestOffsetKnown possible anymore? At least update the comment:
+	 * we won't overrun members anymore.
 	 */
-	if (oldestOffsetKnown)
-	{
-		/* move back to start of the corresponding segment */
-		offsetStopLimit = oldestOffset - (oldestOffset %
-										  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-
-		/* always leave one segment before the wraparound point */
-		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-		if (!prevOldestOffsetKnown && !is_startup)
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are now enabled")));
-
-		ereport(DEBUG1,
-				(errmsg_internal("MultiXact member stop limit is now %u based on MultiXact %u",
-								 offsetStopLimit, oldestMultiXactId)));
-	}
-	else if (prevOldestOffsetKnown)
+	if (prevOldestOffsetKnown)
 	{
 		/*
 		 * If we failed to get the oldest offset this time, but we have a
 		 * value from a previous pass through this function, use the old
-		 * values rather than automatically forcing an emergency autovacuum
-		 * cycle again.
+		 * values rather than automatically forcing an autovacuum cycle again.
 		 */
 		oldestOffset = prevOldestOffset;
 		oldestOffsetKnown = true;
-		offsetStopLimit = prevOffsetStopLimit;
 	}
 
 	/* Install the computed values */
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestOffset = oldestOffset;
 	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	MultiXactState->offsetStopLimit = offsetStopLimit;
 	LWLockRelease(MultiXactGenLock);
 
 	/*
-	 * Do we need an emergency autovacuum?	If we're not sure, assume yes.
+	 * Do we need autovacuum?	If we're not sure, assume yes.
 	 */
 	return !oldestOffsetKnown ||
-		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
-}
-
-/*
- * Return whether adding "distance" to "start" would move past "boundary".
- *
- * We use this to determine whether the addition is "wrapping around" the
- * boundary point, hence the name.  The reason we don't want to use the regular
- * 2^31-modulo arithmetic here is that we want to be able to use the whole of
- * the 2^32-1 space here, allowing for more multixacts than would fit
- * otherwise.
- */
-static bool
-MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
-						 uint32 distance)
-{
-	MultiXactOffset finish;
-
-	/*
-	 * Note that offset number 0 is not used (see GetMultiXactIdMembers), so
-	 * if the addition wraps around the UINT_MAX boundary, skip that value.
-	 */
-	finish = start + distance;
-	if (finish < start)
-		finish++;
-
-	/*-----------------------------------------------------------------------
-	 * When the boundary is numerically greater than the starting point, any
-	 * value numerically between the two is not wrapped:
-	 *
-	 *	<----S----B---->
-	 *	[---)			 = F wrapped past B (and UINT_MAX)
-	 *		 [---)		 = F not wrapped
-	 *			  [----] = F wrapped past B
-	 *
-	 * When the boundary is numerically less than the starting point (i.e. the
-	 * UINT_MAX wraparound occurs somewhere in between) then all values in
-	 * between are wrapped:
-	 *
-	 *	<----B----S---->
-	 *	[---)			 = F not wrapped past B (but wrapped past UINT_MAX)
-	 *		 [---)		 = F wrapped past B (and UINT_MAX)
-	 *			  [----] = F not wrapped
-	 *-----------------------------------------------------------------------
-	 */
-	if (start < boundary)
-		return finish >= boundary || finish < start;
-	else
-		return finish >= boundary && finish < start;
+		(nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);
 }
 
 /*
@@ -2846,6 +2557,7 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 	offptr += entryno;
 	offset = *offptr;
+
 	LWLockRelease(SimpleLruGetBankLock(MultiXactOffsetCtl, pageno));
 
 	*result = offset;
@@ -2893,73 +2605,6 @@ GetMultiXactInfo(uint32 *multixacts, MultiXactOffset *members,
 	return true;
 }
 
-/*
- * Multixact members can be removed once the multixacts that refer to them
- * are older than every datminmxid.  autovacuum_multixact_freeze_max_age and
- * vacuum_multixact_freeze_table_age work together to make sure we never have
- * too many multixacts; we hope that, at least under normal circumstances,
- * this will also be sufficient to keep us from using too many offsets.
- * However, if the average multixact has many members, we might exhaust the
- * members space while still using few enough members that these limits fail
- * to trigger relminmxid advancement by VACUUM.  At that point, we'd have no
- * choice but to start failing multixact-creating operations with an error.
- *
- * To prevent that, if more than a threshold portion of the members space is
- * used, we effectively reduce autovacuum_multixact_freeze_max_age and
- * to a value just less than the number of multixacts in use.  We hope that
- * this will quickly trigger autovacuuming on the table or tables with the
- * oldest relminmxid, thus allowing datminmxid values to advance and removing
- * some members.
- *
- * As the fraction of the member space currently in use grows, we become
- * more aggressive in clamping this value.  That not only causes autovacuum
- * to ramp up, but also makes any manual vacuums the user issues more
- * aggressive.  This happens because vacuum_get_cutoffs() will clamp the
- * freeze table and the minimum freeze age cutoffs based on the effective
- * autovacuum_multixact_freeze_max_age this function returns.  In the worst
- * case, we'll claim the freeze_max_age to zero, and every vacuum of any
- * table will freeze every multixact.
- */
-int
-MultiXactMemberFreezeThreshold(void)
-{
-	MultiXactOffset members;
-	uint32		multixacts;
-	uint32		victim_multixacts;
-	double		fraction;
-	int			result;
-	MultiXactId oldestMultiXactId;
-	MultiXactOffset oldestOffset;
-
-	/* If we can't determine member space utilization, assume the worst. */
-	if (!GetMultiXactInfo(&multixacts, &members, &oldestMultiXactId, &oldestOffset))
-		return 0;
-
-	/* If member space utilization is low, no special action is required. */
-	if (members <= MULTIXACT_MEMBER_SAFE_THRESHOLD)
-		return autovacuum_multixact_freeze_max_age;
-
-	/*
-	 * Compute a target for relminmxid advancement.  The number of multixacts
-	 * we try to eliminate from the system is based on how far we are past
-	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
-	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD) /
-		(MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-	victim_multixacts = multixacts * fraction;
-
-	/* fraction could be > 1.0, but lowest possible freeze age is zero */
-	if (victim_multixacts > multixacts)
-		return 0;
-	result = multixacts - victim_multixacts;
-
-	/*
-	 * Clamp to autovacuum_multixact_freeze_max_age, so that we never make
-	 * autovacuum less aggressive than it would otherwise be.
-	 */
-	return Min(result, autovacuum_multixact_freeze_max_age);
-}
-
 typedef struct mxtruncinfo
 {
 	int64		earliestExistingPage;
@@ -2986,36 +2631,12 @@ SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int64 segpage, void *data
 
 /*
  * Delete members segments [oldest, newOldest)
- *
- * The members SLRU can, in contrast to the offsets one, be filled to almost
- * the full range at once. This means SimpleLruTruncate() can't trivially be
- * used - instead the to-be-deleted range is computed using the offsets
- * SLRU. C.f. TruncateMultiXact().
  */
 static void
 PerformMembersTruncation(MultiXactOffset oldestOffset, MultiXactOffset newOldestOffset)
 {
-	const int64 maxsegment = MXOffsetToMemberSegment(MaxMultiXactOffset);
-	int64		startsegment = MXOffsetToMemberSegment(oldestOffset);
-	int64		endsegment = MXOffsetToMemberSegment(newOldestOffset);
-	int64		segment = startsegment;
-
-	/*
-	 * Delete all the segments but the last one. The last segment can still
-	 * contain, possibly partially, valid data.
-	 */
-	while (segment != endsegment)
-	{
-		elog(DEBUG2, "truncating multixact members segment %" PRIx64,
-			 segment);
-		SlruDeleteSegment(MultiXactMemberCtl, segment);
-
-		/* move to next segment, handling wraparound correctly */
-		if (segment == maxsegment)
-			segment = 0;
-		else
-			segment += 1;
-	}
+	SimpleLruTruncate(MultiXactMemberCtl,
+					  MXOffsetToMemberPage(newOldestOffset));
 }
 
 /*
@@ -3159,7 +2780,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 
 	elog(DEBUG1, "performing multixact truncation: "
 		 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-		 "members [%u, %u), members segments [%" PRIx64 ", %" PRIx64 ")",
+		 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
 		 oldestMulti, newOldestMulti,
 		 MultiXactIdToOffsetSegment(oldestMulti),
 		 MultiXactIdToOffsetSegment(newOldestMulti),
@@ -3239,20 +2860,13 @@ MultiXactOffsetPagePrecedes(int64 page1, int64 page2)
 
 /*
  * Decide whether a MultiXactMember page number is "older" for truncation
- * purposes.  There is no "invalid offset number" so use the numbers verbatim.
+ * purposes.  There is no "invalid offset number" and members never wrap
+ * around, so use the numbers verbatim.
  */
 static bool
 MultiXactMemberPagePrecedes(int64 page1, int64 page2)
 {
-	MultiXactOffset offset1;
-	MultiXactOffset offset2;
-
-	offset1 = ((MultiXactOffset) page1) * MULTIXACT_MEMBERS_PER_PAGE;
-	offset2 = ((MultiXactOffset) page2) * MULTIXACT_MEMBERS_PER_PAGE;
-
-	return (MultiXactOffsetPrecedes(offset1, offset2) &&
-			MultiXactOffsetPrecedes(offset1,
-									offset2 + MULTIXACT_MEMBERS_PER_PAGE - 1));
+	return page1 < page2;
 }
 
 /*
@@ -3290,7 +2904,7 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
 static bool
 MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
 {
-	int32		diff = (int32) (offset1 - offset2);
+	int64		diff = (int64) (offset1 - offset2);
 
 	return (diff < 0);
 }
@@ -3387,7 +3001,7 @@ multixact_redo(XLogReaderState *record)
 
 		elog(DEBUG1, "replaying multixact truncation: "
 			 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-			 "members [%u, %u), members segments [%" PRIx64 ", %" PRIx64 ")",
+			 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
 			 xlrec.startTruncOff, xlrec.endTruncOff,
 			 MultiXactIdToOffsetSegment(xlrec.startTruncOff),
 			 MultiXactIdToOffsetSegment(xlrec.endTruncOff),
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7c959051e11..2ade0b4a042 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5128,7 +5128,7 @@ BootStrapXLOG(uint32 data_checksum_version)
 		FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId);
 	checkPoint.nextOid = FirstGenbkiObjectId;
 	checkPoint.nextMulti = FirstMultiXactId;
-	checkPoint.nextMultiOffset = 0;
+	checkPoint.nextMultiOffset = 1;
 	checkPoint.oldestXid = FirstNormalTransactionId;
 	checkPoint.oldestXidDB = Template1DbOid;
 	checkPoint.oldestMulti = FirstMultiXactId;
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 550de6e4a59..66c2364aa9b 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -886,7 +886,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 							 U64FromFullTransactionId(checkPoint.nextXid),
 							 checkPoint.nextOid)));
 	ereport(DEBUG1,
-			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %u",
+			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %" PRIu64,
 							 checkPoint.nextMulti, checkPoint.nextMultiOffset)));
 	ereport(DEBUG1,
 			(errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ed03e3bd50d..259ef60bd31 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1147,7 +1147,7 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams params,
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Almost ready to set freeze output parameters; check if OldestXmin or
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index ed19c74bb19..34909ee54ff 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1151,7 +1151,7 @@ do_start_worker(void)
 
 	/* Also determine the oldest datminmxid we will consider. */
 	recentMulti = ReadNextMultiXactId();
-	multiForceLimit = recentMulti - MultiXactMemberFreezeThreshold();
+	multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
 	if (multiForceLimit < FirstMultiXactId)
 		multiForceLimit -= FirstMultiXactId;
 
@@ -1939,7 +1939,7 @@ do_autovacuum(void)
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Find the pg_database entry and select the default freeze ages. We use
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 10de058ce91..5295108ade3 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -264,7 +264,7 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile->checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
+	printf(_("Latest checkpoint's NextMultiOffset:  %" PRIu64 "\n"),
 		   ControlFile->checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile->checkPointCopy.oldestXid);
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index a89d72fc5cf..4e5eeced89d 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -267,7 +267,7 @@ main(int argc, char *argv[])
 
 			case 'O':
 				errno = 0;
-				set_mxoff = strtoul(optarg, &endptr, 0);
+				set_mxoff = strtou64(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
@@ -743,7 +743,7 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile.checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
+	printf(_("Latest checkpoint's NextMultiOffset:  %" PRIu64 "\n"),
 		   ControlFile.checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile.checkPointCopy.oldestXid);
@@ -817,7 +817,7 @@ PrintNewControlValues(void)
 
 	if (set_mxoff != -1)
 	{
-		printf(_("NextMultiOffset:                      %u\n"),
+		printf(_("NextMultiOffset:                      %" PRIu64 "\n"),
 			   ControlFile.checkPointCopy.nextMultiOffset);
 	}
 
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index d6bbbd0ceda..cc89e0764ae 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -213,7 +213,7 @@ push @cmd,
   sprintf("%d,%d", hex($files[0]) == 0 ? 3 : hex($files[0]), hex($files[-1]));
 
 @files = get_slru_files('pg_multixact/offsets');
-$mult = 32 * $blcksz / 4;
+$mult = 32 * $blcksz / 8;
 # --multixact-ids argument is "new,old"
 push @cmd,
   '--multixact-ids' => sprintf("%d,%d",
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 82e4bb90dd5..7d98fe0fe32 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -28,8 +28,6 @@
 
 #define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
 
-#define MaxMultiXactOffset	((MultiXactOffset) 0xFFFFFFFF)
-
 /*
  * Possible multixact lock modes ("status").  The first four modes are for
  * tuple locks (FOR KEY SHARE, FOR SHARE, FOR NO KEY UPDATE, FOR UPDATE); the
@@ -147,7 +145,6 @@ extern void MultiXactSetNextMXact(MultiXactId nextMulti,
 extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 									  MultiXactOffset minMultiOffset);
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
-extern int	MultiXactMemberFreezeThreshold(void);
 
 extern void multixact_twophase_recover(FullTransactionId fxid, uint16 info,
 									   void *recdata, uint32 len);
diff --git a/src/include/access/multixact_internal.h b/src/include/access/multixact_internal.h
new file mode 100644
index 00000000000..73fb3e998fb
--- /dev/null
+++ b/src/include/access/multixact_internal.h
@@ -0,0 +1,109 @@
+/*
+ * multixact_internal.h
+ *
+ * Defines and helper functions for the PostgreSQL multi-transaction-log manager
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/multixact_internal.h
+ */
+#ifndef MULTIXACT_INTERNAL_H
+#define MULTIXACT_INTERNAL_H
+
+#include "postgres.h"
+
+#include "access/multixact.h"
+
+/*
+ * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
+ * used everywhere else in Postgres.
+ */
+
+/* We need 8 bytes per offset */
+#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)
+{
+	return multi / MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int
+MultiXactIdToOffsetEntry(MultiXactId multi)
+{
+	return multi % MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+/*
+ * The situation for members is a bit more complex: we store one byte of
+ * additional flag bits for each TransactionId.  To do this without getting
+ * into alignment issues, we store four bytes of flags, and then the
+ * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
+ * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
+ * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
+ * performance) trumps space efficiency here.
+ *
+ * Note that the "offset" macros work with byte offset, not array indexes, so
+ * arithmetic must be done using "char *" pointers.
+ */
+/* We need eight bits per xact, so one xact fits in a byte */
+#define MXACT_MEMBER_BITS_PER_XACT			8
+#define MXACT_MEMBER_FLAGS_PER_BYTE			1
+#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
+
+/* how many full bytes of flags are there in a group? */
+#define MULTIXACT_FLAGBYTES_PER_GROUP		4
+#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
+	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
+/* size in bytes of a complete group */
+#define MULTIXACT_MEMBERGROUP_SIZE \
+	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERS_PER_PAGE	\
+	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
+
+/* page in which a member is to be found */
+static inline int64
+MXOffsetToMemberPage(MultiXactOffset offset)
+{
+	return offset / MULTIXACT_MEMBERS_PER_PAGE;
+}
+
+static inline int64
+MXOffsetToMemberSegment(MultiXactOffset offset)
+{
+	return MXOffsetToMemberPage(offset) / SLRU_PAGES_PER_SEGMENT;
+}
+
+/* Location (byte offset within page) of flag word for a given member */
+static inline int
+MXOffsetToFlagsOffset(MultiXactOffset offset)
+{
+	MultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
+	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
+
+	return byteoff;
+}
+
+static inline int
+MXOffsetToFlagsBitShift(MultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
+
+	return bshift;
+}
+
+/* Location (byte offset within page) of TransactionId of given member */
+static inline int
+MXOffsetToMemberOffset(MultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+
+	return MXOffsetToFlagsOffset(offset) +
+		MULTIXACT_FLAGBYTES_PER_GROUP +
+		member_in_group * sizeof(TransactionId);
+}
+
+#endif							/* MULTIXACT_INTERNAL_H */
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 8d57753ed01..8576649b15e 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -23,21 +23,6 @@
  */
 #define SLRU_MAX_ALLOWED_BUFFERS ((1024 * 1024 * 1024) / BLCKSZ)
 
-/*
- * Define SLRU segment size.  A page is the same BLCKSZ as is used everywhere
- * else in Postgres.  The segment size can be chosen somewhat arbitrarily;
- * we make it 32 pages by default, or 256Kb, i.e. 1M transactions for CLOG
- * or 64K transactions for SUBTRANS.
- *
- * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
- * page numbering also wraps around at 0xFFFFFFFF/xxxx_XACTS_PER_PAGE (where
- * xxxx is CLOG or SUBTRANS, respectively), and segment numbering at
- * 0xFFFFFFFF/xxxx_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in slru.c, except when comparing
- * segment and page numbers in SimpleLruTruncate (see PagePrecedes()).
- */
-#define SLRU_PAGES_PER_SEGMENT	32
-
 /*
  * Page status codes.  Note that these do not include the "dirty" bit.
  * page_dirty can be true only in the VALID or WRITE_IN_PROGRESS states;
diff --git a/src/include/c.h b/src/include/c.h
index 757dfff4782..bc92a6f4565 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -670,7 +670,7 @@ typedef uint32 SubTransactionId;
 /* MultiXactId must be equivalent to TransactionId, to fit in t_xmax */
 typedef TransactionId MultiXactId;
 
-typedef uint32 MultiXactOffset;
+typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index 7e1aa422332..8556ce40cbf 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -356,3 +356,18 @@
  * Enable tracing of syncscan operations (see also the trace_syncscan GUC var).
  */
 /* #define TRACE_SYNCSCAN */
+
+/*
+ * Define SLRU segment size.  A page is the same BLCKSZ as is used everywhere
+ * else in Postgres.  The segment size can be chosen somewhat arbitrarily;
+ * we make it 32 pages by default, or 256Kb, i.e. 1M transactions for CLOG
+ * or 64K transactions for SUBTRANS.
+ *
+ * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
+ * page numbering also wraps around at 0xFFFFFFFF/xxxx_XACTS_PER_PAGE (where
+ * xxxx is CLOG or SUBTRANS, respectively), and segment numbering at
+ * 0xFFFFFFFF/xxxx_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
+ * take no explicit notice of that fact in slru.c, except when comparing
+ * segment and page numbers in SimpleLruTruncate (see PagePrecedes()).
+ */
+#define SLRU_PAGES_PER_SEGMENT	32
-- 
2.51.0

v23-0005-Handle-wraparound-of-next-new-multi-in-pg_upgrad.patchapplication/octet-stream; name=v23-0005-Handle-wraparound-of-next-new-multi-in-pg_upgrad.patchDownload
From 06dda16dbe232fa304c9ce418d322cf13bac1901 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Thu, 6 Nov 2025 16:54:33 +0300
Subject: [PATCH v23 5/7] Handle wraparound of next new multi in pg_upgrade

Per BUG #18863 and BUG #18865
---
 src/bin/pg_upgrade/meson.build         |   1 +
 src/bin/pg_upgrade/pg_upgrade.c        |   5 +
 src/bin/pg_upgrade/t/007_multi_wrap.pl | 176 +++++++++++++++++++++++++
 3 files changed, 182 insertions(+)
 create mode 100644 src/bin/pg_upgrade/t/007_multi_wrap.pl

diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index 3e46c4512cf..ca87ae221ce 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -50,6 +50,7 @@ tests += {
       't/004_subscription.pl',
       't/005_char_signedness.pl',
       't/006_transfer_modes.pl',
+      't/007_multi_wrap.pl',
     ],
     'test_kwargs': {'priority': 40}, # pg_upgrade tests are slow
   },
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 0fdd05c127c..eb87052c4ad 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -908,11 +908,16 @@ copy_xact_xlog_xid(void)
 			remove_new_subdir("pg_multixact/offsets", false);
 
 			prep_status("Converting pg_multixact/offsets to 64-bit");
+			/* convert_multixacts handles new_nxtmulti wraparound */
 			convert_multixacts(&new_nxtmulti, &new_nxtmxoff);
 			check_ok();
 		}
 		else
 		{
+			/* handle wraparound */
+			if (new_nxtmulti < FirstMultiXactId)
+				new_nxtmulti = FirstMultiXactId;
+
 			copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
 			copy_subdir_files("pg_multixact/members", "pg_multixact/members");
 		}
diff --git a/src/bin/pg_upgrade/t/007_multi_wrap.pl b/src/bin/pg_upgrade/t/007_multi_wrap.pl
new file mode 100644
index 00000000000..0ad8fd59906
--- /dev/null
+++ b/src/bin/pg_upgrade/t/007_multi_wrap.pl
@@ -0,0 +1,176 @@
+
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use Math::BigInt;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::AdjustDump;
+use PostgreSQL::Test::AdjustUpgrade;
+use Test::More;
+
+# Temp dir for a dumps.
+my $tempdir = PostgreSQL::Test::Utils::tempdir;
+
+# Can be changed to test the other modes.
+my $mode = $ENV{PG_TEST_PG_UPGRADE_MODE} || '--copy';
+
+# Handy pg_resetwal wrapper
+sub reset_mxoff
+{
+	my %args = @_;
+
+	my $node = $args{node};
+	my $offset = $args{offset};
+	my $multi = $args{multi};
+	my $blcksz = sub # Get block size
+	{
+		my $out = (run_command([ 'pg_resetwal', '--dry-run',
+								 $node->data_dir ]))[0];
+		$out =~ /^Database block size: *(\d+)$/m or die;
+		return $1;
+	}->();
+
+	my @cmd;
+
+	# Reset cluster
+	@cmd = ('pg_resetwal', '--pgdata' => $node->data_dir);
+	if (defined($offset))
+	{
+		push @cmd, '--multixact-offset' => $offset;
+	}
+	if (defined($multi))
+	{
+		push @cmd, "--multixact-ids=$multi,$multi";
+	}
+	command_ok(\@cmd, 'reset multi/offset');
+
+	my $n_items;
+	my $segname;
+
+	# Fill empty pg_multixact segments
+	if (defined($offset))
+	{
+		$n_items = 32 * int($blcksz / 20) * 4;
+		$segname = sprintf "%015X", ($offset / $n_items);
+		$segname = $node->data_dir . "/pg_multixact/members/" . $segname;
+
+		@cmd = ('dd');
+		push @cmd, "if=/dev/zero";
+		push @cmd, "of=" . $segname;
+		push @cmd, "bs=$blcksz";
+		push @cmd, "count=32";
+		command_ok(\@cmd, 'fill empty multixact-members');
+	}
+
+	if (defined($multi))
+	{
+		$n_items = 32 * int($blcksz / 8);
+		$segname = sprintf "%04X", $multi / $n_items;
+		$segname = $node->data_dir . "/pg_multixact/offsets/" . $segname;
+
+		@cmd = ('dd');
+		push @cmd, "if=/dev/zero";
+		push @cmd, "of=" . $segname;
+		push @cmd, "bs=$blcksz";
+		push @cmd, "count=32";
+		command_ok(\@cmd, 'fill empty multixact-offsets');
+	}
+}
+
+sub get_dump_for_comparison
+{
+	my ($node, $db, $file_prefix, $adjust_child_columns) = @_;
+
+	my $dumpfile = $tempdir . '/' . $file_prefix . '.sql';
+	my $dump_adjusted = "${dumpfile}_adjusted";
+
+	open(my $dh, '>', $dump_adjusted)
+	  || die "could not open $dump_adjusted for writing $!";
+
+	$node->run_log(
+		[
+			'pg_dump', '--no-sync',
+			'--restrict-key' => 'test',
+			'-d' => $node->connstr($db),
+			'-f' => $dumpfile
+		]);
+
+	print $dh adjust_regress_dumpfile(slurp_file($dumpfile),
+		$adjust_child_columns);
+	close($dh);
+
+	return $dump_adjusted;
+}
+
+# Create old node
+my $old = PostgreSQL::Test::Cluster->new("old");
+$old->init;
+reset_mxoff(node => $old, multi => 4294967295, offset => 429496729);
+
+$old->start;
+$old->safe_psql('postgres',
+qq(
+	CREATE TABLE test_table (id integer NOT NULL PRIMARY KEY, val text);
+	INSERT INTO test_table VALUES (1, 'a');
+));
+
+my $conn1 = $old->background_psql('postgres');
+my $conn2 = $old->background_psql('postgres');
+
+$conn1->query_safe(qq(
+	BEGIN;
+	SELECT * FROM test_table WHERE id = 1 FOR SHARE;
+));
+$conn2->query_safe(qq(
+	BEGIN;
+	SELECT * FROM test_table WHERE id = 1 FOR SHARE;
+));
+
+$conn1->query_safe(qq(COMMIT;));
+$conn2->query_safe(qq(COMMIT;));
+
+$conn1->quit;
+$conn2->quit;
+
+$old->stop;
+
+# Create new node
+my $new = PostgreSQL::Test::Cluster->new("new");
+$new->init;
+
+# Run pg_upgrade
+command_ok(
+	[
+		'pg_upgrade', '--no-sync',
+		'--old-datadir' => $old->data_dir,
+		'--new-datadir' => $new->data_dir,
+		'--old-bindir' => $old->config_data('--bindir'),
+		'--new-bindir' => $new->config_data('--bindir'),
+		'--socketdir' => $new->host,
+		'--old-port' => $old->port,
+		'--new-port' => $new->port,
+		$mode,
+	],
+	'run of pg_upgrade for new instance');
+ok( !-d $new->data_dir . "/pg_upgrade_output.d",
+	"pg_upgrade_output.d/ removed after pg_upgrade success");
+
+$old->start;
+my $src_dump =
+	get_dump_for_comparison($old, 'postgres',
+							"oldnode_1_dump", 0);
+$old->stop;
+
+$new->start;
+my $dst_dump =
+	get_dump_for_comparison($new, 'postgres',
+							"newnode_1_dump", 0);
+$new->stop;
+
+compare_files($src_dump, $dst_dump,
+	'dump outputs from original and restored regression databases match');
+
+done_testing();
-- 
2.51.0

v23-0002-Add-pg_upgarde-for-64-bit-multixact-offsets.patchapplication/octet-stream; name=v23-0002-Add-pg_upgarde-for-64-bit-multixact-offsets.patchDownload
From 96f343db5e75a3b8e74364816fbf496bbb43584f Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Fri, 24 Oct 2025 10:58:37 +0300
Subject: [PATCH v23 2/7] Add pg_upgarde for 64 bit multixact offsets

Author: Maxim Orlov <orlovmg@gmail.com>
Author: Heikki Linnakangas <hlinnaka@iki.fi>
---
 src/backend/access/transam/multixact.c |  56 -----
 src/bin/pg_upgrade/Makefile            |   3 +
 src/bin/pg_upgrade/meson.build         |   3 +
 src/bin/pg_upgrade/multixact_new.c     | 103 +++++++++
 src/bin/pg_upgrade/multixact_new.h     |  23 ++
 src/bin/pg_upgrade/multixact_old.c     | 297 +++++++++++++++++++++++++
 src/bin/pg_upgrade/multixact_old.h     |  29 +++
 src/bin/pg_upgrade/pg_upgrade.c        | 108 ++++++++-
 src/bin/pg_upgrade/pg_upgrade.h        |   5 +
 src/bin/pg_upgrade/slru_io.c           | 239 ++++++++++++++++++++
 src/bin/pg_upgrade/slru_io.h           |  23 ++
 src/tools/pgindent/typedefs.list       |   3 +
 12 files changed, 830 insertions(+), 62 deletions(-)
 create mode 100644 src/bin/pg_upgrade/multixact_new.c
 create mode 100644 src/bin/pg_upgrade/multixact_new.h
 create mode 100644 src/bin/pg_upgrade/multixact_old.c
 create mode 100644 src/bin/pg_upgrade/multixact_old.h
 create mode 100644 src/bin/pg_upgrade/slru_io.c
 create mode 100644 src/bin/pg_upgrade/slru_io.h

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 732f048df5e..9c365689080 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1829,48 +1829,6 @@ BootStrapMultiXact(void)
 	SimpleLruZeroAndWritePage(MultiXactMemberCtl, 0);
 }
 
-/*
- * MaybeExtendOffsetSlru
- *		Extend the offsets SLRU area, if necessary
- *
- * After a binary upgrade from <= 9.2, the pg_multixact/offsets SLRU area might
- * contain files that are shorter than necessary; this would occur if the old
- * installation had used multixacts beyond the first page (files cannot be
- * copied, because the on-disk representation is different).  pg_upgrade would
- * update pg_control to set the next offset value to be at that position, so
- * that tuples marked as locked by such MultiXacts would be seen as visible
- * without having to consult multixact.  However, trying to create and use a
- * new MultiXactId would result in an error because the page on which the new
- * value would reside does not exist.  This routine is in charge of creating
- * such pages.
- */
-static void
-MaybeExtendOffsetSlru(void)
-{
-	int64		pageno;
-	LWLock	   *lock;
-
-	pageno = MultiXactIdToOffsetPage(MultiXactState->nextMXact);
-	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
-
-	LWLockAcquire(lock, LW_EXCLUSIVE);
-
-	if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno))
-	{
-		int			slotno;
-
-		/*
-		 * Fortunately for us, SimpleLruWritePage is already prepared to deal
-		 * with creating a new segment file even if the page we're writing is
-		 * not the first in it, so this is enough.
-		 */
-		slotno = SimpleLruZeroPage(MultiXactOffsetCtl, pageno);
-		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
-	}
-
-	LWLockRelease(lock);
-}
-
 /*
  * This must be called ONCE during postmaster or standalone-backend startup.
  *
@@ -2063,20 +2021,6 @@ MultiXactSetNextMXact(MultiXactId nextMulti,
 	MultiXactState->nextMXact = nextMulti;
 	MultiXactState->nextOffset = nextMultiOffset;
 	LWLockRelease(MultiXactGenLock);
-
-	/*
-	 * During a binary upgrade, make sure that the offsets SLRU is large
-	 * enough to contain the next value that would be created.
-	 *
-	 * We need to do this pretty early during the first startup in binary
-	 * upgrade mode: before StartupMultiXact() in fact, because this routine
-	 * is called even before that by StartupXLOG().  And we can't do it
-	 * earlier than at this point, because during that first call of this
-	 * routine we determine the MultiXactState->nextMXact value that
-	 * MaybeExtendOffsetSlru needs.
-	 */
-	if (IsBinaryUpgrade)
-		MaybeExtendOffsetSlru();
 }
 
 /*
diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index 69fcf593cae..42995d53b0b 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -18,11 +18,14 @@ OBJS = \
 	file.o \
 	function.o \
 	info.o \
+	multixact_new.o \
+	multixact_old.o \
 	option.o \
 	parallel.o \
 	pg_upgrade.o \
 	relfilenumber.o \
 	server.o \
+	slru_io.o \
 	tablespace.o \
 	task.o \
 	util.o \
diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index ac992f0d14b..3e46c4512cf 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -8,11 +8,14 @@ pg_upgrade_sources = files(
   'file.c',
   'function.c',
   'info.c',
+  'multixact_new.c',
+  'multixact_old.c',
   'option.c',
   'parallel.c',
   'pg_upgrade.c',
   'relfilenumber.c',
   'server.c',
+  'slru_io.c',
   'tablespace.c',
   'task.c',
   'util.c',
diff --git a/src/bin/pg_upgrade/multixact_new.c b/src/bin/pg_upgrade/multixact_new.c
new file mode 100644
index 00000000000..f565a378254
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_new.c
@@ -0,0 +1,103 @@
+/*
+ * multixact_new.c
+ *
+ * Functions to write multixacts in the v19 format with 64-bit
+ * MultiXactOffsets
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_new.c
+ */
+
+#include "postgres_fe.h"
+
+#include "access/multixact.h"
+#include "access/multixact_internal.h"
+
+#include "multixact_new.h"
+
+MultiXactWriter *
+AllocMultiXactWrite(const char *pgdata, MultiXactId firstMulti,
+					MultiXactOffset firstOffset)
+{
+	MultiXactWriter *state = pg_malloc(sizeof(*state));
+	char		dir[MAXPGPATH] = {0};
+
+	pg_sprintf(dir, "%s/pg_multixact/offsets", pgdata);
+	state->offset = AllocSlruWrite(dir, false);
+	SlruWriteSwitchPage(state->offset, MultiXactIdToOffsetPage(firstMulti));
+
+	pg_sprintf(dir, "%s/pg_multixact/members", pgdata);
+	state->members = AllocSlruWrite(dir, true /* use long segment names */ );
+	SlruWriteSwitchPage(state->members, MXOffsetToMemberPage(firstOffset));
+
+	return state;
+}
+
+/*
+ * Write a new multixact with members.
+ *
+ * Simplified version of the correspoding server function, hence the name.
+ */
+void
+RecordNewMultiXact(MultiXactWriter *state, MultiXactOffset offset,
+				   MultiXactId multi, int nmembers, MultiXactMember *members)
+{
+	int64		pageno;
+	int64		prev_pageno;
+	int			entryno;
+	char	   *buf;
+	MultiXactOffset *offptr;
+
+	pageno = MultiXactIdToOffsetPage(multi);
+	entryno = MultiXactIdToOffsetEntry(multi);
+
+	/* Store the offset */
+	buf = SlruWriteSwitchPage(state->offset, pageno);
+	offptr = (MultiXactOffset *) buf;
+	offptr[entryno] = offset;
+
+	/* Store the members */
+	prev_pageno = -1;
+	for (int i = 0; i < nmembers; i++, offset++)
+	{
+		TransactionId *memberptr;
+		uint32	   *flagsptr;
+		uint32		flagsval;
+		int			bshift;
+		int			flagsoff;
+		int			memberoff;
+
+		Assert(members[i].status <= MultiXactStatusUpdate);
+
+		pageno = MXOffsetToMemberPage(offset);
+		memberoff = MXOffsetToMemberOffset(offset);
+		flagsoff = MXOffsetToFlagsOffset(offset);
+		bshift = MXOffsetToFlagsBitShift(offset);
+
+		if (pageno != prev_pageno)
+		{
+			buf = SlruWriteSwitchPage(state->members, pageno);
+			prev_pageno = pageno;
+		}
+
+		memberptr = (TransactionId *) (buf + memberoff);
+
+		*memberptr = members[i].xid;
+
+		flagsptr = (uint32 *) (buf + flagsoff);
+
+		flagsval = *flagsptr;
+		flagsval &= ~(((1 << MXACT_MEMBER_BITS_PER_XACT) - 1) << bshift);
+		flagsval |= (members[i].status << bshift);
+		*flagsptr = flagsval;
+	}
+}
+
+void
+FreeMultiXactWrite(MultiXactWriter *state)
+{
+	FreeSlruWrite(state->offset);
+	FreeSlruWrite(state->members);
+
+	pfree(state);
+}
diff --git a/src/bin/pg_upgrade/multixact_new.h b/src/bin/pg_upgrade/multixact_new.h
new file mode 100644
index 00000000000..f66e6af7e45
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_new.h
@@ -0,0 +1,23 @@
+/*
+ * multixact_new.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_new.h
+ */
+#include "access/multixact.h"
+
+#include "slru_io.h"
+
+typedef struct MultiXactWriter
+{
+	SlruSegState *offset;
+	SlruSegState *members;
+} MultiXactWriter;
+
+extern MultiXactWriter *AllocMultiXactWrite(const char *pgdata,
+											MultiXactId firstMulti,
+											MultiXactOffset firstOffset);
+extern void RecordNewMultiXact(MultiXactWriter *state, MultiXactOffset offset,
+							   MultiXactId multi, int nmembers,
+							   MultiXactMember *members);
+extern void FreeMultiXactWrite(MultiXactWriter *writer);
diff --git a/src/bin/pg_upgrade/multixact_old.c b/src/bin/pg_upgrade/multixact_old.c
new file mode 100644
index 00000000000..70ae88d97f4
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_old.c
@@ -0,0 +1,297 @@
+/*
+ * multixact_old.c
+ *
+ * Functions to read pre-v19 multixacts
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_old.c
+ */
+
+#include "postgres_fe.h"
+
+#include "multixact_old.h"
+#include "pg_upgrade.h"
+
+/*
+ * NOTE: below are a bunch of definitions and simple sttaic inline functions
+ * that are copy-pasted from multixact.c from version 18.  The only difference
+ * is that we use the OldMultiXactOffset type equal to uint32 instead of
+ * MultiXactOffset which became uint64.
+ */
+
+/* We need four bytes per offset and 8 bytes per base for each page. */
+#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(OldMultiXactOffset))
+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)
+{
+	return multi / MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int
+MultiXactIdToOffsetEntry(MultiXactId multi)
+{
+	return multi % MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+/*
+ * The situation for members is a bit more complex: we store one byte of
+ * additional flag bits for each TransactionId.  To do this without getting
+ * into alignment issues, we store four bytes of flags, and then the
+ * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
+ * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
+ * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
+ * performance) trumps space efficiency here.
+ *
+ * Note that the "offset" macros work with byte offset, not array indexes, so
+ * arithmetic must be done using "char *" pointers.
+ */
+/* We need eight bits per xact, so one xact fits in a byte */
+#define MXACT_MEMBER_BITS_PER_XACT			8
+#define MXACT_MEMBER_FLAGS_PER_BYTE			1
+#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
+
+/* how many full bytes of flags are there in a group? */
+#define MULTIXACT_FLAGBYTES_PER_GROUP		4
+#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
+	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
+/* size in bytes of a complete group */
+#define MULTIXACT_MEMBERGROUP_SIZE \
+	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERS_PER_PAGE	\
+	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
+
+/* page in which a member is to be found */
+static inline int64
+MXOffsetToMemberPage(OldMultiXactOffset offset)
+{
+	return offset / MULTIXACT_MEMBERS_PER_PAGE;
+}
+
+/* Location (byte offset within page) of flag word for a given member */
+static inline int
+MXOffsetToFlagsOffset(MultiXactOffset offset)
+{
+	OldMultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
+	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
+
+	return byteoff;
+}
+
+/* Location (byte offset within page) of TransactionId of given member */
+static inline int
+MXOffsetToMemberOffset(OldMultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+
+	return MXOffsetToFlagsOffset(offset) +
+		MULTIXACT_FLAGBYTES_PER_GROUP +
+		member_in_group * sizeof(TransactionId);
+}
+
+static inline int
+MXOffsetToFlagsBitShift(OldMultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
+
+	return bshift;
+}
+
+/*
+ * Construct reader of old multixacts.
+ *
+ * Returns the malloced memory used by the all other calls in this module.
+ */
+OldMultiXactReader *
+AllocOldMultiXactRead(char *pgdata, MultiXactId nextMulti,
+					  OldMultiXactOffset nextOffset)
+{
+	OldMultiXactReader *state = state = pg_malloc(sizeof(*state));
+	char		dir[MAXPGPATH] = {0};
+
+	state->nextMXact = nextMulti;
+	state->nextOffset = nextOffset;
+
+	pg_sprintf(dir, "%s/pg_multixact/offsets", pgdata);
+	state->offset = AllocSlruRead(dir);
+
+	pg_sprintf(dir, "%s/pg_multixact/members", pgdata);
+	state->members = AllocSlruRead(dir);
+
+	return state;
+}
+
+/*
+ * This is a simplified version of the GetMultiXactIdMembers() server function.
+ *
+ * - Only return the updating member, if any. Upgrade only cares about the
+ *   updaters. If there is no updating member, return the first locking-only
+ *   member. We don't have any way to represent "no members", but we also don't
+ *   need to preserve all the locking members.
+ *
+ * - We don't need to worry about locking and some corner cases because there's
+ *   no concurrent activity.
+ */
+void
+GetOldMultiXactIdSingleMember(OldMultiXactReader *state, MultiXactId multi,
+							  TransactionId *result, MultiXactStatus *status)
+{
+	MultiXactId nextMXact,
+				nextOffset,
+				tmpMXact;
+	int64		pageno,
+				prev_pageno;
+	int			entryno,
+				length;
+	char	   *buf;
+	OldMultiXactOffset *offptr,
+				offset;
+	TransactionId result_xid = InvalidTransactionId;
+	bool		result_isupdate = false;
+
+	nextMXact = state->nextMXact;
+	nextOffset = state->nextOffset;
+
+	/*
+	 * See GetMultiXactIdMembers in multixact.c
+	 *
+	 * Find out the offset at which we need to start reading MultiXactMembers
+	 * and the number of members in the multixact.  We determine the latter as
+	 * the difference between this multixact's starting offset and the next
+	 * one's.  However, there are some corner cases to worry about:
+	 *
+	 * 1. This multixact may be the latest one created, in which case there is
+	 * no next one to look at.  In this case the nextOffset value we just
+	 * saved is the correct endpoint.
+	 *
+	 * 2. The next multixact may still be in process of being filled in...
+	 * This cannot happen during upgrade.
+	 *
+	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
+	 * handle case #2, there is an ambiguity near the point of offset
+	 * wraparound.  If we see next multixact's offset is one, is that our
+	 * multixact's actual endpoint, or did it end at zero with a subsequent
+	 * increment?  We handle this using the knowledge that if the zero'th
+	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
+	 * transaction ID so it can't be a multixact member.  Therefore, if we
+	 * read a zero from the members array, just ignore it.
+	 */
+
+	pageno = MultiXactIdToOffsetPage(multi);
+	entryno = MultiXactIdToOffsetEntry(multi);
+
+	buf = SlruReadSwitchPage(state->offset, pageno);
+	offptr = (OldMultiXactOffset *) buf;
+	offptr += entryno;
+	offset = *offptr;
+
+	Assert(offset != 0);
+
+	/*
+	 * Use the same increment rule as GetNewMultiXactId(), that is, don't
+	 * handle wraparound explicitly until needed.
+	 */
+	tmpMXact = multi + 1;
+
+	if (nextMXact == tmpMXact)
+	{
+		/* Corner case 1: there is no next multixact */
+		length = nextOffset - offset;
+	}
+	else
+	{
+		OldMultiXactOffset nextMXOffset;
+
+		/* handle wraparound if needed */
+		if (tmpMXact < FirstMultiXactId)
+			tmpMXact = FirstMultiXactId;
+
+		prev_pageno = pageno;
+
+		pageno = MultiXactIdToOffsetPage(tmpMXact);
+		entryno = MultiXactIdToOffsetEntry(tmpMXact);
+
+		if (pageno != prev_pageno)
+			buf = SlruReadSwitchPage(state->offset, pageno);
+
+		offptr = (OldMultiXactOffset *) buf;
+		offptr += entryno;
+		nextMXOffset = *offptr;
+
+		/*
+		 * Corner case 2: next multixact is still being filled in, this must
+		 * not happen during upgrade.
+		 */
+		Assert(nextMXOffset != 0);
+
+		length = nextMXOffset - offset;
+	}
+
+	prev_pageno = -1;
+	for (int i = 0; i < length; i++, offset++)
+	{
+		TransactionId *xactptr;
+		uint32	   *flagsptr;
+		int			flagsoff;
+		int			bshift;
+		int			memberoff;
+		MultiXactStatus st;
+
+		pageno = MXOffsetToMemberPage(offset);
+		memberoff = MXOffsetToMemberOffset(offset);
+
+		if (pageno != prev_pageno)
+		{
+			buf = SlruReadSwitchPage(state->members, pageno);
+			prev_pageno = pageno;
+		}
+
+		xactptr = (TransactionId *) (buf + memberoff);
+		if (!TransactionIdIsValid(*xactptr))
+		{
+			/* Corner case 3: we must be looking at unused slot zero */
+			Assert(offset == 0);
+			continue;
+		}
+
+		flagsoff = MXOffsetToFlagsOffset(offset);
+		bshift = MXOffsetToFlagsBitShift(offset);
+		flagsptr = (uint32 *) (buf + flagsoff);
+
+		st = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
+
+		/* Verify that there is a single update Xid among the given members. */
+		if (ISUPDATE_from_mxstatus(st))
+		{
+			if (result_isupdate)
+				pg_fatal("multixact %u has more than one updating member",
+						 multi);
+			result_xid = *xactptr;
+			result_isupdate = true;
+		}
+		else if (!TransactionIdIsValid(result_xid))
+			result_xid = *xactptr;
+	}
+
+	/* A multixid with zero members should not happen */
+	Assert(TransactionIdIsValid(result_xid));
+
+	*result = result_xid;
+	*status = result_isupdate ? MultiXactStatusUpdate :
+		MultiXactStatusForKeyShare;
+}
+
+/*
+ * Frees the malloced reader.
+ */
+void
+FreeOldMultiXactReader(OldMultiXactReader *state)
+{
+	FreeSlruRead(state->offset);
+	FreeSlruRead(state->members);
+
+	pfree(state);
+}
diff --git a/src/bin/pg_upgrade/multixact_old.h b/src/bin/pg_upgrade/multixact_old.h
new file mode 100644
index 00000000000..8eb5af2ccaf
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_old.h
@@ -0,0 +1,29 @@
+/*
+ * multixact_old.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_old.h
+ */
+
+#include "access/multixact.h"
+#include "slru_io.h"
+
+typedef uint32 OldMultiXactOffset;
+
+typedef struct OldMultiXactReader
+{
+	MultiXactId nextMXact;
+	OldMultiXactOffset nextOffset;
+
+	SlruSegState *offset;
+	SlruSegState *members;
+} OldMultiXactReader;
+
+extern OldMultiXactReader *AllocOldMultiXactRead(char *pgdata,
+												 MultiXactId nextMulti,
+												 OldMultiXactOffset nextOffset);
+extern void GetOldMultiXactIdSingleMember(OldMultiXactReader *state,
+										  MultiXactId multi,
+										  TransactionId *result,
+										  MultiXactStatus *status);
+extern void FreeOldMultiXactReader(OldMultiXactReader *reader);
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 490e98fa26f..0fdd05c127c 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -48,6 +48,8 @@
 #include "common/logging.h"
 #include "common/restricted_token.h"
 #include "fe_utils/string_utils.h"
+#include "multixact_old.h"
+#include "multixact_new.h"
 #include "pg_upgrade.h"
 
 /*
@@ -769,6 +771,81 @@ copy_subdir_files(const char *old_subdir, const char *new_subdir)
 	check_ok();
 }
 
+/*
+ * Convert pg_multixact/offset and /members to new format with 64-bit offsets.
+ */
+static void
+convert_multixacts(MultiXactId *new_nxtmulti, MultiXactOffset *new_nxtmxoff)
+{
+	MultiXactId oldest_multi,
+				next_multi;
+	OldMultiXactReader *old_reader;
+	MultiXactWriter *new_writer;
+	MultiXactOffset next_offset;
+
+	/*
+	 * The range of valid multi XIDs is unchanged by the conversion (they are
+	 * referenced from the heap tables), but the members SLRU is rewritten to
+	 * start from offset 1.
+	 */
+	oldest_multi = old_cluster.controldata.chkpnt_oldstMulti;
+	next_multi = old_cluster.controldata.chkpnt_nxtmulti;
+	next_offset = 1;
+
+	old_reader = AllocOldMultiXactRead(old_cluster.pgdata,
+									   old_cluster.controldata.chkpnt_nxtmulti,
+									   old_cluster.controldata.chkpnt_nxtmxoff);
+	new_writer = AllocMultiXactWrite(new_cluster.pgdata,
+									 oldest_multi, next_offset);
+
+	/* handle wraparound */
+	if (next_multi < FirstMultiXactId)
+		next_multi = FirstMultiXactId;
+
+	/*
+	 * Read multixids from old files one by one, and write them back in the
+	 * new format.
+	 */
+	for (MultiXactId multi = oldest_multi; multi != next_multi;)
+	{
+		TransactionId xid;
+		MultiXactStatus status;
+		MultiXactMember member;
+
+		/*
+		 * Read the old multixid.  The locking-only XIDs that may be part of
+		 * multi-xids don't matter after upgrade, as there can be no
+		 * transactions running across upgrade.  So as a little optimization,
+		 * we only read one member from each multixid: the one updating one,
+		 * or if there was no update, arbitrarily the first locking xid.
+		 */
+		GetOldMultiXactIdSingleMember(old_reader, multi, &xid, &status);
+
+		/* Write it out in new format */
+		member.xid = xid;
+		member.status = status;
+		RecordNewMultiXact(new_writer, next_offset, multi, 1, &member);
+
+		next_offset += 1;
+		multi++;
+		/* handle wraparound */
+		if (multi < FirstMultiXactId)
+			multi = FirstMultiXactId;
+	}
+
+	/*
+	 * Update the nextMXact/Offset values in the control file to match what we
+	 * wrote.  The nextMXact is unchanged, but nextOffset will be different.
+	 */
+	Assert(next_multi == old_cluster.controldata.chkpnt_nxtmulti);
+	*new_nxtmulti = next_multi;
+	*new_nxtmxoff = next_offset;
+
+	/* Release resources */
+	FreeMultiXactWrite(new_writer);
+	FreeOldMultiXactReader(old_reader);
+}
+
 static void
 copy_xact_xlog_xid(void)
 {
@@ -816,8 +893,29 @@ copy_xact_xlog_xid(void)
 	if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER &&
 		new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
 	{
-		copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
-		copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		MultiXactId new_nxtmulti = old_cluster.controldata.chkpnt_nxtmulti;
+		MultiXactOffset new_nxtmxoff = old_cluster.controldata.chkpnt_nxtmxoff;
+
+		/*
+		 * If the old server is before the
+		 * MULTIXACTOFFSET_FORMATCHANGE_CAT_VER it must have 32-bit multixid
+		 * offsets, thus it should be converted.
+		 */
+		if (old_cluster.controldata.cat_ver < MULTIXACTOFFSET_FORMATCHANGE_CAT_VER &&
+			new_cluster.controldata.cat_ver >= MULTIXACTOFFSET_FORMATCHANGE_CAT_VER)
+		{
+			remove_new_subdir("pg_multixact/members", false);
+			remove_new_subdir("pg_multixact/offsets", false);
+
+			prep_status("Converting pg_multixact/offsets to 64-bit");
+			convert_multixacts(&new_nxtmulti, &new_nxtmxoff);
+			check_ok();
+		}
+		else
+		{
+			copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
+			copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		}
 
 		prep_status("Setting next multixact ID and offset for new cluster");
 
@@ -826,10 +924,8 @@ copy_xact_xlog_xid(void)
 		 * counters here and the oldest multi present on system.
 		 */
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/pg_resetwal\" -O %u -m %u,%u \"%s\"",
-				  new_cluster.bindir,
-				  old_cluster.controldata.chkpnt_nxtmxoff,
-				  old_cluster.controldata.chkpnt_nxtmulti,
+				  "\"%s/pg_resetwal\" -O %" PRIu64 " -m %u,%u \"%s\"",
+				  new_cluster.bindir, new_nxtmxoff, new_nxtmulti,
 				  old_cluster.controldata.chkpnt_oldstMulti,
 				  new_cluster.pgdata);
 		check_ok();
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index e86336f4be9..127b2cb00fa 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -114,6 +114,11 @@ extern char *output_files[];
  */
 #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231
 
+/*
+ * Swicth from 32-bit to 64-bit for multixid offsets.
+ */
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 999999999
+
 /*
  * large object chunk size added to pg_controldata,
  * commit 5f93c37805e7485488480916b4585e098d3cc883
diff --git a/src/bin/pg_upgrade/slru_io.c b/src/bin/pg_upgrade/slru_io.c
new file mode 100644
index 00000000000..2a0624ea8b8
--- /dev/null
+++ b/src/bin/pg_upgrade/slru_io.c
@@ -0,0 +1,239 @@
+/*
+ * slru_io.c
+ *
+ * Routines for reading and writing SLRU files during upgrade.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/slru_io.c
+ */
+
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+
+#include "common/fe_memutils.h"
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "port/pg_iovec.h"
+#include "pg_upgrade.h"
+#include "slru_io.h"
+
+/*
+ * State for reading or writing an SLRU, with a one page buffer.
+ */
+typedef struct SlruSegState
+{
+	bool		writing;
+	bool		long_segment_names;
+
+	char	   *dir;
+	char	   *fn;
+	int			fd;
+	int64		segno;
+	uint64		pageno;
+
+	PGAlignedBlock buf;
+} SlruSegState;
+
+static inline SlruSegState *
+AllocSlruSegState(char *dir)
+{
+	SlruSegState *state = pg_malloc(sizeof(*state));
+
+	state->segno = -1;
+	state->pageno = 0;
+	state->dir = pstrdup(dir);
+	state->fd = -1;
+	state->fn = NULL;
+
+	return state;
+}
+
+static inline void
+SlruFlush(SlruSegState *state)
+{
+	struct iovec iovec = {
+		.iov_base = &state->buf,
+		.iov_len = BLCKSZ,
+	};
+	off_t		offset;
+
+	if (state->segno == -1)
+		return;
+
+	offset = (state->pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+	if (pg_pwritev_with_retry(state->fd, &iovec, 1, offset) < 0)
+		pg_fatal("could not write file \"%s\": %m", state->fn);
+}
+
+/*
+ * Create slru reader for dir.
+ *
+ * Returns the malloced memory used by the all other read calls in this module.
+ */
+SlruSegState *
+AllocSlruRead(char *dir)
+{
+	SlruSegState *state = AllocSlruSegState(dir);
+
+	state->writing = false;
+
+	return state;
+}
+
+/*
+ * Open given page for reading.
+ *
+ * Reading can be done in random order.
+ */
+char *
+SlruReadSwitchPage(SlruSegState *state, uint64 pageno)
+{
+	int64		segno;
+
+	Assert(!state->writing);	/* read only mode */
+
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+
+	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	if (segno != state->segno)
+	{
+		if (state->segno != -1)
+		{
+			close(state->fd);
+			state->fd = -1;
+
+			pg_free(state->fn);
+			state->fn = NULL;
+		}
+
+		/* Open new segment */
+		state->fn = psprintf("%s/%04X", state->dir, (unsigned int) segno);
+		if ((state->fd = open(state->fn, O_RDONLY | PG_BINARY, 0)) < 0)
+			pg_fatal("could not open file \"%s\": %m", state->fn);
+	}
+
+	state->segno = segno;
+
+	{
+		struct iovec iovec = {
+			.iov_base = &state->buf,
+			.iov_len = BLCKSZ,
+		};
+		off_t		offset = (pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+		if (pg_preadv(state->fd, &iovec, 1, offset) < 0)
+			pg_fatal("could not read file \"%s\": %m", state->fn);
+
+		state->pageno = pageno;
+	}
+
+	return state->buf.data;
+}
+
+/*
+ * Frees the malloced reader.
+ */
+void
+FreeSlruRead(SlruSegState *state)
+{
+	Assert(!state->writing);	/* read only mode */
+
+	close(state->fd);
+	pg_free(state);
+}
+
+/*
+ * Open the given page for writing.
+ *
+ * NOTE: This uses O_EXCL when stepping to a new segment, so this assumes that
+ * each segment is written in full before moving on to next one.  This
+ * limitation would be easy to lift if needed, but it fits the usage pattern of
+ * current callers.
+ */
+char *
+SlruWriteSwitchPage(SlruSegState *state, uint64 pageno)
+{
+	int64		segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	off_t		offset;
+
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+
+	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	offset = (pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+	SlruFlush(state);
+	memset(state->buf.data, 0, BLCKSZ);
+
+	if (segno != state->segno)
+	{
+		if (state->segno != -1)
+		{
+			close(state->fd);
+			state->fd = -1;
+
+			pg_free(state->fn);
+			state->fn = NULL;
+		}
+
+		/* Create the segment */
+		if (state->long_segment_names)
+		{
+			Assert(segno >= 0 && segno <= INT64CONST(0xFFFFFFFFFFFFFFF));
+			state->fn = psprintf("%s/%015" PRIX64, state->dir, segno);
+		}
+		else
+		{
+			Assert(segno >= 0 && segno <= INT64CONST(0xFFFFFF));
+			state->fn = psprintf("%s/%04X", state->dir, (unsigned int) segno);
+		}
+
+		if ((state->fd = open(state->fn, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+							  pg_file_create_mode)) < 0)
+		{
+			pg_fatal("could not create file \"%s\": %m", state->fn);
+		}
+
+		state->segno = segno;
+
+		if (offset > 0 && pg_pwrite_zeros(state->fd, offset, 0) < 0)
+			pg_fatal("could not write file \"%s\": %m", state->fn);
+	}
+
+	state->pageno = pageno;
+
+	return state->buf.data;
+}
+
+/*
+ * Create slru writer for dir.
+ *
+ * Returns the malloced memory used by the all other write calls in this module.
+ */
+SlruSegState *
+AllocSlruWrite(char *dir, bool long_segment_names)
+{
+	SlruSegState *state = AllocSlruSegState(dir);
+
+	state->writing = true;
+	state->long_segment_names = long_segment_names;
+
+	return state;
+}
+
+/*
+ * Frees the malloced writer.
+ */
+void
+FreeSlruWrite(SlruSegState *state)
+{
+	Assert(state->writing);
+
+	SlruFlush(state);
+
+	close(state->fd);
+	pg_free(state);
+}
diff --git a/src/bin/pg_upgrade/slru_io.h b/src/bin/pg_upgrade/slru_io.h
new file mode 100644
index 00000000000..295fd0bebc4
--- /dev/null
+++ b/src/bin/pg_upgrade/slru_io.h
@@ -0,0 +1,23 @@
+/*
+ * slru_io.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/slru_io.h
+ */
+
+/*
+ * See access/slru.h
+ *
+ * Copy here, since slru.h could not be included in fe code.
+ */
+#define SLRU_PAGES_PER_SEGMENT 32
+
+typedef struct SlruSegState SlruSegState;
+
+extern SlruSegState *AllocSlruRead(char *dir);
+extern char *SlruReadSwitchPage(SlruSegState *state, uint64 pageno);
+extern void FreeSlruRead(SlruSegState *state);
+
+extern SlruSegState *AllocSlruWrite(char *dir, bool long_segment_names);
+extern char *SlruWriteSwitchPage(SlruSegState *state, uint64 pageno);
+extern void FreeSlruWrite(SlruSegState *state);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 432509277c9..9392bb729b9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1725,6 +1725,7 @@ MultiXactMember
 MultiXactOffset
 MultiXactStateData
 MultiXactStatus
+MultiXactWriter
 MultirangeIOData
 MultirangeParseState
 MultirangeType
@@ -1808,6 +1809,7 @@ OffsetVarNodes_context
 Oid
 OidOptions
 OkeysState
+OldMultiXactReader
 OldToNewMapping
 OldToNewMappingData
 OnCommitAction
@@ -2804,6 +2806,7 @@ SlruCtlData
 SlruErrorCause
 SlruPageStatus
 SlruScanCallback
+SlruSegState
 SlruShared
 SlruSharedData
 SlruWriteAll
-- 
2.51.0

v23-0003-Remove-oldestOffset-oldestOffsetKnown-from-multi.patchapplication/octet-stream; name=v23-0003-Remove-oldestOffset-oldestOffsetKnown-from-multi.patchDownload
From 32813e65225d8832673481de8a280cea24f0bfcf Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Thu, 6 Nov 2025 16:20:18 +0300
Subject: [PATCH v23 3/7] Remove oldestOffset/oldestOffsetKnown from multixact

Since we rewrite all multitransactions during pg_upgrade, the oldest
offset for a new cluster will no longer be missing on disc.
---
 src/backend/access/transam/multixact.c | 101 ++-----------------------
 src/include/access/multixact.h         |   3 -
 2 files changed, 5 insertions(+), 99 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 9c365689080..26f8a10c377 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -145,14 +145,6 @@ typedef struct MultiXactStateData
 	MultiXactId oldestMultiXactId;
 	Oid			oldestMultiXactDB;
 
-	/*
-	 * Oldest multixact offset that is potentially referenced by a multixact
-	 * referenced by a relation.  We don't always know this value, so there's
-	 * a flag here to indicate whether or not we currently do.
-	 */
-	MultiXactOffset oldestOffset;
-	bool		oldestOffsetKnown;
-
 	/* support for anti-wraparound measures */
 	MultiXactId multiVacLimit;
 	MultiXactId multiWarnLimit;
@@ -2376,10 +2368,7 @@ SetOffsetVacuumLimit(bool is_startup)
 	MultiXactId oldestMultiXactId;
 	MultiXactId nextMXact;
 	MultiXactOffset oldestOffset = 0;	/* placate compiler */
-	MultiXactOffset prevOldestOffset;
 	MultiXactOffset nextOffset;
-	bool		oldestOffsetKnown = false;
-	bool		prevOldestOffsetKnown;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2392,8 +2381,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	oldestMultiXactId = MultiXactState->oldestMultiXactId;
 	nextMXact = MultiXactState->nextMXact;
 	nextOffset = MultiXactState->nextOffset;
-	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
-	prevOldestOffset = MultiXactState->oldestOffset;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2411,57 +2398,20 @@ SetOffsetVacuumLimit(bool is_startup)
 		 * offset.
 		 */
 		oldestOffset = nextOffset;
-		oldestOffsetKnown = true;
 	}
-	else
+	else if (!find_multixact_start(oldestMultiXactId, &oldestOffset))
 	{
-		/*
-		 * Figure out where the oldest existing multixact's offsets are
-		 * stored. Due to bugs in early release of PostgreSQL 9.3.X and 9.4.X,
-		 * the supposedly-earliest multixact might not really exist.  We are
-		 * careful not to fail in that case.
-		 */
-		oldestOffsetKnown =
-			find_multixact_start(oldestMultiXactId, &oldestOffset);
-
-		if (!oldestOffsetKnown)
-			ereport(LOG,
-					(errmsg("oldest checkpointed MultiXact %u does not exist on disk",
-							oldestMultiXactId)));
+		ereport(LOG,
+				(errmsg("oldest checkpointed MultiXact %u does not exist on disk",
+						oldestMultiXactId)));
 	}
 
 	LWLockRelease(MultiXactTruncationLock);
 
-	/*
-	 * If we can, compute limits (and install them MultiXactState) to prevent
-	 * overrun of old data in the members SLRU area. We can only do so if the
-	 * oldest offset is known though.
-	 *
-	 * FIXME: Is !oldestOffsetKnown possible anymore? At least update the comment:
-	 * we won't overrun members anymore.
-	 */
-	if (prevOldestOffsetKnown)
-	{
-		/*
-		 * If we failed to get the oldest offset this time, but we have a
-		 * value from a previous pass through this function, use the old
-		 * values rather than automatically forcing an autovacuum cycle again.
-		 */
-		oldestOffset = prevOldestOffset;
-		oldestOffsetKnown = true;
-	}
-
-	/* Install the computed values */
-	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
-	MultiXactState->oldestOffset = oldestOffset;
-	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	LWLockRelease(MultiXactGenLock);
-
 	/*
 	 * Do we need autovacuum?	If we're not sure, assume yes.
 	 */
-	return !oldestOffsetKnown ||
-		(nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);
+	return nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD;
 }
 
 /*
@@ -2508,47 +2458,6 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	return true;
 }
 
-/*
- * GetMultiXactInfo
- *
- * Returns information about the current MultiXact state, as of:
- * multixacts: Number of MultiXacts (nextMultiXactId - oldestMultiXactId)
- * members: Number of member entries (nextOffset - oldestOffset)
- * oldestMultiXactId: Oldest MultiXact ID still in use
- * oldestOffset: Oldest offset still in use
- *
- * Returns false if unable to determine, the oldest offset being unknown.
- */
-bool
-GetMultiXactInfo(uint32 *multixacts, MultiXactOffset *members,
-				 MultiXactId *oldestMultiXactId, MultiXactOffset *oldestOffset)
-{
-	MultiXactOffset nextOffset;
-	MultiXactId nextMultiXactId;
-	bool		oldestOffsetKnown;
-
-	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	nextOffset = MultiXactState->nextOffset;
-	*oldestMultiXactId = MultiXactState->oldestMultiXactId;
-	nextMultiXactId = MultiXactState->nextMXact;
-	*oldestOffset = MultiXactState->oldestOffset;
-	oldestOffsetKnown = MultiXactState->oldestOffsetKnown;
-	LWLockRelease(MultiXactGenLock);
-
-	if (!oldestOffsetKnown)
-	{
-		*members = 0;
-		*multixacts = 0;
-		*oldestMultiXactId = InvalidMultiXactId;
-		*oldestOffset = 0;
-		return false;
-	}
-
-	*members = nextOffset - *oldestOffset;
-	*multixacts = nextMultiXactId - *oldestMultiXactId;
-	return true;
-}
-
 typedef struct mxtruncinfo
 {
 	int64		earliestExistingPage;
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 7d98fe0fe32..d688b547c54 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -109,9 +109,6 @@ extern bool MultiXactIdIsRunning(MultiXactId multi, bool isLockOnly);
 extern void MultiXactIdSetOldestMember(void);
 extern int	GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 								  bool from_pgupgrade, bool isLockOnly);
-extern bool GetMultiXactInfo(uint32 *multixacts, MultiXactOffset *members,
-							 MultiXactId *oldestMultiXactId,
-							 MultiXactOffset *oldestOffset);
 extern bool MultiXactIdPrecedes(MultiXactId multi1, MultiXactId multi2);
 extern bool MultiXactIdPrecedesOrEquals(MultiXactId multi1,
 										MultiXactId multi2);
-- 
2.51.0

v23-0004-Add-test-for-64-bit-mxoff-in-pg_resetwal.patchapplication/octet-stream; name=v23-0004-Add-test-for-64-bit-mxoff-in-pg_resetwal.patchDownload
From 65ab6ad948b34740d7145c506324fba536fe63a2 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Tue, 28 Oct 2025 19:08:26 +0300
Subject: [PATCH v23 4/7] Add test for 64-bit mxoff in pg_resetwal

---
 src/bin/pg_resetwal/meson.build    |   1 +
 src/bin/pg_resetwal/t/003_mxoff.pl | 170 +++++++++++++++++++++++++++++
 2 files changed, 171 insertions(+)
 create mode 100644 src/bin/pg_resetwal/t/003_mxoff.pl

diff --git a/src/bin/pg_resetwal/meson.build b/src/bin/pg_resetwal/meson.build
index 290832b2299..1e2dfb38a5b 100644
--- a/src/bin/pg_resetwal/meson.build
+++ b/src/bin/pg_resetwal/meson.build
@@ -25,6 +25,7 @@ tests += {
     'tests': [
       't/001_basic.pl',
       't/002_corrupted.pl',
+      't/003_mxoff.pl',
     ],
   },
 }
diff --git a/src/bin/pg_resetwal/t/003_mxoff.pl b/src/bin/pg_resetwal/t/003_mxoff.pl
new file mode 100644
index 00000000000..3c1b7fa1d33
--- /dev/null
+++ b/src/bin/pg_resetwal/t/003_mxoff.pl
@@ -0,0 +1,170 @@
+
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use Math::BigInt;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+sub mxact_eater
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	$node->start;
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl} (I INT PRIMARY KEY, N_UPDATED INT) " .
+		"       WITH (AUTOVACUUM_ENABLED=FALSE);" .
+		"INSERT INTO ${tbl} SELECT G, 0 FROM GENERATE_SERIES(1, 50) G;");
+
+	# consume around 10k multixact-offsetfs
+	my $nclients = 10;
+	my $update_every = 75;
+	my @connections = ();
+
+	for (0..$nclients)
+	{
+		my $conn = $node->background_psql('postgres');
+		$conn->query_safe("BEGIN");
+
+		push(@connections, $conn);
+	}
+
+	for (my $i = 0; $i < 1000; $i++)
+	{
+		my $conn = $connections[$i % $nclients];
+
+		$conn->query_safe("COMMIT;");
+		$conn->query_safe("BEGIN");
+
+		if ($i % $update_every == 0)
+		{
+			$conn->query_safe(
+				"UPDATE ${tbl} SET " .
+				"N_UPDATED = N_UPDATED + 1 " .
+				"WHERE I = ${i} % 50");
+		}
+		else
+		{
+			$conn->query_safe(
+				"SELECT * FROM ${tbl} FOR KEY SHARE");
+		}
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+
+	$node->stop;
+}
+
+sub next_mxoff
+{
+	my $node = shift;
+	my ($stdout, $stderr) =
+	  run_command([ 'pg_controldata', $node->data_dir ]);
+	my @control_data = split("\n", $stdout);
+	my $next_mxoff = undef;
+
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint's NextMultiOffset:\s*(.*)$/mg)
+		{
+			$next_mxoff = $1;
+			last;
+		}
+	}
+	die "NextMultiOffset not found in control file\n"
+		unless defined($next_mxoff);
+
+	return $next_mxoff;
+}
+
+sub reset_mxoff
+{
+	my $node = shift;
+	my $offset = shift;
+		$offset = Math::BigInt->new($offset);
+
+	# Get block size
+	my $out = (run_command([ 'pg_resetwal', '--dry-run', $node->data_dir ]))[0];
+		$out =~ /^Database block size: *(\d+)$/m or die;
+	my $blcksz = $1;
+
+	# Reset to new offset
+	my @cmd = ('pg_resetwal', '--pgdata' => $node->data_dir);
+	push @cmd, '--multixact-offset' => $offset->as_hex();
+	command_ok(\@cmd, 'set oldest multixact-offset');
+
+	# Fill empty pg_multixact/members segment
+	my $mult = 32 * int($blcksz / 20) * 4;
+	my $segname = sprintf "%015X", $offset / $mult;
+
+	my @dd = ('dd');
+	push @dd, "if=/dev/zero";
+	push @dd, "of=" . $node->data_dir . "/pg_multixact/members/" . $segname;
+	push @dd, "bs=$blcksz";
+	push @dd, "count=32";
+	command_ok(\@dd, 'fill empty multixact-members');
+}
+
+my ($off1, $off2);
+
+# start from defaults
+my $node1 = PostgreSQL::Test::Cluster->new('node1');
+$node1->init;
+$off1 = next_mxoff($node1);
+mxact_eater($node1, "FOO");
+$off2 = next_mxoff($node1);
+note "> start from $off1, finished at $off2\n";
+
+# start from before 32-bit wraparound
+my $node2 = PostgreSQL::Test::Cluster->new('node2');
+$node2->init;
+reset_mxoff($node2, 0xFFFF0000);
+$off1 = next_mxoff($node2);
+mxact_eater($node2, "FOO");
+$off2 = next_mxoff($node2);
+note "> start from $off1, finished at $off2\n";
+
+# start near 32-bit wraparound
+my $node3 = PostgreSQL::Test::Cluster->new('node3');
+$node3->init;
+reset_mxoff($node3, 0xFFFFEC77);
+$off1 = next_mxoff($node3);
+mxact_eater($node3, "FOO");
+$off2 = next_mxoff($node3);
+note "> start from $off1, finished at $off2\n";
+
+# start over 32-bit wraparound
+my $node4 = PostgreSQL::Test::Cluster->new('node4');
+$node4->init;
+reset_mxoff($node4, '0xFFFFFFFF0000');
+$off1 = next_mxoff($node4);
+mxact_eater($node4, "FOO");
+$off2 = next_mxoff($node3);
+note "> start from $off1, finished at $off2\n";
+
+# check invariant
+$node1->start;
+$node2->start;
+$node3->start;
+$node4->start;
+
+my $var1 = $node1->safe_psql('postgres', 'TABLE FOO');
+my $var2 = $node2->safe_psql('postgres', 'TABLE FOO');
+my $var3 = $node3->safe_psql('postgres', 'TABLE FOO');
+my $var4 = $node4->safe_psql('postgres', 'TABLE FOO');
+ok($var1 eq $var2 eq $var3 eq $var4,
+	'check table invariant in all nodes');
+
+$node4->stop;
+$node3->stop;
+$node2->stop;
+$node1->stop;
+
+done_testing();
-- 
2.51.0

v23-0007-TEST-Add-test-for-64-bit-mxoff-in-pg_upgrade.patch.txttext/plain; charset=US-ASCII; name=v23-0007-TEST-Add-test-for-64-bit-mxoff-in-pg_upgrade.patch.txtDownload
From 2670eee7a1b4ee7826976eba9adc5d734f655ff4 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 29 Oct 2025 14:19:56 +0300
Subject: [PATCH v23 7/7] TEST: Add test for 64-bit mxoff in pg_upgrade

---
 src/bin/pg_upgrade/t/008_mxoff.pl | 463 ++++++++++++++++++++++++++++++
 1 file changed, 463 insertions(+)
 create mode 100644 src/bin/pg_upgrade/t/008_mxoff.pl

diff --git a/src/bin/pg_upgrade/t/008_mxoff.pl b/src/bin/pg_upgrade/t/008_mxoff.pl
new file mode 100644
index 00000000000..7204325f873
--- /dev/null
+++ b/src/bin/pg_upgrade/t/008_mxoff.pl
@@ -0,0 +1,463 @@
+
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use Math::BigInt;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::AdjustDump;
+use PostgreSQL::Test::AdjustUpgrade;
+use Test::More;
+
+# This test involves different multitransaction states, similarly to that of
+# 002_pg_upgrade.pl.
+
+unless (defined($ENV{oldinstall}))
+{
+	plan skip_all =>
+		'to run test set oldinstall environment variable to the pre 64-bit mxoff cluster';
+}
+
+# Temp dir for a dumps.
+my $tempdir = PostgreSQL::Test::Utils::tempdir;
+
+# Can be changed to test the other modes.
+my $mode = $ENV{PG_TEST_PG_UPGRADE_MODE} || '--copy';
+
+sub utility_path
+{
+	my $node = shift;
+	my $name = shift;
+
+	my $bin_path = defined($node->install_path) ?
+		$node->install_path . "/bin/$name" : $name;
+
+	return $bin_path;
+}
+
+# Get NextMultiOffset.
+sub next_mxoff
+{
+	my $node = shift;
+
+	my $pg_controldata_path = utility_path($node, 'pg_controldata');
+	my ($stdout, $stderr) = run_command([ $pg_controldata_path,
+											$node->data_dir ]);
+	my @control_data = split("\n", $stdout);
+	my $next_mxoff = undef;
+
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint's NextMultiOffset:\s*(.*)$/mg)
+		{
+			$next_mxoff = $1;
+			last;
+		}
+	}
+	die "NextMultiOffset not found in control file\n"
+		unless defined($next_mxoff);
+
+	return $next_mxoff;
+}
+
+# Consume around 10k of mxoffsets.
+sub mxact_eater
+{
+	my $node = shift;
+	my $tbl = 'FOO';
+
+	my ($mxoff1, $mxoff2);
+
+	$mxoff1 = next_mxoff($node);
+	$node->start;
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl} (I INT PRIMARY KEY, N_UPDATED INT) " .
+		"       WITH (AUTOVACUUM_ENABLED=FALSE);" .
+		"INSERT INTO ${tbl} SELECT G, 0 FROM GENERATE_SERIES(1, 50) G;");
+
+	# consume around 10k mxoff
+	my $nclients = 10;
+	my $update_every = 75;
+	my @connections = ();
+
+	for (0..$nclients)
+	{
+		my $conn = $node->background_psql('postgres');
+		$conn->query_safe("BEGIN");
+
+		push(@connections, $conn);
+	}
+
+	for (my $i = 0; $i < 1000; $i++)
+	{
+		my $conn = $connections[$i % $nclients];
+
+		$conn->query_safe("COMMIT;");
+		$conn->query_safe("BEGIN");
+
+		if ($i % $update_every == 0)
+		{
+			$conn->query_safe(
+				"UPDATE ${tbl} SET " .
+				"N_UPDATED = N_UPDATED + 1 " .
+				"WHERE I = ${i} % 50");
+		}
+		else
+		{
+			$conn->query_safe(
+				"SELECT * FROM ${tbl} FOR KEY SHARE");
+		}
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+
+	$node->stop;
+	$mxoff2 = next_mxoff($node);
+
+	return $mxoff1, $mxoff2;
+}
+
+# Consume around 1M of mxoffsets.
+sub mxact_huge_eater
+{
+	my $node = shift;
+	my $tbl = 'FOO';
+
+	my ($mxoff1, $mxoff2);
+
+	$mxoff1 = next_mxoff($node);
+	$node->start;
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl} (I INT PRIMARY KEY, N_UPDATED INT) " .
+		"       WITH (AUTOVACUUM_ENABLED=FALSE);" .
+		"INSERT INTO ${tbl} SELECT G, 0 FROM GENERATE_SERIES(1, 4) G;");
+
+	my $nclients = 100;
+	my @connections = ();
+	my $timeout = 10 * $PostgreSQL::Test::Utils::timeout_default;
+
+	for (0..$nclients)
+	{
+		my $conn = $node->background_psql('postgres',
+										  timeout => $timeout);
+		$conn->query_safe("BEGIN");
+
+		push(@connections, $conn);
+	}
+
+	# It's a long process, better to tell about progress.
+	my $n_steps = 100_000;
+	my $step = int($n_steps / 10);
+
+	diag "\nstart to consume mxoffsets ...\n";
+	for (my $i = 0; $i < $n_steps; $i++)
+	{
+		my $conn = $connections[$i % $nclients];
+
+		$conn->query_safe("COMMIT;");
+		$conn->query_safe("BEGIN");
+
+		{
+			$conn->query_safe(
+				"SELECT * FROM ${tbl} " .
+				"FOR KEY SHARE");
+		}
+
+		if ($i % $step == 0)
+		{
+			my $done = int(($i / $n_steps) * 100);
+			diag "$done% done...";
+		}
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+
+	$node->stop;
+	$mxoff2 = next_mxoff($node);
+
+	return $mxoff1, $mxoff2;
+}
+
+# Set oldest multixact-offset
+sub reset_mxoff
+{
+	my $node = shift;
+	my $offset = shift;
+
+	my $pg_resetwal_path = utility_path($node, 'pg_resetwal');
+	# Get block size
+	my $out = (run_command([ $pg_resetwal_path, '--dry-run',
+							 $node->data_dir ]))[0];
+		$out =~ /^Database block size: *(\d+)$/m or die;
+	my $blcksz = $1;
+
+	# Reset to new offset
+	my @cmd = ($pg_resetwal_path, '--pgdata' => $node->data_dir);
+	push @cmd, '--multixact-offset' => $offset;
+	command_ok(\@cmd, 'set oldest multixact-offset');
+
+	# Fill empty pg_multixact/members segment
+	my $mult = 32 * int($blcksz / 20) * 4;
+	my $segname = sprintf "%04X", $offset / $mult;
+
+	my @dd = ('dd');
+	push @dd, "if=/dev/zero";
+	push @dd, "of=" . $node->data_dir . "/pg_multixact/members/" . $segname;
+	push @dd, "bs=$blcksz";
+	push @dd, "count=32";
+	command_ok(\@dd, 'fill empty multixact-members');
+}
+
+sub get_dump_for_comparison
+{
+	my ($node, $db, $file_prefix, $adjust_child_columns) = @_;
+
+	my $dumpfile = $tempdir . '/' . $file_prefix . '.sql';
+	my $dump_adjusted = "${dumpfile}_adjusted";
+
+	open(my $dh, '>', $dump_adjusted)
+	  || die "could not open $dump_adjusted for writing $!";
+
+	my $pg_dump_path = utility_path($node, 'pg_dump');
+
+	$node->run_log(
+		[
+			$pg_dump_path, '--no-sync',
+			'--restrict-key' => 'test',
+			'-d' => $node->connstr($db),
+			'-f' => $dumpfile
+		]);
+
+	print $dh adjust_regress_dumpfile(slurp_file($dumpfile),
+		$adjust_child_columns);
+	close($dh);
+
+	return $dump_adjusted;
+}
+
+# Main test workhorse routine.
+# Make pg_upgrade, dump data and compare it.
+sub run_test
+{
+	my $tag = shift;
+	my $oldnode = shift;
+	my $newnode = shift;
+
+	my $pg_upgrade_path = utility_path($newnode, 'pg_upgrade');
+
+	command_ok(
+		[
+			$pg_upgrade_path, '--no-sync',
+			'--old-datadir' => $oldnode->data_dir,
+			'--new-datadir' => $newnode->data_dir,
+			'--old-bindir' => $oldnode->config_data('--bindir'),
+			'--new-bindir' => $newnode->config_data('--bindir'),
+			'--socketdir' => $newnode->host,
+			'--old-port' => $oldnode->port,
+			'--new-port' => $newnode->port,
+			$mode,
+		],
+		'run of pg_upgrade for new instance');
+	ok( !-d $newnode->data_dir . "/pg_upgrade_output.d",
+		"pg_upgrade_output.d/ removed after pg_upgrade success");
+
+	$oldnode->start;
+	my $src_dump =
+		get_dump_for_comparison($oldnode, 'postgres',
+								"oldnode_${tag}_dump", 0);
+	$oldnode->stop;
+
+	$newnode->start;
+	my $dst_dump =
+		get_dump_for_comparison($newnode, 'postgres',
+								"newnode_${tag}_dump", 0);
+	$newnode->stop;
+
+	compare_files($src_dump, $dst_dump,
+		'dump outputs from original and restored regression databases match');
+}
+
+sub to_hex
+{
+	my $arg = shift;
+
+	$arg = Math::BigInt->new($arg);
+	$arg = $arg->as_hex();
+
+	return $arg;
+}
+
+# case #1: start old node from defaults
+{
+	my $tag = 1;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+	$old->init(extra => ['-k']);
+
+	my ($start_mxoff, $finish_mxoff) = mxact_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #2: start old node from before 32-bit wraparound
+{
+	my $tag = 2;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+	reset_mxoff($old, 0xFFFF0000);
+
+	my ($start_mxoff, $finish_mxoff) = mxact_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #3: start old node near 32-bit wraparound and reach wraparound state.
+{
+	my $tag = 3;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+
+	reset_mxoff($old, 0xFFFFEC77);
+	my ($start_mxoff, $finish_mxoff) = mxact_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #4: start old node from defaults
+{
+	my $tag = 4;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+	$old->append_conf("postgresql.conf", "max_connections = 128");
+
+	diag "test #${tag} for multiple mxoff segments";
+	my ($start_mxoff, $finish_mxoff) = mxact_huge_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #5: start old node from before 32-bit wraparound
+{
+	my $tag = 5;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+	$old->append_conf("postgresql.conf", "max_connections = 128");
+	reset_mxoff($old, 0xFF000000);
+
+	diag "test #${tag} for multiple mxoff segments";
+	my ($start_mxoff, $finish_mxoff) = mxact_huge_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #6: start old node near 32-bit wraparound and reach wraparound state.
+{
+	my $tag = 6;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+
+	reset_mxoff($old, 0xFFFFFFFF - 500_000);
+	$old->append_conf("postgresql.conf", "max_connections = 128");
+	my ($start_mxoff, $finish_mxoff) = mxact_huge_eater($old);
+
+	diag "test #${tag} for multiple mxoff segments";
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+done_testing();
-- 
2.51.0

#56Maxim Orlov
orlovmg@gmail.com
In reply to: Maxim Orlov (#55)
Re: POC: make mxidoff 64 bits

I noticed one minor issue after I had already sent the
previous letter.

--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1034,7 +1034,7 @@ GetNewMultiXactId(int nmembers, MultiXactOffset
*offset)
    if (nextOffset + nmembers < nextOffset)
        ereport(ERROR,
                (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-                "MultiXact members would wrap around"));
+                errmsg("MultiXact members would wrap around")));
    *offset = nextOffset;

$ $PGBINOLD/pg_controldata -D pgdata
pg_control version number: 1800
Catalog version number: 202510221
...
Latest checkpoint's NextMultiXactId: 10000000
Latest checkpoint's NextMultiOffset: 999995050
Latest checkpoint's oldestXID: 748
...

I tried finding out how long it would take to convert a big number of
segments. Unfortunately, I only have access to a very old machine right
now. It took me 7 hours to generate this much data on my old
Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz with 16 Gb of RAM.

Here are my rough measurements:

HDD
$ sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
$ time pg_upgrade
...
real 4m59.459s
user 0m19.974s
sys 0m13.640s

SSD
$ sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
$ time pg_upgrade
...
real 4m52.958s
user 0m19.826s
sys 0m13.624s

I aim to get access to more modern stuff and check it all out there.

--
Best regards,
Maxim Orlov.

#57Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Maxim Orlov (#56)
9 attachment(s)
Re: POC: make mxidoff 64 bits

On 07/11/2025 18:03, Maxim Orlov wrote:

I tried finding out how long it would take to convert a big number of
segments. Unfortunately, I only have access to a very old machine right
now. It took me 7 hours to generate this much data on my old
Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz with 16 Gb of RAM.

Here are my rough measurements:

HDD
$ sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
$ time pg_upgrade
...
real    4m59.459s
user    0m19.974s
sys     0m13.640s

SSD
$ sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
$ time pg_upgrade
...
real    4m52.958s
user    0m19.826s
sys     0m13.624s

I aim to get access to more modern stuff and check it all out there.

Thanks, I also did some perf testing on my laptop. I wrote a little
helper function to consume multixids, and used it to create a v17
cluster with 100 million multixids. See attached
consume-mxids.patch.txt. I then ran pg_upgrade on that, and measured how
long the pg_multixact conversion part of pg_upgrade took. It took about
1.2 s on my laptop. Extrapolating from that, converting 1 billion
multixids would take 12 s. These were very simple multixacts with just
one member each, though; realistic multixacts with more members would
presumably take a little longer.

In any case, I think we're in an acceptable ballpark here.

There's some very low-hanging fruit though: Profiling with 'linux-perf'
suggested that a lot of CPU time was spent simply on the function call
overhead of GetOldMultiXactIdSingleMember, SlruReadSwitchPage,
RecordNewMultiXact, SlruWriteSwitchPage for each multixact. I added an
inlined fast path to SlruReadSwitchPage and SlruWriteSwitchPage to
eliminate the function call overhead of those in the common case that no
page switch is needed. With that, the 100 million mxid test case I used
went from 1.2 s to 0.9 s. We could optimize this further but I think
this is good enough.

Some other changes since patch set v23:

- Rebased. I committed the wraparound bug fixes.

- I added an SlruFileName() helper function to slru_io.c, and support
for reading SLRUs with long_segment_names==true. It's not needed
currently, but it seemed like a weird omission. AllocSlruRead() actually
left 'long_segment_names' uninitialized which is error-prone. We
could've just documented it, but it seems just as easy to support it.

- I split the multixact_internal.h header in a separate commit, to make
it more clear what changes are related to 64-bit offsets

I kept all the new test cases for now. We need to decide which ones are
worth keeping, and polish and speed up the ones we decide to keep.

I'm getting one failure from the pg_upgrade/008_mxoff test:

[14:43:38.422](0.530s) not ok 26 - dump outputs from original and restored regression databases match
[14:43:38.422](0.000s) #   Failed test 'dump outputs from original and restored regression databases match'
#   at /home/heikki/git-sandbox/postgresql/src/test/perl/PostgreSQL/Test/Utils.pm line 801.
[14:43:38.422](0.000s) #          got: '1'
#     expected: '0'
=== diff of /home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/008_mxoff/data/tmp_test_AC6A/oldnode_6_dump.sql_adjusted and /home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/008_mxoff/data/tmp_test_AC6A/newnode_6_dump.sql_adjusted
=== stdout ===
--- /home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/008_mxoff/data/tmp_test_AC6A/oldnode_6_dump.sql_adjusted       2025-11-12 14:43:38.030399957 +0200
+++ /home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/008_mxoff/data/tmp_test_AC6A/newnode_6_dump.sql_adjusted       2025-11-12 14:43:38.314399819 +0200
@@ -2,8 +2,8 @@
-- PostgreSQL database dump
--
\restrict test
--- Dumped from database version 17.6
--- Dumped by pg_dump version 17.6
+-- Dumped from database version 19devel
+-- Dumped by pg_dump version 19devel
SET statement_timeout = 0;
SET lock_timeout = 0;
SET idle_in_transaction_session_timeout = 0;=== stderr ===
=== EOF ===
[14:43:38.425](0.004s) # >>> case #6

I ran the test with:

(rm -rf build/testrun/ build/tmp_install/;
olddump=/tmp/olddump-regress.sql oldinstall=/home/heikki/pgsql.17stable/
meson test -C build --suite setup --suite pg_upgrade)

- Heikki

Attachments:

consume-mxids.patch.txttext/plain; charset=UTF-8; name=consume-mxids.patch.txtDownload
diff --git a/src/test/modules/xid_wraparound/xid_wraparound--1.0.sql b/src/test/modules/xid_wraparound/xid_wraparound--1.0.sql
index 51d25fc4c63..c24164c480c 100644
--- a/src/test/modules/xid_wraparound/xid_wraparound--1.0.sql
+++ b/src/test/modules/xid_wraparound/xid_wraparound--1.0.sql
@@ -10,3 +10,7 @@ AS 'MODULE_PATHNAME' LANGUAGE C;
 CREATE FUNCTION consume_xids_until(targetxid xid8)
 RETURNS xid8 IMMUTABLE PARALLEL SAFE STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION consume_mxids(nmxids bigint)
+RETURNS xid8 IMMUTABLE PARALLEL SAFE STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/xid_wraparound/xid_wraparound.c b/src/test/modules/xid_wraparound/xid_wraparound.c
index dce81c0c6d6..935a770a683 100644
--- a/src/test/modules/xid_wraparound/xid_wraparound.c
+++ b/src/test/modules/xid_wraparound/xid_wraparound.c
@@ -14,6 +14,7 @@
  */
 #include "postgres.h"
 
+#include "access/multixact.h"
 #include "access/xact.h"
 #include "miscadmin.h"
 #include "storage/proc.h"
@@ -24,6 +25,8 @@ PG_MODULE_MAGIC;
 static int64 consume_xids_shortcut(void);
 static FullTransactionId consume_xids_common(FullTransactionId untilxid, uint64 nxids);
 
+static MultiXactId  consume_multixids_common(uint64 nmxids);
+
 /*
  * Consume the specified number of XIDs.
  */
@@ -151,6 +154,7 @@ consume_xids_common(FullTransactionId untilxid, uint64 nxids)
 	}
 
 	return lastxid;
+#undef REPORT_INTERVAL
 }
 
 /*
@@ -217,3 +221,89 @@ consume_xids_shortcut(void)
 
 	return consumed;
 }
+
+/*
+ * Consume the specified number of multitransaction IDs.
+ */
+PG_FUNCTION_INFO_V1(consume_mxids);
+Datum
+consume_mxids(PG_FUNCTION_ARGS)
+{
+	int64		nmxids = PG_GETARG_INT64(0);
+	MultiXactId lastmxid;
+
+	if (nmxids < 0)
+		elog(ERROR, "invalid nmxids argument: %lld", (long long) nmxids);
+
+	if (nmxids == 0)
+		lastmxid = ReadNextMultiXactId();
+	else
+		lastmxid = consume_multixids_common((uint64) nmxids);
+
+	PG_RETURN_TRANSACTIONID(lastmxid);
+}
+
+
+/*
+ * Common functionality between the two public functions. XXX
+ */
+static MultiXactId
+consume_multixids_common(uint64 nmxids)
+{
+	MultiXactId lastmxid;
+	uint64		last_reported_at = 0;
+	uint64		consumed = 0;
+	MultiXactMember member;
+	TransactionId xids[256];
+
+	/* Print a NOTICE every REPORT_INTERVAL xids */
+#define REPORT_INTERVAL (10 * 1000000 /  10)
+
+	/* initialize 'lastmxid' with the system's current next XID */
+	lastmxid = ReadNextMultiXactId();
+
+	xids[0] = GetTopTransactionId();
+	for (int i = 1; i < Min(256, nmxids); i++)
+	{
+		xids[i] = XidFromFullTransactionId(GetNewTransactionId(true));
+	}
+
+	for (;;)
+	{
+		//uint64		mxids_left;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* How many XIDs do we have left to consume? */
+		if (nmxids > 0)
+		{
+			if (consumed >= nmxids)
+				break;
+			//mxids_left = nmxids - consumed;
+		}
+
+		/* (no fast path) */
+
+		/* Slow path: Call GetNewTransactionId to allocate a new XID. */
+
+		member = (MultiXactMember) {
+			.xid = xids[consumed % 256],
+			.status = ((consumed / 256) % 2) ? MultiXactStatusForUpdate : MultiXactStatusForKeyShare,
+		};
+
+		lastmxid = MultiXactIdCreateFromMembers(1, &member);
+		consumed++;
+
+		/* Report progress */
+		if (consumed - last_reported_at >= REPORT_INTERVAL)
+		{
+			elog(NOTICE, "consumed %llu / %llu XIDs, latest %u",
+				 (unsigned long long) consumed, (unsigned long long) nmxids,
+				 lastmxid);
+			last_reported_at = consumed;
+		}
+	}
+
+	return lastmxid;
+#undef REPORT_INTERVAL
+}
v24-0001-Move-pg_multixact-SLRU-page-format-definitions-t.patchtext/x-patch; charset=UTF-8; name=v24-0001-Move-pg_multixact-SLRU-page-format-definitions-t.patchDownload
From f277ce5b07a92d86e673e523a6981cab73ec7a76 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 12 Nov 2025 14:19:32 +0200
Subject: [PATCH v24 1/8] Move pg_multixact SLRU page format definitions to
 separate header

---
 src/backend/access/transam/multixact.c  | 119 --------------------
 src/include/access/multixact_internal.h | 140 ++++++++++++++++++++++++
 2 files changed, 140 insertions(+), 119 deletions(-)
 create mode 100644 src/include/access/multixact_internal.h

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 9d5f130af7e..acb2a6788f9 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -89,125 +89,6 @@
 #include "utils/memutils.h"
 
 
-/*
- * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
- * used everywhere else in Postgres.
- *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
- */
-
-/* We need four bytes per offset */
-#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
-
-static inline int64
-MultiXactIdToOffsetPage(MultiXactId multi)
-{
-	return multi / MULTIXACT_OFFSETS_PER_PAGE;
-}
-
-static inline int
-MultiXactIdToOffsetEntry(MultiXactId multi)
-{
-	return multi % MULTIXACT_OFFSETS_PER_PAGE;
-}
-
-static inline int64
-MultiXactIdToOffsetSegment(MultiXactId multi)
-{
-	return MultiXactIdToOffsetPage(multi) / SLRU_PAGES_PER_SEGMENT;
-}
-
-/*
- * The situation for members is a bit more complex: we store one byte of
- * additional flag bits for each TransactionId.  To do this without getting
- * into alignment issues, we store four bytes of flags, and then the
- * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
- * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
- * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
- * performance) trumps space efficiency here.
- *
- * Note that the "offset" macros work with byte offset, not array indexes, so
- * arithmetic must be done using "char *" pointers.
- */
-/* We need eight bits per xact, so one xact fits in a byte */
-#define MXACT_MEMBER_BITS_PER_XACT			8
-#define MXACT_MEMBER_FLAGS_PER_BYTE			1
-#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
-
-/* how many full bytes of flags are there in a group? */
-#define MULTIXACT_FLAGBYTES_PER_GROUP		4
-#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
-	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
-/* size in bytes of a complete group */
-#define MULTIXACT_MEMBERGROUP_SIZE \
-	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
-#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
-#define MULTIXACT_MEMBERS_PER_PAGE	\
-	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
-
-/*
- * Because the number of items per page is not a divisor of the last item
- * number (member 0xFFFFFFFF), the last segment does not use the maximum number
- * of pages, and moreover the last used page therein does not use the same
- * number of items as previous pages.  (Another way to say it is that the
- * 0xFFFFFFFF member is somewhere in the middle of the last page, so the page
- * has some empty space after that item.)
- *
- * This constant is the number of members in the last page of the last segment.
- */
-#define MAX_MEMBERS_IN_LAST_MEMBERS_PAGE \
-		((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
-
-/* page in which a member is to be found */
-static inline int64
-MXOffsetToMemberPage(MultiXactOffset offset)
-{
-	return offset / MULTIXACT_MEMBERS_PER_PAGE;
-}
-
-static inline int64
-MXOffsetToMemberSegment(MultiXactOffset offset)
-{
-	return MXOffsetToMemberPage(offset) / SLRU_PAGES_PER_SEGMENT;
-}
-
-/* Location (byte offset within page) of flag word for a given member */
-static inline int
-MXOffsetToFlagsOffset(MultiXactOffset offset)
-{
-	MultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
-	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
-	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
-
-	return byteoff;
-}
-
-static inline int
-MXOffsetToFlagsBitShift(MultiXactOffset offset)
-{
-	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
-	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
-
-	return bshift;
-}
-
-/* Location (byte offset within page) of TransactionId of given member */
-static inline int
-MXOffsetToMemberOffset(MultiXactOffset offset)
-{
-	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
-
-	return MXOffsetToFlagsOffset(offset) +
-		MULTIXACT_FLAGBYTES_PER_GROUP +
-		member_in_group * sizeof(TransactionId);
-}
-
 /* Multixact members wraparound thresholds. */
 #define MULTIXACT_MEMBER_SAFE_THRESHOLD		(MaxMultiXactOffset / 2)
 #define MULTIXACT_MEMBER_DANGER_THRESHOLD	\
diff --git a/src/include/access/multixact_internal.h b/src/include/access/multixact_internal.h
new file mode 100644
index 00000000000..9b56deaef31
--- /dev/null
+++ b/src/include/access/multixact_internal.h
@@ -0,0 +1,140 @@
+/*
+ * multixact_internal.h
+ *
+ * PostgreSQL multi-transaction-log manager internal declarations
+ *
+ * These functions and definitions are for dealing with pg_multixact pages.
+ * They are internal to multixact.c, but they are exported here to allow
+ * pg_upgrade to write pg_multixact files directly.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/multixact_internal.h
+ */
+#ifndef MULTIXACT_INTERNAL_H
+#define MULTIXACT_INTERNAL_H
+
+#include "access/multixact.h"
+
+
+/*
+ * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
+ * used everywhere else in Postgres.
+ *
+ * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
+ * MultiXact page numbering also wraps around at
+ * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
+ * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
+ * take no explicit notice of that fact in this module, except when comparing
+ * segment and page numbers in TruncateMultiXact (see
+ * MultiXactOffsetPagePrecedes).
+ */
+
+/* We need four bytes per offset */
+#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)
+{
+	return multi / MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int
+MultiXactIdToOffsetEntry(MultiXactId multi)
+{
+	return multi % MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int64
+MultiXactIdToOffsetSegment(MultiXactId multi)
+{
+	return MultiXactIdToOffsetPage(multi) / SLRU_PAGES_PER_SEGMENT;
+}
+
+/*
+ * The situation for members is a bit more complex: we store one byte of
+ * additional flag bits for each TransactionId.  To do this without getting
+ * into alignment issues, we store four bytes of flags, and then the
+ * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
+ * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
+ * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
+ * performance) trumps space efficiency here.
+ *
+ * Note that the "offset" macros work with byte offset, not array indexes, so
+ * arithmetic must be done using "char *" pointers.
+ */
+/* We need eight bits per xact, so one xact fits in a byte */
+#define MXACT_MEMBER_BITS_PER_XACT			8
+#define MXACT_MEMBER_FLAGS_PER_BYTE			1
+#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
+
+/* how many full bytes of flags are there in a group? */
+#define MULTIXACT_FLAGBYTES_PER_GROUP		4
+#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
+	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
+/* size in bytes of a complete group */
+#define MULTIXACT_MEMBERGROUP_SIZE \
+	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERS_PER_PAGE	\
+	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
+
+/*
+ * Because the number of items per page is not a divisor of the last item
+ * number (member 0xFFFFFFFF), the last segment does not use the maximum number
+ * of pages, and moreover the last used page therein does not use the same
+ * number of items as previous pages.  (Another way to say it is that the
+ * 0xFFFFFFFF member is somewhere in the middle of the last page, so the page
+ * has some empty space after that item.)
+ *
+ * This constant is the number of members in the last page of the last segment.
+ */
+#define MAX_MEMBERS_IN_LAST_MEMBERS_PAGE \
+		((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
+
+/* page in which a member is to be found */
+static inline int64
+MXOffsetToMemberPage(MultiXactOffset offset)
+{
+	return offset / MULTIXACT_MEMBERS_PER_PAGE;
+}
+
+static inline int64
+MXOffsetToMemberSegment(MultiXactOffset offset)
+{
+	return MXOffsetToMemberPage(offset) / SLRU_PAGES_PER_SEGMENT;
+}
+
+/* Location (byte offset within page) of flag word for a given member */
+static inline int
+MXOffsetToFlagsOffset(MultiXactOffset offset)
+{
+	MultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
+	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
+
+	return byteoff;
+}
+
+static inline int
+MXOffsetToFlagsBitShift(MultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
+
+	return bshift;
+}
+
+/* Location (byte offset within page) of TransactionId of given member */
+static inline int
+MXOffsetToMemberOffset(MultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+
+	return MXOffsetToFlagsOffset(offset) +
+		MULTIXACT_FLAGBYTES_PER_GROUP +
+		member_in_group * sizeof(TransactionId);
+}
+
+#endif							/* MULTIXACT_INTERNAL_H */
-- 
2.47.3

v24-0002-Use-64-bit-multixact-offsets.patchtext/x-patch; charset=UTF-8; name=v24-0002-Use-64-bit-multixact-offsets.patchDownload
From 6e0b8e3620e5ada1850dc39b4db47c99fa684c8c Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 7 Aug 2024 16:35:22 +0300
Subject: [PATCH v24 2/8] Use 64-bit multixact offsets

Switching to 64-bit multitransaction offsets removes wraparound and the
2^32 limit on their total number.

Author: Maxim Orlov <orlovmg@gmail.com>
Discussion: FIXME
---
 src/backend/access/rmgrdesc/mxactdesc.c   |   4 +-
 src/backend/access/rmgrdesc/xlogdesc.c    |   2 +-
 src/backend/access/transam/multixact.c    | 390 ++++------------------
 src/backend/access/transam/xlog.c         |   2 +-
 src/backend/access/transam/xlogrecovery.c |   2 +-
 src/backend/commands/vacuum.c             |   2 +-
 src/backend/postmaster/autovacuum.c       |   4 +-
 src/bin/pg_controldata/pg_controldata.c   |   2 +-
 src/bin/pg_resetwal/pg_resetwal.c         |  30 +-
 src/bin/pg_resetwal/t/001_basic.pl        |   4 +-
 src/include/access/multixact.h            |   3 -
 src/include/access/multixact_internal.h   |  24 +-
 src/include/c.h                           |   2 +-
 13 files changed, 95 insertions(+), 376 deletions(-)

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 3ca0582db36..052dd0a4ce5 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -65,7 +65,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		xl_multixact_create *xlrec = (xl_multixact_create *) rec;
 		int			i;
 
-		appendStringInfo(buf, "%u offset %u nmembers %d: ", xlrec->mid,
+		appendStringInfo(buf, "%u offset %" PRIu64 " nmembers %d: ", xlrec->mid,
 						 xlrec->moff, xlrec->nmembers);
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
@@ -74,7 +74,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
 
-		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+		appendStringInfo(buf, "offsets [%u, %u), members [%" PRIu64 ", %" PRIu64 ")",
 						 xlrec->startTruncOff, xlrec->endTruncOff,
 						 xlrec->startTruncMemb, xlrec->endTruncMemb);
 	}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index cd6c2a2f650..441034f5929 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -66,7 +66,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
 		appendStringInfo(buf, "redo %X/%08X; "
-						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %u; "
+						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %" PRIu64 "; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
 						 "oldest running xid %u; %s",
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index acb2a6788f9..34a745c07be 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -69,6 +69,7 @@
 #include "postgres.h"
 
 #include "access/multixact.h"
+#include "access/multixact_internal.h"
 #include "access/slru.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
@@ -89,10 +90,14 @@
 #include "utils/memutils.h"
 
 
-/* Multixact members wraparound thresholds. */
-#define MULTIXACT_MEMBER_SAFE_THRESHOLD		(MaxMultiXactOffset / 2)
-#define MULTIXACT_MEMBER_DANGER_THRESHOLD	\
-	(MaxMultiXactOffset - MaxMultiXactOffset / 4)
+/*
+ * Multixact members warning threshold.
+ *
+ * If the difference between nextOffset and oldestOffset exceeds this value,
+ * we trigger autovacuum in order to release disk space consumed by the
+ * members SLRU.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD		UINT64CONST(4000000000)
 
 static inline MultiXactId
 PreviousMultiXactId(MultiXactId multi)
@@ -149,9 +154,6 @@ typedef struct MultiXactStateData
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 
-	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
-
 	/*
 	 * This is used to sleep until a multixact offset is written when we want
 	 * to create the next one.
@@ -282,8 +284,6 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
-									 MultiXactOffset start, uint32 distance);
 static bool SetOffsetVacuumLimit(bool is_startup);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMTruncateXlogRec(Oid oldestMultiDB,
@@ -1023,90 +1023,22 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	ExtendMultiXactOffset(result);
 
 	/*
-	 * Reserve the members space, similarly to above.  Also, be careful not to
-	 * return zero as the starting offset for any multixact. See
-	 * GetMultiXactIdMembers() for motivation.
+	 * Reserve the members space, similarly to above.
 	 */
 	nextOffset = MultiXactState->nextOffset;
-	if (nextOffset == 0)
-	{
-		*offset = 1;
-		nmembers++;				/* allocate member slot 0 too */
-	}
-	else
-		*offset = nextOffset;
-
-	/*----------
-	 * Protect against overrun of the members space as well, with the
-	 * following rules:
-	 *
-	 * If we're past offsetStopLimit, refuse to generate more multis.
-	 * If we're close to offsetStopLimit, emit a warning.
-	 *
-	 * Arbitrarily, we start emitting warnings when we're 20 segments or less
-	 * from offsetStopLimit.
-	 *
-	 * Note we haven't updated the shared state yet, so if we fail at this
-	 * point, the multixact ID we grabbed can still be used by the next guy.
-	 *
-	 * Note that there is no point in forcing autovacuum runs here: the
-	 * multixact freeze settings would have to be reduced for that to have any
-	 * effect.
-	 *----------
-	 */
-#define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
-								 nmembers))
-	{
-		/* see comment in the corresponding offsets wraparound case */
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("multixact \"members\" limit exceeded"),
-				 errdetail_plural("This command would create a multixact with %u members, but the remaining space is only enough for %u member.",
-								  "This command would create a multixact with %u members, but the remaining space is only enough for %u members.",
-								  MultiXactState->offsetStopLimit - nextOffset - 1,
-								  nmembers,
-								  MultiXactState->offsetStopLimit - nextOffset - 1),
-				 errhint("Execute a database-wide VACUUM in database with OID %u with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.",
-						 MultiXactState->oldestMultiXactDB)));
-	}
 
 	/*
-	 * Check whether we should kick autovacuum into action, to prevent members
-	 * wraparound. NB we use a much larger window to trigger autovacuum than
-	 * just the warning limit. The warning is just a measure of last resort -
-	 * this is in line with GetNewTransactionId's behaviour.
+	 * Offsets are 64-bit integers and will never wrap around.  Firstly, it
+	 * would take an unrealistic amount of time and resources to consume 2^64
+	 * offsets.  Secondly, multixid creation is WAL-logged, so you would run
+	 * out of LSNs before reaching offset wraparound.  Nevertheless, check for
+	 * wraparound as a sanity check.
 	 */
-	if (!MultiXactState->oldestOffsetKnown ||
-		(MultiXactState->nextOffset - MultiXactState->oldestOffset
-		 > MULTIXACT_MEMBER_SAFE_THRESHOLD))
-	{
-		/*
-		 * To avoid swamping the postmaster with signals, we issue the autovac
-		 * request only when crossing a segment boundary. With default
-		 * compilation settings that's roughly after 50k members.  This still
-		 * gives plenty of chances before we get into real trouble.
-		 */
-		if ((MXOffsetToMemberPage(nextOffset) / SLRU_PAGES_PER_SEGMENT) !=
-			(MXOffsetToMemberPage(nextOffset + nmembers) / SLRU_PAGES_PER_SEGMENT))
-			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	}
-
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
-								 nextOffset,
-								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
-		ereport(WARNING,
+	if (nextOffset + nmembers < nextOffset)
+		ereport(ERROR,
 				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg_plural("database with OID %u must be vacuumed before %d more multixact member is used",
-							   "database with OID %u must be vacuumed before %d more multixact members are used",
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers,
-							   MultiXactState->oldestMultiXactDB,
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers),
-				 errhint("Execute a database-wide VACUUM in that database with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.")));
+				 errmsg("MultiXact members would wrap around")));
+	*offset = nextOffset;
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
@@ -1127,8 +1059,7 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	 * the next iteration.  But note that nextMXact may be InvalidMultiXactId
 	 * or the first value on a segment-beginning page after this routine
 	 * exits, so anyone else looking at the variable must be prepared to deal
-	 * with either case.  Similarly, nextOffset may be zero, but we won't use
-	 * that as the actual start offset of the next multixact.
+	 * with either case.
 	 */
 	(MultiXactState->nextMXact)++;
 
@@ -1136,7 +1067,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	LWLockRelease(MultiXactGenLock);
 
-	debug_elog4(DEBUG2, "GetNew: returning %u offset %u", result, *offset);
+	debug_elog4(DEBUG2, "GetNew: returning %u offset %" PRIu64, result,
+				*offset);
 	return result;
 }
 
@@ -1178,7 +1110,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	MultiXactOffset *offptr;
 	MultiXactOffset offset;
 	int			length;
-	int			truelength;
 	MultiXactId oldestMXact;
 	MultiXactId nextMXact;
 	MultiXactId tmpMXact;
@@ -1277,16 +1208,7 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * we have just for this; the process in charge will signal the CV as soon
 	 * as it has finished writing the multixact offset.
 	 *
-	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
-	 * handle case #2, there is an ambiguity near the point of offset
-	 * wraparound.  If we see next multixact's offset is one, is that our
-	 * multixact's actual endpoint, or did it end at zero with a subsequent
-	 * increment?  We handle this using the knowledge that if the zero'th
-	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
-	 * transaction ID so it can't be a multixact member.  Therefore, if we
-	 * read a zero from the members array, just ignore it.
-	 *
-	 * This is all pretty messy, but the mess occurs only in infrequent corner
+	 * This is a little messy, but the mess occurs only in infrequent corner
 	 * cases, so it seems better than holding the MultiXactGenLock for a long
 	 * time on every multixact creation.
 	 */
@@ -1372,6 +1294,9 @@ retry:
 	LWLockRelease(lock);
 	lock = NULL;
 
+	/* A multixid with zero members should not happen */
+	Assert(length > 0);
+
 	/*
 	 * If we slept above, clean up state; it's no longer needed.
 	 */
@@ -1380,7 +1305,6 @@ retry:
 
 	ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
 
-	truelength = 0;
 	prev_pageno = -1;
 	for (int i = 0; i < length; i++, offset++)
 	{
@@ -1417,37 +1341,27 @@ retry:
 
 		xactptr = (TransactionId *)
 			(MultiXactMemberCtl->shared->page_buffer[slotno] + memberoff);
-
-		if (!TransactionIdIsValid(*xactptr))
-		{
-			/* Corner case 3: we must be looking at unused slot zero */
-			Assert(offset == 0);
-			continue;
-		}
+		Assert(TransactionIdIsValid(*xactptr));
 
 		flagsoff = MXOffsetToFlagsOffset(offset);
 		bshift = MXOffsetToFlagsBitShift(offset);
 		flagsptr = (uint32 *) (MultiXactMemberCtl->shared->page_buffer[slotno] + flagsoff);
 
-		ptr[truelength].xid = *xactptr;
-		ptr[truelength].status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
-		truelength++;
+		ptr[i].xid = *xactptr;
+		ptr[i].status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
 	}
 
 	LWLockRelease(lock);
 
-	/* A multixid with zero members should not happen */
-	Assert(truelength > 0);
-
 	/*
 	 * Copy the result into the local cache.
 	 */
-	mXactCachePut(multi, truelength, ptr);
+	mXactCachePut(multi, length, ptr);
 
 	debug_elog3(DEBUG2, "GetMembers: no cache for %s",
-				mxid_to_string(multi, truelength, ptr));
+				mxid_to_string(multi, length, ptr));
 	*members = ptr;
-	return truelength;
+	return length;
 }
 
 /*
@@ -1854,7 +1768,7 @@ MultiXactShmemInit(void)
 				  "pg_multixact/members", LWTRANCHE_MULTIXACTMEMBER_BUFFER,
 				  LWTRANCHE_MULTIXACTMEMBER_SLRU,
 				  SYNC_HANDLER_MULTIXACT_MEMBER,
-				  false);
+				  true);
 	/* doesn't call SimpleLruTruncate() or meet criteria for unit tests */
 
 	/* Initialize our shared state struct */
@@ -2031,7 +1945,6 @@ TrimMultiXact(void)
 		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 		offptr += entryno;
-
 		MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
 
 		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
@@ -2104,7 +2017,7 @@ MultiXactGetCheckptMulti(bool is_shutdown,
 	LWLockRelease(MultiXactGenLock);
 
 	debug_elog6(DEBUG2,
-				"MultiXact: checkpoint is nextMulti %u, nextOffset %u, oldestMulti %u in DB %u",
+				"MultiXact: checkpoint is nextMulti %u, nextOffset %" PRIu64 ", oldestMulti %u in DB %u",
 				*nextMulti, *nextMultiOffset, *oldestMulti, *oldestMultiDB);
 }
 
@@ -2139,7 +2052,7 @@ void
 MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset)
 {
-	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %u",
+	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %" PRIu64,
 				nextMulti, nextMultiOffset);
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->nextMXact = nextMulti;
@@ -2330,7 +2243,7 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 	}
 	if (MultiXactOffsetPrecedes(MultiXactState->nextOffset, minMultiOffset))
 	{
-		debug_elog3(DEBUG2, "MultiXact: setting next offset to %u",
+		debug_elog3(DEBUG2, "MultiXact: setting next offset to %" PRIU64,
 					minMultiOffset);
 		MultiXactState->nextOffset = minMultiOffset;
 	}
@@ -2432,23 +2345,8 @@ ExtendMultiXactMember(MultiXactOffset offset, int nmembers)
 			LWLockRelease(lock);
 		}
 
-		/*
-		 * Compute the number of items till end of current page.  Careful: if
-		 * addition of unsigned ints wraps around, we're at the last page of
-		 * the last segment; since that page holds a different number of items
-		 * than other pages, we need to do it differently.
-		 */
-		if (offset + MAX_MEMBERS_IN_LAST_MEMBERS_PAGE < offset)
-		{
-			/*
-			 * This is the last page of the last segment; we can compute the
-			 * number of items left to allocate in it without modulo
-			 * arithmetic.
-			 */
-			difference = MaxMultiXactOffset - offset + 1;
-		}
-		else
-			difference = MULTIXACT_MEMBERS_PER_PAGE - offset % MULTIXACT_MEMBERS_PER_PAGE;
+		/* Compute the number of items till end of current page. */
+		difference = MULTIXACT_MEMBERS_PER_PAGE - offset % MULTIXACT_MEMBERS_PER_PAGE;
 
 		/*
 		 * Advance to next page, taking care to properly handle the wraparound
@@ -2514,15 +2412,14 @@ GetOldestMultiXactId(void)
 }
 
 /*
- * Determine how aggressively we need to vacuum in order to prevent member
- * wraparound.
+ * Determine if we need to vacuum to keep the size of the members SLRU in
+ * check.
  *
  * To do so determine what's the oldest member offset and install the limit
  * info in MultiXactState, where it can be used to prevent overrun of old data
  * in the members SLRU area.
  *
- * The return value is true if emergency autovacuum is required and false
- * otherwise.
+ * The return value is true if autovacuum is required and false otherwise.
  */
 static bool
 SetOffsetVacuumLimit(bool is_startup)
@@ -2534,8 +2431,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	MultiXactOffset nextOffset;
 	bool		oldestOffsetKnown = false;
 	bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;
-	MultiXactOffset prevOffsetStopLimit;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2550,7 +2445,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	nextOffset = MultiXactState->nextOffset;
 	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
 	prevOldestOffset = MultiXactState->oldestOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2581,13 +2475,9 @@ SetOffsetVacuumLimit(bool is_startup)
 		oldestOffsetKnown =
 			find_multixact_start(oldestMultiXactId, &oldestOffset);
 
-		if (oldestOffsetKnown)
-			ereport(DEBUG1,
-					(errmsg_internal("oldest MultiXactId member is at offset %u",
-									 oldestOffset)));
-		else
+		if (!oldestOffsetKnown)
 			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
+					(errmsg("oldest checkpointed MultiXact %u does not exist on disk",
 							oldestMultiXactId)));
 	}
 
@@ -2597,97 +2487,32 @@ SetOffsetVacuumLimit(bool is_startup)
 	 * If we can, compute limits (and install them MultiXactState) to prevent
 	 * overrun of old data in the members SLRU area. We can only do so if the
 	 * oldest offset is known though.
+	 *
+	 * FIXME: Is !oldestOffsetKnown possible anymore? At least update the comment:
+	 * we won't overrun members anymore.
 	 */
-	if (oldestOffsetKnown)
-	{
-		/* move back to start of the corresponding segment */
-		offsetStopLimit = oldestOffset - (oldestOffset %
-										  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-
-		/* always leave one segment before the wraparound point */
-		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-		if (!prevOldestOffsetKnown && !is_startup)
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are now enabled")));
-
-		ereport(DEBUG1,
-				(errmsg_internal("MultiXact member stop limit is now %u based on MultiXact %u",
-								 offsetStopLimit, oldestMultiXactId)));
-	}
-	else if (prevOldestOffsetKnown)
+	if (prevOldestOffsetKnown)
 	{
 		/*
 		 * If we failed to get the oldest offset this time, but we have a
 		 * value from a previous pass through this function, use the old
-		 * values rather than automatically forcing an emergency autovacuum
-		 * cycle again.
+		 * values rather than automatically forcing an autovacuum cycle again.
 		 */
 		oldestOffset = prevOldestOffset;
 		oldestOffsetKnown = true;
-		offsetStopLimit = prevOffsetStopLimit;
 	}
 
 	/* Install the computed values */
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestOffset = oldestOffset;
 	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	MultiXactState->offsetStopLimit = offsetStopLimit;
 	LWLockRelease(MultiXactGenLock);
 
 	/*
-	 * Do we need an emergency autovacuum?	If we're not sure, assume yes.
+	 * Do we need autovacuum?	If we're not sure, assume yes.
 	 */
 	return !oldestOffsetKnown ||
-		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
-}
-
-/*
- * Return whether adding "distance" to "start" would move past "boundary".
- *
- * We use this to determine whether the addition is "wrapping around" the
- * boundary point, hence the name.  The reason we don't want to use the regular
- * 2^31-modulo arithmetic here is that we want to be able to use the whole of
- * the 2^32-1 space here, allowing for more multixacts than would fit
- * otherwise.
- */
-static bool
-MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
-						 uint32 distance)
-{
-	MultiXactOffset finish;
-
-	/*
-	 * Note that offset number 0 is not used (see GetMultiXactIdMembers), so
-	 * if the addition wraps around the UINT_MAX boundary, skip that value.
-	 */
-	finish = start + distance;
-	if (finish < start)
-		finish++;
-
-	/*-----------------------------------------------------------------------
-	 * When the boundary is numerically greater than the starting point, any
-	 * value numerically between the two is not wrapped:
-	 *
-	 *	<----S----B---->
-	 *	[---)			 = F wrapped past B (and UINT_MAX)
-	 *		 [---)		 = F not wrapped
-	 *			  [----] = F wrapped past B
-	 *
-	 * When the boundary is numerically less than the starting point (i.e. the
-	 * UINT_MAX wraparound occurs somewhere in between) then all values in
-	 * between are wrapped:
-	 *
-	 *	<----B----S---->
-	 *	[---)			 = F not wrapped past B (but wrapped past UINT_MAX)
-	 *		 [---)		 = F wrapped past B (and UINT_MAX)
-	 *			  [----] = F not wrapped
-	 *-----------------------------------------------------------------------
-	 */
-	if (start < boundary)
-		return finish >= boundary || finish < start;
-	else
-		return finish >= boundary && finish < start;
+		(nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);
 }
 
 /*
@@ -2727,6 +2552,7 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 	offptr += entryno;
 	offset = *offptr;
+
 	LWLockRelease(SimpleLruGetBankLock(MultiXactOffsetCtl, pageno));
 
 	*result = offset;
@@ -2774,73 +2600,6 @@ GetMultiXactInfo(uint32 *multixacts, MultiXactOffset *members,
 	return true;
 }
 
-/*
- * Multixact members can be removed once the multixacts that refer to them
- * are older than every datminmxid.  autovacuum_multixact_freeze_max_age and
- * vacuum_multixact_freeze_table_age work together to make sure we never have
- * too many multixacts; we hope that, at least under normal circumstances,
- * this will also be sufficient to keep us from using too many offsets.
- * However, if the average multixact has many members, we might exhaust the
- * members space while still using few enough members that these limits fail
- * to trigger relminmxid advancement by VACUUM.  At that point, we'd have no
- * choice but to start failing multixact-creating operations with an error.
- *
- * To prevent that, if more than a threshold portion of the members space is
- * used, we effectively reduce autovacuum_multixact_freeze_max_age and
- * to a value just less than the number of multixacts in use.  We hope that
- * this will quickly trigger autovacuuming on the table or tables with the
- * oldest relminmxid, thus allowing datminmxid values to advance and removing
- * some members.
- *
- * As the fraction of the member space currently in use grows, we become
- * more aggressive in clamping this value.  That not only causes autovacuum
- * to ramp up, but also makes any manual vacuums the user issues more
- * aggressive.  This happens because vacuum_get_cutoffs() will clamp the
- * freeze table and the minimum freeze age cutoffs based on the effective
- * autovacuum_multixact_freeze_max_age this function returns.  In the worst
- * case, we'll claim the freeze_max_age to zero, and every vacuum of any
- * table will freeze every multixact.
- */
-int
-MultiXactMemberFreezeThreshold(void)
-{
-	MultiXactOffset members;
-	uint32		multixacts;
-	uint32		victim_multixacts;
-	double		fraction;
-	int			result;
-	MultiXactId oldestMultiXactId;
-	MultiXactOffset oldestOffset;
-
-	/* If we can't determine member space utilization, assume the worst. */
-	if (!GetMultiXactInfo(&multixacts, &members, &oldestMultiXactId, &oldestOffset))
-		return 0;
-
-	/* If member space utilization is low, no special action is required. */
-	if (members <= MULTIXACT_MEMBER_SAFE_THRESHOLD)
-		return autovacuum_multixact_freeze_max_age;
-
-	/*
-	 * Compute a target for relminmxid advancement.  The number of multixacts
-	 * we try to eliminate from the system is based on how far we are past
-	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
-	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD) /
-		(MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-	victim_multixacts = multixacts * fraction;
-
-	/* fraction could be > 1.0, but lowest possible freeze age is zero */
-	if (victim_multixacts > multixacts)
-		return 0;
-	result = multixacts - victim_multixacts;
-
-	/*
-	 * Clamp to autovacuum_multixact_freeze_max_age, so that we never make
-	 * autovacuum less aggressive than it would otherwise be.
-	 */
-	return Min(result, autovacuum_multixact_freeze_max_age);
-}
-
 typedef struct mxtruncinfo
 {
 	int64		earliestExistingPage;
@@ -2867,36 +2626,12 @@ SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int64 segpage, void *data
 
 /*
  * Delete members segments [oldest, newOldest)
- *
- * The members SLRU can, in contrast to the offsets one, be filled to almost
- * the full range at once. This means SimpleLruTruncate() can't trivially be
- * used - instead the to-be-deleted range is computed using the offsets
- * SLRU. C.f. TruncateMultiXact().
  */
 static void
 PerformMembersTruncation(MultiXactOffset oldestOffset, MultiXactOffset newOldestOffset)
 {
-	const int64 maxsegment = MXOffsetToMemberSegment(MaxMultiXactOffset);
-	int64		startsegment = MXOffsetToMemberSegment(oldestOffset);
-	int64		endsegment = MXOffsetToMemberSegment(newOldestOffset);
-	int64		segment = startsegment;
-
-	/*
-	 * Delete all the segments but the last one. The last segment can still
-	 * contain, possibly partially, valid data.
-	 */
-	while (segment != endsegment)
-	{
-		elog(DEBUG2, "truncating multixact members segment %" PRIx64,
-			 segment);
-		SlruDeleteSegment(MultiXactMemberCtl, segment);
-
-		/* move to next segment, handling wraparound correctly */
-		if (segment == maxsegment)
-			segment = 0;
-		else
-			segment += 1;
-	}
+	SimpleLruTruncate(MultiXactMemberCtl,
+					  MXOffsetToMemberPage(newOldestOffset));
 }
 
 /*
@@ -3040,7 +2775,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 
 	elog(DEBUG1, "performing multixact truncation: "
 		 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-		 "members [%u, %u), members segments [%" PRIx64 ", %" PRIx64 ")",
+		 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
 		 oldestMulti, newOldestMulti,
 		 MultiXactIdToOffsetSegment(oldestMulti),
 		 MultiXactIdToOffsetSegment(newOldestMulti),
@@ -3120,20 +2855,13 @@ MultiXactOffsetPagePrecedes(int64 page1, int64 page2)
 
 /*
  * Decide whether a MultiXactMember page number is "older" for truncation
- * purposes.  There is no "invalid offset number" so use the numbers verbatim.
+ * purposes.  There is no "invalid offset number" and members never wrap
+ * around, so use the numbers verbatim.
  */
 static bool
 MultiXactMemberPagePrecedes(int64 page1, int64 page2)
 {
-	MultiXactOffset offset1;
-	MultiXactOffset offset2;
-
-	offset1 = ((MultiXactOffset) page1) * MULTIXACT_MEMBERS_PER_PAGE;
-	offset2 = ((MultiXactOffset) page2) * MULTIXACT_MEMBERS_PER_PAGE;
-
-	return (MultiXactOffsetPrecedes(offset1, offset2) &&
-			MultiXactOffsetPrecedes(offset1,
-									offset2 + MULTIXACT_MEMBERS_PER_PAGE - 1));
+	return page1 < page2;
 }
 
 /*
@@ -3171,7 +2899,7 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
 static bool
 MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
 {
-	int32		diff = (int32) (offset1 - offset2);
+	int64		diff = (int64) (offset1 - offset2);
 
 	return (diff < 0);
 }
@@ -3268,7 +2996,7 @@ multixact_redo(XLogReaderState *record)
 
 		elog(DEBUG1, "replaying multixact truncation: "
 			 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-			 "members [%u, %u), members segments [%" PRIx64 ", %" PRIx64 ")",
+			 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
 			 xlrec.startTruncOff, xlrec.endTruncOff,
 			 MultiXactIdToOffsetSegment(xlrec.startTruncOff),
 			 MultiXactIdToOffsetSegment(xlrec.endTruncOff),
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 22d0a2e8c3a..ef405d66b3b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5139,7 +5139,7 @@ BootStrapXLOG(uint32 data_checksum_version)
 		FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId);
 	checkPoint.nextOid = FirstGenbkiObjectId;
 	checkPoint.nextMulti = FirstMultiXactId;
-	checkPoint.nextMultiOffset = 0;
+	checkPoint.nextMultiOffset = 1;
 	checkPoint.oldestXid = FirstNormalTransactionId;
 	checkPoint.oldestXidDB = Template1DbOid;
 	checkPoint.oldestMulti = FirstMultiXactId;
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index eddc22fc5ad..5dd25cf2dfc 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -886,7 +886,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 							 U64FromFullTransactionId(checkPoint.nextXid),
 							 checkPoint.nextOid)));
 	ereport(DEBUG1,
-			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %u",
+			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %" PRIu64,
 							 checkPoint.nextMulti, checkPoint.nextMultiOffset)));
 	ereport(DEBUG1,
 			(errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ed03e3bd50d..259ef60bd31 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1147,7 +1147,7 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams params,
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Almost ready to set freeze output parameters; check if OldestXmin or
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index ed19c74bb19..34909ee54ff 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1151,7 +1151,7 @@ do_start_worker(void)
 
 	/* Also determine the oldest datminmxid we will consider. */
 	recentMulti = ReadNextMultiXactId();
-	multiForceLimit = recentMulti - MultiXactMemberFreezeThreshold();
+	multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
 	if (multiForceLimit < FirstMultiXactId)
 		multiForceLimit -= FirstMultiXactId;
 
@@ -1939,7 +1939,7 @@ do_autovacuum(void)
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Find the pg_database entry and select the default freeze ages. We use
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 30ad46912e1..a4060309ae0 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -271,7 +271,7 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile->checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
+	printf(_("Latest checkpoint's NextMultiOffset:  %" PRIu64 "\n"),
 		   ControlFile->checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile->checkPointCopy.oldestXid);
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index a31e7643cf0..7c6c2741a17 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -92,6 +92,7 @@ static void KillExistingArchiveStatus(void);
 static void KillExistingWALSummaries(void);
 static void WriteEmptyXLOG(void);
 static void usage(void);
+static uint64 strtou64_strict(const char *s, char **endptr, int base);
 
 
 int
@@ -120,7 +121,6 @@ main(int argc, char *argv[])
 	MultiXactId set_oldestmxid = 0;
 	char	   *endptr;
 	char	   *endptr2;
-	int64		tmpi64;
 	char	   *DataDir = NULL;
 	char	   *log_fname = NULL;
 	int			fd;
@@ -269,17 +269,14 @@ main(int argc, char *argv[])
 
 			case 'O':
 				errno = 0;
-				tmpi64 = strtoi64(optarg, &endptr, 0);
+				set_mxoff = strtou64_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
 					pg_log_error_hint("Try \"%s --help\" for more information.", progname);
 					exit(1);
 				}
-				if (tmpi64 < 0 || tmpi64 > (int64) MaxMultiXactOffset)
-					pg_fatal("multitransaction offset (-O) must be between 0 and %u", MaxMultiXactOffset);
 
-				set_mxoff = (MultiXactOffset) tmpi64;
 				mxoff_given = true;
 				break;
 
@@ -749,7 +746,7 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile.checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
+	printf(_("Latest checkpoint's NextMultiOffset:  %" PRIu64 "\n"),
 		   ControlFile.checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile.checkPointCopy.oldestXid);
@@ -825,7 +822,7 @@ PrintNewControlValues(void)
 
 	if (mxoff_given)
 	{
-		printf(_("NextMultiOffset:                      %u\n"),
+		printf(_("NextMultiOffset:                      %" PRIu64 "\n"),
 			   ControlFile.checkPointCopy.nextMultiOffset);
 	}
 
@@ -1210,3 +1207,22 @@ usage(void)
 	printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
 	printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
 }
+
+/* Like strtou64(), but negative values are not accepted. */
+static uint64
+strtou64_strict(const char *s, char **endptr, int base)
+{
+	/* skip leading whitespace */
+	while (isspace(*s))
+		s++;
+
+	/* reject negative values */
+	if (*s == '-')
+	{
+		*endptr = (char *) s;
+		errno = ERANGE;
+		return UINT64_MAX;
+	}
+
+	return strtou64(s, endptr, base);
+}
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index 90ecb8afe18..5a175e285d1 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -145,7 +145,7 @@ command_fails_like(
 	'fails with incorrect -O option');
 command_fails_like(
 	[ 'pg_resetwal', '-O' => '-1', $node->data_dir ],
-	qr/must be between 0 and 4294967295/,
+	qr/error: invalid argument for option -O/,
 	'fails with -O value -1');
 # --wal-segsize
 command_fails_like(
@@ -215,7 +215,7 @@ push @cmd,
   sprintf("%d,%d", hex($files[0]) == 0 ? 3 : hex($files[0]), hex($files[-1]));
 
 @files = get_slru_files('pg_multixact/offsets');
-$mult = 32 * $blcksz / 4;
+$mult = 32 * $blcksz / 8;
 # --multixact-ids argument is "new,old"
 push @cmd,
   '--multixact-ids' => sprintf("%d,%d",
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 82e4bb90dd5..7d98fe0fe32 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -28,8 +28,6 @@
 
 #define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
 
-#define MaxMultiXactOffset	((MultiXactOffset) 0xFFFFFFFF)
-
 /*
  * Possible multixact lock modes ("status").  The first four modes are for
  * tuple locks (FOR KEY SHARE, FOR SHARE, FOR NO KEY UPDATE, FOR UPDATE); the
@@ -147,7 +145,6 @@ extern void MultiXactSetNextMXact(MultiXactId nextMulti,
 extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 									  MultiXactOffset minMultiOffset);
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
-extern int	MultiXactMemberFreezeThreshold(void);
 
 extern void multixact_twophase_recover(FullTransactionId fxid, uint16 info,
 									   void *recdata, uint32 len);
diff --git a/src/include/access/multixact_internal.h b/src/include/access/multixact_internal.h
index 9b56deaef31..b0227759e39 100644
--- a/src/include/access/multixact_internal.h
+++ b/src/include/access/multixact_internal.h
@@ -17,21 +17,12 @@
 
 #include "access/multixact.h"
 
-
 /*
  * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
  * used everywhere else in Postgres.
- *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
  */
 
-/* We need four bytes per offset */
+/* We need 8 bytes per offset */
 #define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
 
 static inline int64
@@ -80,19 +71,6 @@ MultiXactIdToOffsetSegment(MultiXactId multi)
 #define MULTIXACT_MEMBERS_PER_PAGE	\
 	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
 
-/*
- * Because the number of items per page is not a divisor of the last item
- * number (member 0xFFFFFFFF), the last segment does not use the maximum number
- * of pages, and moreover the last used page therein does not use the same
- * number of items as previous pages.  (Another way to say it is that the
- * 0xFFFFFFFF member is somewhere in the middle of the last page, so the page
- * has some empty space after that item.)
- *
- * This constant is the number of members in the last page of the last segment.
- */
-#define MAX_MEMBERS_IN_LAST_MEMBERS_PAGE \
-		((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
-
 /* page in which a member is to be found */
 static inline int64
 MXOffsetToMemberPage(MultiXactOffset offset)
diff --git a/src/include/c.h b/src/include/c.h
index 757dfff4782..bc92a6f4565 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -670,7 +670,7 @@ typedef uint32 SubTransactionId;
 /* MultiXactId must be equivalent to TransactionId, to fit in t_xmax */
 typedef TransactionId MultiXactId;
 
-typedef uint32 MultiXactOffset;
+typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
-- 
2.47.3

v24-0003-Add-pg_upgrade-for-64-bit-multixact-offsets.patchtext/x-patch; charset=UTF-8; name=v24-0003-Add-pg_upgrade-for-64-bit-multixact-offsets.patchDownload
From 55d50e721e696a915902676e0d5ca42769b4b793 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Fri, 24 Oct 2025 10:58:37 +0300
Subject: [PATCH v24 3/8] Add pg_upgrade for 64 bit multixact offsets

Author: Maxim Orlov <orlovmg@gmail.com>
Author: Heikki Linnakangas <hlinnaka@iki.fi>
---
 src/backend/access/transam/multixact.c |  56 -----
 src/bin/pg_upgrade/Makefile            |   3 +
 src/bin/pg_upgrade/meson.build         |   3 +
 src/bin/pg_upgrade/multixact_new.c     | 174 +++++++++++++++
 src/bin/pg_upgrade/multixact_new.h     |  23 ++
 src/bin/pg_upgrade/multixact_old.c     | 297 +++++++++++++++++++++++++
 src/bin/pg_upgrade/multixact_old.h     |  29 +++
 src/bin/pg_upgrade/pg_upgrade.c        | 108 ++++++++-
 src/bin/pg_upgrade/pg_upgrade.h        |   5 +
 src/bin/pg_upgrade/slru_io.c           | 242 ++++++++++++++++++++
 src/bin/pg_upgrade/slru_io.h           |  59 +++++
 src/tools/pgindent/typedefs.list       |   3 +
 12 files changed, 940 insertions(+), 62 deletions(-)
 create mode 100644 src/bin/pg_upgrade/multixact_new.c
 create mode 100644 src/bin/pg_upgrade/multixact_new.h
 create mode 100644 src/bin/pg_upgrade/multixact_old.c
 create mode 100644 src/bin/pg_upgrade/multixact_old.h
 create mode 100644 src/bin/pg_upgrade/slru_io.c
 create mode 100644 src/bin/pg_upgrade/slru_io.h

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 34a745c07be..e0323ec1014 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1824,48 +1824,6 @@ BootStrapMultiXact(void)
 	SimpleLruZeroAndWritePage(MultiXactMemberCtl, 0);
 }
 
-/*
- * MaybeExtendOffsetSlru
- *		Extend the offsets SLRU area, if necessary
- *
- * After a binary upgrade from <= 9.2, the pg_multixact/offsets SLRU area might
- * contain files that are shorter than necessary; this would occur if the old
- * installation had used multixacts beyond the first page (files cannot be
- * copied, because the on-disk representation is different).  pg_upgrade would
- * update pg_control to set the next offset value to be at that position, so
- * that tuples marked as locked by such MultiXacts would be seen as visible
- * without having to consult multixact.  However, trying to create and use a
- * new MultiXactId would result in an error because the page on which the new
- * value would reside does not exist.  This routine is in charge of creating
- * such pages.
- */
-static void
-MaybeExtendOffsetSlru(void)
-{
-	int64		pageno;
-	LWLock	   *lock;
-
-	pageno = MultiXactIdToOffsetPage(MultiXactState->nextMXact);
-	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
-
-	LWLockAcquire(lock, LW_EXCLUSIVE);
-
-	if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno))
-	{
-		int			slotno;
-
-		/*
-		 * Fortunately for us, SimpleLruWritePage is already prepared to deal
-		 * with creating a new segment file even if the page we're writing is
-		 * not the first in it, so this is enough.
-		 */
-		slotno = SimpleLruZeroPage(MultiXactOffsetCtl, pageno);
-		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
-	}
-
-	LWLockRelease(lock);
-}
-
 /*
  * This must be called ONCE during postmaster or standalone-backend startup.
  *
@@ -2058,20 +2016,6 @@ MultiXactSetNextMXact(MultiXactId nextMulti,
 	MultiXactState->nextMXact = nextMulti;
 	MultiXactState->nextOffset = nextMultiOffset;
 	LWLockRelease(MultiXactGenLock);
-
-	/*
-	 * During a binary upgrade, make sure that the offsets SLRU is large
-	 * enough to contain the next value that would be created.
-	 *
-	 * We need to do this pretty early during the first startup in binary
-	 * upgrade mode: before StartupMultiXact() in fact, because this routine
-	 * is called even before that by StartupXLOG().  And we can't do it
-	 * earlier than at this point, because during that first call of this
-	 * routine we determine the MultiXactState->nextMXact value that
-	 * MaybeExtendOffsetSlru needs.
-	 */
-	if (IsBinaryUpgrade)
-		MaybeExtendOffsetSlru();
 }
 
 /*
diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index 69fcf593cae..42995d53b0b 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -18,11 +18,14 @@ OBJS = \
 	file.o \
 	function.o \
 	info.o \
+	multixact_new.o \
+	multixact_old.o \
 	option.o \
 	parallel.o \
 	pg_upgrade.o \
 	relfilenumber.o \
 	server.o \
+	slru_io.o \
 	tablespace.o \
 	task.o \
 	util.o \
diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index ac992f0d14b..3e46c4512cf 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -8,11 +8,14 @@ pg_upgrade_sources = files(
   'file.c',
   'function.c',
   'info.c',
+  'multixact_new.c',
+  'multixact_old.c',
   'option.c',
   'parallel.c',
   'pg_upgrade.c',
   'relfilenumber.c',
   'server.c',
+  'slru_io.c',
   'tablespace.c',
   'task.c',
   'util.c',
diff --git a/src/bin/pg_upgrade/multixact_new.c b/src/bin/pg_upgrade/multixact_new.c
new file mode 100644
index 00000000000..5db7af5b12d
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_new.c
@@ -0,0 +1,174 @@
+/*
+ * multixact_new.c
+ *
+ * Functions to write multixacts in the v19 format with 64-bit
+ * MultiXactOffsets
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_new.c
+ */
+
+#include "postgres_fe.h"
+
+#include "multixact_new.h"
+
+/*
+ * NOTE: Below are a bunch of definitions and simple inline functions that are
+ * copy-pasted from multixact.c
+ */
+
+/* We need four bytes per offset, 8 bytes for the base */
+#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)
+{
+	return multi / MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int
+MultiXactIdToOffsetEntry(MultiXactId multi)
+{
+	return multi % MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+/* We need eight bits per xact, so one xact fits in a byte */
+#define MXACT_MEMBER_BITS_PER_XACT			8
+#define MXACT_MEMBER_FLAGS_PER_BYTE			1
+#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
+
+/* how many full bytes of flags are there in a group? */
+#define MULTIXACT_FLAGBYTES_PER_GROUP		4
+#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
+	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
+/* size in bytes of a complete group */
+#define MULTIXACT_MEMBERGROUP_SIZE \
+	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERS_PER_PAGE	\
+	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
+
+/* page in which a member is to be found */
+static inline int64
+MXOffsetToMemberPage(MultiXactOffset offset)
+{
+	return offset / MULTIXACT_MEMBERS_PER_PAGE;
+}
+
+/* Location (byte offset within page) of flag word for a given member */
+static inline int
+MXOffsetToFlagsOffset(MultiXactOffset offset)
+{
+	MultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
+	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
+
+	return byteoff;
+}
+
+static inline int
+MXOffsetToFlagsBitShift(MultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
+
+	return bshift;
+}
+
+/* Location (byte offset within page) of TransactionId of given member */
+static inline int
+MXOffsetToMemberOffset(MultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+
+	return MXOffsetToFlagsOffset(offset) +
+		MULTIXACT_FLAGBYTES_PER_GROUP +
+		member_in_group * sizeof(TransactionId);
+}
+
+MultiXactWriter *
+AllocMultiXactWrite(const char *pgdata, MultiXactId firstMulti,
+					MultiXactOffset firstOffset)
+{
+	MultiXactWriter *state = pg_malloc(sizeof(*state));
+	char		dir[MAXPGPATH] = {0};
+
+	pg_sprintf(dir, "%s/pg_multixact/offsets", pgdata);
+	state->offset = AllocSlruWrite(dir, false);
+	SlruWriteSwitchPage(state->offset, MultiXactIdToOffsetPage(firstMulti));
+
+	pg_sprintf(dir, "%s/pg_multixact/members", pgdata);
+	state->members = AllocSlruWrite(dir, true /* use long segment names */ );
+	SlruWriteSwitchPage(state->members, MXOffsetToMemberPage(firstOffset));
+
+	return state;
+}
+
+/*
+ * Write a new multixact with members.
+ *
+ * Simplified version of the correspoding server function, hence the name.
+ */
+void
+RecordNewMultiXact(MultiXactWriter *state, MultiXactOffset offset,
+				   MultiXactId multi, int nmembers, MultiXactMember *members)
+{
+	int64		pageno;
+	int64		prev_pageno;
+	int			entryno;
+	char	   *buf;
+	MultiXactOffset *offptr;
+
+	pageno = MultiXactIdToOffsetPage(multi);
+	entryno = MultiXactIdToOffsetEntry(multi);
+
+	/* Store the offset */
+	buf = SlruWriteSwitchPage(state->offset, pageno);
+	offptr = (MultiXactOffset *) buf;
+	offptr[entryno] = offset;
+
+	/* Store the members */
+	prev_pageno = -1;
+	for (int i = 0; i < nmembers; i++, offset++)
+	{
+		TransactionId *memberptr;
+		uint32	   *flagsptr;
+		uint32		flagsval;
+		int			bshift;
+		int			flagsoff;
+		int			memberoff;
+
+		Assert(members[i].status <= MultiXactStatusUpdate);
+
+		pageno = MXOffsetToMemberPage(offset);
+		memberoff = MXOffsetToMemberOffset(offset);
+		flagsoff = MXOffsetToFlagsOffset(offset);
+		bshift = MXOffsetToFlagsBitShift(offset);
+
+		if (pageno != prev_pageno)
+		{
+			buf = SlruWriteSwitchPage(state->members, pageno);
+			prev_pageno = pageno;
+		}
+
+		memberptr = (TransactionId *) (buf + memberoff);
+
+		*memberptr = members[i].xid;
+
+		flagsptr = (uint32 *) (buf + flagsoff);
+
+		flagsval = *flagsptr;
+		flagsval &= ~(((1 << MXACT_MEMBER_BITS_PER_XACT) - 1) << bshift);
+		flagsval |= (members[i].status << bshift);
+		*flagsptr = flagsval;
+	}
+}
+
+void
+FreeMultiXactWrite(MultiXactWriter *state)
+{
+	FreeSlruWrite(state->offset);
+	FreeSlruWrite(state->members);
+
+	pfree(state);
+}
diff --git a/src/bin/pg_upgrade/multixact_new.h b/src/bin/pg_upgrade/multixact_new.h
new file mode 100644
index 00000000000..f66e6af7e45
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_new.h
@@ -0,0 +1,23 @@
+/*
+ * multixact_new.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_new.h
+ */
+#include "access/multixact.h"
+
+#include "slru_io.h"
+
+typedef struct MultiXactWriter
+{
+	SlruSegState *offset;
+	SlruSegState *members;
+} MultiXactWriter;
+
+extern MultiXactWriter *AllocMultiXactWrite(const char *pgdata,
+											MultiXactId firstMulti,
+											MultiXactOffset firstOffset);
+extern void RecordNewMultiXact(MultiXactWriter *state, MultiXactOffset offset,
+							   MultiXactId multi, int nmembers,
+							   MultiXactMember *members);
+extern void FreeMultiXactWrite(MultiXactWriter *writer);
diff --git a/src/bin/pg_upgrade/multixact_old.c b/src/bin/pg_upgrade/multixact_old.c
new file mode 100644
index 00000000000..f05f8e0a1f2
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_old.c
@@ -0,0 +1,297 @@
+/*
+ * multixact_old.c
+ *
+ * Functions to read pre-v19 multixacts
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_old.c
+ */
+
+#include "postgres_fe.h"
+
+#include "multixact_old.h"
+#include "pg_upgrade.h"
+
+/*
+ * NOTE: below are a bunch of definitions and simple sttaic inline functions
+ * that are copy-pasted from multixact.c from version 18.  The only difference
+ * is that we use the OldMultiXactOffset type equal to uint32 instead of
+ * MultiXactOffset which became uint64.
+ */
+
+/* We need four bytes per offset and 8 bytes per base for each page. */
+#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(OldMultiXactOffset))
+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)
+{
+	return multi / MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int
+MultiXactIdToOffsetEntry(MultiXactId multi)
+{
+	return multi % MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+/*
+ * The situation for members is a bit more complex: we store one byte of
+ * additional flag bits for each TransactionId.  To do this without getting
+ * into alignment issues, we store four bytes of flags, and then the
+ * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
+ * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
+ * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
+ * performance) trumps space efficiency here.
+ *
+ * Note that the "offset" macros work with byte offset, not array indexes, so
+ * arithmetic must be done using "char *" pointers.
+ */
+/* We need eight bits per xact, so one xact fits in a byte */
+#define MXACT_MEMBER_BITS_PER_XACT			8
+#define MXACT_MEMBER_FLAGS_PER_BYTE			1
+#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
+
+/* how many full bytes of flags are there in a group? */
+#define MULTIXACT_FLAGBYTES_PER_GROUP		4
+#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
+	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
+/* size in bytes of a complete group */
+#define MULTIXACT_MEMBERGROUP_SIZE \
+	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERS_PER_PAGE	\
+	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
+
+/* page in which a member is to be found */
+static inline int64
+MXOffsetToMemberPage(OldMultiXactOffset offset)
+{
+	return offset / MULTIXACT_MEMBERS_PER_PAGE;
+}
+
+/* Location (byte offset within page) of flag word for a given member */
+static inline int
+MXOffsetToFlagsOffset(MultiXactOffset offset)
+{
+	OldMultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
+	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
+
+	return byteoff;
+}
+
+/* Location (byte offset within page) of TransactionId of given member */
+static inline int
+MXOffsetToMemberOffset(OldMultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+
+	return MXOffsetToFlagsOffset(offset) +
+		MULTIXACT_FLAGBYTES_PER_GROUP +
+		member_in_group * sizeof(TransactionId);
+}
+
+static inline int
+MXOffsetToFlagsBitShift(OldMultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
+
+	return bshift;
+}
+
+/*
+ * Construct reader of old multixacts.
+ *
+ * Returns the malloced memory used by the all other calls in this module.
+ */
+OldMultiXactReader *
+AllocOldMultiXactRead(char *pgdata, MultiXactId nextMulti,
+					  OldMultiXactOffset nextOffset)
+{
+	OldMultiXactReader *state = state = pg_malloc(sizeof(*state));
+	char		dir[MAXPGPATH] = {0};
+
+	state->nextMXact = nextMulti;
+	state->nextOffset = nextOffset;
+
+	pg_sprintf(dir, "%s/pg_multixact/offsets", pgdata);
+	state->offset = AllocSlruRead(dir, false);
+
+	pg_sprintf(dir, "%s/pg_multixact/members", pgdata);
+	state->members = AllocSlruRead(dir, false);
+
+	return state;
+}
+
+/*
+ * This is a simplified version of the GetMultiXactIdMembers() server function.
+ *
+ * - Only return the updating member, if any. Upgrade only cares about the
+ *   updaters. If there is no updating member, return the first locking-only
+ *   member. We don't have any way to represent "no members", but we also don't
+ *   need to preserve all the locking members.
+ *
+ * - We don't need to worry about locking and some corner cases because there's
+ *   no concurrent activity.
+ */
+void
+GetOldMultiXactIdSingleMember(OldMultiXactReader *state, MultiXactId multi,
+							  TransactionId *result, MultiXactStatus *status)
+{
+	MultiXactId nextMXact,
+				nextOffset,
+				tmpMXact;
+	int64		pageno,
+				prev_pageno;
+	int			entryno,
+				length;
+	char	   *buf;
+	OldMultiXactOffset *offptr,
+				offset;
+	TransactionId result_xid = InvalidTransactionId;
+	bool		result_isupdate = false;
+
+	nextMXact = state->nextMXact;
+	nextOffset = state->nextOffset;
+
+	/*
+	 * See GetMultiXactIdMembers in multixact.c
+	 *
+	 * Find out the offset at which we need to start reading MultiXactMembers
+	 * and the number of members in the multixact.  We determine the latter as
+	 * the difference between this multixact's starting offset and the next
+	 * one's.  However, there are some corner cases to worry about:
+	 *
+	 * 1. This multixact may be the latest one created, in which case there is
+	 * no next one to look at.  In this case the nextOffset value we just
+	 * saved is the correct endpoint.
+	 *
+	 * 2. The next multixact may still be in process of being filled in...
+	 * This cannot happen during upgrade.
+	 *
+	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
+	 * handle case #2, there is an ambiguity near the point of offset
+	 * wraparound.  If we see next multixact's offset is one, is that our
+	 * multixact's actual endpoint, or did it end at zero with a subsequent
+	 * increment?  We handle this using the knowledge that if the zero'th
+	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
+	 * transaction ID so it can't be a multixact member.  Therefore, if we
+	 * read a zero from the members array, just ignore it.
+	 */
+
+	pageno = MultiXactIdToOffsetPage(multi);
+	entryno = MultiXactIdToOffsetEntry(multi);
+
+	buf = SlruReadSwitchPage(state->offset, pageno);
+	offptr = (OldMultiXactOffset *) buf;
+	offptr += entryno;
+	offset = *offptr;
+
+	Assert(offset != 0);
+
+	/*
+	 * Use the same increment rule as GetNewMultiXactId(), that is, don't
+	 * handle wraparound explicitly until needed.
+	 */
+	tmpMXact = multi + 1;
+
+	if (nextMXact == tmpMXact)
+	{
+		/* Corner case 1: there is no next multixact */
+		length = nextOffset - offset;
+	}
+	else
+	{
+		OldMultiXactOffset nextMXOffset;
+
+		/* handle wraparound if needed */
+		if (tmpMXact < FirstMultiXactId)
+			tmpMXact = FirstMultiXactId;
+
+		prev_pageno = pageno;
+
+		pageno = MultiXactIdToOffsetPage(tmpMXact);
+		entryno = MultiXactIdToOffsetEntry(tmpMXact);
+
+		if (pageno != prev_pageno)
+			buf = SlruReadSwitchPage(state->offset, pageno);
+
+		offptr = (OldMultiXactOffset *) buf;
+		offptr += entryno;
+		nextMXOffset = *offptr;
+
+		/*
+		 * Corner case 2: next multixact is still being filled in, this must
+		 * not happen during upgrade.
+		 */
+		Assert(nextMXOffset != 0);
+
+		length = nextMXOffset - offset;
+	}
+
+	prev_pageno = -1;
+	for (int i = 0; i < length; i++, offset++)
+	{
+		TransactionId *xactptr;
+		uint32	   *flagsptr;
+		int			flagsoff;
+		int			bshift;
+		int			memberoff;
+		MultiXactStatus st;
+
+		pageno = MXOffsetToMemberPage(offset);
+		memberoff = MXOffsetToMemberOffset(offset);
+
+		if (pageno != prev_pageno)
+		{
+			buf = SlruReadSwitchPage(state->members, pageno);
+			prev_pageno = pageno;
+		}
+
+		xactptr = (TransactionId *) (buf + memberoff);
+		if (!TransactionIdIsValid(*xactptr))
+		{
+			/* Corner case 3: we must be looking at unused slot zero */
+			Assert(offset == 0);
+			continue;
+		}
+
+		flagsoff = MXOffsetToFlagsOffset(offset);
+		bshift = MXOffsetToFlagsBitShift(offset);
+		flagsptr = (uint32 *) (buf + flagsoff);
+
+		st = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
+
+		/* Verify that there is a single update Xid among the given members. */
+		if (ISUPDATE_from_mxstatus(st))
+		{
+			if (result_isupdate)
+				pg_fatal("multixact %u has more than one updating member",
+						 multi);
+			result_xid = *xactptr;
+			result_isupdate = true;
+		}
+		else if (!TransactionIdIsValid(result_xid))
+			result_xid = *xactptr;
+	}
+
+	/* A multixid with zero members should not happen */
+	Assert(TransactionIdIsValid(result_xid));
+
+	*result = result_xid;
+	*status = result_isupdate ? MultiXactStatusUpdate :
+		MultiXactStatusForKeyShare;
+}
+
+/*
+ * Frees the malloced reader.
+ */
+void
+FreeOldMultiXactReader(OldMultiXactReader *state)
+{
+	FreeSlruRead(state->offset);
+	FreeSlruRead(state->members);
+
+	pfree(state);
+}
diff --git a/src/bin/pg_upgrade/multixact_old.h b/src/bin/pg_upgrade/multixact_old.h
new file mode 100644
index 00000000000..8eb5af2ccaf
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_old.h
@@ -0,0 +1,29 @@
+/*
+ * multixact_old.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_old.h
+ */
+
+#include "access/multixact.h"
+#include "slru_io.h"
+
+typedef uint32 OldMultiXactOffset;
+
+typedef struct OldMultiXactReader
+{
+	MultiXactId nextMXact;
+	OldMultiXactOffset nextOffset;
+
+	SlruSegState *offset;
+	SlruSegState *members;
+} OldMultiXactReader;
+
+extern OldMultiXactReader *AllocOldMultiXactRead(char *pgdata,
+												 MultiXactId nextMulti,
+												 OldMultiXactOffset nextOffset);
+extern void GetOldMultiXactIdSingleMember(OldMultiXactReader *state,
+										  MultiXactId multi,
+										  TransactionId *result,
+										  MultiXactStatus *status);
+extern void FreeOldMultiXactReader(OldMultiXactReader *reader);
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 490e98fa26f..0fdd05c127c 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -48,6 +48,8 @@
 #include "common/logging.h"
 #include "common/restricted_token.h"
 #include "fe_utils/string_utils.h"
+#include "multixact_old.h"
+#include "multixact_new.h"
 #include "pg_upgrade.h"
 
 /*
@@ -769,6 +771,81 @@ copy_subdir_files(const char *old_subdir, const char *new_subdir)
 	check_ok();
 }
 
+/*
+ * Convert pg_multixact/offset and /members to new format with 64-bit offsets.
+ */
+static void
+convert_multixacts(MultiXactId *new_nxtmulti, MultiXactOffset *new_nxtmxoff)
+{
+	MultiXactId oldest_multi,
+				next_multi;
+	OldMultiXactReader *old_reader;
+	MultiXactWriter *new_writer;
+	MultiXactOffset next_offset;
+
+	/*
+	 * The range of valid multi XIDs is unchanged by the conversion (they are
+	 * referenced from the heap tables), but the members SLRU is rewritten to
+	 * start from offset 1.
+	 */
+	oldest_multi = old_cluster.controldata.chkpnt_oldstMulti;
+	next_multi = old_cluster.controldata.chkpnt_nxtmulti;
+	next_offset = 1;
+
+	old_reader = AllocOldMultiXactRead(old_cluster.pgdata,
+									   old_cluster.controldata.chkpnt_nxtmulti,
+									   old_cluster.controldata.chkpnt_nxtmxoff);
+	new_writer = AllocMultiXactWrite(new_cluster.pgdata,
+									 oldest_multi, next_offset);
+
+	/* handle wraparound */
+	if (next_multi < FirstMultiXactId)
+		next_multi = FirstMultiXactId;
+
+	/*
+	 * Read multixids from old files one by one, and write them back in the
+	 * new format.
+	 */
+	for (MultiXactId multi = oldest_multi; multi != next_multi;)
+	{
+		TransactionId xid;
+		MultiXactStatus status;
+		MultiXactMember member;
+
+		/*
+		 * Read the old multixid.  The locking-only XIDs that may be part of
+		 * multi-xids don't matter after upgrade, as there can be no
+		 * transactions running across upgrade.  So as a little optimization,
+		 * we only read one member from each multixid: the one updating one,
+		 * or if there was no update, arbitrarily the first locking xid.
+		 */
+		GetOldMultiXactIdSingleMember(old_reader, multi, &xid, &status);
+
+		/* Write it out in new format */
+		member.xid = xid;
+		member.status = status;
+		RecordNewMultiXact(new_writer, next_offset, multi, 1, &member);
+
+		next_offset += 1;
+		multi++;
+		/* handle wraparound */
+		if (multi < FirstMultiXactId)
+			multi = FirstMultiXactId;
+	}
+
+	/*
+	 * Update the nextMXact/Offset values in the control file to match what we
+	 * wrote.  The nextMXact is unchanged, but nextOffset will be different.
+	 */
+	Assert(next_multi == old_cluster.controldata.chkpnt_nxtmulti);
+	*new_nxtmulti = next_multi;
+	*new_nxtmxoff = next_offset;
+
+	/* Release resources */
+	FreeMultiXactWrite(new_writer);
+	FreeOldMultiXactReader(old_reader);
+}
+
 static void
 copy_xact_xlog_xid(void)
 {
@@ -816,8 +893,29 @@ copy_xact_xlog_xid(void)
 	if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER &&
 		new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
 	{
-		copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
-		copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		MultiXactId new_nxtmulti = old_cluster.controldata.chkpnt_nxtmulti;
+		MultiXactOffset new_nxtmxoff = old_cluster.controldata.chkpnt_nxtmxoff;
+
+		/*
+		 * If the old server is before the
+		 * MULTIXACTOFFSET_FORMATCHANGE_CAT_VER it must have 32-bit multixid
+		 * offsets, thus it should be converted.
+		 */
+		if (old_cluster.controldata.cat_ver < MULTIXACTOFFSET_FORMATCHANGE_CAT_VER &&
+			new_cluster.controldata.cat_ver >= MULTIXACTOFFSET_FORMATCHANGE_CAT_VER)
+		{
+			remove_new_subdir("pg_multixact/members", false);
+			remove_new_subdir("pg_multixact/offsets", false);
+
+			prep_status("Converting pg_multixact/offsets to 64-bit");
+			convert_multixacts(&new_nxtmulti, &new_nxtmxoff);
+			check_ok();
+		}
+		else
+		{
+			copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
+			copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		}
 
 		prep_status("Setting next multixact ID and offset for new cluster");
 
@@ -826,10 +924,8 @@ copy_xact_xlog_xid(void)
 		 * counters here and the oldest multi present on system.
 		 */
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/pg_resetwal\" -O %u -m %u,%u \"%s\"",
-				  new_cluster.bindir,
-				  old_cluster.controldata.chkpnt_nxtmxoff,
-				  old_cluster.controldata.chkpnt_nxtmulti,
+				  "\"%s/pg_resetwal\" -O %" PRIu64 " -m %u,%u \"%s\"",
+				  new_cluster.bindir, new_nxtmxoff, new_nxtmulti,
 				  old_cluster.controldata.chkpnt_oldstMulti,
 				  new_cluster.pgdata);
 		check_ok();
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index e86336f4be9..127b2cb00fa 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -114,6 +114,11 @@ extern char *output_files[];
  */
 #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231
 
+/*
+ * Swicth from 32-bit to 64-bit for multixid offsets.
+ */
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 999999999
+
 /*
  * large object chunk size added to pg_controldata,
  * commit 5f93c37805e7485488480916b4585e098d3cc883
diff --git a/src/bin/pg_upgrade/slru_io.c b/src/bin/pg_upgrade/slru_io.c
new file mode 100644
index 00000000000..010094184be
--- /dev/null
+++ b/src/bin/pg_upgrade/slru_io.c
@@ -0,0 +1,242 @@
+/*
+ * slru_io.c
+ *
+ * Routines for reading and writing SLRU files during upgrade.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/slru_io.c
+ */
+
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+
+#include "common/fe_memutils.h"
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "port/pg_iovec.h"
+#include "pg_upgrade.h"
+#include "slru_io.h"
+
+static SlruSegState *AllocSlruSegState(const char *dir);
+static char *SlruFileName(SlruSegState *state, int64 segno);
+static void SlruFlush(SlruSegState *state);
+
+static SlruSegState *
+AllocSlruSegState(const char *dir)
+{
+	SlruSegState *state = pg_malloc(sizeof(*state));
+
+	state->dir = pstrdup(dir);
+	state->fn = NULL;
+	state->fd = -1;
+	state->segno = -1;
+	state->pageno = 0;
+
+	return state;
+}
+
+/* similar to the backend function with the same name */
+static char *
+SlruFileName(SlruSegState *state, int64 segno)
+{
+	if (state->long_segment_names)
+	{
+		Assert(segno >= 0 && segno <= INT64CONST(0xFFFFFFFFFFFFFFF));
+		return psprintf("%s/%015" PRIX64, state->dir, segno);
+	}
+	else
+	{
+		Assert(segno >= 0 && segno <= INT64CONST(0xFFFFFF));
+		return psprintf("%s/%04X", state->dir, (unsigned int) segno);
+	}
+}
+
+/*
+ * Create slru reader for dir.
+ *
+ * Returns the malloced memory used by the all other read calls in this module.
+ */
+SlruSegState *
+AllocSlruRead(const char *dir, bool long_segment_names)
+{
+	SlruSegState *state = AllocSlruSegState(dir);
+
+	state->writing = false;
+	state->long_segment_names = long_segment_names;
+
+	return state;
+}
+
+/*
+ * Open given page for reading.
+ *
+ * Reading can be done in random order.
+ */
+char *
+SlruReadSwitchPageSlow(SlruSegState *state, uint64 pageno)
+{
+	int64		segno;
+
+	Assert(!state->writing);	/* read only mode */
+
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+
+	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	if (segno != state->segno)
+	{
+		if (state->segno != -1)
+		{
+			close(state->fd);
+			state->fd = -1;
+
+			pg_free(state->fn);
+			state->fn = NULL;
+
+			state->segno = -1;
+		}
+
+		/* Open new segment */
+		state->fn = SlruFileName(state, segno);
+		if ((state->fd = open(state->fn, O_RDONLY | PG_BINARY, 0)) < 0)
+			pg_fatal("could not open file \"%s\": %m", state->fn);
+	}
+
+	state->segno = segno;
+
+	{
+		struct iovec iovec = {
+			.iov_base = &state->buf,
+			.iov_len = BLCKSZ,
+		};
+		off_t		offset = (pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+		if (pg_preadv(state->fd, &iovec, 1, offset) < 0)
+			pg_fatal("could not read file \"%s\": %m", state->fn);
+
+		state->pageno = pageno;
+	}
+
+	return state->buf.data;
+}
+
+/*
+ * Frees the malloced reader.
+ */
+void
+FreeSlruRead(SlruSegState *state)
+{
+	Assert(!state->writing);	/* read only mode */
+
+	if (state->fd != -1)
+		close(state->fd);
+	pg_free(state);
+}
+
+/*
+ * Create slru writer for dir.
+ *
+ * Returns the malloced memory used by the all other write calls in this module.
+ */
+SlruSegState *
+AllocSlruWrite(const char *dir, bool long_segment_names)
+{
+	SlruSegState *state = AllocSlruSegState(dir);
+
+	state->writing = true;
+	state->long_segment_names = long_segment_names;
+
+	return state;
+}
+
+/*
+ * Open the given page for writing.
+ *
+ * NOTE: This uses O_EXCL when stepping to a new segment, so this assumes that
+ * each segment is written in full before moving on to next one.  This
+ * limitation would be easy to lift if needed, but it fits the usage pattern of
+ * current callers.
+ */
+char *
+SlruWriteSwitchPageSlow(SlruSegState *state, uint64 pageno)
+{
+	int64		segno;
+	off_t		offset;
+
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+
+	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	offset = (pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+	SlruFlush(state);
+	memset(state->buf.data, 0, BLCKSZ);
+
+	if (segno != state->segno)
+	{
+		if (state->segno != -1)
+		{
+			close(state->fd);
+			state->fd = -1;
+
+			pg_free(state->fn);
+			state->fn = NULL;
+
+			state->segno = -1;
+		}
+
+		/* Create the segment */
+		state->fn = SlruFileName(state, segno);
+		if ((state->fd = open(state->fn, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+							  pg_file_create_mode)) < 0)
+		{
+			pg_fatal("could not create file \"%s\": %m", state->fn);
+		}
+
+		state->segno = segno;
+
+		if (offset > 0)
+		{
+			if (pg_pwrite_zeros(state->fd, offset, 0) < 0)
+				pg_fatal("could not write file \"%s\": %m", state->fn);
+		}
+	}
+
+	state->pageno = pageno;
+
+	return state->buf.data;
+}
+
+static void
+SlruFlush(SlruSegState *state)
+{
+	struct iovec iovec = {
+		.iov_base = &state->buf,
+		.iov_len = BLCKSZ,
+	};
+	off_t		offset;
+
+	if (state->segno == -1)
+		return;
+
+	offset = (state->pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+	if (pg_pwritev_with_retry(state->fd, &iovec, 1, offset) < 0)
+		pg_fatal("could not write file \"%s\": %m", state->fn);
+}
+
+/*
+ * Frees the malloced writer.
+ */
+void
+FreeSlruWrite(SlruSegState *state)
+{
+	Assert(state->writing);
+
+	SlruFlush(state);
+
+	if (state->fd != -1)
+		close(state->fd);
+	pg_free(state);
+}
diff --git a/src/bin/pg_upgrade/slru_io.h b/src/bin/pg_upgrade/slru_io.h
new file mode 100644
index 00000000000..77b8830b000
--- /dev/null
+++ b/src/bin/pg_upgrade/slru_io.h
@@ -0,0 +1,59 @@
+/*
+ * slru_io.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/slru_io.h
+ */
+
+#ifndef SLRU_IO_H
+#define SLRU_IO_H
+
+/*
+ * See access/slru.h
+ *
+ * Copy here, since slru.h could not be included in fe code.
+ */
+#define SLRU_PAGES_PER_SEGMENT 32
+
+/*
+ * State for reading or writing an SLRU, with a one page buffer.
+ */
+typedef struct SlruSegState
+{
+	bool		writing;
+	bool		long_segment_names;
+
+	char	   *dir;
+	char	   *fn;
+	int			fd;
+	int64		segno;
+	uint64		pageno;
+
+	PGAlignedBlock buf;
+} SlruSegState;
+
+extern SlruSegState *AllocSlruRead(const char *dir, bool long_segment_names);
+extern char *SlruReadSwitchPageSlow(SlruSegState *state, uint64 pageno);
+extern void FreeSlruRead(SlruSegState *state);
+
+static inline char *
+SlruReadSwitchPage(SlruSegState *state, uint64 pageno)
+{
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+	return SlruReadSwitchPageSlow(state, pageno);
+}
+
+extern SlruSegState *AllocSlruWrite(const char *dir, bool long_segment_names);
+extern char *SlruWriteSwitchPageSlow(SlruSegState *state, uint64 pageno);
+extern void FreeSlruWrite(SlruSegState *state);
+
+static inline char *
+SlruWriteSwitchPage(SlruSegState *state, uint64 pageno)
+{
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+	return SlruWriteSwitchPageSlow(state, pageno);
+}
+
+#endif							/* SLRU_IO_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 432509277c9..9392bb729b9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1725,6 +1725,7 @@ MultiXactMember
 MultiXactOffset
 MultiXactStateData
 MultiXactStatus
+MultiXactWriter
 MultirangeIOData
 MultirangeParseState
 MultirangeType
@@ -1808,6 +1809,7 @@ OffsetVarNodes_context
 Oid
 OidOptions
 OkeysState
+OldMultiXactReader
 OldToNewMapping
 OldToNewMappingData
 OnCommitAction
@@ -2804,6 +2806,7 @@ SlruCtlData
 SlruErrorCause
 SlruPageStatus
 SlruScanCallback
+SlruSegState
 SlruShared
 SlruSharedData
 SlruWriteAll
-- 
2.47.3

v24-0004-Remove-oldestOffset-oldestOffsetKnown-from-multi.patchtext/x-patch; charset=UTF-8; name=v24-0004-Remove-oldestOffset-oldestOffsetKnown-from-multi.patchDownload
From fb6c57ac151de98d12831db626cac5a3f34985a9 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Thu, 6 Nov 2025 16:20:18 +0300
Subject: [PATCH v24 4/8] Remove oldestOffset/oldestOffsetKnown from multixact

Since we rewrite all multitransactions during pg_upgrade, the oldest
offset for a new cluster will no longer be missing on disc.
---
 src/backend/access/transam/multixact.c | 101 ++-----------------------
 src/include/access/multixact.h         |   3 -
 2 files changed, 5 insertions(+), 99 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index e0323ec1014..78ba6d72a92 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -140,14 +140,6 @@ typedef struct MultiXactStateData
 	MultiXactId oldestMultiXactId;
 	Oid			oldestMultiXactDB;
 
-	/*
-	 * Oldest multixact offset that is potentially referenced by a multixact
-	 * referenced by a relation.  We don't always know this value, so there's
-	 * a flag here to indicate whether or not we currently do.
-	 */
-	MultiXactOffset oldestOffset;
-	bool		oldestOffsetKnown;
-
 	/* support for anti-wraparound measures */
 	MultiXactId multiVacLimit;
 	MultiXactId multiWarnLimit;
@@ -2371,10 +2363,7 @@ SetOffsetVacuumLimit(bool is_startup)
 	MultiXactId oldestMultiXactId;
 	MultiXactId nextMXact;
 	MultiXactOffset oldestOffset = 0;	/* placate compiler */
-	MultiXactOffset prevOldestOffset;
 	MultiXactOffset nextOffset;
-	bool		oldestOffsetKnown = false;
-	bool		prevOldestOffsetKnown;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2387,8 +2376,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	oldestMultiXactId = MultiXactState->oldestMultiXactId;
 	nextMXact = MultiXactState->nextMXact;
 	nextOffset = MultiXactState->nextOffset;
-	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
-	prevOldestOffset = MultiXactState->oldestOffset;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2406,57 +2393,20 @@ SetOffsetVacuumLimit(bool is_startup)
 		 * offset.
 		 */
 		oldestOffset = nextOffset;
-		oldestOffsetKnown = true;
 	}
-	else
+	else if (!find_multixact_start(oldestMultiXactId, &oldestOffset))
 	{
-		/*
-		 * Figure out where the oldest existing multixact's offsets are
-		 * stored. Due to bugs in early release of PostgreSQL 9.3.X and 9.4.X,
-		 * the supposedly-earliest multixact might not really exist.  We are
-		 * careful not to fail in that case.
-		 */
-		oldestOffsetKnown =
-			find_multixact_start(oldestMultiXactId, &oldestOffset);
-
-		if (!oldestOffsetKnown)
-			ereport(LOG,
-					(errmsg("oldest checkpointed MultiXact %u does not exist on disk",
-							oldestMultiXactId)));
+		ereport(LOG,
+				(errmsg("oldest checkpointed MultiXact %u does not exist on disk",
+						oldestMultiXactId)));
 	}
 
 	LWLockRelease(MultiXactTruncationLock);
 
-	/*
-	 * If we can, compute limits (and install them MultiXactState) to prevent
-	 * overrun of old data in the members SLRU area. We can only do so if the
-	 * oldest offset is known though.
-	 *
-	 * FIXME: Is !oldestOffsetKnown possible anymore? At least update the comment:
-	 * we won't overrun members anymore.
-	 */
-	if (prevOldestOffsetKnown)
-	{
-		/*
-		 * If we failed to get the oldest offset this time, but we have a
-		 * value from a previous pass through this function, use the old
-		 * values rather than automatically forcing an autovacuum cycle again.
-		 */
-		oldestOffset = prevOldestOffset;
-		oldestOffsetKnown = true;
-	}
-
-	/* Install the computed values */
-	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
-	MultiXactState->oldestOffset = oldestOffset;
-	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	LWLockRelease(MultiXactGenLock);
-
 	/*
 	 * Do we need autovacuum?	If we're not sure, assume yes.
 	 */
-	return !oldestOffsetKnown ||
-		(nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);
+	return nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD;
 }
 
 /*
@@ -2503,47 +2453,6 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	return true;
 }
 
-/*
- * GetMultiXactInfo
- *
- * Returns information about the current MultiXact state, as of:
- * multixacts: Number of MultiXacts (nextMultiXactId - oldestMultiXactId)
- * members: Number of member entries (nextOffset - oldestOffset)
- * oldestMultiXactId: Oldest MultiXact ID still in use
- * oldestOffset: Oldest offset still in use
- *
- * Returns false if unable to determine, the oldest offset being unknown.
- */
-bool
-GetMultiXactInfo(uint32 *multixacts, MultiXactOffset *members,
-				 MultiXactId *oldestMultiXactId, MultiXactOffset *oldestOffset)
-{
-	MultiXactOffset nextOffset;
-	MultiXactId nextMultiXactId;
-	bool		oldestOffsetKnown;
-
-	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	nextOffset = MultiXactState->nextOffset;
-	*oldestMultiXactId = MultiXactState->oldestMultiXactId;
-	nextMultiXactId = MultiXactState->nextMXact;
-	*oldestOffset = MultiXactState->oldestOffset;
-	oldestOffsetKnown = MultiXactState->oldestOffsetKnown;
-	LWLockRelease(MultiXactGenLock);
-
-	if (!oldestOffsetKnown)
-	{
-		*members = 0;
-		*multixacts = 0;
-		*oldestMultiXactId = InvalidMultiXactId;
-		*oldestOffset = 0;
-		return false;
-	}
-
-	*members = nextOffset - *oldestOffset;
-	*multixacts = nextMultiXactId - *oldestMultiXactId;
-	return true;
-}
-
 typedef struct mxtruncinfo
 {
 	int64		earliestExistingPage;
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 7d98fe0fe32..d688b547c54 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -109,9 +109,6 @@ extern bool MultiXactIdIsRunning(MultiXactId multi, bool isLockOnly);
 extern void MultiXactIdSetOldestMember(void);
 extern int	GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 								  bool from_pgupgrade, bool isLockOnly);
-extern bool GetMultiXactInfo(uint32 *multixacts, MultiXactOffset *members,
-							 MultiXactId *oldestMultiXactId,
-							 MultiXactOffset *oldestOffset);
 extern bool MultiXactIdPrecedes(MultiXactId multi1, MultiXactId multi2);
 extern bool MultiXactIdPrecedesOrEquals(MultiXactId multi1,
 										MultiXactId multi2);
-- 
2.47.3

v24-0005-TEST-bump-catversion.patchtext/x-patch; charset=UTF-8; name=v24-0005-TEST-bump-catversion.patchDownload
From f2b9d137e0f6fa1336883d55f42b2fecf5f2fcd9 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Fri, 24 Oct 2025 11:47:50 +0300
Subject: [PATCH v24 5/8] TEST: bump catversion

To avoid constant CF-bot complains, make catversion bump in a separate
commit.

NOTE: keep it in sync with MULTIXACTOFFSET_FORMATCHANGE_CAT_VER
---
 src/include/catalog/catversion.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 7eefca1ae42..b0162c2bf63 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,7 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202511101
+// FIXME: bump it
+#define CATALOG_VERSION_NO	999999999
 
 #endif
-- 
2.47.3

v24-0006-TEST-Add-test-for-64-bit-mxoff-in-pg_resetwal.patchtext/x-patch; charset=UTF-8; name=v24-0006-TEST-Add-test-for-64-bit-mxoff-in-pg_resetwal.patchDownload
From c28747c8586fbe9ceb7ad212654a09266fb404a4 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Tue, 28 Oct 2025 19:08:26 +0300
Subject: [PATCH v24 6/8] TEST: Add test for 64-bit mxoff in pg_resetwal

---
 src/bin/pg_resetwal/meson.build    |   1 +
 src/bin/pg_resetwal/t/003_mxoff.pl | 170 +++++++++++++++++++++++++++++
 2 files changed, 171 insertions(+)
 create mode 100644 src/bin/pg_resetwal/t/003_mxoff.pl

diff --git a/src/bin/pg_resetwal/meson.build b/src/bin/pg_resetwal/meson.build
index 290832b2299..1e2dfb38a5b 100644
--- a/src/bin/pg_resetwal/meson.build
+++ b/src/bin/pg_resetwal/meson.build
@@ -25,6 +25,7 @@ tests += {
     'tests': [
       't/001_basic.pl',
       't/002_corrupted.pl',
+      't/003_mxoff.pl',
     ],
   },
 }
diff --git a/src/bin/pg_resetwal/t/003_mxoff.pl b/src/bin/pg_resetwal/t/003_mxoff.pl
new file mode 100644
index 00000000000..3c1b7fa1d33
--- /dev/null
+++ b/src/bin/pg_resetwal/t/003_mxoff.pl
@@ -0,0 +1,170 @@
+
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use Math::BigInt;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+sub mxact_eater
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	$node->start;
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl} (I INT PRIMARY KEY, N_UPDATED INT) " .
+		"       WITH (AUTOVACUUM_ENABLED=FALSE);" .
+		"INSERT INTO ${tbl} SELECT G, 0 FROM GENERATE_SERIES(1, 50) G;");
+
+	# consume around 10k multixact-offsetfs
+	my $nclients = 10;
+	my $update_every = 75;
+	my @connections = ();
+
+	for (0..$nclients)
+	{
+		my $conn = $node->background_psql('postgres');
+		$conn->query_safe("BEGIN");
+
+		push(@connections, $conn);
+	}
+
+	for (my $i = 0; $i < 1000; $i++)
+	{
+		my $conn = $connections[$i % $nclients];
+
+		$conn->query_safe("COMMIT;");
+		$conn->query_safe("BEGIN");
+
+		if ($i % $update_every == 0)
+		{
+			$conn->query_safe(
+				"UPDATE ${tbl} SET " .
+				"N_UPDATED = N_UPDATED + 1 " .
+				"WHERE I = ${i} % 50");
+		}
+		else
+		{
+			$conn->query_safe(
+				"SELECT * FROM ${tbl} FOR KEY SHARE");
+		}
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+
+	$node->stop;
+}
+
+sub next_mxoff
+{
+	my $node = shift;
+	my ($stdout, $stderr) =
+	  run_command([ 'pg_controldata', $node->data_dir ]);
+	my @control_data = split("\n", $stdout);
+	my $next_mxoff = undef;
+
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint's NextMultiOffset:\s*(.*)$/mg)
+		{
+			$next_mxoff = $1;
+			last;
+		}
+	}
+	die "NextMultiOffset not found in control file\n"
+		unless defined($next_mxoff);
+
+	return $next_mxoff;
+}
+
+sub reset_mxoff
+{
+	my $node = shift;
+	my $offset = shift;
+		$offset = Math::BigInt->new($offset);
+
+	# Get block size
+	my $out = (run_command([ 'pg_resetwal', '--dry-run', $node->data_dir ]))[0];
+		$out =~ /^Database block size: *(\d+)$/m or die;
+	my $blcksz = $1;
+
+	# Reset to new offset
+	my @cmd = ('pg_resetwal', '--pgdata' => $node->data_dir);
+	push @cmd, '--multixact-offset' => $offset->as_hex();
+	command_ok(\@cmd, 'set oldest multixact-offset');
+
+	# Fill empty pg_multixact/members segment
+	my $mult = 32 * int($blcksz / 20) * 4;
+	my $segname = sprintf "%015X", $offset / $mult;
+
+	my @dd = ('dd');
+	push @dd, "if=/dev/zero";
+	push @dd, "of=" . $node->data_dir . "/pg_multixact/members/" . $segname;
+	push @dd, "bs=$blcksz";
+	push @dd, "count=32";
+	command_ok(\@dd, 'fill empty multixact-members');
+}
+
+my ($off1, $off2);
+
+# start from defaults
+my $node1 = PostgreSQL::Test::Cluster->new('node1');
+$node1->init;
+$off1 = next_mxoff($node1);
+mxact_eater($node1, "FOO");
+$off2 = next_mxoff($node1);
+note "> start from $off1, finished at $off2\n";
+
+# start from before 32-bit wraparound
+my $node2 = PostgreSQL::Test::Cluster->new('node2');
+$node2->init;
+reset_mxoff($node2, 0xFFFF0000);
+$off1 = next_mxoff($node2);
+mxact_eater($node2, "FOO");
+$off2 = next_mxoff($node2);
+note "> start from $off1, finished at $off2\n";
+
+# start near 32-bit wraparound
+my $node3 = PostgreSQL::Test::Cluster->new('node3');
+$node3->init;
+reset_mxoff($node3, 0xFFFFEC77);
+$off1 = next_mxoff($node3);
+mxact_eater($node3, "FOO");
+$off2 = next_mxoff($node3);
+note "> start from $off1, finished at $off2\n";
+
+# start over 32-bit wraparound
+my $node4 = PostgreSQL::Test::Cluster->new('node4');
+$node4->init;
+reset_mxoff($node4, '0xFFFFFFFF0000');
+$off1 = next_mxoff($node4);
+mxact_eater($node4, "FOO");
+$off2 = next_mxoff($node3);
+note "> start from $off1, finished at $off2\n";
+
+# check invariant
+$node1->start;
+$node2->start;
+$node3->start;
+$node4->start;
+
+my $var1 = $node1->safe_psql('postgres', 'TABLE FOO');
+my $var2 = $node2->safe_psql('postgres', 'TABLE FOO');
+my $var3 = $node3->safe_psql('postgres', 'TABLE FOO');
+my $var4 = $node4->safe_psql('postgres', 'TABLE FOO');
+ok($var1 eq $var2 eq $var3 eq $var4,
+	'check table invariant in all nodes');
+
+$node4->stop;
+$node3->stop;
+$node2->stop;
+$node1->stop;
+
+done_testing();
-- 
2.47.3

v24-0007-TEST-Add-test-for-wraparound-of-next-new-multi-i.patchtext/x-patch; charset=UTF-8; name=v24-0007-TEST-Add-test-for-wraparound-of-next-new-multi-i.patchDownload
From 0164a86224716ad7f7986163b1927574526d739c Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 12 Nov 2025 13:36:09 +0200
Subject: [PATCH v24 7/8] TEST: Add test for wraparound of next new multi in
 pg_upgrade

Related to BUG #18863 and BUG #18865
---
 src/bin/pg_upgrade/meson.build         |   1 +
 src/bin/pg_upgrade/t/007_multi_wrap.pl | 176 +++++++++++++++++++++++++
 2 files changed, 177 insertions(+)
 create mode 100644 src/bin/pg_upgrade/t/007_multi_wrap.pl

diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index 3e46c4512cf..ca87ae221ce 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -50,6 +50,7 @@ tests += {
       't/004_subscription.pl',
       't/005_char_signedness.pl',
       't/006_transfer_modes.pl',
+      't/007_multi_wrap.pl',
     ],
     'test_kwargs': {'priority': 40}, # pg_upgrade tests are slow
   },
diff --git a/src/bin/pg_upgrade/t/007_multi_wrap.pl b/src/bin/pg_upgrade/t/007_multi_wrap.pl
new file mode 100644
index 00000000000..0ad8fd59906
--- /dev/null
+++ b/src/bin/pg_upgrade/t/007_multi_wrap.pl
@@ -0,0 +1,176 @@
+
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use Math::BigInt;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::AdjustDump;
+use PostgreSQL::Test::AdjustUpgrade;
+use Test::More;
+
+# Temp dir for a dumps.
+my $tempdir = PostgreSQL::Test::Utils::tempdir;
+
+# Can be changed to test the other modes.
+my $mode = $ENV{PG_TEST_PG_UPGRADE_MODE} || '--copy';
+
+# Handy pg_resetwal wrapper
+sub reset_mxoff
+{
+	my %args = @_;
+
+	my $node = $args{node};
+	my $offset = $args{offset};
+	my $multi = $args{multi};
+	my $blcksz = sub # Get block size
+	{
+		my $out = (run_command([ 'pg_resetwal', '--dry-run',
+								 $node->data_dir ]))[0];
+		$out =~ /^Database block size: *(\d+)$/m or die;
+		return $1;
+	}->();
+
+	my @cmd;
+
+	# Reset cluster
+	@cmd = ('pg_resetwal', '--pgdata' => $node->data_dir);
+	if (defined($offset))
+	{
+		push @cmd, '--multixact-offset' => $offset;
+	}
+	if (defined($multi))
+	{
+		push @cmd, "--multixact-ids=$multi,$multi";
+	}
+	command_ok(\@cmd, 'reset multi/offset');
+
+	my $n_items;
+	my $segname;
+
+	# Fill empty pg_multixact segments
+	if (defined($offset))
+	{
+		$n_items = 32 * int($blcksz / 20) * 4;
+		$segname = sprintf "%015X", ($offset / $n_items);
+		$segname = $node->data_dir . "/pg_multixact/members/" . $segname;
+
+		@cmd = ('dd');
+		push @cmd, "if=/dev/zero";
+		push @cmd, "of=" . $segname;
+		push @cmd, "bs=$blcksz";
+		push @cmd, "count=32";
+		command_ok(\@cmd, 'fill empty multixact-members');
+	}
+
+	if (defined($multi))
+	{
+		$n_items = 32 * int($blcksz / 8);
+		$segname = sprintf "%04X", $multi / $n_items;
+		$segname = $node->data_dir . "/pg_multixact/offsets/" . $segname;
+
+		@cmd = ('dd');
+		push @cmd, "if=/dev/zero";
+		push @cmd, "of=" . $segname;
+		push @cmd, "bs=$blcksz";
+		push @cmd, "count=32";
+		command_ok(\@cmd, 'fill empty multixact-offsets');
+	}
+}
+
+sub get_dump_for_comparison
+{
+	my ($node, $db, $file_prefix, $adjust_child_columns) = @_;
+
+	my $dumpfile = $tempdir . '/' . $file_prefix . '.sql';
+	my $dump_adjusted = "${dumpfile}_adjusted";
+
+	open(my $dh, '>', $dump_adjusted)
+	  || die "could not open $dump_adjusted for writing $!";
+
+	$node->run_log(
+		[
+			'pg_dump', '--no-sync',
+			'--restrict-key' => 'test',
+			'-d' => $node->connstr($db),
+			'-f' => $dumpfile
+		]);
+
+	print $dh adjust_regress_dumpfile(slurp_file($dumpfile),
+		$adjust_child_columns);
+	close($dh);
+
+	return $dump_adjusted;
+}
+
+# Create old node
+my $old = PostgreSQL::Test::Cluster->new("old");
+$old->init;
+reset_mxoff(node => $old, multi => 4294967295, offset => 429496729);
+
+$old->start;
+$old->safe_psql('postgres',
+qq(
+	CREATE TABLE test_table (id integer NOT NULL PRIMARY KEY, val text);
+	INSERT INTO test_table VALUES (1, 'a');
+));
+
+my $conn1 = $old->background_psql('postgres');
+my $conn2 = $old->background_psql('postgres');
+
+$conn1->query_safe(qq(
+	BEGIN;
+	SELECT * FROM test_table WHERE id = 1 FOR SHARE;
+));
+$conn2->query_safe(qq(
+	BEGIN;
+	SELECT * FROM test_table WHERE id = 1 FOR SHARE;
+));
+
+$conn1->query_safe(qq(COMMIT;));
+$conn2->query_safe(qq(COMMIT;));
+
+$conn1->quit;
+$conn2->quit;
+
+$old->stop;
+
+# Create new node
+my $new = PostgreSQL::Test::Cluster->new("new");
+$new->init;
+
+# Run pg_upgrade
+command_ok(
+	[
+		'pg_upgrade', '--no-sync',
+		'--old-datadir' => $old->data_dir,
+		'--new-datadir' => $new->data_dir,
+		'--old-bindir' => $old->config_data('--bindir'),
+		'--new-bindir' => $new->config_data('--bindir'),
+		'--socketdir' => $new->host,
+		'--old-port' => $old->port,
+		'--new-port' => $new->port,
+		$mode,
+	],
+	'run of pg_upgrade for new instance');
+ok( !-d $new->data_dir . "/pg_upgrade_output.d",
+	"pg_upgrade_output.d/ removed after pg_upgrade success");
+
+$old->start;
+my $src_dump =
+	get_dump_for_comparison($old, 'postgres',
+							"oldnode_1_dump", 0);
+$old->stop;
+
+$new->start;
+my $dst_dump =
+	get_dump_for_comparison($new, 'postgres',
+							"newnode_1_dump", 0);
+$new->stop;
+
+compare_files($src_dump, $dst_dump,
+	'dump outputs from original and restored regression databases match');
+
+done_testing();
-- 
2.47.3

v24-0008-TEST-Add-test-for-64-bit-mxoff-in-pg_upgrade.patchtext/x-patch; charset=UTF-8; name=v24-0008-TEST-Add-test-for-64-bit-mxoff-in-pg_upgrade.patchDownload
From 8ee659426acf73d3f0aef295000eac2531bfaaec Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 29 Oct 2025 14:19:56 +0300
Subject: [PATCH v24 8/8] TEST: Add test for 64-bit mxoff in pg_upgrade

---
 src/bin/pg_upgrade/meson.build    |   1 +
 src/bin/pg_upgrade/t/008_mxoff.pl | 463 ++++++++++++++++++++++++++++++
 2 files changed, 464 insertions(+)
 create mode 100644 src/bin/pg_upgrade/t/008_mxoff.pl

diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index ca87ae221ce..7f14e5c463c 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/005_char_signedness.pl',
       't/006_transfer_modes.pl',
       't/007_multi_wrap.pl',
+      't/008_mxoff.pl',
     ],
     'test_kwargs': {'priority': 40}, # pg_upgrade tests are slow
   },
diff --git a/src/bin/pg_upgrade/t/008_mxoff.pl b/src/bin/pg_upgrade/t/008_mxoff.pl
new file mode 100644
index 00000000000..7204325f873
--- /dev/null
+++ b/src/bin/pg_upgrade/t/008_mxoff.pl
@@ -0,0 +1,463 @@
+
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use Math::BigInt;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::AdjustDump;
+use PostgreSQL::Test::AdjustUpgrade;
+use Test::More;
+
+# This test involves different multitransaction states, similarly to that of
+# 002_pg_upgrade.pl.
+
+unless (defined($ENV{oldinstall}))
+{
+	plan skip_all =>
+		'to run test set oldinstall environment variable to the pre 64-bit mxoff cluster';
+}
+
+# Temp dir for a dumps.
+my $tempdir = PostgreSQL::Test::Utils::tempdir;
+
+# Can be changed to test the other modes.
+my $mode = $ENV{PG_TEST_PG_UPGRADE_MODE} || '--copy';
+
+sub utility_path
+{
+	my $node = shift;
+	my $name = shift;
+
+	my $bin_path = defined($node->install_path) ?
+		$node->install_path . "/bin/$name" : $name;
+
+	return $bin_path;
+}
+
+# Get NextMultiOffset.
+sub next_mxoff
+{
+	my $node = shift;
+
+	my $pg_controldata_path = utility_path($node, 'pg_controldata');
+	my ($stdout, $stderr) = run_command([ $pg_controldata_path,
+											$node->data_dir ]);
+	my @control_data = split("\n", $stdout);
+	my $next_mxoff = undef;
+
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint's NextMultiOffset:\s*(.*)$/mg)
+		{
+			$next_mxoff = $1;
+			last;
+		}
+	}
+	die "NextMultiOffset not found in control file\n"
+		unless defined($next_mxoff);
+
+	return $next_mxoff;
+}
+
+# Consume around 10k of mxoffsets.
+sub mxact_eater
+{
+	my $node = shift;
+	my $tbl = 'FOO';
+
+	my ($mxoff1, $mxoff2);
+
+	$mxoff1 = next_mxoff($node);
+	$node->start;
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl} (I INT PRIMARY KEY, N_UPDATED INT) " .
+		"       WITH (AUTOVACUUM_ENABLED=FALSE);" .
+		"INSERT INTO ${tbl} SELECT G, 0 FROM GENERATE_SERIES(1, 50) G;");
+
+	# consume around 10k mxoff
+	my $nclients = 10;
+	my $update_every = 75;
+	my @connections = ();
+
+	for (0..$nclients)
+	{
+		my $conn = $node->background_psql('postgres');
+		$conn->query_safe("BEGIN");
+
+		push(@connections, $conn);
+	}
+
+	for (my $i = 0; $i < 1000; $i++)
+	{
+		my $conn = $connections[$i % $nclients];
+
+		$conn->query_safe("COMMIT;");
+		$conn->query_safe("BEGIN");
+
+		if ($i % $update_every == 0)
+		{
+			$conn->query_safe(
+				"UPDATE ${tbl} SET " .
+				"N_UPDATED = N_UPDATED + 1 " .
+				"WHERE I = ${i} % 50");
+		}
+		else
+		{
+			$conn->query_safe(
+				"SELECT * FROM ${tbl} FOR KEY SHARE");
+		}
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+
+	$node->stop;
+	$mxoff2 = next_mxoff($node);
+
+	return $mxoff1, $mxoff2;
+}
+
+# Consume around 1M of mxoffsets.
+sub mxact_huge_eater
+{
+	my $node = shift;
+	my $tbl = 'FOO';
+
+	my ($mxoff1, $mxoff2);
+
+	$mxoff1 = next_mxoff($node);
+	$node->start;
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl} (I INT PRIMARY KEY, N_UPDATED INT) " .
+		"       WITH (AUTOVACUUM_ENABLED=FALSE);" .
+		"INSERT INTO ${tbl} SELECT G, 0 FROM GENERATE_SERIES(1, 4) G;");
+
+	my $nclients = 100;
+	my @connections = ();
+	my $timeout = 10 * $PostgreSQL::Test::Utils::timeout_default;
+
+	for (0..$nclients)
+	{
+		my $conn = $node->background_psql('postgres',
+										  timeout => $timeout);
+		$conn->query_safe("BEGIN");
+
+		push(@connections, $conn);
+	}
+
+	# It's a long process, better to tell about progress.
+	my $n_steps = 100_000;
+	my $step = int($n_steps / 10);
+
+	diag "\nstart to consume mxoffsets ...\n";
+	for (my $i = 0; $i < $n_steps; $i++)
+	{
+		my $conn = $connections[$i % $nclients];
+
+		$conn->query_safe("COMMIT;");
+		$conn->query_safe("BEGIN");
+
+		{
+			$conn->query_safe(
+				"SELECT * FROM ${tbl} " .
+				"FOR KEY SHARE");
+		}
+
+		if ($i % $step == 0)
+		{
+			my $done = int(($i / $n_steps) * 100);
+			diag "$done% done...";
+		}
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+
+	$node->stop;
+	$mxoff2 = next_mxoff($node);
+
+	return $mxoff1, $mxoff2;
+}
+
+# Set oldest multixact-offset
+sub reset_mxoff
+{
+	my $node = shift;
+	my $offset = shift;
+
+	my $pg_resetwal_path = utility_path($node, 'pg_resetwal');
+	# Get block size
+	my $out = (run_command([ $pg_resetwal_path, '--dry-run',
+							 $node->data_dir ]))[0];
+		$out =~ /^Database block size: *(\d+)$/m or die;
+	my $blcksz = $1;
+
+	# Reset to new offset
+	my @cmd = ($pg_resetwal_path, '--pgdata' => $node->data_dir);
+	push @cmd, '--multixact-offset' => $offset;
+	command_ok(\@cmd, 'set oldest multixact-offset');
+
+	# Fill empty pg_multixact/members segment
+	my $mult = 32 * int($blcksz / 20) * 4;
+	my $segname = sprintf "%04X", $offset / $mult;
+
+	my @dd = ('dd');
+	push @dd, "if=/dev/zero";
+	push @dd, "of=" . $node->data_dir . "/pg_multixact/members/" . $segname;
+	push @dd, "bs=$blcksz";
+	push @dd, "count=32";
+	command_ok(\@dd, 'fill empty multixact-members');
+}
+
+sub get_dump_for_comparison
+{
+	my ($node, $db, $file_prefix, $adjust_child_columns) = @_;
+
+	my $dumpfile = $tempdir . '/' . $file_prefix . '.sql';
+	my $dump_adjusted = "${dumpfile}_adjusted";
+
+	open(my $dh, '>', $dump_adjusted)
+	  || die "could not open $dump_adjusted for writing $!";
+
+	my $pg_dump_path = utility_path($node, 'pg_dump');
+
+	$node->run_log(
+		[
+			$pg_dump_path, '--no-sync',
+			'--restrict-key' => 'test',
+			'-d' => $node->connstr($db),
+			'-f' => $dumpfile
+		]);
+
+	print $dh adjust_regress_dumpfile(slurp_file($dumpfile),
+		$adjust_child_columns);
+	close($dh);
+
+	return $dump_adjusted;
+}
+
+# Main test workhorse routine.
+# Make pg_upgrade, dump data and compare it.
+sub run_test
+{
+	my $tag = shift;
+	my $oldnode = shift;
+	my $newnode = shift;
+
+	my $pg_upgrade_path = utility_path($newnode, 'pg_upgrade');
+
+	command_ok(
+		[
+			$pg_upgrade_path, '--no-sync',
+			'--old-datadir' => $oldnode->data_dir,
+			'--new-datadir' => $newnode->data_dir,
+			'--old-bindir' => $oldnode->config_data('--bindir'),
+			'--new-bindir' => $newnode->config_data('--bindir'),
+			'--socketdir' => $newnode->host,
+			'--old-port' => $oldnode->port,
+			'--new-port' => $newnode->port,
+			$mode,
+		],
+		'run of pg_upgrade for new instance');
+	ok( !-d $newnode->data_dir . "/pg_upgrade_output.d",
+		"pg_upgrade_output.d/ removed after pg_upgrade success");
+
+	$oldnode->start;
+	my $src_dump =
+		get_dump_for_comparison($oldnode, 'postgres',
+								"oldnode_${tag}_dump", 0);
+	$oldnode->stop;
+
+	$newnode->start;
+	my $dst_dump =
+		get_dump_for_comparison($newnode, 'postgres',
+								"newnode_${tag}_dump", 0);
+	$newnode->stop;
+
+	compare_files($src_dump, $dst_dump,
+		'dump outputs from original and restored regression databases match');
+}
+
+sub to_hex
+{
+	my $arg = shift;
+
+	$arg = Math::BigInt->new($arg);
+	$arg = $arg->as_hex();
+
+	return $arg;
+}
+
+# case #1: start old node from defaults
+{
+	my $tag = 1;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+	$old->init(extra => ['-k']);
+
+	my ($start_mxoff, $finish_mxoff) = mxact_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #2: start old node from before 32-bit wraparound
+{
+	my $tag = 2;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+	reset_mxoff($old, 0xFFFF0000);
+
+	my ($start_mxoff, $finish_mxoff) = mxact_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #3: start old node near 32-bit wraparound and reach wraparound state.
+{
+	my $tag = 3;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+
+	reset_mxoff($old, 0xFFFFEC77);
+	my ($start_mxoff, $finish_mxoff) = mxact_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #4: start old node from defaults
+{
+	my $tag = 4;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+	$old->append_conf("postgresql.conf", "max_connections = 128");
+
+	diag "test #${tag} for multiple mxoff segments";
+	my ($start_mxoff, $finish_mxoff) = mxact_huge_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #5: start old node from before 32-bit wraparound
+{
+	my $tag = 5;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+	$old->append_conf("postgresql.conf", "max_connections = 128");
+	reset_mxoff($old, 0xFF000000);
+
+	diag "test #${tag} for multiple mxoff segments";
+	my ($start_mxoff, $finish_mxoff) = mxact_huge_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #6: start old node near 32-bit wraparound and reach wraparound state.
+{
+	my $tag = 6;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+
+	reset_mxoff($old, 0xFFFFFFFF - 500_000);
+	$old->append_conf("postgresql.conf", "max_connections = 128");
+	my ($start_mxoff, $finish_mxoff) = mxact_huge_eater($old);
+
+	diag "test #${tag} for multiple mxoff segments";
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+done_testing();
-- 
2.47.3

#58Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#35)
10 attachment(s)
Re: POC: make mxidoff 64 bits

I realized that this issue was still outstanding:

On 01/04/2025 21:25, Heikki Linnakangas wrote:

Thanks! I did some manual testing of this. I created a little helper
function to consume multixids, to test the autovacuum behavior, and
found one issue:

If you consume a lot of multixid members space, by creating lots of
multixids with huge number of members in each, you can end up with a
very bloated members SLRU, and autovacuum is in no hurry to clean it up.
Here's what I did:

1. Installed attached test module
2. Ran "select consume_multixids(10000, 100000);" many times
3. ran:

$ du -h data/pg_multixact/members/
26G    data/pg_multixact/members/

When I run "vacuum freeze; select * from pg_database;", I can see that
'datminmxid' for the current database is advanced. However, autovacuum
is in no hurry to vacuum 'template0' and 'template1', so pg_multixact/
members/ does not get truncated. Eventually, when
autovacuum_multixact_freeze_max_age is reached, it presumably will, but
you will run out of disk space before that.

There is this check for members size at the end of SetOffsetVacuumLimit():

    /*
     * Do we need autovacuum?    If we're not sure, assume yes.
     */
    return !oldestOffsetKnown ||
        (nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);

And the caller (SetMultiXactIdLimit()) will in fact signal the
autovacuum launcher after "vacuum freeze" because of that. But
autovacuum launcher will look at the datminmxid / relminmxid values, see
that they are well within autovacuum_multixact_freeze_max_age, and do
nothing.

This is a very extreme case, but clearly the code to signal autovacuum
launcher, and the freeze age cutoff that autovacuum then uses, are not
in sync.

This patch removed MultiXactMemberFreezeThreshold(), per my suggestion,
but we threw this baby with the bathwater. We discussed that in this
thread, but didn't come up with any solution. But ISTM we still need
something like MultiXactMemberFreezeThreshold() to trigger autovacuum
freezing if the members have grown too large.

Here's a new patch version that addresses the above issue. I resurrected
MultiXactMemberFreezeThreshold(), using the same logic as before, just
using pretty arbitrary thresholds of 1 and 2 billion offsets instead of
the safe/danger thresholds derived from MaxMultiOffset. That gives
roughly the same behavior wrt. calculating effective freeze age as before.

Another change is that I removed the offset-based emergency vacuum
triggering. With 64-bit offsets, we never need to shut down the system
to prevent offset wraparound, so even if the offsets SLRU grows large,
it's not an "emergency" the same way that wraparound is. Consuming lots
of disk space could be a problem, of course, but we can let autovacuum
deal with that at the normal pace, like it deals with bloated tables.

The heuristics could surely be made better and/or more configurable, but
I think this good enough for now.

I included these changes as a separate patch for review purposes, but it
ought to be squashed with the main patch before committing.

- Heikki

Attachments:

v25-0001-Move-pg_multixact-SLRU-page-format-definitions-t.patchtext/x-patch; charset=UTF-8; name=v25-0001-Move-pg_multixact-SLRU-page-format-definitions-t.patchDownload
From 44cc3b4ade03a5b4a26dac9f7daf53627f4d6ccb Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 12 Nov 2025 14:19:32 +0200
Subject: [PATCH v25 01/10] Move pg_multixact SLRU page format definitions to
 separate header

---
 src/backend/access/transam/multixact.c  | 119 --------------------
 src/include/access/multixact_internal.h | 140 ++++++++++++++++++++++++
 2 files changed, 140 insertions(+), 119 deletions(-)
 create mode 100644 src/include/access/multixact_internal.h

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 9d5f130af7e..acb2a6788f9 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -89,125 +89,6 @@
 #include "utils/memutils.h"
 
 
-/*
- * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
- * used everywhere else in Postgres.
- *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
- */
-
-/* We need four bytes per offset */
-#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
-
-static inline int64
-MultiXactIdToOffsetPage(MultiXactId multi)
-{
-	return multi / MULTIXACT_OFFSETS_PER_PAGE;
-}
-
-static inline int
-MultiXactIdToOffsetEntry(MultiXactId multi)
-{
-	return multi % MULTIXACT_OFFSETS_PER_PAGE;
-}
-
-static inline int64
-MultiXactIdToOffsetSegment(MultiXactId multi)
-{
-	return MultiXactIdToOffsetPage(multi) / SLRU_PAGES_PER_SEGMENT;
-}
-
-/*
- * The situation for members is a bit more complex: we store one byte of
- * additional flag bits for each TransactionId.  To do this without getting
- * into alignment issues, we store four bytes of flags, and then the
- * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
- * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
- * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
- * performance) trumps space efficiency here.
- *
- * Note that the "offset" macros work with byte offset, not array indexes, so
- * arithmetic must be done using "char *" pointers.
- */
-/* We need eight bits per xact, so one xact fits in a byte */
-#define MXACT_MEMBER_BITS_PER_XACT			8
-#define MXACT_MEMBER_FLAGS_PER_BYTE			1
-#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
-
-/* how many full bytes of flags are there in a group? */
-#define MULTIXACT_FLAGBYTES_PER_GROUP		4
-#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
-	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
-/* size in bytes of a complete group */
-#define MULTIXACT_MEMBERGROUP_SIZE \
-	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
-#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
-#define MULTIXACT_MEMBERS_PER_PAGE	\
-	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
-
-/*
- * Because the number of items per page is not a divisor of the last item
- * number (member 0xFFFFFFFF), the last segment does not use the maximum number
- * of pages, and moreover the last used page therein does not use the same
- * number of items as previous pages.  (Another way to say it is that the
- * 0xFFFFFFFF member is somewhere in the middle of the last page, so the page
- * has some empty space after that item.)
- *
- * This constant is the number of members in the last page of the last segment.
- */
-#define MAX_MEMBERS_IN_LAST_MEMBERS_PAGE \
-		((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
-
-/* page in which a member is to be found */
-static inline int64
-MXOffsetToMemberPage(MultiXactOffset offset)
-{
-	return offset / MULTIXACT_MEMBERS_PER_PAGE;
-}
-
-static inline int64
-MXOffsetToMemberSegment(MultiXactOffset offset)
-{
-	return MXOffsetToMemberPage(offset) / SLRU_PAGES_PER_SEGMENT;
-}
-
-/* Location (byte offset within page) of flag word for a given member */
-static inline int
-MXOffsetToFlagsOffset(MultiXactOffset offset)
-{
-	MultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
-	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
-	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
-
-	return byteoff;
-}
-
-static inline int
-MXOffsetToFlagsBitShift(MultiXactOffset offset)
-{
-	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
-	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
-
-	return bshift;
-}
-
-/* Location (byte offset within page) of TransactionId of given member */
-static inline int
-MXOffsetToMemberOffset(MultiXactOffset offset)
-{
-	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
-
-	return MXOffsetToFlagsOffset(offset) +
-		MULTIXACT_FLAGBYTES_PER_GROUP +
-		member_in_group * sizeof(TransactionId);
-}
-
 /* Multixact members wraparound thresholds. */
 #define MULTIXACT_MEMBER_SAFE_THRESHOLD		(MaxMultiXactOffset / 2)
 #define MULTIXACT_MEMBER_DANGER_THRESHOLD	\
diff --git a/src/include/access/multixact_internal.h b/src/include/access/multixact_internal.h
new file mode 100644
index 00000000000..9b56deaef31
--- /dev/null
+++ b/src/include/access/multixact_internal.h
@@ -0,0 +1,140 @@
+/*
+ * multixact_internal.h
+ *
+ * PostgreSQL multi-transaction-log manager internal declarations
+ *
+ * These functions and definitions are for dealing with pg_multixact pages.
+ * They are internal to multixact.c, but they are exported here to allow
+ * pg_upgrade to write pg_multixact files directly.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/multixact_internal.h
+ */
+#ifndef MULTIXACT_INTERNAL_H
+#define MULTIXACT_INTERNAL_H
+
+#include "access/multixact.h"
+
+
+/*
+ * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
+ * used everywhere else in Postgres.
+ *
+ * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
+ * MultiXact page numbering also wraps around at
+ * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
+ * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
+ * take no explicit notice of that fact in this module, except when comparing
+ * segment and page numbers in TruncateMultiXact (see
+ * MultiXactOffsetPagePrecedes).
+ */
+
+/* We need four bytes per offset */
+#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)
+{
+	return multi / MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int
+MultiXactIdToOffsetEntry(MultiXactId multi)
+{
+	return multi % MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int64
+MultiXactIdToOffsetSegment(MultiXactId multi)
+{
+	return MultiXactIdToOffsetPage(multi) / SLRU_PAGES_PER_SEGMENT;
+}
+
+/*
+ * The situation for members is a bit more complex: we store one byte of
+ * additional flag bits for each TransactionId.  To do this without getting
+ * into alignment issues, we store four bytes of flags, and then the
+ * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
+ * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
+ * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
+ * performance) trumps space efficiency here.
+ *
+ * Note that the "offset" macros work with byte offset, not array indexes, so
+ * arithmetic must be done using "char *" pointers.
+ */
+/* We need eight bits per xact, so one xact fits in a byte */
+#define MXACT_MEMBER_BITS_PER_XACT			8
+#define MXACT_MEMBER_FLAGS_PER_BYTE			1
+#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
+
+/* how many full bytes of flags are there in a group? */
+#define MULTIXACT_FLAGBYTES_PER_GROUP		4
+#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
+	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
+/* size in bytes of a complete group */
+#define MULTIXACT_MEMBERGROUP_SIZE \
+	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERS_PER_PAGE	\
+	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
+
+/*
+ * Because the number of items per page is not a divisor of the last item
+ * number (member 0xFFFFFFFF), the last segment does not use the maximum number
+ * of pages, and moreover the last used page therein does not use the same
+ * number of items as previous pages.  (Another way to say it is that the
+ * 0xFFFFFFFF member is somewhere in the middle of the last page, so the page
+ * has some empty space after that item.)
+ *
+ * This constant is the number of members in the last page of the last segment.
+ */
+#define MAX_MEMBERS_IN_LAST_MEMBERS_PAGE \
+		((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
+
+/* page in which a member is to be found */
+static inline int64
+MXOffsetToMemberPage(MultiXactOffset offset)
+{
+	return offset / MULTIXACT_MEMBERS_PER_PAGE;
+}
+
+static inline int64
+MXOffsetToMemberSegment(MultiXactOffset offset)
+{
+	return MXOffsetToMemberPage(offset) / SLRU_PAGES_PER_SEGMENT;
+}
+
+/* Location (byte offset within page) of flag word for a given member */
+static inline int
+MXOffsetToFlagsOffset(MultiXactOffset offset)
+{
+	MultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
+	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
+
+	return byteoff;
+}
+
+static inline int
+MXOffsetToFlagsBitShift(MultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
+
+	return bshift;
+}
+
+/* Location (byte offset within page) of TransactionId of given member */
+static inline int
+MXOffsetToMemberOffset(MultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+
+	return MXOffsetToFlagsOffset(offset) +
+		MULTIXACT_FLAGBYTES_PER_GROUP +
+		member_in_group * sizeof(TransactionId);
+}
+
+#endif							/* MULTIXACT_INTERNAL_H */
-- 
2.47.3

v25-0002-Use-64-bit-multixact-offsets.patchtext/x-patch; charset=UTF-8; name=v25-0002-Use-64-bit-multixact-offsets.patchDownload
From 70ba1219e7d6fde789fcfb50baeb284e7c313c06 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 7 Aug 2024 16:35:22 +0300
Subject: [PATCH v25 02/10] Use 64-bit multixact offsets

Switching to 64-bit multitransaction offsets removes wraparound and the
2^32 limit on their total number.

Author: Maxim Orlov <orlovmg@gmail.com>
Discussion: FIXME
---
 src/backend/access/rmgrdesc/mxactdesc.c   |   4 +-
 src/backend/access/rmgrdesc/xlogdesc.c    |   2 +-
 src/backend/access/transam/multixact.c    | 390 ++++------------------
 src/backend/access/transam/xlog.c         |   2 +-
 src/backend/access/transam/xlogrecovery.c |   2 +-
 src/backend/commands/vacuum.c             |   2 +-
 src/backend/postmaster/autovacuum.c       |   4 +-
 src/bin/pg_controldata/pg_controldata.c   |   2 +-
 src/bin/pg_resetwal/pg_resetwal.c         |  30 +-
 src/bin/pg_resetwal/t/001_basic.pl        |   4 +-
 src/include/access/multixact.h            |   3 -
 src/include/access/multixact_internal.h   |  24 +-
 src/include/c.h                           |   2 +-
 13 files changed, 95 insertions(+), 376 deletions(-)

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 3ca0582db36..052dd0a4ce5 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -65,7 +65,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		xl_multixact_create *xlrec = (xl_multixact_create *) rec;
 		int			i;
 
-		appendStringInfo(buf, "%u offset %u nmembers %d: ", xlrec->mid,
+		appendStringInfo(buf, "%u offset %" PRIu64 " nmembers %d: ", xlrec->mid,
 						 xlrec->moff, xlrec->nmembers);
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
@@ -74,7 +74,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
 
-		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+		appendStringInfo(buf, "offsets [%u, %u), members [%" PRIu64 ", %" PRIu64 ")",
 						 xlrec->startTruncOff, xlrec->endTruncOff,
 						 xlrec->startTruncMemb, xlrec->endTruncMemb);
 	}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index cd6c2a2f650..441034f5929 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -66,7 +66,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
 		appendStringInfo(buf, "redo %X/%08X; "
-						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %u; "
+						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %" PRIu64 "; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
 						 "oldest running xid %u; %s",
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index acb2a6788f9..34a745c07be 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -69,6 +69,7 @@
 #include "postgres.h"
 
 #include "access/multixact.h"
+#include "access/multixact_internal.h"
 #include "access/slru.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
@@ -89,10 +90,14 @@
 #include "utils/memutils.h"
 
 
-/* Multixact members wraparound thresholds. */
-#define MULTIXACT_MEMBER_SAFE_THRESHOLD		(MaxMultiXactOffset / 2)
-#define MULTIXACT_MEMBER_DANGER_THRESHOLD	\
-	(MaxMultiXactOffset - MaxMultiXactOffset / 4)
+/*
+ * Multixact members warning threshold.
+ *
+ * If the difference between nextOffset and oldestOffset exceeds this value,
+ * we trigger autovacuum in order to release disk space consumed by the
+ * members SLRU.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD		UINT64CONST(4000000000)
 
 static inline MultiXactId
 PreviousMultiXactId(MultiXactId multi)
@@ -149,9 +154,6 @@ typedef struct MultiXactStateData
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 
-	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
-
 	/*
 	 * This is used to sleep until a multixact offset is written when we want
 	 * to create the next one.
@@ -282,8 +284,6 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
-									 MultiXactOffset start, uint32 distance);
 static bool SetOffsetVacuumLimit(bool is_startup);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMTruncateXlogRec(Oid oldestMultiDB,
@@ -1023,90 +1023,22 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	ExtendMultiXactOffset(result);
 
 	/*
-	 * Reserve the members space, similarly to above.  Also, be careful not to
-	 * return zero as the starting offset for any multixact. See
-	 * GetMultiXactIdMembers() for motivation.
+	 * Reserve the members space, similarly to above.
 	 */
 	nextOffset = MultiXactState->nextOffset;
-	if (nextOffset == 0)
-	{
-		*offset = 1;
-		nmembers++;				/* allocate member slot 0 too */
-	}
-	else
-		*offset = nextOffset;
-
-	/*----------
-	 * Protect against overrun of the members space as well, with the
-	 * following rules:
-	 *
-	 * If we're past offsetStopLimit, refuse to generate more multis.
-	 * If we're close to offsetStopLimit, emit a warning.
-	 *
-	 * Arbitrarily, we start emitting warnings when we're 20 segments or less
-	 * from offsetStopLimit.
-	 *
-	 * Note we haven't updated the shared state yet, so if we fail at this
-	 * point, the multixact ID we grabbed can still be used by the next guy.
-	 *
-	 * Note that there is no point in forcing autovacuum runs here: the
-	 * multixact freeze settings would have to be reduced for that to have any
-	 * effect.
-	 *----------
-	 */
-#define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
-								 nmembers))
-	{
-		/* see comment in the corresponding offsets wraparound case */
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("multixact \"members\" limit exceeded"),
-				 errdetail_plural("This command would create a multixact with %u members, but the remaining space is only enough for %u member.",
-								  "This command would create a multixact with %u members, but the remaining space is only enough for %u members.",
-								  MultiXactState->offsetStopLimit - nextOffset - 1,
-								  nmembers,
-								  MultiXactState->offsetStopLimit - nextOffset - 1),
-				 errhint("Execute a database-wide VACUUM in database with OID %u with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.",
-						 MultiXactState->oldestMultiXactDB)));
-	}
 
 	/*
-	 * Check whether we should kick autovacuum into action, to prevent members
-	 * wraparound. NB we use a much larger window to trigger autovacuum than
-	 * just the warning limit. The warning is just a measure of last resort -
-	 * this is in line with GetNewTransactionId's behaviour.
+	 * Offsets are 64-bit integers and will never wrap around.  Firstly, it
+	 * would take an unrealistic amount of time and resources to consume 2^64
+	 * offsets.  Secondly, multixid creation is WAL-logged, so you would run
+	 * out of LSNs before reaching offset wraparound.  Nevertheless, check for
+	 * wraparound as a sanity check.
 	 */
-	if (!MultiXactState->oldestOffsetKnown ||
-		(MultiXactState->nextOffset - MultiXactState->oldestOffset
-		 > MULTIXACT_MEMBER_SAFE_THRESHOLD))
-	{
-		/*
-		 * To avoid swamping the postmaster with signals, we issue the autovac
-		 * request only when crossing a segment boundary. With default
-		 * compilation settings that's roughly after 50k members.  This still
-		 * gives plenty of chances before we get into real trouble.
-		 */
-		if ((MXOffsetToMemberPage(nextOffset) / SLRU_PAGES_PER_SEGMENT) !=
-			(MXOffsetToMemberPage(nextOffset + nmembers) / SLRU_PAGES_PER_SEGMENT))
-			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	}
-
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
-								 nextOffset,
-								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
-		ereport(WARNING,
+	if (nextOffset + nmembers < nextOffset)
+		ereport(ERROR,
 				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg_plural("database with OID %u must be vacuumed before %d more multixact member is used",
-							   "database with OID %u must be vacuumed before %d more multixact members are used",
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers,
-							   MultiXactState->oldestMultiXactDB,
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers),
-				 errhint("Execute a database-wide VACUUM in that database with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.")));
+				 errmsg("MultiXact members would wrap around")));
+	*offset = nextOffset;
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
@@ -1127,8 +1059,7 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	 * the next iteration.  But note that nextMXact may be InvalidMultiXactId
 	 * or the first value on a segment-beginning page after this routine
 	 * exits, so anyone else looking at the variable must be prepared to deal
-	 * with either case.  Similarly, nextOffset may be zero, but we won't use
-	 * that as the actual start offset of the next multixact.
+	 * with either case.
 	 */
 	(MultiXactState->nextMXact)++;
 
@@ -1136,7 +1067,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	LWLockRelease(MultiXactGenLock);
 
-	debug_elog4(DEBUG2, "GetNew: returning %u offset %u", result, *offset);
+	debug_elog4(DEBUG2, "GetNew: returning %u offset %" PRIu64, result,
+				*offset);
 	return result;
 }
 
@@ -1178,7 +1110,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	MultiXactOffset *offptr;
 	MultiXactOffset offset;
 	int			length;
-	int			truelength;
 	MultiXactId oldestMXact;
 	MultiXactId nextMXact;
 	MultiXactId tmpMXact;
@@ -1277,16 +1208,7 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * we have just for this; the process in charge will signal the CV as soon
 	 * as it has finished writing the multixact offset.
 	 *
-	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
-	 * handle case #2, there is an ambiguity near the point of offset
-	 * wraparound.  If we see next multixact's offset is one, is that our
-	 * multixact's actual endpoint, or did it end at zero with a subsequent
-	 * increment?  We handle this using the knowledge that if the zero'th
-	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
-	 * transaction ID so it can't be a multixact member.  Therefore, if we
-	 * read a zero from the members array, just ignore it.
-	 *
-	 * This is all pretty messy, but the mess occurs only in infrequent corner
+	 * This is a little messy, but the mess occurs only in infrequent corner
 	 * cases, so it seems better than holding the MultiXactGenLock for a long
 	 * time on every multixact creation.
 	 */
@@ -1372,6 +1294,9 @@ retry:
 	LWLockRelease(lock);
 	lock = NULL;
 
+	/* A multixid with zero members should not happen */
+	Assert(length > 0);
+
 	/*
 	 * If we slept above, clean up state; it's no longer needed.
 	 */
@@ -1380,7 +1305,6 @@ retry:
 
 	ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
 
-	truelength = 0;
 	prev_pageno = -1;
 	for (int i = 0; i < length; i++, offset++)
 	{
@@ -1417,37 +1341,27 @@ retry:
 
 		xactptr = (TransactionId *)
 			(MultiXactMemberCtl->shared->page_buffer[slotno] + memberoff);
-
-		if (!TransactionIdIsValid(*xactptr))
-		{
-			/* Corner case 3: we must be looking at unused slot zero */
-			Assert(offset == 0);
-			continue;
-		}
+		Assert(TransactionIdIsValid(*xactptr));
 
 		flagsoff = MXOffsetToFlagsOffset(offset);
 		bshift = MXOffsetToFlagsBitShift(offset);
 		flagsptr = (uint32 *) (MultiXactMemberCtl->shared->page_buffer[slotno] + flagsoff);
 
-		ptr[truelength].xid = *xactptr;
-		ptr[truelength].status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
-		truelength++;
+		ptr[i].xid = *xactptr;
+		ptr[i].status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
 	}
 
 	LWLockRelease(lock);
 
-	/* A multixid with zero members should not happen */
-	Assert(truelength > 0);
-
 	/*
 	 * Copy the result into the local cache.
 	 */
-	mXactCachePut(multi, truelength, ptr);
+	mXactCachePut(multi, length, ptr);
 
 	debug_elog3(DEBUG2, "GetMembers: no cache for %s",
-				mxid_to_string(multi, truelength, ptr));
+				mxid_to_string(multi, length, ptr));
 	*members = ptr;
-	return truelength;
+	return length;
 }
 
 /*
@@ -1854,7 +1768,7 @@ MultiXactShmemInit(void)
 				  "pg_multixact/members", LWTRANCHE_MULTIXACTMEMBER_BUFFER,
 				  LWTRANCHE_MULTIXACTMEMBER_SLRU,
 				  SYNC_HANDLER_MULTIXACT_MEMBER,
-				  false);
+				  true);
 	/* doesn't call SimpleLruTruncate() or meet criteria for unit tests */
 
 	/* Initialize our shared state struct */
@@ -2031,7 +1945,6 @@ TrimMultiXact(void)
 		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 		offptr += entryno;
-
 		MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
 
 		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
@@ -2104,7 +2017,7 @@ MultiXactGetCheckptMulti(bool is_shutdown,
 	LWLockRelease(MultiXactGenLock);
 
 	debug_elog6(DEBUG2,
-				"MultiXact: checkpoint is nextMulti %u, nextOffset %u, oldestMulti %u in DB %u",
+				"MultiXact: checkpoint is nextMulti %u, nextOffset %" PRIu64 ", oldestMulti %u in DB %u",
 				*nextMulti, *nextMultiOffset, *oldestMulti, *oldestMultiDB);
 }
 
@@ -2139,7 +2052,7 @@ void
 MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset)
 {
-	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %u",
+	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %" PRIu64,
 				nextMulti, nextMultiOffset);
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->nextMXact = nextMulti;
@@ -2330,7 +2243,7 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 	}
 	if (MultiXactOffsetPrecedes(MultiXactState->nextOffset, minMultiOffset))
 	{
-		debug_elog3(DEBUG2, "MultiXact: setting next offset to %u",
+		debug_elog3(DEBUG2, "MultiXact: setting next offset to %" PRIU64,
 					minMultiOffset);
 		MultiXactState->nextOffset = minMultiOffset;
 	}
@@ -2432,23 +2345,8 @@ ExtendMultiXactMember(MultiXactOffset offset, int nmembers)
 			LWLockRelease(lock);
 		}
 
-		/*
-		 * Compute the number of items till end of current page.  Careful: if
-		 * addition of unsigned ints wraps around, we're at the last page of
-		 * the last segment; since that page holds a different number of items
-		 * than other pages, we need to do it differently.
-		 */
-		if (offset + MAX_MEMBERS_IN_LAST_MEMBERS_PAGE < offset)
-		{
-			/*
-			 * This is the last page of the last segment; we can compute the
-			 * number of items left to allocate in it without modulo
-			 * arithmetic.
-			 */
-			difference = MaxMultiXactOffset - offset + 1;
-		}
-		else
-			difference = MULTIXACT_MEMBERS_PER_PAGE - offset % MULTIXACT_MEMBERS_PER_PAGE;
+		/* Compute the number of items till end of current page. */
+		difference = MULTIXACT_MEMBERS_PER_PAGE - offset % MULTIXACT_MEMBERS_PER_PAGE;
 
 		/*
 		 * Advance to next page, taking care to properly handle the wraparound
@@ -2514,15 +2412,14 @@ GetOldestMultiXactId(void)
 }
 
 /*
- * Determine how aggressively we need to vacuum in order to prevent member
- * wraparound.
+ * Determine if we need to vacuum to keep the size of the members SLRU in
+ * check.
  *
  * To do so determine what's the oldest member offset and install the limit
  * info in MultiXactState, where it can be used to prevent overrun of old data
  * in the members SLRU area.
  *
- * The return value is true if emergency autovacuum is required and false
- * otherwise.
+ * The return value is true if autovacuum is required and false otherwise.
  */
 static bool
 SetOffsetVacuumLimit(bool is_startup)
@@ -2534,8 +2431,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	MultiXactOffset nextOffset;
 	bool		oldestOffsetKnown = false;
 	bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;
-	MultiXactOffset prevOffsetStopLimit;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2550,7 +2445,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	nextOffset = MultiXactState->nextOffset;
 	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
 	prevOldestOffset = MultiXactState->oldestOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2581,13 +2475,9 @@ SetOffsetVacuumLimit(bool is_startup)
 		oldestOffsetKnown =
 			find_multixact_start(oldestMultiXactId, &oldestOffset);
 
-		if (oldestOffsetKnown)
-			ereport(DEBUG1,
-					(errmsg_internal("oldest MultiXactId member is at offset %u",
-									 oldestOffset)));
-		else
+		if (!oldestOffsetKnown)
 			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
+					(errmsg("oldest checkpointed MultiXact %u does not exist on disk",
 							oldestMultiXactId)));
 	}
 
@@ -2597,97 +2487,32 @@ SetOffsetVacuumLimit(bool is_startup)
 	 * If we can, compute limits (and install them MultiXactState) to prevent
 	 * overrun of old data in the members SLRU area. We can only do so if the
 	 * oldest offset is known though.
+	 *
+	 * FIXME: Is !oldestOffsetKnown possible anymore? At least update the comment:
+	 * we won't overrun members anymore.
 	 */
-	if (oldestOffsetKnown)
-	{
-		/* move back to start of the corresponding segment */
-		offsetStopLimit = oldestOffset - (oldestOffset %
-										  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-
-		/* always leave one segment before the wraparound point */
-		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-		if (!prevOldestOffsetKnown && !is_startup)
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are now enabled")));
-
-		ereport(DEBUG1,
-				(errmsg_internal("MultiXact member stop limit is now %u based on MultiXact %u",
-								 offsetStopLimit, oldestMultiXactId)));
-	}
-	else if (prevOldestOffsetKnown)
+	if (prevOldestOffsetKnown)
 	{
 		/*
 		 * If we failed to get the oldest offset this time, but we have a
 		 * value from a previous pass through this function, use the old
-		 * values rather than automatically forcing an emergency autovacuum
-		 * cycle again.
+		 * values rather than automatically forcing an autovacuum cycle again.
 		 */
 		oldestOffset = prevOldestOffset;
 		oldestOffsetKnown = true;
-		offsetStopLimit = prevOffsetStopLimit;
 	}
 
 	/* Install the computed values */
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestOffset = oldestOffset;
 	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	MultiXactState->offsetStopLimit = offsetStopLimit;
 	LWLockRelease(MultiXactGenLock);
 
 	/*
-	 * Do we need an emergency autovacuum?	If we're not sure, assume yes.
+	 * Do we need autovacuum?	If we're not sure, assume yes.
 	 */
 	return !oldestOffsetKnown ||
-		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
-}
-
-/*
- * Return whether adding "distance" to "start" would move past "boundary".
- *
- * We use this to determine whether the addition is "wrapping around" the
- * boundary point, hence the name.  The reason we don't want to use the regular
- * 2^31-modulo arithmetic here is that we want to be able to use the whole of
- * the 2^32-1 space here, allowing for more multixacts than would fit
- * otherwise.
- */
-static bool
-MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
-						 uint32 distance)
-{
-	MultiXactOffset finish;
-
-	/*
-	 * Note that offset number 0 is not used (see GetMultiXactIdMembers), so
-	 * if the addition wraps around the UINT_MAX boundary, skip that value.
-	 */
-	finish = start + distance;
-	if (finish < start)
-		finish++;
-
-	/*-----------------------------------------------------------------------
-	 * When the boundary is numerically greater than the starting point, any
-	 * value numerically between the two is not wrapped:
-	 *
-	 *	<----S----B---->
-	 *	[---)			 = F wrapped past B (and UINT_MAX)
-	 *		 [---)		 = F not wrapped
-	 *			  [----] = F wrapped past B
-	 *
-	 * When the boundary is numerically less than the starting point (i.e. the
-	 * UINT_MAX wraparound occurs somewhere in between) then all values in
-	 * between are wrapped:
-	 *
-	 *	<----B----S---->
-	 *	[---)			 = F not wrapped past B (but wrapped past UINT_MAX)
-	 *		 [---)		 = F wrapped past B (and UINT_MAX)
-	 *			  [----] = F not wrapped
-	 *-----------------------------------------------------------------------
-	 */
-	if (start < boundary)
-		return finish >= boundary || finish < start;
-	else
-		return finish >= boundary && finish < start;
+		(nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);
 }
 
 /*
@@ -2727,6 +2552,7 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 	offptr += entryno;
 	offset = *offptr;
+
 	LWLockRelease(SimpleLruGetBankLock(MultiXactOffsetCtl, pageno));
 
 	*result = offset;
@@ -2774,73 +2600,6 @@ GetMultiXactInfo(uint32 *multixacts, MultiXactOffset *members,
 	return true;
 }
 
-/*
- * Multixact members can be removed once the multixacts that refer to them
- * are older than every datminmxid.  autovacuum_multixact_freeze_max_age and
- * vacuum_multixact_freeze_table_age work together to make sure we never have
- * too many multixacts; we hope that, at least under normal circumstances,
- * this will also be sufficient to keep us from using too many offsets.
- * However, if the average multixact has many members, we might exhaust the
- * members space while still using few enough members that these limits fail
- * to trigger relminmxid advancement by VACUUM.  At that point, we'd have no
- * choice but to start failing multixact-creating operations with an error.
- *
- * To prevent that, if more than a threshold portion of the members space is
- * used, we effectively reduce autovacuum_multixact_freeze_max_age and
- * to a value just less than the number of multixacts in use.  We hope that
- * this will quickly trigger autovacuuming on the table or tables with the
- * oldest relminmxid, thus allowing datminmxid values to advance and removing
- * some members.
- *
- * As the fraction of the member space currently in use grows, we become
- * more aggressive in clamping this value.  That not only causes autovacuum
- * to ramp up, but also makes any manual vacuums the user issues more
- * aggressive.  This happens because vacuum_get_cutoffs() will clamp the
- * freeze table and the minimum freeze age cutoffs based on the effective
- * autovacuum_multixact_freeze_max_age this function returns.  In the worst
- * case, we'll claim the freeze_max_age to zero, and every vacuum of any
- * table will freeze every multixact.
- */
-int
-MultiXactMemberFreezeThreshold(void)
-{
-	MultiXactOffset members;
-	uint32		multixacts;
-	uint32		victim_multixacts;
-	double		fraction;
-	int			result;
-	MultiXactId oldestMultiXactId;
-	MultiXactOffset oldestOffset;
-
-	/* If we can't determine member space utilization, assume the worst. */
-	if (!GetMultiXactInfo(&multixacts, &members, &oldestMultiXactId, &oldestOffset))
-		return 0;
-
-	/* If member space utilization is low, no special action is required. */
-	if (members <= MULTIXACT_MEMBER_SAFE_THRESHOLD)
-		return autovacuum_multixact_freeze_max_age;
-
-	/*
-	 * Compute a target for relminmxid advancement.  The number of multixacts
-	 * we try to eliminate from the system is based on how far we are past
-	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
-	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD) /
-		(MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-	victim_multixacts = multixacts * fraction;
-
-	/* fraction could be > 1.0, but lowest possible freeze age is zero */
-	if (victim_multixacts > multixacts)
-		return 0;
-	result = multixacts - victim_multixacts;
-
-	/*
-	 * Clamp to autovacuum_multixact_freeze_max_age, so that we never make
-	 * autovacuum less aggressive than it would otherwise be.
-	 */
-	return Min(result, autovacuum_multixact_freeze_max_age);
-}
-
 typedef struct mxtruncinfo
 {
 	int64		earliestExistingPage;
@@ -2867,36 +2626,12 @@ SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int64 segpage, void *data
 
 /*
  * Delete members segments [oldest, newOldest)
- *
- * The members SLRU can, in contrast to the offsets one, be filled to almost
- * the full range at once. This means SimpleLruTruncate() can't trivially be
- * used - instead the to-be-deleted range is computed using the offsets
- * SLRU. C.f. TruncateMultiXact().
  */
 static void
 PerformMembersTruncation(MultiXactOffset oldestOffset, MultiXactOffset newOldestOffset)
 {
-	const int64 maxsegment = MXOffsetToMemberSegment(MaxMultiXactOffset);
-	int64		startsegment = MXOffsetToMemberSegment(oldestOffset);
-	int64		endsegment = MXOffsetToMemberSegment(newOldestOffset);
-	int64		segment = startsegment;
-
-	/*
-	 * Delete all the segments but the last one. The last segment can still
-	 * contain, possibly partially, valid data.
-	 */
-	while (segment != endsegment)
-	{
-		elog(DEBUG2, "truncating multixact members segment %" PRIx64,
-			 segment);
-		SlruDeleteSegment(MultiXactMemberCtl, segment);
-
-		/* move to next segment, handling wraparound correctly */
-		if (segment == maxsegment)
-			segment = 0;
-		else
-			segment += 1;
-	}
+	SimpleLruTruncate(MultiXactMemberCtl,
+					  MXOffsetToMemberPage(newOldestOffset));
 }
 
 /*
@@ -3040,7 +2775,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 
 	elog(DEBUG1, "performing multixact truncation: "
 		 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-		 "members [%u, %u), members segments [%" PRIx64 ", %" PRIx64 ")",
+		 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
 		 oldestMulti, newOldestMulti,
 		 MultiXactIdToOffsetSegment(oldestMulti),
 		 MultiXactIdToOffsetSegment(newOldestMulti),
@@ -3120,20 +2855,13 @@ MultiXactOffsetPagePrecedes(int64 page1, int64 page2)
 
 /*
  * Decide whether a MultiXactMember page number is "older" for truncation
- * purposes.  There is no "invalid offset number" so use the numbers verbatim.
+ * purposes.  There is no "invalid offset number" and members never wrap
+ * around, so use the numbers verbatim.
  */
 static bool
 MultiXactMemberPagePrecedes(int64 page1, int64 page2)
 {
-	MultiXactOffset offset1;
-	MultiXactOffset offset2;
-
-	offset1 = ((MultiXactOffset) page1) * MULTIXACT_MEMBERS_PER_PAGE;
-	offset2 = ((MultiXactOffset) page2) * MULTIXACT_MEMBERS_PER_PAGE;
-
-	return (MultiXactOffsetPrecedes(offset1, offset2) &&
-			MultiXactOffsetPrecedes(offset1,
-									offset2 + MULTIXACT_MEMBERS_PER_PAGE - 1));
+	return page1 < page2;
 }
 
 /*
@@ -3171,7 +2899,7 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
 static bool
 MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
 {
-	int32		diff = (int32) (offset1 - offset2);
+	int64		diff = (int64) (offset1 - offset2);
 
 	return (diff < 0);
 }
@@ -3268,7 +2996,7 @@ multixact_redo(XLogReaderState *record)
 
 		elog(DEBUG1, "replaying multixact truncation: "
 			 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-			 "members [%u, %u), members segments [%" PRIx64 ", %" PRIx64 ")",
+			 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
 			 xlrec.startTruncOff, xlrec.endTruncOff,
 			 MultiXactIdToOffsetSegment(xlrec.startTruncOff),
 			 MultiXactIdToOffsetSegment(xlrec.endTruncOff),
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 22d0a2e8c3a..ef405d66b3b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5139,7 +5139,7 @@ BootStrapXLOG(uint32 data_checksum_version)
 		FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId);
 	checkPoint.nextOid = FirstGenbkiObjectId;
 	checkPoint.nextMulti = FirstMultiXactId;
-	checkPoint.nextMultiOffset = 0;
+	checkPoint.nextMultiOffset = 1;
 	checkPoint.oldestXid = FirstNormalTransactionId;
 	checkPoint.oldestXidDB = Template1DbOid;
 	checkPoint.oldestMulti = FirstMultiXactId;
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 21b8f179ba0..51dea342a4d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -886,7 +886,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 							 U64FromFullTransactionId(checkPoint.nextXid),
 							 checkPoint.nextOid)));
 	ereport(DEBUG1,
-			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %u",
+			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %" PRIu64,
 							 checkPoint.nextMulti, checkPoint.nextMultiOffset)));
 	ereport(DEBUG1,
 			(errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index e785dd55ce5..100e1a72c22 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1148,7 +1148,7 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams params,
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Almost ready to set freeze output parameters; check if OldestXmin or
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 1c38488f2cb..bf66f494e3a 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1151,7 +1151,7 @@ do_start_worker(void)
 
 	/* Also determine the oldest datminmxid we will consider. */
 	recentMulti = ReadNextMultiXactId();
-	multiForceLimit = recentMulti - MultiXactMemberFreezeThreshold();
+	multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
 	if (multiForceLimit < FirstMultiXactId)
 		multiForceLimit -= FirstMultiXactId;
 
@@ -1939,7 +1939,7 @@ do_autovacuum(void)
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Find the pg_database entry and select the default freeze ages. We use
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 30ad46912e1..a4060309ae0 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -271,7 +271,7 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile->checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
+	printf(_("Latest checkpoint's NextMultiOffset:  %" PRIu64 "\n"),
 		   ControlFile->checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile->checkPointCopy.oldestXid);
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index a31e7643cf0..7c6c2741a17 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -92,6 +92,7 @@ static void KillExistingArchiveStatus(void);
 static void KillExistingWALSummaries(void);
 static void WriteEmptyXLOG(void);
 static void usage(void);
+static uint64 strtou64_strict(const char *s, char **endptr, int base);
 
 
 int
@@ -120,7 +121,6 @@ main(int argc, char *argv[])
 	MultiXactId set_oldestmxid = 0;
 	char	   *endptr;
 	char	   *endptr2;
-	int64		tmpi64;
 	char	   *DataDir = NULL;
 	char	   *log_fname = NULL;
 	int			fd;
@@ -269,17 +269,14 @@ main(int argc, char *argv[])
 
 			case 'O':
 				errno = 0;
-				tmpi64 = strtoi64(optarg, &endptr, 0);
+				set_mxoff = strtou64_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
 					pg_log_error_hint("Try \"%s --help\" for more information.", progname);
 					exit(1);
 				}
-				if (tmpi64 < 0 || tmpi64 > (int64) MaxMultiXactOffset)
-					pg_fatal("multitransaction offset (-O) must be between 0 and %u", MaxMultiXactOffset);
 
-				set_mxoff = (MultiXactOffset) tmpi64;
 				mxoff_given = true;
 				break;
 
@@ -749,7 +746,7 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile.checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
+	printf(_("Latest checkpoint's NextMultiOffset:  %" PRIu64 "\n"),
 		   ControlFile.checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile.checkPointCopy.oldestXid);
@@ -825,7 +822,7 @@ PrintNewControlValues(void)
 
 	if (mxoff_given)
 	{
-		printf(_("NextMultiOffset:                      %u\n"),
+		printf(_("NextMultiOffset:                      %" PRIu64 "\n"),
 			   ControlFile.checkPointCopy.nextMultiOffset);
 	}
 
@@ -1210,3 +1207,22 @@ usage(void)
 	printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
 	printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
 }
+
+/* Like strtou64(), but negative values are not accepted. */
+static uint64
+strtou64_strict(const char *s, char **endptr, int base)
+{
+	/* skip leading whitespace */
+	while (isspace(*s))
+		s++;
+
+	/* reject negative values */
+	if (*s == '-')
+	{
+		*endptr = (char *) s;
+		errno = ERANGE;
+		return UINT64_MAX;
+	}
+
+	return strtou64(s, endptr, base);
+}
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index 90ecb8afe18..5a175e285d1 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -145,7 +145,7 @@ command_fails_like(
 	'fails with incorrect -O option');
 command_fails_like(
 	[ 'pg_resetwal', '-O' => '-1', $node->data_dir ],
-	qr/must be between 0 and 4294967295/,
+	qr/error: invalid argument for option -O/,
 	'fails with -O value -1');
 # --wal-segsize
 command_fails_like(
@@ -215,7 +215,7 @@ push @cmd,
   sprintf("%d,%d", hex($files[0]) == 0 ? 3 : hex($files[0]), hex($files[-1]));
 
 @files = get_slru_files('pg_multixact/offsets');
-$mult = 32 * $blcksz / 4;
+$mult = 32 * $blcksz / 8;
 # --multixact-ids argument is "new,old"
 push @cmd,
   '--multixact-ids' => sprintf("%d,%d",
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 82e4bb90dd5..7d98fe0fe32 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -28,8 +28,6 @@
 
 #define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
 
-#define MaxMultiXactOffset	((MultiXactOffset) 0xFFFFFFFF)
-
 /*
  * Possible multixact lock modes ("status").  The first four modes are for
  * tuple locks (FOR KEY SHARE, FOR SHARE, FOR NO KEY UPDATE, FOR UPDATE); the
@@ -147,7 +145,6 @@ extern void MultiXactSetNextMXact(MultiXactId nextMulti,
 extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 									  MultiXactOffset minMultiOffset);
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
-extern int	MultiXactMemberFreezeThreshold(void);
 
 extern void multixact_twophase_recover(FullTransactionId fxid, uint16 info,
 									   void *recdata, uint32 len);
diff --git a/src/include/access/multixact_internal.h b/src/include/access/multixact_internal.h
index 9b56deaef31..b0227759e39 100644
--- a/src/include/access/multixact_internal.h
+++ b/src/include/access/multixact_internal.h
@@ -17,21 +17,12 @@
 
 #include "access/multixact.h"
 
-
 /*
  * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
  * used everywhere else in Postgres.
- *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
  */
 
-/* We need four bytes per offset */
+/* We need 8 bytes per offset */
 #define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
 
 static inline int64
@@ -80,19 +71,6 @@ MultiXactIdToOffsetSegment(MultiXactId multi)
 #define MULTIXACT_MEMBERS_PER_PAGE	\
 	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
 
-/*
- * Because the number of items per page is not a divisor of the last item
- * number (member 0xFFFFFFFF), the last segment does not use the maximum number
- * of pages, and moreover the last used page therein does not use the same
- * number of items as previous pages.  (Another way to say it is that the
- * 0xFFFFFFFF member is somewhere in the middle of the last page, so the page
- * has some empty space after that item.)
- *
- * This constant is the number of members in the last page of the last segment.
- */
-#define MAX_MEMBERS_IN_LAST_MEMBERS_PAGE \
-		((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
-
 /* page in which a member is to be found */
 static inline int64
 MXOffsetToMemberPage(MultiXactOffset offset)
diff --git a/src/include/c.h b/src/include/c.h
index 757dfff4782..bc92a6f4565 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -670,7 +670,7 @@ typedef uint32 SubTransactionId;
 /* MultiXactId must be equivalent to TransactionId, to fit in t_xmax */
 typedef TransactionId MultiXactId;
 
-typedef uint32 MultiXactOffset;
+typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
-- 
2.47.3

v25-0003-Add-pg_upgrade-for-64-bit-multixact-offsets.patchtext/x-patch; charset=UTF-8; name=v25-0003-Add-pg_upgrade-for-64-bit-multixact-offsets.patchDownload
From 02a1863f9badcce4a7d977149053334afff4b51a Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 12 Nov 2025 21:47:44 +0200
Subject: [PATCH v25 03/10] Add pg_upgrade for 64 bit multixact offsets

Author: Maxim Orlov <orlovmg@gmail.com>
Author: Heikki Linnakangas <hlinnaka@iki.fi>
---
 src/backend/access/transam/multixact.c |  56 -----
 src/bin/pg_upgrade/Makefile            |   3 +
 src/bin/pg_upgrade/meson.build         |   3 +
 src/bin/pg_upgrade/multixact_new.c     | 101 +++++++++
 src/bin/pg_upgrade/multixact_new.h     |  23 ++
 src/bin/pg_upgrade/multixact_old.c     | 297 +++++++++++++++++++++++++
 src/bin/pg_upgrade/multixact_old.h     |  29 +++
 src/bin/pg_upgrade/pg_upgrade.c        | 108 ++++++++-
 src/bin/pg_upgrade/pg_upgrade.h        |   5 +
 src/bin/pg_upgrade/slru_io.c           | 242 ++++++++++++++++++++
 src/bin/pg_upgrade/slru_io.h           |  52 +++++
 src/tools/pgindent/typedefs.list       |   3 +
 12 files changed, 860 insertions(+), 62 deletions(-)
 create mode 100644 src/bin/pg_upgrade/multixact_new.c
 create mode 100644 src/bin/pg_upgrade/multixact_new.h
 create mode 100644 src/bin/pg_upgrade/multixact_old.c
 create mode 100644 src/bin/pg_upgrade/multixact_old.h
 create mode 100644 src/bin/pg_upgrade/slru_io.c
 create mode 100644 src/bin/pg_upgrade/slru_io.h

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 34a745c07be..e0323ec1014 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1824,48 +1824,6 @@ BootStrapMultiXact(void)
 	SimpleLruZeroAndWritePage(MultiXactMemberCtl, 0);
 }
 
-/*
- * MaybeExtendOffsetSlru
- *		Extend the offsets SLRU area, if necessary
- *
- * After a binary upgrade from <= 9.2, the pg_multixact/offsets SLRU area might
- * contain files that are shorter than necessary; this would occur if the old
- * installation had used multixacts beyond the first page (files cannot be
- * copied, because the on-disk representation is different).  pg_upgrade would
- * update pg_control to set the next offset value to be at that position, so
- * that tuples marked as locked by such MultiXacts would be seen as visible
- * without having to consult multixact.  However, trying to create and use a
- * new MultiXactId would result in an error because the page on which the new
- * value would reside does not exist.  This routine is in charge of creating
- * such pages.
- */
-static void
-MaybeExtendOffsetSlru(void)
-{
-	int64		pageno;
-	LWLock	   *lock;
-
-	pageno = MultiXactIdToOffsetPage(MultiXactState->nextMXact);
-	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
-
-	LWLockAcquire(lock, LW_EXCLUSIVE);
-
-	if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno))
-	{
-		int			slotno;
-
-		/*
-		 * Fortunately for us, SimpleLruWritePage is already prepared to deal
-		 * with creating a new segment file even if the page we're writing is
-		 * not the first in it, so this is enough.
-		 */
-		slotno = SimpleLruZeroPage(MultiXactOffsetCtl, pageno);
-		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
-	}
-
-	LWLockRelease(lock);
-}
-
 /*
  * This must be called ONCE during postmaster or standalone-backend startup.
  *
@@ -2058,20 +2016,6 @@ MultiXactSetNextMXact(MultiXactId nextMulti,
 	MultiXactState->nextMXact = nextMulti;
 	MultiXactState->nextOffset = nextMultiOffset;
 	LWLockRelease(MultiXactGenLock);
-
-	/*
-	 * During a binary upgrade, make sure that the offsets SLRU is large
-	 * enough to contain the next value that would be created.
-	 *
-	 * We need to do this pretty early during the first startup in binary
-	 * upgrade mode: before StartupMultiXact() in fact, because this routine
-	 * is called even before that by StartupXLOG().  And we can't do it
-	 * earlier than at this point, because during that first call of this
-	 * routine we determine the MultiXactState->nextMXact value that
-	 * MaybeExtendOffsetSlru needs.
-	 */
-	if (IsBinaryUpgrade)
-		MaybeExtendOffsetSlru();
 }
 
 /*
diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index 69fcf593cae..42995d53b0b 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -18,11 +18,14 @@ OBJS = \
 	file.o \
 	function.o \
 	info.o \
+	multixact_new.o \
+	multixact_old.o \
 	option.o \
 	parallel.o \
 	pg_upgrade.o \
 	relfilenumber.o \
 	server.o \
+	slru_io.o \
 	tablespace.o \
 	task.o \
 	util.o \
diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index ac992f0d14b..3e46c4512cf 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -8,11 +8,14 @@ pg_upgrade_sources = files(
   'file.c',
   'function.c',
   'info.c',
+  'multixact_new.c',
+  'multixact_old.c',
   'option.c',
   'parallel.c',
   'pg_upgrade.c',
   'relfilenumber.c',
   'server.c',
+  'slru_io.c',
   'tablespace.c',
   'task.c',
   'util.c',
diff --git a/src/bin/pg_upgrade/multixact_new.c b/src/bin/pg_upgrade/multixact_new.c
new file mode 100644
index 00000000000..8284a2015fc
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_new.c
@@ -0,0 +1,101 @@
+/*
+ * multixact_new.c
+ *
+ * Functions to write multixacts in the v19 format with 64-bit
+ * MultiXactOffsets
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_new.c
+ */
+
+#include "postgres_fe.h"
+
+#include "access/multixact_internal.h"
+#include "multixact_new.h"
+
+MultiXactWriter *
+AllocMultiXactWrite(const char *pgdata, MultiXactId firstMulti,
+					MultiXactOffset firstOffset)
+{
+	MultiXactWriter *state = pg_malloc(sizeof(*state));
+	char		dir[MAXPGPATH] = {0};
+
+	pg_sprintf(dir, "%s/pg_multixact/offsets", pgdata);
+	state->offset = AllocSlruWrite(dir, false);
+	SlruWriteSwitchPage(state->offset, MultiXactIdToOffsetPage(firstMulti));
+
+	pg_sprintf(dir, "%s/pg_multixact/members", pgdata);
+	state->members = AllocSlruWrite(dir, true /* use long segment names */ );
+	SlruWriteSwitchPage(state->members, MXOffsetToMemberPage(firstOffset));
+
+	return state;
+}
+
+/*
+ * Write a new multixact with members.
+ *
+ * Simplified version of the correspoding server function, hence the name.
+ */
+void
+RecordNewMultiXact(MultiXactWriter *state, MultiXactOffset offset,
+				   MultiXactId multi, int nmembers, MultiXactMember *members)
+{
+	int64		pageno;
+	int64		prev_pageno;
+	int			entryno;
+	char	   *buf;
+	MultiXactOffset *offptr;
+
+	pageno = MultiXactIdToOffsetPage(multi);
+	entryno = MultiXactIdToOffsetEntry(multi);
+
+	/* Store the offset */
+	buf = SlruWriteSwitchPage(state->offset, pageno);
+	offptr = (MultiXactOffset *) buf;
+	offptr[entryno] = offset;
+
+	/* Store the members */
+	prev_pageno = -1;
+	for (int i = 0; i < nmembers; i++, offset++)
+	{
+		TransactionId *memberptr;
+		uint32	   *flagsptr;
+		uint32		flagsval;
+		int			bshift;
+		int			flagsoff;
+		int			memberoff;
+
+		Assert(members[i].status <= MultiXactStatusUpdate);
+
+		pageno = MXOffsetToMemberPage(offset);
+		memberoff = MXOffsetToMemberOffset(offset);
+		flagsoff = MXOffsetToFlagsOffset(offset);
+		bshift = MXOffsetToFlagsBitShift(offset);
+
+		if (pageno != prev_pageno)
+		{
+			buf = SlruWriteSwitchPage(state->members, pageno);
+			prev_pageno = pageno;
+		}
+
+		memberptr = (TransactionId *) (buf + memberoff);
+
+		*memberptr = members[i].xid;
+
+		flagsptr = (uint32 *) (buf + flagsoff);
+
+		flagsval = *flagsptr;
+		flagsval &= ~(((1 << MXACT_MEMBER_BITS_PER_XACT) - 1) << bshift);
+		flagsval |= (members[i].status << bshift);
+		*flagsptr = flagsval;
+	}
+}
+
+void
+FreeMultiXactWrite(MultiXactWriter *state)
+{
+	FreeSlruWrite(state->offset);
+	FreeSlruWrite(state->members);
+
+	pfree(state);
+}
diff --git a/src/bin/pg_upgrade/multixact_new.h b/src/bin/pg_upgrade/multixact_new.h
new file mode 100644
index 00000000000..f66e6af7e45
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_new.h
@@ -0,0 +1,23 @@
+/*
+ * multixact_new.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_new.h
+ */
+#include "access/multixact.h"
+
+#include "slru_io.h"
+
+typedef struct MultiXactWriter
+{
+	SlruSegState *offset;
+	SlruSegState *members;
+} MultiXactWriter;
+
+extern MultiXactWriter *AllocMultiXactWrite(const char *pgdata,
+											MultiXactId firstMulti,
+											MultiXactOffset firstOffset);
+extern void RecordNewMultiXact(MultiXactWriter *state, MultiXactOffset offset,
+							   MultiXactId multi, int nmembers,
+							   MultiXactMember *members);
+extern void FreeMultiXactWrite(MultiXactWriter *writer);
diff --git a/src/bin/pg_upgrade/multixact_old.c b/src/bin/pg_upgrade/multixact_old.c
new file mode 100644
index 00000000000..7bf7db4b009
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_old.c
@@ -0,0 +1,297 @@
+/*
+ * multixact_old.c
+ *
+ * Functions to read pre-v19 multixacts
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_old.c
+ */
+
+#include "postgres_fe.h"
+
+#include "multixact_old.h"
+#include "pg_upgrade.h"
+
+/*
+ * NOTE: below are a bunch of definitions that are copy-pasted from
+ * multixact.c from version 18.  The only difference is that we use the
+ * OldMultiXactOffset type equal to uint32 instead of MultiXactOffset which
+ * became uint64.
+ */
+
+/* We need four bytes per offset and 8 bytes per base for each page. */
+#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(OldMultiXactOffset))
+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)
+{
+	return multi / MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int
+MultiXactIdToOffsetEntry(MultiXactId multi)
+{
+	return multi % MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+/*
+ * The situation for members is a bit more complex: we store one byte of
+ * additional flag bits for each TransactionId.  To do this without getting
+ * into alignment issues, we store four bytes of flags, and then the
+ * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
+ * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
+ * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
+ * performance) trumps space efficiency here.
+ *
+ * Note that the "offset" macros work with byte offset, not array indexes, so
+ * arithmetic must be done using "char *" pointers.
+ */
+/* We need eight bits per xact, so one xact fits in a byte */
+#define MXACT_MEMBER_BITS_PER_XACT			8
+#define MXACT_MEMBER_FLAGS_PER_BYTE			1
+#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
+
+/* how many full bytes of flags are there in a group? */
+#define MULTIXACT_FLAGBYTES_PER_GROUP		4
+#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
+	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
+/* size in bytes of a complete group */
+#define MULTIXACT_MEMBERGROUP_SIZE \
+	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERS_PER_PAGE	\
+	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
+
+/* page in which a member is to be found */
+static inline int64
+MXOffsetToMemberPage(OldMultiXactOffset offset)
+{
+	return offset / MULTIXACT_MEMBERS_PER_PAGE;
+}
+
+/* Location (byte offset within page) of flag word for a given member */
+static inline int
+MXOffsetToFlagsOffset(MultiXactOffset offset)
+{
+	OldMultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
+	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
+
+	return byteoff;
+}
+
+/* Location (byte offset within page) of TransactionId of given member */
+static inline int
+MXOffsetToMemberOffset(OldMultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+
+	return MXOffsetToFlagsOffset(offset) +
+		MULTIXACT_FLAGBYTES_PER_GROUP +
+		member_in_group * sizeof(TransactionId);
+}
+
+static inline int
+MXOffsetToFlagsBitShift(OldMultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
+
+	return bshift;
+}
+
+/*
+ * Construct reader of old multixacts.
+ *
+ * Returns the malloced memory used by the all other calls in this module.
+ */
+OldMultiXactReader *
+AllocOldMultiXactRead(char *pgdata, MultiXactId nextMulti,
+					  OldMultiXactOffset nextOffset)
+{
+	OldMultiXactReader *state = state = pg_malloc(sizeof(*state));
+	char		dir[MAXPGPATH] = {0};
+
+	state->nextMXact = nextMulti;
+	state->nextOffset = nextOffset;
+
+	pg_sprintf(dir, "%s/pg_multixact/offsets", pgdata);
+	state->offset = AllocSlruRead(dir, false);
+
+	pg_sprintf(dir, "%s/pg_multixact/members", pgdata);
+	state->members = AllocSlruRead(dir, false);
+
+	return state;
+}
+
+/*
+ * This is a simplified version of the GetMultiXactIdMembers() server function.
+ *
+ * - Only return the updating member, if any. Upgrade only cares about the
+ *   updaters. If there is no updating member, return the first locking-only
+ *   member. We don't have any way to represent "no members", but we also don't
+ *   need to preserve all the locking members.
+ *
+ * - We don't need to worry about locking and some corner cases because there's
+ *   no concurrent activity.
+ */
+void
+GetOldMultiXactIdSingleMember(OldMultiXactReader *state, MultiXactId multi,
+							  TransactionId *result, MultiXactStatus *status)
+{
+	MultiXactId nextMXact,
+				nextOffset,
+				tmpMXact;
+	int64		pageno,
+				prev_pageno;
+	int			entryno,
+				length;
+	char	   *buf;
+	OldMultiXactOffset *offptr,
+				offset;
+	TransactionId result_xid = InvalidTransactionId;
+	bool		result_isupdate = false;
+
+	nextMXact = state->nextMXact;
+	nextOffset = state->nextOffset;
+
+	/*
+	 * See GetMultiXactIdMembers in multixact.c
+	 *
+	 * Find out the offset at which we need to start reading MultiXactMembers
+	 * and the number of members in the multixact.  We determine the latter as
+	 * the difference between this multixact's starting offset and the next
+	 * one's.  However, there are some corner cases to worry about:
+	 *
+	 * 1. This multixact may be the latest one created, in which case there is
+	 * no next one to look at.  In this case the nextOffset value we just
+	 * saved is the correct endpoint.
+	 *
+	 * 2. The next multixact may still be in process of being filled in...
+	 * This cannot happen during upgrade.
+	 *
+	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
+	 * handle case #2, there is an ambiguity near the point of offset
+	 * wraparound.  If we see next multixact's offset is one, is that our
+	 * multixact's actual endpoint, or did it end at zero with a subsequent
+	 * increment?  We handle this using the knowledge that if the zero'th
+	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
+	 * transaction ID so it can't be a multixact member.  Therefore, if we
+	 * read a zero from the members array, just ignore it.
+	 */
+
+	pageno = MultiXactIdToOffsetPage(multi);
+	entryno = MultiXactIdToOffsetEntry(multi);
+
+	buf = SlruReadSwitchPage(state->offset, pageno);
+	offptr = (OldMultiXactOffset *) buf;
+	offptr += entryno;
+	offset = *offptr;
+
+	Assert(offset != 0);
+
+	/*
+	 * Use the same increment rule as GetNewMultiXactId(), that is, don't
+	 * handle wraparound explicitly until needed.
+	 */
+	tmpMXact = multi + 1;
+
+	if (nextMXact == tmpMXact)
+	{
+		/* Corner case 1: there is no next multixact */
+		length = nextOffset - offset;
+	}
+	else
+	{
+		OldMultiXactOffset nextMXOffset;
+
+		/* handle wraparound if needed */
+		if (tmpMXact < FirstMultiXactId)
+			tmpMXact = FirstMultiXactId;
+
+		prev_pageno = pageno;
+
+		pageno = MultiXactIdToOffsetPage(tmpMXact);
+		entryno = MultiXactIdToOffsetEntry(tmpMXact);
+
+		if (pageno != prev_pageno)
+			buf = SlruReadSwitchPage(state->offset, pageno);
+
+		offptr = (OldMultiXactOffset *) buf;
+		offptr += entryno;
+		nextMXOffset = *offptr;
+
+		/*
+		 * Corner case 2: next multixact is still being filled in, this must
+		 * not happen during upgrade.
+		 */
+		Assert(nextMXOffset != 0);
+
+		length = nextMXOffset - offset;
+	}
+
+	prev_pageno = -1;
+	for (int i = 0; i < length; i++, offset++)
+	{
+		TransactionId *xactptr;
+		uint32	   *flagsptr;
+		int			flagsoff;
+		int			bshift;
+		int			memberoff;
+		MultiXactStatus st;
+
+		pageno = MXOffsetToMemberPage(offset);
+		memberoff = MXOffsetToMemberOffset(offset);
+
+		if (pageno != prev_pageno)
+		{
+			buf = SlruReadSwitchPage(state->members, pageno);
+			prev_pageno = pageno;
+		}
+
+		xactptr = (TransactionId *) (buf + memberoff);
+		if (!TransactionIdIsValid(*xactptr))
+		{
+			/* Corner case 3: we must be looking at unused slot zero */
+			Assert(offset == 0);
+			continue;
+		}
+
+		flagsoff = MXOffsetToFlagsOffset(offset);
+		bshift = MXOffsetToFlagsBitShift(offset);
+		flagsptr = (uint32 *) (buf + flagsoff);
+
+		st = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
+
+		/* Verify that there is a single update Xid among the given members. */
+		if (ISUPDATE_from_mxstatus(st))
+		{
+			if (result_isupdate)
+				pg_fatal("multixact %u has more than one updating member",
+						 multi);
+			result_xid = *xactptr;
+			result_isupdate = true;
+		}
+		else if (!TransactionIdIsValid(result_xid))
+			result_xid = *xactptr;
+	}
+
+	/* A multixid with zero members should not happen */
+	Assert(TransactionIdIsValid(result_xid));
+
+	*result = result_xid;
+	*status = result_isupdate ? MultiXactStatusUpdate :
+		MultiXactStatusForKeyShare;
+}
+
+/*
+ * Frees the malloced reader.
+ */
+void
+FreeOldMultiXactReader(OldMultiXactReader *state)
+{
+	FreeSlruRead(state->offset);
+	FreeSlruRead(state->members);
+
+	pfree(state);
+}
diff --git a/src/bin/pg_upgrade/multixact_old.h b/src/bin/pg_upgrade/multixact_old.h
new file mode 100644
index 00000000000..8eb5af2ccaf
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_old.h
@@ -0,0 +1,29 @@
+/*
+ * multixact_old.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_old.h
+ */
+
+#include "access/multixact.h"
+#include "slru_io.h"
+
+typedef uint32 OldMultiXactOffset;
+
+typedef struct OldMultiXactReader
+{
+	MultiXactId nextMXact;
+	OldMultiXactOffset nextOffset;
+
+	SlruSegState *offset;
+	SlruSegState *members;
+} OldMultiXactReader;
+
+extern OldMultiXactReader *AllocOldMultiXactRead(char *pgdata,
+												 MultiXactId nextMulti,
+												 OldMultiXactOffset nextOffset);
+extern void GetOldMultiXactIdSingleMember(OldMultiXactReader *state,
+										  MultiXactId multi,
+										  TransactionId *result,
+										  MultiXactStatus *status);
+extern void FreeOldMultiXactReader(OldMultiXactReader *reader);
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 490e98fa26f..0fdd05c127c 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -48,6 +48,8 @@
 #include "common/logging.h"
 #include "common/restricted_token.h"
 #include "fe_utils/string_utils.h"
+#include "multixact_old.h"
+#include "multixact_new.h"
 #include "pg_upgrade.h"
 
 /*
@@ -769,6 +771,81 @@ copy_subdir_files(const char *old_subdir, const char *new_subdir)
 	check_ok();
 }
 
+/*
+ * Convert pg_multixact/offset and /members to new format with 64-bit offsets.
+ */
+static void
+convert_multixacts(MultiXactId *new_nxtmulti, MultiXactOffset *new_nxtmxoff)
+{
+	MultiXactId oldest_multi,
+				next_multi;
+	OldMultiXactReader *old_reader;
+	MultiXactWriter *new_writer;
+	MultiXactOffset next_offset;
+
+	/*
+	 * The range of valid multi XIDs is unchanged by the conversion (they are
+	 * referenced from the heap tables), but the members SLRU is rewritten to
+	 * start from offset 1.
+	 */
+	oldest_multi = old_cluster.controldata.chkpnt_oldstMulti;
+	next_multi = old_cluster.controldata.chkpnt_nxtmulti;
+	next_offset = 1;
+
+	old_reader = AllocOldMultiXactRead(old_cluster.pgdata,
+									   old_cluster.controldata.chkpnt_nxtmulti,
+									   old_cluster.controldata.chkpnt_nxtmxoff);
+	new_writer = AllocMultiXactWrite(new_cluster.pgdata,
+									 oldest_multi, next_offset);
+
+	/* handle wraparound */
+	if (next_multi < FirstMultiXactId)
+		next_multi = FirstMultiXactId;
+
+	/*
+	 * Read multixids from old files one by one, and write them back in the
+	 * new format.
+	 */
+	for (MultiXactId multi = oldest_multi; multi != next_multi;)
+	{
+		TransactionId xid;
+		MultiXactStatus status;
+		MultiXactMember member;
+
+		/*
+		 * Read the old multixid.  The locking-only XIDs that may be part of
+		 * multi-xids don't matter after upgrade, as there can be no
+		 * transactions running across upgrade.  So as a little optimization,
+		 * we only read one member from each multixid: the one updating one,
+		 * or if there was no update, arbitrarily the first locking xid.
+		 */
+		GetOldMultiXactIdSingleMember(old_reader, multi, &xid, &status);
+
+		/* Write it out in new format */
+		member.xid = xid;
+		member.status = status;
+		RecordNewMultiXact(new_writer, next_offset, multi, 1, &member);
+
+		next_offset += 1;
+		multi++;
+		/* handle wraparound */
+		if (multi < FirstMultiXactId)
+			multi = FirstMultiXactId;
+	}
+
+	/*
+	 * Update the nextMXact/Offset values in the control file to match what we
+	 * wrote.  The nextMXact is unchanged, but nextOffset will be different.
+	 */
+	Assert(next_multi == old_cluster.controldata.chkpnt_nxtmulti);
+	*new_nxtmulti = next_multi;
+	*new_nxtmxoff = next_offset;
+
+	/* Release resources */
+	FreeMultiXactWrite(new_writer);
+	FreeOldMultiXactReader(old_reader);
+}
+
 static void
 copy_xact_xlog_xid(void)
 {
@@ -816,8 +893,29 @@ copy_xact_xlog_xid(void)
 	if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER &&
 		new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
 	{
-		copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
-		copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		MultiXactId new_nxtmulti = old_cluster.controldata.chkpnt_nxtmulti;
+		MultiXactOffset new_nxtmxoff = old_cluster.controldata.chkpnt_nxtmxoff;
+
+		/*
+		 * If the old server is before the
+		 * MULTIXACTOFFSET_FORMATCHANGE_CAT_VER it must have 32-bit multixid
+		 * offsets, thus it should be converted.
+		 */
+		if (old_cluster.controldata.cat_ver < MULTIXACTOFFSET_FORMATCHANGE_CAT_VER &&
+			new_cluster.controldata.cat_ver >= MULTIXACTOFFSET_FORMATCHANGE_CAT_VER)
+		{
+			remove_new_subdir("pg_multixact/members", false);
+			remove_new_subdir("pg_multixact/offsets", false);
+
+			prep_status("Converting pg_multixact/offsets to 64-bit");
+			convert_multixacts(&new_nxtmulti, &new_nxtmxoff);
+			check_ok();
+		}
+		else
+		{
+			copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
+			copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		}
 
 		prep_status("Setting next multixact ID and offset for new cluster");
 
@@ -826,10 +924,8 @@ copy_xact_xlog_xid(void)
 		 * counters here and the oldest multi present on system.
 		 */
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/pg_resetwal\" -O %u -m %u,%u \"%s\"",
-				  new_cluster.bindir,
-				  old_cluster.controldata.chkpnt_nxtmxoff,
-				  old_cluster.controldata.chkpnt_nxtmulti,
+				  "\"%s/pg_resetwal\" -O %" PRIu64 " -m %u,%u \"%s\"",
+				  new_cluster.bindir, new_nxtmxoff, new_nxtmulti,
 				  old_cluster.controldata.chkpnt_oldstMulti,
 				  new_cluster.pgdata);
 		check_ok();
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index e86336f4be9..127b2cb00fa 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -114,6 +114,11 @@ extern char *output_files[];
  */
 #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231
 
+/*
+ * Swicth from 32-bit to 64-bit for multixid offsets.
+ */
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 999999999
+
 /*
  * large object chunk size added to pg_controldata,
  * commit 5f93c37805e7485488480916b4585e098d3cc883
diff --git a/src/bin/pg_upgrade/slru_io.c b/src/bin/pg_upgrade/slru_io.c
new file mode 100644
index 00000000000..010094184be
--- /dev/null
+++ b/src/bin/pg_upgrade/slru_io.c
@@ -0,0 +1,242 @@
+/*
+ * slru_io.c
+ *
+ * Routines for reading and writing SLRU files during upgrade.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/slru_io.c
+ */
+
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+
+#include "common/fe_memutils.h"
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "port/pg_iovec.h"
+#include "pg_upgrade.h"
+#include "slru_io.h"
+
+static SlruSegState *AllocSlruSegState(const char *dir);
+static char *SlruFileName(SlruSegState *state, int64 segno);
+static void SlruFlush(SlruSegState *state);
+
+static SlruSegState *
+AllocSlruSegState(const char *dir)
+{
+	SlruSegState *state = pg_malloc(sizeof(*state));
+
+	state->dir = pstrdup(dir);
+	state->fn = NULL;
+	state->fd = -1;
+	state->segno = -1;
+	state->pageno = 0;
+
+	return state;
+}
+
+/* similar to the backend function with the same name */
+static char *
+SlruFileName(SlruSegState *state, int64 segno)
+{
+	if (state->long_segment_names)
+	{
+		Assert(segno >= 0 && segno <= INT64CONST(0xFFFFFFFFFFFFFFF));
+		return psprintf("%s/%015" PRIX64, state->dir, segno);
+	}
+	else
+	{
+		Assert(segno >= 0 && segno <= INT64CONST(0xFFFFFF));
+		return psprintf("%s/%04X", state->dir, (unsigned int) segno);
+	}
+}
+
+/*
+ * Create slru reader for dir.
+ *
+ * Returns the malloced memory used by the all other read calls in this module.
+ */
+SlruSegState *
+AllocSlruRead(const char *dir, bool long_segment_names)
+{
+	SlruSegState *state = AllocSlruSegState(dir);
+
+	state->writing = false;
+	state->long_segment_names = long_segment_names;
+
+	return state;
+}
+
+/*
+ * Open given page for reading.
+ *
+ * Reading can be done in random order.
+ */
+char *
+SlruReadSwitchPageSlow(SlruSegState *state, uint64 pageno)
+{
+	int64		segno;
+
+	Assert(!state->writing);	/* read only mode */
+
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+
+	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	if (segno != state->segno)
+	{
+		if (state->segno != -1)
+		{
+			close(state->fd);
+			state->fd = -1;
+
+			pg_free(state->fn);
+			state->fn = NULL;
+
+			state->segno = -1;
+		}
+
+		/* Open new segment */
+		state->fn = SlruFileName(state, segno);
+		if ((state->fd = open(state->fn, O_RDONLY | PG_BINARY, 0)) < 0)
+			pg_fatal("could not open file \"%s\": %m", state->fn);
+	}
+
+	state->segno = segno;
+
+	{
+		struct iovec iovec = {
+			.iov_base = &state->buf,
+			.iov_len = BLCKSZ,
+		};
+		off_t		offset = (pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+		if (pg_preadv(state->fd, &iovec, 1, offset) < 0)
+			pg_fatal("could not read file \"%s\": %m", state->fn);
+
+		state->pageno = pageno;
+	}
+
+	return state->buf.data;
+}
+
+/*
+ * Frees the malloced reader.
+ */
+void
+FreeSlruRead(SlruSegState *state)
+{
+	Assert(!state->writing);	/* read only mode */
+
+	if (state->fd != -1)
+		close(state->fd);
+	pg_free(state);
+}
+
+/*
+ * Create slru writer for dir.
+ *
+ * Returns the malloced memory used by the all other write calls in this module.
+ */
+SlruSegState *
+AllocSlruWrite(const char *dir, bool long_segment_names)
+{
+	SlruSegState *state = AllocSlruSegState(dir);
+
+	state->writing = true;
+	state->long_segment_names = long_segment_names;
+
+	return state;
+}
+
+/*
+ * Open the given page for writing.
+ *
+ * NOTE: This uses O_EXCL when stepping to a new segment, so this assumes that
+ * each segment is written in full before moving on to next one.  This
+ * limitation would be easy to lift if needed, but it fits the usage pattern of
+ * current callers.
+ */
+char *
+SlruWriteSwitchPageSlow(SlruSegState *state, uint64 pageno)
+{
+	int64		segno;
+	off_t		offset;
+
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+
+	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	offset = (pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+	SlruFlush(state);
+	memset(state->buf.data, 0, BLCKSZ);
+
+	if (segno != state->segno)
+	{
+		if (state->segno != -1)
+		{
+			close(state->fd);
+			state->fd = -1;
+
+			pg_free(state->fn);
+			state->fn = NULL;
+
+			state->segno = -1;
+		}
+
+		/* Create the segment */
+		state->fn = SlruFileName(state, segno);
+		if ((state->fd = open(state->fn, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+							  pg_file_create_mode)) < 0)
+		{
+			pg_fatal("could not create file \"%s\": %m", state->fn);
+		}
+
+		state->segno = segno;
+
+		if (offset > 0)
+		{
+			if (pg_pwrite_zeros(state->fd, offset, 0) < 0)
+				pg_fatal("could not write file \"%s\": %m", state->fn);
+		}
+	}
+
+	state->pageno = pageno;
+
+	return state->buf.data;
+}
+
+static void
+SlruFlush(SlruSegState *state)
+{
+	struct iovec iovec = {
+		.iov_base = &state->buf,
+		.iov_len = BLCKSZ,
+	};
+	off_t		offset;
+
+	if (state->segno == -1)
+		return;
+
+	offset = (state->pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+	if (pg_pwritev_with_retry(state->fd, &iovec, 1, offset) < 0)
+		pg_fatal("could not write file \"%s\": %m", state->fn);
+}
+
+/*
+ * Frees the malloced writer.
+ */
+void
+FreeSlruWrite(SlruSegState *state)
+{
+	Assert(state->writing);
+
+	SlruFlush(state);
+
+	if (state->fd != -1)
+		close(state->fd);
+	pg_free(state);
+}
diff --git a/src/bin/pg_upgrade/slru_io.h b/src/bin/pg_upgrade/slru_io.h
new file mode 100644
index 00000000000..5c80a679b4d
--- /dev/null
+++ b/src/bin/pg_upgrade/slru_io.h
@@ -0,0 +1,52 @@
+/*
+ * slru_io.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/slru_io.h
+ */
+
+#ifndef SLRU_IO_H
+#define SLRU_IO_H
+
+/*
+ * State for reading or writing an SLRU, with a one page buffer.
+ */
+typedef struct SlruSegState
+{
+	bool		writing;
+	bool		long_segment_names;
+
+	char	   *dir;
+	char	   *fn;
+	int			fd;
+	int64		segno;
+	uint64		pageno;
+
+	PGAlignedBlock buf;
+} SlruSegState;
+
+extern SlruSegState *AllocSlruRead(const char *dir, bool long_segment_names);
+extern char *SlruReadSwitchPageSlow(SlruSegState *state, uint64 pageno);
+extern void FreeSlruRead(SlruSegState *state);
+
+static inline char *
+SlruReadSwitchPage(SlruSegState *state, uint64 pageno)
+{
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+	return SlruReadSwitchPageSlow(state, pageno);
+}
+
+extern SlruSegState *AllocSlruWrite(const char *dir, bool long_segment_names);
+extern char *SlruWriteSwitchPageSlow(SlruSegState *state, uint64 pageno);
+extern void FreeSlruWrite(SlruSegState *state);
+
+static inline char *
+SlruWriteSwitchPage(SlruSegState *state, uint64 pageno)
+{
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+	return SlruWriteSwitchPageSlow(state, pageno);
+}
+
+#endif							/* SLRU_IO_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 23bce72ae64..f9ddd06ec1d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1725,6 +1725,7 @@ MultiXactMember
 MultiXactOffset
 MultiXactStateData
 MultiXactStatus
+MultiXactWriter
 MultirangeIOData
 MultirangeParseState
 MultirangeType
@@ -1808,6 +1809,7 @@ OffsetVarNodes_context
 Oid
 OidOptions
 OkeysState
+OldMultiXactReader
 OldToNewMapping
 OldToNewMappingData
 OnCommitAction
@@ -2804,6 +2806,7 @@ SlruCtlData
 SlruErrorCause
 SlruPageStatus
 SlruScanCallback
+SlruSegState
 SlruShared
 SlruSharedData
 SlruWriteAll
-- 
2.47.3

v25-0004-Remove-oldestOffset-oldestOffsetKnown-from-multi.patchtext/x-patch; charset=UTF-8; name=v25-0004-Remove-oldestOffset-oldestOffsetKnown-from-multi.patchDownload
From b511667577101768f243acaf8e1f382a1e9d4fe4 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Thu, 6 Nov 2025 16:20:18 +0300
Subject: [PATCH v25 04/10] Remove oldestOffset/oldestOffsetKnown from
 multixact

Since we rewrite all multitransactions during pg_upgrade, the oldest
offset for a new cluster will no longer be missing on disc.
---
 src/backend/access/transam/multixact.c | 101 ++-----------------------
 src/include/access/multixact.h         |   3 -
 2 files changed, 5 insertions(+), 99 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index e0323ec1014..78ba6d72a92 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -140,14 +140,6 @@ typedef struct MultiXactStateData
 	MultiXactId oldestMultiXactId;
 	Oid			oldestMultiXactDB;
 
-	/*
-	 * Oldest multixact offset that is potentially referenced by a multixact
-	 * referenced by a relation.  We don't always know this value, so there's
-	 * a flag here to indicate whether or not we currently do.
-	 */
-	MultiXactOffset oldestOffset;
-	bool		oldestOffsetKnown;
-
 	/* support for anti-wraparound measures */
 	MultiXactId multiVacLimit;
 	MultiXactId multiWarnLimit;
@@ -2371,10 +2363,7 @@ SetOffsetVacuumLimit(bool is_startup)
 	MultiXactId oldestMultiXactId;
 	MultiXactId nextMXact;
 	MultiXactOffset oldestOffset = 0;	/* placate compiler */
-	MultiXactOffset prevOldestOffset;
 	MultiXactOffset nextOffset;
-	bool		oldestOffsetKnown = false;
-	bool		prevOldestOffsetKnown;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2387,8 +2376,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	oldestMultiXactId = MultiXactState->oldestMultiXactId;
 	nextMXact = MultiXactState->nextMXact;
 	nextOffset = MultiXactState->nextOffset;
-	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
-	prevOldestOffset = MultiXactState->oldestOffset;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2406,57 +2393,20 @@ SetOffsetVacuumLimit(bool is_startup)
 		 * offset.
 		 */
 		oldestOffset = nextOffset;
-		oldestOffsetKnown = true;
 	}
-	else
+	else if (!find_multixact_start(oldestMultiXactId, &oldestOffset))
 	{
-		/*
-		 * Figure out where the oldest existing multixact's offsets are
-		 * stored. Due to bugs in early release of PostgreSQL 9.3.X and 9.4.X,
-		 * the supposedly-earliest multixact might not really exist.  We are
-		 * careful not to fail in that case.
-		 */
-		oldestOffsetKnown =
-			find_multixact_start(oldestMultiXactId, &oldestOffset);
-
-		if (!oldestOffsetKnown)
-			ereport(LOG,
-					(errmsg("oldest checkpointed MultiXact %u does not exist on disk",
-							oldestMultiXactId)));
+		ereport(LOG,
+				(errmsg("oldest checkpointed MultiXact %u does not exist on disk",
+						oldestMultiXactId)));
 	}
 
 	LWLockRelease(MultiXactTruncationLock);
 
-	/*
-	 * If we can, compute limits (and install them MultiXactState) to prevent
-	 * overrun of old data in the members SLRU area. We can only do so if the
-	 * oldest offset is known though.
-	 *
-	 * FIXME: Is !oldestOffsetKnown possible anymore? At least update the comment:
-	 * we won't overrun members anymore.
-	 */
-	if (prevOldestOffsetKnown)
-	{
-		/*
-		 * If we failed to get the oldest offset this time, but we have a
-		 * value from a previous pass through this function, use the old
-		 * values rather than automatically forcing an autovacuum cycle again.
-		 */
-		oldestOffset = prevOldestOffset;
-		oldestOffsetKnown = true;
-	}
-
-	/* Install the computed values */
-	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
-	MultiXactState->oldestOffset = oldestOffset;
-	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	LWLockRelease(MultiXactGenLock);
-
 	/*
 	 * Do we need autovacuum?	If we're not sure, assume yes.
 	 */
-	return !oldestOffsetKnown ||
-		(nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);
+	return nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD;
 }
 
 /*
@@ -2503,47 +2453,6 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	return true;
 }
 
-/*
- * GetMultiXactInfo
- *
- * Returns information about the current MultiXact state, as of:
- * multixacts: Number of MultiXacts (nextMultiXactId - oldestMultiXactId)
- * members: Number of member entries (nextOffset - oldestOffset)
- * oldestMultiXactId: Oldest MultiXact ID still in use
- * oldestOffset: Oldest offset still in use
- *
- * Returns false if unable to determine, the oldest offset being unknown.
- */
-bool
-GetMultiXactInfo(uint32 *multixacts, MultiXactOffset *members,
-				 MultiXactId *oldestMultiXactId, MultiXactOffset *oldestOffset)
-{
-	MultiXactOffset nextOffset;
-	MultiXactId nextMultiXactId;
-	bool		oldestOffsetKnown;
-
-	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	nextOffset = MultiXactState->nextOffset;
-	*oldestMultiXactId = MultiXactState->oldestMultiXactId;
-	nextMultiXactId = MultiXactState->nextMXact;
-	*oldestOffset = MultiXactState->oldestOffset;
-	oldestOffsetKnown = MultiXactState->oldestOffsetKnown;
-	LWLockRelease(MultiXactGenLock);
-
-	if (!oldestOffsetKnown)
-	{
-		*members = 0;
-		*multixacts = 0;
-		*oldestMultiXactId = InvalidMultiXactId;
-		*oldestOffset = 0;
-		return false;
-	}
-
-	*members = nextOffset - *oldestOffset;
-	*multixacts = nextMultiXactId - *oldestMultiXactId;
-	return true;
-}
-
 typedef struct mxtruncinfo
 {
 	int64		earliestExistingPage;
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 7d98fe0fe32..d688b547c54 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -109,9 +109,6 @@ extern bool MultiXactIdIsRunning(MultiXactId multi, bool isLockOnly);
 extern void MultiXactIdSetOldestMember(void);
 extern int	GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 								  bool from_pgupgrade, bool isLockOnly);
-extern bool GetMultiXactInfo(uint32 *multixacts, MultiXactOffset *members,
-							 MultiXactId *oldestMultiXactId,
-							 MultiXactOffset *oldestOffset);
 extern bool MultiXactIdPrecedes(MultiXactId multi1, MultiXactId multi2);
 extern bool MultiXactIdPrecedesOrEquals(MultiXactId multi1,
 										MultiXactId multi2);
-- 
2.47.3

v25-0005-Reintroduce-MultiXactMemberFreezeThreshold.patchtext/x-patch; charset=UTF-8; name=v25-0005-Reintroduce-MultiXactMemberFreezeThreshold.patchDownload
From 8211ba8be8f8d2da4fc3237c817b411ad9ebe728 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Thu, 13 Nov 2025 12:38:41 +0200
Subject: [PATCH v25 05/10] Reintroduce MultiXactMemberFreezeThreshold

---
 src/backend/access/transam/multixact.c | 202 ++++++++++++++++++++-----
 src/backend/access/transam/xlog.c      |   4 +-
 src/backend/commands/vacuum.c          |   6 +-
 src/backend/postmaster/autovacuum.c    |   4 +-
 src/include/access/multixact.h         |   4 +-
 5 files changed, 170 insertions(+), 50 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 78ba6d72a92..c72b2cd7090 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -91,13 +91,13 @@
 
 
 /*
- * Multixact members warning threshold.
- *
- * If the difference between nextOffset and oldestOffset exceeds this value,
- * we trigger autovacuum in order to release disk space consumed by the
- * members SLRU.
+ * Thresholds used to keep members disk usage in check when multixids have a
+ * lot of members.  When MULTIXACT_MEMBER_LOW_THRESHOLD is reached, vacuum
+ * starts freezing multixids more aggressively, even if the normal multixid
+ * age limits haven't been reached yet.
  */
-#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD		UINT64CONST(4000000000)
+#define MULTIXACT_MEMBER_LOW_THRESHOLD		UINT64CONST(2000000000)
+#define MULTIXACT_MEMBER_HIGH_THRESHOLD		UINT64CONST(4000000000)
 
 static inline MultiXactId
 PreviousMultiXactId(MultiXactId multi)
@@ -140,6 +140,12 @@ typedef struct MultiXactStateData
 	MultiXactId oldestMultiXactId;
 	Oid			oldestMultiXactDB;
 
+	/*
+	 * Oldest multixact offset that is potentially referenced by a multixact
+	 * referenced by a relation.
+	 */
+	MultiXactOffset oldestOffset;
+
 	/* support for anti-wraparound measures */
 	MultiXactId multiVacLimit;
 	MultiXactId multiWarnLimit;
@@ -276,7 +282,7 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool SetOffsetVacuumLimit(bool is_startup);
+static void SetOffsetVacuumLimit(void);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMTruncateXlogRec(Oid oldestMultiDB,
 								  MultiXactId startTruncOff,
@@ -1945,8 +1951,8 @@ TrimMultiXact(void)
 	MultiXactState->finishedStartup = true;
 	LWLockRelease(MultiXactGenLock);
 
-	/* Now compute how far away the next members wraparound is. */
-	SetMultiXactIdLimit(oldestMXact, oldestMXactDB, true);
+	/* Now compute how far away the next multixid wraparound is. */
+	SetMultiXactIdLimit(oldestMXact, oldestMXactDB);
 }
 
 /*
@@ -2015,28 +2021,24 @@ MultiXactSetNextMXact(MultiXactId nextMulti,
  * datminmxid (ie, the oldest MultiXactId that might exist in any database
  * of our cluster), and the OID of the (or a) database with that value.
  *
- * is_startup is true when we are just starting the cluster, false when we
- * are updating state in a running cluster.  This only affects log messages.
+ * This also updates MultiXactState->oldestOffset, by looking up the offset of
+ * MultiXactState->oldestMultiXactId.
  */
 void
-SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid,
-					bool is_startup)
+SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid)
 {
 	MultiXactId multiVacLimit;
 	MultiXactId multiWarnLimit;
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 	MultiXactId curMulti;
-	bool		needs_offset_vacuum;
 
 	Assert(MultiXactIdIsValid(oldest_datminmxid));
 
 	/*
 	 * We pretend that a wrap will happen halfway through the multixact ID
 	 * space, but that's not really true, because multixacts wrap differently
-	 * from transaction IDs.  Note that, separately from any concern about
-	 * multixact IDs wrapping, we must ensure that multixact members do not
-	 * wrap.  Limits for that are set in SetOffsetVacuumLimit, not here.
+	 * from transaction IDs.
 	 */
 	multiWrapLimit = oldest_datminmxid + (MaxMultiXactId >> 1);
 	if (multiWrapLimit < FirstMultiXactId)
@@ -2104,8 +2106,13 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid,
 
 	Assert(!InRecovery);
 
-	/* Set limits for offset vacuum. */
-	needs_offset_vacuum = SetOffsetVacuumLimit(is_startup);
+	/*
+	 * Offsets are 64-bits wide and never wrap around, so we don't need to
+	 * consider them for emergency autovacuum purposes.  But now that we're in
+	 * a consistent state, determine MultiXactState->oldestOffset, to be used
+	 * to calculate freezing cutoff to keep the offsets disk usage in check.
+	 */
+	SetOffsetVacuumLimit();
 
 	/*
 	 * If past the autovacuum force point, immediately signal an autovac
@@ -2114,8 +2121,7 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid,
 	 * database, it'll call here, and we'll signal the postmaster to start
 	 * another iteration immediately if there are still any old databases.
 	 */
-	if ((MultiXactIdPrecedes(multiVacLimit, curMulti) ||
-		 needs_offset_vacuum) && IsUnderPostmaster)
+	if (MultiXactIdPrecedes(multiVacLimit, curMulti) && IsUnderPostmaster)
 		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
 
 	/* Give an immediate warning if past the wrap warn point */
@@ -2198,7 +2204,7 @@ MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB)
 	Assert(InRecovery);
 
 	if (MultiXactIdPrecedes(MultiXactState->oldestMultiXactId, oldestMulti))
-		SetMultiXactIdLimit(oldestMulti, oldestMultiDB, false);
+		SetMultiXactIdLimit(oldestMulti, oldestMultiDB);
 }
 
 /*
@@ -2348,22 +2354,17 @@ GetOldestMultiXactId(void)
 }
 
 /*
- * Determine if we need to vacuum to keep the size of the members SLRU in
- * check.
- *
- * To do so determine what's the oldest member offset and install the limit
- * info in MultiXactState, where it can be used to prevent overrun of old data
- * in the members SLRU area.
- *
- * The return value is true if autovacuum is required and false otherwise.
+ * Determine what's the oldest member offset and install it in MultiXactState,
+ * where it can be used to adjust multixid freezing cutoffs.
  */
-static bool
-SetOffsetVacuumLimit(bool is_startup)
+static void
+SetOffsetVacuumLimit(void)
 {
 	MultiXactId oldestMultiXactId;
 	MultiXactId nextMXact;
 	MultiXactOffset oldestOffset = 0;	/* placate compiler */
 	MultiXactOffset nextOffset;
+	bool		oldestOffsetKnown = false;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2393,20 +2394,37 @@ SetOffsetVacuumLimit(bool is_startup)
 		 * offset.
 		 */
 		oldestOffset = nextOffset;
+		oldestOffsetKnown = true;
 	}
-	else if (!find_multixact_start(oldestMultiXactId, &oldestOffset))
+	else
 	{
-		ereport(LOG,
-				(errmsg("oldest checkpointed MultiXact %u does not exist on disk",
-						oldestMultiXactId)));
+		/*
+		 * Figure out the offset at which oldest existing multixact's members
+		 * are stored.  If we cannot find it, be careful not to fail.  (We had
+		 * bugs in early releases of PostgreSQL 9.3.X and 9.4.X, the
+		 * supposedly-earliest multixact might not really exist.  Those should
+		 * be long gone by now, but let's nevertheless be careful not to fail
+		 * in that case.)
+		 */
+		oldestOffsetKnown =
+			find_multixact_start(oldestMultiXactId, &oldestOffset);
+
+		if (!oldestOffsetKnown)
+			ereport(LOG,
+					(errmsg("oldest checkpointed MultiXact %u does not exist on disk",
+							oldestMultiXactId)));
+		return;
 	}
 
 	LWLockRelease(MultiXactTruncationLock);
 
-	/*
-	 * Do we need autovacuum?	If we're not sure, assume yes.
-	 */
-	return nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD;
+	/* Install the computed value */
+	if (oldestOffsetKnown)
+	{
+		LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
+		MultiXactState->oldestOffset = oldestOffset;
+		LWLockRelease(MultiXactGenLock);
+	}
 }
 
 /*
@@ -2453,6 +2471,107 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	return true;
 }
 
+/*
+ * Determine how many multixacts, and how many multixact members, currently
+ * exist.
+ */
+static void
+ReadMultiXactCounts(uint32 *multixacts, MultiXactOffset *members)
+{
+	MultiXactOffset nextOffset;
+	MultiXactOffset oldestOffset;
+	MultiXactId oldestMultiXactId;
+	MultiXactId nextMultiXactId;
+
+	LWLockAcquire(MultiXactGenLock, LW_SHARED);
+	nextOffset = MultiXactState->nextOffset;
+	oldestMultiXactId = MultiXactState->oldestMultiXactId;
+	nextMultiXactId = MultiXactState->nextMXact;
+	oldestOffset = MultiXactState->oldestOffset;
+	LWLockRelease(MultiXactGenLock);
+
+	*members = nextOffset - oldestOffset;
+	*multixacts = nextMultiXactId - oldestMultiXactId;
+}
+
+/*
+ * Multixact members can be removed once the multixacts that refer to them are
+ * older than every datminmxid.  autovacuum_multixact_freeze_max_age and
+ * vacuum_multixact_freeze_table_age work together to make sure we never have
+ * too many multixacts; we hope that, at least under normal circumstances,
+ * this will also be sufficient to keep us from using too many offsets.
+ * However, if the average multixact has many members, we might accumulate a
+ * huge amount of members, consuming disk space, while still using few enough
+ * multixids that the multixid limits fail to trigger relminmxid advancement
+ * by VACUUM.
+ *
+ * To prevent that, if more than a certain amount of members space is used
+ * (MULTIXACT_MEMBER_LOW_THRESHOLD), we effectively reduce
+ * autovacuum_multixact_freeze_max_age to a value just less than the number of
+ * multixacts in use.  We hope that this will quickly trigger autovacuuming on
+ * the table or tables with the oldest relminmxid, thus allowing datminmxid
+ * values to advance and removing some members.
+ *
+ * As the amount of the member space in use grows, we become more aggressive
+ * in clamping this value.  That not only causes autovacuum to ramp up, but
+ * also makes any manual vacuums the user issues more aggressive.  This
+ * happens because vacuum_get_cutoffs() will clamp the freeze table and the
+ * minimum freeze age cutoffs based on the effective
+ * autovacuum_multixact_freeze_max_age this function returns.  At the extreme,
+ * when the members usage reaches MULTIXACT_MEMBER_HIGH_THRESHOLD, we'll clamp
+ * freeze_max_age to zero, and every vacuum of any table will freeze every
+ * multixact.
+ */
+int
+MultiXactMemberFreezeThreshold(void)
+{
+	MultiXactOffset members;
+	uint32		multixacts;
+	uint32		victim_multixacts;
+	double		fraction;
+	int			result;
+
+	/*
+	 * Read the current offsets and members usage.
+	 *
+	 * Note: In the case that we have been unable to calculate oldestOffset,
+	 * because we failed to find the offset of the oldest multixid, we assume
+	 * the worst because oldestOffset will be left to zero in that case.
+	 */
+	ReadMultiXactCounts(&multixacts, &members);
+
+	/* If member space utilization is low, no special action is required. */
+	if (members <= MULTIXACT_MEMBER_LOW_THRESHOLD)
+		return autovacuum_multixact_freeze_max_age;
+
+	/*
+	 * Compute a target for relminmxid advancement.  The number of multixacts
+	 * we try to eliminate from the system is based on how far we are past
+	 * MULTIXACT_MEMBER_LOW_THRESHOLD.
+	 *
+	 * The way this formula works is that when members is exactly at the low
+	 * threshold, fraction == 0.0, and we set freeze_max_age equal to
+	 * mxid_age(oldestMultiXactId).  As members grows further, towards the
+	 * high threshold, fraction grows linearly from 0.0 to 1.0, and the result
+	 * shrinks from mxid_age(oldestMultiXactId) to 0.  Beyond the high
+	 * threshold, fraction > 1.0 and the result is clamped to 0.
+	 */
+	fraction = (double) (members - MULTIXACT_MEMBER_LOW_THRESHOLD) /
+		(MULTIXACT_MEMBER_HIGH_THRESHOLD - MULTIXACT_MEMBER_LOW_THRESHOLD);
+	victim_multixacts = multixacts * fraction;
+
+	/* fraction could be > 1.0, but lowest possible freeze age is zero */
+	if (victim_multixacts > multixacts)
+		return 0;
+	result = multixacts - victim_multixacts;
+
+	/*
+	 * Clamp to autovacuum_multixact_freeze_max_age, so that we never make
+	 * autovacuum less aggressive than it would otherwise be.
+	 */
+	return Min(result, autovacuum_multixact_freeze_max_age);
+}
+
 typedef struct mxtruncinfo
 {
 	int64		earliestExistingPage;
@@ -2669,6 +2788,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestMultiXactId = newOldestMulti;
 	MultiXactState->oldestMultiXactDB = newOldestMultiDB;
+	MultiXactState->oldestOffset = newOldestOffset;
 	LWLockRelease(MultiXactGenLock);
 
 	/* First truncate members */
@@ -2864,7 +2984,7 @@ multixact_redo(XLogReaderState *record)
 		 * Advance the horizon values, so they're current at the end of
 		 * recovery.
 		 */
-		SetMultiXactIdLimit(xlrec.endTruncOff, xlrec.oldestMultiDB, false);
+		SetMultiXactIdLimit(xlrec.endTruncOff, xlrec.oldestMultiDB);
 
 		PerformMembersTruncation(xlrec.startTruncMemb, xlrec.endTruncMemb);
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ef405d66b3b..a000b8bd509 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5155,7 +5155,7 @@ BootStrapXLOG(uint32 data_checksum_version)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	AdvanceOldestClogXid(checkPoint.oldestXid);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
-	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
+	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
 	SetCommitTsLimit(InvalidTransactionId, InvalidTransactionId);
 
 	/* Set up the XLOG page header */
@@ -5636,7 +5636,7 @@ StartupXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	AdvanceOldestClogXid(checkPoint.oldestXid);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
-	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
+	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
 	SetCommitTsLimit(checkPoint.oldestCommitTsXid,
 					 checkPoint.newestCommitTsXid);
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 100e1a72c22..bd4278cd250 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1146,9 +1146,9 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams params,
 	/*
 	 * Also compute the multixact age for which freezing is urgent.  This is
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
-	 * short of multixact member space.
+	 * short of multixact member space. XXX update comment
 	 */
-	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
+	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
 
 	/*
 	 * Almost ready to set freeze output parameters; check if OldestXmin or
@@ -1971,7 +1971,7 @@ vac_truncate_clog(TransactionId frozenXID,
 	 * signaling twice?
 	 */
 	SetTransactionIdLimit(frozenXID, oldestxid_datoid);
-	SetMultiXactIdLimit(minMulti, minmulti_datoid, false);
+	SetMultiXactIdLimit(minMulti, minmulti_datoid);
 
 	LWLockRelease(WrapLimitsVacuumLock);
 }
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index bf66f494e3a..1c38488f2cb 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1151,7 +1151,7 @@ do_start_worker(void)
 
 	/* Also determine the oldest datminmxid we will consider. */
 	recentMulti = ReadNextMultiXactId();
-	multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
+	multiForceLimit = recentMulti - MultiXactMemberFreezeThreshold();
 	if (multiForceLimit < FirstMultiXactId)
 		multiForceLimit -= FirstMultiXactId;
 
@@ -1939,7 +1939,7 @@ do_autovacuum(void)
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
+	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
 
 	/*
 	 * Find the pg_database entry and select the default freeze ages. We use
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index d688b547c54..cfff86f655f 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -126,8 +126,7 @@ extern void BootStrapMultiXact(void);
 extern void StartupMultiXact(void);
 extern void TrimMultiXact(void);
 extern void SetMultiXactIdLimit(MultiXactId oldest_datminmxid,
-								Oid oldest_datoid,
-								bool is_startup);
+								Oid oldest_datoid);
 extern void MultiXactGetCheckptMulti(bool is_shutdown,
 									 MultiXactId *nextMulti,
 									 MultiXactOffset *nextMultiOffset,
@@ -142,6 +141,7 @@ extern void MultiXactSetNextMXact(MultiXactId nextMulti,
 extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 									  MultiXactOffset minMultiOffset);
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
+extern int	MultiXactMemberFreezeThreshold(void);
 
 extern void multixact_twophase_recover(FullTransactionId fxid, uint16 info,
 									   void *recdata, uint32 len);
-- 
2.47.3

v25-0006-TEST-bump-catversion.patchtext/x-patch; charset=UTF-8; name=v25-0006-TEST-bump-catversion.patchDownload
From b0d3aae5088a42ad6bc714be473dbc6fcae4847d Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Fri, 24 Oct 2025 11:47:50 +0300
Subject: [PATCH v25 06/10] TEST: bump catversion

To avoid constant CF-bot complains, make catversion bump in a separate
commit.

NOTE: keep it in sync with MULTIXACTOFFSET_FORMATCHANGE_CAT_VER
---
 src/include/catalog/catversion.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 7eefca1ae42..b0162c2bf63 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,7 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202511101
+// FIXME: bump it
+#define CATALOG_VERSION_NO	999999999
 
 #endif
-- 
2.47.3

v25-0007-TEST-Add-test-for-64-bit-mxoff-in-pg_resetwal.patchtext/x-patch; charset=UTF-8; name=v25-0007-TEST-Add-test-for-64-bit-mxoff-in-pg_resetwal.patchDownload
From d3fdf8da398df15a6e44d52364b65783f89c360d Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Tue, 28 Oct 2025 19:08:26 +0300
Subject: [PATCH v25 07/10] TEST: Add test for 64-bit mxoff in pg_resetwal

---
 src/bin/pg_resetwal/meson.build    |   1 +
 src/bin/pg_resetwal/t/003_mxoff.pl | 170 +++++++++++++++++++++++++++++
 2 files changed, 171 insertions(+)
 create mode 100644 src/bin/pg_resetwal/t/003_mxoff.pl

diff --git a/src/bin/pg_resetwal/meson.build b/src/bin/pg_resetwal/meson.build
index 290832b2299..1e2dfb38a5b 100644
--- a/src/bin/pg_resetwal/meson.build
+++ b/src/bin/pg_resetwal/meson.build
@@ -25,6 +25,7 @@ tests += {
     'tests': [
       't/001_basic.pl',
       't/002_corrupted.pl',
+      't/003_mxoff.pl',
     ],
   },
 }
diff --git a/src/bin/pg_resetwal/t/003_mxoff.pl b/src/bin/pg_resetwal/t/003_mxoff.pl
new file mode 100644
index 00000000000..3c1b7fa1d33
--- /dev/null
+++ b/src/bin/pg_resetwal/t/003_mxoff.pl
@@ -0,0 +1,170 @@
+
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use Math::BigInt;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+sub mxact_eater
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	$node->start;
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl} (I INT PRIMARY KEY, N_UPDATED INT) " .
+		"       WITH (AUTOVACUUM_ENABLED=FALSE);" .
+		"INSERT INTO ${tbl} SELECT G, 0 FROM GENERATE_SERIES(1, 50) G;");
+
+	# consume around 10k multixact-offsetfs
+	my $nclients = 10;
+	my $update_every = 75;
+	my @connections = ();
+
+	for (0..$nclients)
+	{
+		my $conn = $node->background_psql('postgres');
+		$conn->query_safe("BEGIN");
+
+		push(@connections, $conn);
+	}
+
+	for (my $i = 0; $i < 1000; $i++)
+	{
+		my $conn = $connections[$i % $nclients];
+
+		$conn->query_safe("COMMIT;");
+		$conn->query_safe("BEGIN");
+
+		if ($i % $update_every == 0)
+		{
+			$conn->query_safe(
+				"UPDATE ${tbl} SET " .
+				"N_UPDATED = N_UPDATED + 1 " .
+				"WHERE I = ${i} % 50");
+		}
+		else
+		{
+			$conn->query_safe(
+				"SELECT * FROM ${tbl} FOR KEY SHARE");
+		}
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+
+	$node->stop;
+}
+
+sub next_mxoff
+{
+	my $node = shift;
+	my ($stdout, $stderr) =
+	  run_command([ 'pg_controldata', $node->data_dir ]);
+	my @control_data = split("\n", $stdout);
+	my $next_mxoff = undef;
+
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint's NextMultiOffset:\s*(.*)$/mg)
+		{
+			$next_mxoff = $1;
+			last;
+		}
+	}
+	die "NextMultiOffset not found in control file\n"
+		unless defined($next_mxoff);
+
+	return $next_mxoff;
+}
+
+sub reset_mxoff
+{
+	my $node = shift;
+	my $offset = shift;
+		$offset = Math::BigInt->new($offset);
+
+	# Get block size
+	my $out = (run_command([ 'pg_resetwal', '--dry-run', $node->data_dir ]))[0];
+		$out =~ /^Database block size: *(\d+)$/m or die;
+	my $blcksz = $1;
+
+	# Reset to new offset
+	my @cmd = ('pg_resetwal', '--pgdata' => $node->data_dir);
+	push @cmd, '--multixact-offset' => $offset->as_hex();
+	command_ok(\@cmd, 'set oldest multixact-offset');
+
+	# Fill empty pg_multixact/members segment
+	my $mult = 32 * int($blcksz / 20) * 4;
+	my $segname = sprintf "%015X", $offset / $mult;
+
+	my @dd = ('dd');
+	push @dd, "if=/dev/zero";
+	push @dd, "of=" . $node->data_dir . "/pg_multixact/members/" . $segname;
+	push @dd, "bs=$blcksz";
+	push @dd, "count=32";
+	command_ok(\@dd, 'fill empty multixact-members');
+}
+
+my ($off1, $off2);
+
+# start from defaults
+my $node1 = PostgreSQL::Test::Cluster->new('node1');
+$node1->init;
+$off1 = next_mxoff($node1);
+mxact_eater($node1, "FOO");
+$off2 = next_mxoff($node1);
+note "> start from $off1, finished at $off2\n";
+
+# start from before 32-bit wraparound
+my $node2 = PostgreSQL::Test::Cluster->new('node2');
+$node2->init;
+reset_mxoff($node2, 0xFFFF0000);
+$off1 = next_mxoff($node2);
+mxact_eater($node2, "FOO");
+$off2 = next_mxoff($node2);
+note "> start from $off1, finished at $off2\n";
+
+# start near 32-bit wraparound
+my $node3 = PostgreSQL::Test::Cluster->new('node3');
+$node3->init;
+reset_mxoff($node3, 0xFFFFEC77);
+$off1 = next_mxoff($node3);
+mxact_eater($node3, "FOO");
+$off2 = next_mxoff($node3);
+note "> start from $off1, finished at $off2\n";
+
+# start over 32-bit wraparound
+my $node4 = PostgreSQL::Test::Cluster->new('node4');
+$node4->init;
+reset_mxoff($node4, '0xFFFFFFFF0000');
+$off1 = next_mxoff($node4);
+mxact_eater($node4, "FOO");
+$off2 = next_mxoff($node3);
+note "> start from $off1, finished at $off2\n";
+
+# check invariant
+$node1->start;
+$node2->start;
+$node3->start;
+$node4->start;
+
+my $var1 = $node1->safe_psql('postgres', 'TABLE FOO');
+my $var2 = $node2->safe_psql('postgres', 'TABLE FOO');
+my $var3 = $node3->safe_psql('postgres', 'TABLE FOO');
+my $var4 = $node4->safe_psql('postgres', 'TABLE FOO');
+ok($var1 eq $var2 eq $var3 eq $var4,
+	'check table invariant in all nodes');
+
+$node4->stop;
+$node3->stop;
+$node2->stop;
+$node1->stop;
+
+done_testing();
-- 
2.47.3

v25-0008-TEST-Add-test-for-wraparound-of-next-new-multi-i.patchtext/x-patch; charset=UTF-8; name=v25-0008-TEST-Add-test-for-wraparound-of-next-new-multi-i.patchDownload
From 02b6f6df2602b7eeba71cac39fbc53f606b0d652 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 12 Nov 2025 13:36:09 +0200
Subject: [PATCH v25 08/10] TEST: Add test for wraparound of next new multi in
 pg_upgrade

Related to BUG #18863 and BUG #18865
---
 src/bin/pg_upgrade/meson.build         |   1 +
 src/bin/pg_upgrade/t/007_multi_wrap.pl | 176 +++++++++++++++++++++++++
 2 files changed, 177 insertions(+)
 create mode 100644 src/bin/pg_upgrade/t/007_multi_wrap.pl

diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index 3e46c4512cf..ca87ae221ce 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -50,6 +50,7 @@ tests += {
       't/004_subscription.pl',
       't/005_char_signedness.pl',
       't/006_transfer_modes.pl',
+      't/007_multi_wrap.pl',
     ],
     'test_kwargs': {'priority': 40}, # pg_upgrade tests are slow
   },
diff --git a/src/bin/pg_upgrade/t/007_multi_wrap.pl b/src/bin/pg_upgrade/t/007_multi_wrap.pl
new file mode 100644
index 00000000000..0ad8fd59906
--- /dev/null
+++ b/src/bin/pg_upgrade/t/007_multi_wrap.pl
@@ -0,0 +1,176 @@
+
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use Math::BigInt;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::AdjustDump;
+use PostgreSQL::Test::AdjustUpgrade;
+use Test::More;
+
+# Temp dir for a dumps.
+my $tempdir = PostgreSQL::Test::Utils::tempdir;
+
+# Can be changed to test the other modes.
+my $mode = $ENV{PG_TEST_PG_UPGRADE_MODE} || '--copy';
+
+# Handy pg_resetwal wrapper
+sub reset_mxoff
+{
+	my %args = @_;
+
+	my $node = $args{node};
+	my $offset = $args{offset};
+	my $multi = $args{multi};
+	my $blcksz = sub # Get block size
+	{
+		my $out = (run_command([ 'pg_resetwal', '--dry-run',
+								 $node->data_dir ]))[0];
+		$out =~ /^Database block size: *(\d+)$/m or die;
+		return $1;
+	}->();
+
+	my @cmd;
+
+	# Reset cluster
+	@cmd = ('pg_resetwal', '--pgdata' => $node->data_dir);
+	if (defined($offset))
+	{
+		push @cmd, '--multixact-offset' => $offset;
+	}
+	if (defined($multi))
+	{
+		push @cmd, "--multixact-ids=$multi,$multi";
+	}
+	command_ok(\@cmd, 'reset multi/offset');
+
+	my $n_items;
+	my $segname;
+
+	# Fill empty pg_multixact segments
+	if (defined($offset))
+	{
+		$n_items = 32 * int($blcksz / 20) * 4;
+		$segname = sprintf "%015X", ($offset / $n_items);
+		$segname = $node->data_dir . "/pg_multixact/members/" . $segname;
+
+		@cmd = ('dd');
+		push @cmd, "if=/dev/zero";
+		push @cmd, "of=" . $segname;
+		push @cmd, "bs=$blcksz";
+		push @cmd, "count=32";
+		command_ok(\@cmd, 'fill empty multixact-members');
+	}
+
+	if (defined($multi))
+	{
+		$n_items = 32 * int($blcksz / 8);
+		$segname = sprintf "%04X", $multi / $n_items;
+		$segname = $node->data_dir . "/pg_multixact/offsets/" . $segname;
+
+		@cmd = ('dd');
+		push @cmd, "if=/dev/zero";
+		push @cmd, "of=" . $segname;
+		push @cmd, "bs=$blcksz";
+		push @cmd, "count=32";
+		command_ok(\@cmd, 'fill empty multixact-offsets');
+	}
+}
+
+sub get_dump_for_comparison
+{
+	my ($node, $db, $file_prefix, $adjust_child_columns) = @_;
+
+	my $dumpfile = $tempdir . '/' . $file_prefix . '.sql';
+	my $dump_adjusted = "${dumpfile}_adjusted";
+
+	open(my $dh, '>', $dump_adjusted)
+	  || die "could not open $dump_adjusted for writing $!";
+
+	$node->run_log(
+		[
+			'pg_dump', '--no-sync',
+			'--restrict-key' => 'test',
+			'-d' => $node->connstr($db),
+			'-f' => $dumpfile
+		]);
+
+	print $dh adjust_regress_dumpfile(slurp_file($dumpfile),
+		$adjust_child_columns);
+	close($dh);
+
+	return $dump_adjusted;
+}
+
+# Create old node
+my $old = PostgreSQL::Test::Cluster->new("old");
+$old->init;
+reset_mxoff(node => $old, multi => 4294967295, offset => 429496729);
+
+$old->start;
+$old->safe_psql('postgres',
+qq(
+	CREATE TABLE test_table (id integer NOT NULL PRIMARY KEY, val text);
+	INSERT INTO test_table VALUES (1, 'a');
+));
+
+my $conn1 = $old->background_psql('postgres');
+my $conn2 = $old->background_psql('postgres');
+
+$conn1->query_safe(qq(
+	BEGIN;
+	SELECT * FROM test_table WHERE id = 1 FOR SHARE;
+));
+$conn2->query_safe(qq(
+	BEGIN;
+	SELECT * FROM test_table WHERE id = 1 FOR SHARE;
+));
+
+$conn1->query_safe(qq(COMMIT;));
+$conn2->query_safe(qq(COMMIT;));
+
+$conn1->quit;
+$conn2->quit;
+
+$old->stop;
+
+# Create new node
+my $new = PostgreSQL::Test::Cluster->new("new");
+$new->init;
+
+# Run pg_upgrade
+command_ok(
+	[
+		'pg_upgrade', '--no-sync',
+		'--old-datadir' => $old->data_dir,
+		'--new-datadir' => $new->data_dir,
+		'--old-bindir' => $old->config_data('--bindir'),
+		'--new-bindir' => $new->config_data('--bindir'),
+		'--socketdir' => $new->host,
+		'--old-port' => $old->port,
+		'--new-port' => $new->port,
+		$mode,
+	],
+	'run of pg_upgrade for new instance');
+ok( !-d $new->data_dir . "/pg_upgrade_output.d",
+	"pg_upgrade_output.d/ removed after pg_upgrade success");
+
+$old->start;
+my $src_dump =
+	get_dump_for_comparison($old, 'postgres',
+							"oldnode_1_dump", 0);
+$old->stop;
+
+$new->start;
+my $dst_dump =
+	get_dump_for_comparison($new, 'postgres',
+							"newnode_1_dump", 0);
+$new->stop;
+
+compare_files($src_dump, $dst_dump,
+	'dump outputs from original and restored regression databases match');
+
+done_testing();
-- 
2.47.3

v25-0009-TEST-Add-test-for-64-bit-mxoff-in-pg_upgrade.patchtext/x-patch; charset=UTF-8; name=v25-0009-TEST-Add-test-for-64-bit-mxoff-in-pg_upgrade.patchDownload
From eedf4c80d19c14a7bbb109ed936a9c5edd6e3367 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 29 Oct 2025 14:19:56 +0300
Subject: [PATCH v25 09/10] TEST: Add test for 64-bit mxoff in pg_upgrade

---
 src/bin/pg_upgrade/meson.build    |   1 +
 src/bin/pg_upgrade/t/008_mxoff.pl | 463 ++++++++++++++++++++++++++++++
 2 files changed, 464 insertions(+)
 create mode 100644 src/bin/pg_upgrade/t/008_mxoff.pl

diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index ca87ae221ce..7f14e5c463c 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/005_char_signedness.pl',
       't/006_transfer_modes.pl',
       't/007_multi_wrap.pl',
+      't/008_mxoff.pl',
     ],
     'test_kwargs': {'priority': 40}, # pg_upgrade tests are slow
   },
diff --git a/src/bin/pg_upgrade/t/008_mxoff.pl b/src/bin/pg_upgrade/t/008_mxoff.pl
new file mode 100644
index 00000000000..7204325f873
--- /dev/null
+++ b/src/bin/pg_upgrade/t/008_mxoff.pl
@@ -0,0 +1,463 @@
+
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use Math::BigInt;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::AdjustDump;
+use PostgreSQL::Test::AdjustUpgrade;
+use Test::More;
+
+# This test involves different multitransaction states, similarly to that of
+# 002_pg_upgrade.pl.
+
+unless (defined($ENV{oldinstall}))
+{
+	plan skip_all =>
+		'to run test set oldinstall environment variable to the pre 64-bit mxoff cluster';
+}
+
+# Temp dir for a dumps.
+my $tempdir = PostgreSQL::Test::Utils::tempdir;
+
+# Can be changed to test the other modes.
+my $mode = $ENV{PG_TEST_PG_UPGRADE_MODE} || '--copy';
+
+sub utility_path
+{
+	my $node = shift;
+	my $name = shift;
+
+	my $bin_path = defined($node->install_path) ?
+		$node->install_path . "/bin/$name" : $name;
+
+	return $bin_path;
+}
+
+# Get NextMultiOffset.
+sub next_mxoff
+{
+	my $node = shift;
+
+	my $pg_controldata_path = utility_path($node, 'pg_controldata');
+	my ($stdout, $stderr) = run_command([ $pg_controldata_path,
+											$node->data_dir ]);
+	my @control_data = split("\n", $stdout);
+	my $next_mxoff = undef;
+
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint's NextMultiOffset:\s*(.*)$/mg)
+		{
+			$next_mxoff = $1;
+			last;
+		}
+	}
+	die "NextMultiOffset not found in control file\n"
+		unless defined($next_mxoff);
+
+	return $next_mxoff;
+}
+
+# Consume around 10k of mxoffsets.
+sub mxact_eater
+{
+	my $node = shift;
+	my $tbl = 'FOO';
+
+	my ($mxoff1, $mxoff2);
+
+	$mxoff1 = next_mxoff($node);
+	$node->start;
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl} (I INT PRIMARY KEY, N_UPDATED INT) " .
+		"       WITH (AUTOVACUUM_ENABLED=FALSE);" .
+		"INSERT INTO ${tbl} SELECT G, 0 FROM GENERATE_SERIES(1, 50) G;");
+
+	# consume around 10k mxoff
+	my $nclients = 10;
+	my $update_every = 75;
+	my @connections = ();
+
+	for (0..$nclients)
+	{
+		my $conn = $node->background_psql('postgres');
+		$conn->query_safe("BEGIN");
+
+		push(@connections, $conn);
+	}
+
+	for (my $i = 0; $i < 1000; $i++)
+	{
+		my $conn = $connections[$i % $nclients];
+
+		$conn->query_safe("COMMIT;");
+		$conn->query_safe("BEGIN");
+
+		if ($i % $update_every == 0)
+		{
+			$conn->query_safe(
+				"UPDATE ${tbl} SET " .
+				"N_UPDATED = N_UPDATED + 1 " .
+				"WHERE I = ${i} % 50");
+		}
+		else
+		{
+			$conn->query_safe(
+				"SELECT * FROM ${tbl} FOR KEY SHARE");
+		}
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+
+	$node->stop;
+	$mxoff2 = next_mxoff($node);
+
+	return $mxoff1, $mxoff2;
+}
+
+# Consume around 1M of mxoffsets.
+sub mxact_huge_eater
+{
+	my $node = shift;
+	my $tbl = 'FOO';
+
+	my ($mxoff1, $mxoff2);
+
+	$mxoff1 = next_mxoff($node);
+	$node->start;
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl} (I INT PRIMARY KEY, N_UPDATED INT) " .
+		"       WITH (AUTOVACUUM_ENABLED=FALSE);" .
+		"INSERT INTO ${tbl} SELECT G, 0 FROM GENERATE_SERIES(1, 4) G;");
+
+	my $nclients = 100;
+	my @connections = ();
+	my $timeout = 10 * $PostgreSQL::Test::Utils::timeout_default;
+
+	for (0..$nclients)
+	{
+		my $conn = $node->background_psql('postgres',
+										  timeout => $timeout);
+		$conn->query_safe("BEGIN");
+
+		push(@connections, $conn);
+	}
+
+	# It's a long process, better to tell about progress.
+	my $n_steps = 100_000;
+	my $step = int($n_steps / 10);
+
+	diag "\nstart to consume mxoffsets ...\n";
+	for (my $i = 0; $i < $n_steps; $i++)
+	{
+		my $conn = $connections[$i % $nclients];
+
+		$conn->query_safe("COMMIT;");
+		$conn->query_safe("BEGIN");
+
+		{
+			$conn->query_safe(
+				"SELECT * FROM ${tbl} " .
+				"FOR KEY SHARE");
+		}
+
+		if ($i % $step == 0)
+		{
+			my $done = int(($i / $n_steps) * 100);
+			diag "$done% done...";
+		}
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+
+	$node->stop;
+	$mxoff2 = next_mxoff($node);
+
+	return $mxoff1, $mxoff2;
+}
+
+# Set oldest multixact-offset
+sub reset_mxoff
+{
+	my $node = shift;
+	my $offset = shift;
+
+	my $pg_resetwal_path = utility_path($node, 'pg_resetwal');
+	# Get block size
+	my $out = (run_command([ $pg_resetwal_path, '--dry-run',
+							 $node->data_dir ]))[0];
+		$out =~ /^Database block size: *(\d+)$/m or die;
+	my $blcksz = $1;
+
+	# Reset to new offset
+	my @cmd = ($pg_resetwal_path, '--pgdata' => $node->data_dir);
+	push @cmd, '--multixact-offset' => $offset;
+	command_ok(\@cmd, 'set oldest multixact-offset');
+
+	# Fill empty pg_multixact/members segment
+	my $mult = 32 * int($blcksz / 20) * 4;
+	my $segname = sprintf "%04X", $offset / $mult;
+
+	my @dd = ('dd');
+	push @dd, "if=/dev/zero";
+	push @dd, "of=" . $node->data_dir . "/pg_multixact/members/" . $segname;
+	push @dd, "bs=$blcksz";
+	push @dd, "count=32";
+	command_ok(\@dd, 'fill empty multixact-members');
+}
+
+sub get_dump_for_comparison
+{
+	my ($node, $db, $file_prefix, $adjust_child_columns) = @_;
+
+	my $dumpfile = $tempdir . '/' . $file_prefix . '.sql';
+	my $dump_adjusted = "${dumpfile}_adjusted";
+
+	open(my $dh, '>', $dump_adjusted)
+	  || die "could not open $dump_adjusted for writing $!";
+
+	my $pg_dump_path = utility_path($node, 'pg_dump');
+
+	$node->run_log(
+		[
+			$pg_dump_path, '--no-sync',
+			'--restrict-key' => 'test',
+			'-d' => $node->connstr($db),
+			'-f' => $dumpfile
+		]);
+
+	print $dh adjust_regress_dumpfile(slurp_file($dumpfile),
+		$adjust_child_columns);
+	close($dh);
+
+	return $dump_adjusted;
+}
+
+# Main test workhorse routine.
+# Make pg_upgrade, dump data and compare it.
+sub run_test
+{
+	my $tag = shift;
+	my $oldnode = shift;
+	my $newnode = shift;
+
+	my $pg_upgrade_path = utility_path($newnode, 'pg_upgrade');
+
+	command_ok(
+		[
+			$pg_upgrade_path, '--no-sync',
+			'--old-datadir' => $oldnode->data_dir,
+			'--new-datadir' => $newnode->data_dir,
+			'--old-bindir' => $oldnode->config_data('--bindir'),
+			'--new-bindir' => $newnode->config_data('--bindir'),
+			'--socketdir' => $newnode->host,
+			'--old-port' => $oldnode->port,
+			'--new-port' => $newnode->port,
+			$mode,
+		],
+		'run of pg_upgrade for new instance');
+	ok( !-d $newnode->data_dir . "/pg_upgrade_output.d",
+		"pg_upgrade_output.d/ removed after pg_upgrade success");
+
+	$oldnode->start;
+	my $src_dump =
+		get_dump_for_comparison($oldnode, 'postgres',
+								"oldnode_${tag}_dump", 0);
+	$oldnode->stop;
+
+	$newnode->start;
+	my $dst_dump =
+		get_dump_for_comparison($newnode, 'postgres',
+								"newnode_${tag}_dump", 0);
+	$newnode->stop;
+
+	compare_files($src_dump, $dst_dump,
+		'dump outputs from original and restored regression databases match');
+}
+
+sub to_hex
+{
+	my $arg = shift;
+
+	$arg = Math::BigInt->new($arg);
+	$arg = $arg->as_hex();
+
+	return $arg;
+}
+
+# case #1: start old node from defaults
+{
+	my $tag = 1;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+	$old->init(extra => ['-k']);
+
+	my ($start_mxoff, $finish_mxoff) = mxact_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #2: start old node from before 32-bit wraparound
+{
+	my $tag = 2;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+	reset_mxoff($old, 0xFFFF0000);
+
+	my ($start_mxoff, $finish_mxoff) = mxact_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #3: start old node near 32-bit wraparound and reach wraparound state.
+{
+	my $tag = 3;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+
+	reset_mxoff($old, 0xFFFFEC77);
+	my ($start_mxoff, $finish_mxoff) = mxact_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #4: start old node from defaults
+{
+	my $tag = 4;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+	$old->append_conf("postgresql.conf", "max_connections = 128");
+
+	diag "test #${tag} for multiple mxoff segments";
+	my ($start_mxoff, $finish_mxoff) = mxact_huge_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #5: start old node from before 32-bit wraparound
+{
+	my $tag = 5;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+	$old->append_conf("postgresql.conf", "max_connections = 128");
+	reset_mxoff($old, 0xFF000000);
+
+	diag "test #${tag} for multiple mxoff segments";
+	my ($start_mxoff, $finish_mxoff) = mxact_huge_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #6: start old node near 32-bit wraparound and reach wraparound state.
+{
+	my $tag = 6;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+
+	reset_mxoff($old, 0xFFFFFFFF - 500_000);
+	$old->append_conf("postgresql.conf", "max_connections = 128");
+	my ($start_mxoff, $finish_mxoff) = mxact_huge_eater($old);
+
+	diag "test #${tag} for multiple mxoff segments";
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+done_testing();
-- 
2.47.3

v25-0010-TEST-add-consume_multixids-function.patchtext/x-patch; charset=UTF-8; name=v25-0010-TEST-add-consume_multixids-function.patchDownload
From b35b3de296f0dd9c8777be09f35d8288a3ebfbab Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 1 Apr 2025 21:01:07 +0300
Subject: [PATCH v25 10/10] TEST: add consume_multixids function

---
 src/test/modules/xid_wraparound/Makefile      |  1 +
 src/test/modules/xid_wraparound/meson.build   |  1 +
 .../xid_wraparound/multixid_wraparound.c      | 96 +++++++++++++++++++
 .../xid_wraparound/xid_wraparound--1.0.sql    |  4 +
 4 files changed, 102 insertions(+)
 create mode 100644 src/test/modules/xid_wraparound/multixid_wraparound.c

diff --git a/src/test/modules/xid_wraparound/Makefile b/src/test/modules/xid_wraparound/Makefile
index 7a6e0f66762..ebb3d8fcb3e 100644
--- a/src/test/modules/xid_wraparound/Makefile
+++ b/src/test/modules/xid_wraparound/Makefile
@@ -3,6 +3,7 @@
 MODULE_big = xid_wraparound
 OBJS = \
 	$(WIN32RES) \
+	multixid_wraparound.o \
 	xid_wraparound.o
 PGFILEDESC = "xid_wraparound - tests for XID wraparound"
 
diff --git a/src/test/modules/xid_wraparound/meson.build b/src/test/modules/xid_wraparound/meson.build
index 3aec430df8c..ce4ac468830 100644
--- a/src/test/modules/xid_wraparound/meson.build
+++ b/src/test/modules/xid_wraparound/meson.build
@@ -1,6 +1,7 @@
 # Copyright (c) 2023-2025, PostgreSQL Global Development Group
 
 xid_wraparound_sources = files(
+  'multixid_wraparound.c',
   'xid_wraparound.c',
 )
 
diff --git a/src/test/modules/xid_wraparound/multixid_wraparound.c b/src/test/modules/xid_wraparound/multixid_wraparound.c
new file mode 100644
index 00000000000..af567c6e541
--- /dev/null
+++ b/src/test/modules/xid_wraparound/multixid_wraparound.c
@@ -0,0 +1,96 @@
+/*--------------------------------------------------------------------------
+ *
+ * multixid_wraparound.c
+ *		Utilities for testing multixids
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/test/modules/xid_wraparound/multixid_wraparound.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/multixact.h"
+#include "access/xact.h"
+#include "miscadmin.h"
+#include "storage/proc.h"
+#include "utils/xid8.h"
+
+static int mxactMemberComparator(const void *arg1, const void *arg2);
+
+/*
+ * Consume the specified number of multi-XIDs, with specified number of
+ * members each.
+ */
+PG_FUNCTION_INFO_V1(consume_multixids);
+Datum
+consume_multixids(PG_FUNCTION_ARGS)
+{
+	int64		nmultis = PG_GETARG_INT64(0);
+	int32		nmembers = PG_GETARG_INT32(1);
+	MultiXactMember *members;
+	MultiXactId	lastmxid = InvalidMultiXactId;
+
+	if (nmultis < 0)
+		elog(ERROR, "invalid nxids argument: %" PRId64, nmultis);
+	if (nmembers < 1)
+		elog(ERROR, "invalid nmembers argument: %d", nmembers);
+
+	/*
+	 * We consume XIDs by calling GetNewTransactionId(true), which marks the
+	 * consumed XIDs as subtransactions of the current top-level transaction.
+	 * For that to work, this transaction must have a top-level XID.
+	 *
+	 * GetNewTransactionId registers them in the subxid cache in PGPROC, until
+	 * the cache overflows, but beyond that, we don't keep track of the
+	 * consumed XIDs.
+	 */
+	(void) GetTopTransactionId();
+
+	members = palloc((nmultis + nmembers) * sizeof(MultiXactMember));
+	for (int32 i = 0; i < nmultis + nmembers; i++)
+	{
+		FullTransactionId xid;
+
+		xid = GetNewTransactionId(true);
+		members[i].xid = XidFromFullTransactionId(xid);
+		members[i].status = MultiXactStatusForKeyShare;
+	}
+	/*
+	 * pre-sort the array like mXactCacheGetBySet does, so that the qsort call
+	 * in mXactCacheGetBySet() is cheaper.
+	 */
+	qsort(members, nmultis + nmembers, sizeof(MultiXactMember), mxactMemberComparator);
+
+	for (int64 i = 0; i < nmultis; i++)
+	{
+		lastmxid = MultiXactIdCreateFromMembers(nmembers, &members[i]);
+		CHECK_FOR_INTERRUPTS();
+	}
+
+	pfree(members);
+
+	PG_RETURN_TRANSACTIONID(lastmxid);
+}
+
+/* copied from multixact.c */
+static int
+mxactMemberComparator(const void *arg1, const void *arg2)
+{
+	MultiXactMember member1 = *(const MultiXactMember *) arg1;
+	MultiXactMember member2 = *(const MultiXactMember *) arg2;
+
+	if (member1.xid > member2.xid)
+		return 1;
+	if (member1.xid < member2.xid)
+		return -1;
+	if (member1.status > member2.status)
+		return 1;
+	if (member1.status < member2.status)
+		return -1;
+	return 0;
+}
diff --git a/src/test/modules/xid_wraparound/xid_wraparound--1.0.sql b/src/test/modules/xid_wraparound/xid_wraparound--1.0.sql
index 96356b4b974..ed7520c3d86 100644
--- a/src/test/modules/xid_wraparound/xid_wraparound--1.0.sql
+++ b/src/test/modules/xid_wraparound/xid_wraparound--1.0.sql
@@ -10,3 +10,7 @@ AS 'MODULE_PATHNAME' LANGUAGE C;
 CREATE FUNCTION consume_xids_until(targetxid xid8)
 RETURNS xid8 VOLATILE PARALLEL UNSAFE STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION consume_multixids(nmultis bigint, nmembers int4)
+RETURNS bigint VOLATILE PARALLEL UNSAFE STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
-- 
2.47.3

#59Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#58)
Re: POC: make mxidoff 64 bits

On Wed, 12 Nov 2025 at 16:00, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I added an
inlined fast path to SlruReadSwitchPage and SlruWriteSwitchPage to
eliminate the function call overhead of those in the common case that no
page switch is needed. With that, the 100 million mxid test case I used
went from 1.2 s to 0.9 s. We could optimize this further but I think
this is good enough.

I agree with you.

- I added an SlruFileName() helper function to slru_io.c, and support

for reading SLRUs with long_segment_names==true. It's not needed
currently, but it seemed like a weird omission. AllocSlruRead() actually
left 'long_segment_names' uninitialized which is error-prone. We
could've just documented it, but it seems just as easy to support it.

Yeah, I didn't particularly like that place either. But then I decided it
was
overkill to do it for the sake of symmetry and would raise questions.
It turned out much better this way.

I kept all the new test cases for now. We need to decide which ones are
worth keeping, and polish and speed up the ones we decide to keep.

I think of two cases here.
A) Upgrade from "new cluster":
* created cluster with pre 32-bit overflow mxoff
* consume around of 2k of mxacts (1k before 32-bit overflow
and 1k after)
* run pg_upgrade
* check upgraded cluster is working
* check data invariant
B) Same as A), but for an "old cluster" with oldinstall env.

On Thu, 13 Nov 2025 at 19:04, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Here's a new patch version that addresses the above issue. I resurrected
MultiXactMemberFreezeThreshold(), using the same logic as before, just
using pretty arbitrary thresholds of 1 and 2 billion offsets instead of
the safe/danger thresholds derived from MaxMultiOffset. That gives
roughly the same behavior wrt. calculating effective freeze age as before.

Yes, I think it's okay for now. This reflects the existing logic well.
I wonder what the alternative solution might be? Can we make a
"vacuum freeze" also do pg_multixact segments truncation?
In any case, this can be discussed later.

--
Best regards,
Maxim Orlov.

#60Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Maxim Orlov (#59)
10 attachment(s)
Re: POC: make mxidoff 64 bits

On 14/11/2025 17:40, Maxim Orlov wrote:

On Wed, 12 Nov 2025 at 16:00, Heikki Linnakangas <hlinnaka@iki.fi>

I kept all the new test cases for now. We need to decide which
ones are worth keeping, and polish and speed up the ones we decide
to keep.

Attached is a new patch version, with more work on the tests. The
pg_upgrade patch
(v26-0004-Add-pg_upgrade-for-64-bit-multixact-offsets.patch) now
includes a test case. I'm proposing to commit that test along with these
patches. It's a heavily-modified version of the test cases you wrote.

I tested that test using old installations, all the way down to version
9.4. That required a bunch of changes to the test perl modules, to make
them work with such old versions. Without any extra changes, the test
works down to v11.

Later patches in the patch set add more tests, labelled with the TEST:
prefix. Those are the tests you posted earlier, with little to no
modifications. I'm just carrying those around, so that I can easily run
them now during development. But I don't think they're adding much value
and I plan to leave them out of the final commit.

I think of two cases here.
A) Upgrade from "new cluster":
    * created cluster with pre 32-bit overflow mxoff
    * consume around of 2k of mxacts (1k before 32-bit overflow
      and 1k after)
    * run pg_upgrade
    * check upgraded cluster is working
    * check data invariant
B)  Same as A), but for an "old cluster" with oldinstall env.

Makes sense.

The 007_multixact_conversion.pl test in the attached patches includes
two test scenarios: "basic" and "wraparound" test. In the basic
scenario there's no overflow or wraparound involved, but it can be run
without an old installation, i.e. in a "new -> new upgrade". The
"wraparound" scenario is the same, but the old cluster is reset with
pg_resetwal so that the mxoff wraps around. The "wraparound" requires a
pre-19 old installation, because the pg_resetwal logic requires pre-v19
layout.

If we enhance the reset_mxoff() perl function in the test so that it
also works with v19, we could run the "wraparound" scenario in new->new
upgrades too. That would essentially the case A) that you listed above.

I think it's already pretty good as it is though. I don't expect the
point where we cross offset 2^32 in the new version to be very
interesting now that we use 64-bit offsets everywhere.

- Heikki

Attachments:

v26-0001-Move-pg_multixact-SLRU-page-format-definitions-t.patchtext/x-patch; charset=UTF-8; name=v26-0001-Move-pg_multixact-SLRU-page-format-definitions-t.patchDownload
From 930f94d3ae66ed054ddc2aaea5b247a37c6b3ba3 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 12 Nov 2025 14:19:32 +0200
Subject: [PATCH v26 01/10] Move pg_multixact SLRU page format definitions to
 separate header

---
 src/backend/access/transam/multixact.c  | 119 --------------------
 src/include/access/multixact_internal.h | 140 ++++++++++++++++++++++++
 2 files changed, 140 insertions(+), 119 deletions(-)
 create mode 100644 src/include/access/multixact_internal.h

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 9d5f130af7e..acb2a6788f9 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -89,125 +89,6 @@
 #include "utils/memutils.h"
 
 
-/*
- * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
- * used everywhere else in Postgres.
- *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
- */
-
-/* We need four bytes per offset */
-#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
-
-static inline int64
-MultiXactIdToOffsetPage(MultiXactId multi)
-{
-	return multi / MULTIXACT_OFFSETS_PER_PAGE;
-}
-
-static inline int
-MultiXactIdToOffsetEntry(MultiXactId multi)
-{
-	return multi % MULTIXACT_OFFSETS_PER_PAGE;
-}
-
-static inline int64
-MultiXactIdToOffsetSegment(MultiXactId multi)
-{
-	return MultiXactIdToOffsetPage(multi) / SLRU_PAGES_PER_SEGMENT;
-}
-
-/*
- * The situation for members is a bit more complex: we store one byte of
- * additional flag bits for each TransactionId.  To do this without getting
- * into alignment issues, we store four bytes of flags, and then the
- * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
- * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
- * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
- * performance) trumps space efficiency here.
- *
- * Note that the "offset" macros work with byte offset, not array indexes, so
- * arithmetic must be done using "char *" pointers.
- */
-/* We need eight bits per xact, so one xact fits in a byte */
-#define MXACT_MEMBER_BITS_PER_XACT			8
-#define MXACT_MEMBER_FLAGS_PER_BYTE			1
-#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
-
-/* how many full bytes of flags are there in a group? */
-#define MULTIXACT_FLAGBYTES_PER_GROUP		4
-#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
-	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
-/* size in bytes of a complete group */
-#define MULTIXACT_MEMBERGROUP_SIZE \
-	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
-#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
-#define MULTIXACT_MEMBERS_PER_PAGE	\
-	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
-
-/*
- * Because the number of items per page is not a divisor of the last item
- * number (member 0xFFFFFFFF), the last segment does not use the maximum number
- * of pages, and moreover the last used page therein does not use the same
- * number of items as previous pages.  (Another way to say it is that the
- * 0xFFFFFFFF member is somewhere in the middle of the last page, so the page
- * has some empty space after that item.)
- *
- * This constant is the number of members in the last page of the last segment.
- */
-#define MAX_MEMBERS_IN_LAST_MEMBERS_PAGE \
-		((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
-
-/* page in which a member is to be found */
-static inline int64
-MXOffsetToMemberPage(MultiXactOffset offset)
-{
-	return offset / MULTIXACT_MEMBERS_PER_PAGE;
-}
-
-static inline int64
-MXOffsetToMemberSegment(MultiXactOffset offset)
-{
-	return MXOffsetToMemberPage(offset) / SLRU_PAGES_PER_SEGMENT;
-}
-
-/* Location (byte offset within page) of flag word for a given member */
-static inline int
-MXOffsetToFlagsOffset(MultiXactOffset offset)
-{
-	MultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
-	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
-	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
-
-	return byteoff;
-}
-
-static inline int
-MXOffsetToFlagsBitShift(MultiXactOffset offset)
-{
-	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
-	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
-
-	return bshift;
-}
-
-/* Location (byte offset within page) of TransactionId of given member */
-static inline int
-MXOffsetToMemberOffset(MultiXactOffset offset)
-{
-	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
-
-	return MXOffsetToFlagsOffset(offset) +
-		MULTIXACT_FLAGBYTES_PER_GROUP +
-		member_in_group * sizeof(TransactionId);
-}
-
 /* Multixact members wraparound thresholds. */
 #define MULTIXACT_MEMBER_SAFE_THRESHOLD		(MaxMultiXactOffset / 2)
 #define MULTIXACT_MEMBER_DANGER_THRESHOLD	\
diff --git a/src/include/access/multixact_internal.h b/src/include/access/multixact_internal.h
new file mode 100644
index 00000000000..9b56deaef31
--- /dev/null
+++ b/src/include/access/multixact_internal.h
@@ -0,0 +1,140 @@
+/*
+ * multixact_internal.h
+ *
+ * PostgreSQL multi-transaction-log manager internal declarations
+ *
+ * These functions and definitions are for dealing with pg_multixact pages.
+ * They are internal to multixact.c, but they are exported here to allow
+ * pg_upgrade to write pg_multixact files directly.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/multixact_internal.h
+ */
+#ifndef MULTIXACT_INTERNAL_H
+#define MULTIXACT_INTERNAL_H
+
+#include "access/multixact.h"
+
+
+/*
+ * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
+ * used everywhere else in Postgres.
+ *
+ * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
+ * MultiXact page numbering also wraps around at
+ * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
+ * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
+ * take no explicit notice of that fact in this module, except when comparing
+ * segment and page numbers in TruncateMultiXact (see
+ * MultiXactOffsetPagePrecedes).
+ */
+
+/* We need four bytes per offset */
+#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)
+{
+	return multi / MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int
+MultiXactIdToOffsetEntry(MultiXactId multi)
+{
+	return multi % MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int64
+MultiXactIdToOffsetSegment(MultiXactId multi)
+{
+	return MultiXactIdToOffsetPage(multi) / SLRU_PAGES_PER_SEGMENT;
+}
+
+/*
+ * The situation for members is a bit more complex: we store one byte of
+ * additional flag bits for each TransactionId.  To do this without getting
+ * into alignment issues, we store four bytes of flags, and then the
+ * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
+ * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
+ * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
+ * performance) trumps space efficiency here.
+ *
+ * Note that the "offset" macros work with byte offset, not array indexes, so
+ * arithmetic must be done using "char *" pointers.
+ */
+/* We need eight bits per xact, so one xact fits in a byte */
+#define MXACT_MEMBER_BITS_PER_XACT			8
+#define MXACT_MEMBER_FLAGS_PER_BYTE			1
+#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
+
+/* how many full bytes of flags are there in a group? */
+#define MULTIXACT_FLAGBYTES_PER_GROUP		4
+#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
+	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
+/* size in bytes of a complete group */
+#define MULTIXACT_MEMBERGROUP_SIZE \
+	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERS_PER_PAGE	\
+	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
+
+/*
+ * Because the number of items per page is not a divisor of the last item
+ * number (member 0xFFFFFFFF), the last segment does not use the maximum number
+ * of pages, and moreover the last used page therein does not use the same
+ * number of items as previous pages.  (Another way to say it is that the
+ * 0xFFFFFFFF member is somewhere in the middle of the last page, so the page
+ * has some empty space after that item.)
+ *
+ * This constant is the number of members in the last page of the last segment.
+ */
+#define MAX_MEMBERS_IN_LAST_MEMBERS_PAGE \
+		((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
+
+/* page in which a member is to be found */
+static inline int64
+MXOffsetToMemberPage(MultiXactOffset offset)
+{
+	return offset / MULTIXACT_MEMBERS_PER_PAGE;
+}
+
+static inline int64
+MXOffsetToMemberSegment(MultiXactOffset offset)
+{
+	return MXOffsetToMemberPage(offset) / SLRU_PAGES_PER_SEGMENT;
+}
+
+/* Location (byte offset within page) of flag word for a given member */
+static inline int
+MXOffsetToFlagsOffset(MultiXactOffset offset)
+{
+	MultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
+	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
+
+	return byteoff;
+}
+
+static inline int
+MXOffsetToFlagsBitShift(MultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
+
+	return bshift;
+}
+
+/* Location (byte offset within page) of TransactionId of given member */
+static inline int
+MXOffsetToMemberOffset(MultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+
+	return MXOffsetToFlagsOffset(offset) +
+		MULTIXACT_FLAGBYTES_PER_GROUP +
+		member_in_group * sizeof(TransactionId);
+}
+
+#endif							/* MULTIXACT_INTERNAL_H */
-- 
2.47.3

v26-0002-Use-64-bit-multixact-offsets.patchtext/x-patch; charset=UTF-8; name=v26-0002-Use-64-bit-multixact-offsets.patchDownload
From 6938b28313addc62118d238b3c768360eb8f5e17 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 7 Aug 2024 16:35:22 +0300
Subject: [PATCH v26 02/10] Use 64-bit multixact offsets

Switching to 64-bit multitransaction offsets removes wraparound and the
2^32 limit on their total number.

Author: Maxim Orlov <orlovmg@gmail.com>
Discussion: FIXME
---
 src/backend/access/rmgrdesc/mxactdesc.c   |   4 +-
 src/backend/access/rmgrdesc/xlogdesc.c    |   2 +-
 src/backend/access/transam/multixact.c    | 390 ++++------------------
 src/backend/access/transam/xlog.c         |   2 +-
 src/backend/access/transam/xlogrecovery.c |   2 +-
 src/backend/commands/vacuum.c             |   2 +-
 src/backend/postmaster/autovacuum.c       |   4 +-
 src/bin/pg_controldata/pg_controldata.c   |   2 +-
 src/bin/pg_resetwal/pg_resetwal.c         |  30 +-
 src/bin/pg_resetwal/t/001_basic.pl        |   4 +-
 src/include/access/multixact.h            |   3 -
 src/include/access/multixact_internal.h   |  24 +-
 src/include/c.h                           |   2 +-
 13 files changed, 95 insertions(+), 376 deletions(-)

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 3ca0582db36..052dd0a4ce5 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -65,7 +65,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		xl_multixact_create *xlrec = (xl_multixact_create *) rec;
 		int			i;
 
-		appendStringInfo(buf, "%u offset %u nmembers %d: ", xlrec->mid,
+		appendStringInfo(buf, "%u offset %" PRIu64 " nmembers %d: ", xlrec->mid,
 						 xlrec->moff, xlrec->nmembers);
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
@@ -74,7 +74,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
 
-		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+		appendStringInfo(buf, "offsets [%u, %u), members [%" PRIu64 ", %" PRIu64 ")",
 						 xlrec->startTruncOff, xlrec->endTruncOff,
 						 xlrec->startTruncMemb, xlrec->endTruncMemb);
 	}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index cd6c2a2f650..441034f5929 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -66,7 +66,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
 		appendStringInfo(buf, "redo %X/%08X; "
-						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %u; "
+						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %" PRIu64 "; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
 						 "oldest running xid %u; %s",
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index acb2a6788f9..34a745c07be 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -69,6 +69,7 @@
 #include "postgres.h"
 
 #include "access/multixact.h"
+#include "access/multixact_internal.h"
 #include "access/slru.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
@@ -89,10 +90,14 @@
 #include "utils/memutils.h"
 
 
-/* Multixact members wraparound thresholds. */
-#define MULTIXACT_MEMBER_SAFE_THRESHOLD		(MaxMultiXactOffset / 2)
-#define MULTIXACT_MEMBER_DANGER_THRESHOLD	\
-	(MaxMultiXactOffset - MaxMultiXactOffset / 4)
+/*
+ * Multixact members warning threshold.
+ *
+ * If the difference between nextOffset and oldestOffset exceeds this value,
+ * we trigger autovacuum in order to release disk space consumed by the
+ * members SLRU.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD		UINT64CONST(4000000000)
 
 static inline MultiXactId
 PreviousMultiXactId(MultiXactId multi)
@@ -149,9 +154,6 @@ typedef struct MultiXactStateData
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 
-	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
-
 	/*
 	 * This is used to sleep until a multixact offset is written when we want
 	 * to create the next one.
@@ -282,8 +284,6 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
-									 MultiXactOffset start, uint32 distance);
 static bool SetOffsetVacuumLimit(bool is_startup);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMTruncateXlogRec(Oid oldestMultiDB,
@@ -1023,90 +1023,22 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	ExtendMultiXactOffset(result);
 
 	/*
-	 * Reserve the members space, similarly to above.  Also, be careful not to
-	 * return zero as the starting offset for any multixact. See
-	 * GetMultiXactIdMembers() for motivation.
+	 * Reserve the members space, similarly to above.
 	 */
 	nextOffset = MultiXactState->nextOffset;
-	if (nextOffset == 0)
-	{
-		*offset = 1;
-		nmembers++;				/* allocate member slot 0 too */
-	}
-	else
-		*offset = nextOffset;
-
-	/*----------
-	 * Protect against overrun of the members space as well, with the
-	 * following rules:
-	 *
-	 * If we're past offsetStopLimit, refuse to generate more multis.
-	 * If we're close to offsetStopLimit, emit a warning.
-	 *
-	 * Arbitrarily, we start emitting warnings when we're 20 segments or less
-	 * from offsetStopLimit.
-	 *
-	 * Note we haven't updated the shared state yet, so if we fail at this
-	 * point, the multixact ID we grabbed can still be used by the next guy.
-	 *
-	 * Note that there is no point in forcing autovacuum runs here: the
-	 * multixact freeze settings would have to be reduced for that to have any
-	 * effect.
-	 *----------
-	 */
-#define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
-								 nmembers))
-	{
-		/* see comment in the corresponding offsets wraparound case */
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("multixact \"members\" limit exceeded"),
-				 errdetail_plural("This command would create a multixact with %u members, but the remaining space is only enough for %u member.",
-								  "This command would create a multixact with %u members, but the remaining space is only enough for %u members.",
-								  MultiXactState->offsetStopLimit - nextOffset - 1,
-								  nmembers,
-								  MultiXactState->offsetStopLimit - nextOffset - 1),
-				 errhint("Execute a database-wide VACUUM in database with OID %u with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.",
-						 MultiXactState->oldestMultiXactDB)));
-	}
 
 	/*
-	 * Check whether we should kick autovacuum into action, to prevent members
-	 * wraparound. NB we use a much larger window to trigger autovacuum than
-	 * just the warning limit. The warning is just a measure of last resort -
-	 * this is in line with GetNewTransactionId's behaviour.
+	 * Offsets are 64-bit integers and will never wrap around.  Firstly, it
+	 * would take an unrealistic amount of time and resources to consume 2^64
+	 * offsets.  Secondly, multixid creation is WAL-logged, so you would run
+	 * out of LSNs before reaching offset wraparound.  Nevertheless, check for
+	 * wraparound as a sanity check.
 	 */
-	if (!MultiXactState->oldestOffsetKnown ||
-		(MultiXactState->nextOffset - MultiXactState->oldestOffset
-		 > MULTIXACT_MEMBER_SAFE_THRESHOLD))
-	{
-		/*
-		 * To avoid swamping the postmaster with signals, we issue the autovac
-		 * request only when crossing a segment boundary. With default
-		 * compilation settings that's roughly after 50k members.  This still
-		 * gives plenty of chances before we get into real trouble.
-		 */
-		if ((MXOffsetToMemberPage(nextOffset) / SLRU_PAGES_PER_SEGMENT) !=
-			(MXOffsetToMemberPage(nextOffset + nmembers) / SLRU_PAGES_PER_SEGMENT))
-			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	}
-
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
-								 nextOffset,
-								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
-		ereport(WARNING,
+	if (nextOffset + nmembers < nextOffset)
+		ereport(ERROR,
 				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg_plural("database with OID %u must be vacuumed before %d more multixact member is used",
-							   "database with OID %u must be vacuumed before %d more multixact members are used",
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers,
-							   MultiXactState->oldestMultiXactDB,
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers),
-				 errhint("Execute a database-wide VACUUM in that database with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.")));
+				 errmsg("MultiXact members would wrap around")));
+	*offset = nextOffset;
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
@@ -1127,8 +1059,7 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	 * the next iteration.  But note that nextMXact may be InvalidMultiXactId
 	 * or the first value on a segment-beginning page after this routine
 	 * exits, so anyone else looking at the variable must be prepared to deal
-	 * with either case.  Similarly, nextOffset may be zero, but we won't use
-	 * that as the actual start offset of the next multixact.
+	 * with either case.
 	 */
 	(MultiXactState->nextMXact)++;
 
@@ -1136,7 +1067,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	LWLockRelease(MultiXactGenLock);
 
-	debug_elog4(DEBUG2, "GetNew: returning %u offset %u", result, *offset);
+	debug_elog4(DEBUG2, "GetNew: returning %u offset %" PRIu64, result,
+				*offset);
 	return result;
 }
 
@@ -1178,7 +1110,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	MultiXactOffset *offptr;
 	MultiXactOffset offset;
 	int			length;
-	int			truelength;
 	MultiXactId oldestMXact;
 	MultiXactId nextMXact;
 	MultiXactId tmpMXact;
@@ -1277,16 +1208,7 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * we have just for this; the process in charge will signal the CV as soon
 	 * as it has finished writing the multixact offset.
 	 *
-	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
-	 * handle case #2, there is an ambiguity near the point of offset
-	 * wraparound.  If we see next multixact's offset is one, is that our
-	 * multixact's actual endpoint, or did it end at zero with a subsequent
-	 * increment?  We handle this using the knowledge that if the zero'th
-	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
-	 * transaction ID so it can't be a multixact member.  Therefore, if we
-	 * read a zero from the members array, just ignore it.
-	 *
-	 * This is all pretty messy, but the mess occurs only in infrequent corner
+	 * This is a little messy, but the mess occurs only in infrequent corner
 	 * cases, so it seems better than holding the MultiXactGenLock for a long
 	 * time on every multixact creation.
 	 */
@@ -1372,6 +1294,9 @@ retry:
 	LWLockRelease(lock);
 	lock = NULL;
 
+	/* A multixid with zero members should not happen */
+	Assert(length > 0);
+
 	/*
 	 * If we slept above, clean up state; it's no longer needed.
 	 */
@@ -1380,7 +1305,6 @@ retry:
 
 	ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
 
-	truelength = 0;
 	prev_pageno = -1;
 	for (int i = 0; i < length; i++, offset++)
 	{
@@ -1417,37 +1341,27 @@ retry:
 
 		xactptr = (TransactionId *)
 			(MultiXactMemberCtl->shared->page_buffer[slotno] + memberoff);
-
-		if (!TransactionIdIsValid(*xactptr))
-		{
-			/* Corner case 3: we must be looking at unused slot zero */
-			Assert(offset == 0);
-			continue;
-		}
+		Assert(TransactionIdIsValid(*xactptr));
 
 		flagsoff = MXOffsetToFlagsOffset(offset);
 		bshift = MXOffsetToFlagsBitShift(offset);
 		flagsptr = (uint32 *) (MultiXactMemberCtl->shared->page_buffer[slotno] + flagsoff);
 
-		ptr[truelength].xid = *xactptr;
-		ptr[truelength].status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
-		truelength++;
+		ptr[i].xid = *xactptr;
+		ptr[i].status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
 	}
 
 	LWLockRelease(lock);
 
-	/* A multixid with zero members should not happen */
-	Assert(truelength > 0);
-
 	/*
 	 * Copy the result into the local cache.
 	 */
-	mXactCachePut(multi, truelength, ptr);
+	mXactCachePut(multi, length, ptr);
 
 	debug_elog3(DEBUG2, "GetMembers: no cache for %s",
-				mxid_to_string(multi, truelength, ptr));
+				mxid_to_string(multi, length, ptr));
 	*members = ptr;
-	return truelength;
+	return length;
 }
 
 /*
@@ -1854,7 +1768,7 @@ MultiXactShmemInit(void)
 				  "pg_multixact/members", LWTRANCHE_MULTIXACTMEMBER_BUFFER,
 				  LWTRANCHE_MULTIXACTMEMBER_SLRU,
 				  SYNC_HANDLER_MULTIXACT_MEMBER,
-				  false);
+				  true);
 	/* doesn't call SimpleLruTruncate() or meet criteria for unit tests */
 
 	/* Initialize our shared state struct */
@@ -2031,7 +1945,6 @@ TrimMultiXact(void)
 		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 		offptr += entryno;
-
 		MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
 
 		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
@@ -2104,7 +2017,7 @@ MultiXactGetCheckptMulti(bool is_shutdown,
 	LWLockRelease(MultiXactGenLock);
 
 	debug_elog6(DEBUG2,
-				"MultiXact: checkpoint is nextMulti %u, nextOffset %u, oldestMulti %u in DB %u",
+				"MultiXact: checkpoint is nextMulti %u, nextOffset %" PRIu64 ", oldestMulti %u in DB %u",
 				*nextMulti, *nextMultiOffset, *oldestMulti, *oldestMultiDB);
 }
 
@@ -2139,7 +2052,7 @@ void
 MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset)
 {
-	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %u",
+	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %" PRIu64,
 				nextMulti, nextMultiOffset);
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->nextMXact = nextMulti;
@@ -2330,7 +2243,7 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 	}
 	if (MultiXactOffsetPrecedes(MultiXactState->nextOffset, minMultiOffset))
 	{
-		debug_elog3(DEBUG2, "MultiXact: setting next offset to %u",
+		debug_elog3(DEBUG2, "MultiXact: setting next offset to %" PRIU64,
 					minMultiOffset);
 		MultiXactState->nextOffset = minMultiOffset;
 	}
@@ -2432,23 +2345,8 @@ ExtendMultiXactMember(MultiXactOffset offset, int nmembers)
 			LWLockRelease(lock);
 		}
 
-		/*
-		 * Compute the number of items till end of current page.  Careful: if
-		 * addition of unsigned ints wraps around, we're at the last page of
-		 * the last segment; since that page holds a different number of items
-		 * than other pages, we need to do it differently.
-		 */
-		if (offset + MAX_MEMBERS_IN_LAST_MEMBERS_PAGE < offset)
-		{
-			/*
-			 * This is the last page of the last segment; we can compute the
-			 * number of items left to allocate in it without modulo
-			 * arithmetic.
-			 */
-			difference = MaxMultiXactOffset - offset + 1;
-		}
-		else
-			difference = MULTIXACT_MEMBERS_PER_PAGE - offset % MULTIXACT_MEMBERS_PER_PAGE;
+		/* Compute the number of items till end of current page. */
+		difference = MULTIXACT_MEMBERS_PER_PAGE - offset % MULTIXACT_MEMBERS_PER_PAGE;
 
 		/*
 		 * Advance to next page, taking care to properly handle the wraparound
@@ -2514,15 +2412,14 @@ GetOldestMultiXactId(void)
 }
 
 /*
- * Determine how aggressively we need to vacuum in order to prevent member
- * wraparound.
+ * Determine if we need to vacuum to keep the size of the members SLRU in
+ * check.
  *
  * To do so determine what's the oldest member offset and install the limit
  * info in MultiXactState, where it can be used to prevent overrun of old data
  * in the members SLRU area.
  *
- * The return value is true if emergency autovacuum is required and false
- * otherwise.
+ * The return value is true if autovacuum is required and false otherwise.
  */
 static bool
 SetOffsetVacuumLimit(bool is_startup)
@@ -2534,8 +2431,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	MultiXactOffset nextOffset;
 	bool		oldestOffsetKnown = false;
 	bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;
-	MultiXactOffset prevOffsetStopLimit;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2550,7 +2445,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	nextOffset = MultiXactState->nextOffset;
 	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
 	prevOldestOffset = MultiXactState->oldestOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2581,13 +2475,9 @@ SetOffsetVacuumLimit(bool is_startup)
 		oldestOffsetKnown =
 			find_multixact_start(oldestMultiXactId, &oldestOffset);
 
-		if (oldestOffsetKnown)
-			ereport(DEBUG1,
-					(errmsg_internal("oldest MultiXactId member is at offset %u",
-									 oldestOffset)));
-		else
+		if (!oldestOffsetKnown)
 			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
+					(errmsg("oldest checkpointed MultiXact %u does not exist on disk",
 							oldestMultiXactId)));
 	}
 
@@ -2597,97 +2487,32 @@ SetOffsetVacuumLimit(bool is_startup)
 	 * If we can, compute limits (and install them MultiXactState) to prevent
 	 * overrun of old data in the members SLRU area. We can only do so if the
 	 * oldest offset is known though.
+	 *
+	 * FIXME: Is !oldestOffsetKnown possible anymore? At least update the comment:
+	 * we won't overrun members anymore.
 	 */
-	if (oldestOffsetKnown)
-	{
-		/* move back to start of the corresponding segment */
-		offsetStopLimit = oldestOffset - (oldestOffset %
-										  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-
-		/* always leave one segment before the wraparound point */
-		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-		if (!prevOldestOffsetKnown && !is_startup)
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are now enabled")));
-
-		ereport(DEBUG1,
-				(errmsg_internal("MultiXact member stop limit is now %u based on MultiXact %u",
-								 offsetStopLimit, oldestMultiXactId)));
-	}
-	else if (prevOldestOffsetKnown)
+	if (prevOldestOffsetKnown)
 	{
 		/*
 		 * If we failed to get the oldest offset this time, but we have a
 		 * value from a previous pass through this function, use the old
-		 * values rather than automatically forcing an emergency autovacuum
-		 * cycle again.
+		 * values rather than automatically forcing an autovacuum cycle again.
 		 */
 		oldestOffset = prevOldestOffset;
 		oldestOffsetKnown = true;
-		offsetStopLimit = prevOffsetStopLimit;
 	}
 
 	/* Install the computed values */
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestOffset = oldestOffset;
 	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	MultiXactState->offsetStopLimit = offsetStopLimit;
 	LWLockRelease(MultiXactGenLock);
 
 	/*
-	 * Do we need an emergency autovacuum?	If we're not sure, assume yes.
+	 * Do we need autovacuum?	If we're not sure, assume yes.
 	 */
 	return !oldestOffsetKnown ||
-		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
-}
-
-/*
- * Return whether adding "distance" to "start" would move past "boundary".
- *
- * We use this to determine whether the addition is "wrapping around" the
- * boundary point, hence the name.  The reason we don't want to use the regular
- * 2^31-modulo arithmetic here is that we want to be able to use the whole of
- * the 2^32-1 space here, allowing for more multixacts than would fit
- * otherwise.
- */
-static bool
-MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
-						 uint32 distance)
-{
-	MultiXactOffset finish;
-
-	/*
-	 * Note that offset number 0 is not used (see GetMultiXactIdMembers), so
-	 * if the addition wraps around the UINT_MAX boundary, skip that value.
-	 */
-	finish = start + distance;
-	if (finish < start)
-		finish++;
-
-	/*-----------------------------------------------------------------------
-	 * When the boundary is numerically greater than the starting point, any
-	 * value numerically between the two is not wrapped:
-	 *
-	 *	<----S----B---->
-	 *	[---)			 = F wrapped past B (and UINT_MAX)
-	 *		 [---)		 = F not wrapped
-	 *			  [----] = F wrapped past B
-	 *
-	 * When the boundary is numerically less than the starting point (i.e. the
-	 * UINT_MAX wraparound occurs somewhere in between) then all values in
-	 * between are wrapped:
-	 *
-	 *	<----B----S---->
-	 *	[---)			 = F not wrapped past B (but wrapped past UINT_MAX)
-	 *		 [---)		 = F wrapped past B (and UINT_MAX)
-	 *			  [----] = F not wrapped
-	 *-----------------------------------------------------------------------
-	 */
-	if (start < boundary)
-		return finish >= boundary || finish < start;
-	else
-		return finish >= boundary && finish < start;
+		(nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);
 }
 
 /*
@@ -2727,6 +2552,7 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 	offptr += entryno;
 	offset = *offptr;
+
 	LWLockRelease(SimpleLruGetBankLock(MultiXactOffsetCtl, pageno));
 
 	*result = offset;
@@ -2774,73 +2600,6 @@ GetMultiXactInfo(uint32 *multixacts, MultiXactOffset *members,
 	return true;
 }
 
-/*
- * Multixact members can be removed once the multixacts that refer to them
- * are older than every datminmxid.  autovacuum_multixact_freeze_max_age and
- * vacuum_multixact_freeze_table_age work together to make sure we never have
- * too many multixacts; we hope that, at least under normal circumstances,
- * this will also be sufficient to keep us from using too many offsets.
- * However, if the average multixact has many members, we might exhaust the
- * members space while still using few enough members that these limits fail
- * to trigger relminmxid advancement by VACUUM.  At that point, we'd have no
- * choice but to start failing multixact-creating operations with an error.
- *
- * To prevent that, if more than a threshold portion of the members space is
- * used, we effectively reduce autovacuum_multixact_freeze_max_age and
- * to a value just less than the number of multixacts in use.  We hope that
- * this will quickly trigger autovacuuming on the table or tables with the
- * oldest relminmxid, thus allowing datminmxid values to advance and removing
- * some members.
- *
- * As the fraction of the member space currently in use grows, we become
- * more aggressive in clamping this value.  That not only causes autovacuum
- * to ramp up, but also makes any manual vacuums the user issues more
- * aggressive.  This happens because vacuum_get_cutoffs() will clamp the
- * freeze table and the minimum freeze age cutoffs based on the effective
- * autovacuum_multixact_freeze_max_age this function returns.  In the worst
- * case, we'll claim the freeze_max_age to zero, and every vacuum of any
- * table will freeze every multixact.
- */
-int
-MultiXactMemberFreezeThreshold(void)
-{
-	MultiXactOffset members;
-	uint32		multixacts;
-	uint32		victim_multixacts;
-	double		fraction;
-	int			result;
-	MultiXactId oldestMultiXactId;
-	MultiXactOffset oldestOffset;
-
-	/* If we can't determine member space utilization, assume the worst. */
-	if (!GetMultiXactInfo(&multixacts, &members, &oldestMultiXactId, &oldestOffset))
-		return 0;
-
-	/* If member space utilization is low, no special action is required. */
-	if (members <= MULTIXACT_MEMBER_SAFE_THRESHOLD)
-		return autovacuum_multixact_freeze_max_age;
-
-	/*
-	 * Compute a target for relminmxid advancement.  The number of multixacts
-	 * we try to eliminate from the system is based on how far we are past
-	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
-	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD) /
-		(MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
-	victim_multixacts = multixacts * fraction;
-
-	/* fraction could be > 1.0, but lowest possible freeze age is zero */
-	if (victim_multixacts > multixacts)
-		return 0;
-	result = multixacts - victim_multixacts;
-
-	/*
-	 * Clamp to autovacuum_multixact_freeze_max_age, so that we never make
-	 * autovacuum less aggressive than it would otherwise be.
-	 */
-	return Min(result, autovacuum_multixact_freeze_max_age);
-}
-
 typedef struct mxtruncinfo
 {
 	int64		earliestExistingPage;
@@ -2867,36 +2626,12 @@ SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int64 segpage, void *data
 
 /*
  * Delete members segments [oldest, newOldest)
- *
- * The members SLRU can, in contrast to the offsets one, be filled to almost
- * the full range at once. This means SimpleLruTruncate() can't trivially be
- * used - instead the to-be-deleted range is computed using the offsets
- * SLRU. C.f. TruncateMultiXact().
  */
 static void
 PerformMembersTruncation(MultiXactOffset oldestOffset, MultiXactOffset newOldestOffset)
 {
-	const int64 maxsegment = MXOffsetToMemberSegment(MaxMultiXactOffset);
-	int64		startsegment = MXOffsetToMemberSegment(oldestOffset);
-	int64		endsegment = MXOffsetToMemberSegment(newOldestOffset);
-	int64		segment = startsegment;
-
-	/*
-	 * Delete all the segments but the last one. The last segment can still
-	 * contain, possibly partially, valid data.
-	 */
-	while (segment != endsegment)
-	{
-		elog(DEBUG2, "truncating multixact members segment %" PRIx64,
-			 segment);
-		SlruDeleteSegment(MultiXactMemberCtl, segment);
-
-		/* move to next segment, handling wraparound correctly */
-		if (segment == maxsegment)
-			segment = 0;
-		else
-			segment += 1;
-	}
+	SimpleLruTruncate(MultiXactMemberCtl,
+					  MXOffsetToMemberPage(newOldestOffset));
 }
 
 /*
@@ -3040,7 +2775,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 
 	elog(DEBUG1, "performing multixact truncation: "
 		 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-		 "members [%u, %u), members segments [%" PRIx64 ", %" PRIx64 ")",
+		 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
 		 oldestMulti, newOldestMulti,
 		 MultiXactIdToOffsetSegment(oldestMulti),
 		 MultiXactIdToOffsetSegment(newOldestMulti),
@@ -3120,20 +2855,13 @@ MultiXactOffsetPagePrecedes(int64 page1, int64 page2)
 
 /*
  * Decide whether a MultiXactMember page number is "older" for truncation
- * purposes.  There is no "invalid offset number" so use the numbers verbatim.
+ * purposes.  There is no "invalid offset number" and members never wrap
+ * around, so use the numbers verbatim.
  */
 static bool
 MultiXactMemberPagePrecedes(int64 page1, int64 page2)
 {
-	MultiXactOffset offset1;
-	MultiXactOffset offset2;
-
-	offset1 = ((MultiXactOffset) page1) * MULTIXACT_MEMBERS_PER_PAGE;
-	offset2 = ((MultiXactOffset) page2) * MULTIXACT_MEMBERS_PER_PAGE;
-
-	return (MultiXactOffsetPrecedes(offset1, offset2) &&
-			MultiXactOffsetPrecedes(offset1,
-									offset2 + MULTIXACT_MEMBERS_PER_PAGE - 1));
+	return page1 < page2;
 }
 
 /*
@@ -3171,7 +2899,7 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
 static bool
 MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
 {
-	int32		diff = (int32) (offset1 - offset2);
+	int64		diff = (int64) (offset1 - offset2);
 
 	return (diff < 0);
 }
@@ -3268,7 +2996,7 @@ multixact_redo(XLogReaderState *record)
 
 		elog(DEBUG1, "replaying multixact truncation: "
 			 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-			 "members [%u, %u), members segments [%" PRIx64 ", %" PRIx64 ")",
+			 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
 			 xlrec.startTruncOff, xlrec.endTruncOff,
 			 MultiXactIdToOffsetSegment(xlrec.startTruncOff),
 			 MultiXactIdToOffsetSegment(xlrec.endTruncOff),
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 22d0a2e8c3a..ef405d66b3b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5139,7 +5139,7 @@ BootStrapXLOG(uint32 data_checksum_version)
 		FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId);
 	checkPoint.nextOid = FirstGenbkiObjectId;
 	checkPoint.nextMulti = FirstMultiXactId;
-	checkPoint.nextMultiOffset = 0;
+	checkPoint.nextMultiOffset = 1;
 	checkPoint.oldestXid = FirstNormalTransactionId;
 	checkPoint.oldestXidDB = Template1DbOid;
 	checkPoint.oldestMulti = FirstMultiXactId;
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 21b8f179ba0..51dea342a4d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -886,7 +886,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 							 U64FromFullTransactionId(checkPoint.nextXid),
 							 checkPoint.nextOid)));
 	ereport(DEBUG1,
-			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %u",
+			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %" PRIu64,
 							 checkPoint.nextMulti, checkPoint.nextMultiOffset)));
 	ereport(DEBUG1,
 			(errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index e785dd55ce5..100e1a72c22 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1148,7 +1148,7 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams params,
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Almost ready to set freeze output parameters; check if OldestXmin or
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 1c38488f2cb..bf66f494e3a 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1151,7 +1151,7 @@ do_start_worker(void)
 
 	/* Also determine the oldest datminmxid we will consider. */
 	recentMulti = ReadNextMultiXactId();
-	multiForceLimit = recentMulti - MultiXactMemberFreezeThreshold();
+	multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
 	if (multiForceLimit < FirstMultiXactId)
 		multiForceLimit -= FirstMultiXactId;
 
@@ -1939,7 +1939,7 @@ do_autovacuum(void)
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Find the pg_database entry and select the default freeze ages. We use
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 30ad46912e1..a4060309ae0 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -271,7 +271,7 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile->checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
+	printf(_("Latest checkpoint's NextMultiOffset:  %" PRIu64 "\n"),
 		   ControlFile->checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile->checkPointCopy.oldestXid);
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index a31e7643cf0..7c6c2741a17 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -92,6 +92,7 @@ static void KillExistingArchiveStatus(void);
 static void KillExistingWALSummaries(void);
 static void WriteEmptyXLOG(void);
 static void usage(void);
+static uint64 strtou64_strict(const char *s, char **endptr, int base);
 
 
 int
@@ -120,7 +121,6 @@ main(int argc, char *argv[])
 	MultiXactId set_oldestmxid = 0;
 	char	   *endptr;
 	char	   *endptr2;
-	int64		tmpi64;
 	char	   *DataDir = NULL;
 	char	   *log_fname = NULL;
 	int			fd;
@@ -269,17 +269,14 @@ main(int argc, char *argv[])
 
 			case 'O':
 				errno = 0;
-				tmpi64 = strtoi64(optarg, &endptr, 0);
+				set_mxoff = strtou64_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
 					pg_log_error_hint("Try \"%s --help\" for more information.", progname);
 					exit(1);
 				}
-				if (tmpi64 < 0 || tmpi64 > (int64) MaxMultiXactOffset)
-					pg_fatal("multitransaction offset (-O) must be between 0 and %u", MaxMultiXactOffset);
 
-				set_mxoff = (MultiXactOffset) tmpi64;
 				mxoff_given = true;
 				break;
 
@@ -749,7 +746,7 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile.checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
+	printf(_("Latest checkpoint's NextMultiOffset:  %" PRIu64 "\n"),
 		   ControlFile.checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile.checkPointCopy.oldestXid);
@@ -825,7 +822,7 @@ PrintNewControlValues(void)
 
 	if (mxoff_given)
 	{
-		printf(_("NextMultiOffset:                      %u\n"),
+		printf(_("NextMultiOffset:                      %" PRIu64 "\n"),
 			   ControlFile.checkPointCopy.nextMultiOffset);
 	}
 
@@ -1210,3 +1207,22 @@ usage(void)
 	printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
 	printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
 }
+
+/* Like strtou64(), but negative values are not accepted. */
+static uint64
+strtou64_strict(const char *s, char **endptr, int base)
+{
+	/* skip leading whitespace */
+	while (isspace(*s))
+		s++;
+
+	/* reject negative values */
+	if (*s == '-')
+	{
+		*endptr = (char *) s;
+		errno = ERANGE;
+		return UINT64_MAX;
+	}
+
+	return strtou64(s, endptr, base);
+}
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index 90ecb8afe18..5a175e285d1 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -145,7 +145,7 @@ command_fails_like(
 	'fails with incorrect -O option');
 command_fails_like(
 	[ 'pg_resetwal', '-O' => '-1', $node->data_dir ],
-	qr/must be between 0 and 4294967295/,
+	qr/error: invalid argument for option -O/,
 	'fails with -O value -1');
 # --wal-segsize
 command_fails_like(
@@ -215,7 +215,7 @@ push @cmd,
   sprintf("%d,%d", hex($files[0]) == 0 ? 3 : hex($files[0]), hex($files[-1]));
 
 @files = get_slru_files('pg_multixact/offsets');
-$mult = 32 * $blcksz / 4;
+$mult = 32 * $blcksz / 8;
 # --multixact-ids argument is "new,old"
 push @cmd,
   '--multixact-ids' => sprintf("%d,%d",
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 82e4bb90dd5..7d98fe0fe32 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -28,8 +28,6 @@
 
 #define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
 
-#define MaxMultiXactOffset	((MultiXactOffset) 0xFFFFFFFF)
-
 /*
  * Possible multixact lock modes ("status").  The first four modes are for
  * tuple locks (FOR KEY SHARE, FOR SHARE, FOR NO KEY UPDATE, FOR UPDATE); the
@@ -147,7 +145,6 @@ extern void MultiXactSetNextMXact(MultiXactId nextMulti,
 extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 									  MultiXactOffset minMultiOffset);
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
-extern int	MultiXactMemberFreezeThreshold(void);
 
 extern void multixact_twophase_recover(FullTransactionId fxid, uint16 info,
 									   void *recdata, uint32 len);
diff --git a/src/include/access/multixact_internal.h b/src/include/access/multixact_internal.h
index 9b56deaef31..b0227759e39 100644
--- a/src/include/access/multixact_internal.h
+++ b/src/include/access/multixact_internal.h
@@ -17,21 +17,12 @@
 
 #include "access/multixact.h"
 
-
 /*
  * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
  * used everywhere else in Postgres.
- *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
  */
 
-/* We need four bytes per offset */
+/* We need 8 bytes per offset */
 #define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
 
 static inline int64
@@ -80,19 +71,6 @@ MultiXactIdToOffsetSegment(MultiXactId multi)
 #define MULTIXACT_MEMBERS_PER_PAGE	\
 	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
 
-/*
- * Because the number of items per page is not a divisor of the last item
- * number (member 0xFFFFFFFF), the last segment does not use the maximum number
- * of pages, and moreover the last used page therein does not use the same
- * number of items as previous pages.  (Another way to say it is that the
- * 0xFFFFFFFF member is somewhere in the middle of the last page, so the page
- * has some empty space after that item.)
- *
- * This constant is the number of members in the last page of the last segment.
- */
-#define MAX_MEMBERS_IN_LAST_MEMBERS_PAGE \
-		((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
-
 /* page in which a member is to be found */
 static inline int64
 MXOffsetToMemberPage(MultiXactOffset offset)
diff --git a/src/include/c.h b/src/include/c.h
index 757dfff4782..bc92a6f4565 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -670,7 +670,7 @@ typedef uint32 SubTransactionId;
 /* MultiXactId must be equivalent to TransactionId, to fit in t_xmax */
 typedef TransactionId MultiXactId;
 
-typedef uint32 MultiXactOffset;
+typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
-- 
2.47.3

v26-0003-TEST-bump-catversion.patchtext/x-patch; charset=UTF-8; name=v26-0003-TEST-bump-catversion.patchDownload
From da5eaad183d4e27a1915d0594645c54ef5d820c8 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Fri, 24 Oct 2025 11:47:50 +0300
Subject: [PATCH v26 03/10] TEST: bump catversion

To avoid constant CF-bot complains, make catversion bump in a separate
commit.

NOTE: keep it in sync with MULTIXACTOFFSET_FORMATCHANGE_CAT_VER
---
 src/include/catalog/catversion.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 7eefca1ae42..b0162c2bf63 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,7 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202511101
+// FIXME: bump it
+#define CATALOG_VERSION_NO	999999999
 
 #endif
-- 
2.47.3

v26-0004-Add-pg_upgrade-for-64-bit-multixact-offsets.patchtext/x-patch; charset=UTF-8; name=v26-0004-Add-pg_upgrade-for-64-bit-multixact-offsets.patchDownload
From a9073083671689f64ae25b95b4ded8083d870de2 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 12 Nov 2025 21:47:44 +0200
Subject: [PATCH v26 04/10] Add pg_upgrade for 64 bit multixact offsets

Author: Maxim Orlov <orlovmg@gmail.com>
Author: Heikki Linnakangas <hlinnaka@iki.fi>
---
 src/backend/access/transam/multixact.c        |  56 ---
 src/bin/pg_upgrade/Makefile                   |   3 +
 src/bin/pg_upgrade/meson.build                |   4 +
 src/bin/pg_upgrade/multixact_new.c            | 101 ++++++
 src/bin/pg_upgrade/multixact_new.h            |  23 ++
 src/bin/pg_upgrade/multixact_old.c            | 297 ++++++++++++++++
 src/bin/pg_upgrade/multixact_old.h            |  29 ++
 src/bin/pg_upgrade/pg_upgrade.c               | 108 +++++-
 src/bin/pg_upgrade/pg_upgrade.h               |   5 +
 src/bin/pg_upgrade/slru_io.c                  | 242 +++++++++++++
 src/bin/pg_upgrade/slru_io.h                  |  52 +++
 .../pg_upgrade/t/007_multixact_conversion.pl  | 329 ++++++++++++++++++
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  21 +-
 src/tools/pgindent/typedefs.list              |   3 +
 14 files changed, 1204 insertions(+), 69 deletions(-)
 create mode 100644 src/bin/pg_upgrade/multixact_new.c
 create mode 100644 src/bin/pg_upgrade/multixact_new.h
 create mode 100644 src/bin/pg_upgrade/multixact_old.c
 create mode 100644 src/bin/pg_upgrade/multixact_old.h
 create mode 100644 src/bin/pg_upgrade/slru_io.c
 create mode 100644 src/bin/pg_upgrade/slru_io.h
 create mode 100644 src/bin/pg_upgrade/t/007_multixact_conversion.pl

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 34a745c07be..e0323ec1014 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1824,48 +1824,6 @@ BootStrapMultiXact(void)
 	SimpleLruZeroAndWritePage(MultiXactMemberCtl, 0);
 }
 
-/*
- * MaybeExtendOffsetSlru
- *		Extend the offsets SLRU area, if necessary
- *
- * After a binary upgrade from <= 9.2, the pg_multixact/offsets SLRU area might
- * contain files that are shorter than necessary; this would occur if the old
- * installation had used multixacts beyond the first page (files cannot be
- * copied, because the on-disk representation is different).  pg_upgrade would
- * update pg_control to set the next offset value to be at that position, so
- * that tuples marked as locked by such MultiXacts would be seen as visible
- * without having to consult multixact.  However, trying to create and use a
- * new MultiXactId would result in an error because the page on which the new
- * value would reside does not exist.  This routine is in charge of creating
- * such pages.
- */
-static void
-MaybeExtendOffsetSlru(void)
-{
-	int64		pageno;
-	LWLock	   *lock;
-
-	pageno = MultiXactIdToOffsetPage(MultiXactState->nextMXact);
-	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
-
-	LWLockAcquire(lock, LW_EXCLUSIVE);
-
-	if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno))
-	{
-		int			slotno;
-
-		/*
-		 * Fortunately for us, SimpleLruWritePage is already prepared to deal
-		 * with creating a new segment file even if the page we're writing is
-		 * not the first in it, so this is enough.
-		 */
-		slotno = SimpleLruZeroPage(MultiXactOffsetCtl, pageno);
-		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
-	}
-
-	LWLockRelease(lock);
-}
-
 /*
  * This must be called ONCE during postmaster or standalone-backend startup.
  *
@@ -2058,20 +2016,6 @@ MultiXactSetNextMXact(MultiXactId nextMulti,
 	MultiXactState->nextMXact = nextMulti;
 	MultiXactState->nextOffset = nextMultiOffset;
 	LWLockRelease(MultiXactGenLock);
-
-	/*
-	 * During a binary upgrade, make sure that the offsets SLRU is large
-	 * enough to contain the next value that would be created.
-	 *
-	 * We need to do this pretty early during the first startup in binary
-	 * upgrade mode: before StartupMultiXact() in fact, because this routine
-	 * is called even before that by StartupXLOG().  And we can't do it
-	 * earlier than at this point, because during that first call of this
-	 * routine we determine the MultiXactState->nextMXact value that
-	 * MaybeExtendOffsetSlru needs.
-	 */
-	if (IsBinaryUpgrade)
-		MaybeExtendOffsetSlru();
 }
 
 /*
diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index 69fcf593cae..42995d53b0b 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -18,11 +18,14 @@ OBJS = \
 	file.o \
 	function.o \
 	info.o \
+	multixact_new.o \
+	multixact_old.o \
 	option.o \
 	parallel.o \
 	pg_upgrade.o \
 	relfilenumber.o \
 	server.o \
+	slru_io.o \
 	tablespace.o \
 	task.o \
 	util.o \
diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index ac992f0d14b..fff0db3b560 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -8,11 +8,14 @@ pg_upgrade_sources = files(
   'file.c',
   'function.c',
   'info.c',
+  'multixact_new.c',
+  'multixact_old.c',
   'option.c',
   'parallel.c',
   'pg_upgrade.c',
   'relfilenumber.c',
   'server.c',
+  'slru_io.c',
   'tablespace.c',
   'task.c',
   'util.c',
@@ -47,6 +50,7 @@ tests += {
       't/004_subscription.pl',
       't/005_char_signedness.pl',
       't/006_transfer_modes.pl',
+      't/007_multixact_conversion.pl',
     ],
     'test_kwargs': {'priority': 40}, # pg_upgrade tests are slow
   },
diff --git a/src/bin/pg_upgrade/multixact_new.c b/src/bin/pg_upgrade/multixact_new.c
new file mode 100644
index 00000000000..8284a2015fc
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_new.c
@@ -0,0 +1,101 @@
+/*
+ * multixact_new.c
+ *
+ * Functions to write multixacts in the v19 format with 64-bit
+ * MultiXactOffsets
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_new.c
+ */
+
+#include "postgres_fe.h"
+
+#include "access/multixact_internal.h"
+#include "multixact_new.h"
+
+MultiXactWriter *
+AllocMultiXactWrite(const char *pgdata, MultiXactId firstMulti,
+					MultiXactOffset firstOffset)
+{
+	MultiXactWriter *state = pg_malloc(sizeof(*state));
+	char		dir[MAXPGPATH] = {0};
+
+	pg_sprintf(dir, "%s/pg_multixact/offsets", pgdata);
+	state->offset = AllocSlruWrite(dir, false);
+	SlruWriteSwitchPage(state->offset, MultiXactIdToOffsetPage(firstMulti));
+
+	pg_sprintf(dir, "%s/pg_multixact/members", pgdata);
+	state->members = AllocSlruWrite(dir, true /* use long segment names */ );
+	SlruWriteSwitchPage(state->members, MXOffsetToMemberPage(firstOffset));
+
+	return state;
+}
+
+/*
+ * Write a new multixact with members.
+ *
+ * Simplified version of the correspoding server function, hence the name.
+ */
+void
+RecordNewMultiXact(MultiXactWriter *state, MultiXactOffset offset,
+				   MultiXactId multi, int nmembers, MultiXactMember *members)
+{
+	int64		pageno;
+	int64		prev_pageno;
+	int			entryno;
+	char	   *buf;
+	MultiXactOffset *offptr;
+
+	pageno = MultiXactIdToOffsetPage(multi);
+	entryno = MultiXactIdToOffsetEntry(multi);
+
+	/* Store the offset */
+	buf = SlruWriteSwitchPage(state->offset, pageno);
+	offptr = (MultiXactOffset *) buf;
+	offptr[entryno] = offset;
+
+	/* Store the members */
+	prev_pageno = -1;
+	for (int i = 0; i < nmembers; i++, offset++)
+	{
+		TransactionId *memberptr;
+		uint32	   *flagsptr;
+		uint32		flagsval;
+		int			bshift;
+		int			flagsoff;
+		int			memberoff;
+
+		Assert(members[i].status <= MultiXactStatusUpdate);
+
+		pageno = MXOffsetToMemberPage(offset);
+		memberoff = MXOffsetToMemberOffset(offset);
+		flagsoff = MXOffsetToFlagsOffset(offset);
+		bshift = MXOffsetToFlagsBitShift(offset);
+
+		if (pageno != prev_pageno)
+		{
+			buf = SlruWriteSwitchPage(state->members, pageno);
+			prev_pageno = pageno;
+		}
+
+		memberptr = (TransactionId *) (buf + memberoff);
+
+		*memberptr = members[i].xid;
+
+		flagsptr = (uint32 *) (buf + flagsoff);
+
+		flagsval = *flagsptr;
+		flagsval &= ~(((1 << MXACT_MEMBER_BITS_PER_XACT) - 1) << bshift);
+		flagsval |= (members[i].status << bshift);
+		*flagsptr = flagsval;
+	}
+}
+
+void
+FreeMultiXactWrite(MultiXactWriter *state)
+{
+	FreeSlruWrite(state->offset);
+	FreeSlruWrite(state->members);
+
+	pfree(state);
+}
diff --git a/src/bin/pg_upgrade/multixact_new.h b/src/bin/pg_upgrade/multixact_new.h
new file mode 100644
index 00000000000..f66e6af7e45
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_new.h
@@ -0,0 +1,23 @@
+/*
+ * multixact_new.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_new.h
+ */
+#include "access/multixact.h"
+
+#include "slru_io.h"
+
+typedef struct MultiXactWriter
+{
+	SlruSegState *offset;
+	SlruSegState *members;
+} MultiXactWriter;
+
+extern MultiXactWriter *AllocMultiXactWrite(const char *pgdata,
+											MultiXactId firstMulti,
+											MultiXactOffset firstOffset);
+extern void RecordNewMultiXact(MultiXactWriter *state, MultiXactOffset offset,
+							   MultiXactId multi, int nmembers,
+							   MultiXactMember *members);
+extern void FreeMultiXactWrite(MultiXactWriter *writer);
diff --git a/src/bin/pg_upgrade/multixact_old.c b/src/bin/pg_upgrade/multixact_old.c
new file mode 100644
index 00000000000..7bf7db4b009
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_old.c
@@ -0,0 +1,297 @@
+/*
+ * multixact_old.c
+ *
+ * Functions to read pre-v19 multixacts
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_old.c
+ */
+
+#include "postgres_fe.h"
+
+#include "multixact_old.h"
+#include "pg_upgrade.h"
+
+/*
+ * NOTE: below are a bunch of definitions that are copy-pasted from
+ * multixact.c from version 18.  The only difference is that we use the
+ * OldMultiXactOffset type equal to uint32 instead of MultiXactOffset which
+ * became uint64.
+ */
+
+/* We need four bytes per offset and 8 bytes per base for each page. */
+#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(OldMultiXactOffset))
+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)
+{
+	return multi / MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int
+MultiXactIdToOffsetEntry(MultiXactId multi)
+{
+	return multi % MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+/*
+ * The situation for members is a bit more complex: we store one byte of
+ * additional flag bits for each TransactionId.  To do this without getting
+ * into alignment issues, we store four bytes of flags, and then the
+ * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
+ * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
+ * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
+ * performance) trumps space efficiency here.
+ *
+ * Note that the "offset" macros work with byte offset, not array indexes, so
+ * arithmetic must be done using "char *" pointers.
+ */
+/* We need eight bits per xact, so one xact fits in a byte */
+#define MXACT_MEMBER_BITS_PER_XACT			8
+#define MXACT_MEMBER_FLAGS_PER_BYTE			1
+#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
+
+/* how many full bytes of flags are there in a group? */
+#define MULTIXACT_FLAGBYTES_PER_GROUP		4
+#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
+	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
+/* size in bytes of a complete group */
+#define MULTIXACT_MEMBERGROUP_SIZE \
+	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERS_PER_PAGE	\
+	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
+
+/* page in which a member is to be found */
+static inline int64
+MXOffsetToMemberPage(OldMultiXactOffset offset)
+{
+	return offset / MULTIXACT_MEMBERS_PER_PAGE;
+}
+
+/* Location (byte offset within page) of flag word for a given member */
+static inline int
+MXOffsetToFlagsOffset(MultiXactOffset offset)
+{
+	OldMultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
+	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
+
+	return byteoff;
+}
+
+/* Location (byte offset within page) of TransactionId of given member */
+static inline int
+MXOffsetToMemberOffset(OldMultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+
+	return MXOffsetToFlagsOffset(offset) +
+		MULTIXACT_FLAGBYTES_PER_GROUP +
+		member_in_group * sizeof(TransactionId);
+}
+
+static inline int
+MXOffsetToFlagsBitShift(OldMultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
+
+	return bshift;
+}
+
+/*
+ * Construct reader of old multixacts.
+ *
+ * Returns the malloced memory used by the all other calls in this module.
+ */
+OldMultiXactReader *
+AllocOldMultiXactRead(char *pgdata, MultiXactId nextMulti,
+					  OldMultiXactOffset nextOffset)
+{
+	OldMultiXactReader *state = state = pg_malloc(sizeof(*state));
+	char		dir[MAXPGPATH] = {0};
+
+	state->nextMXact = nextMulti;
+	state->nextOffset = nextOffset;
+
+	pg_sprintf(dir, "%s/pg_multixact/offsets", pgdata);
+	state->offset = AllocSlruRead(dir, false);
+
+	pg_sprintf(dir, "%s/pg_multixact/members", pgdata);
+	state->members = AllocSlruRead(dir, false);
+
+	return state;
+}
+
+/*
+ * This is a simplified version of the GetMultiXactIdMembers() server function.
+ *
+ * - Only return the updating member, if any. Upgrade only cares about the
+ *   updaters. If there is no updating member, return the first locking-only
+ *   member. We don't have any way to represent "no members", but we also don't
+ *   need to preserve all the locking members.
+ *
+ * - We don't need to worry about locking and some corner cases because there's
+ *   no concurrent activity.
+ */
+void
+GetOldMultiXactIdSingleMember(OldMultiXactReader *state, MultiXactId multi,
+							  TransactionId *result, MultiXactStatus *status)
+{
+	MultiXactId nextMXact,
+				nextOffset,
+				tmpMXact;
+	int64		pageno,
+				prev_pageno;
+	int			entryno,
+				length;
+	char	   *buf;
+	OldMultiXactOffset *offptr,
+				offset;
+	TransactionId result_xid = InvalidTransactionId;
+	bool		result_isupdate = false;
+
+	nextMXact = state->nextMXact;
+	nextOffset = state->nextOffset;
+
+	/*
+	 * See GetMultiXactIdMembers in multixact.c
+	 *
+	 * Find out the offset at which we need to start reading MultiXactMembers
+	 * and the number of members in the multixact.  We determine the latter as
+	 * the difference between this multixact's starting offset and the next
+	 * one's.  However, there are some corner cases to worry about:
+	 *
+	 * 1. This multixact may be the latest one created, in which case there is
+	 * no next one to look at.  In this case the nextOffset value we just
+	 * saved is the correct endpoint.
+	 *
+	 * 2. The next multixact may still be in process of being filled in...
+	 * This cannot happen during upgrade.
+	 *
+	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
+	 * handle case #2, there is an ambiguity near the point of offset
+	 * wraparound.  If we see next multixact's offset is one, is that our
+	 * multixact's actual endpoint, or did it end at zero with a subsequent
+	 * increment?  We handle this using the knowledge that if the zero'th
+	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
+	 * transaction ID so it can't be a multixact member.  Therefore, if we
+	 * read a zero from the members array, just ignore it.
+	 */
+
+	pageno = MultiXactIdToOffsetPage(multi);
+	entryno = MultiXactIdToOffsetEntry(multi);
+
+	buf = SlruReadSwitchPage(state->offset, pageno);
+	offptr = (OldMultiXactOffset *) buf;
+	offptr += entryno;
+	offset = *offptr;
+
+	Assert(offset != 0);
+
+	/*
+	 * Use the same increment rule as GetNewMultiXactId(), that is, don't
+	 * handle wraparound explicitly until needed.
+	 */
+	tmpMXact = multi + 1;
+
+	if (nextMXact == tmpMXact)
+	{
+		/* Corner case 1: there is no next multixact */
+		length = nextOffset - offset;
+	}
+	else
+	{
+		OldMultiXactOffset nextMXOffset;
+
+		/* handle wraparound if needed */
+		if (tmpMXact < FirstMultiXactId)
+			tmpMXact = FirstMultiXactId;
+
+		prev_pageno = pageno;
+
+		pageno = MultiXactIdToOffsetPage(tmpMXact);
+		entryno = MultiXactIdToOffsetEntry(tmpMXact);
+
+		if (pageno != prev_pageno)
+			buf = SlruReadSwitchPage(state->offset, pageno);
+
+		offptr = (OldMultiXactOffset *) buf;
+		offptr += entryno;
+		nextMXOffset = *offptr;
+
+		/*
+		 * Corner case 2: next multixact is still being filled in, this must
+		 * not happen during upgrade.
+		 */
+		Assert(nextMXOffset != 0);
+
+		length = nextMXOffset - offset;
+	}
+
+	prev_pageno = -1;
+	for (int i = 0; i < length; i++, offset++)
+	{
+		TransactionId *xactptr;
+		uint32	   *flagsptr;
+		int			flagsoff;
+		int			bshift;
+		int			memberoff;
+		MultiXactStatus st;
+
+		pageno = MXOffsetToMemberPage(offset);
+		memberoff = MXOffsetToMemberOffset(offset);
+
+		if (pageno != prev_pageno)
+		{
+			buf = SlruReadSwitchPage(state->members, pageno);
+			prev_pageno = pageno;
+		}
+
+		xactptr = (TransactionId *) (buf + memberoff);
+		if (!TransactionIdIsValid(*xactptr))
+		{
+			/* Corner case 3: we must be looking at unused slot zero */
+			Assert(offset == 0);
+			continue;
+		}
+
+		flagsoff = MXOffsetToFlagsOffset(offset);
+		bshift = MXOffsetToFlagsBitShift(offset);
+		flagsptr = (uint32 *) (buf + flagsoff);
+
+		st = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
+
+		/* Verify that there is a single update Xid among the given members. */
+		if (ISUPDATE_from_mxstatus(st))
+		{
+			if (result_isupdate)
+				pg_fatal("multixact %u has more than one updating member",
+						 multi);
+			result_xid = *xactptr;
+			result_isupdate = true;
+		}
+		else if (!TransactionIdIsValid(result_xid))
+			result_xid = *xactptr;
+	}
+
+	/* A multixid with zero members should not happen */
+	Assert(TransactionIdIsValid(result_xid));
+
+	*result = result_xid;
+	*status = result_isupdate ? MultiXactStatusUpdate :
+		MultiXactStatusForKeyShare;
+}
+
+/*
+ * Frees the malloced reader.
+ */
+void
+FreeOldMultiXactReader(OldMultiXactReader *state)
+{
+	FreeSlruRead(state->offset);
+	FreeSlruRead(state->members);
+
+	pfree(state);
+}
diff --git a/src/bin/pg_upgrade/multixact_old.h b/src/bin/pg_upgrade/multixact_old.h
new file mode 100644
index 00000000000..8eb5af2ccaf
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_old.h
@@ -0,0 +1,29 @@
+/*
+ * multixact_old.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_old.h
+ */
+
+#include "access/multixact.h"
+#include "slru_io.h"
+
+typedef uint32 OldMultiXactOffset;
+
+typedef struct OldMultiXactReader
+{
+	MultiXactId nextMXact;
+	OldMultiXactOffset nextOffset;
+
+	SlruSegState *offset;
+	SlruSegState *members;
+} OldMultiXactReader;
+
+extern OldMultiXactReader *AllocOldMultiXactRead(char *pgdata,
+												 MultiXactId nextMulti,
+												 OldMultiXactOffset nextOffset);
+extern void GetOldMultiXactIdSingleMember(OldMultiXactReader *state,
+										  MultiXactId multi,
+										  TransactionId *result,
+										  MultiXactStatus *status);
+extern void FreeOldMultiXactReader(OldMultiXactReader *reader);
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 490e98fa26f..0fdd05c127c 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -48,6 +48,8 @@
 #include "common/logging.h"
 #include "common/restricted_token.h"
 #include "fe_utils/string_utils.h"
+#include "multixact_old.h"
+#include "multixact_new.h"
 #include "pg_upgrade.h"
 
 /*
@@ -769,6 +771,81 @@ copy_subdir_files(const char *old_subdir, const char *new_subdir)
 	check_ok();
 }
 
+/*
+ * Convert pg_multixact/offset and /members to new format with 64-bit offsets.
+ */
+static void
+convert_multixacts(MultiXactId *new_nxtmulti, MultiXactOffset *new_nxtmxoff)
+{
+	MultiXactId oldest_multi,
+				next_multi;
+	OldMultiXactReader *old_reader;
+	MultiXactWriter *new_writer;
+	MultiXactOffset next_offset;
+
+	/*
+	 * The range of valid multi XIDs is unchanged by the conversion (they are
+	 * referenced from the heap tables), but the members SLRU is rewritten to
+	 * start from offset 1.
+	 */
+	oldest_multi = old_cluster.controldata.chkpnt_oldstMulti;
+	next_multi = old_cluster.controldata.chkpnt_nxtmulti;
+	next_offset = 1;
+
+	old_reader = AllocOldMultiXactRead(old_cluster.pgdata,
+									   old_cluster.controldata.chkpnt_nxtmulti,
+									   old_cluster.controldata.chkpnt_nxtmxoff);
+	new_writer = AllocMultiXactWrite(new_cluster.pgdata,
+									 oldest_multi, next_offset);
+
+	/* handle wraparound */
+	if (next_multi < FirstMultiXactId)
+		next_multi = FirstMultiXactId;
+
+	/*
+	 * Read multixids from old files one by one, and write them back in the
+	 * new format.
+	 */
+	for (MultiXactId multi = oldest_multi; multi != next_multi;)
+	{
+		TransactionId xid;
+		MultiXactStatus status;
+		MultiXactMember member;
+
+		/*
+		 * Read the old multixid.  The locking-only XIDs that may be part of
+		 * multi-xids don't matter after upgrade, as there can be no
+		 * transactions running across upgrade.  So as a little optimization,
+		 * we only read one member from each multixid: the one updating one,
+		 * or if there was no update, arbitrarily the first locking xid.
+		 */
+		GetOldMultiXactIdSingleMember(old_reader, multi, &xid, &status);
+
+		/* Write it out in new format */
+		member.xid = xid;
+		member.status = status;
+		RecordNewMultiXact(new_writer, next_offset, multi, 1, &member);
+
+		next_offset += 1;
+		multi++;
+		/* handle wraparound */
+		if (multi < FirstMultiXactId)
+			multi = FirstMultiXactId;
+	}
+
+	/*
+	 * Update the nextMXact/Offset values in the control file to match what we
+	 * wrote.  The nextMXact is unchanged, but nextOffset will be different.
+	 */
+	Assert(next_multi == old_cluster.controldata.chkpnt_nxtmulti);
+	*new_nxtmulti = next_multi;
+	*new_nxtmxoff = next_offset;
+
+	/* Release resources */
+	FreeMultiXactWrite(new_writer);
+	FreeOldMultiXactReader(old_reader);
+}
+
 static void
 copy_xact_xlog_xid(void)
 {
@@ -816,8 +893,29 @@ copy_xact_xlog_xid(void)
 	if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER &&
 		new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
 	{
-		copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
-		copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		MultiXactId new_nxtmulti = old_cluster.controldata.chkpnt_nxtmulti;
+		MultiXactOffset new_nxtmxoff = old_cluster.controldata.chkpnt_nxtmxoff;
+
+		/*
+		 * If the old server is before the
+		 * MULTIXACTOFFSET_FORMATCHANGE_CAT_VER it must have 32-bit multixid
+		 * offsets, thus it should be converted.
+		 */
+		if (old_cluster.controldata.cat_ver < MULTIXACTOFFSET_FORMATCHANGE_CAT_VER &&
+			new_cluster.controldata.cat_ver >= MULTIXACTOFFSET_FORMATCHANGE_CAT_VER)
+		{
+			remove_new_subdir("pg_multixact/members", false);
+			remove_new_subdir("pg_multixact/offsets", false);
+
+			prep_status("Converting pg_multixact/offsets to 64-bit");
+			convert_multixacts(&new_nxtmulti, &new_nxtmxoff);
+			check_ok();
+		}
+		else
+		{
+			copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
+			copy_subdir_files("pg_multixact/members", "pg_multixact/members");
+		}
 
 		prep_status("Setting next multixact ID and offset for new cluster");
 
@@ -826,10 +924,8 @@ copy_xact_xlog_xid(void)
 		 * counters here and the oldest multi present on system.
 		 */
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/pg_resetwal\" -O %u -m %u,%u \"%s\"",
-				  new_cluster.bindir,
-				  old_cluster.controldata.chkpnt_nxtmxoff,
-				  old_cluster.controldata.chkpnt_nxtmulti,
+				  "\"%s/pg_resetwal\" -O %" PRIu64 " -m %u,%u \"%s\"",
+				  new_cluster.bindir, new_nxtmxoff, new_nxtmulti,
 				  old_cluster.controldata.chkpnt_oldstMulti,
 				  new_cluster.pgdata);
 		check_ok();
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index e86336f4be9..127b2cb00fa 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -114,6 +114,11 @@ extern char *output_files[];
  */
 #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231
 
+/*
+ * Swicth from 32-bit to 64-bit for multixid offsets.
+ */
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 999999999
+
 /*
  * large object chunk size added to pg_controldata,
  * commit 5f93c37805e7485488480916b4585e098d3cc883
diff --git a/src/bin/pg_upgrade/slru_io.c b/src/bin/pg_upgrade/slru_io.c
new file mode 100644
index 00000000000..010094184be
--- /dev/null
+++ b/src/bin/pg_upgrade/slru_io.c
@@ -0,0 +1,242 @@
+/*
+ * slru_io.c
+ *
+ * Routines for reading and writing SLRU files during upgrade.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/slru_io.c
+ */
+
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+
+#include "common/fe_memutils.h"
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "port/pg_iovec.h"
+#include "pg_upgrade.h"
+#include "slru_io.h"
+
+static SlruSegState *AllocSlruSegState(const char *dir);
+static char *SlruFileName(SlruSegState *state, int64 segno);
+static void SlruFlush(SlruSegState *state);
+
+static SlruSegState *
+AllocSlruSegState(const char *dir)
+{
+	SlruSegState *state = pg_malloc(sizeof(*state));
+
+	state->dir = pstrdup(dir);
+	state->fn = NULL;
+	state->fd = -1;
+	state->segno = -1;
+	state->pageno = 0;
+
+	return state;
+}
+
+/* similar to the backend function with the same name */
+static char *
+SlruFileName(SlruSegState *state, int64 segno)
+{
+	if (state->long_segment_names)
+	{
+		Assert(segno >= 0 && segno <= INT64CONST(0xFFFFFFFFFFFFFFF));
+		return psprintf("%s/%015" PRIX64, state->dir, segno);
+	}
+	else
+	{
+		Assert(segno >= 0 && segno <= INT64CONST(0xFFFFFF));
+		return psprintf("%s/%04X", state->dir, (unsigned int) segno);
+	}
+}
+
+/*
+ * Create slru reader for dir.
+ *
+ * Returns the malloced memory used by the all other read calls in this module.
+ */
+SlruSegState *
+AllocSlruRead(const char *dir, bool long_segment_names)
+{
+	SlruSegState *state = AllocSlruSegState(dir);
+
+	state->writing = false;
+	state->long_segment_names = long_segment_names;
+
+	return state;
+}
+
+/*
+ * Open given page for reading.
+ *
+ * Reading can be done in random order.
+ */
+char *
+SlruReadSwitchPageSlow(SlruSegState *state, uint64 pageno)
+{
+	int64		segno;
+
+	Assert(!state->writing);	/* read only mode */
+
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+
+	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	if (segno != state->segno)
+	{
+		if (state->segno != -1)
+		{
+			close(state->fd);
+			state->fd = -1;
+
+			pg_free(state->fn);
+			state->fn = NULL;
+
+			state->segno = -1;
+		}
+
+		/* Open new segment */
+		state->fn = SlruFileName(state, segno);
+		if ((state->fd = open(state->fn, O_RDONLY | PG_BINARY, 0)) < 0)
+			pg_fatal("could not open file \"%s\": %m", state->fn);
+	}
+
+	state->segno = segno;
+
+	{
+		struct iovec iovec = {
+			.iov_base = &state->buf,
+			.iov_len = BLCKSZ,
+		};
+		off_t		offset = (pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+		if (pg_preadv(state->fd, &iovec, 1, offset) < 0)
+			pg_fatal("could not read file \"%s\": %m", state->fn);
+
+		state->pageno = pageno;
+	}
+
+	return state->buf.data;
+}
+
+/*
+ * Frees the malloced reader.
+ */
+void
+FreeSlruRead(SlruSegState *state)
+{
+	Assert(!state->writing);	/* read only mode */
+
+	if (state->fd != -1)
+		close(state->fd);
+	pg_free(state);
+}
+
+/*
+ * Create slru writer for dir.
+ *
+ * Returns the malloced memory used by the all other write calls in this module.
+ */
+SlruSegState *
+AllocSlruWrite(const char *dir, bool long_segment_names)
+{
+	SlruSegState *state = AllocSlruSegState(dir);
+
+	state->writing = true;
+	state->long_segment_names = long_segment_names;
+
+	return state;
+}
+
+/*
+ * Open the given page for writing.
+ *
+ * NOTE: This uses O_EXCL when stepping to a new segment, so this assumes that
+ * each segment is written in full before moving on to next one.  This
+ * limitation would be easy to lift if needed, but it fits the usage pattern of
+ * current callers.
+ */
+char *
+SlruWriteSwitchPageSlow(SlruSegState *state, uint64 pageno)
+{
+	int64		segno;
+	off_t		offset;
+
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+
+	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	offset = (pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+	SlruFlush(state);
+	memset(state->buf.data, 0, BLCKSZ);
+
+	if (segno != state->segno)
+	{
+		if (state->segno != -1)
+		{
+			close(state->fd);
+			state->fd = -1;
+
+			pg_free(state->fn);
+			state->fn = NULL;
+
+			state->segno = -1;
+		}
+
+		/* Create the segment */
+		state->fn = SlruFileName(state, segno);
+		if ((state->fd = open(state->fn, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+							  pg_file_create_mode)) < 0)
+		{
+			pg_fatal("could not create file \"%s\": %m", state->fn);
+		}
+
+		state->segno = segno;
+
+		if (offset > 0)
+		{
+			if (pg_pwrite_zeros(state->fd, offset, 0) < 0)
+				pg_fatal("could not write file \"%s\": %m", state->fn);
+		}
+	}
+
+	state->pageno = pageno;
+
+	return state->buf.data;
+}
+
+static void
+SlruFlush(SlruSegState *state)
+{
+	struct iovec iovec = {
+		.iov_base = &state->buf,
+		.iov_len = BLCKSZ,
+	};
+	off_t		offset;
+
+	if (state->segno == -1)
+		return;
+
+	offset = (state->pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+	if (pg_pwritev_with_retry(state->fd, &iovec, 1, offset) < 0)
+		pg_fatal("could not write file \"%s\": %m", state->fn);
+}
+
+/*
+ * Frees the malloced writer.
+ */
+void
+FreeSlruWrite(SlruSegState *state)
+{
+	Assert(state->writing);
+
+	SlruFlush(state);
+
+	if (state->fd != -1)
+		close(state->fd);
+	pg_free(state);
+}
diff --git a/src/bin/pg_upgrade/slru_io.h b/src/bin/pg_upgrade/slru_io.h
new file mode 100644
index 00000000000..5c80a679b4d
--- /dev/null
+++ b/src/bin/pg_upgrade/slru_io.h
@@ -0,0 +1,52 @@
+/*
+ * slru_io.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/slru_io.h
+ */
+
+#ifndef SLRU_IO_H
+#define SLRU_IO_H
+
+/*
+ * State for reading or writing an SLRU, with a one page buffer.
+ */
+typedef struct SlruSegState
+{
+	bool		writing;
+	bool		long_segment_names;
+
+	char	   *dir;
+	char	   *fn;
+	int			fd;
+	int64		segno;
+	uint64		pageno;
+
+	PGAlignedBlock buf;
+} SlruSegState;
+
+extern SlruSegState *AllocSlruRead(const char *dir, bool long_segment_names);
+extern char *SlruReadSwitchPageSlow(SlruSegState *state, uint64 pageno);
+extern void FreeSlruRead(SlruSegState *state);
+
+static inline char *
+SlruReadSwitchPage(SlruSegState *state, uint64 pageno)
+{
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+	return SlruReadSwitchPageSlow(state, pageno);
+}
+
+extern SlruSegState *AllocSlruWrite(const char *dir, bool long_segment_names);
+extern char *SlruWriteSwitchPageSlow(SlruSegState *state, uint64 pageno);
+extern void FreeSlruWrite(SlruSegState *state);
+
+static inline char *
+SlruWriteSwitchPage(SlruSegState *state, uint64 pageno)
+{
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+	return SlruWriteSwitchPageSlow(state, pageno);
+}
+
+#endif							/* SLRU_IO_H */
diff --git a/src/bin/pg_upgrade/t/007_multixact_conversion.pl b/src/bin/pg_upgrade/t/007_multixact_conversion.pl
new file mode 100644
index 00000000000..fe8da9aded2
--- /dev/null
+++ b/src/bin/pg_upgrade/t/007_multixact_conversion.pl
@@ -0,0 +1,329 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Version 19 expanded MultiXactOffset from 32 to 64 bits. Upgrading
+# across that requires rewriting the SLRU files to the new format.
+# This file contains tests for the conversion.
+#
+# To run, set 'oldinstall' ENV variable to point to a pre-v19
+# installation. If it's not set, or if it points to a v19 or above
+# installation, this still performs a very basic test, upgrading a
+# cluster with some multixacts. It's not very interesting, however,
+# because there's no conversion involved in that case.
+
+use strict;
+use warnings FATAL => 'all';
+
+use Math::BigInt;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Temp dir for a dumps.
+my $tempdir = PostgreSQL::Test::Utils::tempdir;
+
+# A workload that consumes multixids. The purpose of this is to
+# generate some multixids in the old cluster, so that we can test
+# upgrading them. The workload is a mix of KEY SHARE locking queries
+# and UPDATEs, and commits and aborts. It consumes around 3000
+# multixids with 30000 members. That's enough to span more than one
+# multixids 'offsets' page, and more than one 'members' segment.
+#
+# The workload leaves behind a table called 'mxofftest' containing a
+# small number of rows referencing some of the generated multixids.
+#
+# Because this function is used to generate test data on the old
+# installation, it needs to work with older PostgreSQL server
+# versions.
+#
+# The first argument is the cluster to connect to, the second argument
+# is a cluster using the new version. We need the 'psql' binary from
+# the new version, the new cluster is otherwise unused. (We need to
+# use the new 'psql' because some of the more advanced background psql
+# perl module features depend on a fairly recent psql version.)
+sub mxact_workload
+{
+	my $node = shift;       # Cluster to connect to
+	my $binnode = shift;    # Use the psql binary from this cluster
+
+	my $connstr = $node->connstr('postgres');
+
+	$node->start;
+	$node->safe_psql('postgres', qq[
+		CREATE TABLE mxofftest (id INT PRIMARY KEY, n_updated INT)
+		  WITH (AUTOVACUUM_ENABLED=FALSE);
+		INSERT INTO mxofftest SELECT G, 0 FROM GENERATE_SERIES(1, 50) G;
+	]);
+
+	my $nclients = 20;
+	my $update_every = 13;
+	my $abort_every = 11;
+	my @connections = ();
+
+	# Open multiple connections to the database. Start a transaction
+	# in each connection.
+	for (0 .. $nclients)
+	{
+		# Use the psql binary from the new installation. The
+		# BackgroundPsql functionality doesn't work with older psql
+		# versions.
+		my $conn = $binnode->background_psql('',
+			connstr => $node->connstr('postgres'));
+		$conn->query_safe("SET enable_seqscan=off");
+		$conn->query_safe("BEGIN");
+
+		push(@connections, $conn);
+	}
+
+	# Run queries using cycling through the connections in a
+	# round-robin fashion. We keep a transaction open in each
+	# connection at all times, and lock/update the rows. With 10
+	# connections, each SELECT FOR KEY SHARE query generates a new
+	# multixid, containing the 10 XIDs of all the transactions running
+	# at the time.
+	for (my $i = 0; $i < 3000; $i++)
+	{
+		my $conn = $connections[ $i % $nclients ];
+
+		my $sql;
+		if ($i % $abort_every == 0)
+		{
+			$sql = "ABORT; ";
+		}
+		else
+		{
+			$sql = "COMMIT; ";
+		}
+		$sql .= "BEGIN; ";
+
+		if ($i % $update_every == 0)
+		{
+			$sql .= qq[
+			  UPDATE mxofftest SET n_updated = n_updated + 1 WHERE id = ${i} % 50;
+			];
+		}
+		else
+		{
+			my $threshold = int($i / 3000 * 50);
+			$sql .= qq[
+			  select count(*) from (
+				SELECT * FROM mxofftest WHERE id >= $threshold FOR KEY SHARE
+			  ) as x
+			];
+		}
+		$conn->query_safe($sql);
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+
+	$node->stop;
+	return;
+}
+
+# Read NextMultiOffset from the control file
+#
+# Note: This is used on both the old and the new installation, so the
+# command arguments and the output parsing used here must work with
+# all PostgreSQL versions supported by the test.
+sub read_next_mxoff
+{
+	my $node = shift;
+
+	my $pg_controldata_path = $node->installed_command('pg_controldata');
+	my ($stdout, $stderr) =
+	  run_command([ $pg_controldata_path, $node->data_dir ]);
+	$stdout =~ /^Latest checkpoint's NextMultiOffset:\s*(.*)$/m
+	  or die "could not read NextMultiOffset from pg_controldata";
+	return $1;
+}
+
+# Reset a cluster's oldest multixact-offset to given offset.
+#
+# Note: This is used on both the old and the new installation, so the
+# command arguments and the output parsing used here must work with
+# all PostgreSQL versions supported by the test.
+sub reset_mxoff_pre_v19
+{
+	my $node = shift;
+	my $offset = shift;
+
+	my $pg_resetwal_path = $node->installed_command('pg_resetwal');
+	# Get block size
+	my ($out, $err) =
+	  run_command([ $pg_resetwal_path, '--dry-run', $node->data_dir ]);
+	$out =~ /^Database block size: *(\d+)$/m or die;
+	my $blcksz = $1;
+	# SLRU_PAGES_PER_SEGMENT is always 32 on pre-19 version
+	my $slru_pages_per_segment = 32;
+
+	# Verify that no multixids are currently in use. Resetting would
+	# destroy them. (A freshly initialized cluster has no multixids.)
+	$out =~ /^Latest checkpoint's NextMultiXactId: *(\d+)$/m or die;
+	my $next_mxid = $1;
+	$out =~ /^Latest checkpoint's oldestMultiXid: *(\d+)$/m or die;
+	my $oldest_mxid = $1;
+	die "cluster has some multixids in use" unless $next_mxid == $oldest_mxid;
+
+	# Reset to new offset using pg_resetwal
+	my @cmd = (
+		$pg_resetwal_path,
+		'--pgdata' => $node->data_dir,
+		'--multixact-offset' => $offset);
+	command_ok(\@cmd, 'set oldest multixact-offset');
+
+	# pg_resetwal just updates the control file. The cluster will
+	# refuse to start up, if the SLRU segment corresponding to the
+	# offset does not exist. Create a dummy segment that covers the
+	# given offset, filled with zeros. But first remove any old
+	# segments.
+	unlink glob $node->data_dir . "/pg_multixact/members/*";
+
+	my $mult = 32 * int($blcksz / 20) * 4;
+	my $segname = sprintf "%04X", $offset / $mult;
+
+	my $path = $node->data_dir . "/pg_multixact/members/" . $segname;
+
+	my $null_block = "\x00" x $blcksz;
+	open(my $dh, '>', $path)
+	  || die "could not open $path for writing $!";
+	for (0 .. $slru_pages_per_segment)
+	{
+		print $dh $null_block;
+	}
+	close($dh);
+}
+
+# Dump contents of the 'mxofftest' table, created by mxact_workload
+sub get_dump_for_comparison
+{
+	my ($node, $file_prefix) = @_;
+
+	my $contents = $node->safe_psql('postgres',
+		"SELECT ctid, xmin, xmax, * FROM mxofftest");
+
+	my $dumpfile = $tempdir . '/' . $file_prefix . '.sql';
+	open(my $dh, '>', $dumpfile)
+	  || die "could not open $dumpfile for writing $!";
+	print $dh $contents;
+	close($dh);
+
+	return $dumpfile;
+}
+
+# Main test workhorse routine.
+# Dump data on old version, run pg_upgrade, compare data after upgrade.
+sub upgrade_and_compare
+{
+	my $tag = shift;
+	my $oldnode = shift;
+	my $newnode = shift;
+
+	command_ok(
+		[
+			'pg_upgrade', '--no-sync',
+			'--old-datadir' => $oldnode->data_dir,
+			'--new-datadir' => $newnode->data_dir,
+			'--old-bindir' => $oldnode->config_data('--bindir'),
+			'--new-bindir' => $newnode->config_data('--bindir'),
+			'--socketdir' => $newnode->host,
+			'--old-port' => $oldnode->port,
+			'--new-port' => $newnode->port,
+		],
+		'run of pg_upgrade for new instance');
+
+	# Note: we do this *after* running pg_upgrade, to ensure that we
+	# don't set all the hint bits before upgrade by doing the SELECT
+	# on the table.
+	$oldnode->start;
+	my $old_dump = get_dump_for_comparison($oldnode, "oldnode_${tag}_dump");
+	$oldnode->stop;
+
+	$newnode->start;
+	my $new_dump = get_dump_for_comparison($newnode, "newnode_${tag}_dump");
+	$newnode->stop;
+
+	compare_files($old_dump, $new_dump,
+		'dump outputs from original and restored regression databases match');
+}
+
+my $old_version;
+
+# Basic scenario: Create a cluster using old installation, run
+# multixid-creating workload on it, then upgrade.
+#
+# This works even even if the old and new version is the same,
+# although it's not very interesting as the conversion routines only
+# run when upgrading from a pre-v19 cluster.
+{
+	my $tag = 'basic';
+	my $old =
+	  PostgreSQL::Test::Cluster->new("${tag}_oldnode",
+		install_path => $ENV{oldinstall});
+	my $new = PostgreSQL::Test::Cluster->new("${tag}_newnode");
+
+	$old->init(extra => ['-k']);
+
+	$old_version = $old->pg_version;
+	note "old installation is version $old_version\n";
+
+	# Run the workload
+	my $start_mxoff = read_next_mxoff($old);
+	mxact_workload($old, $new);
+	my $finish_mxoff = read_next_mxoff($old);
+
+	$new->init;
+	upgrade_and_compare($tag, $old, $new);
+
+	my $new_next_mxoff = read_next_mxoff($new);
+
+	note ">>> case #${tag}\n"
+	  . " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n"
+	  . " newnode mxoff ${new_next_mxoff}\n";
+}
+
+# Wraparound scenario: This is the same as the basic scenario, but the
+# old cluster goes through mxoffset wraparound.
+#
+# This requires the old installation to be version 19 of older,
+# because the hacks we use to reset the old cluster to a state just
+# before the wraparound rely on the pre-v19 file format. In version
+# 19, offsets no longer wrap around anyway.
+SKIP:
+{
+	skip
+	  "skipping mxoffset conversion tests because upgrading from the old version does not require conversion"
+	  if ($old_version >= '19devel');
+
+	my $tag = 'wraparound';
+	my $old =
+	  PostgreSQL::Test::Cluster->new("${tag}_oldnode",
+		install_path => $ENV{oldinstall});
+	my $new = PostgreSQL::Test::Cluster->new("${tag}_newnode");
+
+	$old->init(extra => ['-k']);
+
+	# Reset the NextMultiOffset value in the  old cluster to just before 32-bit wraparound.
+	reset_mxoff_pre_v19($old, 0xFFFFEC77);
+
+	# Run the workload. This crosses the wraparound.
+	my $start_mxoff = read_next_mxoff($old);
+	mxact_workload($old, $new);
+	my $finish_mxoff = read_next_mxoff($old);
+
+	# Verify that wraparound happened.
+	cmp_ok($finish_mxoff, '<', $start_mxoff,
+		"mxoff wrapped around in old cluster");
+
+	$new->init;
+	upgrade_and_compare($tag, $old, $new);
+
+	my $new_next_mxoff = read_next_mxoff($new);
+
+	note ">>> case #${tag}\n"
+	  . " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n"
+	  . " newnode mxoff ${new_next_mxoff}\n";
+}
+
+done_testing();
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 35413f14019..34f07d52cd8 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -1793,13 +1793,20 @@ sub _get_env
 	return (%inst_env);
 }
 
-# Private routine to get an installation path qualified command.
-#
-# IPC::Run maintains a cache, %cmd_cache, mapping commands to paths.  Tests
-# which use nodes spanning more than one postgres installation path need to
-# avoid confusing which installation's binaries get run.  Setting $ENV{PATH} is
-# insufficient, as IPC::Run does not check to see if the path has changed since
-# caching a command.
+=pod
+
+=item $node->installed_command(cmd)
+
+Get an installation path qualified command.
+
+IPC::Run maintains a cache, %cmd_cache, mapping commands to paths.  Tests
+which use nodes spanning more than one postgres installation path need to
+avoid confusing which installation's binaries get run.  Setting $ENV{PATH} is
+insufficient, as IPC::Run does not check to see if the path has changed since
+caching a command.
+
+=cut
+
 sub installed_command
 {
 	my ($self, $cmd) = @_;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 23bce72ae64..f9ddd06ec1d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1725,6 +1725,7 @@ MultiXactMember
 MultiXactOffset
 MultiXactStateData
 MultiXactStatus
+MultiXactWriter
 MultirangeIOData
 MultirangeParseState
 MultirangeType
@@ -1808,6 +1809,7 @@ OffsetVarNodes_context
 Oid
 OidOptions
 OkeysState
+OldMultiXactReader
 OldToNewMapping
 OldToNewMappingData
 OnCommitAction
@@ -2804,6 +2806,7 @@ SlruCtlData
 SlruErrorCause
 SlruPageStatus
 SlruScanCallback
+SlruSegState
 SlruShared
 SlruSharedData
 SlruWriteAll
-- 
2.47.3

v26-0005-Remove-oldestOffset-oldestOffsetKnown-from-multi.patchtext/x-patch; charset=UTF-8; name=v26-0005-Remove-oldestOffset-oldestOffsetKnown-from-multi.patchDownload
From dbc1a7595403e26fa89e614da77590f208c26755 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Thu, 6 Nov 2025 16:20:18 +0300
Subject: [PATCH v26 05/10] Remove oldestOffset/oldestOffsetKnown from
 multixact

Since we rewrite all multitransactions during pg_upgrade, the oldest
offset for a new cluster will no longer be missing on disc.
---
 src/backend/access/transam/multixact.c | 101 ++-----------------------
 src/include/access/multixact.h         |   3 -
 2 files changed, 5 insertions(+), 99 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index e0323ec1014..78ba6d72a92 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -140,14 +140,6 @@ typedef struct MultiXactStateData
 	MultiXactId oldestMultiXactId;
 	Oid			oldestMultiXactDB;
 
-	/*
-	 * Oldest multixact offset that is potentially referenced by a multixact
-	 * referenced by a relation.  We don't always know this value, so there's
-	 * a flag here to indicate whether or not we currently do.
-	 */
-	MultiXactOffset oldestOffset;
-	bool		oldestOffsetKnown;
-
 	/* support for anti-wraparound measures */
 	MultiXactId multiVacLimit;
 	MultiXactId multiWarnLimit;
@@ -2371,10 +2363,7 @@ SetOffsetVacuumLimit(bool is_startup)
 	MultiXactId oldestMultiXactId;
 	MultiXactId nextMXact;
 	MultiXactOffset oldestOffset = 0;	/* placate compiler */
-	MultiXactOffset prevOldestOffset;
 	MultiXactOffset nextOffset;
-	bool		oldestOffsetKnown = false;
-	bool		prevOldestOffsetKnown;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2387,8 +2376,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	oldestMultiXactId = MultiXactState->oldestMultiXactId;
 	nextMXact = MultiXactState->nextMXact;
 	nextOffset = MultiXactState->nextOffset;
-	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
-	prevOldestOffset = MultiXactState->oldestOffset;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2406,57 +2393,20 @@ SetOffsetVacuumLimit(bool is_startup)
 		 * offset.
 		 */
 		oldestOffset = nextOffset;
-		oldestOffsetKnown = true;
 	}
-	else
+	else if (!find_multixact_start(oldestMultiXactId, &oldestOffset))
 	{
-		/*
-		 * Figure out where the oldest existing multixact's offsets are
-		 * stored. Due to bugs in early release of PostgreSQL 9.3.X and 9.4.X,
-		 * the supposedly-earliest multixact might not really exist.  We are
-		 * careful not to fail in that case.
-		 */
-		oldestOffsetKnown =
-			find_multixact_start(oldestMultiXactId, &oldestOffset);
-
-		if (!oldestOffsetKnown)
-			ereport(LOG,
-					(errmsg("oldest checkpointed MultiXact %u does not exist on disk",
-							oldestMultiXactId)));
+		ereport(LOG,
+				(errmsg("oldest checkpointed MultiXact %u does not exist on disk",
+						oldestMultiXactId)));
 	}
 
 	LWLockRelease(MultiXactTruncationLock);
 
-	/*
-	 * If we can, compute limits (and install them MultiXactState) to prevent
-	 * overrun of old data in the members SLRU area. We can only do so if the
-	 * oldest offset is known though.
-	 *
-	 * FIXME: Is !oldestOffsetKnown possible anymore? At least update the comment:
-	 * we won't overrun members anymore.
-	 */
-	if (prevOldestOffsetKnown)
-	{
-		/*
-		 * If we failed to get the oldest offset this time, but we have a
-		 * value from a previous pass through this function, use the old
-		 * values rather than automatically forcing an autovacuum cycle again.
-		 */
-		oldestOffset = prevOldestOffset;
-		oldestOffsetKnown = true;
-	}
-
-	/* Install the computed values */
-	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
-	MultiXactState->oldestOffset = oldestOffset;
-	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	LWLockRelease(MultiXactGenLock);
-
 	/*
 	 * Do we need autovacuum?	If we're not sure, assume yes.
 	 */
-	return !oldestOffsetKnown ||
-		(nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);
+	return nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD;
 }
 
 /*
@@ -2503,47 +2453,6 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	return true;
 }
 
-/*
- * GetMultiXactInfo
- *
- * Returns information about the current MultiXact state, as of:
- * multixacts: Number of MultiXacts (nextMultiXactId - oldestMultiXactId)
- * members: Number of member entries (nextOffset - oldestOffset)
- * oldestMultiXactId: Oldest MultiXact ID still in use
- * oldestOffset: Oldest offset still in use
- *
- * Returns false if unable to determine, the oldest offset being unknown.
- */
-bool
-GetMultiXactInfo(uint32 *multixacts, MultiXactOffset *members,
-				 MultiXactId *oldestMultiXactId, MultiXactOffset *oldestOffset)
-{
-	MultiXactOffset nextOffset;
-	MultiXactId nextMultiXactId;
-	bool		oldestOffsetKnown;
-
-	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	nextOffset = MultiXactState->nextOffset;
-	*oldestMultiXactId = MultiXactState->oldestMultiXactId;
-	nextMultiXactId = MultiXactState->nextMXact;
-	*oldestOffset = MultiXactState->oldestOffset;
-	oldestOffsetKnown = MultiXactState->oldestOffsetKnown;
-	LWLockRelease(MultiXactGenLock);
-
-	if (!oldestOffsetKnown)
-	{
-		*members = 0;
-		*multixacts = 0;
-		*oldestMultiXactId = InvalidMultiXactId;
-		*oldestOffset = 0;
-		return false;
-	}
-
-	*members = nextOffset - *oldestOffset;
-	*multixacts = nextMultiXactId - *oldestMultiXactId;
-	return true;
-}
-
 typedef struct mxtruncinfo
 {
 	int64		earliestExistingPage;
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 7d98fe0fe32..d688b547c54 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -109,9 +109,6 @@ extern bool MultiXactIdIsRunning(MultiXactId multi, bool isLockOnly);
 extern void MultiXactIdSetOldestMember(void);
 extern int	GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 								  bool from_pgupgrade, bool isLockOnly);
-extern bool GetMultiXactInfo(uint32 *multixacts, MultiXactOffset *members,
-							 MultiXactId *oldestMultiXactId,
-							 MultiXactOffset *oldestOffset);
 extern bool MultiXactIdPrecedes(MultiXactId multi1, MultiXactId multi2);
 extern bool MultiXactIdPrecedesOrEquals(MultiXactId multi1,
 										MultiXactId multi2);
-- 
2.47.3

v26-0006-Reintroduce-MultiXactMemberFreezeThreshold.patchtext/x-patch; charset=UTF-8; name=v26-0006-Reintroduce-MultiXactMemberFreezeThreshold.patchDownload
From 4d309c8240773c29a2c7d222db80608f82868b47 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Thu, 13 Nov 2025 12:38:41 +0200
Subject: [PATCH v26 06/10] Reintroduce MultiXactMemberFreezeThreshold

---
 src/backend/access/transam/multixact.c | 202 ++++++++++++++++++++-----
 src/backend/access/transam/xlog.c      |   4 +-
 src/backend/commands/vacuum.c          |   6 +-
 src/backend/postmaster/autovacuum.c    |   4 +-
 src/include/access/multixact.h         |   4 +-
 5 files changed, 170 insertions(+), 50 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 78ba6d72a92..c72b2cd7090 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -91,13 +91,13 @@
 
 
 /*
- * Multixact members warning threshold.
- *
- * If the difference between nextOffset and oldestOffset exceeds this value,
- * we trigger autovacuum in order to release disk space consumed by the
- * members SLRU.
+ * Thresholds used to keep members disk usage in check when multixids have a
+ * lot of members.  When MULTIXACT_MEMBER_LOW_THRESHOLD is reached, vacuum
+ * starts freezing multixids more aggressively, even if the normal multixid
+ * age limits haven't been reached yet.
  */
-#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD		UINT64CONST(4000000000)
+#define MULTIXACT_MEMBER_LOW_THRESHOLD		UINT64CONST(2000000000)
+#define MULTIXACT_MEMBER_HIGH_THRESHOLD		UINT64CONST(4000000000)
 
 static inline MultiXactId
 PreviousMultiXactId(MultiXactId multi)
@@ -140,6 +140,12 @@ typedef struct MultiXactStateData
 	MultiXactId oldestMultiXactId;
 	Oid			oldestMultiXactDB;
 
+	/*
+	 * Oldest multixact offset that is potentially referenced by a multixact
+	 * referenced by a relation.
+	 */
+	MultiXactOffset oldestOffset;
+
 	/* support for anti-wraparound measures */
 	MultiXactId multiVacLimit;
 	MultiXactId multiWarnLimit;
@@ -276,7 +282,7 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool SetOffsetVacuumLimit(bool is_startup);
+static void SetOffsetVacuumLimit(void);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMTruncateXlogRec(Oid oldestMultiDB,
 								  MultiXactId startTruncOff,
@@ -1945,8 +1951,8 @@ TrimMultiXact(void)
 	MultiXactState->finishedStartup = true;
 	LWLockRelease(MultiXactGenLock);
 
-	/* Now compute how far away the next members wraparound is. */
-	SetMultiXactIdLimit(oldestMXact, oldestMXactDB, true);
+	/* Now compute how far away the next multixid wraparound is. */
+	SetMultiXactIdLimit(oldestMXact, oldestMXactDB);
 }
 
 /*
@@ -2015,28 +2021,24 @@ MultiXactSetNextMXact(MultiXactId nextMulti,
  * datminmxid (ie, the oldest MultiXactId that might exist in any database
  * of our cluster), and the OID of the (or a) database with that value.
  *
- * is_startup is true when we are just starting the cluster, false when we
- * are updating state in a running cluster.  This only affects log messages.
+ * This also updates MultiXactState->oldestOffset, by looking up the offset of
+ * MultiXactState->oldestMultiXactId.
  */
 void
-SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid,
-					bool is_startup)
+SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid)
 {
 	MultiXactId multiVacLimit;
 	MultiXactId multiWarnLimit;
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 	MultiXactId curMulti;
-	bool		needs_offset_vacuum;
 
 	Assert(MultiXactIdIsValid(oldest_datminmxid));
 
 	/*
 	 * We pretend that a wrap will happen halfway through the multixact ID
 	 * space, but that's not really true, because multixacts wrap differently
-	 * from transaction IDs.  Note that, separately from any concern about
-	 * multixact IDs wrapping, we must ensure that multixact members do not
-	 * wrap.  Limits for that are set in SetOffsetVacuumLimit, not here.
+	 * from transaction IDs.
 	 */
 	multiWrapLimit = oldest_datminmxid + (MaxMultiXactId >> 1);
 	if (multiWrapLimit < FirstMultiXactId)
@@ -2104,8 +2106,13 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid,
 
 	Assert(!InRecovery);
 
-	/* Set limits for offset vacuum. */
-	needs_offset_vacuum = SetOffsetVacuumLimit(is_startup);
+	/*
+	 * Offsets are 64-bits wide and never wrap around, so we don't need to
+	 * consider them for emergency autovacuum purposes.  But now that we're in
+	 * a consistent state, determine MultiXactState->oldestOffset, to be used
+	 * to calculate freezing cutoff to keep the offsets disk usage in check.
+	 */
+	SetOffsetVacuumLimit();
 
 	/*
 	 * If past the autovacuum force point, immediately signal an autovac
@@ -2114,8 +2121,7 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid,
 	 * database, it'll call here, and we'll signal the postmaster to start
 	 * another iteration immediately if there are still any old databases.
 	 */
-	if ((MultiXactIdPrecedes(multiVacLimit, curMulti) ||
-		 needs_offset_vacuum) && IsUnderPostmaster)
+	if (MultiXactIdPrecedes(multiVacLimit, curMulti) && IsUnderPostmaster)
 		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
 
 	/* Give an immediate warning if past the wrap warn point */
@@ -2198,7 +2204,7 @@ MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB)
 	Assert(InRecovery);
 
 	if (MultiXactIdPrecedes(MultiXactState->oldestMultiXactId, oldestMulti))
-		SetMultiXactIdLimit(oldestMulti, oldestMultiDB, false);
+		SetMultiXactIdLimit(oldestMulti, oldestMultiDB);
 }
 
 /*
@@ -2348,22 +2354,17 @@ GetOldestMultiXactId(void)
 }
 
 /*
- * Determine if we need to vacuum to keep the size of the members SLRU in
- * check.
- *
- * To do so determine what's the oldest member offset and install the limit
- * info in MultiXactState, where it can be used to prevent overrun of old data
- * in the members SLRU area.
- *
- * The return value is true if autovacuum is required and false otherwise.
+ * Determine what's the oldest member offset and install it in MultiXactState,
+ * where it can be used to adjust multixid freezing cutoffs.
  */
-static bool
-SetOffsetVacuumLimit(bool is_startup)
+static void
+SetOffsetVacuumLimit(void)
 {
 	MultiXactId oldestMultiXactId;
 	MultiXactId nextMXact;
 	MultiXactOffset oldestOffset = 0;	/* placate compiler */
 	MultiXactOffset nextOffset;
+	bool		oldestOffsetKnown = false;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2393,20 +2394,37 @@ SetOffsetVacuumLimit(bool is_startup)
 		 * offset.
 		 */
 		oldestOffset = nextOffset;
+		oldestOffsetKnown = true;
 	}
-	else if (!find_multixact_start(oldestMultiXactId, &oldestOffset))
+	else
 	{
-		ereport(LOG,
-				(errmsg("oldest checkpointed MultiXact %u does not exist on disk",
-						oldestMultiXactId)));
+		/*
+		 * Figure out the offset at which oldest existing multixact's members
+		 * are stored.  If we cannot find it, be careful not to fail.  (We had
+		 * bugs in early releases of PostgreSQL 9.3.X and 9.4.X, the
+		 * supposedly-earliest multixact might not really exist.  Those should
+		 * be long gone by now, but let's nevertheless be careful not to fail
+		 * in that case.)
+		 */
+		oldestOffsetKnown =
+			find_multixact_start(oldestMultiXactId, &oldestOffset);
+
+		if (!oldestOffsetKnown)
+			ereport(LOG,
+					(errmsg("oldest checkpointed MultiXact %u does not exist on disk",
+							oldestMultiXactId)));
+		return;
 	}
 
 	LWLockRelease(MultiXactTruncationLock);
 
-	/*
-	 * Do we need autovacuum?	If we're not sure, assume yes.
-	 */
-	return nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD;
+	/* Install the computed value */
+	if (oldestOffsetKnown)
+	{
+		LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
+		MultiXactState->oldestOffset = oldestOffset;
+		LWLockRelease(MultiXactGenLock);
+	}
 }
 
 /*
@@ -2453,6 +2471,107 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	return true;
 }
 
+/*
+ * Determine how many multixacts, and how many multixact members, currently
+ * exist.
+ */
+static void
+ReadMultiXactCounts(uint32 *multixacts, MultiXactOffset *members)
+{
+	MultiXactOffset nextOffset;
+	MultiXactOffset oldestOffset;
+	MultiXactId oldestMultiXactId;
+	MultiXactId nextMultiXactId;
+
+	LWLockAcquire(MultiXactGenLock, LW_SHARED);
+	nextOffset = MultiXactState->nextOffset;
+	oldestMultiXactId = MultiXactState->oldestMultiXactId;
+	nextMultiXactId = MultiXactState->nextMXact;
+	oldestOffset = MultiXactState->oldestOffset;
+	LWLockRelease(MultiXactGenLock);
+
+	*members = nextOffset - oldestOffset;
+	*multixacts = nextMultiXactId - oldestMultiXactId;
+}
+
+/*
+ * Multixact members can be removed once the multixacts that refer to them are
+ * older than every datminmxid.  autovacuum_multixact_freeze_max_age and
+ * vacuum_multixact_freeze_table_age work together to make sure we never have
+ * too many multixacts; we hope that, at least under normal circumstances,
+ * this will also be sufficient to keep us from using too many offsets.
+ * However, if the average multixact has many members, we might accumulate a
+ * huge amount of members, consuming disk space, while still using few enough
+ * multixids that the multixid limits fail to trigger relminmxid advancement
+ * by VACUUM.
+ *
+ * To prevent that, if more than a certain amount of members space is used
+ * (MULTIXACT_MEMBER_LOW_THRESHOLD), we effectively reduce
+ * autovacuum_multixact_freeze_max_age to a value just less than the number of
+ * multixacts in use.  We hope that this will quickly trigger autovacuuming on
+ * the table or tables with the oldest relminmxid, thus allowing datminmxid
+ * values to advance and removing some members.
+ *
+ * As the amount of the member space in use grows, we become more aggressive
+ * in clamping this value.  That not only causes autovacuum to ramp up, but
+ * also makes any manual vacuums the user issues more aggressive.  This
+ * happens because vacuum_get_cutoffs() will clamp the freeze table and the
+ * minimum freeze age cutoffs based on the effective
+ * autovacuum_multixact_freeze_max_age this function returns.  At the extreme,
+ * when the members usage reaches MULTIXACT_MEMBER_HIGH_THRESHOLD, we'll clamp
+ * freeze_max_age to zero, and every vacuum of any table will freeze every
+ * multixact.
+ */
+int
+MultiXactMemberFreezeThreshold(void)
+{
+	MultiXactOffset members;
+	uint32		multixacts;
+	uint32		victim_multixacts;
+	double		fraction;
+	int			result;
+
+	/*
+	 * Read the current offsets and members usage.
+	 *
+	 * Note: In the case that we have been unable to calculate oldestOffset,
+	 * because we failed to find the offset of the oldest multixid, we assume
+	 * the worst because oldestOffset will be left to zero in that case.
+	 */
+	ReadMultiXactCounts(&multixacts, &members);
+
+	/* If member space utilization is low, no special action is required. */
+	if (members <= MULTIXACT_MEMBER_LOW_THRESHOLD)
+		return autovacuum_multixact_freeze_max_age;
+
+	/*
+	 * Compute a target for relminmxid advancement.  The number of multixacts
+	 * we try to eliminate from the system is based on how far we are past
+	 * MULTIXACT_MEMBER_LOW_THRESHOLD.
+	 *
+	 * The way this formula works is that when members is exactly at the low
+	 * threshold, fraction == 0.0, and we set freeze_max_age equal to
+	 * mxid_age(oldestMultiXactId).  As members grows further, towards the
+	 * high threshold, fraction grows linearly from 0.0 to 1.0, and the result
+	 * shrinks from mxid_age(oldestMultiXactId) to 0.  Beyond the high
+	 * threshold, fraction > 1.0 and the result is clamped to 0.
+	 */
+	fraction = (double) (members - MULTIXACT_MEMBER_LOW_THRESHOLD) /
+		(MULTIXACT_MEMBER_HIGH_THRESHOLD - MULTIXACT_MEMBER_LOW_THRESHOLD);
+	victim_multixacts = multixacts * fraction;
+
+	/* fraction could be > 1.0, but lowest possible freeze age is zero */
+	if (victim_multixacts > multixacts)
+		return 0;
+	result = multixacts - victim_multixacts;
+
+	/*
+	 * Clamp to autovacuum_multixact_freeze_max_age, so that we never make
+	 * autovacuum less aggressive than it would otherwise be.
+	 */
+	return Min(result, autovacuum_multixact_freeze_max_age);
+}
+
 typedef struct mxtruncinfo
 {
 	int64		earliestExistingPage;
@@ -2669,6 +2788,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestMultiXactId = newOldestMulti;
 	MultiXactState->oldestMultiXactDB = newOldestMultiDB;
+	MultiXactState->oldestOffset = newOldestOffset;
 	LWLockRelease(MultiXactGenLock);
 
 	/* First truncate members */
@@ -2864,7 +2984,7 @@ multixact_redo(XLogReaderState *record)
 		 * Advance the horizon values, so they're current at the end of
 		 * recovery.
 		 */
-		SetMultiXactIdLimit(xlrec.endTruncOff, xlrec.oldestMultiDB, false);
+		SetMultiXactIdLimit(xlrec.endTruncOff, xlrec.oldestMultiDB);
 
 		PerformMembersTruncation(xlrec.startTruncMemb, xlrec.endTruncMemb);
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ef405d66b3b..a000b8bd509 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5155,7 +5155,7 @@ BootStrapXLOG(uint32 data_checksum_version)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	AdvanceOldestClogXid(checkPoint.oldestXid);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
-	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
+	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
 	SetCommitTsLimit(InvalidTransactionId, InvalidTransactionId);
 
 	/* Set up the XLOG page header */
@@ -5636,7 +5636,7 @@ StartupXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	AdvanceOldestClogXid(checkPoint.oldestXid);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
-	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
+	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
 	SetCommitTsLimit(checkPoint.oldestCommitTsXid,
 					 checkPoint.newestCommitTsXid);
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 100e1a72c22..bd4278cd250 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1146,9 +1146,9 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams params,
 	/*
 	 * Also compute the multixact age for which freezing is urgent.  This is
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
-	 * short of multixact member space.
+	 * short of multixact member space. XXX update comment
 	 */
-	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
+	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
 
 	/*
 	 * Almost ready to set freeze output parameters; check if OldestXmin or
@@ -1971,7 +1971,7 @@ vac_truncate_clog(TransactionId frozenXID,
 	 * signaling twice?
 	 */
 	SetTransactionIdLimit(frozenXID, oldestxid_datoid);
-	SetMultiXactIdLimit(minMulti, minmulti_datoid, false);
+	SetMultiXactIdLimit(minMulti, minmulti_datoid);
 
 	LWLockRelease(WrapLimitsVacuumLock);
 }
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index bf66f494e3a..1c38488f2cb 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1151,7 +1151,7 @@ do_start_worker(void)
 
 	/* Also determine the oldest datminmxid we will consider. */
 	recentMulti = ReadNextMultiXactId();
-	multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
+	multiForceLimit = recentMulti - MultiXactMemberFreezeThreshold();
 	if (multiForceLimit < FirstMultiXactId)
 		multiForceLimit -= FirstMultiXactId;
 
@@ -1939,7 +1939,7 @@ do_autovacuum(void)
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
 	 * short of multixact member space.
 	 */
-	effective_multixact_freeze_max_age = autovacuum_multixact_freeze_max_age;
+	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
 
 	/*
 	 * Find the pg_database entry and select the default freeze ages. We use
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index d688b547c54..cfff86f655f 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -126,8 +126,7 @@ extern void BootStrapMultiXact(void);
 extern void StartupMultiXact(void);
 extern void TrimMultiXact(void);
 extern void SetMultiXactIdLimit(MultiXactId oldest_datminmxid,
-								Oid oldest_datoid,
-								bool is_startup);
+								Oid oldest_datoid);
 extern void MultiXactGetCheckptMulti(bool is_shutdown,
 									 MultiXactId *nextMulti,
 									 MultiXactOffset *nextMultiOffset,
@@ -142,6 +141,7 @@ extern void MultiXactSetNextMXact(MultiXactId nextMulti,
 extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 									  MultiXactOffset minMultiOffset);
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
+extern int	MultiXactMemberFreezeThreshold(void);
 
 extern void multixact_twophase_recover(FullTransactionId fxid, uint16 info,
 									   void *recdata, uint32 len);
-- 
2.47.3

v26-0007-TEST-Add-test-for-64-bit-mxoff-in-pg_resetwal.patchtext/x-patch; charset=UTF-8; name=v26-0007-TEST-Add-test-for-64-bit-mxoff-in-pg_resetwal.patchDownload
From 945f02abac316f4fbbd4187679e9b1d2a294e08e Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Tue, 28 Oct 2025 19:08:26 +0300
Subject: [PATCH v26 07/10] TEST: Add test for 64-bit mxoff in pg_resetwal

---
 src/bin/pg_resetwal/meson.build    |   1 +
 src/bin/pg_resetwal/t/003_mxoff.pl | 170 +++++++++++++++++++++++++++++
 2 files changed, 171 insertions(+)
 create mode 100644 src/bin/pg_resetwal/t/003_mxoff.pl

diff --git a/src/bin/pg_resetwal/meson.build b/src/bin/pg_resetwal/meson.build
index 290832b2299..1e2dfb38a5b 100644
--- a/src/bin/pg_resetwal/meson.build
+++ b/src/bin/pg_resetwal/meson.build
@@ -25,6 +25,7 @@ tests += {
     'tests': [
       't/001_basic.pl',
       't/002_corrupted.pl',
+      't/003_mxoff.pl',
     ],
   },
 }
diff --git a/src/bin/pg_resetwal/t/003_mxoff.pl b/src/bin/pg_resetwal/t/003_mxoff.pl
new file mode 100644
index 00000000000..3c1b7fa1d33
--- /dev/null
+++ b/src/bin/pg_resetwal/t/003_mxoff.pl
@@ -0,0 +1,170 @@
+
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use Math::BigInt;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+sub mxact_eater
+{
+	my $node = shift;
+	my $tbl = shift;
+
+	$node->start;
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl} (I INT PRIMARY KEY, N_UPDATED INT) " .
+		"       WITH (AUTOVACUUM_ENABLED=FALSE);" .
+		"INSERT INTO ${tbl} SELECT G, 0 FROM GENERATE_SERIES(1, 50) G;");
+
+	# consume around 10k multixact-offsetfs
+	my $nclients = 10;
+	my $update_every = 75;
+	my @connections = ();
+
+	for (0..$nclients)
+	{
+		my $conn = $node->background_psql('postgres');
+		$conn->query_safe("BEGIN");
+
+		push(@connections, $conn);
+	}
+
+	for (my $i = 0; $i < 1000; $i++)
+	{
+		my $conn = $connections[$i % $nclients];
+
+		$conn->query_safe("COMMIT;");
+		$conn->query_safe("BEGIN");
+
+		if ($i % $update_every == 0)
+		{
+			$conn->query_safe(
+				"UPDATE ${tbl} SET " .
+				"N_UPDATED = N_UPDATED + 1 " .
+				"WHERE I = ${i} % 50");
+		}
+		else
+		{
+			$conn->query_safe(
+				"SELECT * FROM ${tbl} FOR KEY SHARE");
+		}
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+
+	$node->stop;
+}
+
+sub next_mxoff
+{
+	my $node = shift;
+	my ($stdout, $stderr) =
+	  run_command([ 'pg_controldata', $node->data_dir ]);
+	my @control_data = split("\n", $stdout);
+	my $next_mxoff = undef;
+
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint's NextMultiOffset:\s*(.*)$/mg)
+		{
+			$next_mxoff = $1;
+			last;
+		}
+	}
+	die "NextMultiOffset not found in control file\n"
+		unless defined($next_mxoff);
+
+	return $next_mxoff;
+}
+
+sub reset_mxoff
+{
+	my $node = shift;
+	my $offset = shift;
+		$offset = Math::BigInt->new($offset);
+
+	# Get block size
+	my $out = (run_command([ 'pg_resetwal', '--dry-run', $node->data_dir ]))[0];
+		$out =~ /^Database block size: *(\d+)$/m or die;
+	my $blcksz = $1;
+
+	# Reset to new offset
+	my @cmd = ('pg_resetwal', '--pgdata' => $node->data_dir);
+	push @cmd, '--multixact-offset' => $offset->as_hex();
+	command_ok(\@cmd, 'set oldest multixact-offset');
+
+	# Fill empty pg_multixact/members segment
+	my $mult = 32 * int($blcksz / 20) * 4;
+	my $segname = sprintf "%015X", $offset / $mult;
+
+	my @dd = ('dd');
+	push @dd, "if=/dev/zero";
+	push @dd, "of=" . $node->data_dir . "/pg_multixact/members/" . $segname;
+	push @dd, "bs=$blcksz";
+	push @dd, "count=32";
+	command_ok(\@dd, 'fill empty multixact-members');
+}
+
+my ($off1, $off2);
+
+# start from defaults
+my $node1 = PostgreSQL::Test::Cluster->new('node1');
+$node1->init;
+$off1 = next_mxoff($node1);
+mxact_eater($node1, "FOO");
+$off2 = next_mxoff($node1);
+note "> start from $off1, finished at $off2\n";
+
+# start from before 32-bit wraparound
+my $node2 = PostgreSQL::Test::Cluster->new('node2');
+$node2->init;
+reset_mxoff($node2, 0xFFFF0000);
+$off1 = next_mxoff($node2);
+mxact_eater($node2, "FOO");
+$off2 = next_mxoff($node2);
+note "> start from $off1, finished at $off2\n";
+
+# start near 32-bit wraparound
+my $node3 = PostgreSQL::Test::Cluster->new('node3');
+$node3->init;
+reset_mxoff($node3, 0xFFFFEC77);
+$off1 = next_mxoff($node3);
+mxact_eater($node3, "FOO");
+$off2 = next_mxoff($node3);
+note "> start from $off1, finished at $off2\n";
+
+# start over 32-bit wraparound
+my $node4 = PostgreSQL::Test::Cluster->new('node4');
+$node4->init;
+reset_mxoff($node4, '0xFFFFFFFF0000');
+$off1 = next_mxoff($node4);
+mxact_eater($node4, "FOO");
+$off2 = next_mxoff($node3);
+note "> start from $off1, finished at $off2\n";
+
+# check invariant
+$node1->start;
+$node2->start;
+$node3->start;
+$node4->start;
+
+my $var1 = $node1->safe_psql('postgres', 'TABLE FOO');
+my $var2 = $node2->safe_psql('postgres', 'TABLE FOO');
+my $var3 = $node3->safe_psql('postgres', 'TABLE FOO');
+my $var4 = $node4->safe_psql('postgres', 'TABLE FOO');
+ok($var1 eq $var2 eq $var3 eq $var4,
+	'check table invariant in all nodes');
+
+$node4->stop;
+$node3->stop;
+$node2->stop;
+$node1->stop;
+
+done_testing();
-- 
2.47.3

v26-0008-TEST-Add-test-for-wraparound-of-next-new-multi-i.patchtext/x-patch; charset=UTF-8; name=v26-0008-TEST-Add-test-for-wraparound-of-next-new-multi-i.patchDownload
From e6377d7e82a04bfa8007d6bf0353ca414e848434 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 12 Nov 2025 13:36:09 +0200
Subject: [PATCH v26 08/10] TEST: Add test for wraparound of next new multi in
 pg_upgrade

Related to BUG #18863 and BUG #18865
---
 src/bin/pg_upgrade/meson.build         |   1 +
 src/bin/pg_upgrade/t/008_multi_wrap.pl | 176 +++++++++++++++++++++++++
 2 files changed, 177 insertions(+)
 create mode 100644 src/bin/pg_upgrade/t/008_multi_wrap.pl

diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index fff0db3b560..28cd29d666e 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/005_char_signedness.pl',
       't/006_transfer_modes.pl',
       't/007_multixact_conversion.pl',
+      't/008_multi_wrap.pl',
     ],
     'test_kwargs': {'priority': 40}, # pg_upgrade tests are slow
   },
diff --git a/src/bin/pg_upgrade/t/008_multi_wrap.pl b/src/bin/pg_upgrade/t/008_multi_wrap.pl
new file mode 100644
index 00000000000..0ad8fd59906
--- /dev/null
+++ b/src/bin/pg_upgrade/t/008_multi_wrap.pl
@@ -0,0 +1,176 @@
+
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use Math::BigInt;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::AdjustDump;
+use PostgreSQL::Test::AdjustUpgrade;
+use Test::More;
+
+# Temp dir for a dumps.
+my $tempdir = PostgreSQL::Test::Utils::tempdir;
+
+# Can be changed to test the other modes.
+my $mode = $ENV{PG_TEST_PG_UPGRADE_MODE} || '--copy';
+
+# Handy pg_resetwal wrapper
+sub reset_mxoff
+{
+	my %args = @_;
+
+	my $node = $args{node};
+	my $offset = $args{offset};
+	my $multi = $args{multi};
+	my $blcksz = sub # Get block size
+	{
+		my $out = (run_command([ 'pg_resetwal', '--dry-run',
+								 $node->data_dir ]))[0];
+		$out =~ /^Database block size: *(\d+)$/m or die;
+		return $1;
+	}->();
+
+	my @cmd;
+
+	# Reset cluster
+	@cmd = ('pg_resetwal', '--pgdata' => $node->data_dir);
+	if (defined($offset))
+	{
+		push @cmd, '--multixact-offset' => $offset;
+	}
+	if (defined($multi))
+	{
+		push @cmd, "--multixact-ids=$multi,$multi";
+	}
+	command_ok(\@cmd, 'reset multi/offset');
+
+	my $n_items;
+	my $segname;
+
+	# Fill empty pg_multixact segments
+	if (defined($offset))
+	{
+		$n_items = 32 * int($blcksz / 20) * 4;
+		$segname = sprintf "%015X", ($offset / $n_items);
+		$segname = $node->data_dir . "/pg_multixact/members/" . $segname;
+
+		@cmd = ('dd');
+		push @cmd, "if=/dev/zero";
+		push @cmd, "of=" . $segname;
+		push @cmd, "bs=$blcksz";
+		push @cmd, "count=32";
+		command_ok(\@cmd, 'fill empty multixact-members');
+	}
+
+	if (defined($multi))
+	{
+		$n_items = 32 * int($blcksz / 8);
+		$segname = sprintf "%04X", $multi / $n_items;
+		$segname = $node->data_dir . "/pg_multixact/offsets/" . $segname;
+
+		@cmd = ('dd');
+		push @cmd, "if=/dev/zero";
+		push @cmd, "of=" . $segname;
+		push @cmd, "bs=$blcksz";
+		push @cmd, "count=32";
+		command_ok(\@cmd, 'fill empty multixact-offsets');
+	}
+}
+
+sub get_dump_for_comparison
+{
+	my ($node, $db, $file_prefix, $adjust_child_columns) = @_;
+
+	my $dumpfile = $tempdir . '/' . $file_prefix . '.sql';
+	my $dump_adjusted = "${dumpfile}_adjusted";
+
+	open(my $dh, '>', $dump_adjusted)
+	  || die "could not open $dump_adjusted for writing $!";
+
+	$node->run_log(
+		[
+			'pg_dump', '--no-sync',
+			'--restrict-key' => 'test',
+			'-d' => $node->connstr($db),
+			'-f' => $dumpfile
+		]);
+
+	print $dh adjust_regress_dumpfile(slurp_file($dumpfile),
+		$adjust_child_columns);
+	close($dh);
+
+	return $dump_adjusted;
+}
+
+# Create old node
+my $old = PostgreSQL::Test::Cluster->new("old");
+$old->init;
+reset_mxoff(node => $old, multi => 4294967295, offset => 429496729);
+
+$old->start;
+$old->safe_psql('postgres',
+qq(
+	CREATE TABLE test_table (id integer NOT NULL PRIMARY KEY, val text);
+	INSERT INTO test_table VALUES (1, 'a');
+));
+
+my $conn1 = $old->background_psql('postgres');
+my $conn2 = $old->background_psql('postgres');
+
+$conn1->query_safe(qq(
+	BEGIN;
+	SELECT * FROM test_table WHERE id = 1 FOR SHARE;
+));
+$conn2->query_safe(qq(
+	BEGIN;
+	SELECT * FROM test_table WHERE id = 1 FOR SHARE;
+));
+
+$conn1->query_safe(qq(COMMIT;));
+$conn2->query_safe(qq(COMMIT;));
+
+$conn1->quit;
+$conn2->quit;
+
+$old->stop;
+
+# Create new node
+my $new = PostgreSQL::Test::Cluster->new("new");
+$new->init;
+
+# Run pg_upgrade
+command_ok(
+	[
+		'pg_upgrade', '--no-sync',
+		'--old-datadir' => $old->data_dir,
+		'--new-datadir' => $new->data_dir,
+		'--old-bindir' => $old->config_data('--bindir'),
+		'--new-bindir' => $new->config_data('--bindir'),
+		'--socketdir' => $new->host,
+		'--old-port' => $old->port,
+		'--new-port' => $new->port,
+		$mode,
+	],
+	'run of pg_upgrade for new instance');
+ok( !-d $new->data_dir . "/pg_upgrade_output.d",
+	"pg_upgrade_output.d/ removed after pg_upgrade success");
+
+$old->start;
+my $src_dump =
+	get_dump_for_comparison($old, 'postgres',
+							"oldnode_1_dump", 0);
+$old->stop;
+
+$new->start;
+my $dst_dump =
+	get_dump_for_comparison($new, 'postgres',
+							"newnode_1_dump", 0);
+$new->stop;
+
+compare_files($src_dump, $dst_dump,
+	'dump outputs from original and restored regression databases match');
+
+done_testing();
-- 
2.47.3

v26-0009-TEST-add-consume_multixids-function.patchtext/x-patch; charset=UTF-8; name=v26-0009-TEST-add-consume_multixids-function.patchDownload
From e021a6aa69f93ff9c8bc9feec37d272c63dbe495 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 1 Apr 2025 21:01:07 +0300
Subject: [PATCH v26 09/10] TEST: add consume_multixids function

---
 src/test/modules/xid_wraparound/Makefile      |  1 +
 src/test/modules/xid_wraparound/meson.build   |  1 +
 .../xid_wraparound/multixid_wraparound.c      | 96 +++++++++++++++++++
 .../xid_wraparound/xid_wraparound--1.0.sql    |  4 +
 4 files changed, 102 insertions(+)
 create mode 100644 src/test/modules/xid_wraparound/multixid_wraparound.c

diff --git a/src/test/modules/xid_wraparound/Makefile b/src/test/modules/xid_wraparound/Makefile
index 7a6e0f66762..ebb3d8fcb3e 100644
--- a/src/test/modules/xid_wraparound/Makefile
+++ b/src/test/modules/xid_wraparound/Makefile
@@ -3,6 +3,7 @@
 MODULE_big = xid_wraparound
 OBJS = \
 	$(WIN32RES) \
+	multixid_wraparound.o \
 	xid_wraparound.o
 PGFILEDESC = "xid_wraparound - tests for XID wraparound"
 
diff --git a/src/test/modules/xid_wraparound/meson.build b/src/test/modules/xid_wraparound/meson.build
index 3aec430df8c..ce4ac468830 100644
--- a/src/test/modules/xid_wraparound/meson.build
+++ b/src/test/modules/xid_wraparound/meson.build
@@ -1,6 +1,7 @@
 # Copyright (c) 2023-2025, PostgreSQL Global Development Group
 
 xid_wraparound_sources = files(
+  'multixid_wraparound.c',
   'xid_wraparound.c',
 )
 
diff --git a/src/test/modules/xid_wraparound/multixid_wraparound.c b/src/test/modules/xid_wraparound/multixid_wraparound.c
new file mode 100644
index 00000000000..af567c6e541
--- /dev/null
+++ b/src/test/modules/xid_wraparound/multixid_wraparound.c
@@ -0,0 +1,96 @@
+/*--------------------------------------------------------------------------
+ *
+ * multixid_wraparound.c
+ *		Utilities for testing multixids
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/test/modules/xid_wraparound/multixid_wraparound.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/multixact.h"
+#include "access/xact.h"
+#include "miscadmin.h"
+#include "storage/proc.h"
+#include "utils/xid8.h"
+
+static int mxactMemberComparator(const void *arg1, const void *arg2);
+
+/*
+ * Consume the specified number of multi-XIDs, with specified number of
+ * members each.
+ */
+PG_FUNCTION_INFO_V1(consume_multixids);
+Datum
+consume_multixids(PG_FUNCTION_ARGS)
+{
+	int64		nmultis = PG_GETARG_INT64(0);
+	int32		nmembers = PG_GETARG_INT32(1);
+	MultiXactMember *members;
+	MultiXactId	lastmxid = InvalidMultiXactId;
+
+	if (nmultis < 0)
+		elog(ERROR, "invalid nxids argument: %" PRId64, nmultis);
+	if (nmembers < 1)
+		elog(ERROR, "invalid nmembers argument: %d", nmembers);
+
+	/*
+	 * We consume XIDs by calling GetNewTransactionId(true), which marks the
+	 * consumed XIDs as subtransactions of the current top-level transaction.
+	 * For that to work, this transaction must have a top-level XID.
+	 *
+	 * GetNewTransactionId registers them in the subxid cache in PGPROC, until
+	 * the cache overflows, but beyond that, we don't keep track of the
+	 * consumed XIDs.
+	 */
+	(void) GetTopTransactionId();
+
+	members = palloc((nmultis + nmembers) * sizeof(MultiXactMember));
+	for (int32 i = 0; i < nmultis + nmembers; i++)
+	{
+		FullTransactionId xid;
+
+		xid = GetNewTransactionId(true);
+		members[i].xid = XidFromFullTransactionId(xid);
+		members[i].status = MultiXactStatusForKeyShare;
+	}
+	/*
+	 * pre-sort the array like mXactCacheGetBySet does, so that the qsort call
+	 * in mXactCacheGetBySet() is cheaper.
+	 */
+	qsort(members, nmultis + nmembers, sizeof(MultiXactMember), mxactMemberComparator);
+
+	for (int64 i = 0; i < nmultis; i++)
+	{
+		lastmxid = MultiXactIdCreateFromMembers(nmembers, &members[i]);
+		CHECK_FOR_INTERRUPTS();
+	}
+
+	pfree(members);
+
+	PG_RETURN_TRANSACTIONID(lastmxid);
+}
+
+/* copied from multixact.c */
+static int
+mxactMemberComparator(const void *arg1, const void *arg2)
+{
+	MultiXactMember member1 = *(const MultiXactMember *) arg1;
+	MultiXactMember member2 = *(const MultiXactMember *) arg2;
+
+	if (member1.xid > member2.xid)
+		return 1;
+	if (member1.xid < member2.xid)
+		return -1;
+	if (member1.status > member2.status)
+		return 1;
+	if (member1.status < member2.status)
+		return -1;
+	return 0;
+}
diff --git a/src/test/modules/xid_wraparound/xid_wraparound--1.0.sql b/src/test/modules/xid_wraparound/xid_wraparound--1.0.sql
index 96356b4b974..ed7520c3d86 100644
--- a/src/test/modules/xid_wraparound/xid_wraparound--1.0.sql
+++ b/src/test/modules/xid_wraparound/xid_wraparound--1.0.sql
@@ -10,3 +10,7 @@ AS 'MODULE_PATHNAME' LANGUAGE C;
 CREATE FUNCTION consume_xids_until(targetxid xid8)
 RETURNS xid8 VOLATILE PARALLEL UNSAFE STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION consume_multixids(nmultis bigint, nmembers int4)
+RETURNS bigint VOLATILE PARALLEL UNSAFE STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
-- 
2.47.3

v26-0010-TEST-Original-pg_upgrade-test-case.patchtext/x-patch; charset=UTF-8; name=v26-0010-TEST-Original-pg_upgrade-test-case.patchDownload
From c111239a2316fe5d10e236e424ff153090ae9434 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Fri, 14 Nov 2025 00:31:37 +0200
Subject: [PATCH v26 10/10] TEST: Original pg_upgrade test case

---
 src/bin/pg_upgrade/meson.build         |   1 +
 src/bin/pg_upgrade/t/009_mxoff_orig.pl | 463 +++++++++++++++++++++++++
 2 files changed, 464 insertions(+)
 create mode 100644 src/bin/pg_upgrade/t/009_mxoff_orig.pl

diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index 28cd29d666e..dbe2ce9de9e 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -52,6 +52,7 @@ tests += {
       't/006_transfer_modes.pl',
       't/007_multixact_conversion.pl',
       't/008_multi_wrap.pl',
+      't/009_mxoff_orig.pl',
     ],
     'test_kwargs': {'priority': 40}, # pg_upgrade tests are slow
   },
diff --git a/src/bin/pg_upgrade/t/009_mxoff_orig.pl b/src/bin/pg_upgrade/t/009_mxoff_orig.pl
new file mode 100644
index 00000000000..7204325f873
--- /dev/null
+++ b/src/bin/pg_upgrade/t/009_mxoff_orig.pl
@@ -0,0 +1,463 @@
+
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use Math::BigInt;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::AdjustDump;
+use PostgreSQL::Test::AdjustUpgrade;
+use Test::More;
+
+# This test involves different multitransaction states, similarly to that of
+# 002_pg_upgrade.pl.
+
+unless (defined($ENV{oldinstall}))
+{
+	plan skip_all =>
+		'to run test set oldinstall environment variable to the pre 64-bit mxoff cluster';
+}
+
+# Temp dir for a dumps.
+my $tempdir = PostgreSQL::Test::Utils::tempdir;
+
+# Can be changed to test the other modes.
+my $mode = $ENV{PG_TEST_PG_UPGRADE_MODE} || '--copy';
+
+sub utility_path
+{
+	my $node = shift;
+	my $name = shift;
+
+	my $bin_path = defined($node->install_path) ?
+		$node->install_path . "/bin/$name" : $name;
+
+	return $bin_path;
+}
+
+# Get NextMultiOffset.
+sub next_mxoff
+{
+	my $node = shift;
+
+	my $pg_controldata_path = utility_path($node, 'pg_controldata');
+	my ($stdout, $stderr) = run_command([ $pg_controldata_path,
+											$node->data_dir ]);
+	my @control_data = split("\n", $stdout);
+	my $next_mxoff = undef;
+
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint's NextMultiOffset:\s*(.*)$/mg)
+		{
+			$next_mxoff = $1;
+			last;
+		}
+	}
+	die "NextMultiOffset not found in control file\n"
+		unless defined($next_mxoff);
+
+	return $next_mxoff;
+}
+
+# Consume around 10k of mxoffsets.
+sub mxact_eater
+{
+	my $node = shift;
+	my $tbl = 'FOO';
+
+	my ($mxoff1, $mxoff2);
+
+	$mxoff1 = next_mxoff($node);
+	$node->start;
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl} (I INT PRIMARY KEY, N_UPDATED INT) " .
+		"       WITH (AUTOVACUUM_ENABLED=FALSE);" .
+		"INSERT INTO ${tbl} SELECT G, 0 FROM GENERATE_SERIES(1, 50) G;");
+
+	# consume around 10k mxoff
+	my $nclients = 10;
+	my $update_every = 75;
+	my @connections = ();
+
+	for (0..$nclients)
+	{
+		my $conn = $node->background_psql('postgres');
+		$conn->query_safe("BEGIN");
+
+		push(@connections, $conn);
+	}
+
+	for (my $i = 0; $i < 1000; $i++)
+	{
+		my $conn = $connections[$i % $nclients];
+
+		$conn->query_safe("COMMIT;");
+		$conn->query_safe("BEGIN");
+
+		if ($i % $update_every == 0)
+		{
+			$conn->query_safe(
+				"UPDATE ${tbl} SET " .
+				"N_UPDATED = N_UPDATED + 1 " .
+				"WHERE I = ${i} % 50");
+		}
+		else
+		{
+			$conn->query_safe(
+				"SELECT * FROM ${tbl} FOR KEY SHARE");
+		}
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+
+	$node->stop;
+	$mxoff2 = next_mxoff($node);
+
+	return $mxoff1, $mxoff2;
+}
+
+# Consume around 1M of mxoffsets.
+sub mxact_huge_eater
+{
+	my $node = shift;
+	my $tbl = 'FOO';
+
+	my ($mxoff1, $mxoff2);
+
+	$mxoff1 = next_mxoff($node);
+	$node->start;
+	$node->safe_psql('postgres',
+		"CREATE TABLE ${tbl} (I INT PRIMARY KEY, N_UPDATED INT) " .
+		"       WITH (AUTOVACUUM_ENABLED=FALSE);" .
+		"INSERT INTO ${tbl} SELECT G, 0 FROM GENERATE_SERIES(1, 4) G;");
+
+	my $nclients = 100;
+	my @connections = ();
+	my $timeout = 10 * $PostgreSQL::Test::Utils::timeout_default;
+
+	for (0..$nclients)
+	{
+		my $conn = $node->background_psql('postgres',
+										  timeout => $timeout);
+		$conn->query_safe("BEGIN");
+
+		push(@connections, $conn);
+	}
+
+	# It's a long process, better to tell about progress.
+	my $n_steps = 100_000;
+	my $step = int($n_steps / 10);
+
+	diag "\nstart to consume mxoffsets ...\n";
+	for (my $i = 0; $i < $n_steps; $i++)
+	{
+		my $conn = $connections[$i % $nclients];
+
+		$conn->query_safe("COMMIT;");
+		$conn->query_safe("BEGIN");
+
+		{
+			$conn->query_safe(
+				"SELECT * FROM ${tbl} " .
+				"FOR KEY SHARE");
+		}
+
+		if ($i % $step == 0)
+		{
+			my $done = int(($i / $n_steps) * 100);
+			diag "$done% done...";
+		}
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+
+	$node->stop;
+	$mxoff2 = next_mxoff($node);
+
+	return $mxoff1, $mxoff2;
+}
+
+# Set oldest multixact-offset
+sub reset_mxoff
+{
+	my $node = shift;
+	my $offset = shift;
+
+	my $pg_resetwal_path = utility_path($node, 'pg_resetwal');
+	# Get block size
+	my $out = (run_command([ $pg_resetwal_path, '--dry-run',
+							 $node->data_dir ]))[0];
+		$out =~ /^Database block size: *(\d+)$/m or die;
+	my $blcksz = $1;
+
+	# Reset to new offset
+	my @cmd = ($pg_resetwal_path, '--pgdata' => $node->data_dir);
+	push @cmd, '--multixact-offset' => $offset;
+	command_ok(\@cmd, 'set oldest multixact-offset');
+
+	# Fill empty pg_multixact/members segment
+	my $mult = 32 * int($blcksz / 20) * 4;
+	my $segname = sprintf "%04X", $offset / $mult;
+
+	my @dd = ('dd');
+	push @dd, "if=/dev/zero";
+	push @dd, "of=" . $node->data_dir . "/pg_multixact/members/" . $segname;
+	push @dd, "bs=$blcksz";
+	push @dd, "count=32";
+	command_ok(\@dd, 'fill empty multixact-members');
+}
+
+sub get_dump_for_comparison
+{
+	my ($node, $db, $file_prefix, $adjust_child_columns) = @_;
+
+	my $dumpfile = $tempdir . '/' . $file_prefix . '.sql';
+	my $dump_adjusted = "${dumpfile}_adjusted";
+
+	open(my $dh, '>', $dump_adjusted)
+	  || die "could not open $dump_adjusted for writing $!";
+
+	my $pg_dump_path = utility_path($node, 'pg_dump');
+
+	$node->run_log(
+		[
+			$pg_dump_path, '--no-sync',
+			'--restrict-key' => 'test',
+			'-d' => $node->connstr($db),
+			'-f' => $dumpfile
+		]);
+
+	print $dh adjust_regress_dumpfile(slurp_file($dumpfile),
+		$adjust_child_columns);
+	close($dh);
+
+	return $dump_adjusted;
+}
+
+# Main test workhorse routine.
+# Make pg_upgrade, dump data and compare it.
+sub run_test
+{
+	my $tag = shift;
+	my $oldnode = shift;
+	my $newnode = shift;
+
+	my $pg_upgrade_path = utility_path($newnode, 'pg_upgrade');
+
+	command_ok(
+		[
+			$pg_upgrade_path, '--no-sync',
+			'--old-datadir' => $oldnode->data_dir,
+			'--new-datadir' => $newnode->data_dir,
+			'--old-bindir' => $oldnode->config_data('--bindir'),
+			'--new-bindir' => $newnode->config_data('--bindir'),
+			'--socketdir' => $newnode->host,
+			'--old-port' => $oldnode->port,
+			'--new-port' => $newnode->port,
+			$mode,
+		],
+		'run of pg_upgrade for new instance');
+	ok( !-d $newnode->data_dir . "/pg_upgrade_output.d",
+		"pg_upgrade_output.d/ removed after pg_upgrade success");
+
+	$oldnode->start;
+	my $src_dump =
+		get_dump_for_comparison($oldnode, 'postgres',
+								"oldnode_${tag}_dump", 0);
+	$oldnode->stop;
+
+	$newnode->start;
+	my $dst_dump =
+		get_dump_for_comparison($newnode, 'postgres',
+								"newnode_${tag}_dump", 0);
+	$newnode->stop;
+
+	compare_files($src_dump, $dst_dump,
+		'dump outputs from original and restored regression databases match');
+}
+
+sub to_hex
+{
+	my $arg = shift;
+
+	$arg = Math::BigInt->new($arg);
+	$arg = $arg->as_hex();
+
+	return $arg;
+}
+
+# case #1: start old node from defaults
+{
+	my $tag = 1;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+	$old->init(extra => ['-k']);
+
+	my ($start_mxoff, $finish_mxoff) = mxact_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #2: start old node from before 32-bit wraparound
+{
+	my $tag = 2;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+	reset_mxoff($old, 0xFFFF0000);
+
+	my ($start_mxoff, $finish_mxoff) = mxact_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #3: start old node near 32-bit wraparound and reach wraparound state.
+{
+	my $tag = 3;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+
+	reset_mxoff($old, 0xFFFFEC77);
+	my ($start_mxoff, $finish_mxoff) = mxact_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #4: start old node from defaults
+{
+	my $tag = 4;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+	$old->append_conf("postgresql.conf", "max_connections = 128");
+
+	diag "test #${tag} for multiple mxoff segments";
+	my ($start_mxoff, $finish_mxoff) = mxact_huge_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #5: start old node from before 32-bit wraparound
+{
+	my $tag = 5;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+	$old->append_conf("postgresql.conf", "max_connections = 128");
+	reset_mxoff($old, 0xFF000000);
+
+	diag "test #${tag} for multiple mxoff segments";
+	my ($start_mxoff, $finish_mxoff) = mxact_huge_eater($old);
+
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+# case #6: start old node near 32-bit wraparound and reach wraparound state.
+{
+	my $tag = 6;
+	my $old =
+		PostgreSQL::Test::Cluster->new("oldnode${tag}",
+									   install_path => $ENV{oldinstall});
+
+	$old->init(extra => ['-k']);
+
+	reset_mxoff($old, 0xFFFFFFFF - 500_000);
+	$old->append_conf("postgresql.conf", "max_connections = 128");
+	my ($start_mxoff, $finish_mxoff) = mxact_huge_eater($old);
+
+	diag "test #${tag} for multiple mxoff segments";
+	my $new = PostgreSQL::Test::Cluster->new("newnode${tag}");
+	$new->init;
+
+	run_test($tag, $old, $new);
+
+	$start_mxoff = to_hex($start_mxoff);
+	$finish_mxoff = to_hex($finish_mxoff);
+
+	my $next_mxoff = to_hex(next_mxoff($new));
+
+	note ">>> case #${tag}\n" .
+		 " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" .
+		 " newnode mxoff ${next_mxoff}\n";
+}
+
+done_testing();
-- 
2.47.3

#61Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#60)
3 attachment(s)
Re: POC: make mxidoff 64 bits

Here's yet another patch version. I spent the day reviewing this in
detail and doing little cleanups here and there. I squashed the commits
and wrote a proper commit message.

One noteworthy refactoring is in pg_upgrade.c, to make it more clear (to
me at least) how upgrade from version 9.2 and below now works. It was
actually broken when I tested it. Not sure if I had broken it earlier or
if it never worked, but in any case it works now.

I also tested upgrading a cluster from an old minor version, < 9.3.5,
where the control file has a bogus oldestMultiXid==1 value (see commit
b6a3444fa6). As expected, you get a "could not open file" error:

Performing Upgrade
------------------
Setting locale and encoding for new cluster ok
...
Deleting files from new pg_multixact/members ok
Deleting files from new pg_multixact/offsets ok
Converting pg_multixact files
could not open file "/home/heikki/pgsql.93stable/data/pg_multixact/offsets/0000": No such file or directory
Failure, exiting

I don't think we need to support that case. I hope there are no clusters
in that state still in the wild, and you can work around it by upgrading
to 9.3.5 or above and letting autovacuum run. But I wonder if a
pre-upgrade check with a better error message would still be worthwhile.

Ashutosh, you were interested in reviewing this earlier. Would you have
a chance to review this now, before I commit it? Alexander, Alvaro,
would you have a chance to take a final look too, please?

- Heikki

Attachments:

v27-0001-FIXME-bump-catversion.patchtext/x-patch; charset=UTF-8; name=v27-0001-FIXME-bump-catversion.patchDownload
From ba993966de2e2e193b51974cd21ec9704fcd5c60 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Fri, 24 Oct 2025 11:47:50 +0300
Subject: [PATCH v27 1/3] FIXME: bump catversion

To avoid constant CF-bot complains, make catversion bump in a separate
commit.

This is to be squashed with the main commit before pushing.

NOTE: keep it in sync with MULTIXACTOFFSET_FORMATCHANGE_CAT_VER
---
 src/include/catalog/catversion.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 7eefca1ae42..b0162c2bf63 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,7 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202511101
+// FIXME: bump it
+#define CATALOG_VERSION_NO	999999999
 
 #endif
-- 
2.47.3

v27-0002-Move-pg_multixact-SLRU-page-format-definitions-t.patchtext/x-patch; charset=UTF-8; name=v27-0002-Move-pg_multixact-SLRU-page-format-definitions-t.patchDownload
From 26572661dda2e4c26e43bd1a494ee4719fb0f5cd Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 12 Nov 2025 14:19:32 +0200
Subject: [PATCH v27 2/3] Move pg_multixact SLRU page format definitions to
 separate header

This makes them accessible from pg_upgrade, needed by the next commit.
I'm doing this mechanical move as a separate commit, to make the next
commit's changes to these definitions more obvious.

Author: Maxim Orlov <orlovmg@gmail.com>
Discussion: https://www.postgresql.org/message-id/CACG%3DezbZo_3_fnx%3DS5BfepwRftzrpJ%2B7WET4EkTU6wnjDTsnjg@mail.gmail.com
---
 src/backend/access/transam/multixact.c  | 119 --------------------
 src/include/access/multixact_internal.h | 140 ++++++++++++++++++++++++
 2 files changed, 140 insertions(+), 119 deletions(-)
 create mode 100644 src/include/access/multixact_internal.h

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 9d5f130af7e..acb2a6788f9 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -89,125 +89,6 @@
 #include "utils/memutils.h"
 
 
-/*
- * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
- * used everywhere else in Postgres.
- *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
- */
-
-/* We need four bytes per offset */
-#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
-
-static inline int64
-MultiXactIdToOffsetPage(MultiXactId multi)
-{
-	return multi / MULTIXACT_OFFSETS_PER_PAGE;
-}
-
-static inline int
-MultiXactIdToOffsetEntry(MultiXactId multi)
-{
-	return multi % MULTIXACT_OFFSETS_PER_PAGE;
-}
-
-static inline int64
-MultiXactIdToOffsetSegment(MultiXactId multi)
-{
-	return MultiXactIdToOffsetPage(multi) / SLRU_PAGES_PER_SEGMENT;
-}
-
-/*
- * The situation for members is a bit more complex: we store one byte of
- * additional flag bits for each TransactionId.  To do this without getting
- * into alignment issues, we store four bytes of flags, and then the
- * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
- * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
- * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
- * performance) trumps space efficiency here.
- *
- * Note that the "offset" macros work with byte offset, not array indexes, so
- * arithmetic must be done using "char *" pointers.
- */
-/* We need eight bits per xact, so one xact fits in a byte */
-#define MXACT_MEMBER_BITS_PER_XACT			8
-#define MXACT_MEMBER_FLAGS_PER_BYTE			1
-#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
-
-/* how many full bytes of flags are there in a group? */
-#define MULTIXACT_FLAGBYTES_PER_GROUP		4
-#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
-	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
-/* size in bytes of a complete group */
-#define MULTIXACT_MEMBERGROUP_SIZE \
-	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
-#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
-#define MULTIXACT_MEMBERS_PER_PAGE	\
-	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
-
-/*
- * Because the number of items per page is not a divisor of the last item
- * number (member 0xFFFFFFFF), the last segment does not use the maximum number
- * of pages, and moreover the last used page therein does not use the same
- * number of items as previous pages.  (Another way to say it is that the
- * 0xFFFFFFFF member is somewhere in the middle of the last page, so the page
- * has some empty space after that item.)
- *
- * This constant is the number of members in the last page of the last segment.
- */
-#define MAX_MEMBERS_IN_LAST_MEMBERS_PAGE \
-		((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
-
-/* page in which a member is to be found */
-static inline int64
-MXOffsetToMemberPage(MultiXactOffset offset)
-{
-	return offset / MULTIXACT_MEMBERS_PER_PAGE;
-}
-
-static inline int64
-MXOffsetToMemberSegment(MultiXactOffset offset)
-{
-	return MXOffsetToMemberPage(offset) / SLRU_PAGES_PER_SEGMENT;
-}
-
-/* Location (byte offset within page) of flag word for a given member */
-static inline int
-MXOffsetToFlagsOffset(MultiXactOffset offset)
-{
-	MultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
-	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
-	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
-
-	return byteoff;
-}
-
-static inline int
-MXOffsetToFlagsBitShift(MultiXactOffset offset)
-{
-	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
-	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
-
-	return bshift;
-}
-
-/* Location (byte offset within page) of TransactionId of given member */
-static inline int
-MXOffsetToMemberOffset(MultiXactOffset offset)
-{
-	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
-
-	return MXOffsetToFlagsOffset(offset) +
-		MULTIXACT_FLAGBYTES_PER_GROUP +
-		member_in_group * sizeof(TransactionId);
-}
-
 /* Multixact members wraparound thresholds. */
 #define MULTIXACT_MEMBER_SAFE_THRESHOLD		(MaxMultiXactOffset / 2)
 #define MULTIXACT_MEMBER_DANGER_THRESHOLD	\
diff --git a/src/include/access/multixact_internal.h b/src/include/access/multixact_internal.h
new file mode 100644
index 00000000000..9b56deaef31
--- /dev/null
+++ b/src/include/access/multixact_internal.h
@@ -0,0 +1,140 @@
+/*
+ * multixact_internal.h
+ *
+ * PostgreSQL multi-transaction-log manager internal declarations
+ *
+ * These functions and definitions are for dealing with pg_multixact pages.
+ * They are internal to multixact.c, but they are exported here to allow
+ * pg_upgrade to write pg_multixact files directly.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/multixact_internal.h
+ */
+#ifndef MULTIXACT_INTERNAL_H
+#define MULTIXACT_INTERNAL_H
+
+#include "access/multixact.h"
+
+
+/*
+ * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
+ * used everywhere else in Postgres.
+ *
+ * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
+ * MultiXact page numbering also wraps around at
+ * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
+ * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
+ * take no explicit notice of that fact in this module, except when comparing
+ * segment and page numbers in TruncateMultiXact (see
+ * MultiXactOffsetPagePrecedes).
+ */
+
+/* We need four bytes per offset */
+#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)
+{
+	return multi / MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int
+MultiXactIdToOffsetEntry(MultiXactId multi)
+{
+	return multi % MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int64
+MultiXactIdToOffsetSegment(MultiXactId multi)
+{
+	return MultiXactIdToOffsetPage(multi) / SLRU_PAGES_PER_SEGMENT;
+}
+
+/*
+ * The situation for members is a bit more complex: we store one byte of
+ * additional flag bits for each TransactionId.  To do this without getting
+ * into alignment issues, we store four bytes of flags, and then the
+ * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
+ * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
+ * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
+ * performance) trumps space efficiency here.
+ *
+ * Note that the "offset" macros work with byte offset, not array indexes, so
+ * arithmetic must be done using "char *" pointers.
+ */
+/* We need eight bits per xact, so one xact fits in a byte */
+#define MXACT_MEMBER_BITS_PER_XACT			8
+#define MXACT_MEMBER_FLAGS_PER_BYTE			1
+#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
+
+/* how many full bytes of flags are there in a group? */
+#define MULTIXACT_FLAGBYTES_PER_GROUP		4
+#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
+	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
+/* size in bytes of a complete group */
+#define MULTIXACT_MEMBERGROUP_SIZE \
+	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERS_PER_PAGE	\
+	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
+
+/*
+ * Because the number of items per page is not a divisor of the last item
+ * number (member 0xFFFFFFFF), the last segment does not use the maximum number
+ * of pages, and moreover the last used page therein does not use the same
+ * number of items as previous pages.  (Another way to say it is that the
+ * 0xFFFFFFFF member is somewhere in the middle of the last page, so the page
+ * has some empty space after that item.)
+ *
+ * This constant is the number of members in the last page of the last segment.
+ */
+#define MAX_MEMBERS_IN_LAST_MEMBERS_PAGE \
+		((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
+
+/* page in which a member is to be found */
+static inline int64
+MXOffsetToMemberPage(MultiXactOffset offset)
+{
+	return offset / MULTIXACT_MEMBERS_PER_PAGE;
+}
+
+static inline int64
+MXOffsetToMemberSegment(MultiXactOffset offset)
+{
+	return MXOffsetToMemberPage(offset) / SLRU_PAGES_PER_SEGMENT;
+}
+
+/* Location (byte offset within page) of flag word for a given member */
+static inline int
+MXOffsetToFlagsOffset(MultiXactOffset offset)
+{
+	MultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
+	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
+
+	return byteoff;
+}
+
+static inline int
+MXOffsetToFlagsBitShift(MultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
+
+	return bshift;
+}
+
+/* Location (byte offset within page) of TransactionId of given member */
+static inline int
+MXOffsetToMemberOffset(MultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+
+	return MXOffsetToFlagsOffset(offset) +
+		MULTIXACT_FLAGBYTES_PER_GROUP +
+		member_in_group * sizeof(TransactionId);
+}
+
+#endif							/* MULTIXACT_INTERNAL_H */
-- 
2.47.3

v27-0003-Widen-MultiXactOffset-to-64-bits.patchtext/x-patch; charset=UTF-8; name=v27-0003-Widen-MultiXactOffset-to-64-bits.patchDownload
From c7b7202bab379e57f54f95e946f782adde3da20f Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 17 Nov 2025 16:30:37 +0200
Subject: [PATCH v27 3/3] Widen MultiXactOffset to 64 bits

This eliminates offset wraparound and the 2^32 limit on the total
number of multixid members. Multixids are still limited to 2^31, but
this is a nice improvement because 'members' can grow much faster than
the number of multixids. On such systems, you can now run longer
before hitting hard limits or triggering anti-wraparound vacuums.

Not having to deal with offset wraparound also simplifies the code and
removes some gnarly corner cases.

We no longer need to perform emergency anti-wraparound freezing
because of running out of 'members' space, so the offset stop limit is
gone. But you might still not want 'members' to consume huge amounts
of disk space. For that reason, I kept the logic for lowering vacuum's
multixid freezing cutoff if a large amount of 'members' space is
used. The thresholds for that are roughly the same as the "safe" and
"danger" thresholds used before, 2 billion transactions and 4 billion
transactions. This keeps the behavior for the freeze cutoff roughly
the same as before . It might make sense to make this smarter or
configurable, now that the threshold is only needed to manage disk
usage, but that's left for the future.

Add code to pg_upgrade to convert multitransactions from the old to
the new format. Because pg_upgrade now rewrites the files in the new
format, we can get rid of some hacks we had put in place to deal with
old bugs and upgraded clusters.

Author: Maxim Orlov <orlovmg@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Discussion: https://www.postgresql.org/message-id/CACG%3DezaWg7_nt-8ey4aKv2w9LcuLthHknwCawmBgEeTnJrJTcw@mail.gmail.com
---
 src/backend/access/rmgrdesc/mxactdesc.c       |   4 +-
 src/backend/access/rmgrdesc/xlogdesc.c        |   2 +-
 src/backend/access/transam/multixact.c        | 547 ++++--------------
 src/backend/access/transam/xlog.c             |   6 +-
 src/backend/access/transam/xlogrecovery.c     |   2 +-
 src/backend/commands/vacuum.c                 |   6 +-
 src/backend/postmaster/autovacuum.c           |   4 +-
 src/bin/pg_controldata/pg_controldata.c       |   2 +-
 src/bin/pg_resetwal/pg_resetwal.c             |  30 +-
 src/bin/pg_resetwal/t/001_basic.pl            |   4 +-
 src/bin/pg_upgrade/Makefile                   |   3 +
 src/bin/pg_upgrade/meson.build                |   4 +
 src/bin/pg_upgrade/multixact_new.c            | 101 ++++
 src/bin/pg_upgrade/multixact_new.h            |  28 +
 src/bin/pg_upgrade/multixact_old.c            | 302 ++++++++++
 src/bin/pg_upgrade/multixact_old.h            |  38 ++
 src/bin/pg_upgrade/pg_upgrade.c               | 165 +++++-
 src/bin/pg_upgrade/pg_upgrade.h               |   7 +
 src/bin/pg_upgrade/slru_io.c                  | 242 ++++++++
 src/bin/pg_upgrade/slru_io.h                  |  52 ++
 .../pg_upgrade/t/007_multixact_conversion.pl  | 329 +++++++++++
 src/include/access/multixact.h                |   7 +-
 src/include/access/multixact_internal.h       |  23 +-
 src/include/c.h                               |   2 +-
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  21 +-
 src/tools/pgindent/typedefs.list              |   3 +
 26 files changed, 1424 insertions(+), 510 deletions(-)
 create mode 100644 src/bin/pg_upgrade/multixact_new.c
 create mode 100644 src/bin/pg_upgrade/multixact_new.h
 create mode 100644 src/bin/pg_upgrade/multixact_old.c
 create mode 100644 src/bin/pg_upgrade/multixact_old.h
 create mode 100644 src/bin/pg_upgrade/slru_io.c
 create mode 100644 src/bin/pg_upgrade/slru_io.h
 create mode 100644 src/bin/pg_upgrade/t/007_multixact_conversion.pl

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 3ca0582db36..052dd0a4ce5 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -65,7 +65,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		xl_multixact_create *xlrec = (xl_multixact_create *) rec;
 		int			i;
 
-		appendStringInfo(buf, "%u offset %u nmembers %d: ", xlrec->mid,
+		appendStringInfo(buf, "%u offset %" PRIu64 " nmembers %d: ", xlrec->mid,
 						 xlrec->moff, xlrec->nmembers);
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
@@ -74,7 +74,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
 
-		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+		appendStringInfo(buf, "offsets [%u, %u), members [%" PRIu64 ", %" PRIu64 ")",
 						 xlrec->startTruncOff, xlrec->endTruncOff,
 						 xlrec->startTruncMemb, xlrec->endTruncMemb);
 	}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index cd6c2a2f650..441034f5929 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -66,7 +66,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
 		appendStringInfo(buf, "redo %X/%08X; "
-						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %u; "
+						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %" PRIu64 "; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
 						 "oldest running xid %u; %s",
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index acb2a6788f9..d2ceb5040db 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -69,6 +69,7 @@
 #include "postgres.h"
 
 #include "access/multixact.h"
+#include "access/multixact_internal.h"
 #include "access/slru.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
@@ -89,10 +90,14 @@
 #include "utils/memutils.h"
 
 
-/* Multixact members wraparound thresholds. */
-#define MULTIXACT_MEMBER_SAFE_THRESHOLD		(MaxMultiXactOffset / 2)
-#define MULTIXACT_MEMBER_DANGER_THRESHOLD	\
-	(MaxMultiXactOffset - MaxMultiXactOffset / 4)
+/*
+ * Thresholds used to keep members disk usage in check when multixids have a
+ * lot of members.  When MULTIXACT_MEMBER_LOW_THRESHOLD is reached, vacuum
+ * starts freezing multixids more aggressively, even if the normal multixid
+ * age limits haven't been reached yet.
+ */
+#define MULTIXACT_MEMBER_LOW_THRESHOLD		UINT64CONST(2000000000)
+#define MULTIXACT_MEMBER_HIGH_THRESHOLD		UINT64CONST(4000000000)
 
 static inline MultiXactId
 PreviousMultiXactId(MultiXactId multi)
@@ -137,11 +142,9 @@ typedef struct MultiXactStateData
 
 	/*
 	 * Oldest multixact offset that is potentially referenced by a multixact
-	 * referenced by a relation.  We don't always know this value, so there's
-	 * a flag here to indicate whether or not we currently do.
+	 * referenced by a relation.
 	 */
 	MultiXactOffset oldestOffset;
-	bool		oldestOffsetKnown;
 
 	/* support for anti-wraparound measures */
 	MultiXactId multiVacLimit;
@@ -149,9 +152,6 @@ typedef struct MultiXactStateData
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 
-	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
-
 	/*
 	 * This is used to sleep until a multixact offset is written when we want
 	 * to create the next one.
@@ -278,13 +278,9 @@ static void mXactCachePut(MultiXactId multi, int nmembers,
 /* management of SLRU infrastructure */
 static bool MultiXactOffsetPagePrecedes(int64 page1, int64 page2);
 static bool MultiXactMemberPagePrecedes(int64 page1, int64 page2);
-static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
-									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
-									 MultiXactOffset start, uint32 distance);
-static bool SetOffsetVacuumLimit(bool is_startup);
+static void SetOffsetVacuumLimit(void);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMTruncateXlogRec(Oid oldestMultiDB,
 								  MultiXactId startTruncOff,
@@ -1023,90 +1019,22 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	ExtendMultiXactOffset(result);
 
 	/*
-	 * Reserve the members space, similarly to above.  Also, be careful not to
-	 * return zero as the starting offset for any multixact. See
-	 * GetMultiXactIdMembers() for motivation.
+	 * Reserve the members space, similarly to above.
 	 */
 	nextOffset = MultiXactState->nextOffset;
-	if (nextOffset == 0)
-	{
-		*offset = 1;
-		nmembers++;				/* allocate member slot 0 too */
-	}
-	else
-		*offset = nextOffset;
-
-	/*----------
-	 * Protect against overrun of the members space as well, with the
-	 * following rules:
-	 *
-	 * If we're past offsetStopLimit, refuse to generate more multis.
-	 * If we're close to offsetStopLimit, emit a warning.
-	 *
-	 * Arbitrarily, we start emitting warnings when we're 20 segments or less
-	 * from offsetStopLimit.
-	 *
-	 * Note we haven't updated the shared state yet, so if we fail at this
-	 * point, the multixact ID we grabbed can still be used by the next guy.
-	 *
-	 * Note that there is no point in forcing autovacuum runs here: the
-	 * multixact freeze settings would have to be reduced for that to have any
-	 * effect.
-	 *----------
-	 */
-#define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
-								 nmembers))
-	{
-		/* see comment in the corresponding offsets wraparound case */
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("multixact \"members\" limit exceeded"),
-				 errdetail_plural("This command would create a multixact with %u members, but the remaining space is only enough for %u member.",
-								  "This command would create a multixact with %u members, but the remaining space is only enough for %u members.",
-								  MultiXactState->offsetStopLimit - nextOffset - 1,
-								  nmembers,
-								  MultiXactState->offsetStopLimit - nextOffset - 1),
-				 errhint("Execute a database-wide VACUUM in database with OID %u with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.",
-						 MultiXactState->oldestMultiXactDB)));
-	}
 
 	/*
-	 * Check whether we should kick autovacuum into action, to prevent members
-	 * wraparound. NB we use a much larger window to trigger autovacuum than
-	 * just the warning limit. The warning is just a measure of last resort -
-	 * this is in line with GetNewTransactionId's behaviour.
+	 * Offsets are 64-bit integers and will never wrap around.  Firstly, it
+	 * would take an unrealistic amount of time and resources to consume 2^64
+	 * offsets.  Secondly, multixid creation is WAL-logged, so you would run
+	 * out of LSNs before reaching offset wraparound.  Nevertheless, check for
+	 * wraparound as a sanity check.
 	 */
-	if (!MultiXactState->oldestOffsetKnown ||
-		(MultiXactState->nextOffset - MultiXactState->oldestOffset
-		 > MULTIXACT_MEMBER_SAFE_THRESHOLD))
-	{
-		/*
-		 * To avoid swamping the postmaster with signals, we issue the autovac
-		 * request only when crossing a segment boundary. With default
-		 * compilation settings that's roughly after 50k members.  This still
-		 * gives plenty of chances before we get into real trouble.
-		 */
-		if ((MXOffsetToMemberPage(nextOffset) / SLRU_PAGES_PER_SEGMENT) !=
-			(MXOffsetToMemberPage(nextOffset + nmembers) / SLRU_PAGES_PER_SEGMENT))
-			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	}
-
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
-								 nextOffset,
-								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
-		ereport(WARNING,
+	if (nextOffset + nmembers < nextOffset)
+		ereport(ERROR,
 				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg_plural("database with OID %u must be vacuumed before %d more multixact member is used",
-							   "database with OID %u must be vacuumed before %d more multixact members are used",
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers,
-							   MultiXactState->oldestMultiXactDB,
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers),
-				 errhint("Execute a database-wide VACUUM in that database with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.")));
+				 errmsg("MultiXact members would wrap around")));
+	*offset = nextOffset;
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
@@ -1127,8 +1055,7 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	 * the next iteration.  But note that nextMXact may be InvalidMultiXactId
 	 * or the first value on a segment-beginning page after this routine
 	 * exits, so anyone else looking at the variable must be prepared to deal
-	 * with either case.  Similarly, nextOffset may be zero, but we won't use
-	 * that as the actual start offset of the next multixact.
+	 * with either case.
 	 */
 	(MultiXactState->nextMXact)++;
 
@@ -1136,7 +1063,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	LWLockRelease(MultiXactGenLock);
 
-	debug_elog4(DEBUG2, "GetNew: returning %u offset %u", result, *offset);
+	debug_elog4(DEBUG2, "GetNew: returning %u offset %" PRIu64,
+				result, *offset);
 	return result;
 }
 
@@ -1178,7 +1106,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	MultiXactOffset *offptr;
 	MultiXactOffset offset;
 	int			length;
-	int			truelength;
 	MultiXactId oldestMXact;
 	MultiXactId nextMXact;
 	MultiXactId tmpMXact;
@@ -1277,16 +1204,7 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * we have just for this; the process in charge will signal the CV as soon
 	 * as it has finished writing the multixact offset.
 	 *
-	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
-	 * handle case #2, there is an ambiguity near the point of offset
-	 * wraparound.  If we see next multixact's offset is one, is that our
-	 * multixact's actual endpoint, or did it end at zero with a subsequent
-	 * increment?  We handle this using the knowledge that if the zero'th
-	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
-	 * transaction ID so it can't be a multixact member.  Therefore, if we
-	 * read a zero from the members array, just ignore it.
-	 *
-	 * This is all pretty messy, but the mess occurs only in infrequent corner
+	 * This is a little messy, but the mess occurs only in infrequent corner
 	 * cases, so it seems better than holding the MultiXactGenLock for a long
 	 * time on every multixact creation.
 	 */
@@ -1372,6 +1290,9 @@ retry:
 	LWLockRelease(lock);
 	lock = NULL;
 
+	/* A multixid with zero members should not happen */
+	Assert(length > 0);
+
 	/*
 	 * If we slept above, clean up state; it's no longer needed.
 	 */
@@ -1380,7 +1301,6 @@ retry:
 
 	ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
 
-	truelength = 0;
 	prev_pageno = -1;
 	for (int i = 0; i < length; i++, offset++)
 	{
@@ -1417,37 +1337,27 @@ retry:
 
 		xactptr = (TransactionId *)
 			(MultiXactMemberCtl->shared->page_buffer[slotno] + memberoff);
-
-		if (!TransactionIdIsValid(*xactptr))
-		{
-			/* Corner case 3: we must be looking at unused slot zero */
-			Assert(offset == 0);
-			continue;
-		}
+		Assert(TransactionIdIsValid(*xactptr));
 
 		flagsoff = MXOffsetToFlagsOffset(offset);
 		bshift = MXOffsetToFlagsBitShift(offset);
 		flagsptr = (uint32 *) (MultiXactMemberCtl->shared->page_buffer[slotno] + flagsoff);
 
-		ptr[truelength].xid = *xactptr;
-		ptr[truelength].status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
-		truelength++;
+		ptr[i].xid = *xactptr;
+		ptr[i].status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
 	}
 
 	LWLockRelease(lock);
 
-	/* A multixid with zero members should not happen */
-	Assert(truelength > 0);
-
 	/*
 	 * Copy the result into the local cache.
 	 */
-	mXactCachePut(multi, truelength, ptr);
+	mXactCachePut(multi, length, ptr);
 
 	debug_elog3(DEBUG2, "GetMembers: no cache for %s",
-				mxid_to_string(multi, truelength, ptr));
+				mxid_to_string(multi, length, ptr));
 	*members = ptr;
-	return truelength;
+	return length;
 }
 
 /*
@@ -1854,7 +1764,7 @@ MultiXactShmemInit(void)
 				  "pg_multixact/members", LWTRANCHE_MULTIXACTMEMBER_BUFFER,
 				  LWTRANCHE_MULTIXACTMEMBER_SLRU,
 				  SYNC_HANDLER_MULTIXACT_MEMBER,
-				  false);
+				  true);
 	/* doesn't call SimpleLruTruncate() or meet criteria for unit tests */
 
 	/* Initialize our shared state struct */
@@ -1910,48 +1820,6 @@ BootStrapMultiXact(void)
 	SimpleLruZeroAndWritePage(MultiXactMemberCtl, 0);
 }
 
-/*
- * MaybeExtendOffsetSlru
- *		Extend the offsets SLRU area, if necessary
- *
- * After a binary upgrade from <= 9.2, the pg_multixact/offsets SLRU area might
- * contain files that are shorter than necessary; this would occur if the old
- * installation had used multixacts beyond the first page (files cannot be
- * copied, because the on-disk representation is different).  pg_upgrade would
- * update pg_control to set the next offset value to be at that position, so
- * that tuples marked as locked by such MultiXacts would be seen as visible
- * without having to consult multixact.  However, trying to create and use a
- * new MultiXactId would result in an error because the page on which the new
- * value would reside does not exist.  This routine is in charge of creating
- * such pages.
- */
-static void
-MaybeExtendOffsetSlru(void)
-{
-	int64		pageno;
-	LWLock	   *lock;
-
-	pageno = MultiXactIdToOffsetPage(MultiXactState->nextMXact);
-	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
-
-	LWLockAcquire(lock, LW_EXCLUSIVE);
-
-	if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno))
-	{
-		int			slotno;
-
-		/*
-		 * Fortunately for us, SimpleLruWritePage is already prepared to deal
-		 * with creating a new segment file even if the page we're writing is
-		 * not the first in it, so this is enough.
-		 */
-		slotno = SimpleLruZeroPage(MultiXactOffsetCtl, pageno);
-		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
-	}
-
-	LWLockRelease(lock);
-}
-
 /*
  * This must be called ONCE during postmaster or standalone-backend startup.
  *
@@ -2082,8 +1950,8 @@ TrimMultiXact(void)
 	MultiXactState->finishedStartup = true;
 	LWLockRelease(MultiXactGenLock);
 
-	/* Now compute how far away the next members wraparound is. */
-	SetMultiXactIdLimit(oldestMXact, oldestMXactDB, true);
+	/* Now compute how far away the next multixid wraparound is. */
+	SetMultiXactIdLimit(oldestMXact, oldestMXactDB);
 }
 
 /*
@@ -2104,7 +1972,7 @@ MultiXactGetCheckptMulti(bool is_shutdown,
 	LWLockRelease(MultiXactGenLock);
 
 	debug_elog6(DEBUG2,
-				"MultiXact: checkpoint is nextMulti %u, nextOffset %u, oldestMulti %u in DB %u",
+				"MultiXact: checkpoint is nextMulti %u, nextOffset %" PRIu64 ", oldestMulti %u in DB %u",
 				*nextMulti, *nextMultiOffset, *oldestMulti, *oldestMultiDB);
 }
 
@@ -2139,26 +2007,12 @@ void
 MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset)
 {
-	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %u",
+	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %" PRIu64,
 				nextMulti, nextMultiOffset);
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->nextMXact = nextMulti;
 	MultiXactState->nextOffset = nextMultiOffset;
 	LWLockRelease(MultiXactGenLock);
-
-	/*
-	 * During a binary upgrade, make sure that the offsets SLRU is large
-	 * enough to contain the next value that would be created.
-	 *
-	 * We need to do this pretty early during the first startup in binary
-	 * upgrade mode: before StartupMultiXact() in fact, because this routine
-	 * is called even before that by StartupXLOG().  And we can't do it
-	 * earlier than at this point, because during that first call of this
-	 * routine we determine the MultiXactState->nextMXact value that
-	 * MaybeExtendOffsetSlru needs.
-	 */
-	if (IsBinaryUpgrade)
-		MaybeExtendOffsetSlru();
 }
 
 /*
@@ -2166,28 +2020,24 @@ MultiXactSetNextMXact(MultiXactId nextMulti,
  * datminmxid (ie, the oldest MultiXactId that might exist in any database
  * of our cluster), and the OID of the (or a) database with that value.
  *
- * is_startup is true when we are just starting the cluster, false when we
- * are updating state in a running cluster.  This only affects log messages.
+ * This also updates MultiXactState->oldestOffset, by looking up the offset of
+ * MultiXactState->oldestMultiXactId.
  */
 void
-SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid,
-					bool is_startup)
+SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid)
 {
 	MultiXactId multiVacLimit;
 	MultiXactId multiWarnLimit;
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 	MultiXactId curMulti;
-	bool		needs_offset_vacuum;
 
 	Assert(MultiXactIdIsValid(oldest_datminmxid));
 
 	/*
 	 * We pretend that a wrap will happen halfway through the multixact ID
 	 * space, but that's not really true, because multixacts wrap differently
-	 * from transaction IDs.  Note that, separately from any concern about
-	 * multixact IDs wrapping, we must ensure that multixact members do not
-	 * wrap.  Limits for that are set in SetOffsetVacuumLimit, not here.
+	 * from transaction IDs.
 	 */
 	multiWrapLimit = oldest_datminmxid + (MaxMultiXactId >> 1);
 	if (multiWrapLimit < FirstMultiXactId)
@@ -2255,8 +2105,13 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid,
 
 	Assert(!InRecovery);
 
-	/* Set limits for offset vacuum. */
-	needs_offset_vacuum = SetOffsetVacuumLimit(is_startup);
+	/*
+	 * Offsets are 64-bits wide and never wrap around, so we don't need to
+	 * consider them for emergency autovacuum purposes.  But now that we're in
+	 * a consistent state, determine MultiXactState->oldestOffset, to be used
+	 * to calculate freezing cutoff to keep the offsets disk usage in check.
+	 */
+	SetOffsetVacuumLimit();
 
 	/*
 	 * If past the autovacuum force point, immediately signal an autovac
@@ -2265,8 +2120,7 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid,
 	 * database, it'll call here, and we'll signal the postmaster to start
 	 * another iteration immediately if there are still any old databases.
 	 */
-	if ((MultiXactIdPrecedes(multiVacLimit, curMulti) ||
-		 needs_offset_vacuum) && IsUnderPostmaster)
+	if (MultiXactIdPrecedes(multiVacLimit, curMulti) && IsUnderPostmaster)
 		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
 
 	/* Give an immediate warning if past the wrap warn point */
@@ -2328,9 +2182,9 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 		debug_elog3(DEBUG2, "MultiXact: setting next multi to %u", minMulti);
 		MultiXactState->nextMXact = minMulti;
 	}
-	if (MultiXactOffsetPrecedes(MultiXactState->nextOffset, minMultiOffset))
+	if (MultiXactState->nextOffset < minMultiOffset)
 	{
-		debug_elog3(DEBUG2, "MultiXact: setting next offset to %u",
+		debug_elog3(DEBUG2, "MultiXact: setting next offset to %" PRIU64,
 					minMultiOffset);
 		MultiXactState->nextOffset = minMultiOffset;
 	}
@@ -2349,7 +2203,7 @@ MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB)
 	Assert(InRecovery);
 
 	if (MultiXactIdPrecedes(MultiXactState->oldestMultiXactId, oldestMulti))
-		SetMultiXactIdLimit(oldestMulti, oldestMultiDB, false);
+		SetMultiXactIdLimit(oldestMulti, oldestMultiDB);
 }
 
 /*
@@ -2432,23 +2286,8 @@ ExtendMultiXactMember(MultiXactOffset offset, int nmembers)
 			LWLockRelease(lock);
 		}
 
-		/*
-		 * Compute the number of items till end of current page.  Careful: if
-		 * addition of unsigned ints wraps around, we're at the last page of
-		 * the last segment; since that page holds a different number of items
-		 * than other pages, we need to do it differently.
-		 */
-		if (offset + MAX_MEMBERS_IN_LAST_MEMBERS_PAGE < offset)
-		{
-			/*
-			 * This is the last page of the last segment; we can compute the
-			 * number of items left to allocate in it without modulo
-			 * arithmetic.
-			 */
-			difference = MaxMultiXactOffset - offset + 1;
-		}
-		else
-			difference = MULTIXACT_MEMBERS_PER_PAGE - offset % MULTIXACT_MEMBERS_PER_PAGE;
+		/* Compute the number of items till end of current page. */
+		difference = MULTIXACT_MEMBERS_PER_PAGE - offset % MULTIXACT_MEMBERS_PER_PAGE;
 
 		/*
 		 * Advance to next page, taking care to properly handle the wraparound
@@ -2514,28 +2353,17 @@ GetOldestMultiXactId(void)
 }
 
 /*
- * Determine how aggressively we need to vacuum in order to prevent member
- * wraparound.
- *
- * To do so determine what's the oldest member offset and install the limit
- * info in MultiXactState, where it can be used to prevent overrun of old data
- * in the members SLRU area.
- *
- * The return value is true if emergency autovacuum is required and false
- * otherwise.
+ * Calculate the oldest member offset and install it in MultiXactState, where
+ * it can be used to adjust multixid freezing cutoffs.
  */
-static bool
-SetOffsetVacuumLimit(bool is_startup)
+static void
+SetOffsetVacuumLimit(void)
 {
 	MultiXactId oldestMultiXactId;
 	MultiXactId nextMXact;
 	MultiXactOffset oldestOffset = 0;	/* placate compiler */
-	MultiXactOffset prevOldestOffset;
 	MultiXactOffset nextOffset;
 	bool		oldestOffsetKnown = false;
-	bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;
-	MultiXactOffset prevOffsetStopLimit;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2548,9 +2376,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	oldestMultiXactId = MultiXactState->oldestMultiXactId;
 	nextMXact = MultiXactState->nextMXact;
 	nextOffset = MultiXactState->nextOffset;
-	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
-	prevOldestOffset = MultiXactState->oldestOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2573,121 +2398,39 @@ SetOffsetVacuumLimit(bool is_startup)
 	else
 	{
 		/*
-		 * Figure out where the oldest existing multixact's offsets are
-		 * stored. Due to bugs in early release of PostgreSQL 9.3.X and 9.4.X,
-		 * the supposedly-earliest multixact might not really exist.  We are
-		 * careful not to fail in that case.
+		 * Look up the offset at which the oldest existing multixact's members
+		 * are stored.  If we cannot find it, be careful not to fail, and
+		 * leave oldestOffset unchanged.  oldestOffset is initialized to zero
+		 * at system startup, which prevents truncating members until a proper
+		 * value is calculated.
+		 *
+		 * (We had bugs in early releases of PostgreSQL 9.3.X and 9.4.X where
+		 * the supposedly-earliest multixact might not really exist.  Those
+		 * should be long gone by now, so this should not fail, but let's
+		 * still be defensive.)
 		 */
 		oldestOffsetKnown =
 			find_multixact_start(oldestMultiXactId, &oldestOffset);
 
 		if (oldestOffsetKnown)
 			ereport(DEBUG1,
-					(errmsg_internal("oldest MultiXactId member is at offset %u",
+					(errmsg_internal("oldest MultiXactId member is at offset %" PRIu64,
 									 oldestOffset)));
 		else
 			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
+					(errmsg("oldest checkpointed MultiXact %u does not exist on disk",
 							oldestMultiXactId)));
 	}
 
 	LWLockRelease(MultiXactTruncationLock);
 
-	/*
-	 * If we can, compute limits (and install them MultiXactState) to prevent
-	 * overrun of old data in the members SLRU area. We can only do so if the
-	 * oldest offset is known though.
-	 */
+	/* Install the computed value */
 	if (oldestOffsetKnown)
 	{
-		/* move back to start of the corresponding segment */
-		offsetStopLimit = oldestOffset - (oldestOffset %
-										  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-
-		/* always leave one segment before the wraparound point */
-		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-		if (!prevOldestOffsetKnown && !is_startup)
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are now enabled")));
-
-		ereport(DEBUG1,
-				(errmsg_internal("MultiXact member stop limit is now %u based on MultiXact %u",
-								 offsetStopLimit, oldestMultiXactId)));
-	}
-	else if (prevOldestOffsetKnown)
-	{
-		/*
-		 * If we failed to get the oldest offset this time, but we have a
-		 * value from a previous pass through this function, use the old
-		 * values rather than automatically forcing an emergency autovacuum
-		 * cycle again.
-		 */
-		oldestOffset = prevOldestOffset;
-		oldestOffsetKnown = true;
-		offsetStopLimit = prevOffsetStopLimit;
+		LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
+		MultiXactState->oldestOffset = oldestOffset;
+		LWLockRelease(MultiXactGenLock);
 	}
-
-	/* Install the computed values */
-	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
-	MultiXactState->oldestOffset = oldestOffset;
-	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	MultiXactState->offsetStopLimit = offsetStopLimit;
-	LWLockRelease(MultiXactGenLock);
-
-	/*
-	 * Do we need an emergency autovacuum?	If we're not sure, assume yes.
-	 */
-	return !oldestOffsetKnown ||
-		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
-}
-
-/*
- * Return whether adding "distance" to "start" would move past "boundary".
- *
- * We use this to determine whether the addition is "wrapping around" the
- * boundary point, hence the name.  The reason we don't want to use the regular
- * 2^31-modulo arithmetic here is that we want to be able to use the whole of
- * the 2^32-1 space here, allowing for more multixacts than would fit
- * otherwise.
- */
-static bool
-MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
-						 uint32 distance)
-{
-	MultiXactOffset finish;
-
-	/*
-	 * Note that offset number 0 is not used (see GetMultiXactIdMembers), so
-	 * if the addition wraps around the UINT_MAX boundary, skip that value.
-	 */
-	finish = start + distance;
-	if (finish < start)
-		finish++;
-
-	/*-----------------------------------------------------------------------
-	 * When the boundary is numerically greater than the starting point, any
-	 * value numerically between the two is not wrapped:
-	 *
-	 *	<----S----B---->
-	 *	[---)			 = F wrapped past B (and UINT_MAX)
-	 *		 [---)		 = F not wrapped
-	 *			  [----] = F wrapped past B
-	 *
-	 * When the boundary is numerically less than the starting point (i.e. the
-	 * UINT_MAX wraparound occurs somewhere in between) then all values in
-	 * between are wrapped:
-	 *
-	 *	<----B----S---->
-	 *	[---)			 = F not wrapped past B (but wrapped past UINT_MAX)
-	 *		 [---)		 = F wrapped past B (and UINT_MAX)
-	 *			  [----] = F not wrapped
-	 *-----------------------------------------------------------------------
-	 */
-	if (start < boundary)
-		return finish >= boundary || finish < start;
-	else
-		return finish >= boundary && finish < start;
 }
 
 /*
@@ -2741,37 +2484,23 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
  * members: Number of member entries (nextOffset - oldestOffset)
  * oldestMultiXactId: Oldest MultiXact ID still in use
  * oldestOffset: Oldest offset still in use
- *
- * Returns false if unable to determine, the oldest offset being unknown.
  */
-bool
+void
 GetMultiXactInfo(uint32 *multixacts, MultiXactOffset *members,
 				 MultiXactId *oldestMultiXactId, MultiXactOffset *oldestOffset)
 {
 	MultiXactOffset nextOffset;
 	MultiXactId nextMultiXactId;
-	bool		oldestOffsetKnown;
 
 	LWLockAcquire(MultiXactGenLock, LW_SHARED);
 	nextOffset = MultiXactState->nextOffset;
 	*oldestMultiXactId = MultiXactState->oldestMultiXactId;
 	nextMultiXactId = MultiXactState->nextMXact;
 	*oldestOffset = MultiXactState->oldestOffset;
-	oldestOffsetKnown = MultiXactState->oldestOffsetKnown;
 	LWLockRelease(MultiXactGenLock);
 
-	if (!oldestOffsetKnown)
-	{
-		*members = 0;
-		*multixacts = 0;
-		*oldestMultiXactId = InvalidMultiXactId;
-		*oldestOffset = 0;
-		return false;
-	}
-
 	*members = nextOffset - *oldestOffset;
 	*multixacts = nextMultiXactId - *oldestMultiXactId;
-	return true;
 }
 
 /*
@@ -2780,26 +2509,27 @@ GetMultiXactInfo(uint32 *multixacts, MultiXactOffset *members,
  * vacuum_multixact_freeze_table_age work together to make sure we never have
  * too many multixacts; we hope that, at least under normal circumstances,
  * this will also be sufficient to keep us from using too many offsets.
- * However, if the average multixact has many members, we might exhaust the
- * members space while still using few enough members that these limits fail
- * to trigger relminmxid advancement by VACUUM.  At that point, we'd have no
- * choice but to start failing multixact-creating operations with an error.
- *
- * To prevent that, if more than a threshold portion of the members space is
- * used, we effectively reduce autovacuum_multixact_freeze_max_age and
- * to a value just less than the number of multixacts in use.  We hope that
- * this will quickly trigger autovacuuming on the table or tables with the
- * oldest relminmxid, thus allowing datminmxid values to advance and removing
- * some members.
- *
- * As the fraction of the member space currently in use grows, we become
- * more aggressive in clamping this value.  That not only causes autovacuum
- * to ramp up, but also makes any manual vacuums the user issues more
- * aggressive.  This happens because vacuum_get_cutoffs() will clamp the
- * freeze table and the minimum freeze age cutoffs based on the effective
- * autovacuum_multixact_freeze_max_age this function returns.  In the worst
- * case, we'll claim the freeze_max_age to zero, and every vacuum of any
- * table will freeze every multixact.
+ * However, if the average multixact has many members, we might accumulate a
+ * large amount of members, consuming disk space, while still using few enough
+ * multixids that the multixid limits fail to trigger relminmxid advancement
+ * by VACUUM.
+ *
+ * To prevent that, if the members space usage exceeds a threshold
+ * (MULTIXACT_MEMBER_LOW_THRESHOLD), we effectively reduce
+ * autovacuum_multixact_freeze_max_age to a value just less than the number of
+ * multixacts in use.  We hope that this will quickly trigger autovacuuming on
+ * the table or tables with the oldest relminmxid, thus allowing datminmxid
+ * values to advance and removing some members.
+ *
+ * As the amount of the member space in use grows, we become more aggressive
+ * in clamping this value.  That not only causes autovacuum to ramp up, but
+ * also makes any manual vacuums the user issues more aggressive.  This
+ * happens because vacuum_get_cutoffs() will clamp the freeze table and the
+ * minimum freeze age cutoffs based on the effective
+ * autovacuum_multixact_freeze_max_age this function returns.  At the extreme,
+ * when the members usage reaches MULTIXACT_MEMBER_HIGH_THRESHOLD, we clamp
+ * freeze_max_age to zero, and every vacuum of any table will freeze every
+ * multixact.
  */
 int
 MultiXactMemberFreezeThreshold(void)
@@ -2812,21 +2542,27 @@ MultiXactMemberFreezeThreshold(void)
 	MultiXactId oldestMultiXactId;
 	MultiXactOffset oldestOffset;
 
-	/* If we can't determine member space utilization, assume the worst. */
-	if (!GetMultiXactInfo(&multixacts, &members, &oldestMultiXactId, &oldestOffset))
-		return 0;
+	/* Read the current offsets and members usage. */
+	GetMultiXactInfo(&multixacts, &members, &oldestMultiXactId, &oldestOffset);
 
 	/* If member space utilization is low, no special action is required. */
-	if (members <= MULTIXACT_MEMBER_SAFE_THRESHOLD)
+	if (members <= MULTIXACT_MEMBER_LOW_THRESHOLD)
 		return autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Compute a target for relminmxid advancement.  The number of multixacts
 	 * we try to eliminate from the system is based on how far we are past
-	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
-	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD) /
-		(MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+	 * MULTIXACT_MEMBER_LOW_THRESHOLD.
+	 *
+	 * The way this formula works is that when members is exactly at the low
+	 * threshold, fraction = 0.0, and we set freeze_max_age equal to
+	 * mxid_age(oldestMultiXactId).  As members grows further, towards the
+	 * high threshold, fraction grows linearly from 0.0 to 1.0, and the result
+	 * shrinks from mxid_age(oldestMultiXactId) to 0.  Beyond the high
+	 * threshold, fraction > 1.0 and the result is clamped to 0.
+	 */
+	fraction = (double) (members - MULTIXACT_MEMBER_LOW_THRESHOLD) /
+		(MULTIXACT_MEMBER_HIGH_THRESHOLD - MULTIXACT_MEMBER_LOW_THRESHOLD);
 	victim_multixacts = multixacts * fraction;
 
 	/* fraction could be > 1.0, but lowest possible freeze age is zero */
@@ -2867,36 +2603,12 @@ SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int64 segpage, void *data
 
 /*
  * Delete members segments [oldest, newOldest)
- *
- * The members SLRU can, in contrast to the offsets one, be filled to almost
- * the full range at once. This means SimpleLruTruncate() can't trivially be
- * used - instead the to-be-deleted range is computed using the offsets
- * SLRU. C.f. TruncateMultiXact().
  */
 static void
 PerformMembersTruncation(MultiXactOffset oldestOffset, MultiXactOffset newOldestOffset)
 {
-	const int64 maxsegment = MXOffsetToMemberSegment(MaxMultiXactOffset);
-	int64		startsegment = MXOffsetToMemberSegment(oldestOffset);
-	int64		endsegment = MXOffsetToMemberSegment(newOldestOffset);
-	int64		segment = startsegment;
-
-	/*
-	 * Delete all the segments but the last one. The last segment can still
-	 * contain, possibly partially, valid data.
-	 */
-	while (segment != endsegment)
-	{
-		elog(DEBUG2, "truncating multixact members segment %" PRIx64,
-			 segment);
-		SlruDeleteSegment(MultiXactMemberCtl, segment);
-
-		/* move to next segment, handling wraparound correctly */
-		if (segment == maxsegment)
-			segment = 0;
-		else
-			segment += 1;
-	}
+	SimpleLruTruncate(MultiXactMemberCtl,
+					  MXOffsetToMemberPage(newOldestOffset));
 }
 
 /*
@@ -3040,7 +2752,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 
 	elog(DEBUG1, "performing multixact truncation: "
 		 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-		 "members [%u, %u), members segments [%" PRIx64 ", %" PRIx64 ")",
+		 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
 		 oldestMulti, newOldestMulti,
 		 MultiXactIdToOffsetSegment(oldestMulti),
 		 MultiXactIdToOffsetSegment(newOldestMulti),
@@ -3081,6 +2793,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestMultiXactId = newOldestMulti;
 	MultiXactState->oldestMultiXactDB = newOldestMultiDB;
+	MultiXactState->oldestOffset = newOldestOffset;
 	LWLockRelease(MultiXactGenLock);
 
 	/* First truncate members */
@@ -3120,20 +2833,13 @@ MultiXactOffsetPagePrecedes(int64 page1, int64 page2)
 
 /*
  * Decide whether a MultiXactMember page number is "older" for truncation
- * purposes.  There is no "invalid offset number" so use the numbers verbatim.
+ * purposes.  There is no "invalid offset number" and members never wrap
+ * around, so use the numbers verbatim.
  */
 static bool
 MultiXactMemberPagePrecedes(int64 page1, int64 page2)
 {
-	MultiXactOffset offset1;
-	MultiXactOffset offset2;
-
-	offset1 = ((MultiXactOffset) page1) * MULTIXACT_MEMBERS_PER_PAGE;
-	offset2 = ((MultiXactOffset) page2) * MULTIXACT_MEMBERS_PER_PAGE;
-
-	return (MultiXactOffsetPrecedes(offset1, offset2) &&
-			MultiXactOffsetPrecedes(offset1,
-									offset2 + MULTIXACT_MEMBERS_PER_PAGE - 1));
+	return page1 < page2;
 }
 
 /*
@@ -3165,17 +2871,6 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
 }
 
 
-/*
- * Decide which of two offsets is earlier.
- */
-static bool
-MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
-{
-	int32		diff = (int32) (offset1 - offset2);
-
-	return (diff < 0);
-}
-
 /*
  * Write a TRUNCATE xlog record
  *
@@ -3268,7 +2963,7 @@ multixact_redo(XLogReaderState *record)
 
 		elog(DEBUG1, "replaying multixact truncation: "
 			 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-			 "members [%u, %u), members segments [%" PRIx64 ", %" PRIx64 ")",
+			 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
 			 xlrec.startTruncOff, xlrec.endTruncOff,
 			 MultiXactIdToOffsetSegment(xlrec.startTruncOff),
 			 MultiXactIdToOffsetSegment(xlrec.endTruncOff),
@@ -3283,7 +2978,7 @@ multixact_redo(XLogReaderState *record)
 		 * Advance the horizon values, so they're current at the end of
 		 * recovery.
 		 */
-		SetMultiXactIdLimit(xlrec.endTruncOff, xlrec.oldestMultiDB, false);
+		SetMultiXactIdLimit(xlrec.endTruncOff, xlrec.oldestMultiDB);
 
 		PerformMembersTruncation(xlrec.startTruncMemb, xlrec.endTruncMemb);
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 22d0a2e8c3a..a000b8bd509 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5139,7 +5139,7 @@ BootStrapXLOG(uint32 data_checksum_version)
 		FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId);
 	checkPoint.nextOid = FirstGenbkiObjectId;
 	checkPoint.nextMulti = FirstMultiXactId;
-	checkPoint.nextMultiOffset = 0;
+	checkPoint.nextMultiOffset = 1;
 	checkPoint.oldestXid = FirstNormalTransactionId;
 	checkPoint.oldestXidDB = Template1DbOid;
 	checkPoint.oldestMulti = FirstMultiXactId;
@@ -5155,7 +5155,7 @@ BootStrapXLOG(uint32 data_checksum_version)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	AdvanceOldestClogXid(checkPoint.oldestXid);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
-	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
+	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
 	SetCommitTsLimit(InvalidTransactionId, InvalidTransactionId);
 
 	/* Set up the XLOG page header */
@@ -5636,7 +5636,7 @@ StartupXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	AdvanceOldestClogXid(checkPoint.oldestXid);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
-	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
+	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
 	SetCommitTsLimit(checkPoint.oldestCommitTsXid,
 					 checkPoint.newestCommitTsXid);
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 21b8f179ba0..51dea342a4d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -886,7 +886,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 							 U64FromFullTransactionId(checkPoint.nextXid),
 							 checkPoint.nextOid)));
 	ereport(DEBUG1,
-			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %u",
+			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %" PRIu64,
 							 checkPoint.nextMulti, checkPoint.nextMultiOffset)));
 	ereport(DEBUG1,
 			(errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index e785dd55ce5..7780ea6eae3 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1145,8 +1145,8 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams params,
 
 	/*
 	 * Also compute the multixact age for which freezing is urgent.  This is
-	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
-	 * short of multixact member space.
+	 * normally autovacuum_multixact_freeze_max_age, but may be less if
+	 * multixact members are bloated.
 	 */
 	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
 
@@ -1971,7 +1971,7 @@ vac_truncate_clog(TransactionId frozenXID,
 	 * signaling twice?
 	 */
 	SetTransactionIdLimit(frozenXID, oldestxid_datoid);
-	SetMultiXactIdLimit(minMulti, minmulti_datoid, false);
+	SetMultiXactIdLimit(minMulti, minmulti_datoid);
 
 	LWLockRelease(WrapLimitsVacuumLock);
 }
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 1c38488f2cb..f4830f896f3 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1936,8 +1936,8 @@ do_autovacuum(void)
 
 	/*
 	 * Compute the multixact age for which freezing is urgent.  This is
-	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
-	 * short of multixact member space.
+	 * normally autovacuum_multixact_freeze_max_age, but may be less if
+	 * multixact members are bloated.
 	 */
 	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
 
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 30ad46912e1..a4060309ae0 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -271,7 +271,7 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile->checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
+	printf(_("Latest checkpoint's NextMultiOffset:  %" PRIu64 "\n"),
 		   ControlFile->checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile->checkPointCopy.oldestXid);
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index a31e7643cf0..7c6c2741a17 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -92,6 +92,7 @@ static void KillExistingArchiveStatus(void);
 static void KillExistingWALSummaries(void);
 static void WriteEmptyXLOG(void);
 static void usage(void);
+static uint64 strtou64_strict(const char *s, char **endptr, int base);
 
 
 int
@@ -120,7 +121,6 @@ main(int argc, char *argv[])
 	MultiXactId set_oldestmxid = 0;
 	char	   *endptr;
 	char	   *endptr2;
-	int64		tmpi64;
 	char	   *DataDir = NULL;
 	char	   *log_fname = NULL;
 	int			fd;
@@ -269,17 +269,14 @@ main(int argc, char *argv[])
 
 			case 'O':
 				errno = 0;
-				tmpi64 = strtoi64(optarg, &endptr, 0);
+				set_mxoff = strtou64_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
 					pg_log_error_hint("Try \"%s --help\" for more information.", progname);
 					exit(1);
 				}
-				if (tmpi64 < 0 || tmpi64 > (int64) MaxMultiXactOffset)
-					pg_fatal("multitransaction offset (-O) must be between 0 and %u", MaxMultiXactOffset);
 
-				set_mxoff = (MultiXactOffset) tmpi64;
 				mxoff_given = true;
 				break;
 
@@ -749,7 +746,7 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile.checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
+	printf(_("Latest checkpoint's NextMultiOffset:  %" PRIu64 "\n"),
 		   ControlFile.checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile.checkPointCopy.oldestXid);
@@ -825,7 +822,7 @@ PrintNewControlValues(void)
 
 	if (mxoff_given)
 	{
-		printf(_("NextMultiOffset:                      %u\n"),
+		printf(_("NextMultiOffset:                      %" PRIu64 "\n"),
 			   ControlFile.checkPointCopy.nextMultiOffset);
 	}
 
@@ -1210,3 +1207,22 @@ usage(void)
 	printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
 	printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
 }
+
+/* Like strtou64(), but negative values are not accepted. */
+static uint64
+strtou64_strict(const char *s, char **endptr, int base)
+{
+	/* skip leading whitespace */
+	while (isspace(*s))
+		s++;
+
+	/* reject negative values */
+	if (*s == '-')
+	{
+		*endptr = (char *) s;
+		errno = ERANGE;
+		return UINT64_MAX;
+	}
+
+	return strtou64(s, endptr, base);
+}
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index 90ecb8afe18..5a175e285d1 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -145,7 +145,7 @@ command_fails_like(
 	'fails with incorrect -O option');
 command_fails_like(
 	[ 'pg_resetwal', '-O' => '-1', $node->data_dir ],
-	qr/must be between 0 and 4294967295/,
+	qr/error: invalid argument for option -O/,
 	'fails with -O value -1');
 # --wal-segsize
 command_fails_like(
@@ -215,7 +215,7 @@ push @cmd,
   sprintf("%d,%d", hex($files[0]) == 0 ? 3 : hex($files[0]), hex($files[-1]));
 
 @files = get_slru_files('pg_multixact/offsets');
-$mult = 32 * $blcksz / 4;
+$mult = 32 * $blcksz / 8;
 # --multixact-ids argument is "new,old"
 push @cmd,
   '--multixact-ids' => sprintf("%d,%d",
diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index 69fcf593cae..42995d53b0b 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -18,11 +18,14 @@ OBJS = \
 	file.o \
 	function.o \
 	info.o \
+	multixact_new.o \
+	multixact_old.o \
 	option.o \
 	parallel.o \
 	pg_upgrade.o \
 	relfilenumber.o \
 	server.o \
+	slru_io.o \
 	tablespace.o \
 	task.o \
 	util.o \
diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index ac992f0d14b..fff0db3b560 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -8,11 +8,14 @@ pg_upgrade_sources = files(
   'file.c',
   'function.c',
   'info.c',
+  'multixact_new.c',
+  'multixact_old.c',
   'option.c',
   'parallel.c',
   'pg_upgrade.c',
   'relfilenumber.c',
   'server.c',
+  'slru_io.c',
   'tablespace.c',
   'task.c',
   'util.c',
@@ -47,6 +50,7 @@ tests += {
       't/004_subscription.pl',
       't/005_char_signedness.pl',
       't/006_transfer_modes.pl',
+      't/007_multixact_conversion.pl',
     ],
     'test_kwargs': {'priority': 40}, # pg_upgrade tests are slow
   },
diff --git a/src/bin/pg_upgrade/multixact_new.c b/src/bin/pg_upgrade/multixact_new.c
new file mode 100644
index 00000000000..0ee08f07b07
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_new.c
@@ -0,0 +1,101 @@
+/*
+ * multixact_new.c
+ *
+ * Functions to write multixact SLRUs in the current format with 64-bit
+ * MultiXactOffsets, used since PostgreSQL version 19.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_new.c
+ */
+
+#include "postgres_fe.h"
+
+#include "access/multixact_internal.h"
+#include "multixact_new.h"
+
+MultiXactWriter *
+AllocMultiXactWrite(const char *pgdata, MultiXactId firstMulti,
+					MultiXactOffset firstOffset)
+{
+	MultiXactWriter *state = pg_malloc(sizeof(*state));
+	char		dir[MAXPGPATH] = {0};
+
+	pg_sprintf(dir, "%s/pg_multixact/offsets", pgdata);
+	state->offset = AllocSlruWrite(dir, false);
+	SlruWriteSwitchPage(state->offset, MultiXactIdToOffsetPage(firstMulti));
+
+	pg_sprintf(dir, "%s/pg_multixact/members", pgdata);
+	state->members = AllocSlruWrite(dir, true /* use long segment names */ );
+	SlruWriteSwitchPage(state->members, MXOffsetToMemberPage(firstOffset));
+
+	return state;
+}
+
+/*
+ * Write a new multixact with members.
+ *
+ * Simplified version of the correspoding server function, hence the name.
+ */
+void
+RecordNewMultiXact(MultiXactWriter *state, MultiXactOffset offset,
+				   MultiXactId multi, int nmembers, MultiXactMember *members)
+{
+	int64		pageno;
+	int64		prev_pageno;
+	int			entryno;
+	char	   *buf;
+	MultiXactOffset *offptr;
+
+	pageno = MultiXactIdToOffsetPage(multi);
+	entryno = MultiXactIdToOffsetEntry(multi);
+
+	/* Store the offset */
+	buf = SlruWriteSwitchPage(state->offset, pageno);
+	offptr = (MultiXactOffset *) buf;
+	offptr[entryno] = offset;
+
+	/* Store the members */
+	prev_pageno = -1;
+	for (int i = 0; i < nmembers; i++, offset++)
+	{
+		TransactionId *memberptr;
+		uint32	   *flagsptr;
+		uint32		flagsval;
+		int			bshift;
+		int			flagsoff;
+		int			memberoff;
+
+		Assert(members[i].status <= MultiXactStatusUpdate);
+
+		pageno = MXOffsetToMemberPage(offset);
+		memberoff = MXOffsetToMemberOffset(offset);
+		flagsoff = MXOffsetToFlagsOffset(offset);
+		bshift = MXOffsetToFlagsBitShift(offset);
+
+		if (pageno != prev_pageno)
+		{
+			buf = SlruWriteSwitchPage(state->members, pageno);
+			prev_pageno = pageno;
+		}
+
+		memberptr = (TransactionId *) (buf + memberoff);
+
+		*memberptr = members[i].xid;
+
+		flagsptr = (uint32 *) (buf + flagsoff);
+
+		flagsval = *flagsptr;
+		flagsval &= ~(((1 << MXACT_MEMBER_BITS_PER_XACT) - 1) << bshift);
+		flagsval |= (members[i].status << bshift);
+		*flagsptr = flagsval;
+	}
+}
+
+void
+FreeMultiXactWrite(MultiXactWriter *state)
+{
+	FreeSlruWrite(state->offset);
+	FreeSlruWrite(state->members);
+
+	pfree(state);
+}
diff --git a/src/bin/pg_upgrade/multixact_new.h b/src/bin/pg_upgrade/multixact_new.h
new file mode 100644
index 00000000000..28fe761b0f6
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_new.h
@@ -0,0 +1,28 @@
+/*
+ * multixact_new.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_new.h
+ */
+#ifndef MULTIXACT_NEW_H
+#define MULTIXACT_NEW_H
+
+#include "access/multixact.h"
+
+#include "slru_io.h"
+
+typedef struct MultiXactWriter
+{
+	SlruSegState *offset;
+	SlruSegState *members;
+} MultiXactWriter;
+
+extern MultiXactWriter *AllocMultiXactWrite(const char *pgdata,
+											MultiXactId firstMulti,
+											MultiXactOffset firstOffset);
+extern void RecordNewMultiXact(MultiXactWriter *state, MultiXactOffset offset,
+							   MultiXactId multi, int nmembers,
+							   MultiXactMember *members);
+extern void FreeMultiXactWrite(MultiXactWriter *writer);
+
+#endif							/* MULTIXACT_NEW_H */
diff --git a/src/bin/pg_upgrade/multixact_old.c b/src/bin/pg_upgrade/multixact_old.c
new file mode 100644
index 00000000000..529eeeb93b6
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_old.c
@@ -0,0 +1,302 @@
+/*
+ * multixact_old.c
+ *
+ * Functions to read pre-v19 multixact SLRUs.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_old.c
+ */
+
+#include "postgres_fe.h"
+
+#include "multixact_old.h"
+#include "pg_upgrade.h"
+
+/*
+ * NOTE: below are a bunch of definitions that are copy-pasted from
+ * multixact.c from version 18.  The only difference is that we use the 32-bit
+ * OldMultiXactOffset type instead of MultiXactOffset, which became 64 bits
+ * wide in version 19.
+ */
+
+/* We need four bytes per offset and 8 bytes per base for each page. */
+#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(OldMultiXactOffset))
+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)
+{
+	return multi / MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int
+MultiXactIdToOffsetEntry(MultiXactId multi)
+{
+	return multi % MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+/*
+ * The situation for members is a bit more complex: we store one byte of
+ * additional flag bits for each TransactionId.  To do this without getting
+ * into alignment issues, we store four bytes of flags, and then the
+ * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
+ * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
+ * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
+ * performance) trumps space efficiency here.
+ *
+ * Note that the "offset" macros work with byte offset, not array indexes, so
+ * arithmetic must be done using "char *" pointers.
+ */
+/* We need eight bits per xact, so one xact fits in a byte */
+#define MXACT_MEMBER_BITS_PER_XACT			8
+#define MXACT_MEMBER_FLAGS_PER_BYTE			1
+#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
+
+/* how many full bytes of flags are there in a group? */
+#define MULTIXACT_FLAGBYTES_PER_GROUP		4
+#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
+	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
+/* size in bytes of a complete group */
+#define MULTIXACT_MEMBERGROUP_SIZE \
+	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERS_PER_PAGE	\
+	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
+
+/* page in which a member is to be found */
+static inline int64
+MXOffsetToMemberPage(OldMultiXactOffset offset)
+{
+	return offset / MULTIXACT_MEMBERS_PER_PAGE;
+}
+
+/* Location (byte offset within page) of flag word for a given member */
+static inline int
+MXOffsetToFlagsOffset(MultiXactOffset offset)
+{
+	OldMultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
+	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
+
+	return byteoff;
+}
+
+/* Location (byte offset within page) of TransactionId of given member */
+static inline int
+MXOffsetToMemberOffset(OldMultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+
+	return MXOffsetToFlagsOffset(offset) +
+		MULTIXACT_FLAGBYTES_PER_GROUP +
+		member_in_group * sizeof(TransactionId);
+}
+
+static inline int
+MXOffsetToFlagsBitShift(OldMultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
+
+	return bshift;
+}
+
+/*
+ * Construct reader of old multixacts.
+ *
+ * Returns the malloced memory used by the all other calls in this module.
+ */
+OldMultiXactReader *
+AllocOldMultiXactRead(char *pgdata, MultiXactId nextMulti,
+					  OldMultiXactOffset nextOffset)
+{
+	OldMultiXactReader *state = state = pg_malloc(sizeof(*state));
+	char		dir[MAXPGPATH] = {0};
+
+	state->nextMXact = nextMulti;
+	state->nextOffset = nextOffset;
+
+	pg_sprintf(dir, "%s/pg_multixact/offsets", pgdata);
+	state->offset = AllocSlruRead(dir, false);
+
+	pg_sprintf(dir, "%s/pg_multixact/members", pgdata);
+	state->members = AllocSlruRead(dir, false);
+
+	return state;
+}
+
+/*
+ * This is a simplified version of the GetMultiXactIdMembers() server
+ * function:
+ *
+ * - Only return the updating member, if any.  Upgrade only cares about the
+ *   updaters.  If there is no updating member, return somewhat arbitrarily
+ *   the first locking-only member, because we don't have any way to represent
+ *   "no members".
+ *
+ * - Because there's no concurrent activity, We don't need to worry about
+ *   locking and some corner cases.
+ */
+void
+GetOldMultiXactIdSingleMember(OldMultiXactReader *state, MultiXactId multi,
+							  TransactionId *result, MultiXactStatus *status)
+{
+	MultiXactId nextMXact,
+				nextOffset,
+				tmpMXact;
+	int64		pageno,
+				prev_pageno;
+	int			entryno,
+				length;
+	char	   *buf;
+	OldMultiXactOffset *offptr,
+				offset;
+	TransactionId result_xid = InvalidTransactionId;
+	bool		result_isupdate = false;
+
+	nextMXact = state->nextMXact;
+	nextOffset = state->nextOffset;
+
+	/*
+	 * See GetMultiXactIdMembers in PostgreSQL v18 multixact.c.
+	 *
+	 * Find out the offset at which we need to start reading MultiXactMembers
+	 * and the number of members in the multixact.  We determine the latter as
+	 * the difference between this multixact's starting offset and the next
+	 * one's.  However, there are some corner cases to worry about:
+	 *
+	 * 1. This multixact may be the latest one created, in which case there is
+	 * no next one to look at.  In this case the nextOffset value we just
+	 * saved is the correct endpoint.
+	 *
+	 * 2. (The next multixact may still be in process of being filled in.)
+	 * This cannot happen during upgrade.
+	 *
+	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
+	 * handle case #2, there is an ambiguity near the point of offset
+	 * wraparound.  If we see next multixact's offset is one, is that our
+	 * multixact's actual endpoint, or did it end at zero with a subsequent
+	 * increment?  We handle this using the knowledge that if the zero'th
+	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
+	 * transaction ID so it can't be a multixact member.  Therefore, if we
+	 * read a zero from the members array, just ignore it.
+	 */
+
+	pageno = MultiXactIdToOffsetPage(multi);
+	entryno = MultiXactIdToOffsetEntry(multi);
+
+	buf = SlruReadSwitchPage(state->offset, pageno);
+	offptr = (OldMultiXactOffset *) buf;
+	offptr += entryno;
+	offset = *offptr;
+
+	Assert(offset != 0);
+
+	/*
+	 * Use the same increment rule as GetNewMultiXactId(), that is, don't
+	 * handle wraparound explicitly until needed.
+	 */
+	tmpMXact = multi + 1;
+
+	if (nextMXact == tmpMXact)
+	{
+		/* Corner case 1: there is no next multixact */
+		length = nextOffset - offset;
+	}
+	else
+	{
+		OldMultiXactOffset nextMXOffset;
+
+		/* handle wraparound if needed */
+		if (tmpMXact < FirstMultiXactId)
+			tmpMXact = FirstMultiXactId;
+
+		prev_pageno = pageno;
+
+		pageno = MultiXactIdToOffsetPage(tmpMXact);
+		entryno = MultiXactIdToOffsetEntry(tmpMXact);
+
+		if (pageno != prev_pageno)
+			buf = SlruReadSwitchPage(state->offset, pageno);
+
+		offptr = (OldMultiXactOffset *) buf;
+		offptr += entryno;
+		nextMXOffset = *offptr;
+
+		/*
+		 * Corner case 2: next multixact is still being filled in, this cannot
+		 * happen during upgrade.
+		 */
+		Assert(nextMXOffset != 0);
+
+		length = nextMXOffset - offset;
+	}
+
+	prev_pageno = -1;
+	for (int i = 0; i < length; i++, offset++)
+	{
+		TransactionId *xactptr;
+		uint32	   *flagsptr;
+		int			flagsoff;
+		int			bshift;
+		int			memberoff;
+		MultiXactStatus status;
+
+		pageno = MXOffsetToMemberPage(offset);
+		memberoff = MXOffsetToMemberOffset(offset);
+
+		if (pageno != prev_pageno)
+		{
+			buf = SlruReadSwitchPage(state->members, pageno);
+			prev_pageno = pageno;
+		}
+
+		xactptr = (TransactionId *) (buf + memberoff);
+		if (!TransactionIdIsValid(*xactptr))
+		{
+			/* Corner case 3: we must be looking at unused slot zero */
+			Assert(offset == 0);
+			continue;
+		}
+
+		flagsoff = MXOffsetToFlagsOffset(offset);
+		bshift = MXOffsetToFlagsBitShift(offset);
+		flagsptr = (uint32 *) (buf + flagsoff);
+
+		status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
+
+		/*
+		 * Remember the updating XID among the members, or first locking XID
+		 * if no updating XID.
+		 */
+		if (ISUPDATE_from_mxstatus(status))
+		{
+			/* sanity check */
+			if (result_isupdate)
+				pg_fatal("multixact %u has more than one updating member",
+						 multi);
+			result_xid = *xactptr;
+			result_isupdate = true;
+		}
+		else if (!TransactionIdIsValid(result_xid))
+			result_xid = *xactptr;
+	}
+
+	/* A multixid with zero members should not happen */
+	Assert(TransactionIdIsValid(result_xid));
+
+	*result = result_xid;
+	*status = result_isupdate ? MultiXactStatusUpdate :
+		MultiXactStatusForKeyShare;
+}
+
+/*
+ * Frees the malloced reader.
+ */
+void
+FreeOldMultiXactReader(OldMultiXactReader *state)
+{
+	FreeSlruRead(state->offset);
+	FreeSlruRead(state->members);
+
+	pfree(state);
+}
diff --git a/src/bin/pg_upgrade/multixact_old.h b/src/bin/pg_upgrade/multixact_old.h
new file mode 100644
index 00000000000..4f9e086a1fb
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_old.h
@@ -0,0 +1,38 @@
+/*
+ * multixact_old.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_old.h
+ */
+#ifndef MULTIXACT_OLD_H
+#define MULTIXACT_OLD_H
+
+#include "access/multixact.h"
+#include "slru_io.h"
+
+/*
+ * MultiXactOffset changed from uint32 to uint64 between versions 18 and 19.
+ * OldMultiXactOffset is used to represent a 32-bit offset from the old
+ * cluster.
+ */
+typedef uint32 OldMultiXactOffset;
+
+typedef struct OldMultiXactReader
+{
+	MultiXactId nextMXact;
+	OldMultiXactOffset nextOffset;
+
+	SlruSegState *offset;
+	SlruSegState *members;
+} OldMultiXactReader;
+
+extern OldMultiXactReader *AllocOldMultiXactRead(char *pgdata,
+												 MultiXactId nextMulti,
+												 OldMultiXactOffset nextOffset);
+extern void GetOldMultiXactIdSingleMember(OldMultiXactReader *state,
+										  MultiXactId multi,
+										  TransactionId *result,
+										  MultiXactStatus *status);
+extern void FreeOldMultiXactReader(OldMultiXactReader *reader);
+
+#endif							/* MULTIXACT_OLD_H */
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 490e98fa26f..ff937b9e104 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -48,6 +48,8 @@
 #include "common/logging.h"
 #include "common/restricted_token.h"
 #include "fe_utils/string_utils.h"
+#include "multixact_old.h"
+#include "multixact_new.h"
 #include "pg_upgrade.h"
 
 /*
@@ -62,6 +64,7 @@ static void set_locale_and_encoding(void);
 static void prepare_new_cluster(void);
 static void prepare_new_globals(void);
 static void create_new_objects(void);
+static MultiXactOffset convert_multixacts(MultiXactId from_multi, MultiXactId to_multi);
 static void copy_xact_xlog_xid(void);
 static void set_frozenxids(bool minmxid_only);
 static void make_outputdirs(char *pgdata);
@@ -769,6 +772,88 @@ copy_subdir_files(const char *old_subdir, const char *new_subdir)
 	check_ok();
 }
 
+/*
+ * Convert pg_multixact/offset and /members from the old format with 32-bit
+ * offsets.
+ *
+ * Multixids in the range [from_multi, to_multi) are read from the old pre-v19
+ * cluster, and written in the new format.  An important edge case is that if
+ * from_multi == to_multi, this initializes the new files in the new format
+ * without trying to open any old files.  (We rely on that when upgrading from
+ * PostgreSQL version 9.2 or below.)
+ *
+ * Returns the new nextOffset value; the caller should set it in the new
+ * control file.  The new members always start from offset 1, regardless of
+ * the offset range used in the old cluster.
+ */
+static MultiXactOffset
+convert_multixacts(MultiXactId from_multi, MultiXactId to_multi)
+{
+	MultiXactId oldest_multi,
+				next_multi;
+	MultiXactWriter *new_writer;
+	MultiXactOffset next_offset;
+
+	/*
+	 * The range of valid multi XIDs is unchanged by the conversion (they are
+	 * referenced from the heap tables), but the members SLRU is rewritten to
+	 * start from offset 1.
+	 */
+	oldest_multi = from_multi;
+	next_multi = to_multi;
+	next_offset = 1;
+
+	new_writer = AllocMultiXactWrite(new_cluster.pgdata,
+									 oldest_multi, next_offset);
+
+	/*
+	 * Convert old multixids, if needed, by reading them one-by-one from the
+	 * old cluster.
+	 */
+	if (to_multi != from_multi)
+	{
+		OldMultiXactReader *old_reader;
+
+		old_reader = AllocOldMultiXactRead(old_cluster.pgdata,
+										   old_cluster.controldata.chkpnt_nxtmulti,
+										   old_cluster.controldata.chkpnt_nxtmxoff);
+
+		for (MultiXactId multi = oldest_multi; multi != next_multi;)
+		{
+			TransactionId xid;
+			MultiXactStatus status;
+			MultiXactMember member;
+
+			/*
+			 * Read this multixid's members.  Locking-only XIDs that may be
+			 * part of multi-xids don't matter after upgrade, as there can be
+			 * no transactions running across upgrade.  So as a small
+			 * optimization, we only read one member from each multixid: the
+			 * one updating one, or if there was no update, arbitrarily the
+			 * first locking xid.
+			 */
+			GetOldMultiXactIdSingleMember(old_reader, multi, &xid, &status);
+
+			/* Write it out in new format */
+			member.xid = xid;
+			member.status = status;
+			RecordNewMultiXact(new_writer, next_offset, multi, 1, &member);
+
+			next_offset += 1;
+			multi++;
+			/* handle wraparound */
+			if (multi < FirstMultiXactId)
+				multi = FirstMultiXactId;
+		}
+		FreeOldMultiXactReader(old_reader);
+	}
+
+	/* Release resources */
+	FreeMultiXactWrite(new_writer);
+
+	return next_offset;
+}
+
 static void
 copy_xact_xlog_xid(void)
 {
@@ -807,15 +892,15 @@ copy_xact_xlog_xid(void)
 			  new_cluster.pgdata);
 	check_ok();
 
-	/*
-	 * If the old server is before the MULTIXACT_FORMATCHANGE_CAT_VER change
-	 * (see pg_upgrade.h) and the new server is after, then we don't copy
-	 * pg_multixact files, but we need to reset pg_control so that the new
-	 * server doesn't attempt to read multis older than the cutoff value.
-	 */
-	if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER &&
-		new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
+	/* Copy or convert pg_multixact files */
+	Assert(new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER);
+	Assert(new_cluster.controldata.cat_ver >= MULTIXACTOFFSET_FORMATCHANGE_CAT_VER);
+	if (old_cluster.controldata.cat_ver >= MULTIXACTOFFSET_FORMATCHANGE_CAT_VER)
 	{
+		/* No change in multixact format, just copy the files */
+		MultiXactId new_nxtmulti = old_cluster.controldata.chkpnt_nxtmulti;
+		MultiXactOffset new_nxtmxoff = old_cluster.controldata.chkpnt_nxtmxoff;
+
 		copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
 		copy_subdir_files("pg_multixact/members", "pg_multixact/members");
 
@@ -826,38 +911,64 @@ copy_xact_xlog_xid(void)
 		 * counters here and the oldest multi present on system.
 		 */
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/pg_resetwal\" -O %u -m %u,%u \"%s\"",
-				  new_cluster.bindir,
-				  old_cluster.controldata.chkpnt_nxtmxoff,
-				  old_cluster.controldata.chkpnt_nxtmulti,
+				  "\"%s/pg_resetwal\" -O %" PRIu64 " -m %u,%u \"%s\"",
+				  new_cluster.bindir, new_nxtmxoff, new_nxtmulti,
 				  old_cluster.controldata.chkpnt_oldstMulti,
 				  new_cluster.pgdata);
 		check_ok();
 	}
-	else if (new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
+	else
 	{
+		/* Conversion is needed */
+		MultiXactId nxtmulti;
+		MultiXactId oldstMulti;
+		MultiXactOffset nxtmxoff;
+
 		/*
-		 * Remove offsets/0000 file created by initdb that no longer matches
-		 * the new multi-xid value.  "members" starts at zero so no need to
-		 * remove it.
+		 * Determine the range of multixacts to convert.
 		 */
-		remove_new_subdir("pg_multixact/offsets", false);
+		nxtmulti = old_cluster.controldata.chkpnt_nxtmulti;
+		if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
+			oldstMulti = old_cluster.controldata.chkpnt_oldstMulti;
+		else
+		{
+			/*
+			 * In PostgreSQL 9.2 and below, multitransactions were only used
+			 * for row locking, and as such don't need to be preserved during
+			 * upgrade.  In that case, we utilize convert_multixacts() just to
+			 * initialize new, empty files in the new format.
+			 *
+			 * It's important that the oldest multi is set to the latest value
+			 * used by the old system, so that multixact.c returns the empty
+			 * set for multis that might be present on disk.
+			 */
+			oldstMulti = nxtmulti;
+		}
+		/* handle wraparound */
+		if (nxtmulti < FirstMultiXactId)
+			nxtmulti = FirstMultiXactId;
+		if (oldstMulti < FirstMultiXactId)
+			oldstMulti = FirstMultiXactId;
 
-		prep_status("Setting oldest multixact ID in new cluster");
+		/*
+		 * Remove the files created by initdb in the new cluster.
+		 * convert_multixacts() will create new ones.
+		 */
+		remove_new_subdir("pg_multixact/members", false);
+		remove_new_subdir("pg_multixact/offsets", false);
 
 		/*
-		 * We don't preserve files in this case, but it's important that the
-		 * oldest multi is set to the latest value used by the old system, so
-		 * that multixact.c returns the empty set for multis that might be
-		 * present on disk.  We set next multi to the value following that; it
-		 * might end up wrapped around (i.e. 0) if the old cluster had
-		 * next=MaxMultiXactId, but multixact.c can cope with that just fine.
+		 * Create new pg_multixact files, converting old ones if needed.
 		 */
+		prep_status("Converting pg_multixact files");
+		nxtmxoff = convert_multixacts(oldstMulti, nxtmulti);
+		check_ok();
+
+		prep_status("Setting next multixact ID and offset for new cluster");
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/pg_resetwal\" -m %u,%u \"%s\"",
+				  "\"%s/pg_resetwal\" -O %" PRIu64 " -m %u,%u \"%s\"",
 				  new_cluster.bindir,
-				  old_cluster.controldata.chkpnt_nxtmulti + 1,
-				  old_cluster.controldata.chkpnt_nxtmulti,
+				  nxtmxoff, nxtmulti, oldstMulti,
 				  new_cluster.pgdata);
 		check_ok();
 	}
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index e86336f4be9..d5fc446d1ad 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -114,6 +114,13 @@ extern char *output_files[];
  */
 #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231
 
+/*
+ * MultiXactOffset was changed from 32-bit to 64-bit in version 19, at this
+ * catalog version.  pg_multixact files need to be converted when upgrading
+ * across this version.
+ */
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 999999999
+
 /*
  * large object chunk size added to pg_controldata,
  * commit 5f93c37805e7485488480916b4585e098d3cc883
diff --git a/src/bin/pg_upgrade/slru_io.c b/src/bin/pg_upgrade/slru_io.c
new file mode 100644
index 00000000000..010094184be
--- /dev/null
+++ b/src/bin/pg_upgrade/slru_io.c
@@ -0,0 +1,242 @@
+/*
+ * slru_io.c
+ *
+ * Routines for reading and writing SLRU files during upgrade.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/slru_io.c
+ */
+
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+
+#include "common/fe_memutils.h"
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "port/pg_iovec.h"
+#include "pg_upgrade.h"
+#include "slru_io.h"
+
+static SlruSegState *AllocSlruSegState(const char *dir);
+static char *SlruFileName(SlruSegState *state, int64 segno);
+static void SlruFlush(SlruSegState *state);
+
+static SlruSegState *
+AllocSlruSegState(const char *dir)
+{
+	SlruSegState *state = pg_malloc(sizeof(*state));
+
+	state->dir = pstrdup(dir);
+	state->fn = NULL;
+	state->fd = -1;
+	state->segno = -1;
+	state->pageno = 0;
+
+	return state;
+}
+
+/* similar to the backend function with the same name */
+static char *
+SlruFileName(SlruSegState *state, int64 segno)
+{
+	if (state->long_segment_names)
+	{
+		Assert(segno >= 0 && segno <= INT64CONST(0xFFFFFFFFFFFFFFF));
+		return psprintf("%s/%015" PRIX64, state->dir, segno);
+	}
+	else
+	{
+		Assert(segno >= 0 && segno <= INT64CONST(0xFFFFFF));
+		return psprintf("%s/%04X", state->dir, (unsigned int) segno);
+	}
+}
+
+/*
+ * Create slru reader for dir.
+ *
+ * Returns the malloced memory used by the all other read calls in this module.
+ */
+SlruSegState *
+AllocSlruRead(const char *dir, bool long_segment_names)
+{
+	SlruSegState *state = AllocSlruSegState(dir);
+
+	state->writing = false;
+	state->long_segment_names = long_segment_names;
+
+	return state;
+}
+
+/*
+ * Open given page for reading.
+ *
+ * Reading can be done in random order.
+ */
+char *
+SlruReadSwitchPageSlow(SlruSegState *state, uint64 pageno)
+{
+	int64		segno;
+
+	Assert(!state->writing);	/* read only mode */
+
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+
+	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	if (segno != state->segno)
+	{
+		if (state->segno != -1)
+		{
+			close(state->fd);
+			state->fd = -1;
+
+			pg_free(state->fn);
+			state->fn = NULL;
+
+			state->segno = -1;
+		}
+
+		/* Open new segment */
+		state->fn = SlruFileName(state, segno);
+		if ((state->fd = open(state->fn, O_RDONLY | PG_BINARY, 0)) < 0)
+			pg_fatal("could not open file \"%s\": %m", state->fn);
+	}
+
+	state->segno = segno;
+
+	{
+		struct iovec iovec = {
+			.iov_base = &state->buf,
+			.iov_len = BLCKSZ,
+		};
+		off_t		offset = (pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+		if (pg_preadv(state->fd, &iovec, 1, offset) < 0)
+			pg_fatal("could not read file \"%s\": %m", state->fn);
+
+		state->pageno = pageno;
+	}
+
+	return state->buf.data;
+}
+
+/*
+ * Frees the malloced reader.
+ */
+void
+FreeSlruRead(SlruSegState *state)
+{
+	Assert(!state->writing);	/* read only mode */
+
+	if (state->fd != -1)
+		close(state->fd);
+	pg_free(state);
+}
+
+/*
+ * Create slru writer for dir.
+ *
+ * Returns the malloced memory used by the all other write calls in this module.
+ */
+SlruSegState *
+AllocSlruWrite(const char *dir, bool long_segment_names)
+{
+	SlruSegState *state = AllocSlruSegState(dir);
+
+	state->writing = true;
+	state->long_segment_names = long_segment_names;
+
+	return state;
+}
+
+/*
+ * Open the given page for writing.
+ *
+ * NOTE: This uses O_EXCL when stepping to a new segment, so this assumes that
+ * each segment is written in full before moving on to next one.  This
+ * limitation would be easy to lift if needed, but it fits the usage pattern of
+ * current callers.
+ */
+char *
+SlruWriteSwitchPageSlow(SlruSegState *state, uint64 pageno)
+{
+	int64		segno;
+	off_t		offset;
+
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+
+	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	offset = (pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+	SlruFlush(state);
+	memset(state->buf.data, 0, BLCKSZ);
+
+	if (segno != state->segno)
+	{
+		if (state->segno != -1)
+		{
+			close(state->fd);
+			state->fd = -1;
+
+			pg_free(state->fn);
+			state->fn = NULL;
+
+			state->segno = -1;
+		}
+
+		/* Create the segment */
+		state->fn = SlruFileName(state, segno);
+		if ((state->fd = open(state->fn, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+							  pg_file_create_mode)) < 0)
+		{
+			pg_fatal("could not create file \"%s\": %m", state->fn);
+		}
+
+		state->segno = segno;
+
+		if (offset > 0)
+		{
+			if (pg_pwrite_zeros(state->fd, offset, 0) < 0)
+				pg_fatal("could not write file \"%s\": %m", state->fn);
+		}
+	}
+
+	state->pageno = pageno;
+
+	return state->buf.data;
+}
+
+static void
+SlruFlush(SlruSegState *state)
+{
+	struct iovec iovec = {
+		.iov_base = &state->buf,
+		.iov_len = BLCKSZ,
+	};
+	off_t		offset;
+
+	if (state->segno == -1)
+		return;
+
+	offset = (state->pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+	if (pg_pwritev_with_retry(state->fd, &iovec, 1, offset) < 0)
+		pg_fatal("could not write file \"%s\": %m", state->fn);
+}
+
+/*
+ * Frees the malloced writer.
+ */
+void
+FreeSlruWrite(SlruSegState *state)
+{
+	Assert(state->writing);
+
+	SlruFlush(state);
+
+	if (state->fd != -1)
+		close(state->fd);
+	pg_free(state);
+}
diff --git a/src/bin/pg_upgrade/slru_io.h b/src/bin/pg_upgrade/slru_io.h
new file mode 100644
index 00000000000..5c80a679b4d
--- /dev/null
+++ b/src/bin/pg_upgrade/slru_io.h
@@ -0,0 +1,52 @@
+/*
+ * slru_io.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/slru_io.h
+ */
+
+#ifndef SLRU_IO_H
+#define SLRU_IO_H
+
+/*
+ * State for reading or writing an SLRU, with a one page buffer.
+ */
+typedef struct SlruSegState
+{
+	bool		writing;
+	bool		long_segment_names;
+
+	char	   *dir;
+	char	   *fn;
+	int			fd;
+	int64		segno;
+	uint64		pageno;
+
+	PGAlignedBlock buf;
+} SlruSegState;
+
+extern SlruSegState *AllocSlruRead(const char *dir, bool long_segment_names);
+extern char *SlruReadSwitchPageSlow(SlruSegState *state, uint64 pageno);
+extern void FreeSlruRead(SlruSegState *state);
+
+static inline char *
+SlruReadSwitchPage(SlruSegState *state, uint64 pageno)
+{
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+	return SlruReadSwitchPageSlow(state, pageno);
+}
+
+extern SlruSegState *AllocSlruWrite(const char *dir, bool long_segment_names);
+extern char *SlruWriteSwitchPageSlow(SlruSegState *state, uint64 pageno);
+extern void FreeSlruWrite(SlruSegState *state);
+
+static inline char *
+SlruWriteSwitchPage(SlruSegState *state, uint64 pageno)
+{
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+	return SlruWriteSwitchPageSlow(state, pageno);
+}
+
+#endif							/* SLRU_IO_H */
diff --git a/src/bin/pg_upgrade/t/007_multixact_conversion.pl b/src/bin/pg_upgrade/t/007_multixact_conversion.pl
new file mode 100644
index 00000000000..fe8da9aded2
--- /dev/null
+++ b/src/bin/pg_upgrade/t/007_multixact_conversion.pl
@@ -0,0 +1,329 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Version 19 expanded MultiXactOffset from 32 to 64 bits. Upgrading
+# across that requires rewriting the SLRU files to the new format.
+# This file contains tests for the conversion.
+#
+# To run, set 'oldinstall' ENV variable to point to a pre-v19
+# installation. If it's not set, or if it points to a v19 or above
+# installation, this still performs a very basic test, upgrading a
+# cluster with some multixacts. It's not very interesting, however,
+# because there's no conversion involved in that case.
+
+use strict;
+use warnings FATAL => 'all';
+
+use Math::BigInt;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Temp dir for a dumps.
+my $tempdir = PostgreSQL::Test::Utils::tempdir;
+
+# A workload that consumes multixids. The purpose of this is to
+# generate some multixids in the old cluster, so that we can test
+# upgrading them. The workload is a mix of KEY SHARE locking queries
+# and UPDATEs, and commits and aborts. It consumes around 3000
+# multixids with 30000 members. That's enough to span more than one
+# multixids 'offsets' page, and more than one 'members' segment.
+#
+# The workload leaves behind a table called 'mxofftest' containing a
+# small number of rows referencing some of the generated multixids.
+#
+# Because this function is used to generate test data on the old
+# installation, it needs to work with older PostgreSQL server
+# versions.
+#
+# The first argument is the cluster to connect to, the second argument
+# is a cluster using the new version. We need the 'psql' binary from
+# the new version, the new cluster is otherwise unused. (We need to
+# use the new 'psql' because some of the more advanced background psql
+# perl module features depend on a fairly recent psql version.)
+sub mxact_workload
+{
+	my $node = shift;       # Cluster to connect to
+	my $binnode = shift;    # Use the psql binary from this cluster
+
+	my $connstr = $node->connstr('postgres');
+
+	$node->start;
+	$node->safe_psql('postgres', qq[
+		CREATE TABLE mxofftest (id INT PRIMARY KEY, n_updated INT)
+		  WITH (AUTOVACUUM_ENABLED=FALSE);
+		INSERT INTO mxofftest SELECT G, 0 FROM GENERATE_SERIES(1, 50) G;
+	]);
+
+	my $nclients = 20;
+	my $update_every = 13;
+	my $abort_every = 11;
+	my @connections = ();
+
+	# Open multiple connections to the database. Start a transaction
+	# in each connection.
+	for (0 .. $nclients)
+	{
+		# Use the psql binary from the new installation. The
+		# BackgroundPsql functionality doesn't work with older psql
+		# versions.
+		my $conn = $binnode->background_psql('',
+			connstr => $node->connstr('postgres'));
+		$conn->query_safe("SET enable_seqscan=off");
+		$conn->query_safe("BEGIN");
+
+		push(@connections, $conn);
+	}
+
+	# Run queries using cycling through the connections in a
+	# round-robin fashion. We keep a transaction open in each
+	# connection at all times, and lock/update the rows. With 10
+	# connections, each SELECT FOR KEY SHARE query generates a new
+	# multixid, containing the 10 XIDs of all the transactions running
+	# at the time.
+	for (my $i = 0; $i < 3000; $i++)
+	{
+		my $conn = $connections[ $i % $nclients ];
+
+		my $sql;
+		if ($i % $abort_every == 0)
+		{
+			$sql = "ABORT; ";
+		}
+		else
+		{
+			$sql = "COMMIT; ";
+		}
+		$sql .= "BEGIN; ";
+
+		if ($i % $update_every == 0)
+		{
+			$sql .= qq[
+			  UPDATE mxofftest SET n_updated = n_updated + 1 WHERE id = ${i} % 50;
+			];
+		}
+		else
+		{
+			my $threshold = int($i / 3000 * 50);
+			$sql .= qq[
+			  select count(*) from (
+				SELECT * FROM mxofftest WHERE id >= $threshold FOR KEY SHARE
+			  ) as x
+			];
+		}
+		$conn->query_safe($sql);
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+
+	$node->stop;
+	return;
+}
+
+# Read NextMultiOffset from the control file
+#
+# Note: This is used on both the old and the new installation, so the
+# command arguments and the output parsing used here must work with
+# all PostgreSQL versions supported by the test.
+sub read_next_mxoff
+{
+	my $node = shift;
+
+	my $pg_controldata_path = $node->installed_command('pg_controldata');
+	my ($stdout, $stderr) =
+	  run_command([ $pg_controldata_path, $node->data_dir ]);
+	$stdout =~ /^Latest checkpoint's NextMultiOffset:\s*(.*)$/m
+	  or die "could not read NextMultiOffset from pg_controldata";
+	return $1;
+}
+
+# Reset a cluster's oldest multixact-offset to given offset.
+#
+# Note: This is used on both the old and the new installation, so the
+# command arguments and the output parsing used here must work with
+# all PostgreSQL versions supported by the test.
+sub reset_mxoff_pre_v19
+{
+	my $node = shift;
+	my $offset = shift;
+
+	my $pg_resetwal_path = $node->installed_command('pg_resetwal');
+	# Get block size
+	my ($out, $err) =
+	  run_command([ $pg_resetwal_path, '--dry-run', $node->data_dir ]);
+	$out =~ /^Database block size: *(\d+)$/m or die;
+	my $blcksz = $1;
+	# SLRU_PAGES_PER_SEGMENT is always 32 on pre-19 version
+	my $slru_pages_per_segment = 32;
+
+	# Verify that no multixids are currently in use. Resetting would
+	# destroy them. (A freshly initialized cluster has no multixids.)
+	$out =~ /^Latest checkpoint's NextMultiXactId: *(\d+)$/m or die;
+	my $next_mxid = $1;
+	$out =~ /^Latest checkpoint's oldestMultiXid: *(\d+)$/m or die;
+	my $oldest_mxid = $1;
+	die "cluster has some multixids in use" unless $next_mxid == $oldest_mxid;
+
+	# Reset to new offset using pg_resetwal
+	my @cmd = (
+		$pg_resetwal_path,
+		'--pgdata' => $node->data_dir,
+		'--multixact-offset' => $offset);
+	command_ok(\@cmd, 'set oldest multixact-offset');
+
+	# pg_resetwal just updates the control file. The cluster will
+	# refuse to start up, if the SLRU segment corresponding to the
+	# offset does not exist. Create a dummy segment that covers the
+	# given offset, filled with zeros. But first remove any old
+	# segments.
+	unlink glob $node->data_dir . "/pg_multixact/members/*";
+
+	my $mult = 32 * int($blcksz / 20) * 4;
+	my $segname = sprintf "%04X", $offset / $mult;
+
+	my $path = $node->data_dir . "/pg_multixact/members/" . $segname;
+
+	my $null_block = "\x00" x $blcksz;
+	open(my $dh, '>', $path)
+	  || die "could not open $path for writing $!";
+	for (0 .. $slru_pages_per_segment)
+	{
+		print $dh $null_block;
+	}
+	close($dh);
+}
+
+# Dump contents of the 'mxofftest' table, created by mxact_workload
+sub get_dump_for_comparison
+{
+	my ($node, $file_prefix) = @_;
+
+	my $contents = $node->safe_psql('postgres',
+		"SELECT ctid, xmin, xmax, * FROM mxofftest");
+
+	my $dumpfile = $tempdir . '/' . $file_prefix . '.sql';
+	open(my $dh, '>', $dumpfile)
+	  || die "could not open $dumpfile for writing $!";
+	print $dh $contents;
+	close($dh);
+
+	return $dumpfile;
+}
+
+# Main test workhorse routine.
+# Dump data on old version, run pg_upgrade, compare data after upgrade.
+sub upgrade_and_compare
+{
+	my $tag = shift;
+	my $oldnode = shift;
+	my $newnode = shift;
+
+	command_ok(
+		[
+			'pg_upgrade', '--no-sync',
+			'--old-datadir' => $oldnode->data_dir,
+			'--new-datadir' => $newnode->data_dir,
+			'--old-bindir' => $oldnode->config_data('--bindir'),
+			'--new-bindir' => $newnode->config_data('--bindir'),
+			'--socketdir' => $newnode->host,
+			'--old-port' => $oldnode->port,
+			'--new-port' => $newnode->port,
+		],
+		'run of pg_upgrade for new instance');
+
+	# Note: we do this *after* running pg_upgrade, to ensure that we
+	# don't set all the hint bits before upgrade by doing the SELECT
+	# on the table.
+	$oldnode->start;
+	my $old_dump = get_dump_for_comparison($oldnode, "oldnode_${tag}_dump");
+	$oldnode->stop;
+
+	$newnode->start;
+	my $new_dump = get_dump_for_comparison($newnode, "newnode_${tag}_dump");
+	$newnode->stop;
+
+	compare_files($old_dump, $new_dump,
+		'dump outputs from original and restored regression databases match');
+}
+
+my $old_version;
+
+# Basic scenario: Create a cluster using old installation, run
+# multixid-creating workload on it, then upgrade.
+#
+# This works even even if the old and new version is the same,
+# although it's not very interesting as the conversion routines only
+# run when upgrading from a pre-v19 cluster.
+{
+	my $tag = 'basic';
+	my $old =
+	  PostgreSQL::Test::Cluster->new("${tag}_oldnode",
+		install_path => $ENV{oldinstall});
+	my $new = PostgreSQL::Test::Cluster->new("${tag}_newnode");
+
+	$old->init(extra => ['-k']);
+
+	$old_version = $old->pg_version;
+	note "old installation is version $old_version\n";
+
+	# Run the workload
+	my $start_mxoff = read_next_mxoff($old);
+	mxact_workload($old, $new);
+	my $finish_mxoff = read_next_mxoff($old);
+
+	$new->init;
+	upgrade_and_compare($tag, $old, $new);
+
+	my $new_next_mxoff = read_next_mxoff($new);
+
+	note ">>> case #${tag}\n"
+	  . " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n"
+	  . " newnode mxoff ${new_next_mxoff}\n";
+}
+
+# Wraparound scenario: This is the same as the basic scenario, but the
+# old cluster goes through mxoffset wraparound.
+#
+# This requires the old installation to be version 19 of older,
+# because the hacks we use to reset the old cluster to a state just
+# before the wraparound rely on the pre-v19 file format. In version
+# 19, offsets no longer wrap around anyway.
+SKIP:
+{
+	skip
+	  "skipping mxoffset conversion tests because upgrading from the old version does not require conversion"
+	  if ($old_version >= '19devel');
+
+	my $tag = 'wraparound';
+	my $old =
+	  PostgreSQL::Test::Cluster->new("${tag}_oldnode",
+		install_path => $ENV{oldinstall});
+	my $new = PostgreSQL::Test::Cluster->new("${tag}_newnode");
+
+	$old->init(extra => ['-k']);
+
+	# Reset the NextMultiOffset value in the  old cluster to just before 32-bit wraparound.
+	reset_mxoff_pre_v19($old, 0xFFFFEC77);
+
+	# Run the workload. This crosses the wraparound.
+	my $start_mxoff = read_next_mxoff($old);
+	mxact_workload($old, $new);
+	my $finish_mxoff = read_next_mxoff($old);
+
+	# Verify that wraparound happened.
+	cmp_ok($finish_mxoff, '<', $start_mxoff,
+		"mxoff wrapped around in old cluster");
+
+	$new->init;
+	upgrade_and_compare($tag, $old, $new);
+
+	my $new_next_mxoff = read_next_mxoff($new);
+
+	note ">>> case #${tag}\n"
+	  . " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n"
+	  . " newnode mxoff ${new_next_mxoff}\n";
+}
+
+done_testing();
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 82e4bb90dd5..6433fe16364 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -28,8 +28,6 @@
 
 #define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
 
-#define MaxMultiXactOffset	((MultiXactOffset) 0xFFFFFFFF)
-
 /*
  * Possible multixact lock modes ("status").  The first four modes are for
  * tuple locks (FOR KEY SHARE, FOR SHARE, FOR NO KEY UPDATE, FOR UPDATE); the
@@ -111,7 +109,7 @@ extern bool MultiXactIdIsRunning(MultiXactId multi, bool isLockOnly);
 extern void MultiXactIdSetOldestMember(void);
 extern int	GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 								  bool from_pgupgrade, bool isLockOnly);
-extern bool GetMultiXactInfo(uint32 *multixacts, MultiXactOffset *members,
+extern void GetMultiXactInfo(uint32 *multixacts, MultiXactOffset *members,
 							 MultiXactId *oldestMultiXactId,
 							 MultiXactOffset *oldestOffset);
 extern bool MultiXactIdPrecedes(MultiXactId multi1, MultiXactId multi2);
@@ -131,8 +129,7 @@ extern void BootStrapMultiXact(void);
 extern void StartupMultiXact(void);
 extern void TrimMultiXact(void);
 extern void SetMultiXactIdLimit(MultiXactId oldest_datminmxid,
-								Oid oldest_datoid,
-								bool is_startup);
+								Oid oldest_datoid);
 extern void MultiXactGetCheckptMulti(bool is_shutdown,
 									 MultiXactId *nextMulti,
 									 MultiXactOffset *nextMultiOffset,
diff --git a/src/include/access/multixact_internal.h b/src/include/access/multixact_internal.h
index 9b56deaef31..c4dd1aa044f 100644
--- a/src/include/access/multixact_internal.h
+++ b/src/include/access/multixact_internal.h
@@ -21,17 +21,9 @@
 /*
  * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
  * used everywhere else in Postgres.
- *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
  */
 
-/* We need four bytes per offset */
+/* We need 8 bytes per offset */
 #define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
 
 static inline int64
@@ -80,19 +72,6 @@ MultiXactIdToOffsetSegment(MultiXactId multi)
 #define MULTIXACT_MEMBERS_PER_PAGE	\
 	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
 
-/*
- * Because the number of items per page is not a divisor of the last item
- * number (member 0xFFFFFFFF), the last segment does not use the maximum number
- * of pages, and moreover the last used page therein does not use the same
- * number of items as previous pages.  (Another way to say it is that the
- * 0xFFFFFFFF member is somewhere in the middle of the last page, so the page
- * has some empty space after that item.)
- *
- * This constant is the number of members in the last page of the last segment.
- */
-#define MAX_MEMBERS_IN_LAST_MEMBERS_PAGE \
-		((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
-
 /* page in which a member is to be found */
 static inline int64
 MXOffsetToMemberPage(MultiXactOffset offset)
diff --git a/src/include/c.h b/src/include/c.h
index cb8a38669be..11b83a343ce 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -670,7 +670,7 @@ typedef uint32 SubTransactionId;
 /* MultiXactId must be equivalent to TransactionId, to fit in t_xmax */
 typedef TransactionId MultiXactId;
 
-typedef uint32 MultiXactOffset;
+typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 35413f14019..34f07d52cd8 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -1793,13 +1793,20 @@ sub _get_env
 	return (%inst_env);
 }
 
-# Private routine to get an installation path qualified command.
-#
-# IPC::Run maintains a cache, %cmd_cache, mapping commands to paths.  Tests
-# which use nodes spanning more than one postgres installation path need to
-# avoid confusing which installation's binaries get run.  Setting $ENV{PATH} is
-# insufficient, as IPC::Run does not check to see if the path has changed since
-# caching a command.
+=pod
+
+=item $node->installed_command(cmd)
+
+Get an installation path qualified command.
+
+IPC::Run maintains a cache, %cmd_cache, mapping commands to paths.  Tests
+which use nodes spanning more than one postgres installation path need to
+avoid confusing which installation's binaries get run.  Setting $ENV{PATH} is
+insufficient, as IPC::Run does not check to see if the path has changed since
+caching a command.
+
+=cut
+
 sub installed_command
 {
 	my ($self, $cmd) = @_;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 23bce72ae64..f9ddd06ec1d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1725,6 +1725,7 @@ MultiXactMember
 MultiXactOffset
 MultiXactStateData
 MultiXactStatus
+MultiXactWriter
 MultirangeIOData
 MultirangeParseState
 MultirangeType
@@ -1808,6 +1809,7 @@ OffsetVarNodes_context
 Oid
 OidOptions
 OkeysState
+OldMultiXactReader
 OldToNewMapping
 OldToNewMappingData
 OnCommitAction
@@ -2804,6 +2806,7 @@ SlruCtlData
 SlruErrorCause
 SlruPageStatus
 SlruScanCallback
+SlruSegState
 SlruShared
 SlruSharedData
 SlruWriteAll
-- 
2.47.3

#62wenhui qiu
qiuwenhuifx@gmail.com
In reply to: Heikki Linnakangas (#61)
Re: POC: make mxidoff 64 bits

Hi Heikki

I don't think we need to support that case. I hope there are no clusters
in that state still in the wild, and you can work around it by upgrading
to 9.3.5 or above and letting autovacuum run. But I wonder if a
pre-upgrade check with a better error message would still be worthwhile.

I think we believe it is now highly unlikely to find instances of version
9.3; all users are advised to upgrade to the latest version first.

Thanks

On Tue, Nov 18, 2025 at 12:35 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Show quoted text

Here's yet another patch version. I spent the day reviewing this in
detail and doing little cleanups here and there. I squashed the commits
and wrote a proper commit message.

One noteworthy refactoring is in pg_upgrade.c, to make it more clear (to
me at least) how upgrade from version 9.2 and below now works. It was
actually broken when I tested it. Not sure if I had broken it earlier or
if it never worked, but in any case it works now.

I also tested upgrading a cluster from an old minor version, < 9.3.5,
where the control file has a bogus oldestMultiXid==1 value (see commit
b6a3444fa6). As expected, you get a "could not open file" error:

Performing Upgrade
------------------
Setting locale and encoding for new cluster ok
...
Deleting files from new pg_multixact/members ok
Deleting files from new pg_multixact/offsets ok
Converting pg_multixact files
could not open file

"/home/heikki/pgsql.93stable/data/pg_multixact/offsets/0000": No such file
or directory

Failure, exiting

I don't think we need to support that case. I hope there are no clusters
in that state still in the wild, and you can work around it by upgrading
to 9.3.5 or above and letting autovacuum run. But I wonder if a
pre-upgrade check with a better error message would still be worthwhile.

Ashutosh, you were interested in reviewing this earlier. Would you have
a chance to review this now, before I commit it? Alexander, Alvaro,
would you have a chance to take a final look too, please?

- Heikki

#63Heikki Linnakangas
hlinnaka@iki.fi
In reply to: wenhui qiu (#62)
Re: POC: make mxidoff 64 bits

One more small issue: The docs for pg_resetwal contain recipes for how
to determine safe values to use:

-m mxid,mxid
--multixact-ids=mxid,mxid
Manually set the next and oldest multitransaction ID.

A safe value for the next multitransaction ID (first part) can be
determined by looking for the numerically largest file name in the
directory pg_multixact/offsets under the data directory, adding one,
and then multiplying by 65536 (0x10000). Conversely, a safe value
for the oldest multitransaction ID (second part of -m) can be
determined by looking for the numerically smallest file name in the
same directory and multiplying by 65536. The file names are in
hexadecimal, so the easiest way to do this is to specify the option
value in hexadecimal and append four zeroes.

-O mxoff
--multixact-offset=mxoff

Manually set the next multitransaction offset.

A safe value can be determined by looking for the numerically
largest file name in the directory pg_multixact/members under the
data directory, adding one, and then multiplying by 52352 (0xCC80).
The file names are in hexadecimal. There is no simple recipe such as
the ones for other options of appending zeroes.

I think those recipes need to be adjusted for 64-bit offsets.

- Heikki

#64Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#63)
Re: POC: make mxidoff 64 bits

On Wed, 19 Nov 2025 at 19:20, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I think those recipes need to be adjusted for 64-bit offsets.

Yes, we need to do it.

Sorry if this is too obvious, but with 32-bit offsets, we get:
SLRU_PAGES_PER_SEGMENT * BLKSZ / sizeof(MXOff) =
32 * 8192 / 4 = 65,536 mxoff per segment.

Now, with 64-bits offsets, we should have half as much.

--
Best regards,
Maxim Orlov.

#65Ashutosh Bapat
ashutosh.bapat.oss@gmail.com
In reply to: Heikki Linnakangas (#61)
Re: POC: make mxidoff 64 bits

On Mon, Nov 17, 2025 at 10:05 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Ashutosh, you were interested in reviewing this earlier. Would you have
a chance to review this now, before I commit it?

0002 seems to be fine. It's just moving code from one file to another.
However, the name multixact_internals.h seems to be misusing term
"internal". I would expect an "internal.h" to expose things for use
within the module and not "externally" in other modules. But this
isn't the first time, buf_internal.h, toast_internal.h
bgworker_internal.h and so on have their own meaning of what
"internal" means to them. But it might be better to use something more
meaningful than "internal" in this case. The file seems to contain
declarations related to how pg_multixact SLRU is structured. To that
effect multixact_slru.h or multixact_layout.h may be more appropriate.

There are two offsets that that file deals with offset within
pg_multixact/offset, MultiXactOffset and member offset (flag offset
and xid offset) within pg_multixact/members. It's quite easy to get
confused between those when reading that code. For example, it's not
clear which offset MultiXactIdToOffset* functions are about. These
functions are calculating the page, entry (within the page) and
segment (of page) in pg_multixact/offset where to find the
MultiXactOffset of the first member of a given mxid. Thus returning
offset within offset. I feel they should have been named differently
when the code was written. But now that we are moving this code in a
separate file, we also have an opportunity to rename it better. I
think MXOffsetToMember* functions have better names. Using a similar
convention we could use MultiXactIdToOffsetOffset*, but that might
increase confusion. How about MultiXactIdToOffsetPos*? A separate .h
file also looks like a good place to document how offsets are laid out
and its contents and how members is laid out. The latter is somehow
documented in terms of macros and the static functions. The first is
not documented well, I feel. This refactoring seems to be a good
opportunity to do that. If we do so, I think, the .h there will be
some value in committing .h file as a separate commit.

The reason why this eliminates the need for wraparound is mentioned
somewhere in GetNewMultiXactId(), but probably it should be mentioned
at a more prominent place and also in the commit message. I expected
it to be in the prologue of GetNewMultiXactId(), and then a reference
to prologue from where the comment is right now.

ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("MultiXact members would wrap around")));

If a server ever reaches this point, there is no way but to create
another cluster, if the applications requires multi-xact ids? We
should also provide this as an errhint().

if (nextOffset + nmembers < nextOffset)

:). I spent a few seconds trying to understand this. But I don't know
how to make it easy to understand.

In ExtendMultiXactMember() the last comment mentions a wraparound
/*
* Advance to next page, taking care to properly handle the wraparound
* case. OK if nmembers goes negative.
*/
I thought this wraparound is about offset wraparound, which can not
happen now. Since you have left the comment intact, it's something
else. Is the wraparound of offset within the page? Maybe requires a
bit more clarification?

PerformMembersTruncation(MultiXactOffset oldestOffset, MultiXactOffset
newOldestOffset)
{
... snip ...
- segment += 1;
- }
+ SimpleLruTruncate(MultiXactMemberCtl,
+ MXOffsetToMemberPage(newOldestOffset));
}

Most of the callers of SimpleLruTruncate() call it directly. Why do we
want to keep this static wrapper? PerformOffsetsTruncation() has a
comment which seems to need the wrapper. But
PerformMembersTruncation() doesn't even have that.

MultiXactState->oldestMultiXactId = newOldestMulti;
MultiXactState->oldestMultiXactDB = newOldestMultiDB;
+ MultiXactState->oldestOffset = newOldestOffset;
LWLockRelease(MultiXactGenLock);

Is this something we are missing in the current code? I can not
understand the connection between this change and the fact that offset
wraparound is not possible with wider multi xact offsets. Maybe I
missed some previous discussion.

I have reviewed patch 0002 and multxact.c changes in 0003. So far I
have only these comments. I will review the pg_upgrade.c changes next.

--
Best Wishes,
Ashutosh Bapat

#66Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Ashutosh Bapat (#65)
Re: POC: make mxidoff 64 bits

On 21/11/2025 14:15, Ashutosh Bapat wrote:

On Mon, Nov 17, 2025 at 10:05 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Ashutosh, you were interested in reviewing this earlier. Would you have
a chance to review this now, before I commit it?

0002 seems to be fine. It's just moving code from one file to another.
However, the name multixact_internals.h seems to be misusing term
"internal". I would expect an "internal.h" to expose things for use
within the module and not "externally" in other modules. But this
isn't the first time, buf_internal.h, toast_internal.h
bgworker_internal.h and so on have their own meaning of what
"internal" means to them. But it might be better to use something more
meaningful than "internal" in this case. The file seems to contain
declarations related to how pg_multixact SLRU is structured. To that
effect multixact_slru.h or multixact_layout.h may be more appropriate.

Yeah, I went with multixact_internal.h because of the precedence. It's
not great, but IMHO it's better to be consistent than invent a new
naming scheme.

There are two offsets that that file deals with offset within
pg_multixact/offset, MultiXactOffset and member offset (flag offset
and xid offset) within pg_multixact/members. It's quite easy to get
confused between those when reading that code.

Agreed, those are confusing. I'll think about that a little more.

The reason why this eliminates the need for wraparound is mentioned
somewhere in GetNewMultiXactId(), but probably it should be mentioned
at a more prominent place and also in the commit message. I expected
it to be in the prologue of GetNewMultiXactId(), and then a reference
to prologue from where the comment is right now.

ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("MultiXact members would wrap around")));

If a server ever reaches this point, there is no way but to create
another cluster, if the applications requires multi-xact ids?

Pretty much. You can also vacuum freeze everything, so that no multixids
are in use, and then use pg_resetwal to reset nextOffset to a lower value.

That obviously sounds bad, but this is a "can't happen" situation. For
comparison, we don't provide any better way to recover from running out
of LSNs either.

We should also provide this as an errhint().

I think no. You cannot run into this "organically" by just running the
system. If you do manage to hit it, it's a sign of some other trouble,
and I don't want to guess what that might be, or what might be the
appropriate way to recover.

In ExtendMultiXactMember() the last comment mentions a wraparound
/*
* Advance to next page, taking care to properly handle the wraparound
* case. OK if nmembers goes negative.
*/
I thought this wraparound is about offset wraparound, which can not
happen now. Since you have left the comment intact, it's something
else. Is the wraparound of offset within the page? Maybe requires a
bit more clarification?

It was indeed about offset wraparound. I'll remove it.

PerformMembersTruncation(MultiXactOffset oldestOffset, MultiXactOffset
newOldestOffset)
{
... snip ...
- segment += 1;
- }
+ SimpleLruTruncate(MultiXactMemberCtl,
+ MXOffsetToMemberPage(newOldestOffset));
}

Most of the callers of SimpleLruTruncate() call it directly. Why do we
want to keep this static wrapper? PerformOffsetsTruncation() has a
comment which seems to need the wrapper. But
PerformMembersTruncation() doesn't even have that.

Hmm, yeah those wrappers are a bit vestigial now. I'm inclined to keep
them, because as you said, PerformOffsetsTruncation() provides a place
for the comment. And given that, it seems best to keep
PerformMembersTruncation(), for symmetry.

MultiXactState->oldestMultiXactId = newOldestMulti;
MultiXactState->oldestMultiXactDB = newOldestMultiDB;
+ MultiXactState->oldestOffset = newOldestOffset;
LWLockRelease(MultiXactGenLock);

Is this something we are missing in the current code? I can not
understand the connection between this change and the fact that offset
wraparound is not possible with wider multi xact offsets. Maybe I
missed some previous discussion.

Good question. At first I intended to extract that to a separate commit,
before the main patch, because it seems like a nice improvement: We have
just calculated 'oldestOffset', so why not update the value in shared
memory while we have it? But looking closer, I'm not sure if it'd be
sane without the 64-bit offsets. Currently, oldestOffset is only updated
by SetOffsetVacuumLimit(), which also updates offsetStopLimit. We could
get into a state where oldestOffset is set, but offsetStopLimit is not.
With 64-bit offsets that's no longer a concern because it removes
offsetStopLimit altogether.

I have reviewed patch 0002 and multxact.c changes in 0003. So far I
have only these comments. I will review the pg_upgrade.c changes next.

Thanks for the review so far!

- Heikki

#67Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#66)
Re: POC: make mxidoff 64 bits

Looking at the upgrade code, in light of the "IPC/MultixactCreation on
the Standby server" thread [1]/messages/by-id/172e5723-d65f-4eec-b512-14beacb326ce@yandex.ru, I think we need to make it more
tolerant. It's possible that there are 0 offsets in
pg_multixact/offsets. That might or might not be a problem: it's OK as
long as those multixids don't appear in any heap table, or you might
actually have lost those multixids, which is bad but the damage has
already been done and upgrade should not get stuck on it.
GetOldMultiXactIdSingleMember() currently asserts that the offset is
never zero, but it should try to do something sensible in that case
instead of just failing.

[1]: /messages/by-id/172e5723-d65f-4eec-b512-14beacb326ce@yandex.ru
/messages/by-id/172e5723-d65f-4eec-b512-14beacb326ce@yandex.ru

- Heikki

#68Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#67)
Re: POC: make mxidoff 64 bits

On Tue, 25 Nov 2025 at 13:07, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

GetOldMultiXactIdSingleMember() currently asserts that the offset is
never zero, but it should try to do something sensible in that case
instead of just failing.

Correct me if I'm wrong, but we added the assertion that offsets are

never 0, based on the idea that case #2 will never take place during an
update. If this isn't the case, this assertion could be removed.
The rest of the function appears to work correctly.

I even think that, as an experiment, we could randomly reset some of the
offsets to zero and nothing would happen, except that some data would
be lost.

The most sensible thing we can do is give the user a warning, right?
Something like, "During the update, we encountered some weird offset
that shouldn't have been there, but there's nothing we can do about it,
just take note."

--
Best regards,
Maxim Orlov.

#69Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Maxim Orlov (#68)
Re: POC: make mxidoff 64 bits

On 26/11/2025 17:23, Maxim Orlov wrote:

On Tue, 25 Nov 2025 at 13:07, Heikki Linnakangas <hlinnaka@iki.fi
<mailto:hlinnaka@iki.fi>> wrote:

GetOldMultiXactIdSingleMember() currently asserts that the offset is
never zero, but it should try to do something sensible in that case
instead of just failing.

Correct me if I'm wrong, but we added the assertion that offsets are
never 0, based on the idea that case #2 will never take place during an
update. If this isn't the case, this assertion could be removed.
The rest of the function appears to work correctly.

I even think that, as an experiment, we could randomly reset some of the
offsets to zero and nothing would happen, except that some data would
be lost.

+1

The most sensible thing we can do is give the user a warning, right?
Something like, "During the update, we encountered some weird offset
that shouldn't have been there, but there's nothing we can do about it,
just take note."

Yep, makes sense.

- Heikki

#70Ashutosh Bapat
ashutosh.bapat.oss@gmail.com
In reply to: Heikki Linnakangas (#66)
1 attachment(s)
Re: POC: make mxidoff 64 bits

On Fri, Nov 21, 2025 at 7:26 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I have reviewed patch 0002 and multxact.c changes in 0003. So far I
have only these comments. I will review the pg_upgrade.c changes next.

007_multixact_conversion.pl fires thousands of queries through
BackgroundPsql which prints debug output for each of the queries. When
running this file with oldinstall set,
2.2M regress_log_007_multixact_conversion (size of file)
77874 regress_log_007_multixact_conversion (wc -l output)

Since this output is also copied in testlog.txt, the effect is two-fold.

Most, if not all, of this output is useless. It also makes it hard to
find the output we are looking for. PFA patch which reduces this
output. The patch adds a flag verbose to query_safe() and query() to
toggle this output. With the patch the sizes are
27K regress_log_007_multixact_conversion
588 regress_log_007_multixact_conversion

And it makes the test faster by about a second or two on my laptop.
Something on those lines or other is required to reduce the output
from query_safe().

Some more comments
+++ b/src/bin/pg_upgrade/multixact_old.c

We may need to introduce new _new and then _old will become _older.
Should we rename the files to have pre19 and post19 or some similar
suffixes which make it clear what is meant by old and new?

+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)

The prologue mentions that the definitions are copy-pasted from
multixact.c from version 18, but they share the names with functions
in the current version. I think that's going to be a good source of
confusion especially in a file which is a few hundred lines long. Can
we rename them to have "Old" prefix or something similar?

+
+# Dump contents of the 'mxofftest' table, created by mxact_workload
+sub get_dump_for_comparison

This local function shares its name with a local function in
002_pg_upgrade.pl. Better to use a separate name. Also it's not
"dumping" data using "pg_dump", so "dump" in the name can be
misleading.

+ $newnode->start;
+ my $new_dump = get_dump_for_comparison($newnode, "newnode_${tag}_dump");
+ $newnode->stop;

There is no code which actually looks at the multixact offsets here to
make sure that the conversion happened correctly. I guess the test
relies on visibility checks for that. Anyway, we need a comment
explaining why just comparing the contents of the table is enough to
ensure correct conversion. Better if we can add an explicit test that
the offsets were converted correctly. I don't have any idea of how to
do that right now, though. Maybe use pg_get_multixact_members()
somehow in the query to extract data out of the table?

+
+ compare_files($old_dump, $new_dump,
+ 'dump outputs from original and restored regression databases match');

A shared test name too :); but there is not regression database here.

+
+ note ">>> case #${tag}\n"
+   . " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n"
+   . " newnode mxoff ${new_next_mxoff}\n";

Should we check that some condition holds between finish_mxoff and
new_next_mxoff?

I will continue reviewing it further.

--
Best Wishes,
Ashutosh Bapat

Attachments:

reduce_testoutput.patch.nocibotapplication/octet-stream; name=reduce_testoutput.patch.nocibotDownload
diff --git a/src/bin/pg_upgrade/t/007_multixact_conversion.pl b/src/bin/pg_upgrade/t/007_multixact_conversion.pl
index fe8da9aded2..14133eb3c7a 100644
--- a/src/bin/pg_upgrade/t/007_multixact_conversion.pl
+++ b/src/bin/pg_upgrade/t/007_multixact_conversion.pl
@@ -68,8 +68,8 @@ sub mxact_workload
 		# versions.
 		my $conn = $binnode->background_psql('',
 			connstr => $node->connstr('postgres'));
-		$conn->query_safe("SET enable_seqscan=off");
-		$conn->query_safe("BEGIN");
+		$conn->query_safe("SET enable_seqscan=off", 0);
+		$conn->query_safe("BEGIN", 0);
 
 		push(@connections, $conn);
 	}
@@ -110,7 +110,7 @@ sub mxact_workload
 			  ) as x
 			];
 		}
-		$conn->query_safe($sql);
+		$conn->query_safe($sql, 0);
 	}
 
 	for my $conn (@connections)
diff --git a/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm b/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
index 60bbd5dd445..9d62c7a00c0 100644
--- a/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
+++ b/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
@@ -228,20 +228,24 @@ sub reconnect_and_clear
 
 Executes a query in the current session and returns the output in scalar
 context and (output, error) in list context where error is 1 in case there
-was output generated on stderr when executing the query.
+was output generated on stderr when executing the query. If C<verbose> is
+true (default) the query and its results are printed to the test output.
 
 =cut
 
 sub query
 {
-	my ($self, $query) = @_;
+	my ($self, $query, $verbose) = @_;
 	my $ret;
 	my $output;
 	my $query_cnt = $self->{query_cnt}++;
 
+	# Set $verbose to true if not passed
+	$verbose = 1 unless defined($verbose);
+
 	local $Test::Builder::Level = $Test::Builder::Level + 1;
 
-	note "issuing query $query_cnt via background psql: $query";
+	note "issuing query $query_cnt via background psql: $query" unless !$verbose;
 
 	$self->{timeout}->start() if (defined($self->{query_timer_restart}));
 
@@ -280,7 +284,7 @@ sub query
 	  explain {
 		stdout => $self->{stdout},
 		stderr => $self->{stderr},
-	  };
+	  } unless !$verbose;
 
 	# Remove banner from stdout and stderr, our caller doesn't care.  The
 	# first newline is optional, as there would not be one if consuming an
@@ -308,9 +312,9 @@ Query failure is determined by it producing output on stderr.
 
 sub query_safe
 {
-	my ($self, $query) = @_;
+	my ($self, $query, $verbose) = @_;
 
-	my $ret = $self->query($query);
+	my $ret = $self->query($query, $verbose);
 
 	if ($self->{stderr} ne "")
 	{
#71Ashutosh Bapat
ashutosh.bapat.oss@gmail.com
In reply to: Ashutosh Bapat (#70)
Re: POC: make mxidoff 64 bits

On Fri, Nov 28, 2025 at 6:35 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

On Fri, Nov 21, 2025 at 7:26 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I have reviewed patch 0002 and multxact.c changes in 0003. So far I
have only these comments. I will review the pg_upgrade.c changes next.

007_multixact_conversion.pl fires thousands of queries through
BackgroundPsql which prints debug output for each of the queries. When
running this file with oldinstall set,
2.2M regress_log_007_multixact_conversion (size of file)
77874 regress_log_007_multixact_conversion (wc -l output)

Since this output is also copied in testlog.txt, the effect is two-fold.

Most, if not all, of this output is useless. It also makes it hard to
find the output we are looking for. PFA patch which reduces this
output. The patch adds a flag verbose to query_safe() and query() to
toggle this output. With the patch the sizes are
27K regress_log_007_multixact_conversion
588 regress_log_007_multixact_conversion

And it makes the test faster by about a second or two on my laptop.
Something on those lines or other is required to reduce the output
from query_safe().

Some more comments
+++ b/src/bin/pg_upgrade/multixact_old.c

We may need to introduce new _new and then _old will become _older.
Should we rename the files to have pre19 and post19 or some similar
suffixes which make it clear what is meant by old and new?

+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)

The prologue mentions that the definitions are copy-pasted from
multixact.c from version 18, but they share the names with functions
in the current version. I think that's going to be a good source of
confusion especially in a file which is a few hundred lines long. Can
we rename them to have "Old" prefix or something similar?

+
+# Dump contents of the 'mxofftest' table, created by mxact_workload
+sub get_dump_for_comparison

This local function shares its name with a local function in
002_pg_upgrade.pl. Better to use a separate name. Also it's not
"dumping" data using "pg_dump", so "dump" in the name can be
misleading.

+ $newnode->start;
+ my $new_dump = get_dump_for_comparison($newnode, "newnode_${tag}_dump");
+ $newnode->stop;

There is no code which actually looks at the multixact offsets here to
make sure that the conversion happened correctly. I guess the test
relies on visibility checks for that. Anyway, we need a comment
explaining why just comparing the contents of the table is enough to
ensure correct conversion. Better if we can add an explicit test that
the offsets were converted correctly. I don't have any idea of how to
do that right now, though. Maybe use pg_get_multixact_members()
somehow in the query to extract data out of the table?

+
+ compare_files($old_dump, $new_dump,
+ 'dump outputs from original and restored regression databases match');

A shared test name too :); but there is not regression database here.

+
+ note ">>> case #${tag}\n"
+   . " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n"
+   . " newnode mxoff ${new_next_mxoff}\n";

Should we check that some condition holds between finish_mxoff and
new_next_mxoff?

I will continue reviewing it further.

One more thing,
An UPDATE waits for FOR SHARE query to finish, and vice versa. In my
experiments I didn't see an UPDATE creating a multi-xact. Why do we
have UPDATEs in the load created by the test? Am I missing something?
--
Best Wishes,
Ashutosh Bapat

#72Maxim Orlov
orlovmg@gmail.com
In reply to: Ashutosh Bapat (#71)
Re: POC: make mxidoff 64 bits

On Fri, 28 Nov 2025 at 16:17, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
wrote:

One more thing,
An UPDATE waits for FOR SHARE query to finish, and vice versa. In my
experiments I didn't see an UPDATE creating a multi-xact. Why do we
have UPDATEs in the load created by the test? Am I missing something?

As far as I remember, this was done on purpose to create different
multixact members statuses randomly.

--
Best regards,
Maxim Orlov.

#73Ashutosh Bapat
ashutosh.bapat.oss@gmail.com
In reply to: Maxim Orlov (#72)
Re: POC: make mxidoff 64 bits

On Mon, Dec 1, 2025 at 2:23 PM Maxim Orlov <orlovmg@gmail.com> wrote:

On Fri, 28 Nov 2025 at 16:17, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:

One more thing,
An UPDATE waits for FOR SHARE query to finish, and vice versa. In my
experiments I didn't see an UPDATE creating a multi-xact. Why do we
have UPDATEs in the load created by the test? Am I missing something?

As far as I remember, this was done on purpose to create different
multixact members statuses randomly.

In that case, better to include that in the comments.

--
Best Wishes,
Ashutosh Bapat

#74Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#67)
Re: POC: make mxidoff 64 bits

Hi, Heikki!

On Tue, Nov 25, 2025 at 12:07 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Looking at the upgrade code, in light of the "IPC/MultixactCreation on
the Standby server" thread [1], I think we need to make it more
tolerant. It's possible that there are 0 offsets in
pg_multixact/offsets. That might or might not be a problem: it's OK as
long as those multixids don't appear in any heap table, or you might
actually have lost those multixids, which is bad but the damage has
already been done and upgrade should not get stuck on it.
GetOldMultiXactIdSingleMember() currently asserts that the offset is
never zero, but it should try to do something sensible in that case
instead of just failing.

Thank you for your work on this subject. It's very much appreciated.

I'd like to raise the question about compression again. You have
fairly criticized non-deterministic compression, but what do you think
about deterministic one that I've proposed [1]. I understand that
multixact offsets are subject of growth and their limit is not
removed. However, it's still several extra gigabytes for multixact
offsets, which we could save.

Links.
1. /messages/by-id/CAPpHfduDFLXATvBkUiOjyvZUBZXhK_pj5zjVpxvrJzkRVq+8Lw@mail.gmail.com

------
Regards,
Alexander Korotkov
Supabase

#75Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Alexander Korotkov (#74)
Re: POC: make mxidoff 64 bits

On 02/12/2025 16:11, Alexander Korotkov wrote:

I'd like to raise the question about compression again. You have
fairly criticized non-deterministic compression, but what do you think
about deterministic one that I've proposed [1]. I understand that
multixact offsets are subject of growth and their limit is not
removed. However, it's still several extra gigabytes for multixact
offsets, which we could save.

It felt overly complicated to my taste. And decoding/encoding the whole
chunk on every access seems expensive. Maybe it's cheap enough that it
doesn't matter in practice, but some performance testing would at least
be in order. But I'd love to find a simpler scheme to begin with.

Storing one "base" offset per page, as Maxim did in [1]/messages/by-id/CACG=ezbPUASDL1eJ+c-ZkJMwRPukvp3EL0q1vSUa1h+fnX8y3g@mail.gmail.com, feels about
right to me. Except for the non-deterministic nature of how it gets set
in that patch, and what I referred to as a "frighteningly clever
encoding scheme".

Perhaps we could set the base offset in ExtendMultiXactOffset() already?

[1]: /messages/by-id/CACG=ezbPUASDL1eJ+c-ZkJMwRPukvp3EL0q1vSUa1h+fnX8y3g@mail.gmail.com
/messages/by-id/CACG=ezbPUASDL1eJ+c-ZkJMwRPukvp3EL0q1vSUa1h+fnX8y3g@mail.gmail.com

- Heikki

#76Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#75)
Re: POC: make mxidoff 64 bits

The biggest problem with compression, in my opinion, is that losing
even one byte causes the loss of the entire compressed block in the
worst case scenario. After all, we still don't have checksums for the
SLRU's, which is a shame by itself.

Again, I'm not against the idea of compression, but the risks need to
be considered.

As a software developer, I definitely want to implement compression and
save a few gigabytes. However, given my previous experience using
Postgres in real-world applications, reliability at the cost of several
gigabytes would not have caused me any trouble. Just saying.

--
Best regards,
Maxim Orlov.

#77Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Maxim Orlov (#76)
Re: POC: make mxidoff 64 bits

On 03/12/2025 11:54, Maxim Orlov wrote:

The biggest problem with compression, in my opinion, is that losing
even one byte causes the loss of the entire compressed block in the
worst case scenario. After all, we still don't have checksums for the
SLRU's, which is a shame by itself.

Again, I'm not against the idea of compression, but the risks need to
be considered.

There are plenty of such critical bytes in the system where a single bit
flip renders the whole block unreadable. Actually, if we had checksums
on SLRU pages, a single bit flip anywhere in the page would make the
checksum fail and render the block unreadable.

If things go really bad and you need to open a hex editor and try to fix
the data manually, it shouldn't be too hard to deduce the correct base
offset from surrounding data.

As a software developer, I definitely want to implement compression and
save a few gigabytes. However, given my previous experience using
Postgres in real-world applications, reliability at the cost of several
gigabytes would not have caused me any trouble. Just saying.

+1. If we decide to do some kind of compression here, I want it to be
very simple. Otherwise it's just not worth the code complexity and risk.

Let's do the math of how much disk space we'd save. Let's assume the
worst case that every multixid consists of only one transaction ID.
Currently, every such multixid takes up 4 bytes in the offsets SLRU, and
5 bytes in the members SLRU (one flag byte and 4 bytes for the XID). So
that's 9 bytes. With 64-bit offsets, it becomes 13 bytes. With the
compression, we're back to 9 bytes again (ignoring the one base offset
per page). So in an extreme case that you have 1 billion multixids, with
only one XID per multixid, the difference is between 9 GB and 13 GB.
That seems acceptable.

And having just one XID per multixid is a rare corner case. Much more
commonly, you have at at least two XIDs. With two XIDs per multixid, the
difference is between 14 bytes and 18 bytes.

And having a billion multixids is pretty extreme. Your database is
likely very large too if you reach that point, and a few gigabytes won't
matter.

One could argue that the memory needed for the SLRU cache matters more
than the disk space. That's perhaps true, but I think this is totally
acceptable from that point of view, too.

- Heikki

#78Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Ashutosh Bapat (#73)
Re: POC: make mxidoff 64 bits

On 01/12/2025 14:35, Ashutosh Bapat wrote:

On Mon, Dec 1, 2025 at 2:23 PM Maxim Orlov <orlovmg@gmail.com> wrote:

On Fri, 28 Nov 2025 at 16:17, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:

An UPDATE waits for FOR SHARE query to finish, and vice versa. In my
experiments I didn't see an UPDATE creating a multi-xact. Why do we
have UPDATEs in the load created by the test? Am I missing something?

As far as I remember, this was done on purpose to create different
multixact members statuses randomly.

In that case, better to include that in the comments.

I think that was indeed the purpose, but the test should use FOR KEY
SHARE rather than FOR SHARE. Otherwise the UPDATEs don't generate multixids.

- Heikki

#79wenhui qiu
qiuwenhuifx@gmail.com
In reply to: Maxim Orlov (#76)
Re: POC: make mxidoff 64 bits

Hi

As a software developer, I definitely want to > implement compression and
save a few gigabytes. However, given my previous experience using
Postgres in real-world applications, reliability at the cost of several
gigabytes would not have caused me any trouble. Just saying.

Agree +1, If this had been done twenty years ago, the cost might have been
unacceptable. But with today’s hardware—especially disk random and
sequential I/O performance improving by hundreds of thousands of times, and
memory capacity increasing by several hundred times—it’s almost
unimaginable that we now have single 256-GB DIMMs. So this kind of overhead
is negligible for modern hardware.

Thanks

On Wed, 3 Dec 2025 at 17:54, Maxim Orlov <orlovmg@gmail.com> wrote:

Show quoted text

The biggest problem with compression, in my opinion, is that losing
even one byte causes the loss of the entire compressed block in the
worst case scenario. After all, we still don't have checksums for the
SLRU's, which is a shame by itself.

Again, I'm not against the idea of compression, but the risks need to
be considered.

As a software developer, I definitely want to implement compression and
save a few gigabytes. However, given my previous experience using
Postgres in real-world applications, reliability at the cost of several
gigabytes would not have caused me any trouble. Just saying.

--
Best regards,
Maxim Orlov.

#80Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#77)
Re: POC: make mxidoff 64 bits

On Wed, 3 Dec 2025 at 15:04, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

There are plenty of such critical bytes in the system where a single bit
flip renders the whole block unreadable. Actually, if we had checksums
on SLRU pages, a single bit flip anywhere in the page would make the
checksum fail and render the block unreadable.

Correct. However, my concern about the lack of checksums for SLRU wasn't

about data loss and the impossibility of recovering it, but about the
impossibility of detecting the error. But it is how it is for now.

--
Best regards,
Maxim Orlov.

#81Ashutosh Bapat
ashutosh.bapat.oss@gmail.com
In reply to: Heikki Linnakangas (#77)
Re: POC: make mxidoff 64 bits

On Wed, Dec 3, 2025 at 5:34 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 03/12/2025 11:54, Maxim Orlov wrote:

The biggest problem with compression, in my opinion, is that losing
even one byte causes the loss of the entire compressed block in the
worst case scenario. After all, we still don't have checksums for the
SLRU's, which is a shame by itself.

Again, I'm not against the idea of compression, but the risks need to
be considered.

There are plenty of such critical bytes in the system where a single bit
flip renders the whole block unreadable. Actually, if we had checksums
on SLRU pages, a single bit flip anywhere in the page would make the
checksum fail and render the block unreadable.

If things go really bad and you need to open a hex editor and try to fix
the data manually, it shouldn't be too hard to deduce the correct base
offset from surrounding data.

As a software developer, I definitely want to implement compression and
save a few gigabytes. However, given my previous experience using
Postgres in real-world applications, reliability at the cost of several
gigabytes would not have caused me any trouble. Just saying.

+1. If we decide to do some kind of compression here, I want it to be
very simple. Otherwise it's just not worth the code complexity and risk.

Let's do the math of how much disk space we'd save. Let's assume the
worst case that every multixid consists of only one transaction ID.
Currently, every such multixid takes up 4 bytes in the offsets SLRU, and
5 bytes in the members SLRU (one flag byte and 4 bytes for the XID). So
that's 9 bytes. With 64-bit offsets, it becomes 13 bytes. With the
compression, we're back to 9 bytes again (ignoring the one base offset
per page). So in an extreme case that you have 1 billion multixids, with
only one XID per multixid, the difference is between 9 GB and 13 GB.
That seems acceptable.

And having just one XID per multixid is a rare corner case. Much more
commonly, you have at at least two XIDs. With two XIDs per multixid, the
difference is between 14 bytes and 18 bytes.

And having a billion multixids is pretty extreme. Your database is
likely very large too if you reach that point, and a few gigabytes won't
matter.

I am in favour of keeping things simpler than using a complex compression.

One could argue that the memory needed for the SLRU cache matters more
than the disk space. That's perhaps true, but I think this is totally
acceptable from that point of view, too.

This brings an interesting point. Since the offsets are twice large,
SLRU will contain half the entries than earlier. Have we measured
performance impact of this? Do we need to provide some guidance about
increasing the SLRU size or increase the default SLRU size?

--
Best Wishes,
Ashutosh Bapat

#82Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#69)
Re: POC: make mxidoff 64 bits

On 26/11/2025 17:50, Heikki Linnakangas wrote:

On 26/11/2025 17:23, Maxim Orlov wrote:

On Tue, 25 Nov 2025 at 13:07, Heikki Linnakangas <hlinnaka@iki.fi
<mailto:hlinnaka@iki.fi>> wrote:

GetOldMultiXactIdSingleMember() currently asserts that the offset is
never zero, but it should try to do something sensible in that case
instead of just failing.

Correct me if I'm wrong, but we added the assertion that offsets are
never 0, based on the idea that case #2 will never take place during an
update. If this isn't the case, this assertion could be removed.
The rest of the function appears to work correctly.

I even think that, as an experiment, we could randomly reset some of the
offsets to zero and nothing would happen, except that some data would
be lost.

+1

The most sensible thing we can do is give the user a warning, right?
Something like, "During the update, we encountered some weird offset
that shouldn't have been there, but there's nothing we can do about it,
just take note."

Yep, makes sense.

I read through the SLRU reading codepath, looking for all the things
that could go wrong (not sure I got them all):

1. An SLRU file does not exist
2. An SLRU file is too short, i.e. a page does not exist
3. The offset in 'offsets' page is 0
4. The offset in 'offsets' page looks invalid, i.e. it's greater than
nextOffset or smaller than oldestOffset.
5. The offset is out of order compared to its neighbors
6. The multixid has no members
7. The multixid has an invalid (0) member
8. A multixid has more than one updating member

Some of those situations are theoretically are possible if there was a
crash. We don't follow the WAL-before-data rule for these SLRUs.
Instead, we piggyback on the WAL-before-data of the heap page that would
reference the multixid. In other words, we rely on the fact that if a
multixid write is missed or torn because of a crash, that multixid will
not be referenced from anywhwere and will never be read.

However, that doesn't hold for pg_upgrade. pg_upgrade will try to read
all the multixids. So we need to make the multixact reading code
tolerant of the situations that could be present after a crash. I think
the right philosophy here is that we try to read all the old multixids,
and do our best to interpret them the same way that the old server
would. For those situations that can legitimately be present if the old
server crashed at some point, be silent. For cases that should not
happen, even if there was a crash, print a warning. For example, I think
an SLRU file should never be missing (1) or truncated (2). But the zero
offset (3), and (6) can happen.

Perhaps we should check that all the files exist and have the correct
sizes in the pre-check stage, and abort the upgrade early if anything is
missing. That would be pretty cheap to check.

- Heikki

#83Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#82)
2 attachment(s)
Re: POC: make mxidoff 64 bits

On Thu, 4 Dec 2025 at 13:39, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

However, that doesn't hold for pg_upgrade. pg_upgrade will try to read
all the multixids. So we need to make the multixact reading code
tolerant of the situations that could be present after a crash. I think
the right philosophy here is that we try to read all the old multixids,
and do our best to interpret them the same way that the old server
would.

Something like attached?

Now previous scheme of upgrade with the bytes joggling start
looking not so bad. Just a funny thought that came to my mind.

Perhaps we should check that all the files exist and have the correct

sizes in the pre-check stage

Not sure about it. Because SLRU does not support "holes", simply
checking if the first and last multixacts exist will be enough. But
we'll do it anyway in a real conversion.

PFA to start a conversation.

--
Best regards,
Maxim Orlov.

Attachments:

0001-rough-draft-of-skipping-bogus-offsets.patch.txttext/plain; charset=US-ASCII; name=0001-rough-draft-of-skipping-bogus-offsets.patch.txtDownload
From c2ccb107bef898420e6417c37d56c6b30578d28f Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Thu, 4 Dec 2025 17:02:35 +0300
Subject: [PATCH] rough draft of skipping bogus offsets

---
 src/bin/pg_upgrade/multixact_old.c | 38 ++++++++++++++++++++++++------
 src/bin/pg_upgrade/multixact_old.h |  2 +-
 src/bin/pg_upgrade/pg_upgrade.c    | 15 ++++++------
 3 files changed, 40 insertions(+), 15 deletions(-)

diff --git a/src/bin/pg_upgrade/multixact_old.c b/src/bin/pg_upgrade/multixact_old.c
index 529eeeb93b6..685bfaeff82 100644
--- a/src/bin/pg_upgrade/multixact_old.c
+++ b/src/bin/pg_upgrade/multixact_old.c
@@ -136,7 +136,7 @@ AllocOldMultiXactRead(char *pgdata, MultiXactId nextMulti,
  * - Because there's no concurrent activity, We don't need to worry about
  *   locking and some corner cases.
  */
-void
+bool
 GetOldMultiXactIdSingleMember(OldMultiXactReader *state, MultiXactId multi,
 							  TransactionId *result, MultiXactStatus *status)
 {
@@ -189,7 +189,18 @@ GetOldMultiXactIdSingleMember(OldMultiXactReader *state, MultiXactId multi,
 	offptr += entryno;
 	offset = *offptr;
 
-	Assert(offset != 0);
+	if (offset == 0)
+	{
+		pg_log(PG_WARNING, "multixact %u, offset is empty", multi);
+		return false;
+	}
+#if 0
+	if ( <more checks> )
+	{
+		pg_log(PG_WARNING, "multixact %u, offset is bogus", multi);
+		return false;
+	}
+#endif
 
 	/*
 	 * Use the same increment rule as GetNewMultiXactId(), that is, don't
@@ -224,9 +235,13 @@ GetOldMultiXactIdSingleMember(OldMultiXactReader *state, MultiXactId multi,
 
 		/*
 		 * Corner case 2: next multixact is still being filled in, this cannot
-		 * happen during upgrade.
+		 * happen during upgrade, but if it does, complain.
 		 */
-		Assert(nextMXOffset != 0);
+		if (nextMXOffset == 0)
+		{
+			pg_log(PG_WARNING, "multixact next to %u is empty", multi);
+			return false;
+		}
 
 		length = nextMXOffset - offset;
 	}
@@ -272,8 +287,11 @@ GetOldMultiXactIdSingleMember(OldMultiXactReader *state, MultiXactId multi,
 		{
 			/* sanity check */
 			if (result_isupdate)
-				pg_fatal("multixact %u has more than one updating member",
-						 multi);
+			{
+				pg_log(PG_WARNING,
+					   "multixact %u has more than one updating member", multi);
+				return false;
+			}
 			result_xid = *xactptr;
 			result_isupdate = true;
 		}
@@ -282,11 +300,17 @@ GetOldMultiXactIdSingleMember(OldMultiXactReader *state, MultiXactId multi,
 	}
 
 	/* A multixid with zero members should not happen */
-	Assert(TransactionIdIsValid(result_xid));
+	if (!TransactionIdIsValid(result_xid))
+	{
+		pg_log(PG_WARNING, "multixact %u have zero members", multi);
+		return false;
+	}
 
 	*result = result_xid;
 	*status = result_isupdate ? MultiXactStatusUpdate :
 		MultiXactStatusForKeyShare;
+
+	return true;
 }
 
 /*
diff --git a/src/bin/pg_upgrade/multixact_old.h b/src/bin/pg_upgrade/multixact_old.h
index 4f9e086a1fb..b7352159d83 100644
--- a/src/bin/pg_upgrade/multixact_old.h
+++ b/src/bin/pg_upgrade/multixact_old.h
@@ -29,7 +29,7 @@ typedef struct OldMultiXactReader
 extern OldMultiXactReader *AllocOldMultiXactRead(char *pgdata,
 												 MultiXactId nextMulti,
 												 OldMultiXactOffset nextOffset);
-extern void GetOldMultiXactIdSingleMember(OldMultiXactReader *state,
+extern bool GetOldMultiXactIdSingleMember(OldMultiXactReader *state,
 										  MultiXactId multi,
 										  TransactionId *result,
 										  MultiXactStatus *status);
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index ff937b9e104..c5da56fe785 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -832,14 +832,15 @@ convert_multixacts(MultiXactId from_multi, MultiXactId to_multi)
 			 * one updating one, or if there was no update, arbitrarily the
 			 * first locking xid.
 			 */
-			GetOldMultiXactIdSingleMember(old_reader, multi, &xid, &status);
+			if (GetOldMultiXactIdSingleMember(old_reader, multi, &xid, &status))
+			{
+				/* Write it out in new format */
+				member.xid = xid;
+				member.status = status;
+				RecordNewMultiXact(new_writer, next_offset, multi, 1, &member);
+				next_offset += 1;
+			}
 
-			/* Write it out in new format */
-			member.xid = xid;
-			member.status = status;
-			RecordNewMultiXact(new_writer, next_offset, multi, 1, &member);
-
-			next_offset += 1;
 			multi++;
 			/* handle wraparound */
 			if (multi < FirstMultiXactId)
-- 
2.43.0

0002-Check-is-first-and-last-multis-exists.patch.txttext/plain; charset=US-ASCII; name=0002-Check-is-first-and-last-multis-exists.patch.txtDownload
From 0ac8eb292c21a06da31215aa41adb53ec1f90872 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Thu, 4 Dec 2025 18:31:37 +0300
Subject: [PATCH 2/2] Check is first and last multis exists

---
 src/bin/pg_upgrade/multixact_old.c | 10 ++++++++++
 src/bin/pg_upgrade/multixact_old.h |  2 ++
 src/bin/pg_upgrade/pg_upgrade.c    | 23 +++++++++++++++++++++++
 3 files changed, 35 insertions(+)

diff --git a/src/bin/pg_upgrade/multixact_old.c b/src/bin/pg_upgrade/multixact_old.c
index 685bfaeff82..ffd06ad908f 100644
--- a/src/bin/pg_upgrade/multixact_old.c
+++ b/src/bin/pg_upgrade/multixact_old.c
@@ -324,3 +324,13 @@ FreeOldMultiXactReader(OldMultiXactReader *state)
 
 	pfree(state);
 }
+
+void
+CheckOldMultiXactIdExist(OldMultiXactReader *state, MultiXactId multi)
+{
+	int64 pageno = MultiXactIdToOffsetPage(multi);
+	char *buf = SlruReadSwitchPage(state->offset, pageno);
+
+	if (!buf)
+		pg_fatal("could not read multixact %u", multi);
+}
diff --git a/src/bin/pg_upgrade/multixact_old.h b/src/bin/pg_upgrade/multixact_old.h
index b7352159d83..86141ac392f 100644
--- a/src/bin/pg_upgrade/multixact_old.h
+++ b/src/bin/pg_upgrade/multixact_old.h
@@ -34,5 +34,7 @@ extern bool GetOldMultiXactIdSingleMember(OldMultiXactReader *state,
 										  TransactionId *result,
 										  MultiXactStatus *status);
 extern void FreeOldMultiXactReader(OldMultiXactReader *reader);
+extern void CheckOldMultiXactIdExist(OldMultiXactReader *state,
+									 MultiXactId multi);
 
 #endif							/* MULTIXACT_OLD_H */
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index c5da56fe785..647e05f350a 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -772,6 +772,21 @@ copy_subdir_files(const char *old_subdir, const char *new_subdir)
 	check_ok();
 }
 
+static void
+check_multixacts(MultiXactId from_multi, MultiXactId to_multi)
+{
+	OldMultiXactReader *reader;
+
+	reader = AllocOldMultiXactRead(old_cluster.pgdata,
+								   old_cluster.controldata.chkpnt_nxtmulti,
+								   old_cluster.controldata.chkpnt_nxtmxoff);
+
+	CheckOldMultiXactIdExist(reader, from_multi);
+	CheckOldMultiXactIdExist(reader, to_multi);
+
+	FreeOldMultiXactReader(reader);
+}
+
 /*
  * Convert pg_multixact/offset and /members from the old format with 32-bit
  * offsets.
@@ -958,6 +973,14 @@ copy_xact_xlog_xid(void)
 		remove_new_subdir("pg_multixact/members", false);
 		remove_new_subdir("pg_multixact/offsets", false);
 
+		/*
+		 * Before the actual conversion do sanity check.
+		 * XXX: place it properly, it should be better place for this
+		 */
+		prep_status("Sanity check pg_multixact files");
+		check_multixacts(oldstMulti, nxtmulti);
+		check_ok();
+
 		/*
 		 * Create new pg_multixact files, converting old ones if needed.
 		 */
-- 
2.43.0

#84Ashutosh Bapat
ashutosh.bapat.oss@gmail.com
In reply to: Ashutosh Bapat (#70)
Re: POC: make mxidoff 64 bits

On Fri, Nov 28, 2025 at 6:35 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

I will continue reviewing it further.

There is duplication of code/functionality between server and
pg_upgrade. With it we carry all the risks that come with
code/functionality duplication like the copies going out of sync.
There may be a valid reason to do that but it's not documented in the
comments. At-least both mutlixact_new.c and slru_io.c are not as well
commented as their server counterparts. I understand that the SLRU
code in the server deals with shared memory which is not needed in
pg_upgrade; pg_upgrade will not need more than one buffer in memory
and pg_upgrade code doesn't need to deal with lock and it can not deal
with locks. That means the code required by pg_upgrade is much simpler
than that on the server. But there's also non-trivial code which is
required in both the cases. WIll it be possible to extract parts of
slru.c which deal with IO into slru_io.c, make it part of the core and
then use it in pg_upgrade as well as slru.c? Or whether it's possible
to make SLRU use local memory? And throwing some FRONTEND magic to the
mix, we may be able to avoid duplication. Have we tried this or
something else to avoid duplication? Sorry, if this has been discussed
earlier. Please point me to the relevant discussion if so.

--
Best Wishes,
Ashutosh Bapat

#85Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Ashutosh Bapat (#84)
6 attachment(s)
Re: POC: make mxidoff 64 bits

After a little detour to the "IPC/MultixactCreation on the Standby
server" issue [1]/messages/by-id/172e5723-d65f-4eec-b512-14beacb326ce@yandex.ru, I'm back to working on this. New patch version
attached, addressing your comments and Maxim's.

On 05/12/2025 15:42, Ashutosh Bapat wrote:

There is duplication of code/functionality between server and
pg_upgrade. With it we carry all the risks that come with
code/functionality duplication like the copies going out of sync.
There may be a valid reason to do that but it's not documented in the
comments. At-least both mutlixact_new.c and slru_io.c are not as well
commented as their server counterparts. I understand that the SLRU
code in the server deals with shared memory which is not needed in
pg_upgrade; pg_upgrade will not need more than one buffer in memory
and pg_upgrade code doesn't need to deal with lock and it can not deal
with locks. That means the code required by pg_upgrade is much simpler
than that on the server. But there's also non-trivial code which is
required in both the cases. WIll it be possible to extract parts of
slru.c which deal with IO into slru_io.c, make it part of the core and
then use it in pg_upgrade as well as slru.c? Or whether it's possible
to make SLRU use local memory? And throwing some FRONTEND magic to the
mix, we may be able to avoid duplication. Have we tried this or
something else to avoid duplication? Sorry, if this has been discussed
earlier. Please point me to the relevant discussion if so.

That's a fair point, but I think it's better to have some code
duplication in this case, than trying to write code that works for both
the server and for pg_upgrade. The needs are so different.

007_multixact_conversion.pl fires thousands of queries through
BackgroundPsql which prints debug output for each of the queries. When
running this file with oldinstall set,
2.2M regress_log_007_multixact_conversion (size of file)
77874 regress_log_007_multixact_conversion (wc -l output)

Since this output is also copied in testlog.txt, the effect is two-fold.

Most, if not all, of this output is useless. It also makes it hard to
find the output we are looking for. PFA patch which reduces this
output. The patch adds a flag verbose to query_safe() and query() to
toggle this output. With the patch the sizes are
27K regress_log_007_multixact_conversion
588 regress_log_007_multixact_conversion

And it makes the test faster by about a second or two on my laptop.
Something on those lines or other is required to reduce the output
from query_safe().

Nice! That log bloat was the reason I bundled together the "COMMIT;
BEGIN; SELECT ...;" steps into one statement in the loop. Your solution
addresses it more directly.

I turned 'verbose' into a keyword parameter, for future extensibility of
those functions, so you now call it like "$node->query_safe("SELECT 1",
verbose => 0);". I also set "log_statements=none" in those connections,
to reduce the noise in the server log too.

Some more comments
+++ b/src/bin/pg_upgrade/multixact_old.c

We may need to introduce new _new and then _old will become _older.
Should we rename the files to have pre19 and post19 or some similar
suffixes which make it clear what is meant by old and new?

+1. I renamed multixact_old.c to multixact_pre_v19.c. And
multixact_new.c to multixact_rewrite.c. I also moved the
"convert_multixact" function that drives the conversion to
multixact_rewrite.c. The idea is that if in the future we change the
format again, we will have:

multixact_pre_v19.c # for reading -v19 files
multixact_pre_v24.c # for reading v19-v23 files
multixact_rewrite.c # for writing new files

Hard to predict what a possible future format might look like and how
we'd want to organize the code then, though. This can be changed then if
needed, but it makes sense now.

+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)

The prologue mentions that the definitions are copy-pasted from
multixact.c from version 18, but they share the names with functions
in the current version. I think that's going to be a good source of
confusion especially in a file which is a few hundred lines long. Can
we rename them to have "Old" prefix or something similar?

Fair. On the other hand, having the same names makes it easier to see
what the real differences with the server functions are. Not sure what's
best here..

As long as we use the same names, it's important that
multixact_pre_v19.c doesn't #include the new definitions. I added some
comments on that, and also this safeguard:

#define MultiXactOffset should_not_be_used

That actually caught one (harmless) instance in the file where we had
not renamed MultiXactOffset to OldMultiXactOffset.

I'm not entirely happy with the "Old" prefix here, because as you
pointed out, we might end up needing "older" or "oldold" in the future.
I couldn't come up with anything better though. "PreV19MultiXactOffset"
is quite a mouthful.

+# Dump contents of the 'mxofftest' table, created by mxact_workload
+sub get_dump_for_comparison

This local function shares its name with a local function in
002_pg_upgrade.pl. Better to use a separate name. Also it's not
"dumping" data using "pg_dump", so "dump" in the name can be
misleading.

Renamed to "get_test_table_contents"

+ $newnode->start;
+ my $new_dump = get_dump_for_comparison($newnode, "newnode_${tag}_dump");
+ $newnode->stop;

There is no code which actually looks at the multixact offsets here to
make sure that the conversion happened correctly. I guess the test
relies on visibility checks for that. Anyway, we need a comment
explaining why just comparing the contents of the table is enough to
ensure correct conversion. Better if we can add an explicit test that
the offsets were converted correctly. I don't have any idea of how to
do that right now, though. Maybe use pg_get_multixact_members()
somehow in the query to extract data out of the table?

Agreed, the verification here is quite weak. I didn't realize that
pg_get_multixact_members() exists! That might indeed be handy here, but
I'm not sure how exactly to construct the test. A direct C function like
test_create_multixact() in test_multixact.c would be handy here, but
we'd need to compile and do run that in the old cluster, which seems
difficult.

+ compare_files($old_dump, $new_dump,
+ 'dump outputs from original and restored regression databases match');

A shared test name too :); but there is not regression database here.

Fixed :-)

+
+ note ">>> case #${tag}\n"
+   . " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n"
+   . " newnode mxoff ${new_next_mxoff}\n";

Should we check that some condition holds between finish_mxoff and
new_next_mxoff?

Got something in mind that we could check?

On 04/12/2025 17:33, Maxim Orlov wrote:

On Thu, 4 Dec 2025 at 13:39, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:

However, that doesn't hold for pg_upgrade. pg_upgrade will try to read
all the multixids. So we need to make the multixact reading code
tolerant of the situations that could be present after a crash. I think
the right philosophy here is that we try to read all the old multixids,
and do our best to interpret them the same way that the old server
would.

Something like attached?

+1

Now previous scheme of upgrade with the bytes joggling start
looking not so bad. Just a funny thought that came to my mind.

:-)

Perhaps we should check that all the files exist and have the correct
sizes in the pre-check stage

Not sure about it. Because SLRU does not support "holes", simply
checking if the first and last multixacts exist will be enough. But
we'll do it anyway in a real conversion.

Yeah, the point would be to complain if there are holes when there
shouldn't be. As a sanity check.

There's a reason to not do that though: if you use pg_resetwal to skip
over some multixids, you end up with holes. We shouldn't encourage
people to use pg_resetwal, but it seems reasonable to tolerate it if you
have done it.

I worked some more on this. One notable change is that in light of the
"IPC/MultixactCreation on the Standby server" changes [1]/messages/by-id/172e5723-d65f-4eec-b512-14beacb326ce@yandex.ru, we need to
always write the next multixid's offset, even if the next multixid
itself is invalid. Because otherwise the previous multixid is unreadable
too.

The SlruReadSwitchPageSlow() function didn't handle short reads
properly. As a result, if an SLRU file was shorter than expected, the
buffer kept its old contents when switching to read the missing page.
Fixed that so that the missing part is read as all-zeros instead.

I removed the warnings from some of the invalid multixid cases, like if
the offset or some of the members are zeros. Those cases can
legitimately happen after crash and restart, so we shouldn't complain
about them. If a multixid has more than one updating member, I kept that
as a fatal error. That really should not happen.

To summarize, the behavior now is that if an old SLRU file does not
exist, you get an error. If an SLRU file is too short, you get warnings
and the missing pages are read as all-zeros, i.e. all the multixids on
the missing pages are considered invalid. If an individual multixid is
invalid, because the offset is zero, it's silently written as invalid in
the new file too.

I'm still not 100% sure what the desired behavior for missing files is.
For now, I didn't include the pre-checks for the first and last files in
this version. You can end up with missing files if you skip over many
multixids with pg_resetwal. Or it could be a sign of lost data. If it's
lost data, would you prefer for the upgrade to fail, or continue
upgrading the data that you have? The conversion shouldn't make things
worse, if the data was already lost, but then again, if something really
bad has happened, all bets are off and perhaps it would be best to abort
and complain loudly.

[1]: /messages/by-id/172e5723-d65f-4eec-b512-14beacb326ce@yandex.ru
/messages/by-id/172e5723-d65f-4eec-b512-14beacb326ce@yandex.ru

- Heikki

Attachments:

v28-0001-pg_resetwal-Reject-negative-and-out-of-range-arg.patchtext/x-patch; charset=UTF-8; name=v28-0001-pg_resetwal-Reject-negative-and-out-of-range-arg.patchDownload
From 55af5ce25a7a3f464faeeea8d8bf5ab215c73f34 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 19 Nov 2025 16:36:00 +0200
Subject: [PATCH v28 1/6] pg_resetwal: Reject negative and out of range
 arguments

The strtoul() function that we used to parse many of the options
accepts negative values, and silently wraps them to the equivalent
unsigned values. For example, -1 becomes 0xFFFFFFFF, on platforms
where unsigned long is 32 bits wide. Also, on platforms where
"unsigned long" is 64 bits wide, we silently casted values larger than
UINT32_MAX to the equivalent 32-bit value. Both of those behaviors
seem undesireable, so tighten up the parsing to reject negative and
too large values.
---
 src/bin/pg_resetwal/pg_resetwal.c  | 64 ++++++++++++++++++++++++------
 src/bin/pg_resetwal/t/001_basic.pl | 19 ++++++++-
 2 files changed, 68 insertions(+), 15 deletions(-)

diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 8d5d9805279..8ca8dad01a0 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -92,6 +92,7 @@ static void KillExistingArchiveStatus(void);
 static void KillExistingWALSummaries(void);
 static void WriteEmptyXLOG(void);
 static void usage(void);
+static uint32 strtouint32_strict(const char *restrict s, char **restrict endptr, int base);
 
 
 int
@@ -120,7 +121,6 @@ main(int argc, char *argv[])
 	MultiXactId set_oldestmxid = 0;
 	char	   *endptr;
 	char	   *endptr2;
-	int64		tmpi64;
 	char	   *DataDir = NULL;
 	char	   *log_fname = NULL;
 	int			fd;
@@ -162,7 +162,7 @@ main(int argc, char *argv[])
 
 			case 'e':
 				errno = 0;
-				set_xid_epoch = strtoul(optarg, &endptr, 0);
+				set_xid_epoch = strtouint32_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					/*------
@@ -177,7 +177,7 @@ main(int argc, char *argv[])
 
 			case 'u':
 				errno = 0;
-				set_oldest_xid = strtoul(optarg, &endptr, 0);
+				set_oldest_xid = strtouint32_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-u");
@@ -190,7 +190,7 @@ main(int argc, char *argv[])
 
 			case 'x':
 				errno = 0;
-				set_xid = strtoul(optarg, &endptr, 0);
+				set_xid = strtouint32_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-x");
@@ -203,7 +203,7 @@ main(int argc, char *argv[])
 
 			case 'c':
 				errno = 0;
-				set_oldest_commit_ts_xid = strtoul(optarg, &endptr, 0);
+				set_oldest_commit_ts_xid = strtouint32_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != ',' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-c");
@@ -229,7 +229,7 @@ main(int argc, char *argv[])
 
 			case 'o':
 				errno = 0;
-				set_oid = strtoul(optarg, &endptr, 0);
+				set_oid = strtouint32_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-o");
@@ -242,7 +242,7 @@ main(int argc, char *argv[])
 
 			case 'm':
 				errno = 0;
-				set_mxid = strtoul(optarg, &endptr, 0);
+				set_mxid = strtouint32_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != ',' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-m");
@@ -250,7 +250,7 @@ main(int argc, char *argv[])
 					exit(1);
 				}
 
-				set_oldestmxid = strtoul(endptr + 1, &endptr2, 0);
+				set_oldestmxid = strtouint32_strict(endptr + 1, &endptr2, 0);
 				if (endptr2 == endptr + 1 || *endptr2 != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-m");
@@ -269,17 +269,13 @@ main(int argc, char *argv[])
 
 			case 'O':
 				errno = 0;
-				tmpi64 = strtoi64(optarg, &endptr, 0);
+				set_mxoff = strtouint32_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
 					pg_log_error_hint("Try \"%s --help\" for more information.", progname);
 					exit(1);
 				}
-				if (tmpi64 < 0 || tmpi64 > (int64) MaxMultiXactOffset)
-					pg_fatal("multitransaction offset (-O) must be between 0 and %u", MaxMultiXactOffset);
-
-				set_mxoff = (MultiXactOffset) tmpi64;
 				mxoff_given = true;
 				break;
 
@@ -1214,3 +1210,45 @@ usage(void)
 	printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
 	printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
 }
+
+/*
+ * strtouint32_strict -- like strtoul(), but returns uint32 and doesn't accept
+ * negative values
+ */
+static uint32
+strtouint32_strict(const char *restrict s, char **restrict endptr, int base)
+{
+	unsigned long val;
+	bool		is_neg;
+
+	/* skip leading whitespace */
+	while (isspace(*s))
+		s++;
+
+	/*
+	 * Is it negative?  We still call strtoul() if it was, to set 'endptr'.
+	 * (The current callers don't care though.)
+	 */
+	is_neg = (*s == '-');
+
+	val = strtoul(s, endptr, base);
+
+	/* reject if it was negative */
+	if (errno == 0 && is_neg)
+	{
+		errno = ERANGE;
+		val = 0;
+	}
+
+	/*
+	 * reject values larger than UINT32_MAX on platforms where long is 64 bits
+	 * wide.
+	 */
+	if (errno == 0 && val != (uint32) val)
+	{
+		errno = ERANGE;
+		val = UINT32_MAX;
+	}
+
+	return (uint32) val;
+}
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index 90ecb8afe18..e9780dbe2a6 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -103,7 +103,7 @@ command_fails_like(
 	'fails with incorrect -e option');
 command_fails_like(
 	[ 'pg_resetwal', '-e' => '-1', $node->data_dir ],
-	qr/must not be -1/,
+	qr/error: invalid argument for option -e/,
 	'fails with -e value -1');
 # -l
 command_fails_like(
@@ -145,7 +145,7 @@ command_fails_like(
 	'fails with incorrect -O option');
 command_fails_like(
 	[ 'pg_resetwal', '-O' => '-1', $node->data_dir ],
-	qr/must be between 0 and 4294967295/,
+	qr/error: invalid argument for option -O/,
 	'fails with -O value -1');
 # --wal-segsize
 command_fails_like(
@@ -175,6 +175,21 @@ command_fails_like(
 	qr/must be greater than/,
 	'fails with -x value too small');
 
+# Check out of range values with -x. These are forbidden for all other
+# 32-bit values too, but we use just -x to exercise the parsing.
+command_fails_like(
+	[ 'pg_resetwal', '-x' => '-1', $node->data_dir ],
+	qr/error: invalid argument for option -x/,
+	'fails with -x value -1');
+command_fails_like(
+	[ 'pg_resetwal', '-x' => '-100', $node->data_dir ],
+	qr/error: invalid argument for option -x/,
+	'fails with negative -x value');
+command_fails_like(
+	[ 'pg_resetwal', '-x' => '10000000000', $node->data_dir ],
+	qr/error: invalid argument for option -x/,
+	'fails with -x value too large');
+
 # --char-signedness
 command_fails_like(
 	[ 'pg_resetwal', '--char-signedness', 'foo', $node->data_dir ],
-- 
2.47.3

v28-0002-pg_resetwal-Use-separate-flags-for-whether-an-op.patchtext/x-patch; charset=UTF-8; name=v28-0002-pg_resetwal-Use-separate-flags-for-whether-an-op.patchDownload
From afb3703742e2ef69b19420891d7d87e81ee9e0c5 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 3 Dec 2025 20:48:48 +0200
Subject: [PATCH v28 2/6] pg_resetwal: Use separate flags for whether an option
 is given

Currently, we use special values that are otherwise invalid for each
option to indicate "option was not given". Replace that with separate
boolean variables for each option. It seems more clear to be explicit.

We were already doing that for the -m option, because there were no
invalid values for nextMulti that we could use (since commit
94939c5f3a).
---
 src/bin/pg_resetwal/pg_resetwal.c | 166 +++++++++++++++++-------------
 1 file changed, 95 insertions(+), 71 deletions(-)

diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 8ca8dad01a0..c667a11cb6a 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -64,21 +64,43 @@ static ControlFileData ControlFile; /* pg_control values */
 static XLogSegNo newXlogSegNo;	/* new XLOG segment # */
 static bool guessed = false;	/* T if we had to guess at any values */
 static const char *progname;
-static uint32 set_xid_epoch = (uint32) -1;
-static TransactionId set_oldest_xid = 0;
-static TransactionId set_xid = 0;
-static TransactionId set_oldest_commit_ts_xid = 0;
-static TransactionId set_newest_commit_ts_xid = 0;
-static Oid	set_oid = 0;
-static bool mxid_given = false;
-static MultiXactId set_mxid = 0;
-static bool mxoff_given = false;
-static MultiXactOffset set_mxoff = 0;
+
+/*
+ * New values given on the command-line
+ */
+static bool next_xid_epoch_given = false;
+static uint32 next_xid_epoch_val;
+
+static bool oldest_xid_given = false;
+static TransactionId oldest_xid_val;
+
+static bool next_xid_given = false;
+static TransactionId next_xid_val;
+
+static bool commit_ts_xids_given = false;
+static TransactionId oldest_commit_ts_xid_val;
+static TransactionId newest_commit_ts_xid_val;
+
+static bool next_oid_given = false;
+static Oid	next_oid_val;
+
+static bool mxids_given = false;
+static MultiXactId next_mxid_val;
+static MultiXactId oldest_mxid_val = 0;
+
+static bool next_mxoff_given = false;
+static MultiXactOffset next_mxoff_val;
+
+static bool wal_segsize_given = false;
+static int	wal_segsize_val;
+
+static bool char_signedness_given = false;
+static bool char_signedness_val;
+
+
 static TimeLineID minXlogTli = 0;
 static XLogSegNo minXlogSegNo = 0;
 static int	WalSegSz;
-static int	set_wal_segsize;
-static int	set_char_signedness = -1;
 
 static void CheckDataVersion(void);
 static bool read_controlfile(void);
@@ -118,7 +140,6 @@ main(int argc, char *argv[])
 	int			c;
 	bool		force = false;
 	bool		noupdate = false;
-	MultiXactId set_oldestmxid = 0;
 	char	   *endptr;
 	char	   *endptr2;
 	char	   *DataDir = NULL;
@@ -162,7 +183,7 @@ main(int argc, char *argv[])
 
 			case 'e':
 				errno = 0;
-				set_xid_epoch = strtouint32_strict(optarg, &endptr, 0);
+				next_xid_epoch_val = strtouint32_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					/*------
@@ -171,46 +192,47 @@ main(int argc, char *argv[])
 					pg_log_error_hint("Try \"%s --help\" for more information.", progname);
 					exit(1);
 				}
-				if (set_xid_epoch == -1)
-					pg_fatal("transaction ID epoch (-e) must not be -1");
+				next_xid_epoch_given = true;
 				break;
 
 			case 'u':
 				errno = 0;
-				set_oldest_xid = strtouint32_strict(optarg, &endptr, 0);
+				oldest_xid_val = strtouint32_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-u");
 					pg_log_error_hint("Try \"%s --help\" for more information.", progname);
 					exit(1);
 				}
-				if (!TransactionIdIsNormal(set_oldest_xid))
+				if (!TransactionIdIsNormal(oldest_xid_val))
 					pg_fatal("oldest transaction ID (-u) must be greater than or equal to %u", FirstNormalTransactionId);
+				oldest_xid_given = true;
 				break;
 
 			case 'x':
 				errno = 0;
-				set_xid = strtouint32_strict(optarg, &endptr, 0);
+				next_xid_val = strtouint32_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-x");
 					pg_log_error_hint("Try \"%s --help\" for more information.", progname);
 					exit(1);
 				}
-				if (!TransactionIdIsNormal(set_xid))
+				if (!TransactionIdIsNormal(next_xid_val))
 					pg_fatal("transaction ID (-x) must be greater than or equal to %u", FirstNormalTransactionId);
+				next_xid_given = true;
 				break;
 
 			case 'c':
 				errno = 0;
-				set_oldest_commit_ts_xid = strtouint32_strict(optarg, &endptr, 0);
+				oldest_commit_ts_xid_val = strtouint32_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != ',' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-c");
 					pg_log_error_hint("Try \"%s --help\" for more information.", progname);
 					exit(1);
 				}
-				set_newest_commit_ts_xid = strtoul(endptr + 1, &endptr2, 0);
+				newest_commit_ts_xid_val = strtoul(endptr + 1, &endptr2, 0);
 				if (endptr2 == endptr + 1 || *endptr2 != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-c");
@@ -218,31 +240,33 @@ main(int argc, char *argv[])
 					exit(1);
 				}
 
-				if (set_oldest_commit_ts_xid < FirstNormalTransactionId &&
-					set_oldest_commit_ts_xid != InvalidTransactionId)
+				if (oldest_commit_ts_xid_val < FirstNormalTransactionId &&
+					oldest_commit_ts_xid_val != InvalidTransactionId)
 					pg_fatal("transaction ID (-c) must be either %u or greater than or equal to %u", InvalidTransactionId, FirstNormalTransactionId);
 
-				if (set_newest_commit_ts_xid < FirstNormalTransactionId &&
-					set_newest_commit_ts_xid != InvalidTransactionId)
+				if (newest_commit_ts_xid_val < FirstNormalTransactionId &&
+					newest_commit_ts_xid_val != InvalidTransactionId)
 					pg_fatal("transaction ID (-c) must be either %u or greater than or equal to %u", InvalidTransactionId, FirstNormalTransactionId);
+				commit_ts_xids_given = true;
 				break;
 
 			case 'o':
 				errno = 0;
-				set_oid = strtouint32_strict(optarg, &endptr, 0);
+				next_oid_val = strtouint32_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-o");
 					pg_log_error_hint("Try \"%s --help\" for more information.", progname);
 					exit(1);
 				}
-				if (set_oid == 0)
+				if (next_oid_val == 0)
 					pg_fatal("OID (-o) must not be 0");
+				next_oid_given = true;
 				break;
 
 			case 'm':
 				errno = 0;
-				set_mxid = strtouint32_strict(optarg, &endptr, 0);
+				next_mxid_val = strtouint32_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != ',' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-m");
@@ -250,7 +274,7 @@ main(int argc, char *argv[])
 					exit(1);
 				}
 
-				set_oldestmxid = strtouint32_strict(endptr + 1, &endptr2, 0);
+				oldest_mxid_val = strtouint32_strict(endptr + 1, &endptr2, 0);
 				if (endptr2 == endptr + 1 || *endptr2 != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-m");
@@ -262,21 +286,21 @@ main(int argc, char *argv[])
 				 * XXX It'd be nice to have more sanity checks here, e.g. so
 				 * that oldest is not wrapped around w.r.t. nextMulti.
 				 */
-				if (set_oldestmxid == 0)
+				if (oldest_mxid_val == 0)
 					pg_fatal("oldest multitransaction ID (-m) must not be 0");
-				mxid_given = true;
+				mxids_given = true;
 				break;
 
 			case 'O':
 				errno = 0;
-				set_mxoff = strtouint32_strict(optarg, &endptr, 0);
+				next_mxoff_val = strtouint32_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
 					pg_log_error_hint("Try \"%s --help\" for more information.", progname);
 					exit(1);
 				}
-				mxoff_given = true;
+				next_mxoff_given = true;
 				break;
 
 			case 'l':
@@ -300,9 +324,10 @@ main(int argc, char *argv[])
 
 					if (!option_parse_int(optarg, "--wal-segsize", 1, 1024, &wal_segsize_mb))
 						exit(1);
-					set_wal_segsize = wal_segsize_mb * 1024 * 1024;
-					if (!IsValidWalSegSize(set_wal_segsize))
+					wal_segsize_val = wal_segsize_mb * 1024 * 1024;
+					if (!IsValidWalSegSize(wal_segsize_val))
 						pg_fatal("argument of %s must be a power of two between 1 and 1024", "--wal-segsize");
+					wal_segsize_given = true;
 					break;
 				}
 
@@ -311,15 +336,16 @@ main(int argc, char *argv[])
 					errno = 0;
 
 					if (pg_strcasecmp(optarg, "signed") == 0)
-						set_char_signedness = 1;
+						char_signedness_val = true;
 					else if (pg_strcasecmp(optarg, "unsigned") == 0)
-						set_char_signedness = 0;
+						char_signedness_val = false;
 					else
 					{
 						pg_log_error("invalid argument for option %s", "--char-signedness");
 						pg_log_error_hint("Try \"%s --help\" for more information.", progname);
 						exit(1);
 					}
+					char_signedness_given = true;
 					break;
 				}
 
@@ -407,8 +433,8 @@ main(int argc, char *argv[])
 	/*
 	 * If no new WAL segment size was specified, use the control file value.
 	 */
-	if (set_wal_segsize != 0)
-		WalSegSz = set_wal_segsize;
+	if (wal_segsize_given)
+		WalSegSz = wal_segsize_val;
 	else
 		WalSegSz = ControlFile.xlog_seg_size;
 
@@ -431,42 +457,43 @@ main(int argc, char *argv[])
 	 * Adjust fields if required by switches.  (Do this now so that printout,
 	 * if any, includes these values.)
 	 */
-	if (set_xid_epoch != -1)
+	if (next_xid_epoch_given)
 		ControlFile.checkPointCopy.nextXid =
-			FullTransactionIdFromEpochAndXid(set_xid_epoch,
+			FullTransactionIdFromEpochAndXid(next_xid_epoch_val,
 											 XidFromFullTransactionId(ControlFile.checkPointCopy.nextXid));
 
-	if (set_oldest_xid != 0)
+	if (oldest_xid_given)
 	{
-		ControlFile.checkPointCopy.oldestXid = set_oldest_xid;
+		ControlFile.checkPointCopy.oldestXid = oldest_xid_val;
 		ControlFile.checkPointCopy.oldestXidDB = InvalidOid;
 	}
 
-	if (set_xid != 0)
+	if (next_xid_given)
 		ControlFile.checkPointCopy.nextXid =
 			FullTransactionIdFromEpochAndXid(EpochFromFullTransactionId(ControlFile.checkPointCopy.nextXid),
-											 set_xid);
+											 next_xid_val);
 
-	if (set_oldest_commit_ts_xid != 0)
-		ControlFile.checkPointCopy.oldestCommitTsXid = set_oldest_commit_ts_xid;
-	if (set_newest_commit_ts_xid != 0)
-		ControlFile.checkPointCopy.newestCommitTsXid = set_newest_commit_ts_xid;
+	if (commit_ts_xids_given)
+	{
+		ControlFile.checkPointCopy.oldestCommitTsXid = oldest_commit_ts_xid_val;
+		ControlFile.checkPointCopy.newestCommitTsXid = newest_commit_ts_xid_val;
+	}
 
-	if (set_oid != 0)
-		ControlFile.checkPointCopy.nextOid = set_oid;
+	if (next_oid_given)
+		ControlFile.checkPointCopy.nextOid = next_oid_val;
 
-	if (mxid_given)
+	if (mxids_given)
 	{
-		ControlFile.checkPointCopy.nextMulti = set_mxid;
+		ControlFile.checkPointCopy.nextMulti = next_mxid_val;
 
-		ControlFile.checkPointCopy.oldestMulti = set_oldestmxid;
+		ControlFile.checkPointCopy.oldestMulti = oldest_mxid_val;
 		if (ControlFile.checkPointCopy.oldestMulti < FirstMultiXactId)
 			ControlFile.checkPointCopy.oldestMulti += FirstMultiXactId;
 		ControlFile.checkPointCopy.oldestMultiDB = InvalidOid;
 	}
 
-	if (mxoff_given)
-		ControlFile.checkPointCopy.nextMultiOffset = set_mxoff;
+	if (next_mxoff_given)
+		ControlFile.checkPointCopy.nextMultiOffset = next_mxoff_val;
 
 	if (minXlogTli > ControlFile.checkPointCopy.ThisTimeLineID)
 	{
@@ -474,11 +501,11 @@ main(int argc, char *argv[])
 		ControlFile.checkPointCopy.PrevTimeLineID = minXlogTli;
 	}
 
-	if (set_wal_segsize != 0)
+	if (wal_segsize_given)
 		ControlFile.xlog_seg_size = WalSegSz;
 
-	if (set_char_signedness != -1)
-		ControlFile.default_char_signedness = (set_char_signedness == 1);
+	if (char_signedness_given)
+		ControlFile.default_char_signedness = char_signedness_val;
 
 	if (minXlogSegNo > newXlogSegNo)
 		newXlogSegNo = minXlogSegNo;
@@ -809,7 +836,7 @@ PrintNewControlValues(void)
 				 newXlogSegNo, WalSegSz);
 	printf(_("First log segment after reset:        %s\n"), fname);
 
-	if (mxid_given)
+	if (mxids_given)
 	{
 		printf(_("NextMultiXactId:                      %u\n"),
 			   ControlFile.checkPointCopy.nextMulti);
@@ -819,25 +846,25 @@ PrintNewControlValues(void)
 			   ControlFile.checkPointCopy.oldestMultiDB);
 	}
 
-	if (mxoff_given)
+	if (next_mxoff_given)
 	{
 		printf(_("NextMultiOffset:                      %u\n"),
 			   ControlFile.checkPointCopy.nextMultiOffset);
 	}
 
-	if (set_oid != 0)
+	if (next_oid_given)
 	{
 		printf(_("NextOID:                              %u\n"),
 			   ControlFile.checkPointCopy.nextOid);
 	}
 
-	if (set_xid != 0)
+	if (next_xid_given)
 	{
 		printf(_("NextXID:                              %u\n"),
 			   XidFromFullTransactionId(ControlFile.checkPointCopy.nextXid));
 	}
 
-	if (set_oldest_xid != 0)
+	if (oldest_xid_given)
 	{
 		printf(_("OldestXID:                            %u\n"),
 			   ControlFile.checkPointCopy.oldestXid);
@@ -845,24 +872,21 @@ PrintNewControlValues(void)
 			   ControlFile.checkPointCopy.oldestXidDB);
 	}
 
-	if (set_xid_epoch != -1)
+	if (next_xid_epoch_given)
 	{
 		printf(_("NextXID epoch:                        %u\n"),
 			   EpochFromFullTransactionId(ControlFile.checkPointCopy.nextXid));
 	}
 
-	if (set_oldest_commit_ts_xid != 0)
+	if (commit_ts_xids_given)
 	{
 		printf(_("oldestCommitTsXid:                    %u\n"),
 			   ControlFile.checkPointCopy.oldestCommitTsXid);
-	}
-	if (set_newest_commit_ts_xid != 0)
-	{
 		printf(_("newestCommitTsXid:                    %u\n"),
 			   ControlFile.checkPointCopy.newestCommitTsXid);
 	}
 
-	if (set_wal_segsize != 0)
+	if (wal_segsize_given)
 	{
 		printf(_("Bytes per WAL segment:                %u\n"),
 			   ControlFile.xlog_seg_size);
-- 
2.47.3

v28-0003-Move-pg_multixact-SLRU-page-format-definitions-t.patchtext/x-patch; charset=UTF-8; name=v28-0003-Move-pg_multixact-SLRU-page-format-definitions-t.patchDownload
From 5c0bc31242a9da122df5ffbeac1c0e3262b07d46 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 3 Dec 2025 20:07:47 +0200
Subject: [PATCH v28 3/6] Move pg_multixact SLRU page format definitions to a
 separate header

This makes them accessible from pg_upgrade, needed by the next commit.
I'm doing this mechanical move as a separate commit to make the next
commit's changes to these definitions more obvious.

Author: Maxim Orlov <orlovmg@gmail.com>
Discussion: https://www.postgresql.org/message-id/CACG%3DezbZo_3_fnx%3DS5BfepwRftzrpJ%2B7WET4EkTU6wnjDTsnjg@mail.gmail.com
---
 src/backend/access/transam/multixact.c  | 120 +-------------------
 src/include/access/multixact_internal.h | 140 ++++++++++++++++++++++++
 2 files changed, 141 insertions(+), 119 deletions(-)
 create mode 100644 src/include/access/multixact_internal.h

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 8ed3fd9d071..14d46fb761b 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -69,6 +69,7 @@
 #include "postgres.h"
 
 #include "access/multixact.h"
+#include "access/multixact_internal.h"
 #include "access/slru.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
@@ -88,125 +89,6 @@
 #include "utils/memutils.h"
 
 
-/*
- * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
- * used everywhere else in Postgres.
- *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
- */
-
-/* We need four bytes per offset */
-#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
-
-static inline int64
-MultiXactIdToOffsetPage(MultiXactId multi)
-{
-	return multi / MULTIXACT_OFFSETS_PER_PAGE;
-}
-
-static inline int
-MultiXactIdToOffsetEntry(MultiXactId multi)
-{
-	return multi % MULTIXACT_OFFSETS_PER_PAGE;
-}
-
-static inline int64
-MultiXactIdToOffsetSegment(MultiXactId multi)
-{
-	return MultiXactIdToOffsetPage(multi) / SLRU_PAGES_PER_SEGMENT;
-}
-
-/*
- * The situation for members is a bit more complex: we store one byte of
- * additional flag bits for each TransactionId.  To do this without getting
- * into alignment issues, we store four bytes of flags, and then the
- * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
- * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
- * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
- * performance) trumps space efficiency here.
- *
- * Note that the "offset" macros work with byte offset, not array indexes, so
- * arithmetic must be done using "char *" pointers.
- */
-/* We need eight bits per xact, so one xact fits in a byte */
-#define MXACT_MEMBER_BITS_PER_XACT			8
-#define MXACT_MEMBER_FLAGS_PER_BYTE			1
-#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
-
-/* how many full bytes of flags are there in a group? */
-#define MULTIXACT_FLAGBYTES_PER_GROUP		4
-#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
-	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
-/* size in bytes of a complete group */
-#define MULTIXACT_MEMBERGROUP_SIZE \
-	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
-#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
-#define MULTIXACT_MEMBERS_PER_PAGE	\
-	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
-
-/*
- * Because the number of items per page is not a divisor of the last item
- * number (member 0xFFFFFFFF), the last segment does not use the maximum number
- * of pages, and moreover the last used page therein does not use the same
- * number of items as previous pages.  (Another way to say it is that the
- * 0xFFFFFFFF member is somewhere in the middle of the last page, so the page
- * has some empty space after that item.)
- *
- * This constant is the number of members in the last page of the last segment.
- */
-#define MAX_MEMBERS_IN_LAST_MEMBERS_PAGE \
-		((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
-
-/* page in which a member is to be found */
-static inline int64
-MXOffsetToMemberPage(MultiXactOffset offset)
-{
-	return offset / MULTIXACT_MEMBERS_PER_PAGE;
-}
-
-static inline int64
-MXOffsetToMemberSegment(MultiXactOffset offset)
-{
-	return MXOffsetToMemberPage(offset) / SLRU_PAGES_PER_SEGMENT;
-}
-
-/* Location (byte offset within page) of flag word for a given member */
-static inline int
-MXOffsetToFlagsOffset(MultiXactOffset offset)
-{
-	MultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
-	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
-	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
-
-	return byteoff;
-}
-
-static inline int
-MXOffsetToFlagsBitShift(MultiXactOffset offset)
-{
-	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
-	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
-
-	return bshift;
-}
-
-/* Location (byte offset within page) of TransactionId of given member */
-static inline int
-MXOffsetToMemberOffset(MultiXactOffset offset)
-{
-	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
-
-	return MXOffsetToFlagsOffset(offset) +
-		MULTIXACT_FLAGBYTES_PER_GROUP +
-		member_in_group * sizeof(TransactionId);
-}
-
 /* Multixact members wraparound thresholds. */
 #define MULTIXACT_MEMBER_SAFE_THRESHOLD		(MaxMultiXactOffset / 2)
 #define MULTIXACT_MEMBER_DANGER_THRESHOLD	\
diff --git a/src/include/access/multixact_internal.h b/src/include/access/multixact_internal.h
new file mode 100644
index 00000000000..9b56deaef31
--- /dev/null
+++ b/src/include/access/multixact_internal.h
@@ -0,0 +1,140 @@
+/*
+ * multixact_internal.h
+ *
+ * PostgreSQL multi-transaction-log manager internal declarations
+ *
+ * These functions and definitions are for dealing with pg_multixact pages.
+ * They are internal to multixact.c, but they are exported here to allow
+ * pg_upgrade to write pg_multixact files directly.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/multixact_internal.h
+ */
+#ifndef MULTIXACT_INTERNAL_H
+#define MULTIXACT_INTERNAL_H
+
+#include "access/multixact.h"
+
+
+/*
+ * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
+ * used everywhere else in Postgres.
+ *
+ * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
+ * MultiXact page numbering also wraps around at
+ * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
+ * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
+ * take no explicit notice of that fact in this module, except when comparing
+ * segment and page numbers in TruncateMultiXact (see
+ * MultiXactOffsetPagePrecedes).
+ */
+
+/* We need four bytes per offset */
+#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)
+{
+	return multi / MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int
+MultiXactIdToOffsetEntry(MultiXactId multi)
+{
+	return multi % MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int64
+MultiXactIdToOffsetSegment(MultiXactId multi)
+{
+	return MultiXactIdToOffsetPage(multi) / SLRU_PAGES_PER_SEGMENT;
+}
+
+/*
+ * The situation for members is a bit more complex: we store one byte of
+ * additional flag bits for each TransactionId.  To do this without getting
+ * into alignment issues, we store four bytes of flags, and then the
+ * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
+ * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
+ * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
+ * performance) trumps space efficiency here.
+ *
+ * Note that the "offset" macros work with byte offset, not array indexes, so
+ * arithmetic must be done using "char *" pointers.
+ */
+/* We need eight bits per xact, so one xact fits in a byte */
+#define MXACT_MEMBER_BITS_PER_XACT			8
+#define MXACT_MEMBER_FLAGS_PER_BYTE			1
+#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
+
+/* how many full bytes of flags are there in a group? */
+#define MULTIXACT_FLAGBYTES_PER_GROUP		4
+#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
+	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
+/* size in bytes of a complete group */
+#define MULTIXACT_MEMBERGROUP_SIZE \
+	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERS_PER_PAGE	\
+	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
+
+/*
+ * Because the number of items per page is not a divisor of the last item
+ * number (member 0xFFFFFFFF), the last segment does not use the maximum number
+ * of pages, and moreover the last used page therein does not use the same
+ * number of items as previous pages.  (Another way to say it is that the
+ * 0xFFFFFFFF member is somewhere in the middle of the last page, so the page
+ * has some empty space after that item.)
+ *
+ * This constant is the number of members in the last page of the last segment.
+ */
+#define MAX_MEMBERS_IN_LAST_MEMBERS_PAGE \
+		((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
+
+/* page in which a member is to be found */
+static inline int64
+MXOffsetToMemberPage(MultiXactOffset offset)
+{
+	return offset / MULTIXACT_MEMBERS_PER_PAGE;
+}
+
+static inline int64
+MXOffsetToMemberSegment(MultiXactOffset offset)
+{
+	return MXOffsetToMemberPage(offset) / SLRU_PAGES_PER_SEGMENT;
+}
+
+/* Location (byte offset within page) of flag word for a given member */
+static inline int
+MXOffsetToFlagsOffset(MultiXactOffset offset)
+{
+	MultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
+	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
+
+	return byteoff;
+}
+
+static inline int
+MXOffsetToFlagsBitShift(MultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
+
+	return bshift;
+}
+
+/* Location (byte offset within page) of TransactionId of given member */
+static inline int
+MXOffsetToMemberOffset(MultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+
+	return MXOffsetToFlagsOffset(offset) +
+		MULTIXACT_FLAGBYTES_PER_GROUP +
+		member_in_group * sizeof(TransactionId);
+}
+
+#endif							/* MULTIXACT_INTERNAL_H */
-- 
2.47.3

v28-0004-FIXME-bump-catversion.patchtext/x-patch; charset=UTF-8; name=v28-0004-FIXME-bump-catversion.patchDownload
From c7ecdaea65d13114f32c1c2304fafd8500c2b365 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Fri, 24 Oct 2025 11:47:50 +0300
Subject: [PATCH v28 4/6] FIXME: bump catversion

To avoid constant CF-bot complains, make catversion bump in a separate
commit.

This is to be squashed with the main commit before pushing.

NOTE: keep it in sync with MULTIXACTOFFSET_FORMATCHANGE_CAT_VER
---
 src/include/catalog/catversion.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index d13ed62af46..b0162c2bf63 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,7 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202512051
+// FIXME: bump it
+#define CATALOG_VERSION_NO	999999999
 
 #endif
-- 
2.47.3

v28-0005-Widen-MultiXactOffset-to-64-bits.patchtext/x-patch; charset=UTF-8; name=v28-0005-Widen-MultiXactOffset-to-64-bits.patchDownload
From 7e72f43fbe9a5b37588475b91d07c1960c541147 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 3 Dec 2025 20:11:50 +0200
Subject: [PATCH v28 5/6] Widen MultiXactOffset to 64 bits

This eliminates offset wraparound and the 2^32 limit on the total
number of multixid members. Multixids are still limited to 2^31, but
this is a nice improvement because 'members' can grow much faster than
the number of multixids. On such systems, you can now run longer
before hitting hard limits or triggering anti-wraparound vacuums.

Not having to deal with offset wraparound also simplifies the code and
removes some gnarly corner cases.

We no longer need to perform emergency anti-wraparound freezing
because of running out of 'members' space, so the offset stop limit is
gone. But you might still not want 'members' to consume huge amounts
of disk space. For that reason, I kept the logic for lowering vacuum's
multixid freezing cutoff if a large amount of 'members' space is
used. The thresholds for that are roughly the same as the "safe" and
"danger" thresholds used before, 2 billion transactions and 4 billion
transactions. This keeps the behavior for the freeze cutoff roughly
the same as before . It might make sense to make this smarter or
configurable, now that the threshold is only needed to manage disk
usage, but that's left for the future.

Add code to pg_upgrade to convert multitransactions from the old to
the new format. Because pg_upgrade now rewrites the files in the new
format, we can get rid of some hacks we had put in place to deal with
old bugs and upgraded clusters.

Author: Maxim Orlov <orlovmg@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Discussion: https://www.postgresql.org/message-id/CACG%3DezaWg7_nt-8ey4aKv2w9LcuLthHknwCawmBgEeTnJrJTcw@mail.gmail.com
---
 doc/src/sgml/ref/pg_resetwal.sgml             |  13 +-
 src/backend/access/rmgrdesc/mxactdesc.c       |   4 +-
 src/backend/access/rmgrdesc/xlogdesc.c        |   2 +-
 src/backend/access/transam/multixact.c        | 550 ++++--------------
 src/backend/access/transam/xlog.c             |   6 +-
 src/backend/access/transam/xlogrecovery.c     |   2 +-
 src/backend/commands/vacuum.c                 |   6 +-
 src/backend/postmaster/autovacuum.c           |   4 +-
 src/bin/pg_controldata/pg_controldata.c       |   2 +-
 src/bin/pg_resetwal/pg_resetwal.c             |  38 +-
 src/bin/pg_resetwal/t/001_basic.pl            |   2 +-
 src/bin/pg_upgrade/Makefile                   |   3 +
 src/bin/pg_upgrade/meson.build                |   4 +
 src/bin/pg_upgrade/multixact_read_v18.c       | 337 +++++++++++
 src/bin/pg_upgrade/multixact_read_v18.h       |  37 ++
 src/bin/pg_upgrade/multixact_rewrite.c        | 195 +++++++
 src/bin/pg_upgrade/pg_upgrade.c               |  81 ++-
 src/bin/pg_upgrade/pg_upgrade.h               |  12 +-
 src/bin/pg_upgrade/slru_io.c                  | 258 ++++++++
 src/bin/pg_upgrade/slru_io.h                  |  52 ++
 .../pg_upgrade/t/007_multixact_conversion.pl  | 339 +++++++++++
 src/include/access/multixact.h                |   7 +-
 src/include/access/multixact_internal.h       |  23 +-
 src/include/c.h                               |   2 +-
 .../test_slru/t/002_multixact_wraparound.pl   |   2 +-
 .../perl/PostgreSQL/Test/BackgroundPsql.pm    |  15 +-
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  21 +-
 src/tools/pgindent/typedefs.list              |   3 +
 28 files changed, 1500 insertions(+), 520 deletions(-)
 create mode 100644 src/bin/pg_upgrade/multixact_read_v18.c
 create mode 100644 src/bin/pg_upgrade/multixact_read_v18.h
 create mode 100644 src/bin/pg_upgrade/multixact_rewrite.c
 create mode 100644 src/bin/pg_upgrade/slru_io.c
 create mode 100644 src/bin/pg_upgrade/slru_io.h
 create mode 100644 src/bin/pg_upgrade/t/007_multixact_conversion.pl

diff --git a/doc/src/sgml/ref/pg_resetwal.sgml b/doc/src/sgml/ref/pg_resetwal.sgml
index 2c019c2aac6..41f2b1d480c 100644
--- a/doc/src/sgml/ref/pg_resetwal.sgml
+++ b/doc/src/sgml/ref/pg_resetwal.sgml
@@ -267,14 +267,17 @@ PostgreSQL documentation
       A safe value for the next multitransaction ID (first part) can be
       determined by looking for the numerically largest file name in the
       directory <filename>pg_multixact/offsets</filename> under the data directory,
-      adding one, and then multiplying by 65536 (0x10000).  Conversely, a safe
+      adding one, and then multiplying by 32768 (0x8000).  Conversely, a safe
       value for the oldest multitransaction ID (second part of
       <option>-m</option>) can be determined by looking for the numerically smallest
-      file name in the same directory and multiplying by 65536.  The file
-      names are in hexadecimal, so the easiest way to do this is to specify
-      the option value in hexadecimal and append four zeroes.
+      file name in the same directory and multiplying by 32768 (0x8000).
+      Note that the file names are in hexadecimal.  It is usually easiest
+      to specify the option value in hexadecimal too.  For example, if
+      <filename>000F</filename> and <filename>0007</filename> are the greatest and
+      smallest entries in <filename>pg_multixact/offsets</filename>,
+      <literal>-m 0x80000,0x38000</literal> will work.
      </para>
-     <!-- 65536 = SLRU_PAGES_PER_SEGMENT * BLCKSZ / sizeof(MultiXactOffset) -->
+     <!-- 32768 = SLRU_PAGES_PER_SEGMENT * BLCKSZ / sizeof(MultiXactOffset) -->
     </listitem>
    </varlistentry>
 
diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 3ca0582db36..052dd0a4ce5 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -65,7 +65,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		xl_multixact_create *xlrec = (xl_multixact_create *) rec;
 		int			i;
 
-		appendStringInfo(buf, "%u offset %u nmembers %d: ", xlrec->mid,
+		appendStringInfo(buf, "%u offset %" PRIu64 " nmembers %d: ", xlrec->mid,
 						 xlrec->moff, xlrec->nmembers);
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
@@ -74,7 +74,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
 
-		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+		appendStringInfo(buf, "offsets [%u, %u), members [%" PRIu64 ", %" PRIu64 ")",
 						 xlrec->startTruncOff, xlrec->endTruncOff,
 						 xlrec->startTruncMemb, xlrec->endTruncMemb);
 	}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index cd6c2a2f650..441034f5929 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -66,7 +66,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
 		appendStringInfo(buf, "redo %X/%08X; "
-						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %u; "
+						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %" PRIu64 "; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
 						 "oldest running xid %u; %s",
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 14d46fb761b..dffa0c8e7d4 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -89,10 +89,14 @@
 #include "utils/memutils.h"
 
 
-/* Multixact members wraparound thresholds. */
-#define MULTIXACT_MEMBER_SAFE_THRESHOLD		(MaxMultiXactOffset / 2)
-#define MULTIXACT_MEMBER_DANGER_THRESHOLD	\
-	(MaxMultiXactOffset - MaxMultiXactOffset / 4)
+/*
+ * Thresholds used to keep members disk usage in check when multixids have a
+ * lot of members.  When MULTIXACT_MEMBER_LOW_THRESHOLD is reached, vacuum
+ * starts freezing multixids more aggressively, even if the normal multixid
+ * age limits haven't been reached yet.
+ */
+#define MULTIXACT_MEMBER_LOW_THRESHOLD		UINT64CONST(2000000000)
+#define MULTIXACT_MEMBER_HIGH_THRESHOLD		UINT64CONST(4000000000)
 
 static inline MultiXactId
 PreviousMultiXactId(MultiXactId multi)
@@ -137,11 +141,9 @@ typedef struct MultiXactStateData
 
 	/*
 	 * Oldest multixact offset that is potentially referenced by a multixact
-	 * referenced by a relation.  We don't always know this value, so there's
-	 * a flag here to indicate whether or not we currently do.
+	 * referenced by a relation.
 	 */
 	MultiXactOffset oldestOffset;
-	bool		oldestOffsetKnown;
 
 	/* support for anti-wraparound measures */
 	MultiXactId multiVacLimit;
@@ -149,9 +151,6 @@ typedef struct MultiXactStateData
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 
-	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
-
 	/*
 	 * Per-backend data starts here.  We have two arrays stored in the area
 	 * immediately following the MultiXactStateData struct. Each is indexed by
@@ -272,13 +271,9 @@ static void mXactCachePut(MultiXactId multi, int nmembers,
 /* management of SLRU infrastructure */
 static bool MultiXactOffsetPagePrecedes(int64 page1, int64 page2);
 static bool MultiXactMemberPagePrecedes(int64 page1, int64 page2);
-static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
-									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
-									 MultiXactOffset start, uint32 distance);
-static bool SetOffsetVacuumLimit(bool is_startup);
+static void SetOffsetVacuumLimit(void);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMTruncateXlogRec(Oid oldestMultiDB,
 								  MultiXactId startTruncOff,
@@ -1073,90 +1068,22 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	ExtendMultiXactOffset(result + 1);
 
 	/*
-	 * Reserve the members space, similarly to above.  Also, be careful not to
-	 * return zero as the starting offset for any multixact. See
-	 * GetMultiXactIdMembers() for motivation.
+	 * Reserve the members space, similarly to above.
 	 */
 	nextOffset = MultiXactState->nextOffset;
-	if (nextOffset == 0)
-	{
-		*offset = 1;
-		nmembers++;				/* allocate member slot 0 too */
-	}
-	else
-		*offset = nextOffset;
-
-	/*----------
-	 * Protect against overrun of the members space as well, with the
-	 * following rules:
-	 *
-	 * If we're past offsetStopLimit, refuse to generate more multis.
-	 * If we're close to offsetStopLimit, emit a warning.
-	 *
-	 * Arbitrarily, we start emitting warnings when we're 20 segments or less
-	 * from offsetStopLimit.
-	 *
-	 * Note we haven't updated the shared state yet, so if we fail at this
-	 * point, the multixact ID we grabbed can still be used by the next guy.
-	 *
-	 * Note that there is no point in forcing autovacuum runs here: the
-	 * multixact freeze settings would have to be reduced for that to have any
-	 * effect.
-	 *----------
-	 */
-#define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
-								 nmembers))
-	{
-		/* see comment in the corresponding offsets wraparound case */
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("multixact \"members\" limit exceeded"),
-				 errdetail_plural("This command would create a multixact with %u members, but the remaining space is only enough for %u member.",
-								  "This command would create a multixact with %u members, but the remaining space is only enough for %u members.",
-								  MultiXactState->offsetStopLimit - nextOffset - 1,
-								  nmembers,
-								  MultiXactState->offsetStopLimit - nextOffset - 1),
-				 errhint("Execute a database-wide VACUUM in database with OID %u with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.",
-						 MultiXactState->oldestMultiXactDB)));
-	}
 
 	/*
-	 * Check whether we should kick autovacuum into action, to prevent members
-	 * wraparound. NB we use a much larger window to trigger autovacuum than
-	 * just the warning limit. The warning is just a measure of last resort -
-	 * this is in line with GetNewTransactionId's behaviour.
+	 * Offsets are 64-bit integers and will never wrap around.  Firstly, it
+	 * would take an unrealistic amount of time and resources to consume 2^64
+	 * offsets.  Secondly, multixid creation is WAL-logged, so you would run
+	 * out of LSNs before reaching offset wraparound.  Nevertheless, check for
+	 * wraparound as a sanity check.
 	 */
-	if (!MultiXactState->oldestOffsetKnown ||
-		(MultiXactState->nextOffset - MultiXactState->oldestOffset
-		 > MULTIXACT_MEMBER_SAFE_THRESHOLD))
-	{
-		/*
-		 * To avoid swamping the postmaster with signals, we issue the autovac
-		 * request only when crossing a segment boundary. With default
-		 * compilation settings that's roughly after 50k members.  This still
-		 * gives plenty of chances before we get into real trouble.
-		 */
-		if ((MXOffsetToMemberPage(nextOffset) / SLRU_PAGES_PER_SEGMENT) !=
-			(MXOffsetToMemberPage(nextOffset + nmembers) / SLRU_PAGES_PER_SEGMENT))
-			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	}
-
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
-								 nextOffset,
-								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
-		ereport(WARNING,
+	if (nextOffset + nmembers < nextOffset)
+		ereport(ERROR,
 				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg_plural("database with OID %u must be vacuumed before %d more multixact member is used",
-							   "database with OID %u must be vacuumed before %d more multixact members are used",
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers,
-							   MultiXactState->oldestMultiXactDB,
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers),
-				 errhint("Execute a database-wide VACUUM in that database with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.")));
+				 errmsg("MultiXact members would wrap around")));
+	*offset = nextOffset;
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
@@ -1177,8 +1104,7 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	 * the next iteration.  But note that nextMXact may be InvalidMultiXactId
 	 * or the first value on a segment-beginning page after this routine
 	 * exits, so anyone else looking at the variable must be prepared to deal
-	 * with either case.  Similarly, nextOffset may be zero, but we won't use
-	 * that as the actual start offset of the next multixact.
+	 * with either case.
 	 */
 	(MultiXactState->nextMXact)++;
 
@@ -1186,7 +1112,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	LWLockRelease(MultiXactGenLock);
 
-	debug_elog4(DEBUG2, "GetNew: returning %u offset %u", result, *offset);
+	debug_elog4(DEBUG2, "GetNew: returning %u offset %" PRIu64,
+				result, *offset);
 	return result;
 }
 
@@ -1228,7 +1155,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	MultiXactOffset *offptr;
 	MultiXactOffset offset;
 	int			length;
-	int			truelength;
 	MultiXactId oldestMXact;
 	MultiXactId nextMXact;
 	MultiXactMember *ptr;
@@ -1304,16 +1230,7 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * Find out the offset at which we need to start reading MultiXactMembers
 	 * and the number of members in the multixact.  We determine the latter as
 	 * the difference between this multixact's starting offset and the next
-	 * one's.  However, there is one corner case to worry about:
-	 *
-	 * Because GetNewMultiXactId skips over offset zero, to reserve zero for
-	 * to mean "unset", there is an ambiguity near the point of offset
-	 * wraparound.  If we see next multixact's offset is one, is that our
-	 * multixact's actual endpoint, or did it end at zero with a subsequent
-	 * increment?  We handle this using the knowledge that if the zero'th
-	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
-	 * transaction ID so it can't be a multixact member.  Therefore, if we
-	 * read a zero from the members array, just ignore it.
+	 * one's.
 	 */
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
@@ -1380,10 +1297,11 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	LWLockRelease(lock);
 	lock = NULL;
 
+	/* A multixid with zero members should not happen */
+	Assert(length > 0);
+
 	/* read the members */
 	ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
-
-	truelength = 0;
 	prev_pageno = -1;
 	for (int i = 0; i < length; i++, offset++)
 	{
@@ -1420,37 +1338,27 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 
 		xactptr = (TransactionId *)
 			(MultiXactMemberCtl->shared->page_buffer[slotno] + memberoff);
-
-		if (!TransactionIdIsValid(*xactptr))
-		{
-			/* Corner case: we must be looking at unused slot zero */
-			Assert(offset == 0);
-			continue;
-		}
+		Assert(TransactionIdIsValid(*xactptr));
 
 		flagsoff = MXOffsetToFlagsOffset(offset);
 		bshift = MXOffsetToFlagsBitShift(offset);
 		flagsptr = (uint32 *) (MultiXactMemberCtl->shared->page_buffer[slotno] + flagsoff);
 
-		ptr[truelength].xid = *xactptr;
-		ptr[truelength].status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
-		truelength++;
+		ptr[i].xid = *xactptr;
+		ptr[i].status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
 	}
 
 	LWLockRelease(lock);
 
-	/* A multixid with zero members should not happen */
-	Assert(truelength > 0);
-
 	/*
 	 * Copy the result into the local cache.
 	 */
-	mXactCachePut(multi, truelength, ptr);
+	mXactCachePut(multi, length, ptr);
 
 	debug_elog3(DEBUG2, "GetMembers: no cache for %s",
-				mxid_to_string(multi, truelength, ptr));
+				mxid_to_string(multi, length, ptr));
 	*members = ptr;
-	return truelength;
+	return length;
 }
 
 /*
@@ -1857,7 +1765,7 @@ MultiXactShmemInit(void)
 				  "pg_multixact/members", LWTRANCHE_MULTIXACTMEMBER_BUFFER,
 				  LWTRANCHE_MULTIXACTMEMBER_SLRU,
 				  SYNC_HANDLER_MULTIXACT_MEMBER,
-				  false);
+				  true);
 	/* doesn't call SimpleLruTruncate() or meet criteria for unit tests */
 
 	/* Initialize our shared state struct */
@@ -1912,48 +1820,6 @@ BootStrapMultiXact(void)
 	SimpleLruZeroAndWritePage(MultiXactMemberCtl, 0);
 }
 
-/*
- * MaybeExtendOffsetSlru
- *		Extend the offsets SLRU area, if necessary
- *
- * After a binary upgrade from <= 9.2, the pg_multixact/offsets SLRU area might
- * contain files that are shorter than necessary; this would occur if the old
- * installation had used multixacts beyond the first page (files cannot be
- * copied, because the on-disk representation is different).  pg_upgrade would
- * update pg_control to set the next offset value to be at that position, so
- * that tuples marked as locked by such MultiXacts would be seen as visible
- * without having to consult multixact.  However, trying to create and use a
- * new MultiXactId would result in an error because the page on which the new
- * value would reside does not exist.  This routine is in charge of creating
- * such pages.
- */
-static void
-MaybeExtendOffsetSlru(void)
-{
-	int64		pageno;
-	LWLock	   *lock;
-
-	pageno = MultiXactIdToOffsetPage(MultiXactState->nextMXact);
-	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
-
-	LWLockAcquire(lock, LW_EXCLUSIVE);
-
-	if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno))
-	{
-		int			slotno;
-
-		/*
-		 * Fortunately for us, SimpleLruWritePage is already prepared to deal
-		 * with creating a new segment file even if the page we're writing is
-		 * not the first in it, so this is enough.
-		 */
-		slotno = SimpleLruZeroPage(MultiXactOffsetCtl, pageno);
-		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
-	}
-
-	LWLockRelease(lock);
-}
-
 /*
  * This must be called ONCE during postmaster or standalone-backend startup.
  *
@@ -2092,8 +1958,8 @@ TrimMultiXact(void)
 	MultiXactState->finishedStartup = true;
 	LWLockRelease(MultiXactGenLock);
 
-	/* Now compute how far away the next members wraparound is. */
-	SetMultiXactIdLimit(oldestMXact, oldestMXactDB, true);
+	/* Now compute how far away the next multixid wraparound is. */
+	SetMultiXactIdLimit(oldestMXact, oldestMXactDB);
 }
 
 /*
@@ -2114,7 +1980,7 @@ MultiXactGetCheckptMulti(bool is_shutdown,
 	LWLockRelease(MultiXactGenLock);
 
 	debug_elog6(DEBUG2,
-				"MultiXact: checkpoint is nextMulti %u, nextOffset %u, oldestMulti %u in DB %u",
+				"MultiXact: checkpoint is nextMulti %u, nextOffset %" PRIu64 ", oldestMulti %u in DB %u",
 				*nextMulti, *nextMultiOffset, *oldestMulti, *oldestMultiDB);
 }
 
@@ -2149,26 +2015,12 @@ void
 MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset)
 {
-	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %u",
+	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %" PRIu64,
 				nextMulti, nextMultiOffset);
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->nextMXact = nextMulti;
 	MultiXactState->nextOffset = nextMultiOffset;
 	LWLockRelease(MultiXactGenLock);
-
-	/*
-	 * During a binary upgrade, make sure that the offsets SLRU is large
-	 * enough to contain the next value that would be created.
-	 *
-	 * We need to do this pretty early during the first startup in binary
-	 * upgrade mode: before StartupMultiXact() in fact, because this routine
-	 * is called even before that by StartupXLOG().  And we can't do it
-	 * earlier than at this point, because during that first call of this
-	 * routine we determine the MultiXactState->nextMXact value that
-	 * MaybeExtendOffsetSlru needs.
-	 */
-	if (IsBinaryUpgrade)
-		MaybeExtendOffsetSlru();
 }
 
 /*
@@ -2176,28 +2028,24 @@ MultiXactSetNextMXact(MultiXactId nextMulti,
  * datminmxid (ie, the oldest MultiXactId that might exist in any database
  * of our cluster), and the OID of the (or a) database with that value.
  *
- * is_startup is true when we are just starting the cluster, false when we
- * are updating state in a running cluster.  This only affects log messages.
+ * This also updates MultiXactState->oldestOffset, by looking up the offset of
+ * MultiXactState->oldestMultiXactId.
  */
 void
-SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid,
-					bool is_startup)
+SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid)
 {
 	MultiXactId multiVacLimit;
 	MultiXactId multiWarnLimit;
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 	MultiXactId curMulti;
-	bool		needs_offset_vacuum;
 
 	Assert(MultiXactIdIsValid(oldest_datminmxid));
 
 	/*
 	 * We pretend that a wrap will happen halfway through the multixact ID
 	 * space, but that's not really true, because multixacts wrap differently
-	 * from transaction IDs.  Note that, separately from any concern about
-	 * multixact IDs wrapping, we must ensure that multixact members do not
-	 * wrap.  Limits for that are set in SetOffsetVacuumLimit, not here.
+	 * from transaction IDs.
 	 */
 	multiWrapLimit = oldest_datminmxid + (MaxMultiXactId >> 1);
 	if (multiWrapLimit < FirstMultiXactId)
@@ -2265,8 +2113,13 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid,
 
 	Assert(!InRecovery);
 
-	/* Set limits for offset vacuum. */
-	needs_offset_vacuum = SetOffsetVacuumLimit(is_startup);
+	/*
+	 * Offsets are 64-bits wide and never wrap around, so we don't need to
+	 * consider them for emergency autovacuum purposes.  But now that we're in
+	 * a consistent state, determine MultiXactState->oldestOffset, to be used
+	 * to calculate freezing cutoff to keep the offsets disk usage in check.
+	 */
+	SetOffsetVacuumLimit();
 
 	/*
 	 * If past the autovacuum force point, immediately signal an autovac
@@ -2275,8 +2128,7 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid,
 	 * database, it'll call here, and we'll signal the postmaster to start
 	 * another iteration immediately if there are still any old databases.
 	 */
-	if ((MultiXactIdPrecedes(multiVacLimit, curMulti) ||
-		 needs_offset_vacuum) && IsUnderPostmaster)
+	if (MultiXactIdPrecedes(multiVacLimit, curMulti) && IsUnderPostmaster)
 		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
 
 	/* Give an immediate warning if past the wrap warn point */
@@ -2338,9 +2190,9 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 		debug_elog3(DEBUG2, "MultiXact: setting next multi to %u", minMulti);
 		MultiXactState->nextMXact = minMulti;
 	}
-	if (MultiXactOffsetPrecedes(MultiXactState->nextOffset, minMultiOffset))
+	if (MultiXactState->nextOffset < minMultiOffset)
 	{
-		debug_elog3(DEBUG2, "MultiXact: setting next offset to %u",
+		debug_elog3(DEBUG2, "MultiXact: setting next offset to %" PRIU64,
 					minMultiOffset);
 		MultiXactState->nextOffset = minMultiOffset;
 	}
@@ -2359,7 +2211,7 @@ MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB)
 	Assert(InRecovery);
 
 	if (MultiXactIdPrecedes(MultiXactState->oldestMultiXactId, oldestMulti))
-		SetMultiXactIdLimit(oldestMulti, oldestMultiDB, false);
+		SetMultiXactIdLimit(oldestMulti, oldestMultiDB);
 }
 
 /*
@@ -2442,27 +2294,11 @@ ExtendMultiXactMember(MultiXactOffset offset, int nmembers)
 			LWLockRelease(lock);
 		}
 
-		/*
-		 * Compute the number of items till end of current page.  Careful: if
-		 * addition of unsigned ints wraps around, we're at the last page of
-		 * the last segment; since that page holds a different number of items
-		 * than other pages, we need to do it differently.
-		 */
-		if (offset + MAX_MEMBERS_IN_LAST_MEMBERS_PAGE < offset)
-		{
-			/*
-			 * This is the last page of the last segment; we can compute the
-			 * number of items left to allocate in it without modulo
-			 * arithmetic.
-			 */
-			difference = MaxMultiXactOffset - offset + 1;
-		}
-		else
-			difference = MULTIXACT_MEMBERS_PER_PAGE - offset % MULTIXACT_MEMBERS_PER_PAGE;
+		/* Compute the number of items till end of current page. */
+		difference = MULTIXACT_MEMBERS_PER_PAGE - offset % MULTIXACT_MEMBERS_PER_PAGE;
 
 		/*
-		 * Advance to next page, taking care to properly handle the wraparound
-		 * case.  OK if nmembers goes negative.
+		 * Advance to next page.  OK if nmembers goes negative.
 		 */
 		nmembers -= difference;
 		offset += difference;
@@ -2524,28 +2360,17 @@ GetOldestMultiXactId(void)
 }
 
 /*
- * Determine how aggressively we need to vacuum in order to prevent member
- * wraparound.
- *
- * To do so determine what's the oldest member offset and install the limit
- * info in MultiXactState, where it can be used to prevent overrun of old data
- * in the members SLRU area.
- *
- * The return value is true if emergency autovacuum is required and false
- * otherwise.
+ * Calculate the oldest member offset and install it in MultiXactState, where
+ * it can be used to adjust multixid freezing cutoffs.
  */
-static bool
-SetOffsetVacuumLimit(bool is_startup)
+static void
+SetOffsetVacuumLimit(void)
 {
 	MultiXactId oldestMultiXactId;
 	MultiXactId nextMXact;
 	MultiXactOffset oldestOffset = 0;	/* placate compiler */
-	MultiXactOffset prevOldestOffset;
 	MultiXactOffset nextOffset;
 	bool		oldestOffsetKnown = false;
-	bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;
-	MultiXactOffset prevOffsetStopLimit;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2558,9 +2383,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	oldestMultiXactId = MultiXactState->oldestMultiXactId;
 	nextMXact = MultiXactState->nextMXact;
 	nextOffset = MultiXactState->nextOffset;
-	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
-	prevOldestOffset = MultiXactState->oldestOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2583,121 +2405,39 @@ SetOffsetVacuumLimit(bool is_startup)
 	else
 	{
 		/*
-		 * Figure out where the oldest existing multixact's offsets are
-		 * stored. Due to bugs in early release of PostgreSQL 9.3.X and 9.4.X,
-		 * the supposedly-earliest multixact might not really exist.  We are
-		 * careful not to fail in that case.
+		 * Look up the offset at which the oldest existing multixact's members
+		 * are stored.  If we cannot find it, be careful not to fail, and
+		 * leave oldestOffset unchanged.  oldestOffset is initialized to zero
+		 * at system startup, which prevents truncating members until a proper
+		 * value is calculated.
+		 *
+		 * (We had bugs in early releases of PostgreSQL 9.3.X and 9.4.X where
+		 * the supposedly-earliest multixact might not really exist.  Those
+		 * should be long gone by now, so this should not fail, but let's
+		 * still be defensive.)
 		 */
 		oldestOffsetKnown =
 			find_multixact_start(oldestMultiXactId, &oldestOffset);
 
 		if (oldestOffsetKnown)
 			ereport(DEBUG1,
-					(errmsg_internal("oldest MultiXactId member is at offset %u",
+					(errmsg_internal("oldest MultiXactId member is at offset %" PRIu64,
 									 oldestOffset)));
 		else
 			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
+					(errmsg("oldest checkpointed MultiXact %u does not exist on disk",
 							oldestMultiXactId)));
 	}
 
 	LWLockRelease(MultiXactTruncationLock);
 
-	/*
-	 * If we can, compute limits (and install them MultiXactState) to prevent
-	 * overrun of old data in the members SLRU area. We can only do so if the
-	 * oldest offset is known though.
-	 */
+	/* Install the computed value */
 	if (oldestOffsetKnown)
 	{
-		/* move back to start of the corresponding segment */
-		offsetStopLimit = oldestOffset - (oldestOffset %
-										  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-
-		/* always leave one segment before the wraparound point */
-		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-		if (!prevOldestOffsetKnown && !is_startup)
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are now enabled")));
-
-		ereport(DEBUG1,
-				(errmsg_internal("MultiXact member stop limit is now %u based on MultiXact %u",
-								 offsetStopLimit, oldestMultiXactId)));
-	}
-	else if (prevOldestOffsetKnown)
-	{
-		/*
-		 * If we failed to get the oldest offset this time, but we have a
-		 * value from a previous pass through this function, use the old
-		 * values rather than automatically forcing an emergency autovacuum
-		 * cycle again.
-		 */
-		oldestOffset = prevOldestOffset;
-		oldestOffsetKnown = true;
-		offsetStopLimit = prevOffsetStopLimit;
+		LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
+		MultiXactState->oldestOffset = oldestOffset;
+		LWLockRelease(MultiXactGenLock);
 	}
-
-	/* Install the computed values */
-	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
-	MultiXactState->oldestOffset = oldestOffset;
-	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	MultiXactState->offsetStopLimit = offsetStopLimit;
-	LWLockRelease(MultiXactGenLock);
-
-	/*
-	 * Do we need an emergency autovacuum?	If we're not sure, assume yes.
-	 */
-	return !oldestOffsetKnown ||
-		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
-}
-
-/*
- * Return whether adding "distance" to "start" would move past "boundary".
- *
- * We use this to determine whether the addition is "wrapping around" the
- * boundary point, hence the name.  The reason we don't want to use the regular
- * 2^31-modulo arithmetic here is that we want to be able to use the whole of
- * the 2^32-1 space here, allowing for more multixacts than would fit
- * otherwise.
- */
-static bool
-MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
-						 uint32 distance)
-{
-	MultiXactOffset finish;
-
-	/*
-	 * Note that offset number 0 is not used (see GetMultiXactIdMembers), so
-	 * if the addition wraps around the UINT_MAX boundary, skip that value.
-	 */
-	finish = start + distance;
-	if (finish < start)
-		finish++;
-
-	/*-----------------------------------------------------------------------
-	 * When the boundary is numerically greater than the starting point, any
-	 * value numerically between the two is not wrapped:
-	 *
-	 *	<----S----B---->
-	 *	[---)			 = F wrapped past B (and UINT_MAX)
-	 *		 [---)		 = F not wrapped
-	 *			  [----] = F wrapped past B
-	 *
-	 * When the boundary is numerically less than the starting point (i.e. the
-	 * UINT_MAX wraparound occurs somewhere in between) then all values in
-	 * between are wrapped:
-	 *
-	 *	<----B----S---->
-	 *	[---)			 = F not wrapped past B (but wrapped past UINT_MAX)
-	 *		 [---)		 = F wrapped past B (and UINT_MAX)
-	 *			  [----] = F not wrapped
-	 *-----------------------------------------------------------------------
-	 */
-	if (start < boundary)
-		return finish >= boundary || finish < start;
-	else
-		return finish >= boundary && finish < start;
 }
 
 /*
@@ -2751,37 +2491,23 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
  * members: Number of member entries (nextOffset - oldestOffset)
  * oldestMultiXactId: Oldest MultiXact ID still in use
  * oldestOffset: Oldest offset still in use
- *
- * Returns false if unable to determine, the oldest offset being unknown.
  */
-bool
+void
 GetMultiXactInfo(uint32 *multixacts, MultiXactOffset *members,
 				 MultiXactId *oldestMultiXactId, MultiXactOffset *oldestOffset)
 {
 	MultiXactOffset nextOffset;
 	MultiXactId nextMultiXactId;
-	bool		oldestOffsetKnown;
 
 	LWLockAcquire(MultiXactGenLock, LW_SHARED);
 	nextOffset = MultiXactState->nextOffset;
 	*oldestMultiXactId = MultiXactState->oldestMultiXactId;
 	nextMultiXactId = MultiXactState->nextMXact;
 	*oldestOffset = MultiXactState->oldestOffset;
-	oldestOffsetKnown = MultiXactState->oldestOffsetKnown;
 	LWLockRelease(MultiXactGenLock);
 
-	if (!oldestOffsetKnown)
-	{
-		*members = 0;
-		*multixacts = 0;
-		*oldestMultiXactId = InvalidMultiXactId;
-		*oldestOffset = 0;
-		return false;
-	}
-
 	*members = nextOffset - *oldestOffset;
 	*multixacts = nextMultiXactId - *oldestMultiXactId;
-	return true;
 }
 
 /*
@@ -2790,26 +2516,27 @@ GetMultiXactInfo(uint32 *multixacts, MultiXactOffset *members,
  * vacuum_multixact_freeze_table_age work together to make sure we never have
  * too many multixacts; we hope that, at least under normal circumstances,
  * this will also be sufficient to keep us from using too many offsets.
- * However, if the average multixact has many members, we might exhaust the
- * members space while still using few enough members that these limits fail
- * to trigger relminmxid advancement by VACUUM.  At that point, we'd have no
- * choice but to start failing multixact-creating operations with an error.
- *
- * To prevent that, if more than a threshold portion of the members space is
- * used, we effectively reduce autovacuum_multixact_freeze_max_age and
- * to a value just less than the number of multixacts in use.  We hope that
- * this will quickly trigger autovacuuming on the table or tables with the
- * oldest relminmxid, thus allowing datminmxid values to advance and removing
- * some members.
- *
- * As the fraction of the member space currently in use grows, we become
- * more aggressive in clamping this value.  That not only causes autovacuum
- * to ramp up, but also makes any manual vacuums the user issues more
- * aggressive.  This happens because vacuum_get_cutoffs() will clamp the
- * freeze table and the minimum freeze age cutoffs based on the effective
- * autovacuum_multixact_freeze_max_age this function returns.  In the worst
- * case, we'll claim the freeze_max_age to zero, and every vacuum of any
- * table will freeze every multixact.
+ * However, if the average multixact has many members, we might accumulate a
+ * large amount of members, consuming disk space, while still using few enough
+ * multixids that the multixid limits fail to trigger relminmxid advancement
+ * by VACUUM.
+ *
+ * To prevent that, if the members space usage exceeds a threshold
+ * (MULTIXACT_MEMBER_LOW_THRESHOLD), we effectively reduce
+ * autovacuum_multixact_freeze_max_age to a value just less than the number of
+ * multixacts in use.  We hope that this will quickly trigger autovacuuming on
+ * the table or tables with the oldest relminmxid, thus allowing datminmxid
+ * values to advance and removing some members.
+ *
+ * As the amount of the member space in use grows, we become more aggressive
+ * in clamping this value.  That not only causes autovacuum to ramp up, but
+ * also makes any manual vacuums the user issues more aggressive.  This
+ * happens because vacuum_get_cutoffs() will clamp the freeze table and the
+ * minimum freeze age cutoffs based on the effective
+ * autovacuum_multixact_freeze_max_age this function returns.  At the extreme,
+ * when the members usage reaches MULTIXACT_MEMBER_HIGH_THRESHOLD, we clamp
+ * freeze_max_age to zero, and every vacuum of any table will freeze every
+ * multixact.
  */
 int
 MultiXactMemberFreezeThreshold(void)
@@ -2822,21 +2549,27 @@ MultiXactMemberFreezeThreshold(void)
 	MultiXactId oldestMultiXactId;
 	MultiXactOffset oldestOffset;
 
-	/* If we can't determine member space utilization, assume the worst. */
-	if (!GetMultiXactInfo(&multixacts, &members, &oldestMultiXactId, &oldestOffset))
-		return 0;
+	/* Read the current offsets and members usage. */
+	GetMultiXactInfo(&multixacts, &members, &oldestMultiXactId, &oldestOffset);
 
 	/* If member space utilization is low, no special action is required. */
-	if (members <= MULTIXACT_MEMBER_SAFE_THRESHOLD)
+	if (members <= MULTIXACT_MEMBER_LOW_THRESHOLD)
 		return autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Compute a target for relminmxid advancement.  The number of multixacts
 	 * we try to eliminate from the system is based on how far we are past
-	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
-	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD) /
-		(MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+	 * MULTIXACT_MEMBER_LOW_THRESHOLD.
+	 *
+	 * The way this formula works is that when members is exactly at the low
+	 * threshold, fraction = 0.0, and we set freeze_max_age equal to
+	 * mxid_age(oldestMultiXactId).  As members grows further, towards the
+	 * high threshold, fraction grows linearly from 0.0 to 1.0, and the result
+	 * shrinks from mxid_age(oldestMultiXactId) to 0.  Beyond the high
+	 * threshold, fraction > 1.0 and the result is clamped to 0.
+	 */
+	fraction = (double) (members - MULTIXACT_MEMBER_LOW_THRESHOLD) /
+		(MULTIXACT_MEMBER_HIGH_THRESHOLD - MULTIXACT_MEMBER_LOW_THRESHOLD);
 	victim_multixacts = multixacts * fraction;
 
 	/* fraction could be > 1.0, but lowest possible freeze age is zero */
@@ -2877,36 +2610,12 @@ SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int64 segpage, void *data
 
 /*
  * Delete members segments [oldest, newOldest)
- *
- * The members SLRU can, in contrast to the offsets one, be filled to almost
- * the full range at once. This means SimpleLruTruncate() can't trivially be
- * used - instead the to-be-deleted range is computed using the offsets
- * SLRU. C.f. TruncateMultiXact().
  */
 static void
 PerformMembersTruncation(MultiXactOffset oldestOffset, MultiXactOffset newOldestOffset)
 {
-	const int64 maxsegment = MXOffsetToMemberSegment(MaxMultiXactOffset);
-	int64		startsegment = MXOffsetToMemberSegment(oldestOffset);
-	int64		endsegment = MXOffsetToMemberSegment(newOldestOffset);
-	int64		segment = startsegment;
-
-	/*
-	 * Delete all the segments but the last one. The last segment can still
-	 * contain, possibly partially, valid data.
-	 */
-	while (segment != endsegment)
-	{
-		elog(DEBUG2, "truncating multixact members segment %" PRIx64,
-			 segment);
-		SlruDeleteSegment(MultiXactMemberCtl, segment);
-
-		/* move to next segment, handling wraparound correctly */
-		if (segment == maxsegment)
-			segment = 0;
-		else
-			segment += 1;
-	}
+	SimpleLruTruncate(MultiXactMemberCtl,
+					  MXOffsetToMemberPage(newOldestOffset));
 }
 
 /*
@@ -3050,7 +2759,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 
 	elog(DEBUG1, "performing multixact truncation: "
 		 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-		 "members [%u, %u), members segments [%" PRIx64 ", %" PRIx64 ")",
+		 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
 		 oldestMulti, newOldestMulti,
 		 MultiXactIdToOffsetSegment(oldestMulti),
 		 MultiXactIdToOffsetSegment(newOldestMulti),
@@ -3091,6 +2800,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestMultiXactId = newOldestMulti;
 	MultiXactState->oldestMultiXactDB = newOldestMultiDB;
+	MultiXactState->oldestOffset = newOldestOffset;
 	LWLockRelease(MultiXactGenLock);
 
 	/* First truncate members */
@@ -3130,20 +2840,13 @@ MultiXactOffsetPagePrecedes(int64 page1, int64 page2)
 
 /*
  * Decide whether a MultiXactMember page number is "older" for truncation
- * purposes.  There is no "invalid offset number" so use the numbers verbatim.
+ * purposes.  There is no "invalid offset number" and members never wrap
+ * around, so use the numbers verbatim.
  */
 static bool
 MultiXactMemberPagePrecedes(int64 page1, int64 page2)
 {
-	MultiXactOffset offset1;
-	MultiXactOffset offset2;
-
-	offset1 = ((MultiXactOffset) page1) * MULTIXACT_MEMBERS_PER_PAGE;
-	offset2 = ((MultiXactOffset) page2) * MULTIXACT_MEMBERS_PER_PAGE;
-
-	return (MultiXactOffsetPrecedes(offset1, offset2) &&
-			MultiXactOffsetPrecedes(offset1,
-									offset2 + MULTIXACT_MEMBERS_PER_PAGE - 1));
+	return page1 < page2;
 }
 
 /*
@@ -3175,17 +2878,6 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
 }
 
 
-/*
- * Decide which of two offsets is earlier.
- */
-static bool
-MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
-{
-	int32		diff = (int32) (offset1 - offset2);
-
-	return (diff < 0);
-}
-
 /*
  * Write a TRUNCATE xlog record
  *
@@ -3278,7 +2970,7 @@ multixact_redo(XLogReaderState *record)
 
 		elog(DEBUG1, "replaying multixact truncation: "
 			 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-			 "members [%u, %u), members segments [%" PRIx64 ", %" PRIx64 ")",
+			 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
 			 xlrec.startTruncOff, xlrec.endTruncOff,
 			 MultiXactIdToOffsetSegment(xlrec.startTruncOff),
 			 MultiXactIdToOffsetSegment(xlrec.endTruncOff),
@@ -3293,7 +2985,7 @@ multixact_redo(XLogReaderState *record)
 		 * Advance the horizon values, so they're current at the end of
 		 * recovery.
 		 */
-		SetMultiXactIdLimit(xlrec.endTruncOff, xlrec.oldestMultiDB, false);
+		SetMultiXactIdLimit(xlrec.endTruncOff, xlrec.oldestMultiDB);
 
 		PerformMembersTruncation(xlrec.startTruncMemb, xlrec.endTruncMemb);
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 22d0a2e8c3a..a000b8bd509 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5139,7 +5139,7 @@ BootStrapXLOG(uint32 data_checksum_version)
 		FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId);
 	checkPoint.nextOid = FirstGenbkiObjectId;
 	checkPoint.nextMulti = FirstMultiXactId;
-	checkPoint.nextMultiOffset = 0;
+	checkPoint.nextMultiOffset = 1;
 	checkPoint.oldestXid = FirstNormalTransactionId;
 	checkPoint.oldestXidDB = Template1DbOid;
 	checkPoint.oldestMulti = FirstMultiXactId;
@@ -5155,7 +5155,7 @@ BootStrapXLOG(uint32 data_checksum_version)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	AdvanceOldestClogXid(checkPoint.oldestXid);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
-	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
+	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
 	SetCommitTsLimit(InvalidTransactionId, InvalidTransactionId);
 
 	/* Set up the XLOG page header */
@@ -5636,7 +5636,7 @@ StartupXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	AdvanceOldestClogXid(checkPoint.oldestXid);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
-	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
+	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
 	SetCommitTsLimit(checkPoint.oldestCommitTsXid,
 					 checkPoint.newestCommitTsXid);
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 21b8f179ba0..51dea342a4d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -886,7 +886,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 							 U64FromFullTransactionId(checkPoint.nextXid),
 							 checkPoint.nextOid)));
 	ereport(DEBUG1,
-			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %u",
+			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %" PRIu64,
 							 checkPoint.nextMulti, checkPoint.nextMultiOffset)));
 	ereport(DEBUG1,
 			(errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index e785dd55ce5..7780ea6eae3 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1145,8 +1145,8 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams params,
 
 	/*
 	 * Also compute the multixact age for which freezing is urgent.  This is
-	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
-	 * short of multixact member space.
+	 * normally autovacuum_multixact_freeze_max_age, but may be less if
+	 * multixact members are bloated.
 	 */
 	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
 
@@ -1971,7 +1971,7 @@ vac_truncate_clog(TransactionId frozenXID,
 	 * signaling twice?
 	 */
 	SetTransactionIdLimit(frozenXID, oldestxid_datoid);
-	SetMultiXactIdLimit(minMulti, minmulti_datoid, false);
+	SetMultiXactIdLimit(minMulti, minmulti_datoid);
 
 	LWLockRelease(WrapLimitsVacuumLock);
 }
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 1c38488f2cb..f4830f896f3 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1936,8 +1936,8 @@ do_autovacuum(void)
 
 	/*
 	 * Compute the multixact age for which freezing is urgent.  This is
-	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
-	 * short of multixact member space.
+	 * normally autovacuum_multixact_freeze_max_age, but may be less if
+	 * multixact members are bloated.
 	 */
 	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
 
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 30ad46912e1..a4060309ae0 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -271,7 +271,7 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile->checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
+	printf(_("Latest checkpoint's NextMultiOffset:  %" PRIu64 "\n"),
 		   ControlFile->checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile->checkPointCopy.oldestXid);
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index c667a11cb6a..d5de4a7171a 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -115,6 +115,7 @@ static void KillExistingWALSummaries(void);
 static void WriteEmptyXLOG(void);
 static void usage(void);
 static uint32 strtouint32_strict(const char *restrict s, char **restrict endptr, int base);
+static uint64 strtouint64_strict(const char *restrict s, char **restrict endptr, int base);
 
 
 int
@@ -293,7 +294,7 @@ main(int argc, char *argv[])
 
 			case 'O':
 				errno = 0;
-				next_mxoff_val = strtouint32_strict(optarg, &endptr, 0);
+				next_mxoff_val = strtouint64_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
@@ -772,7 +773,7 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile.checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
+	printf(_("Latest checkpoint's NextMultiOffset:  %" PRIu64 "\n"),
 		   ControlFile.checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile.checkPointCopy.oldestXid);
@@ -848,7 +849,7 @@ PrintNewControlValues(void)
 
 	if (next_mxoff_given)
 	{
-		printf(_("NextMultiOffset:                      %u\n"),
+		printf(_("NextMultiOffset:                      %" PRIu64 "\n"),
 			   ControlFile.checkPointCopy.nextMultiOffset);
 	}
 
@@ -1276,3 +1277,34 @@ strtouint32_strict(const char *restrict s, char **restrict endptr, int base)
 
 	return (uint32) val;
 }
+
+/*
+ * strtouint64_strict -- like strtou64(), but doesn't accept negative values
+ */
+static uint64
+strtouint64_strict(const char *restrict s, char **restrict endptr, int base)
+{
+	uint64		val;
+	bool		is_neg;
+
+	/* skip leading whitespace */
+	while (isspace(*s))
+		s++;
+
+	/*
+	 * Is it negative?  We still call strtou64() if it was, to set 'endptr'.
+	 * (The current callers don't care though.)
+	 */
+	is_neg = (*s == '-');
+
+	val = strtou64(s, endptr, base);
+
+	/* reject if it was negative */
+	if (errno == 0 && is_neg)
+	{
+		errno = ERANGE;
+		val = 0;
+	}
+
+	return val;
+}
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index e9780dbe2a6..4ae51ee574e 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -230,7 +230,7 @@ push @cmd,
   sprintf("%d,%d", hex($files[0]) == 0 ? 3 : hex($files[0]), hex($files[-1]));
 
 @files = get_slru_files('pg_multixact/offsets');
-$mult = 32 * $blcksz / 4;
+$mult = 32 * $blcksz / 8;
 # --multixact-ids argument is "new,old"
 push @cmd,
   '--multixact-ids' => sprintf("%d,%d",
diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index 69fcf593cae..12f747b2c59 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -18,11 +18,14 @@ OBJS = \
 	file.o \
 	function.o \
 	info.o \
+	multixact_rewrite.o \
+	multixact_read_v18.o \
 	option.o \
 	parallel.o \
 	pg_upgrade.o \
 	relfilenumber.o \
 	server.o \
+	slru_io.o \
 	tablespace.o \
 	task.o \
 	util.o \
diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index ac992f0d14b..7bd7062b62f 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -8,11 +8,14 @@ pg_upgrade_sources = files(
   'file.c',
   'function.c',
   'info.c',
+  'multixact_rewrite.c',
+  'multixact_read_v18.c',
   'option.c',
   'parallel.c',
   'pg_upgrade.c',
   'relfilenumber.c',
   'server.c',
+  'slru_io.c',
   'tablespace.c',
   'task.c',
   'util.c',
@@ -47,6 +50,7 @@ tests += {
       't/004_subscription.pl',
       't/005_char_signedness.pl',
       't/006_transfer_modes.pl',
+      't/007_multixact_conversion.pl',
     ],
     'test_kwargs': {'priority': 40}, # pg_upgrade tests are slow
   },
diff --git a/src/bin/pg_upgrade/multixact_read_v18.c b/src/bin/pg_upgrade/multixact_read_v18.c
new file mode 100644
index 00000000000..fb537668a2c
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_read_v18.c
@@ -0,0 +1,337 @@
+/*
+ * multixact_read_v18.c
+ *
+ * Functions to read multixact SLRUs from cluster of PostgreSQL version 18 and
+ * older. In version 19, the multixid offsets were expanded from 32 to 64
+ * bits.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_read_v18.c
+ */
+
+#include "postgres_fe.h"
+
+#include "multixact_read_v18.h"
+#include "pg_upgrade.h"
+
+/*
+ * NOTE: below are a bunch of definitions that are copy-pasted from
+ * multixact.c from version 18.  It's important that this file doesn't
+ * #include the new definitions with same names from "multixact_internal.h"!
+ *
+ * To avoid confusion in the functions exposed outside this source file,
+ * though, we use OldMultiXactOffset to represent the old-style 32-bit
+ * multixid offsets. The new 64-bit MultiXactOffset should not be used
+ * anywhere in this file.
+ */
+#define MultiXactOffset should_not_be_used
+
+/* We need four bytes per offset and 8 bytes per base for each page. */
+#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(OldMultiXactOffset))
+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)
+{
+	return multi / MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int
+MultiXactIdToOffsetEntry(MultiXactId multi)
+{
+	return multi % MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+/*
+ * The situation for members is a bit more complex: we store one byte of
+ * additional flag bits for each TransactionId.  To do this without getting
+ * into alignment issues, we store four bytes of flags, and then the
+ * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
+ * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
+ * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
+ * performance) trumps space efficiency here.
+ *
+ * Note that the "offset" macros work with byte offset, not array indexes, so
+ * arithmetic must be done using "char *" pointers.
+ */
+/* We need eight bits per xact, so one xact fits in a byte */
+#define MXACT_MEMBER_BITS_PER_XACT			8
+#define MXACT_MEMBER_FLAGS_PER_BYTE			1
+#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
+
+/* how many full bytes of flags are there in a group? */
+#define MULTIXACT_FLAGBYTES_PER_GROUP		4
+#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
+	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
+/* size in bytes of a complete group */
+#define MULTIXACT_MEMBERGROUP_SIZE \
+	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERS_PER_PAGE	\
+	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
+
+/* page in which a member is to be found */
+static inline int64
+MXOffsetToMemberPage(OldMultiXactOffset offset)
+{
+	return offset / MULTIXACT_MEMBERS_PER_PAGE;
+}
+
+/* Location (byte offset within page) of flag word for a given member */
+static inline int
+MXOffsetToFlagsOffset(OldMultiXactOffset offset)
+{
+	OldMultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
+	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
+
+	return byteoff;
+}
+
+/* Location (byte offset within page) of TransactionId of given member */
+static inline int
+MXOffsetToMemberOffset(OldMultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+
+	return MXOffsetToFlagsOffset(offset) +
+		MULTIXACT_FLAGBYTES_PER_GROUP +
+		member_in_group * sizeof(TransactionId);
+}
+
+static inline int
+MXOffsetToFlagsBitShift(OldMultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
+
+	return bshift;
+}
+
+/*
+ * Construct reader of old multixacts.
+ *
+ * Returns the malloced memory used by the all other calls in this module.
+ */
+OldMultiXactReader *
+AllocOldMultiXactRead(char *pgdata, MultiXactId nextMulti,
+					  OldMultiXactOffset nextOffset)
+{
+	OldMultiXactReader *state = state = pg_malloc(sizeof(*state));
+	char		dir[MAXPGPATH] = {0};
+
+	state->nextMXact = nextMulti;
+	state->nextOffset = nextOffset;
+
+	pg_sprintf(dir, "%s/pg_multixact/offsets", pgdata);
+	state->offset = AllocSlruRead(dir, false);
+
+	pg_sprintf(dir, "%s/pg_multixact/members", pgdata);
+	state->members = AllocSlruRead(dir, false);
+
+	return state;
+}
+
+/*
+ * This is a simplified version of the GetMultiXactIdMembers() server
+ * function:
+ *
+ * - Only return the updating member, if any.  Upgrade only cares about the
+ *   updaters.  If there is no updating member, return somewhat arbitrarily
+ *   the first locking-only member, because we don't have any way to represent
+ *   "no members".
+ *
+ * - Because there's no concurrent activity, We don't need to worry about
+ *   locking and some corner cases.
+ *
+ * - Don't bail out on invalid entries.  If the server crashes, it can leave
+ *   invalid or half-written entries on disk. Such multixids won't appear
+ *   anywhere else on disk, so the server will never try to read them.  During
+ *   upgrade, however, we scan through all multixids in order, and will
+ *   encounter such invalid but unreferenced multixids too.
+ *
+ * Returns true on success, false if the multixact was invalid.
+ */
+bool
+GetOldMultiXactIdSingleMember(OldMultiXactReader *state, MultiXactId multi,
+							  MultiXactMember *member)
+{
+	MultiXactId nextMXact,
+				nextOffset,
+				tmpMXact;
+	int64		pageno,
+				prev_pageno;
+	int			entryno,
+				length;
+	char	   *buf;
+	OldMultiXactOffset *offptr,
+				offset;
+	OldMultiXactOffset nextMXOffset;
+	TransactionId result_xid = InvalidTransactionId;
+	MultiXactStatus result_status = 0;
+
+	nextMXact = state->nextMXact;
+	nextOffset = state->nextOffset;
+
+	/*
+	 * Comment copied from GetMultiXactIdMembers in PostgreSQL v18
+	 * multixact.c:
+	 *
+	 * Find out the offset at which we need to start reading MultiXactMembers
+	 * and the number of members in the multixact.  We determine the latter as
+	 * the difference between this multixact's starting offset and the next
+	 * one's.  However, there are some corner cases to worry about:
+	 *
+	 * 1. This multixact may be the latest one created, in which case there is
+	 * no next one to look at.  The next multixact's offset should be set
+	 * already, as we set it in RecordNewMultiXact(), but we used to not do
+	 * that in older minor versions.  To cope with that case, if this
+	 * multixact is the latest one created, use the nextOffset value we read
+	 * above as the endpoint.
+	 *
+	 * 2. Because GetNewMultiXactId skips over offset zero, to reserve zero
+	 * for to mean "unset", there is an ambiguity near the point of offset
+	 * wraparound.  If we see next multixact's offset is one, is that our
+	 * multixact's actual endpoint, or did it end at zero with a subsequent
+	 * increment?  We handle this using the knowledge that if the zero'th
+	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
+	 * transaction ID so it can't be a multixact member.  Therefore, if we
+	 * read a zero from the members array, just ignore it.
+	 */
+
+	pageno = MultiXactIdToOffsetPage(multi);
+	entryno = MultiXactIdToOffsetEntry(multi);
+
+	buf = SlruReadSwitchPage(state->offset, pageno);
+	offptr = (OldMultiXactOffset *) buf;
+	offptr += entryno;
+	offset = *offptr;
+
+	if (offset == 0)
+	{
+		/* Invalid entry */
+		return false;
+	}
+
+	/*
+	 * Use the same increment rule as GetNewMultiXactId(), that is, don't
+	 * handle wraparound explicitly until needed.
+	 */
+	tmpMXact = multi + 1;
+
+	if (nextMXact == tmpMXact)
+	{
+		/* Corner case 1: there is no next multixact */
+		nextMXOffset = nextOffset;
+	}
+	else
+	{
+		/* handle wraparound if needed */
+		if (tmpMXact < FirstMultiXactId)
+			tmpMXact = FirstMultiXactId;
+
+		prev_pageno = pageno;
+
+		pageno = MultiXactIdToOffsetPage(tmpMXact);
+		entryno = MultiXactIdToOffsetEntry(tmpMXact);
+
+		if (pageno != prev_pageno)
+			buf = SlruReadSwitchPage(state->offset, pageno);
+
+		offptr = (OldMultiXactOffset *) buf;
+		offptr += entryno;
+		nextMXOffset = *offptr;
+	}
+
+	if (nextMXOffset == 0)
+	{
+		/* Invalid entry */
+		return false;
+	}
+	length = nextMXOffset - offset;
+
+	/* read the members */
+	prev_pageno = -1;
+	for (int i = 0; i < length; i++, offset++)
+	{
+		TransactionId *xactptr;
+		uint32	   *flagsptr;
+		int			flagsoff;
+		int			bshift;
+		int			memberoff;
+		MultiXactStatus status;
+
+		pageno = MXOffsetToMemberPage(offset);
+		memberoff = MXOffsetToMemberOffset(offset);
+
+		if (pageno != prev_pageno)
+		{
+			buf = SlruReadSwitchPage(state->members, pageno);
+			prev_pageno = pageno;
+		}
+
+		xactptr = (TransactionId *) (buf + memberoff);
+		if (!TransactionIdIsValid(*xactptr))
+		{
+			/*
+			 * Corner case 2: we are looking at unused slot zero
+			 */
+			if (offset == 0)
+				continue;
+
+			/*
+			 * Otherwise this is an invalid entry that should not be
+			 * referenced from anywhere in the heap.  We could return 'false'
+			 * here, but we prefer to continue reading the members and
+			 * converting them the best we can, to preserve evidence in case
+			 * this is corruption that should not happen.
+			 */
+		}
+
+		flagsoff = MXOffsetToFlagsOffset(offset);
+		bshift = MXOffsetToFlagsBitShift(offset);
+		flagsptr = (uint32 *) (buf + flagsoff);
+
+		status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
+
+		/*
+		 * Remember the updating XID among the members, or first locking XID
+		 * if no updating XID.
+		 */
+		if (ISUPDATE_from_mxstatus(status))
+		{
+			/* sanity check */
+			if (ISUPDATE_from_mxstatus(result_status))
+			{
+				/*
+				 * We don't expect to see more than one updating member, even
+				 * if the server had crashed.
+				 */
+				pg_fatal("multixact %u has more than one updating member",
+						 multi);
+			}
+			result_xid = *xactptr;
+			result_status = status;
+		}
+		else if (!TransactionIdIsValid(result_xid))
+		{
+			result_xid = *xactptr;
+			result_status = status;
+		}
+	}
+
+	member->xid = result_xid;
+	member->status = result_status;
+	return true;
+}
+
+/*
+ * Frees the malloced reader.
+ */
+void
+FreeOldMultiXactReader(OldMultiXactReader *state)
+{
+	FreeSlruRead(state->offset);
+	FreeSlruRead(state->members);
+
+	pfree(state);
+}
diff --git a/src/bin/pg_upgrade/multixact_read_v18.h b/src/bin/pg_upgrade/multixact_read_v18.h
new file mode 100644
index 00000000000..8ee82a14a46
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_read_v18.h
@@ -0,0 +1,37 @@
+/*
+ * multixact_read_v18.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_read_v18.h
+ */
+#ifndef MULTIXACT_READ_V18_H
+#define MULTIXACT_READ_V18_H
+
+#include "access/multixact.h"
+#include "slru_io.h"
+
+/*
+ * MultiXactOffset changed from uint32 to uint64 between versions 18 and 19.
+ * OldMultiXactOffset is used to represent a 32-bit offset from the old
+ * cluster.
+ */
+typedef uint32 OldMultiXactOffset;
+
+typedef struct OldMultiXactReader
+{
+	MultiXactId nextMXact;
+	OldMultiXactOffset nextOffset;
+
+	SlruSegState *offset;
+	SlruSegState *members;
+} OldMultiXactReader;
+
+extern OldMultiXactReader *AllocOldMultiXactRead(char *pgdata,
+												 MultiXactId nextMulti,
+												 OldMultiXactOffset nextOffset);
+extern bool GetOldMultiXactIdSingleMember(OldMultiXactReader *state,
+										  MultiXactId multi,
+										  MultiXactMember *member);
+extern void FreeOldMultiXactReader(OldMultiXactReader *reader);
+
+#endif							/* MULTIXACT_READ_V18_H */
diff --git a/src/bin/pg_upgrade/multixact_rewrite.c b/src/bin/pg_upgrade/multixact_rewrite.c
new file mode 100644
index 00000000000..d483b2ff31f
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_rewrite.c
@@ -0,0 +1,195 @@
+/*
+ * multixact_rewrite.c
+ *
+ * Functions to convert multixact SLRUs from the pre-v19 format to the current
+ * format with 64-bit MultiXactOffsets.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_rewrite.c
+ */
+
+#include "postgres_fe.h"
+
+#include "access/multixact_internal.h"
+#include "multixact_read_v18.h"
+#include "pg_upgrade.h"
+
+static void RecordMultiXactOffset(SlruSegState *offsets_writer, MultiXactId multi,
+								  MultiXactOffset offset);
+static void RecordMultiXactMembers(SlruSegState *members_writer,
+								   MultiXactOffset offset,
+								   int nmembers, MultiXactMember *members);
+
+/*
+ * Convert pg_multixact/offset and /members from the old pre-v19 format with
+ * 32-bit offsets to the current format.
+ *
+ * Multixids in the range [from_multi, to_multi) are read from the old
+ * cluster, and written in the new format.  An important edge case is that if
+ * from_multi == to_multi, this initializes the new pg_multixact files in the
+ * new format without trying to open any old files.  (We rely on that when
+ * upgrading from PostgreSQL version 9.2 or below.)
+ *
+ * Returns the new nextOffset value; the caller should set it in the new
+ * control file.  The new members always start from offset 1, regardless of
+ * the offset range used in the old cluster.
+ */
+MultiXactOffset
+rewrite_multixacts(MultiXactId from_multi, MultiXactId to_multi)
+{
+	MultiXactId oldest_multi,
+				next_multi;
+	MultiXactOffset next_offset;
+	SlruSegState *offsets_writer;
+	SlruSegState *members_writer;
+	char		dir[MAXPGPATH] = {0};
+	bool		prev_multixid_valid = false;
+
+	/*
+	 * The range of valid multi XIDs is unchanged by the conversion (they are
+	 * referenced from the heap tables), but the members SLRU is rewritten to
+	 * start from offset 1.
+	 */
+	oldest_multi = from_multi;
+	next_multi = to_multi;
+	next_offset = 1;
+
+	/* Prepare to write the new SLRU files */
+	pg_sprintf(dir, "%s/pg_multixact/offsets", new_cluster.pgdata);
+	offsets_writer = AllocSlruWrite(dir, false);
+	SlruWriteSwitchPage(offsets_writer, MultiXactIdToOffsetPage(from_multi));
+
+	pg_sprintf(dir, "%s/pg_multixact/members", new_cluster.pgdata);
+	members_writer = AllocSlruWrite(dir, true /* use long segment names */ );
+	SlruWriteSwitchPage(members_writer, MXOffsetToMemberPage(next_offset));
+
+	/*
+	 * Convert old multixids, if needed, by reading them one-by-one from the
+	 * old cluster.
+	 */
+	if (to_multi != from_multi)
+	{
+		OldMultiXactReader *old_reader;
+
+		old_reader = AllocOldMultiXactRead(old_cluster.pgdata,
+										   old_cluster.controldata.chkpnt_nxtmulti,
+										   old_cluster.controldata.chkpnt_nxtmxoff);
+
+		for (MultiXactId multi = oldest_multi; multi != next_multi;)
+		{
+			MultiXactMember member;
+			bool		multixid_valid;
+
+			/*
+			 * Read this multixid's members.
+			 *
+			 * Locking-only XIDs that may be part of multi-xids don't matter
+			 * after upgrade, as there can be no transactions running across
+			 * upgrade.  So as a small optimization, we only read one member
+			 * from each multixid: the one updating one, or if there was no
+			 * update, arbitrarily the first locking xid.
+			 */
+			multixid_valid = GetOldMultiXactIdSingleMember(old_reader, multi, &member);
+
+			/*
+			 * Write the new offset to pg_multixact/offsets.
+			 *
+			 * If the old multixid was invalid, we still need to write this
+			 * offset if the *previous* multixid was valid.  That's because
+			 * the when reading a multixids, the number of members is
+			 * calculated from the difference between the current and the next
+			 * multixid's offsets.
+			 */
+			RecordMultiXactOffset(offsets_writer, multi,
+								  (multixid_valid || prev_multixid_valid) ? next_offset : 0);
+
+			if (multixid_valid)
+			{
+				RecordMultiXactMembers(members_writer, next_offset, 1, &member);
+				next_offset += 1;
+			}
+
+			/* Advance to next multixid, handling wraparound */
+			multi++;
+			if (multi < FirstMultiXactId)
+				multi = FirstMultiXactId;
+			prev_multixid_valid = multixid_valid;
+		}
+
+		FreeOldMultiXactReader(old_reader);
+	}
+
+	/* write the final 'next' offset to the last SLRU page */
+	RecordMultiXactOffset(offsets_writer, next_multi,
+						  prev_multixid_valid ? next_offset : 0);
+
+	/* Release resources */
+	FreeSlruWrite(offsets_writer);
+	FreeSlruWrite(members_writer);
+
+	return next_offset;
+}
+
+
+/*
+ * Write one offset to the offset SLRU
+ */
+static void
+RecordMultiXactOffset(SlruSegState *offsets_writer, MultiXactId multi,
+					  MultiXactOffset offset)
+{
+	int64		pageno;
+	int			entryno;
+	char	   *buf;
+	MultiXactOffset *offptr;
+
+	pageno = MultiXactIdToOffsetPage(multi);
+	entryno = MultiXactIdToOffsetEntry(multi);
+
+	buf = SlruWriteSwitchPage(offsets_writer, pageno);
+	offptr = (MultiXactOffset *) buf;
+	offptr[entryno] = offset;
+}
+
+/*
+ * Write the members for one multixid in the members SLRU
+ *
+ * (Currently, this is only ever called with nmembers == 1)
+ */
+static void
+RecordMultiXactMembers(SlruSegState *members_writer,
+					   MultiXactOffset offset,
+					   int nmembers, MultiXactMember *members)
+{
+	for (int i = 0; i < nmembers; i++, offset++)
+	{
+		int64		pageno;
+		char	   *buf;
+		TransactionId *memberptr;
+		uint32	   *flagsptr;
+		uint32		flagsval;
+		int			bshift;
+		int			flagsoff;
+		int			memberoff;
+
+		Assert(members[i].status <= MultiXactStatusUpdate);
+
+		pageno = MXOffsetToMemberPage(offset);
+		memberoff = MXOffsetToMemberOffset(offset);
+		flagsoff = MXOffsetToFlagsOffset(offset);
+		bshift = MXOffsetToFlagsBitShift(offset);
+
+		buf = SlruWriteSwitchPage(members_writer, pageno);
+
+		memberptr = (TransactionId *) (buf + memberoff);
+
+		*memberptr = members[i].xid;
+
+		flagsptr = (uint32 *) (buf + flagsoff);
+
+		flagsval = *flagsptr;
+		flagsval &= ~(((1 << MXACT_MEMBER_BITS_PER_XACT) - 1) << bshift);
+		flagsval |= (members[i].status << bshift);
+		*flagsptr = flagsval;
+	}
+}
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 490e98fa26f..b3405c22135 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -43,6 +43,7 @@
 
 #include <time.h>
 
+#include "access/multixact.h"
 #include "catalog/pg_class_d.h"
 #include "common/file_perm.h"
 #include "common/logging.h"
@@ -807,15 +808,15 @@ copy_xact_xlog_xid(void)
 			  new_cluster.pgdata);
 	check_ok();
 
-	/*
-	 * If the old server is before the MULTIXACT_FORMATCHANGE_CAT_VER change
-	 * (see pg_upgrade.h) and the new server is after, then we don't copy
-	 * pg_multixact files, but we need to reset pg_control so that the new
-	 * server doesn't attempt to read multis older than the cutoff value.
-	 */
-	if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER &&
-		new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
+	/* Copy or convert pg_multixact files */
+	Assert(new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER);
+	Assert(new_cluster.controldata.cat_ver >= MULTIXACTOFFSET_FORMATCHANGE_CAT_VER);
+	if (old_cluster.controldata.cat_ver >= MULTIXACTOFFSET_FORMATCHANGE_CAT_VER)
 	{
+		/* No change in multixact format, just copy the files */
+		MultiXactId new_nxtmulti = old_cluster.controldata.chkpnt_nxtmulti;
+		MultiXactOffset new_nxtmxoff = old_cluster.controldata.chkpnt_nxtmxoff;
+
 		copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
 		copy_subdir_files("pg_multixact/members", "pg_multixact/members");
 
@@ -826,38 +827,64 @@ copy_xact_xlog_xid(void)
 		 * counters here and the oldest multi present on system.
 		 */
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/pg_resetwal\" -O %u -m %u,%u \"%s\"",
-				  new_cluster.bindir,
-				  old_cluster.controldata.chkpnt_nxtmxoff,
-				  old_cluster.controldata.chkpnt_nxtmulti,
+				  "\"%s/pg_resetwal\" -O %" PRIu64 " -m %u,%u \"%s\"",
+				  new_cluster.bindir, new_nxtmxoff, new_nxtmulti,
 				  old_cluster.controldata.chkpnt_oldstMulti,
 				  new_cluster.pgdata);
 		check_ok();
 	}
-	else if (new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
+	else
 	{
+		/* Conversion is needed */
+		MultiXactId nxtmulti;
+		MultiXactId oldstMulti;
+		MultiXactOffset nxtmxoff;
+
 		/*
-		 * Remove offsets/0000 file created by initdb that no longer matches
-		 * the new multi-xid value.  "members" starts at zero so no need to
-		 * remove it.
+		 * Determine the range of multixacts to convert.
 		 */
-		remove_new_subdir("pg_multixact/offsets", false);
+		nxtmulti = old_cluster.controldata.chkpnt_nxtmulti;
+		if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
+			oldstMulti = old_cluster.controldata.chkpnt_oldstMulti;
+		else
+		{
+			/*
+			 * In PostgreSQL 9.2 and below, multitransactions were only used
+			 * for row locking, and as such don't need to be preserved during
+			 * upgrade.  In that case, we utilize convert_multixacts() just to
+			 * initialize new, empty files in the new format.
+			 *
+			 * It's important that the oldest multi is set to the latest value
+			 * used by the old system, so that multixact.c returns the empty
+			 * set for multis that might be present on disk.
+			 */
+			oldstMulti = nxtmulti;
+		}
+		/* handle wraparound */
+		if (nxtmulti < FirstMultiXactId)
+			nxtmulti = FirstMultiXactId;
+		if (oldstMulti < FirstMultiXactId)
+			oldstMulti = FirstMultiXactId;
 
-		prep_status("Setting oldest multixact ID in new cluster");
+		/*
+		 * Remove the files created by initdb in the new cluster.
+		 * convert_multixacts() will create new ones.
+		 */
+		remove_new_subdir("pg_multixact/members", false);
+		remove_new_subdir("pg_multixact/offsets", false);
 
 		/*
-		 * We don't preserve files in this case, but it's important that the
-		 * oldest multi is set to the latest value used by the old system, so
-		 * that multixact.c returns the empty set for multis that might be
-		 * present on disk.  We set next multi to the value following that; it
-		 * might end up wrapped around (i.e. 0) if the old cluster had
-		 * next=MaxMultiXactId, but multixact.c can cope with that just fine.
+		 * Create new pg_multixact files, converting old ones if needed.
 		 */
+		prep_status("Converting pg_multixact files");
+		nxtmxoff = rewrite_multixacts(oldstMulti, nxtmulti);
+		check_ok();
+
+		prep_status("Setting next multixact ID and offset for new cluster");
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/pg_resetwal\" -m %u,%u \"%s\"",
+				  "\"%s/pg_resetwal\" -O %" PRIu64 " -m %u,%u \"%s\"",
 				  new_cluster.bindir,
-				  old_cluster.controldata.chkpnt_nxtmulti + 1,
-				  old_cluster.controldata.chkpnt_nxtmulti,
+				  nxtmxoff, nxtmulti, oldstMulti,
 				  new_cluster.pgdata);
 		check_ok();
 	}
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index e86336f4be9..48f15dff5e0 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -114,6 +114,13 @@ extern char *output_files[];
  */
 #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231
 
+/*
+ * MultiXactOffset was changed from 32-bit to 64-bit in version 19, at this
+ * catalog version.  pg_multixact files need to be converted when upgrading
+ * across this version.
+ */
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 999999999
+
 /*
  * large object chunk size added to pg_controldata,
  * commit 5f93c37805e7485488480916b4585e098d3cc883
@@ -235,7 +242,7 @@ typedef struct
 	uint32		chkpnt_nxtepoch;
 	uint32		chkpnt_nxtoid;
 	uint32		chkpnt_nxtmulti;
-	uint32		chkpnt_nxtmxoff;
+	uint64		chkpnt_nxtmxoff;
 	uint32		chkpnt_oldstMulti;
 	uint32		chkpnt_oldstxid;
 	uint32		align;
@@ -499,6 +506,9 @@ void		old_9_6_invalidate_hash_indexes(ClusterInfo *cluster,
 
 void		report_extension_updates(ClusterInfo *cluster);
 
+/* multixact_rewrite.c */
+MultiXactOffset rewrite_multixacts(MultiXactId from_multi, MultiXactId to_multi);
+
 /* parallel.c */
 void		parallel_exec_prog(const char *log_file, const char *opt_log_file,
 							   const char *fmt,...) pg_attribute_printf(3, 4);
diff --git a/src/bin/pg_upgrade/slru_io.c b/src/bin/pg_upgrade/slru_io.c
new file mode 100644
index 00000000000..720445289b9
--- /dev/null
+++ b/src/bin/pg_upgrade/slru_io.c
@@ -0,0 +1,258 @@
+/*
+ * slru_io.c
+ *
+ * Routines for reading and writing SLRU files during upgrade.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/slru_io.c
+ */
+
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+
+#include "common/fe_memutils.h"
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "port/pg_iovec.h"
+#include "pg_upgrade.h"
+#include "slru_io.h"
+
+static SlruSegState *AllocSlruSegState(const char *dir);
+static char *SlruFileName(SlruSegState *state, int64 segno);
+static void SlruFlush(SlruSegState *state);
+
+static SlruSegState *
+AllocSlruSegState(const char *dir)
+{
+	SlruSegState *state = pg_malloc(sizeof(*state));
+
+	state->dir = pstrdup(dir);
+	state->fn = NULL;
+	state->fd = -1;
+	state->segno = -1;
+	state->pageno = 0;
+
+	return state;
+}
+
+/* similar to the backend function with the same name */
+static char *
+SlruFileName(SlruSegState *state, int64 segno)
+{
+	if (state->long_segment_names)
+	{
+		Assert(segno >= 0 && segno <= INT64CONST(0xFFFFFFFFFFFFFFF));
+		return psprintf("%s/%015" PRIX64, state->dir, segno);
+	}
+	else
+	{
+		Assert(segno >= 0 && segno <= INT64CONST(0xFFFFFF));
+		return psprintf("%s/%04X", state->dir, (unsigned int) segno);
+	}
+}
+
+/*
+ * Create slru reader for dir.
+ *
+ * Returns the malloced memory used by the all other read calls in this module.
+ */
+SlruSegState *
+AllocSlruRead(const char *dir, bool long_segment_names)
+{
+	SlruSegState *state = AllocSlruSegState(dir);
+
+	state->writing = false;
+	state->long_segment_names = long_segment_names;
+
+	return state;
+}
+
+/*
+ * Open given page for reading.
+ *
+ * Reading can be done in random order.
+ */
+char *
+SlruReadSwitchPageSlow(SlruSegState *state, uint64 pageno)
+{
+	int64		segno;
+	ssize_t		bytes_read;
+	off_t		offset;
+
+	Assert(!state->writing);	/* read only mode */
+
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+
+	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	if (segno != state->segno)
+	{
+		if (state->segno != -1)
+		{
+			close(state->fd);
+			state->fd = -1;
+
+			pg_free(state->fn);
+			state->fn = NULL;
+
+			state->segno = -1;
+		}
+
+		/* Open new segment */
+		state->fn = SlruFileName(state, segno);
+		if ((state->fd = open(state->fn, O_RDONLY | PG_BINARY, 0)) < 0)
+			pg_fatal("could not open file \"%s\": %m", state->fn);
+	}
+	state->segno = segno;
+
+	offset = (pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+	bytes_read = 0;
+	while (bytes_read < BLCKSZ)
+	{
+		ssize_t		rc;
+
+		rc = pg_pread(state->fd,
+					  &state->buf.data + bytes_read,
+					  BLCKSZ - bytes_read,
+					  offset + bytes_read);
+		if (rc < 0)
+		{
+			if (errno == EINTR)
+				continue;
+			pg_fatal("could not read file \"%s\": %m", state->fn);
+		}
+		if (rc == 0)
+		{
+			/* unexpected EOF */
+			pg_log(PG_WARNING, "unexpected EOF reading file \"%s\" at offset %zd, reading as zeros", state->fn,
+				   offset + bytes_read);
+			memset(&state->buf.data + bytes_read, 0, BLCKSZ - bytes_read);
+			break;
+		}
+		bytes_read += rc;
+	}
+	state->pageno = pageno;
+
+	return state->buf.data;
+}
+
+/*
+ * Frees the malloced reader.
+ */
+void
+FreeSlruRead(SlruSegState *state)
+{
+	Assert(!state->writing);	/* read only mode */
+
+	if (state->fd != -1)
+		close(state->fd);
+	pg_free(state);
+}
+
+/*
+ * Create slru writer for dir.
+ *
+ * Returns the malloced memory used by the all other write calls in this module.
+ */
+SlruSegState *
+AllocSlruWrite(const char *dir, bool long_segment_names)
+{
+	SlruSegState *state = AllocSlruSegState(dir);
+
+	state->writing = true;
+	state->long_segment_names = long_segment_names;
+
+	return state;
+}
+
+/*
+ * Open the given page for writing.
+ *
+ * NOTE: This uses O_EXCL when stepping to a new segment, so this assumes that
+ * each segment is written in full before moving on to next one.  This
+ * limitation would be easy to lift if needed, but it fits the usage pattern of
+ * current callers.
+ */
+char *
+SlruWriteSwitchPageSlow(SlruSegState *state, uint64 pageno)
+{
+	int64		segno;
+	off_t		offset;
+
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+
+	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	offset = (pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+	SlruFlush(state);
+	memset(state->buf.data, 0, BLCKSZ);
+
+	if (segno != state->segno)
+	{
+		if (state->segno != -1)
+		{
+			close(state->fd);
+			state->fd = -1;
+
+			pg_free(state->fn);
+			state->fn = NULL;
+
+			state->segno = -1;
+		}
+
+		/* Create the segment */
+		state->fn = SlruFileName(state, segno);
+		if ((state->fd = open(state->fn, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+							  pg_file_create_mode)) < 0)
+		{
+			pg_fatal("could not create file \"%s\": %m", state->fn);
+		}
+
+		state->segno = segno;
+
+		if (offset > 0)
+		{
+			if (pg_pwrite_zeros(state->fd, offset, 0) < 0)
+				pg_fatal("could not write file \"%s\": %m", state->fn);
+		}
+	}
+
+	state->pageno = pageno;
+
+	return state->buf.data;
+}
+
+static void
+SlruFlush(SlruSegState *state)
+{
+	struct iovec iovec = {
+		.iov_base = &state->buf,
+		.iov_len = BLCKSZ,
+	};
+	off_t		offset;
+
+	if (state->segno == -1)
+		return;
+
+	offset = (state->pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+	if (pg_pwritev_with_retry(state->fd, &iovec, 1, offset) < 0)
+		pg_fatal("could not write file \"%s\": %m", state->fn);
+}
+
+/*
+ * Frees the malloced writer.
+ */
+void
+FreeSlruWrite(SlruSegState *state)
+{
+	Assert(state->writing);
+
+	SlruFlush(state);
+
+	if (state->fd != -1)
+		close(state->fd);
+	pg_free(state);
+}
diff --git a/src/bin/pg_upgrade/slru_io.h b/src/bin/pg_upgrade/slru_io.h
new file mode 100644
index 00000000000..5c80a679b4d
--- /dev/null
+++ b/src/bin/pg_upgrade/slru_io.h
@@ -0,0 +1,52 @@
+/*
+ * slru_io.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/slru_io.h
+ */
+
+#ifndef SLRU_IO_H
+#define SLRU_IO_H
+
+/*
+ * State for reading or writing an SLRU, with a one page buffer.
+ */
+typedef struct SlruSegState
+{
+	bool		writing;
+	bool		long_segment_names;
+
+	char	   *dir;
+	char	   *fn;
+	int			fd;
+	int64		segno;
+	uint64		pageno;
+
+	PGAlignedBlock buf;
+} SlruSegState;
+
+extern SlruSegState *AllocSlruRead(const char *dir, bool long_segment_names);
+extern char *SlruReadSwitchPageSlow(SlruSegState *state, uint64 pageno);
+extern void FreeSlruRead(SlruSegState *state);
+
+static inline char *
+SlruReadSwitchPage(SlruSegState *state, uint64 pageno)
+{
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+	return SlruReadSwitchPageSlow(state, pageno);
+}
+
+extern SlruSegState *AllocSlruWrite(const char *dir, bool long_segment_names);
+extern char *SlruWriteSwitchPageSlow(SlruSegState *state, uint64 pageno);
+extern void FreeSlruWrite(SlruSegState *state);
+
+static inline char *
+SlruWriteSwitchPage(SlruSegState *state, uint64 pageno)
+{
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+	return SlruWriteSwitchPageSlow(state, pageno);
+}
+
+#endif							/* SLRU_IO_H */
diff --git a/src/bin/pg_upgrade/t/007_multixact_conversion.pl b/src/bin/pg_upgrade/t/007_multixact_conversion.pl
new file mode 100644
index 00000000000..f84bd5668bf
--- /dev/null
+++ b/src/bin/pg_upgrade/t/007_multixact_conversion.pl
@@ -0,0 +1,339 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Version 19 expanded MultiXactOffset from 32 to 64 bits. Upgrading
+# across that requires rewriting the SLRU files to the new format.
+# This file contains tests for the conversion.
+#
+# To run, set 'oldinstall' ENV variable to point to a pre-v19
+# installation. If it's not set, or if it points to a v19 or above
+# installation, this still performs a very basic test, upgrading a
+# cluster with some multixacts. It's not very interesting, however,
+# because there's no conversion involved in that case.
+
+use strict;
+use warnings FATAL => 'all';
+
+use Math::BigInt;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Temp dir for a dumps.
+my $tempdir = PostgreSQL::Test::Utils::tempdir;
+
+# A workload that consumes multixids. The purpose of this is to
+# generate some multixids in the old cluster, so that we can test
+# upgrading them. The workload is a mix of KEY SHARE locking queries
+# and UPDATEs, and commits and aborts, to generate a mix of multixids
+# with different statuses. It consumes around 3000 multixids with
+# 30000 members. That's enough to span more than one multixids
+# 'offsets' page, and more than one 'members' segment.
+#
+# The workload leaves behind a table called 'mxofftest' containing a
+# small number of rows referencing some of the generated multixids.
+#
+# Because this function is used to generate test data on the old
+# installation, it needs to work with older PostgreSQL server
+# versions.
+#
+# The first argument is the cluster to connect to, the second argument
+# is a cluster using the new version. We need the 'psql' binary from
+# the new version, the new cluster is otherwise unused. (We need to
+# use the new 'psql' because some of the more advanced background psql
+# perl module features depend on a fairly recent psql version.)
+sub mxact_workload
+{
+	my $node = shift;       # Cluster to connect to
+	my $binnode = shift;    # Use the psql binary from this cluster
+
+	my $connstr = $node->connstr('postgres');
+
+	$node->start;
+	$node->safe_psql('postgres', qq[
+		CREATE TABLE mxofftest (id INT PRIMARY KEY, n_updated INT)
+		  WITH (AUTOVACUUM_ENABLED=FALSE);
+		INSERT INTO mxofftest SELECT G, 0 FROM GENERATE_SERIES(1, 50) G;
+	]);
+
+	my $nclients = 20;
+	my $update_every = 13;
+	my $abort_every = 11;
+	my @connections = ();
+
+	# Silence the logging of the statements we run to avoid
+	# unnecessarily bloating the test logs. This runs before the
+	# upgrade we're testing, so the details should not be very
+	# interesting for debugging. But if needed, you can make it more
+	# verbose by setting this.
+	my $verbose = 0;
+
+	# Open multiple connections to the database. Start a transaction
+	# in each connection.
+	for (0 .. $nclients)
+	{
+		# Use the psql binary from the new installation. The
+		# BackgroundPsql functionality doesn't work with older psql
+		# versions.
+		my $conn = $binnode->background_psql('',
+			connstr => $node->connstr('postgres'));
+
+		$conn->query_safe("SET log_statement=none", verbose => $verbose) unless $verbose;
+		$conn->query_safe("SET enable_seqscan=off", verbose => $verbose);
+		$conn->query_safe("BEGIN", verbose => $verbose);
+
+		push(@connections, $conn);
+	}
+
+	# Run queries using cycling through the connections in a
+	# round-robin fashion. We keep a transaction open in each
+	# connection at all times, and lock/update the rows. With 10
+	# connections, each SELECT FOR KEY SHARE query generates a new
+	# multixid, containing the 10 XIDs of all the transactions running
+	# at the time.
+	for (my $i = 0; $i < 3000; $i++)
+	{
+		my $conn = $connections[ $i % $nclients ];
+
+		my $sql;
+		if ($i % $abort_every == 0)
+		{
+			$sql = "ABORT; ";
+		}
+		else
+		{
+			$sql = "COMMIT; ";
+		}
+		$sql .= "BEGIN; ";
+
+		if ($i % $update_every == 0)
+		{
+			$sql .= qq[
+			  UPDATE mxofftest SET n_updated = n_updated + 1 WHERE id = ${i} % 50;
+			];
+		}
+		else
+		{
+			my $threshold = int($i / 3000 * 50);
+			$sql .= qq[
+			  select count(*) from (
+				SELECT * FROM mxofftest WHERE id >= $threshold FOR KEY SHARE
+			  ) as x
+			];
+		}
+		$conn->query_safe($sql, verbose => $verbose);
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+
+	$node->stop;
+	return;
+}
+
+# Return contents of the 'mxofftest' table, created by mxact_workload
+sub get_test_table_contents
+{
+	my ($node, $file_prefix) = @_;
+
+	my $contents = $node->safe_psql('postgres',
+		"SELECT ctid, xmin, xmax, * FROM mxofftest");
+
+	my $dumpfile = $tempdir . '/' . $file_prefix . '.sql';
+	open(my $dh, '>', $dumpfile)
+	  || die "could not open $dumpfile for writing $!";
+	print $dh $contents;
+	close($dh);
+
+	return $dumpfile;
+}
+
+# Read NextMultiOffset from the control file
+#
+# Note: This is used on both the old and the new installation, so the
+# command arguments and the output parsing used here must work with
+# all PostgreSQL versions supported by the test.
+sub read_next_mxoff
+{
+	my $node = shift;
+
+	my $pg_controldata_path = $node->installed_command('pg_controldata');
+	my ($stdout, $stderr) =
+	  run_command([ $pg_controldata_path, $node->data_dir ]);
+	$stdout =~ /^Latest checkpoint's NextMultiOffset:\s*(.*)$/m
+	  or die "could not read NextMultiOffset from pg_controldata";
+	return $1;
+}
+
+# Reset a cluster's oldest multixact-offset to given offset.
+#
+# Note: This is used on both the old and the new installation, so the
+# command arguments and the output parsing used here must work with
+# all PostgreSQL versions supported by the test.
+sub reset_mxoff_pre_v19
+{
+	my $node = shift;
+	my $offset = shift;
+
+	my $pg_resetwal_path = $node->installed_command('pg_resetwal');
+	# Get block size
+	my ($out, $err) =
+	  run_command([ $pg_resetwal_path, '--dry-run', $node->data_dir ]);
+	$out =~ /^Database block size: *(\d+)$/m or die;
+	my $blcksz = $1;
+	# SLRU_PAGES_PER_SEGMENT is always 32 on pre-19 version
+	my $slru_pages_per_segment = 32;
+
+	# Verify that no multixids are currently in use. Resetting would
+	# destroy them. (A freshly initialized cluster has no multixids.)
+	$out =~ /^Latest checkpoint's NextMultiXactId: *(\d+)$/m or die;
+	my $next_mxid = $1;
+	$out =~ /^Latest checkpoint's oldestMultiXid: *(\d+)$/m or die;
+	my $oldest_mxid = $1;
+	die "cluster has some multixids in use" unless $next_mxid == $oldest_mxid;
+
+	# Reset to new offset using pg_resetwal
+	my @cmd = (
+		$pg_resetwal_path,
+		'--pgdata' => $node->data_dir,
+		'--multixact-offset' => $offset);
+	command_ok(\@cmd, 'set oldest multixact-offset');
+
+	# pg_resetwal just updates the control file. The cluster will
+	# refuse to start up, if the SLRU segment corresponding to the
+	# offset does not exist. Create a dummy segment that covers the
+	# given offset, filled with zeros. But first remove any old
+	# segments.
+	unlink glob $node->data_dir . "/pg_multixact/members/*";
+
+	my $mult = 32 * int($blcksz / 20) * 4;
+	my $segname = sprintf "%04X", $offset / $mult;
+
+	my $path = $node->data_dir . "/pg_multixact/members/" . $segname;
+
+	my $null_block = "\x00" x $blcksz;
+	open(my $dh, '>', $path)
+	  || die "could not open $path for writing $!";
+	for (0 .. $slru_pages_per_segment)
+	{
+		print $dh $null_block;
+	}
+	close($dh);
+}
+
+# Main test workhorse routine.
+# Dump data on old version, run pg_upgrade, compare data after upgrade.
+sub upgrade_and_compare
+{
+	my $tag = shift;
+	my $oldnode = shift;
+	my $newnode = shift;
+
+	command_ok(
+		[
+			'pg_upgrade', '--no-sync',
+			'--old-datadir' => $oldnode->data_dir,
+			'--new-datadir' => $newnode->data_dir,
+			'--old-bindir' => $oldnode->config_data('--bindir'),
+			'--new-bindir' => $newnode->config_data('--bindir'),
+			'--socketdir' => $newnode->host,
+			'--old-port' => $oldnode->port,
+			'--new-port' => $newnode->port,
+		],
+		'run of pg_upgrade for new instance');
+
+	# Note: we do this *after* running pg_upgrade, to ensure that we
+	# don't set all the hint bits before upgrade by doing the SELECT
+	# on the table.
+	$oldnode->start;
+	my $old_dump = get_test_table_contents($oldnode, "oldnode_${tag}_dump");
+	$oldnode->stop;
+
+	$newnode->start;
+	my $new_dump = get_test_table_contents($newnode, "newnode_${tag}_dump");
+	$newnode->stop;
+
+	compare_files($old_dump, $new_dump,
+		'test table contents from original and upgraded databases match');
+}
+
+my $old_version;
+
+# Basic scenario: Create a cluster using old installation, run
+# multixid-creating workload on it, then upgrade.
+#
+# This works even even if the old and new version is the same,
+# although it's not very interesting as the conversion routines only
+# run when upgrading from a pre-v19 cluster.
+{
+	my $tag = 'basic';
+	my $old =
+	  PostgreSQL::Test::Cluster->new("${tag}_oldnode",
+		install_path => $ENV{oldinstall});
+	my $new = PostgreSQL::Test::Cluster->new("${tag}_newnode");
+
+	$old->init(extra => ['-k']);
+
+	$old_version = $old->pg_version;
+	note "old installation is version $old_version\n";
+
+	# Run the workload
+	my $start_mxoff = read_next_mxoff($old);
+	mxact_workload($old, $new);
+	my $finish_mxoff = read_next_mxoff($old);
+
+	$new->init;
+	upgrade_and_compare($tag, $old, $new);
+
+	my $new_next_mxoff = read_next_mxoff($new);
+
+	note ">>> case #${tag}\n"
+	  . " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n"
+	  . " newnode mxoff ${new_next_mxoff}\n";
+}
+
+# Wraparound scenario: This is the same as the basic scenario, but the
+# old cluster goes through mxoffset wraparound.
+#
+# This requires the old installation to be version 19 of older,
+# because the hacks we use to reset the old cluster to a state just
+# before the wraparound rely on the pre-v19 file format. In version
+# 19, offsets no longer wrap around anyway.
+SKIP:
+{
+	skip
+	  "skipping mxoffset conversion tests because upgrading from the old version does not require conversion"
+	  if ($old_version >= '19devel');
+
+	my $tag = 'wraparound';
+	my $old =
+	  PostgreSQL::Test::Cluster->new("${tag}_oldnode",
+		install_path => $ENV{oldinstall});
+	my $new = PostgreSQL::Test::Cluster->new("${tag}_newnode");
+
+	$old->init(extra => ['-k']);
+
+	# Reset the NextMultiOffset value in the  old cluster to just before 32-bit wraparound.
+	reset_mxoff_pre_v19($old, 0xFFFFEC77);
+
+	# Run the workload. This crosses the wraparound.
+	my $start_mxoff = read_next_mxoff($old);
+	mxact_workload($old, $new);
+	my $finish_mxoff = read_next_mxoff($old);
+
+	# Verify that wraparound happened.
+	cmp_ok($finish_mxoff, '<', $start_mxoff,
+		"mxoff wrapped around in old cluster");
+
+	$new->init;
+	upgrade_and_compare($tag, $old, $new);
+
+	my $new_next_mxoff = read_next_mxoff($new);
+
+	note ">>> case #${tag}\n"
+	  . " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n"
+	  . " newnode mxoff ${new_next_mxoff}\n";
+}
+
+done_testing();
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 82e4bb90dd5..6433fe16364 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -28,8 +28,6 @@
 
 #define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
 
-#define MaxMultiXactOffset	((MultiXactOffset) 0xFFFFFFFF)
-
 /*
  * Possible multixact lock modes ("status").  The first four modes are for
  * tuple locks (FOR KEY SHARE, FOR SHARE, FOR NO KEY UPDATE, FOR UPDATE); the
@@ -111,7 +109,7 @@ extern bool MultiXactIdIsRunning(MultiXactId multi, bool isLockOnly);
 extern void MultiXactIdSetOldestMember(void);
 extern int	GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 								  bool from_pgupgrade, bool isLockOnly);
-extern bool GetMultiXactInfo(uint32 *multixacts, MultiXactOffset *members,
+extern void GetMultiXactInfo(uint32 *multixacts, MultiXactOffset *members,
 							 MultiXactId *oldestMultiXactId,
 							 MultiXactOffset *oldestOffset);
 extern bool MultiXactIdPrecedes(MultiXactId multi1, MultiXactId multi2);
@@ -131,8 +129,7 @@ extern void BootStrapMultiXact(void);
 extern void StartupMultiXact(void);
 extern void TrimMultiXact(void);
 extern void SetMultiXactIdLimit(MultiXactId oldest_datminmxid,
-								Oid oldest_datoid,
-								bool is_startup);
+								Oid oldest_datoid);
 extern void MultiXactGetCheckptMulti(bool is_shutdown,
 									 MultiXactId *nextMulti,
 									 MultiXactOffset *nextMultiOffset,
diff --git a/src/include/access/multixact_internal.h b/src/include/access/multixact_internal.h
index 9b56deaef31..c4dd1aa044f 100644
--- a/src/include/access/multixact_internal.h
+++ b/src/include/access/multixact_internal.h
@@ -21,17 +21,9 @@
 /*
  * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
  * used everywhere else in Postgres.
- *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
  */
 
-/* We need four bytes per offset */
+/* We need 8 bytes per offset */
 #define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
 
 static inline int64
@@ -80,19 +72,6 @@ MultiXactIdToOffsetSegment(MultiXactId multi)
 #define MULTIXACT_MEMBERS_PER_PAGE	\
 	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
 
-/*
- * Because the number of items per page is not a divisor of the last item
- * number (member 0xFFFFFFFF), the last segment does not use the maximum number
- * of pages, and moreover the last used page therein does not use the same
- * number of items as previous pages.  (Another way to say it is that the
- * 0xFFFFFFFF member is somewhere in the middle of the last page, so the page
- * has some empty space after that item.)
- *
- * This constant is the number of members in the last page of the last segment.
- */
-#define MAX_MEMBERS_IN_LAST_MEMBERS_PAGE \
-		((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
-
 /* page in which a member is to be found */
 static inline int64
 MXOffsetToMemberPage(MultiXactOffset offset)
diff --git a/src/include/c.h b/src/include/c.h
index ccd2b654d45..62cbf7a2eec 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -669,7 +669,7 @@ typedef uint32 SubTransactionId;
 /* MultiXactId must be equivalent to TransactionId, to fit in t_xmax */
 typedef TransactionId MultiXactId;
 
-typedef uint32 MultiXactOffset;
+typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
diff --git a/src/test/modules/test_slru/t/002_multixact_wraparound.pl b/src/test/modules/test_slru/t/002_multixact_wraparound.pl
index 169333fc564..272d8e6fb08 100644
--- a/src/test/modules/test_slru/t/002_multixact_wraparound.pl
+++ b/src/test/modules/test_slru/t/002_multixact_wraparound.pl
@@ -37,7 +37,7 @@ my $slru_pages_per_segment = $1;
 
 # initialize the 'offsets' SLRU file containing the new next multixid
 # with zeros
-my $multixact_offsets_per_page = $blcksz / 4;   # sizeof(MultiXactOffset) == 4
+my $multixact_offsets_per_page = $blcksz / 8;   # sizeof(MultiXactOffset) == 8
 my $segno =
   int(0xFFFFFFF8 / $multixact_offsets_per_page / $slru_pages_per_segment);
 my $slru_file = sprintf('%s/pg_multixact/offsets/%04X', $node_pgdata, $segno);
diff --git a/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm b/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
index 60bbd5dd445..9825aaa9bb4 100644
--- a/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
+++ b/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
@@ -230,18 +230,23 @@ Executes a query in the current session and returns the output in scalar
 context and (output, error) in list context where error is 1 in case there
 was output generated on stderr when executing the query.
 
+By default, the query and its results are printed to the test output. This
+can be disabled by passing the keyword parameter verbose => false.
+
 =cut
 
 sub query
 {
-	my ($self, $query) = @_;
+	my ($self, $query, %params) = @_;
 	my $ret;
 	my $output;
 	my $query_cnt = $self->{query_cnt}++;
 
+	$params{verbose} = 1 unless defined $params{verbose};
+
 	local $Test::Builder::Level = $Test::Builder::Level + 1;
 
-	note "issuing query $query_cnt via background psql: $query";
+	note "issuing query $query_cnt via background psql: $query" unless !$params{verbose};
 
 	$self->{timeout}->start() if (defined($self->{query_timer_restart}));
 
@@ -280,7 +285,7 @@ sub query
 	  explain {
 		stdout => $self->{stdout},
 		stderr => $self->{stderr},
-	  };
+	  } unless !$params{verbose};
 
 	# Remove banner from stdout and stderr, our caller doesn't care.  The
 	# first newline is optional, as there would not be one if consuming an
@@ -308,9 +313,9 @@ Query failure is determined by it producing output on stderr.
 
 sub query_safe
 {
-	my ($self, $query) = @_;
+	my ($self, $query, %params) = @_;
 
-	my $ret = $self->query($query);
+	my $ret = $self->query($query, %params);
 
 	if ($self->{stderr} ne "")
 	{
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 747528c4af1..295988b8b87 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -1793,13 +1793,20 @@ sub _get_env
 	return (%inst_env);
 }
 
-# Private routine to get an installation path qualified command.
-#
-# IPC::Run maintains a cache, %cmd_cache, mapping commands to paths.  Tests
-# which use nodes spanning more than one postgres installation path need to
-# avoid confusing which installation's binaries get run.  Setting $ENV{PATH} is
-# insufficient, as IPC::Run does not check to see if the path has changed since
-# caching a command.
+=pod
+
+=item $node->installed_command(cmd)
+
+Get an installation path qualified command.
+
+IPC::Run maintains a cache, %cmd_cache, mapping commands to paths.  Tests
+which use nodes spanning more than one postgres installation path need to
+avoid confusing which installation's binaries get run.  Setting $ENV{PATH} is
+insufficient, as IPC::Run does not check to see if the path has changed since
+caching a command.
+
+=cut
+
 sub installed_command
 {
 	my ($self, $cmd) = @_;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c1ad80a418d..f69e68e6dbd 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1731,6 +1731,7 @@ MultiXactMember
 MultiXactOffset
 MultiXactStateData
 MultiXactStatus
+MultiXactWriter
 MultirangeIOData
 MultirangeParseState
 MultirangeType
@@ -1816,6 +1817,7 @@ OffsetVarNodes_context
 Oid
 OidOptions
 OkeysState
+OldMultiXactReader
 OldToNewMapping
 OldToNewMappingData
 OnCommitAction
@@ -2814,6 +2816,7 @@ SlruCtlData
 SlruErrorCause
 SlruPageStatus
 SlruScanCallback
+SlruSegState
 SlruShared
 SlruSharedData
 SlruWriteAll
-- 
2.47.3

v28-0006-Add-runtime-checks-for-bogus-multixact-offsets.patchtext/x-patch; charset=UTF-8; name=v28-0006-Add-runtime-checks-for-bogus-multixact-offsets.patchDownload
From 706d0421ae23119382bbe41bf38570b1b4cb6edf Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Thu, 4 Dec 2025 15:31:39 +0200
Subject: [PATCH v28 6/6] Add runtime checks for bogus multixact offsets

These are not directly related to 64 bit offsets, but makes sense I
think
---
 src/backend/access/transam/multixact.c | 33 ++++++++++++++++----------
 1 file changed, 21 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index dffa0c8e7d4..dc9c4257a98 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1154,6 +1154,7 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	int			slotno;
 	MultiXactOffset *offptr;
 	MultiXactOffset offset;
+	MultiXactOffset nextMXOffset;
 	int			length;
 	MultiXactId oldestMXact;
 	MultiXactId nextMXact;
@@ -1245,12 +1246,14 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	offptr += entryno;
 	offset = *offptr;
 
-	Assert(offset != 0);
+	if (offset == 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("MultiXact %u has invalid offset", multi)));
 
 	/* read next multi's offset */
 	{
 		MultiXactId tmpMXact;
-		MultiXactOffset nextMXOffset;
 
 		/* handle wraparound if needed */
 		tmpMXact = multi + 1;
@@ -1284,21 +1287,27 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 		offptr += entryno;
 		nextMXOffset = *offptr;
-
-		if (nextMXOffset == 0)
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("MultiXact %u has invalid next offset",
-							multi)));
-
-		length = nextMXOffset - offset;
 	}
 
 	LWLockRelease(lock);
 	lock = NULL;
 
-	/* A multixid with zero members should not happen */
-	Assert(length > 0);
+	/* Sanity check the next offset */
+	if (nextMXOffset == 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("MultiXact %u has invalid next offset", multi)));
+	if (nextMXOffset < offset)
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("MultiXact %u has offset (%" PRIu64") greater than its next offset  (%" PRIu64")",
+						multi, offset, nextMXOffset)));
+	if (nextMXOffset - offset > INT32_MAX)
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("MultiXact %u has too many members (%" PRIu64 ")",
+						multi, nextMXOffset - offset)));
+	length = nextMXOffset - offset;
 
 	/* read the members */
 	ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
-- 
2.47.3

#86Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#85)
6 attachment(s)
Re: POC: make mxidoff 64 bits

On 06/12/2025 01:36, Heikki Linnakangas wrote:

On 05/12/2025 15:42, Ashutosh Bapat wrote:

+ $newnode->start;
+ my $new_dump = get_dump_for_comparison($newnode, "newnode_${tag} 
_dump");
+ $newnode->stop;

There is no code which actually looks at the multixact offsets here to
make sure that the conversion happened correctly. I guess the test
relies on visibility checks for that. Anyway, we need a comment
explaining why just comparing the contents of the table is enough to
ensure correct conversion. Better if we can add an explicit test that
the offsets were converted correctly. I don't have any idea of how to
do that right now, though. Maybe use pg_get_multixact_members()
somehow in the query to extract data out of the table?

Agreed, the verification here is quite weak. I didn't realize that
pg_get_multixact_members() exists! That might indeed be handy here, but
I'm not sure how exactly to construct the test. A direct C function like
test_create_multixact() in test_multixact.c would be handy here, but
we'd need to compile and do run that in the old cluster, which seems
difficult.

I added verification of all the multixids between oldest and next
multixid, using pg_get_multixact_members(). The test now calls
pg_get_multixact_members() for all updating multixids in the range,
before and after the upgrade, and compares the results.

The verification ignores locking-only multixids. Verifying their
correctness would need a little more code because they're not fully
preserved by the upgrade.

I also expanded the test to cover multixid wraparound. It only covered
mxoffset wraparound previously.

New patch set attached. Only test changes compared to patch set v28.

- Heikki

Attachments:

v29-0001-pg_resetwal-Reject-negative-and-out-of-range-arg.patchtext/x-patch; charset=UTF-8; name=v29-0001-pg_resetwal-Reject-negative-and-out-of-range-arg.patchDownload
From cac4f465c897e9b823f27473b44d77c0dc1a9d7b Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 19 Nov 2025 16:36:00 +0200
Subject: [PATCH v29 1/6] pg_resetwal: Reject negative and out of range
 arguments

The strtoul() function that we used to parse many of the options
accepts negative values, and silently wraps them to the equivalent
unsigned values. For example, -1 becomes 0xFFFFFFFF, on platforms
where unsigned long is 32 bits wide. Also, on platforms where
"unsigned long" is 64 bits wide, we silently casted values larger than
UINT32_MAX to the equivalent 32-bit value. Both of those behaviors
seem undesireable, so tighten up the parsing to reject negative and
too large values.
---
 src/bin/pg_resetwal/pg_resetwal.c  | 64 ++++++++++++++++++++++++------
 src/bin/pg_resetwal/t/001_basic.pl | 19 ++++++++-
 2 files changed, 68 insertions(+), 15 deletions(-)

diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 8d5d9805279..8ca8dad01a0 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -92,6 +92,7 @@ static void KillExistingArchiveStatus(void);
 static void KillExistingWALSummaries(void);
 static void WriteEmptyXLOG(void);
 static void usage(void);
+static uint32 strtouint32_strict(const char *restrict s, char **restrict endptr, int base);
 
 
 int
@@ -120,7 +121,6 @@ main(int argc, char *argv[])
 	MultiXactId set_oldestmxid = 0;
 	char	   *endptr;
 	char	   *endptr2;
-	int64		tmpi64;
 	char	   *DataDir = NULL;
 	char	   *log_fname = NULL;
 	int			fd;
@@ -162,7 +162,7 @@ main(int argc, char *argv[])
 
 			case 'e':
 				errno = 0;
-				set_xid_epoch = strtoul(optarg, &endptr, 0);
+				set_xid_epoch = strtouint32_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					/*------
@@ -177,7 +177,7 @@ main(int argc, char *argv[])
 
 			case 'u':
 				errno = 0;
-				set_oldest_xid = strtoul(optarg, &endptr, 0);
+				set_oldest_xid = strtouint32_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-u");
@@ -190,7 +190,7 @@ main(int argc, char *argv[])
 
 			case 'x':
 				errno = 0;
-				set_xid = strtoul(optarg, &endptr, 0);
+				set_xid = strtouint32_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-x");
@@ -203,7 +203,7 @@ main(int argc, char *argv[])
 
 			case 'c':
 				errno = 0;
-				set_oldest_commit_ts_xid = strtoul(optarg, &endptr, 0);
+				set_oldest_commit_ts_xid = strtouint32_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != ',' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-c");
@@ -229,7 +229,7 @@ main(int argc, char *argv[])
 
 			case 'o':
 				errno = 0;
-				set_oid = strtoul(optarg, &endptr, 0);
+				set_oid = strtouint32_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-o");
@@ -242,7 +242,7 @@ main(int argc, char *argv[])
 
 			case 'm':
 				errno = 0;
-				set_mxid = strtoul(optarg, &endptr, 0);
+				set_mxid = strtouint32_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != ',' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-m");
@@ -250,7 +250,7 @@ main(int argc, char *argv[])
 					exit(1);
 				}
 
-				set_oldestmxid = strtoul(endptr + 1, &endptr2, 0);
+				set_oldestmxid = strtouint32_strict(endptr + 1, &endptr2, 0);
 				if (endptr2 == endptr + 1 || *endptr2 != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-m");
@@ -269,17 +269,13 @@ main(int argc, char *argv[])
 
 			case 'O':
 				errno = 0;
-				tmpi64 = strtoi64(optarg, &endptr, 0);
+				set_mxoff = strtouint32_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
 					pg_log_error_hint("Try \"%s --help\" for more information.", progname);
 					exit(1);
 				}
-				if (tmpi64 < 0 || tmpi64 > (int64) MaxMultiXactOffset)
-					pg_fatal("multitransaction offset (-O) must be between 0 and %u", MaxMultiXactOffset);
-
-				set_mxoff = (MultiXactOffset) tmpi64;
 				mxoff_given = true;
 				break;
 
@@ -1214,3 +1210,45 @@ usage(void)
 	printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
 	printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
 }
+
+/*
+ * strtouint32_strict -- like strtoul(), but returns uint32 and doesn't accept
+ * negative values
+ */
+static uint32
+strtouint32_strict(const char *restrict s, char **restrict endptr, int base)
+{
+	unsigned long val;
+	bool		is_neg;
+
+	/* skip leading whitespace */
+	while (isspace(*s))
+		s++;
+
+	/*
+	 * Is it negative?  We still call strtoul() if it was, to set 'endptr'.
+	 * (The current callers don't care though.)
+	 */
+	is_neg = (*s == '-');
+
+	val = strtoul(s, endptr, base);
+
+	/* reject if it was negative */
+	if (errno == 0 && is_neg)
+	{
+		errno = ERANGE;
+		val = 0;
+	}
+
+	/*
+	 * reject values larger than UINT32_MAX on platforms where long is 64 bits
+	 * wide.
+	 */
+	if (errno == 0 && val != (uint32) val)
+	{
+		errno = ERANGE;
+		val = UINT32_MAX;
+	}
+
+	return (uint32) val;
+}
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index 90ecb8afe18..e9780dbe2a6 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -103,7 +103,7 @@ command_fails_like(
 	'fails with incorrect -e option');
 command_fails_like(
 	[ 'pg_resetwal', '-e' => '-1', $node->data_dir ],
-	qr/must not be -1/,
+	qr/error: invalid argument for option -e/,
 	'fails with -e value -1');
 # -l
 command_fails_like(
@@ -145,7 +145,7 @@ command_fails_like(
 	'fails with incorrect -O option');
 command_fails_like(
 	[ 'pg_resetwal', '-O' => '-1', $node->data_dir ],
-	qr/must be between 0 and 4294967295/,
+	qr/error: invalid argument for option -O/,
 	'fails with -O value -1');
 # --wal-segsize
 command_fails_like(
@@ -175,6 +175,21 @@ command_fails_like(
 	qr/must be greater than/,
 	'fails with -x value too small');
 
+# Check out of range values with -x. These are forbidden for all other
+# 32-bit values too, but we use just -x to exercise the parsing.
+command_fails_like(
+	[ 'pg_resetwal', '-x' => '-1', $node->data_dir ],
+	qr/error: invalid argument for option -x/,
+	'fails with -x value -1');
+command_fails_like(
+	[ 'pg_resetwal', '-x' => '-100', $node->data_dir ],
+	qr/error: invalid argument for option -x/,
+	'fails with negative -x value');
+command_fails_like(
+	[ 'pg_resetwal', '-x' => '10000000000', $node->data_dir ],
+	qr/error: invalid argument for option -x/,
+	'fails with -x value too large');
+
 # --char-signedness
 command_fails_like(
 	[ 'pg_resetwal', '--char-signedness', 'foo', $node->data_dir ],
-- 
2.47.3

v29-0002-pg_resetwal-Use-separate-flags-for-whether-an-op.patchtext/x-patch; charset=UTF-8; name=v29-0002-pg_resetwal-Use-separate-flags-for-whether-an-op.patchDownload
From 7c7e0ad12a1a575b7db0d0da37e463aaa77c2528 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 3 Dec 2025 20:48:48 +0200
Subject: [PATCH v29 2/6] pg_resetwal: Use separate flags for whether an option
 is given

Currently, we use special values that are otherwise invalid for each
option to indicate "option was not given". Replace that with separate
boolean variables for each option. It seems more clear to be explicit.

We were already doing that for the -m option, because there were no
invalid values for nextMulti that we could use (since commit
94939c5f3a).
---
 src/bin/pg_resetwal/pg_resetwal.c | 166 +++++++++++++++++-------------
 1 file changed, 95 insertions(+), 71 deletions(-)

diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 8ca8dad01a0..c667a11cb6a 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -64,21 +64,43 @@ static ControlFileData ControlFile; /* pg_control values */
 static XLogSegNo newXlogSegNo;	/* new XLOG segment # */
 static bool guessed = false;	/* T if we had to guess at any values */
 static const char *progname;
-static uint32 set_xid_epoch = (uint32) -1;
-static TransactionId set_oldest_xid = 0;
-static TransactionId set_xid = 0;
-static TransactionId set_oldest_commit_ts_xid = 0;
-static TransactionId set_newest_commit_ts_xid = 0;
-static Oid	set_oid = 0;
-static bool mxid_given = false;
-static MultiXactId set_mxid = 0;
-static bool mxoff_given = false;
-static MultiXactOffset set_mxoff = 0;
+
+/*
+ * New values given on the command-line
+ */
+static bool next_xid_epoch_given = false;
+static uint32 next_xid_epoch_val;
+
+static bool oldest_xid_given = false;
+static TransactionId oldest_xid_val;
+
+static bool next_xid_given = false;
+static TransactionId next_xid_val;
+
+static bool commit_ts_xids_given = false;
+static TransactionId oldest_commit_ts_xid_val;
+static TransactionId newest_commit_ts_xid_val;
+
+static bool next_oid_given = false;
+static Oid	next_oid_val;
+
+static bool mxids_given = false;
+static MultiXactId next_mxid_val;
+static MultiXactId oldest_mxid_val = 0;
+
+static bool next_mxoff_given = false;
+static MultiXactOffset next_mxoff_val;
+
+static bool wal_segsize_given = false;
+static int	wal_segsize_val;
+
+static bool char_signedness_given = false;
+static bool char_signedness_val;
+
+
 static TimeLineID minXlogTli = 0;
 static XLogSegNo minXlogSegNo = 0;
 static int	WalSegSz;
-static int	set_wal_segsize;
-static int	set_char_signedness = -1;
 
 static void CheckDataVersion(void);
 static bool read_controlfile(void);
@@ -118,7 +140,6 @@ main(int argc, char *argv[])
 	int			c;
 	bool		force = false;
 	bool		noupdate = false;
-	MultiXactId set_oldestmxid = 0;
 	char	   *endptr;
 	char	   *endptr2;
 	char	   *DataDir = NULL;
@@ -162,7 +183,7 @@ main(int argc, char *argv[])
 
 			case 'e':
 				errno = 0;
-				set_xid_epoch = strtouint32_strict(optarg, &endptr, 0);
+				next_xid_epoch_val = strtouint32_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					/*------
@@ -171,46 +192,47 @@ main(int argc, char *argv[])
 					pg_log_error_hint("Try \"%s --help\" for more information.", progname);
 					exit(1);
 				}
-				if (set_xid_epoch == -1)
-					pg_fatal("transaction ID epoch (-e) must not be -1");
+				next_xid_epoch_given = true;
 				break;
 
 			case 'u':
 				errno = 0;
-				set_oldest_xid = strtouint32_strict(optarg, &endptr, 0);
+				oldest_xid_val = strtouint32_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-u");
 					pg_log_error_hint("Try \"%s --help\" for more information.", progname);
 					exit(1);
 				}
-				if (!TransactionIdIsNormal(set_oldest_xid))
+				if (!TransactionIdIsNormal(oldest_xid_val))
 					pg_fatal("oldest transaction ID (-u) must be greater than or equal to %u", FirstNormalTransactionId);
+				oldest_xid_given = true;
 				break;
 
 			case 'x':
 				errno = 0;
-				set_xid = strtouint32_strict(optarg, &endptr, 0);
+				next_xid_val = strtouint32_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-x");
 					pg_log_error_hint("Try \"%s --help\" for more information.", progname);
 					exit(1);
 				}
-				if (!TransactionIdIsNormal(set_xid))
+				if (!TransactionIdIsNormal(next_xid_val))
 					pg_fatal("transaction ID (-x) must be greater than or equal to %u", FirstNormalTransactionId);
+				next_xid_given = true;
 				break;
 
 			case 'c':
 				errno = 0;
-				set_oldest_commit_ts_xid = strtouint32_strict(optarg, &endptr, 0);
+				oldest_commit_ts_xid_val = strtouint32_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != ',' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-c");
 					pg_log_error_hint("Try \"%s --help\" for more information.", progname);
 					exit(1);
 				}
-				set_newest_commit_ts_xid = strtoul(endptr + 1, &endptr2, 0);
+				newest_commit_ts_xid_val = strtoul(endptr + 1, &endptr2, 0);
 				if (endptr2 == endptr + 1 || *endptr2 != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-c");
@@ -218,31 +240,33 @@ main(int argc, char *argv[])
 					exit(1);
 				}
 
-				if (set_oldest_commit_ts_xid < FirstNormalTransactionId &&
-					set_oldest_commit_ts_xid != InvalidTransactionId)
+				if (oldest_commit_ts_xid_val < FirstNormalTransactionId &&
+					oldest_commit_ts_xid_val != InvalidTransactionId)
 					pg_fatal("transaction ID (-c) must be either %u or greater than or equal to %u", InvalidTransactionId, FirstNormalTransactionId);
 
-				if (set_newest_commit_ts_xid < FirstNormalTransactionId &&
-					set_newest_commit_ts_xid != InvalidTransactionId)
+				if (newest_commit_ts_xid_val < FirstNormalTransactionId &&
+					newest_commit_ts_xid_val != InvalidTransactionId)
 					pg_fatal("transaction ID (-c) must be either %u or greater than or equal to %u", InvalidTransactionId, FirstNormalTransactionId);
+				commit_ts_xids_given = true;
 				break;
 
 			case 'o':
 				errno = 0;
-				set_oid = strtouint32_strict(optarg, &endptr, 0);
+				next_oid_val = strtouint32_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-o");
 					pg_log_error_hint("Try \"%s --help\" for more information.", progname);
 					exit(1);
 				}
-				if (set_oid == 0)
+				if (next_oid_val == 0)
 					pg_fatal("OID (-o) must not be 0");
+				next_oid_given = true;
 				break;
 
 			case 'm':
 				errno = 0;
-				set_mxid = strtouint32_strict(optarg, &endptr, 0);
+				next_mxid_val = strtouint32_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != ',' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-m");
@@ -250,7 +274,7 @@ main(int argc, char *argv[])
 					exit(1);
 				}
 
-				set_oldestmxid = strtouint32_strict(endptr + 1, &endptr2, 0);
+				oldest_mxid_val = strtouint32_strict(endptr + 1, &endptr2, 0);
 				if (endptr2 == endptr + 1 || *endptr2 != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-m");
@@ -262,21 +286,21 @@ main(int argc, char *argv[])
 				 * XXX It'd be nice to have more sanity checks here, e.g. so
 				 * that oldest is not wrapped around w.r.t. nextMulti.
 				 */
-				if (set_oldestmxid == 0)
+				if (oldest_mxid_val == 0)
 					pg_fatal("oldest multitransaction ID (-m) must not be 0");
-				mxid_given = true;
+				mxids_given = true;
 				break;
 
 			case 'O':
 				errno = 0;
-				set_mxoff = strtouint32_strict(optarg, &endptr, 0);
+				next_mxoff_val = strtouint32_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
 					pg_log_error_hint("Try \"%s --help\" for more information.", progname);
 					exit(1);
 				}
-				mxoff_given = true;
+				next_mxoff_given = true;
 				break;
 
 			case 'l':
@@ -300,9 +324,10 @@ main(int argc, char *argv[])
 
 					if (!option_parse_int(optarg, "--wal-segsize", 1, 1024, &wal_segsize_mb))
 						exit(1);
-					set_wal_segsize = wal_segsize_mb * 1024 * 1024;
-					if (!IsValidWalSegSize(set_wal_segsize))
+					wal_segsize_val = wal_segsize_mb * 1024 * 1024;
+					if (!IsValidWalSegSize(wal_segsize_val))
 						pg_fatal("argument of %s must be a power of two between 1 and 1024", "--wal-segsize");
+					wal_segsize_given = true;
 					break;
 				}
 
@@ -311,15 +336,16 @@ main(int argc, char *argv[])
 					errno = 0;
 
 					if (pg_strcasecmp(optarg, "signed") == 0)
-						set_char_signedness = 1;
+						char_signedness_val = true;
 					else if (pg_strcasecmp(optarg, "unsigned") == 0)
-						set_char_signedness = 0;
+						char_signedness_val = false;
 					else
 					{
 						pg_log_error("invalid argument for option %s", "--char-signedness");
 						pg_log_error_hint("Try \"%s --help\" for more information.", progname);
 						exit(1);
 					}
+					char_signedness_given = true;
 					break;
 				}
 
@@ -407,8 +433,8 @@ main(int argc, char *argv[])
 	/*
 	 * If no new WAL segment size was specified, use the control file value.
 	 */
-	if (set_wal_segsize != 0)
-		WalSegSz = set_wal_segsize;
+	if (wal_segsize_given)
+		WalSegSz = wal_segsize_val;
 	else
 		WalSegSz = ControlFile.xlog_seg_size;
 
@@ -431,42 +457,43 @@ main(int argc, char *argv[])
 	 * Adjust fields if required by switches.  (Do this now so that printout,
 	 * if any, includes these values.)
 	 */
-	if (set_xid_epoch != -1)
+	if (next_xid_epoch_given)
 		ControlFile.checkPointCopy.nextXid =
-			FullTransactionIdFromEpochAndXid(set_xid_epoch,
+			FullTransactionIdFromEpochAndXid(next_xid_epoch_val,
 											 XidFromFullTransactionId(ControlFile.checkPointCopy.nextXid));
 
-	if (set_oldest_xid != 0)
+	if (oldest_xid_given)
 	{
-		ControlFile.checkPointCopy.oldestXid = set_oldest_xid;
+		ControlFile.checkPointCopy.oldestXid = oldest_xid_val;
 		ControlFile.checkPointCopy.oldestXidDB = InvalidOid;
 	}
 
-	if (set_xid != 0)
+	if (next_xid_given)
 		ControlFile.checkPointCopy.nextXid =
 			FullTransactionIdFromEpochAndXid(EpochFromFullTransactionId(ControlFile.checkPointCopy.nextXid),
-											 set_xid);
+											 next_xid_val);
 
-	if (set_oldest_commit_ts_xid != 0)
-		ControlFile.checkPointCopy.oldestCommitTsXid = set_oldest_commit_ts_xid;
-	if (set_newest_commit_ts_xid != 0)
-		ControlFile.checkPointCopy.newestCommitTsXid = set_newest_commit_ts_xid;
+	if (commit_ts_xids_given)
+	{
+		ControlFile.checkPointCopy.oldestCommitTsXid = oldest_commit_ts_xid_val;
+		ControlFile.checkPointCopy.newestCommitTsXid = newest_commit_ts_xid_val;
+	}
 
-	if (set_oid != 0)
-		ControlFile.checkPointCopy.nextOid = set_oid;
+	if (next_oid_given)
+		ControlFile.checkPointCopy.nextOid = next_oid_val;
 
-	if (mxid_given)
+	if (mxids_given)
 	{
-		ControlFile.checkPointCopy.nextMulti = set_mxid;
+		ControlFile.checkPointCopy.nextMulti = next_mxid_val;
 
-		ControlFile.checkPointCopy.oldestMulti = set_oldestmxid;
+		ControlFile.checkPointCopy.oldestMulti = oldest_mxid_val;
 		if (ControlFile.checkPointCopy.oldestMulti < FirstMultiXactId)
 			ControlFile.checkPointCopy.oldestMulti += FirstMultiXactId;
 		ControlFile.checkPointCopy.oldestMultiDB = InvalidOid;
 	}
 
-	if (mxoff_given)
-		ControlFile.checkPointCopy.nextMultiOffset = set_mxoff;
+	if (next_mxoff_given)
+		ControlFile.checkPointCopy.nextMultiOffset = next_mxoff_val;
 
 	if (minXlogTli > ControlFile.checkPointCopy.ThisTimeLineID)
 	{
@@ -474,11 +501,11 @@ main(int argc, char *argv[])
 		ControlFile.checkPointCopy.PrevTimeLineID = minXlogTli;
 	}
 
-	if (set_wal_segsize != 0)
+	if (wal_segsize_given)
 		ControlFile.xlog_seg_size = WalSegSz;
 
-	if (set_char_signedness != -1)
-		ControlFile.default_char_signedness = (set_char_signedness == 1);
+	if (char_signedness_given)
+		ControlFile.default_char_signedness = char_signedness_val;
 
 	if (minXlogSegNo > newXlogSegNo)
 		newXlogSegNo = minXlogSegNo;
@@ -809,7 +836,7 @@ PrintNewControlValues(void)
 				 newXlogSegNo, WalSegSz);
 	printf(_("First log segment after reset:        %s\n"), fname);
 
-	if (mxid_given)
+	if (mxids_given)
 	{
 		printf(_("NextMultiXactId:                      %u\n"),
 			   ControlFile.checkPointCopy.nextMulti);
@@ -819,25 +846,25 @@ PrintNewControlValues(void)
 			   ControlFile.checkPointCopy.oldestMultiDB);
 	}
 
-	if (mxoff_given)
+	if (next_mxoff_given)
 	{
 		printf(_("NextMultiOffset:                      %u\n"),
 			   ControlFile.checkPointCopy.nextMultiOffset);
 	}
 
-	if (set_oid != 0)
+	if (next_oid_given)
 	{
 		printf(_("NextOID:                              %u\n"),
 			   ControlFile.checkPointCopy.nextOid);
 	}
 
-	if (set_xid != 0)
+	if (next_xid_given)
 	{
 		printf(_("NextXID:                              %u\n"),
 			   XidFromFullTransactionId(ControlFile.checkPointCopy.nextXid));
 	}
 
-	if (set_oldest_xid != 0)
+	if (oldest_xid_given)
 	{
 		printf(_("OldestXID:                            %u\n"),
 			   ControlFile.checkPointCopy.oldestXid);
@@ -845,24 +872,21 @@ PrintNewControlValues(void)
 			   ControlFile.checkPointCopy.oldestXidDB);
 	}
 
-	if (set_xid_epoch != -1)
+	if (next_xid_epoch_given)
 	{
 		printf(_("NextXID epoch:                        %u\n"),
 			   EpochFromFullTransactionId(ControlFile.checkPointCopy.nextXid));
 	}
 
-	if (set_oldest_commit_ts_xid != 0)
+	if (commit_ts_xids_given)
 	{
 		printf(_("oldestCommitTsXid:                    %u\n"),
 			   ControlFile.checkPointCopy.oldestCommitTsXid);
-	}
-	if (set_newest_commit_ts_xid != 0)
-	{
 		printf(_("newestCommitTsXid:                    %u\n"),
 			   ControlFile.checkPointCopy.newestCommitTsXid);
 	}
 
-	if (set_wal_segsize != 0)
+	if (wal_segsize_given)
 	{
 		printf(_("Bytes per WAL segment:                %u\n"),
 			   ControlFile.xlog_seg_size);
-- 
2.47.3

v29-0003-Move-pg_multixact-SLRU-page-format-definitions-t.patchtext/x-patch; charset=UTF-8; name=v29-0003-Move-pg_multixact-SLRU-page-format-definitions-t.patchDownload
From 1f584eb98dca13bb992d199c6ffdd2b18a04bd29 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 3 Dec 2025 20:07:47 +0200
Subject: [PATCH v29 3/6] Move pg_multixact SLRU page format definitions to a
 separate header

This makes them accessible from pg_upgrade, needed by the next commit.
I'm doing this mechanical move as a separate commit to make the next
commit's changes to these definitions more obvious.

Author: Maxim Orlov <orlovmg@gmail.com>
Discussion: https://www.postgresql.org/message-id/CACG%3DezbZo_3_fnx%3DS5BfepwRftzrpJ%2B7WET4EkTU6wnjDTsnjg@mail.gmail.com
---
 src/backend/access/transam/multixact.c  | 120 +-------------------
 src/include/access/multixact_internal.h | 140 ++++++++++++++++++++++++
 2 files changed, 141 insertions(+), 119 deletions(-)
 create mode 100644 src/include/access/multixact_internal.h

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 8ed3fd9d071..14d46fb761b 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -69,6 +69,7 @@
 #include "postgres.h"
 
 #include "access/multixact.h"
+#include "access/multixact_internal.h"
 #include "access/slru.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
@@ -88,125 +89,6 @@
 #include "utils/memutils.h"
 
 
-/*
- * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
- * used everywhere else in Postgres.
- *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
- */
-
-/* We need four bytes per offset */
-#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
-
-static inline int64
-MultiXactIdToOffsetPage(MultiXactId multi)
-{
-	return multi / MULTIXACT_OFFSETS_PER_PAGE;
-}
-
-static inline int
-MultiXactIdToOffsetEntry(MultiXactId multi)
-{
-	return multi % MULTIXACT_OFFSETS_PER_PAGE;
-}
-
-static inline int64
-MultiXactIdToOffsetSegment(MultiXactId multi)
-{
-	return MultiXactIdToOffsetPage(multi) / SLRU_PAGES_PER_SEGMENT;
-}
-
-/*
- * The situation for members is a bit more complex: we store one byte of
- * additional flag bits for each TransactionId.  To do this without getting
- * into alignment issues, we store four bytes of flags, and then the
- * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
- * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
- * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
- * performance) trumps space efficiency here.
- *
- * Note that the "offset" macros work with byte offset, not array indexes, so
- * arithmetic must be done using "char *" pointers.
- */
-/* We need eight bits per xact, so one xact fits in a byte */
-#define MXACT_MEMBER_BITS_PER_XACT			8
-#define MXACT_MEMBER_FLAGS_PER_BYTE			1
-#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
-
-/* how many full bytes of flags are there in a group? */
-#define MULTIXACT_FLAGBYTES_PER_GROUP		4
-#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
-	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
-/* size in bytes of a complete group */
-#define MULTIXACT_MEMBERGROUP_SIZE \
-	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
-#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
-#define MULTIXACT_MEMBERS_PER_PAGE	\
-	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
-
-/*
- * Because the number of items per page is not a divisor of the last item
- * number (member 0xFFFFFFFF), the last segment does not use the maximum number
- * of pages, and moreover the last used page therein does not use the same
- * number of items as previous pages.  (Another way to say it is that the
- * 0xFFFFFFFF member is somewhere in the middle of the last page, so the page
- * has some empty space after that item.)
- *
- * This constant is the number of members in the last page of the last segment.
- */
-#define MAX_MEMBERS_IN_LAST_MEMBERS_PAGE \
-		((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
-
-/* page in which a member is to be found */
-static inline int64
-MXOffsetToMemberPage(MultiXactOffset offset)
-{
-	return offset / MULTIXACT_MEMBERS_PER_PAGE;
-}
-
-static inline int64
-MXOffsetToMemberSegment(MultiXactOffset offset)
-{
-	return MXOffsetToMemberPage(offset) / SLRU_PAGES_PER_SEGMENT;
-}
-
-/* Location (byte offset within page) of flag word for a given member */
-static inline int
-MXOffsetToFlagsOffset(MultiXactOffset offset)
-{
-	MultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
-	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
-	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
-
-	return byteoff;
-}
-
-static inline int
-MXOffsetToFlagsBitShift(MultiXactOffset offset)
-{
-	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
-	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
-
-	return bshift;
-}
-
-/* Location (byte offset within page) of TransactionId of given member */
-static inline int
-MXOffsetToMemberOffset(MultiXactOffset offset)
-{
-	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
-
-	return MXOffsetToFlagsOffset(offset) +
-		MULTIXACT_FLAGBYTES_PER_GROUP +
-		member_in_group * sizeof(TransactionId);
-}
-
 /* Multixact members wraparound thresholds. */
 #define MULTIXACT_MEMBER_SAFE_THRESHOLD		(MaxMultiXactOffset / 2)
 #define MULTIXACT_MEMBER_DANGER_THRESHOLD	\
diff --git a/src/include/access/multixact_internal.h b/src/include/access/multixact_internal.h
new file mode 100644
index 00000000000..9b56deaef31
--- /dev/null
+++ b/src/include/access/multixact_internal.h
@@ -0,0 +1,140 @@
+/*
+ * multixact_internal.h
+ *
+ * PostgreSQL multi-transaction-log manager internal declarations
+ *
+ * These functions and definitions are for dealing with pg_multixact pages.
+ * They are internal to multixact.c, but they are exported here to allow
+ * pg_upgrade to write pg_multixact files directly.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/multixact_internal.h
+ */
+#ifndef MULTIXACT_INTERNAL_H
+#define MULTIXACT_INTERNAL_H
+
+#include "access/multixact.h"
+
+
+/*
+ * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
+ * used everywhere else in Postgres.
+ *
+ * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
+ * MultiXact page numbering also wraps around at
+ * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
+ * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
+ * take no explicit notice of that fact in this module, except when comparing
+ * segment and page numbers in TruncateMultiXact (see
+ * MultiXactOffsetPagePrecedes).
+ */
+
+/* We need four bytes per offset */
+#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)
+{
+	return multi / MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int
+MultiXactIdToOffsetEntry(MultiXactId multi)
+{
+	return multi % MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int64
+MultiXactIdToOffsetSegment(MultiXactId multi)
+{
+	return MultiXactIdToOffsetPage(multi) / SLRU_PAGES_PER_SEGMENT;
+}
+
+/*
+ * The situation for members is a bit more complex: we store one byte of
+ * additional flag bits for each TransactionId.  To do this without getting
+ * into alignment issues, we store four bytes of flags, and then the
+ * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
+ * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
+ * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
+ * performance) trumps space efficiency here.
+ *
+ * Note that the "offset" macros work with byte offset, not array indexes, so
+ * arithmetic must be done using "char *" pointers.
+ */
+/* We need eight bits per xact, so one xact fits in a byte */
+#define MXACT_MEMBER_BITS_PER_XACT			8
+#define MXACT_MEMBER_FLAGS_PER_BYTE			1
+#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
+
+/* how many full bytes of flags are there in a group? */
+#define MULTIXACT_FLAGBYTES_PER_GROUP		4
+#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
+	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
+/* size in bytes of a complete group */
+#define MULTIXACT_MEMBERGROUP_SIZE \
+	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERS_PER_PAGE	\
+	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
+
+/*
+ * Because the number of items per page is not a divisor of the last item
+ * number (member 0xFFFFFFFF), the last segment does not use the maximum number
+ * of pages, and moreover the last used page therein does not use the same
+ * number of items as previous pages.  (Another way to say it is that the
+ * 0xFFFFFFFF member is somewhere in the middle of the last page, so the page
+ * has some empty space after that item.)
+ *
+ * This constant is the number of members in the last page of the last segment.
+ */
+#define MAX_MEMBERS_IN_LAST_MEMBERS_PAGE \
+		((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
+
+/* page in which a member is to be found */
+static inline int64
+MXOffsetToMemberPage(MultiXactOffset offset)
+{
+	return offset / MULTIXACT_MEMBERS_PER_PAGE;
+}
+
+static inline int64
+MXOffsetToMemberSegment(MultiXactOffset offset)
+{
+	return MXOffsetToMemberPage(offset) / SLRU_PAGES_PER_SEGMENT;
+}
+
+/* Location (byte offset within page) of flag word for a given member */
+static inline int
+MXOffsetToFlagsOffset(MultiXactOffset offset)
+{
+	MultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
+	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
+
+	return byteoff;
+}
+
+static inline int
+MXOffsetToFlagsBitShift(MultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
+
+	return bshift;
+}
+
+/* Location (byte offset within page) of TransactionId of given member */
+static inline int
+MXOffsetToMemberOffset(MultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+
+	return MXOffsetToFlagsOffset(offset) +
+		MULTIXACT_FLAGBYTES_PER_GROUP +
+		member_in_group * sizeof(TransactionId);
+}
+
+#endif							/* MULTIXACT_INTERNAL_H */
-- 
2.47.3

v29-0004-FIXME-bump-catversion.patchtext/x-patch; charset=UTF-8; name=v29-0004-FIXME-bump-catversion.patchDownload
From c741713bf6fa05030d91652da7cb54264ac25e47 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Fri, 24 Oct 2025 11:47:50 +0300
Subject: [PATCH v29 4/6] FIXME: bump catversion

To avoid constant CF-bot complains, make catversion bump in a separate
commit.

This is to be squashed with the main commit before pushing.

NOTE: keep it in sync with MULTIXACTOFFSET_FORMATCHANGE_CAT_VER
---
 src/include/catalog/catversion.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 2fa6c8c60f0..b0162c2bf63 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,7 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202512061
+// FIXME: bump it
+#define CATALOG_VERSION_NO	999999999
 
 #endif
-- 
2.47.3

v29-0005-Widen-MultiXactOffset-to-64-bits.patchtext/x-patch; charset=UTF-8; name=v29-0005-Widen-MultiXactOffset-to-64-bits.patchDownload
From ee1012298bfbc607022c81cb3e95912f542c3649 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 3 Dec 2025 20:11:50 +0200
Subject: [PATCH v29 5/6] Widen MultiXactOffset to 64 bits

This eliminates offset wraparound and the 2^32 limit on the total
number of multixid members. Multixids are still limited to 2^31, but
this is a nice improvement because 'members' can grow much faster than
the number of multixids. On such systems, you can now run longer
before hitting hard limits or triggering anti-wraparound vacuums.

Not having to deal with offset wraparound also simplifies the code and
removes some gnarly corner cases.

We no longer need to perform emergency anti-wraparound freezing
because of running out of 'members' space, so the offset stop limit is
gone. But you might still not want 'members' to consume huge amounts
of disk space. For that reason, I kept the logic for lowering vacuum's
multixid freezing cutoff if a large amount of 'members' space is
used. The thresholds for that are roughly the same as the "safe" and
"danger" thresholds used before, 2 billion transactions and 4 billion
transactions. This keeps the behavior for the freeze cutoff roughly
the same as before . It might make sense to make this smarter or
configurable, now that the threshold is only needed to manage disk
usage, but that's left for the future.

Add code to pg_upgrade to convert multitransactions from the old to
the new format. Because pg_upgrade now rewrites the files in the new
format, we can get rid of some hacks we had put in place to deal with
old bugs and upgraded clusters.

Author: Maxim Orlov <orlovmg@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Discussion: https://www.postgresql.org/message-id/CACG%3DezaWg7_nt-8ey4aKv2w9LcuLthHknwCawmBgEeTnJrJTcw@mail.gmail.com
---
 doc/src/sgml/ref/pg_resetwal.sgml             |  13 +-
 src/backend/access/rmgrdesc/mxactdesc.c       |   4 +-
 src/backend/access/rmgrdesc/xlogdesc.c        |   2 +-
 src/backend/access/transam/multixact.c        | 550 ++++--------------
 src/backend/access/transam/xlog.c             |   6 +-
 src/backend/access/transam/xlogrecovery.c     |   2 +-
 src/backend/commands/vacuum.c                 |   6 +-
 src/backend/postmaster/autovacuum.c           |   4 +-
 src/bin/pg_controldata/pg_controldata.c       |   2 +-
 src/bin/pg_resetwal/pg_resetwal.c             |  38 +-
 src/bin/pg_resetwal/t/001_basic.pl            |   2 +-
 src/bin/pg_upgrade/Makefile                   |   3 +
 src/bin/pg_upgrade/meson.build                |   4 +
 src/bin/pg_upgrade/multixact_read_v18.c       | 337 +++++++++++
 src/bin/pg_upgrade/multixact_read_v18.h       |  37 ++
 src/bin/pg_upgrade/multixact_rewrite.c        | 195 +++++++
 src/bin/pg_upgrade/pg_upgrade.c               |  81 ++-
 src/bin/pg_upgrade/pg_upgrade.h               |  12 +-
 src/bin/pg_upgrade/slru_io.c                  | 258 ++++++++
 src/bin/pg_upgrade/slru_io.h                  |  52 ++
 .../pg_upgrade/t/007_multixact_conversion.pl  | 433 ++++++++++++++
 src/include/access/multixact.h                |   7 +-
 src/include/access/multixact_internal.h       |  23 +-
 src/include/c.h                               |   2 +-
 .../test_slru/t/002_multixact_wraparound.pl   |   2 +-
 .../perl/PostgreSQL/Test/BackgroundPsql.pm    |  15 +-
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  21 +-
 src/tools/pgindent/typedefs.list              |   3 +
 28 files changed, 1594 insertions(+), 520 deletions(-)
 create mode 100644 src/bin/pg_upgrade/multixact_read_v18.c
 create mode 100644 src/bin/pg_upgrade/multixact_read_v18.h
 create mode 100644 src/bin/pg_upgrade/multixact_rewrite.c
 create mode 100644 src/bin/pg_upgrade/slru_io.c
 create mode 100644 src/bin/pg_upgrade/slru_io.h
 create mode 100644 src/bin/pg_upgrade/t/007_multixact_conversion.pl

diff --git a/doc/src/sgml/ref/pg_resetwal.sgml b/doc/src/sgml/ref/pg_resetwal.sgml
index 2c019c2aac6..41f2b1d480c 100644
--- a/doc/src/sgml/ref/pg_resetwal.sgml
+++ b/doc/src/sgml/ref/pg_resetwal.sgml
@@ -267,14 +267,17 @@ PostgreSQL documentation
       A safe value for the next multitransaction ID (first part) can be
       determined by looking for the numerically largest file name in the
       directory <filename>pg_multixact/offsets</filename> under the data directory,
-      adding one, and then multiplying by 65536 (0x10000).  Conversely, a safe
+      adding one, and then multiplying by 32768 (0x8000).  Conversely, a safe
       value for the oldest multitransaction ID (second part of
       <option>-m</option>) can be determined by looking for the numerically smallest
-      file name in the same directory and multiplying by 65536.  The file
-      names are in hexadecimal, so the easiest way to do this is to specify
-      the option value in hexadecimal and append four zeroes.
+      file name in the same directory and multiplying by 32768 (0x8000).
+      Note that the file names are in hexadecimal.  It is usually easiest
+      to specify the option value in hexadecimal too.  For example, if
+      <filename>000F</filename> and <filename>0007</filename> are the greatest and
+      smallest entries in <filename>pg_multixact/offsets</filename>,
+      <literal>-m 0x80000,0x38000</literal> will work.
      </para>
-     <!-- 65536 = SLRU_PAGES_PER_SEGMENT * BLCKSZ / sizeof(MultiXactOffset) -->
+     <!-- 32768 = SLRU_PAGES_PER_SEGMENT * BLCKSZ / sizeof(MultiXactOffset) -->
     </listitem>
    </varlistentry>
 
diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 3ca0582db36..052dd0a4ce5 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -65,7 +65,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		xl_multixact_create *xlrec = (xl_multixact_create *) rec;
 		int			i;
 
-		appendStringInfo(buf, "%u offset %u nmembers %d: ", xlrec->mid,
+		appendStringInfo(buf, "%u offset %" PRIu64 " nmembers %d: ", xlrec->mid,
 						 xlrec->moff, xlrec->nmembers);
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
@@ -74,7 +74,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
 
-		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+		appendStringInfo(buf, "offsets [%u, %u), members [%" PRIu64 ", %" PRIu64 ")",
 						 xlrec->startTruncOff, xlrec->endTruncOff,
 						 xlrec->startTruncMemb, xlrec->endTruncMemb);
 	}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index cd6c2a2f650..441034f5929 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -66,7 +66,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
 		appendStringInfo(buf, "redo %X/%08X; "
-						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %u; "
+						 "tli %u; prev tli %u; fpw %s; wal_level %s; xid %u:%u; oid %u; multi %u; offset %" PRIu64 "; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
 						 "oldest running xid %u; %s",
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 14d46fb761b..dffa0c8e7d4 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -89,10 +89,14 @@
 #include "utils/memutils.h"
 
 
-/* Multixact members wraparound thresholds. */
-#define MULTIXACT_MEMBER_SAFE_THRESHOLD		(MaxMultiXactOffset / 2)
-#define MULTIXACT_MEMBER_DANGER_THRESHOLD	\
-	(MaxMultiXactOffset - MaxMultiXactOffset / 4)
+/*
+ * Thresholds used to keep members disk usage in check when multixids have a
+ * lot of members.  When MULTIXACT_MEMBER_LOW_THRESHOLD is reached, vacuum
+ * starts freezing multixids more aggressively, even if the normal multixid
+ * age limits haven't been reached yet.
+ */
+#define MULTIXACT_MEMBER_LOW_THRESHOLD		UINT64CONST(2000000000)
+#define MULTIXACT_MEMBER_HIGH_THRESHOLD		UINT64CONST(4000000000)
 
 static inline MultiXactId
 PreviousMultiXactId(MultiXactId multi)
@@ -137,11 +141,9 @@ typedef struct MultiXactStateData
 
 	/*
 	 * Oldest multixact offset that is potentially referenced by a multixact
-	 * referenced by a relation.  We don't always know this value, so there's
-	 * a flag here to indicate whether or not we currently do.
+	 * referenced by a relation.
 	 */
 	MultiXactOffset oldestOffset;
-	bool		oldestOffsetKnown;
 
 	/* support for anti-wraparound measures */
 	MultiXactId multiVacLimit;
@@ -149,9 +151,6 @@ typedef struct MultiXactStateData
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 
-	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
-
 	/*
 	 * Per-backend data starts here.  We have two arrays stored in the area
 	 * immediately following the MultiXactStateData struct. Each is indexed by
@@ -272,13 +271,9 @@ static void mXactCachePut(MultiXactId multi, int nmembers,
 /* management of SLRU infrastructure */
 static bool MultiXactOffsetPagePrecedes(int64 page1, int64 page2);
 static bool MultiXactMemberPagePrecedes(int64 page1, int64 page2);
-static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
-									MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
-									 MultiXactOffset start, uint32 distance);
-static bool SetOffsetVacuumLimit(bool is_startup);
+static void SetOffsetVacuumLimit(void);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMTruncateXlogRec(Oid oldestMultiDB,
 								  MultiXactId startTruncOff,
@@ -1073,90 +1068,22 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	ExtendMultiXactOffset(result + 1);
 
 	/*
-	 * Reserve the members space, similarly to above.  Also, be careful not to
-	 * return zero as the starting offset for any multixact. See
-	 * GetMultiXactIdMembers() for motivation.
+	 * Reserve the members space, similarly to above.
 	 */
 	nextOffset = MultiXactState->nextOffset;
-	if (nextOffset == 0)
-	{
-		*offset = 1;
-		nmembers++;				/* allocate member slot 0 too */
-	}
-	else
-		*offset = nextOffset;
-
-	/*----------
-	 * Protect against overrun of the members space as well, with the
-	 * following rules:
-	 *
-	 * If we're past offsetStopLimit, refuse to generate more multis.
-	 * If we're close to offsetStopLimit, emit a warning.
-	 *
-	 * Arbitrarily, we start emitting warnings when we're 20 segments or less
-	 * from offsetStopLimit.
-	 *
-	 * Note we haven't updated the shared state yet, so if we fail at this
-	 * point, the multixact ID we grabbed can still be used by the next guy.
-	 *
-	 * Note that there is no point in forcing autovacuum runs here: the
-	 * multixact freeze settings would have to be reduced for that to have any
-	 * effect.
-	 *----------
-	 */
-#define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
-								 nmembers))
-	{
-		/* see comment in the corresponding offsets wraparound case */
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("multixact \"members\" limit exceeded"),
-				 errdetail_plural("This command would create a multixact with %u members, but the remaining space is only enough for %u member.",
-								  "This command would create a multixact with %u members, but the remaining space is only enough for %u members.",
-								  MultiXactState->offsetStopLimit - nextOffset - 1,
-								  nmembers,
-								  MultiXactState->offsetStopLimit - nextOffset - 1),
-				 errhint("Execute a database-wide VACUUM in database with OID %u with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.",
-						 MultiXactState->oldestMultiXactDB)));
-	}
 
 	/*
-	 * Check whether we should kick autovacuum into action, to prevent members
-	 * wraparound. NB we use a much larger window to trigger autovacuum than
-	 * just the warning limit. The warning is just a measure of last resort -
-	 * this is in line with GetNewTransactionId's behaviour.
+	 * Offsets are 64-bit integers and will never wrap around.  Firstly, it
+	 * would take an unrealistic amount of time and resources to consume 2^64
+	 * offsets.  Secondly, multixid creation is WAL-logged, so you would run
+	 * out of LSNs before reaching offset wraparound.  Nevertheless, check for
+	 * wraparound as a sanity check.
 	 */
-	if (!MultiXactState->oldestOffsetKnown ||
-		(MultiXactState->nextOffset - MultiXactState->oldestOffset
-		 > MULTIXACT_MEMBER_SAFE_THRESHOLD))
-	{
-		/*
-		 * To avoid swamping the postmaster with signals, we issue the autovac
-		 * request only when crossing a segment boundary. With default
-		 * compilation settings that's roughly after 50k members.  This still
-		 * gives plenty of chances before we get into real trouble.
-		 */
-		if ((MXOffsetToMemberPage(nextOffset) / SLRU_PAGES_PER_SEGMENT) !=
-			(MXOffsetToMemberPage(nextOffset + nmembers) / SLRU_PAGES_PER_SEGMENT))
-			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	}
-
-	if (MultiXactState->oldestOffsetKnown &&
-		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
-								 nextOffset,
-								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
-		ereport(WARNING,
+	if (nextOffset + nmembers < nextOffset)
+		ereport(ERROR,
 				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg_plural("database with OID %u must be vacuumed before %d more multixact member is used",
-							   "database with OID %u must be vacuumed before %d more multixact members are used",
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers,
-							   MultiXactState->oldestMultiXactDB,
-							   MultiXactState->offsetStopLimit - nextOffset + nmembers),
-				 errhint("Execute a database-wide VACUUM in that database with reduced \"vacuum_multixact_freeze_min_age\" and \"vacuum_multixact_freeze_table_age\" settings.")));
+				 errmsg("MultiXact members would wrap around")));
+	*offset = nextOffset;
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
@@ -1177,8 +1104,7 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	 * the next iteration.  But note that nextMXact may be InvalidMultiXactId
 	 * or the first value on a segment-beginning page after this routine
 	 * exits, so anyone else looking at the variable must be prepared to deal
-	 * with either case.  Similarly, nextOffset may be zero, but we won't use
-	 * that as the actual start offset of the next multixact.
+	 * with either case.
 	 */
 	(MultiXactState->nextMXact)++;
 
@@ -1186,7 +1112,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	LWLockRelease(MultiXactGenLock);
 
-	debug_elog4(DEBUG2, "GetNew: returning %u offset %u", result, *offset);
+	debug_elog4(DEBUG2, "GetNew: returning %u offset %" PRIu64,
+				result, *offset);
 	return result;
 }
 
@@ -1228,7 +1155,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	MultiXactOffset *offptr;
 	MultiXactOffset offset;
 	int			length;
-	int			truelength;
 	MultiXactId oldestMXact;
 	MultiXactId nextMXact;
 	MultiXactMember *ptr;
@@ -1304,16 +1230,7 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * Find out the offset at which we need to start reading MultiXactMembers
 	 * and the number of members in the multixact.  We determine the latter as
 	 * the difference between this multixact's starting offset and the next
-	 * one's.  However, there is one corner case to worry about:
-	 *
-	 * Because GetNewMultiXactId skips over offset zero, to reserve zero for
-	 * to mean "unset", there is an ambiguity near the point of offset
-	 * wraparound.  If we see next multixact's offset is one, is that our
-	 * multixact's actual endpoint, or did it end at zero with a subsequent
-	 * increment?  We handle this using the knowledge that if the zero'th
-	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
-	 * transaction ID so it can't be a multixact member.  Therefore, if we
-	 * read a zero from the members array, just ignore it.
+	 * one's.
 	 */
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
@@ -1380,10 +1297,11 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	LWLockRelease(lock);
 	lock = NULL;
 
+	/* A multixid with zero members should not happen */
+	Assert(length > 0);
+
 	/* read the members */
 	ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
-
-	truelength = 0;
 	prev_pageno = -1;
 	for (int i = 0; i < length; i++, offset++)
 	{
@@ -1420,37 +1338,27 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 
 		xactptr = (TransactionId *)
 			(MultiXactMemberCtl->shared->page_buffer[slotno] + memberoff);
-
-		if (!TransactionIdIsValid(*xactptr))
-		{
-			/* Corner case: we must be looking at unused slot zero */
-			Assert(offset == 0);
-			continue;
-		}
+		Assert(TransactionIdIsValid(*xactptr));
 
 		flagsoff = MXOffsetToFlagsOffset(offset);
 		bshift = MXOffsetToFlagsBitShift(offset);
 		flagsptr = (uint32 *) (MultiXactMemberCtl->shared->page_buffer[slotno] + flagsoff);
 
-		ptr[truelength].xid = *xactptr;
-		ptr[truelength].status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
-		truelength++;
+		ptr[i].xid = *xactptr;
+		ptr[i].status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
 	}
 
 	LWLockRelease(lock);
 
-	/* A multixid with zero members should not happen */
-	Assert(truelength > 0);
-
 	/*
 	 * Copy the result into the local cache.
 	 */
-	mXactCachePut(multi, truelength, ptr);
+	mXactCachePut(multi, length, ptr);
 
 	debug_elog3(DEBUG2, "GetMembers: no cache for %s",
-				mxid_to_string(multi, truelength, ptr));
+				mxid_to_string(multi, length, ptr));
 	*members = ptr;
-	return truelength;
+	return length;
 }
 
 /*
@@ -1857,7 +1765,7 @@ MultiXactShmemInit(void)
 				  "pg_multixact/members", LWTRANCHE_MULTIXACTMEMBER_BUFFER,
 				  LWTRANCHE_MULTIXACTMEMBER_SLRU,
 				  SYNC_HANDLER_MULTIXACT_MEMBER,
-				  false);
+				  true);
 	/* doesn't call SimpleLruTruncate() or meet criteria for unit tests */
 
 	/* Initialize our shared state struct */
@@ -1912,48 +1820,6 @@ BootStrapMultiXact(void)
 	SimpleLruZeroAndWritePage(MultiXactMemberCtl, 0);
 }
 
-/*
- * MaybeExtendOffsetSlru
- *		Extend the offsets SLRU area, if necessary
- *
- * After a binary upgrade from <= 9.2, the pg_multixact/offsets SLRU area might
- * contain files that are shorter than necessary; this would occur if the old
- * installation had used multixacts beyond the first page (files cannot be
- * copied, because the on-disk representation is different).  pg_upgrade would
- * update pg_control to set the next offset value to be at that position, so
- * that tuples marked as locked by such MultiXacts would be seen as visible
- * without having to consult multixact.  However, trying to create and use a
- * new MultiXactId would result in an error because the page on which the new
- * value would reside does not exist.  This routine is in charge of creating
- * such pages.
- */
-static void
-MaybeExtendOffsetSlru(void)
-{
-	int64		pageno;
-	LWLock	   *lock;
-
-	pageno = MultiXactIdToOffsetPage(MultiXactState->nextMXact);
-	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
-
-	LWLockAcquire(lock, LW_EXCLUSIVE);
-
-	if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno))
-	{
-		int			slotno;
-
-		/*
-		 * Fortunately for us, SimpleLruWritePage is already prepared to deal
-		 * with creating a new segment file even if the page we're writing is
-		 * not the first in it, so this is enough.
-		 */
-		slotno = SimpleLruZeroPage(MultiXactOffsetCtl, pageno);
-		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
-	}
-
-	LWLockRelease(lock);
-}
-
 /*
  * This must be called ONCE during postmaster or standalone-backend startup.
  *
@@ -2092,8 +1958,8 @@ TrimMultiXact(void)
 	MultiXactState->finishedStartup = true;
 	LWLockRelease(MultiXactGenLock);
 
-	/* Now compute how far away the next members wraparound is. */
-	SetMultiXactIdLimit(oldestMXact, oldestMXactDB, true);
+	/* Now compute how far away the next multixid wraparound is. */
+	SetMultiXactIdLimit(oldestMXact, oldestMXactDB);
 }
 
 /*
@@ -2114,7 +1980,7 @@ MultiXactGetCheckptMulti(bool is_shutdown,
 	LWLockRelease(MultiXactGenLock);
 
 	debug_elog6(DEBUG2,
-				"MultiXact: checkpoint is nextMulti %u, nextOffset %u, oldestMulti %u in DB %u",
+				"MultiXact: checkpoint is nextMulti %u, nextOffset %" PRIu64 ", oldestMulti %u in DB %u",
 				*nextMulti, *nextMultiOffset, *oldestMulti, *oldestMultiDB);
 }
 
@@ -2149,26 +2015,12 @@ void
 MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset)
 {
-	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %u",
+	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %" PRIu64,
 				nextMulti, nextMultiOffset);
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->nextMXact = nextMulti;
 	MultiXactState->nextOffset = nextMultiOffset;
 	LWLockRelease(MultiXactGenLock);
-
-	/*
-	 * During a binary upgrade, make sure that the offsets SLRU is large
-	 * enough to contain the next value that would be created.
-	 *
-	 * We need to do this pretty early during the first startup in binary
-	 * upgrade mode: before StartupMultiXact() in fact, because this routine
-	 * is called even before that by StartupXLOG().  And we can't do it
-	 * earlier than at this point, because during that first call of this
-	 * routine we determine the MultiXactState->nextMXact value that
-	 * MaybeExtendOffsetSlru needs.
-	 */
-	if (IsBinaryUpgrade)
-		MaybeExtendOffsetSlru();
 }
 
 /*
@@ -2176,28 +2028,24 @@ MultiXactSetNextMXact(MultiXactId nextMulti,
  * datminmxid (ie, the oldest MultiXactId that might exist in any database
  * of our cluster), and the OID of the (or a) database with that value.
  *
- * is_startup is true when we are just starting the cluster, false when we
- * are updating state in a running cluster.  This only affects log messages.
+ * This also updates MultiXactState->oldestOffset, by looking up the offset of
+ * MultiXactState->oldestMultiXactId.
  */
 void
-SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid,
-					bool is_startup)
+SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid)
 {
 	MultiXactId multiVacLimit;
 	MultiXactId multiWarnLimit;
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 	MultiXactId curMulti;
-	bool		needs_offset_vacuum;
 
 	Assert(MultiXactIdIsValid(oldest_datminmxid));
 
 	/*
 	 * We pretend that a wrap will happen halfway through the multixact ID
 	 * space, but that's not really true, because multixacts wrap differently
-	 * from transaction IDs.  Note that, separately from any concern about
-	 * multixact IDs wrapping, we must ensure that multixact members do not
-	 * wrap.  Limits for that are set in SetOffsetVacuumLimit, not here.
+	 * from transaction IDs.
 	 */
 	multiWrapLimit = oldest_datminmxid + (MaxMultiXactId >> 1);
 	if (multiWrapLimit < FirstMultiXactId)
@@ -2265,8 +2113,13 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid,
 
 	Assert(!InRecovery);
 
-	/* Set limits for offset vacuum. */
-	needs_offset_vacuum = SetOffsetVacuumLimit(is_startup);
+	/*
+	 * Offsets are 64-bits wide and never wrap around, so we don't need to
+	 * consider them for emergency autovacuum purposes.  But now that we're in
+	 * a consistent state, determine MultiXactState->oldestOffset, to be used
+	 * to calculate freezing cutoff to keep the offsets disk usage in check.
+	 */
+	SetOffsetVacuumLimit();
 
 	/*
 	 * If past the autovacuum force point, immediately signal an autovac
@@ -2275,8 +2128,7 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid,
 	 * database, it'll call here, and we'll signal the postmaster to start
 	 * another iteration immediately if there are still any old databases.
 	 */
-	if ((MultiXactIdPrecedes(multiVacLimit, curMulti) ||
-		 needs_offset_vacuum) && IsUnderPostmaster)
+	if (MultiXactIdPrecedes(multiVacLimit, curMulti) && IsUnderPostmaster)
 		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
 
 	/* Give an immediate warning if past the wrap warn point */
@@ -2338,9 +2190,9 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 		debug_elog3(DEBUG2, "MultiXact: setting next multi to %u", minMulti);
 		MultiXactState->nextMXact = minMulti;
 	}
-	if (MultiXactOffsetPrecedes(MultiXactState->nextOffset, minMultiOffset))
+	if (MultiXactState->nextOffset < minMultiOffset)
 	{
-		debug_elog3(DEBUG2, "MultiXact: setting next offset to %u",
+		debug_elog3(DEBUG2, "MultiXact: setting next offset to %" PRIU64,
 					minMultiOffset);
 		MultiXactState->nextOffset = minMultiOffset;
 	}
@@ -2359,7 +2211,7 @@ MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB)
 	Assert(InRecovery);
 
 	if (MultiXactIdPrecedes(MultiXactState->oldestMultiXactId, oldestMulti))
-		SetMultiXactIdLimit(oldestMulti, oldestMultiDB, false);
+		SetMultiXactIdLimit(oldestMulti, oldestMultiDB);
 }
 
 /*
@@ -2442,27 +2294,11 @@ ExtendMultiXactMember(MultiXactOffset offset, int nmembers)
 			LWLockRelease(lock);
 		}
 
-		/*
-		 * Compute the number of items till end of current page.  Careful: if
-		 * addition of unsigned ints wraps around, we're at the last page of
-		 * the last segment; since that page holds a different number of items
-		 * than other pages, we need to do it differently.
-		 */
-		if (offset + MAX_MEMBERS_IN_LAST_MEMBERS_PAGE < offset)
-		{
-			/*
-			 * This is the last page of the last segment; we can compute the
-			 * number of items left to allocate in it without modulo
-			 * arithmetic.
-			 */
-			difference = MaxMultiXactOffset - offset + 1;
-		}
-		else
-			difference = MULTIXACT_MEMBERS_PER_PAGE - offset % MULTIXACT_MEMBERS_PER_PAGE;
+		/* Compute the number of items till end of current page. */
+		difference = MULTIXACT_MEMBERS_PER_PAGE - offset % MULTIXACT_MEMBERS_PER_PAGE;
 
 		/*
-		 * Advance to next page, taking care to properly handle the wraparound
-		 * case.  OK if nmembers goes negative.
+		 * Advance to next page.  OK if nmembers goes negative.
 		 */
 		nmembers -= difference;
 		offset += difference;
@@ -2524,28 +2360,17 @@ GetOldestMultiXactId(void)
 }
 
 /*
- * Determine how aggressively we need to vacuum in order to prevent member
- * wraparound.
- *
- * To do so determine what's the oldest member offset and install the limit
- * info in MultiXactState, where it can be used to prevent overrun of old data
- * in the members SLRU area.
- *
- * The return value is true if emergency autovacuum is required and false
- * otherwise.
+ * Calculate the oldest member offset and install it in MultiXactState, where
+ * it can be used to adjust multixid freezing cutoffs.
  */
-static bool
-SetOffsetVacuumLimit(bool is_startup)
+static void
+SetOffsetVacuumLimit(void)
 {
 	MultiXactId oldestMultiXactId;
 	MultiXactId nextMXact;
 	MultiXactOffset oldestOffset = 0;	/* placate compiler */
-	MultiXactOffset prevOldestOffset;
 	MultiXactOffset nextOffset;
 	bool		oldestOffsetKnown = false;
-	bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;
-	MultiXactOffset prevOffsetStopLimit;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2558,9 +2383,6 @@ SetOffsetVacuumLimit(bool is_startup)
 	oldestMultiXactId = MultiXactState->oldestMultiXactId;
 	nextMXact = MultiXactState->nextMXact;
 	nextOffset = MultiXactState->nextOffset;
-	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
-	prevOldestOffset = MultiXactState->oldestOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2583,121 +2405,39 @@ SetOffsetVacuumLimit(bool is_startup)
 	else
 	{
 		/*
-		 * Figure out where the oldest existing multixact's offsets are
-		 * stored. Due to bugs in early release of PostgreSQL 9.3.X and 9.4.X,
-		 * the supposedly-earliest multixact might not really exist.  We are
-		 * careful not to fail in that case.
+		 * Look up the offset at which the oldest existing multixact's members
+		 * are stored.  If we cannot find it, be careful not to fail, and
+		 * leave oldestOffset unchanged.  oldestOffset is initialized to zero
+		 * at system startup, which prevents truncating members until a proper
+		 * value is calculated.
+		 *
+		 * (We had bugs in early releases of PostgreSQL 9.3.X and 9.4.X where
+		 * the supposedly-earliest multixact might not really exist.  Those
+		 * should be long gone by now, so this should not fail, but let's
+		 * still be defensive.)
 		 */
 		oldestOffsetKnown =
 			find_multixact_start(oldestMultiXactId, &oldestOffset);
 
 		if (oldestOffsetKnown)
 			ereport(DEBUG1,
-					(errmsg_internal("oldest MultiXactId member is at offset %u",
+					(errmsg_internal("oldest MultiXactId member is at offset %" PRIu64,
 									 oldestOffset)));
 		else
 			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
+					(errmsg("oldest checkpointed MultiXact %u does not exist on disk",
 							oldestMultiXactId)));
 	}
 
 	LWLockRelease(MultiXactTruncationLock);
 
-	/*
-	 * If we can, compute limits (and install them MultiXactState) to prevent
-	 * overrun of old data in the members SLRU area. We can only do so if the
-	 * oldest offset is known though.
-	 */
+	/* Install the computed value */
 	if (oldestOffsetKnown)
 	{
-		/* move back to start of the corresponding segment */
-		offsetStopLimit = oldestOffset - (oldestOffset %
-										  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-
-		/* always leave one segment before the wraparound point */
-		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-		if (!prevOldestOffsetKnown && !is_startup)
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are now enabled")));
-
-		ereport(DEBUG1,
-				(errmsg_internal("MultiXact member stop limit is now %u based on MultiXact %u",
-								 offsetStopLimit, oldestMultiXactId)));
-	}
-	else if (prevOldestOffsetKnown)
-	{
-		/*
-		 * If we failed to get the oldest offset this time, but we have a
-		 * value from a previous pass through this function, use the old
-		 * values rather than automatically forcing an emergency autovacuum
-		 * cycle again.
-		 */
-		oldestOffset = prevOldestOffset;
-		oldestOffsetKnown = true;
-		offsetStopLimit = prevOffsetStopLimit;
+		LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
+		MultiXactState->oldestOffset = oldestOffset;
+		LWLockRelease(MultiXactGenLock);
 	}
-
-	/* Install the computed values */
-	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
-	MultiXactState->oldestOffset = oldestOffset;
-	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-	MultiXactState->offsetStopLimit = offsetStopLimit;
-	LWLockRelease(MultiXactGenLock);
-
-	/*
-	 * Do we need an emergency autovacuum?	If we're not sure, assume yes.
-	 */
-	return !oldestOffsetKnown ||
-		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
-}
-
-/*
- * Return whether adding "distance" to "start" would move past "boundary".
- *
- * We use this to determine whether the addition is "wrapping around" the
- * boundary point, hence the name.  The reason we don't want to use the regular
- * 2^31-modulo arithmetic here is that we want to be able to use the whole of
- * the 2^32-1 space here, allowing for more multixacts than would fit
- * otherwise.
- */
-static bool
-MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
-						 uint32 distance)
-{
-	MultiXactOffset finish;
-
-	/*
-	 * Note that offset number 0 is not used (see GetMultiXactIdMembers), so
-	 * if the addition wraps around the UINT_MAX boundary, skip that value.
-	 */
-	finish = start + distance;
-	if (finish < start)
-		finish++;
-
-	/*-----------------------------------------------------------------------
-	 * When the boundary is numerically greater than the starting point, any
-	 * value numerically between the two is not wrapped:
-	 *
-	 *	<----S----B---->
-	 *	[---)			 = F wrapped past B (and UINT_MAX)
-	 *		 [---)		 = F not wrapped
-	 *			  [----] = F wrapped past B
-	 *
-	 * When the boundary is numerically less than the starting point (i.e. the
-	 * UINT_MAX wraparound occurs somewhere in between) then all values in
-	 * between are wrapped:
-	 *
-	 *	<----B----S---->
-	 *	[---)			 = F not wrapped past B (but wrapped past UINT_MAX)
-	 *		 [---)		 = F wrapped past B (and UINT_MAX)
-	 *			  [----] = F not wrapped
-	 *-----------------------------------------------------------------------
-	 */
-	if (start < boundary)
-		return finish >= boundary || finish < start;
-	else
-		return finish >= boundary && finish < start;
 }
 
 /*
@@ -2751,37 +2491,23 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
  * members: Number of member entries (nextOffset - oldestOffset)
  * oldestMultiXactId: Oldest MultiXact ID still in use
  * oldestOffset: Oldest offset still in use
- *
- * Returns false if unable to determine, the oldest offset being unknown.
  */
-bool
+void
 GetMultiXactInfo(uint32 *multixacts, MultiXactOffset *members,
 				 MultiXactId *oldestMultiXactId, MultiXactOffset *oldestOffset)
 {
 	MultiXactOffset nextOffset;
 	MultiXactId nextMultiXactId;
-	bool		oldestOffsetKnown;
 
 	LWLockAcquire(MultiXactGenLock, LW_SHARED);
 	nextOffset = MultiXactState->nextOffset;
 	*oldestMultiXactId = MultiXactState->oldestMultiXactId;
 	nextMultiXactId = MultiXactState->nextMXact;
 	*oldestOffset = MultiXactState->oldestOffset;
-	oldestOffsetKnown = MultiXactState->oldestOffsetKnown;
 	LWLockRelease(MultiXactGenLock);
 
-	if (!oldestOffsetKnown)
-	{
-		*members = 0;
-		*multixacts = 0;
-		*oldestMultiXactId = InvalidMultiXactId;
-		*oldestOffset = 0;
-		return false;
-	}
-
 	*members = nextOffset - *oldestOffset;
 	*multixacts = nextMultiXactId - *oldestMultiXactId;
-	return true;
 }
 
 /*
@@ -2790,26 +2516,27 @@ GetMultiXactInfo(uint32 *multixacts, MultiXactOffset *members,
  * vacuum_multixact_freeze_table_age work together to make sure we never have
  * too many multixacts; we hope that, at least under normal circumstances,
  * this will also be sufficient to keep us from using too many offsets.
- * However, if the average multixact has many members, we might exhaust the
- * members space while still using few enough members that these limits fail
- * to trigger relminmxid advancement by VACUUM.  At that point, we'd have no
- * choice but to start failing multixact-creating operations with an error.
- *
- * To prevent that, if more than a threshold portion of the members space is
- * used, we effectively reduce autovacuum_multixact_freeze_max_age and
- * to a value just less than the number of multixacts in use.  We hope that
- * this will quickly trigger autovacuuming on the table or tables with the
- * oldest relminmxid, thus allowing datminmxid values to advance and removing
- * some members.
- *
- * As the fraction of the member space currently in use grows, we become
- * more aggressive in clamping this value.  That not only causes autovacuum
- * to ramp up, but also makes any manual vacuums the user issues more
- * aggressive.  This happens because vacuum_get_cutoffs() will clamp the
- * freeze table and the minimum freeze age cutoffs based on the effective
- * autovacuum_multixact_freeze_max_age this function returns.  In the worst
- * case, we'll claim the freeze_max_age to zero, and every vacuum of any
- * table will freeze every multixact.
+ * However, if the average multixact has many members, we might accumulate a
+ * large amount of members, consuming disk space, while still using few enough
+ * multixids that the multixid limits fail to trigger relminmxid advancement
+ * by VACUUM.
+ *
+ * To prevent that, if the members space usage exceeds a threshold
+ * (MULTIXACT_MEMBER_LOW_THRESHOLD), we effectively reduce
+ * autovacuum_multixact_freeze_max_age to a value just less than the number of
+ * multixacts in use.  We hope that this will quickly trigger autovacuuming on
+ * the table or tables with the oldest relminmxid, thus allowing datminmxid
+ * values to advance and removing some members.
+ *
+ * As the amount of the member space in use grows, we become more aggressive
+ * in clamping this value.  That not only causes autovacuum to ramp up, but
+ * also makes any manual vacuums the user issues more aggressive.  This
+ * happens because vacuum_get_cutoffs() will clamp the freeze table and the
+ * minimum freeze age cutoffs based on the effective
+ * autovacuum_multixact_freeze_max_age this function returns.  At the extreme,
+ * when the members usage reaches MULTIXACT_MEMBER_HIGH_THRESHOLD, we clamp
+ * freeze_max_age to zero, and every vacuum of any table will freeze every
+ * multixact.
  */
 int
 MultiXactMemberFreezeThreshold(void)
@@ -2822,21 +2549,27 @@ MultiXactMemberFreezeThreshold(void)
 	MultiXactId oldestMultiXactId;
 	MultiXactOffset oldestOffset;
 
-	/* If we can't determine member space utilization, assume the worst. */
-	if (!GetMultiXactInfo(&multixacts, &members, &oldestMultiXactId, &oldestOffset))
-		return 0;
+	/* Read the current offsets and members usage. */
+	GetMultiXactInfo(&multixacts, &members, &oldestMultiXactId, &oldestOffset);
 
 	/* If member space utilization is low, no special action is required. */
-	if (members <= MULTIXACT_MEMBER_SAFE_THRESHOLD)
+	if (members <= MULTIXACT_MEMBER_LOW_THRESHOLD)
 		return autovacuum_multixact_freeze_max_age;
 
 	/*
 	 * Compute a target for relminmxid advancement.  The number of multixacts
 	 * we try to eliminate from the system is based on how far we are past
-	 * MULTIXACT_MEMBER_SAFE_THRESHOLD.
-	 */
-	fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD) /
-		(MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
+	 * MULTIXACT_MEMBER_LOW_THRESHOLD.
+	 *
+	 * The way this formula works is that when members is exactly at the low
+	 * threshold, fraction = 0.0, and we set freeze_max_age equal to
+	 * mxid_age(oldestMultiXactId).  As members grows further, towards the
+	 * high threshold, fraction grows linearly from 0.0 to 1.0, and the result
+	 * shrinks from mxid_age(oldestMultiXactId) to 0.  Beyond the high
+	 * threshold, fraction > 1.0 and the result is clamped to 0.
+	 */
+	fraction = (double) (members - MULTIXACT_MEMBER_LOW_THRESHOLD) /
+		(MULTIXACT_MEMBER_HIGH_THRESHOLD - MULTIXACT_MEMBER_LOW_THRESHOLD);
 	victim_multixacts = multixacts * fraction;
 
 	/* fraction could be > 1.0, but lowest possible freeze age is zero */
@@ -2877,36 +2610,12 @@ SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int64 segpage, void *data
 
 /*
  * Delete members segments [oldest, newOldest)
- *
- * The members SLRU can, in contrast to the offsets one, be filled to almost
- * the full range at once. This means SimpleLruTruncate() can't trivially be
- * used - instead the to-be-deleted range is computed using the offsets
- * SLRU. C.f. TruncateMultiXact().
  */
 static void
 PerformMembersTruncation(MultiXactOffset oldestOffset, MultiXactOffset newOldestOffset)
 {
-	const int64 maxsegment = MXOffsetToMemberSegment(MaxMultiXactOffset);
-	int64		startsegment = MXOffsetToMemberSegment(oldestOffset);
-	int64		endsegment = MXOffsetToMemberSegment(newOldestOffset);
-	int64		segment = startsegment;
-
-	/*
-	 * Delete all the segments but the last one. The last segment can still
-	 * contain, possibly partially, valid data.
-	 */
-	while (segment != endsegment)
-	{
-		elog(DEBUG2, "truncating multixact members segment %" PRIx64,
-			 segment);
-		SlruDeleteSegment(MultiXactMemberCtl, segment);
-
-		/* move to next segment, handling wraparound correctly */
-		if (segment == maxsegment)
-			segment = 0;
-		else
-			segment += 1;
-	}
+	SimpleLruTruncate(MultiXactMemberCtl,
+					  MXOffsetToMemberPage(newOldestOffset));
 }
 
 /*
@@ -3050,7 +2759,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 
 	elog(DEBUG1, "performing multixact truncation: "
 		 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-		 "members [%u, %u), members segments [%" PRIx64 ", %" PRIx64 ")",
+		 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
 		 oldestMulti, newOldestMulti,
 		 MultiXactIdToOffsetSegment(oldestMulti),
 		 MultiXactIdToOffsetSegment(newOldestMulti),
@@ -3091,6 +2800,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestMultiXactId = newOldestMulti;
 	MultiXactState->oldestMultiXactDB = newOldestMultiDB;
+	MultiXactState->oldestOffset = newOldestOffset;
 	LWLockRelease(MultiXactGenLock);
 
 	/* First truncate members */
@@ -3130,20 +2840,13 @@ MultiXactOffsetPagePrecedes(int64 page1, int64 page2)
 
 /*
  * Decide whether a MultiXactMember page number is "older" for truncation
- * purposes.  There is no "invalid offset number" so use the numbers verbatim.
+ * purposes.  There is no "invalid offset number" and members never wrap
+ * around, so use the numbers verbatim.
  */
 static bool
 MultiXactMemberPagePrecedes(int64 page1, int64 page2)
 {
-	MultiXactOffset offset1;
-	MultiXactOffset offset2;
-
-	offset1 = ((MultiXactOffset) page1) * MULTIXACT_MEMBERS_PER_PAGE;
-	offset2 = ((MultiXactOffset) page2) * MULTIXACT_MEMBERS_PER_PAGE;
-
-	return (MultiXactOffsetPrecedes(offset1, offset2) &&
-			MultiXactOffsetPrecedes(offset1,
-									offset2 + MULTIXACT_MEMBERS_PER_PAGE - 1));
+	return page1 < page2;
 }
 
 /*
@@ -3175,17 +2878,6 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
 }
 
 
-/*
- * Decide which of two offsets is earlier.
- */
-static bool
-MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
-{
-	int32		diff = (int32) (offset1 - offset2);
-
-	return (diff < 0);
-}
-
 /*
  * Write a TRUNCATE xlog record
  *
@@ -3278,7 +2970,7 @@ multixact_redo(XLogReaderState *record)
 
 		elog(DEBUG1, "replaying multixact truncation: "
 			 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-			 "members [%u, %u), members segments [%" PRIx64 ", %" PRIx64 ")",
+			 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
 			 xlrec.startTruncOff, xlrec.endTruncOff,
 			 MultiXactIdToOffsetSegment(xlrec.startTruncOff),
 			 MultiXactIdToOffsetSegment(xlrec.endTruncOff),
@@ -3293,7 +2985,7 @@ multixact_redo(XLogReaderState *record)
 		 * Advance the horizon values, so they're current at the end of
 		 * recovery.
 		 */
-		SetMultiXactIdLimit(xlrec.endTruncOff, xlrec.oldestMultiDB, false);
+		SetMultiXactIdLimit(xlrec.endTruncOff, xlrec.oldestMultiDB);
 
 		PerformMembersTruncation(xlrec.startTruncMemb, xlrec.endTruncMemb);
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 22d0a2e8c3a..a000b8bd509 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5139,7 +5139,7 @@ BootStrapXLOG(uint32 data_checksum_version)
 		FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId);
 	checkPoint.nextOid = FirstGenbkiObjectId;
 	checkPoint.nextMulti = FirstMultiXactId;
-	checkPoint.nextMultiOffset = 0;
+	checkPoint.nextMultiOffset = 1;
 	checkPoint.oldestXid = FirstNormalTransactionId;
 	checkPoint.oldestXidDB = Template1DbOid;
 	checkPoint.oldestMulti = FirstMultiXactId;
@@ -5155,7 +5155,7 @@ BootStrapXLOG(uint32 data_checksum_version)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	AdvanceOldestClogXid(checkPoint.oldestXid);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
-	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
+	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
 	SetCommitTsLimit(InvalidTransactionId, InvalidTransactionId);
 
 	/* Set up the XLOG page header */
@@ -5636,7 +5636,7 @@ StartupXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	AdvanceOldestClogXid(checkPoint.oldestXid);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
-	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
+	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
 	SetCommitTsLimit(checkPoint.oldestCommitTsXid,
 					 checkPoint.newestCommitTsXid);
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 21b8f179ba0..51dea342a4d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -886,7 +886,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 							 U64FromFullTransactionId(checkPoint.nextXid),
 							 checkPoint.nextOid)));
 	ereport(DEBUG1,
-			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %u",
+			(errmsg_internal("next MultiXactId: %u; next MultiXactOffset: %" PRIu64,
 							 checkPoint.nextMulti, checkPoint.nextMultiOffset)));
 	ereport(DEBUG1,
 			(errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index e785dd55ce5..7780ea6eae3 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1145,8 +1145,8 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams params,
 
 	/*
 	 * Also compute the multixact age for which freezing is urgent.  This is
-	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
-	 * short of multixact member space.
+	 * normally autovacuum_multixact_freeze_max_age, but may be less if
+	 * multixact members are bloated.
 	 */
 	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
 
@@ -1971,7 +1971,7 @@ vac_truncate_clog(TransactionId frozenXID,
 	 * signaling twice?
 	 */
 	SetTransactionIdLimit(frozenXID, oldestxid_datoid);
-	SetMultiXactIdLimit(minMulti, minmulti_datoid, false);
+	SetMultiXactIdLimit(minMulti, minmulti_datoid);
 
 	LWLockRelease(WrapLimitsVacuumLock);
 }
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 1c38488f2cb..f4830f896f3 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1936,8 +1936,8 @@ do_autovacuum(void)
 
 	/*
 	 * Compute the multixact age for which freezing is urgent.  This is
-	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
-	 * short of multixact member space.
+	 * normally autovacuum_multixact_freeze_max_age, but may be less if
+	 * multixact members are bloated.
 	 */
 	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
 
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 30ad46912e1..a4060309ae0 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -271,7 +271,7 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile->checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
+	printf(_("Latest checkpoint's NextMultiOffset:  %" PRIu64 "\n"),
 		   ControlFile->checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile->checkPointCopy.oldestXid);
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index c667a11cb6a..d5de4a7171a 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -115,6 +115,7 @@ static void KillExistingWALSummaries(void);
 static void WriteEmptyXLOG(void);
 static void usage(void);
 static uint32 strtouint32_strict(const char *restrict s, char **restrict endptr, int base);
+static uint64 strtouint64_strict(const char *restrict s, char **restrict endptr, int base);
 
 
 int
@@ -293,7 +294,7 @@ main(int argc, char *argv[])
 
 			case 'O':
 				errno = 0;
-				next_mxoff_val = strtouint32_strict(optarg, &endptr, 0);
+				next_mxoff_val = strtouint64_strict(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0' || errno != 0)
 				{
 					pg_log_error("invalid argument for option %s", "-O");
@@ -772,7 +773,7 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.nextOid);
 	printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),
 		   ControlFile.checkPointCopy.nextMulti);
-	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
+	printf(_("Latest checkpoint's NextMultiOffset:  %" PRIu64 "\n"),
 		   ControlFile.checkPointCopy.nextMultiOffset);
 	printf(_("Latest checkpoint's oldestXID:        %u\n"),
 		   ControlFile.checkPointCopy.oldestXid);
@@ -848,7 +849,7 @@ PrintNewControlValues(void)
 
 	if (next_mxoff_given)
 	{
-		printf(_("NextMultiOffset:                      %u\n"),
+		printf(_("NextMultiOffset:                      %" PRIu64 "\n"),
 			   ControlFile.checkPointCopy.nextMultiOffset);
 	}
 
@@ -1276,3 +1277,34 @@ strtouint32_strict(const char *restrict s, char **restrict endptr, int base)
 
 	return (uint32) val;
 }
+
+/*
+ * strtouint64_strict -- like strtou64(), but doesn't accept negative values
+ */
+static uint64
+strtouint64_strict(const char *restrict s, char **restrict endptr, int base)
+{
+	uint64		val;
+	bool		is_neg;
+
+	/* skip leading whitespace */
+	while (isspace(*s))
+		s++;
+
+	/*
+	 * Is it negative?  We still call strtou64() if it was, to set 'endptr'.
+	 * (The current callers don't care though.)
+	 */
+	is_neg = (*s == '-');
+
+	val = strtou64(s, endptr, base);
+
+	/* reject if it was negative */
+	if (errno == 0 && is_neg)
+	{
+		errno = ERANGE;
+		val = 0;
+	}
+
+	return val;
+}
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index e9780dbe2a6..4ae51ee574e 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -230,7 +230,7 @@ push @cmd,
   sprintf("%d,%d", hex($files[0]) == 0 ? 3 : hex($files[0]), hex($files[-1]));
 
 @files = get_slru_files('pg_multixact/offsets');
-$mult = 32 * $blcksz / 4;
+$mult = 32 * $blcksz / 8;
 # --multixact-ids argument is "new,old"
 push @cmd,
   '--multixact-ids' => sprintf("%d,%d",
diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index 69fcf593cae..12f747b2c59 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -18,11 +18,14 @@ OBJS = \
 	file.o \
 	function.o \
 	info.o \
+	multixact_rewrite.o \
+	multixact_read_v18.o \
 	option.o \
 	parallel.o \
 	pg_upgrade.o \
 	relfilenumber.o \
 	server.o \
+	slru_io.o \
 	tablespace.o \
 	task.o \
 	util.o \
diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index ac992f0d14b..7bd7062b62f 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -8,11 +8,14 @@ pg_upgrade_sources = files(
   'file.c',
   'function.c',
   'info.c',
+  'multixact_rewrite.c',
+  'multixact_read_v18.c',
   'option.c',
   'parallel.c',
   'pg_upgrade.c',
   'relfilenumber.c',
   'server.c',
+  'slru_io.c',
   'tablespace.c',
   'task.c',
   'util.c',
@@ -47,6 +50,7 @@ tests += {
       't/004_subscription.pl',
       't/005_char_signedness.pl',
       't/006_transfer_modes.pl',
+      't/007_multixact_conversion.pl',
     ],
     'test_kwargs': {'priority': 40}, # pg_upgrade tests are slow
   },
diff --git a/src/bin/pg_upgrade/multixact_read_v18.c b/src/bin/pg_upgrade/multixact_read_v18.c
new file mode 100644
index 00000000000..fb537668a2c
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_read_v18.c
@@ -0,0 +1,337 @@
+/*
+ * multixact_read_v18.c
+ *
+ * Functions to read multixact SLRUs from cluster of PostgreSQL version 18 and
+ * older. In version 19, the multixid offsets were expanded from 32 to 64
+ * bits.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_read_v18.c
+ */
+
+#include "postgres_fe.h"
+
+#include "multixact_read_v18.h"
+#include "pg_upgrade.h"
+
+/*
+ * NOTE: below are a bunch of definitions that are copy-pasted from
+ * multixact.c from version 18.  It's important that this file doesn't
+ * #include the new definitions with same names from "multixact_internal.h"!
+ *
+ * To avoid confusion in the functions exposed outside this source file,
+ * though, we use OldMultiXactOffset to represent the old-style 32-bit
+ * multixid offsets. The new 64-bit MultiXactOffset should not be used
+ * anywhere in this file.
+ */
+#define MultiXactOffset should_not_be_used
+
+/* We need four bytes per offset and 8 bytes per base for each page. */
+#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(OldMultiXactOffset))
+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)
+{
+	return multi / MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+static inline int
+MultiXactIdToOffsetEntry(MultiXactId multi)
+{
+	return multi % MULTIXACT_OFFSETS_PER_PAGE;
+}
+
+/*
+ * The situation for members is a bit more complex: we store one byte of
+ * additional flag bits for each TransactionId.  To do this without getting
+ * into alignment issues, we store four bytes of flags, and then the
+ * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
+ * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
+ * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
+ * performance) trumps space efficiency here.
+ *
+ * Note that the "offset" macros work with byte offset, not array indexes, so
+ * arithmetic must be done using "char *" pointers.
+ */
+/* We need eight bits per xact, so one xact fits in a byte */
+#define MXACT_MEMBER_BITS_PER_XACT			8
+#define MXACT_MEMBER_FLAGS_PER_BYTE			1
+#define MXACT_MEMBER_XACT_BITMASK	((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
+
+/* how many full bytes of flags are there in a group? */
+#define MULTIXACT_FLAGBYTES_PER_GROUP		4
+#define MULTIXACT_MEMBERS_PER_MEMBERGROUP	\
+	(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
+/* size in bytes of a complete group */
+#define MULTIXACT_MEMBERGROUP_SIZE \
+	(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
+#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
+#define MULTIXACT_MEMBERS_PER_PAGE	\
+	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
+
+/* page in which a member is to be found */
+static inline int64
+MXOffsetToMemberPage(OldMultiXactOffset offset)
+{
+	return offset / MULTIXACT_MEMBERS_PER_PAGE;
+}
+
+/* Location (byte offset within page) of flag word for a given member */
+static inline int
+MXOffsetToFlagsOffset(OldMultiXactOffset offset)
+{
+	OldMultiXactOffset group = offset / MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			grouponpg = group % MULTIXACT_MEMBERGROUPS_PER_PAGE;
+	int			byteoff = grouponpg * MULTIXACT_MEMBERGROUP_SIZE;
+
+	return byteoff;
+}
+
+/* Location (byte offset within page) of TransactionId of given member */
+static inline int
+MXOffsetToMemberOffset(OldMultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+
+	return MXOffsetToFlagsOffset(offset) +
+		MULTIXACT_FLAGBYTES_PER_GROUP +
+		member_in_group * sizeof(TransactionId);
+}
+
+static inline int
+MXOffsetToFlagsBitShift(OldMultiXactOffset offset)
+{
+	int			member_in_group = offset % MULTIXACT_MEMBERS_PER_MEMBERGROUP;
+	int			bshift = member_in_group * MXACT_MEMBER_BITS_PER_XACT;
+
+	return bshift;
+}
+
+/*
+ * Construct reader of old multixacts.
+ *
+ * Returns the malloced memory used by the all other calls in this module.
+ */
+OldMultiXactReader *
+AllocOldMultiXactRead(char *pgdata, MultiXactId nextMulti,
+					  OldMultiXactOffset nextOffset)
+{
+	OldMultiXactReader *state = state = pg_malloc(sizeof(*state));
+	char		dir[MAXPGPATH] = {0};
+
+	state->nextMXact = nextMulti;
+	state->nextOffset = nextOffset;
+
+	pg_sprintf(dir, "%s/pg_multixact/offsets", pgdata);
+	state->offset = AllocSlruRead(dir, false);
+
+	pg_sprintf(dir, "%s/pg_multixact/members", pgdata);
+	state->members = AllocSlruRead(dir, false);
+
+	return state;
+}
+
+/*
+ * This is a simplified version of the GetMultiXactIdMembers() server
+ * function:
+ *
+ * - Only return the updating member, if any.  Upgrade only cares about the
+ *   updaters.  If there is no updating member, return somewhat arbitrarily
+ *   the first locking-only member, because we don't have any way to represent
+ *   "no members".
+ *
+ * - Because there's no concurrent activity, We don't need to worry about
+ *   locking and some corner cases.
+ *
+ * - Don't bail out on invalid entries.  If the server crashes, it can leave
+ *   invalid or half-written entries on disk. Such multixids won't appear
+ *   anywhere else on disk, so the server will never try to read them.  During
+ *   upgrade, however, we scan through all multixids in order, and will
+ *   encounter such invalid but unreferenced multixids too.
+ *
+ * Returns true on success, false if the multixact was invalid.
+ */
+bool
+GetOldMultiXactIdSingleMember(OldMultiXactReader *state, MultiXactId multi,
+							  MultiXactMember *member)
+{
+	MultiXactId nextMXact,
+				nextOffset,
+				tmpMXact;
+	int64		pageno,
+				prev_pageno;
+	int			entryno,
+				length;
+	char	   *buf;
+	OldMultiXactOffset *offptr,
+				offset;
+	OldMultiXactOffset nextMXOffset;
+	TransactionId result_xid = InvalidTransactionId;
+	MultiXactStatus result_status = 0;
+
+	nextMXact = state->nextMXact;
+	nextOffset = state->nextOffset;
+
+	/*
+	 * Comment copied from GetMultiXactIdMembers in PostgreSQL v18
+	 * multixact.c:
+	 *
+	 * Find out the offset at which we need to start reading MultiXactMembers
+	 * and the number of members in the multixact.  We determine the latter as
+	 * the difference between this multixact's starting offset and the next
+	 * one's.  However, there are some corner cases to worry about:
+	 *
+	 * 1. This multixact may be the latest one created, in which case there is
+	 * no next one to look at.  The next multixact's offset should be set
+	 * already, as we set it in RecordNewMultiXact(), but we used to not do
+	 * that in older minor versions.  To cope with that case, if this
+	 * multixact is the latest one created, use the nextOffset value we read
+	 * above as the endpoint.
+	 *
+	 * 2. Because GetNewMultiXactId skips over offset zero, to reserve zero
+	 * for to mean "unset", there is an ambiguity near the point of offset
+	 * wraparound.  If we see next multixact's offset is one, is that our
+	 * multixact's actual endpoint, or did it end at zero with a subsequent
+	 * increment?  We handle this using the knowledge that if the zero'th
+	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
+	 * transaction ID so it can't be a multixact member.  Therefore, if we
+	 * read a zero from the members array, just ignore it.
+	 */
+
+	pageno = MultiXactIdToOffsetPage(multi);
+	entryno = MultiXactIdToOffsetEntry(multi);
+
+	buf = SlruReadSwitchPage(state->offset, pageno);
+	offptr = (OldMultiXactOffset *) buf;
+	offptr += entryno;
+	offset = *offptr;
+
+	if (offset == 0)
+	{
+		/* Invalid entry */
+		return false;
+	}
+
+	/*
+	 * Use the same increment rule as GetNewMultiXactId(), that is, don't
+	 * handle wraparound explicitly until needed.
+	 */
+	tmpMXact = multi + 1;
+
+	if (nextMXact == tmpMXact)
+	{
+		/* Corner case 1: there is no next multixact */
+		nextMXOffset = nextOffset;
+	}
+	else
+	{
+		/* handle wraparound if needed */
+		if (tmpMXact < FirstMultiXactId)
+			tmpMXact = FirstMultiXactId;
+
+		prev_pageno = pageno;
+
+		pageno = MultiXactIdToOffsetPage(tmpMXact);
+		entryno = MultiXactIdToOffsetEntry(tmpMXact);
+
+		if (pageno != prev_pageno)
+			buf = SlruReadSwitchPage(state->offset, pageno);
+
+		offptr = (OldMultiXactOffset *) buf;
+		offptr += entryno;
+		nextMXOffset = *offptr;
+	}
+
+	if (nextMXOffset == 0)
+	{
+		/* Invalid entry */
+		return false;
+	}
+	length = nextMXOffset - offset;
+
+	/* read the members */
+	prev_pageno = -1;
+	for (int i = 0; i < length; i++, offset++)
+	{
+		TransactionId *xactptr;
+		uint32	   *flagsptr;
+		int			flagsoff;
+		int			bshift;
+		int			memberoff;
+		MultiXactStatus status;
+
+		pageno = MXOffsetToMemberPage(offset);
+		memberoff = MXOffsetToMemberOffset(offset);
+
+		if (pageno != prev_pageno)
+		{
+			buf = SlruReadSwitchPage(state->members, pageno);
+			prev_pageno = pageno;
+		}
+
+		xactptr = (TransactionId *) (buf + memberoff);
+		if (!TransactionIdIsValid(*xactptr))
+		{
+			/*
+			 * Corner case 2: we are looking at unused slot zero
+			 */
+			if (offset == 0)
+				continue;
+
+			/*
+			 * Otherwise this is an invalid entry that should not be
+			 * referenced from anywhere in the heap.  We could return 'false'
+			 * here, but we prefer to continue reading the members and
+			 * converting them the best we can, to preserve evidence in case
+			 * this is corruption that should not happen.
+			 */
+		}
+
+		flagsoff = MXOffsetToFlagsOffset(offset);
+		bshift = MXOffsetToFlagsBitShift(offset);
+		flagsptr = (uint32 *) (buf + flagsoff);
+
+		status = (*flagsptr >> bshift) & MXACT_MEMBER_XACT_BITMASK;
+
+		/*
+		 * Remember the updating XID among the members, or first locking XID
+		 * if no updating XID.
+		 */
+		if (ISUPDATE_from_mxstatus(status))
+		{
+			/* sanity check */
+			if (ISUPDATE_from_mxstatus(result_status))
+			{
+				/*
+				 * We don't expect to see more than one updating member, even
+				 * if the server had crashed.
+				 */
+				pg_fatal("multixact %u has more than one updating member",
+						 multi);
+			}
+			result_xid = *xactptr;
+			result_status = status;
+		}
+		else if (!TransactionIdIsValid(result_xid))
+		{
+			result_xid = *xactptr;
+			result_status = status;
+		}
+	}
+
+	member->xid = result_xid;
+	member->status = result_status;
+	return true;
+}
+
+/*
+ * Frees the malloced reader.
+ */
+void
+FreeOldMultiXactReader(OldMultiXactReader *state)
+{
+	FreeSlruRead(state->offset);
+	FreeSlruRead(state->members);
+
+	pfree(state);
+}
diff --git a/src/bin/pg_upgrade/multixact_read_v18.h b/src/bin/pg_upgrade/multixact_read_v18.h
new file mode 100644
index 00000000000..8ee82a14a46
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_read_v18.h
@@ -0,0 +1,37 @@
+/*
+ * multixact_read_v18.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_read_v18.h
+ */
+#ifndef MULTIXACT_READ_V18_H
+#define MULTIXACT_READ_V18_H
+
+#include "access/multixact.h"
+#include "slru_io.h"
+
+/*
+ * MultiXactOffset changed from uint32 to uint64 between versions 18 and 19.
+ * OldMultiXactOffset is used to represent a 32-bit offset from the old
+ * cluster.
+ */
+typedef uint32 OldMultiXactOffset;
+
+typedef struct OldMultiXactReader
+{
+	MultiXactId nextMXact;
+	OldMultiXactOffset nextOffset;
+
+	SlruSegState *offset;
+	SlruSegState *members;
+} OldMultiXactReader;
+
+extern OldMultiXactReader *AllocOldMultiXactRead(char *pgdata,
+												 MultiXactId nextMulti,
+												 OldMultiXactOffset nextOffset);
+extern bool GetOldMultiXactIdSingleMember(OldMultiXactReader *state,
+										  MultiXactId multi,
+										  MultiXactMember *member);
+extern void FreeOldMultiXactReader(OldMultiXactReader *reader);
+
+#endif							/* MULTIXACT_READ_V18_H */
diff --git a/src/bin/pg_upgrade/multixact_rewrite.c b/src/bin/pg_upgrade/multixact_rewrite.c
new file mode 100644
index 00000000000..d483b2ff31f
--- /dev/null
+++ b/src/bin/pg_upgrade/multixact_rewrite.c
@@ -0,0 +1,195 @@
+/*
+ * multixact_rewrite.c
+ *
+ * Functions to convert multixact SLRUs from the pre-v19 format to the current
+ * format with 64-bit MultiXactOffsets.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/multixact_rewrite.c
+ */
+
+#include "postgres_fe.h"
+
+#include "access/multixact_internal.h"
+#include "multixact_read_v18.h"
+#include "pg_upgrade.h"
+
+static void RecordMultiXactOffset(SlruSegState *offsets_writer, MultiXactId multi,
+								  MultiXactOffset offset);
+static void RecordMultiXactMembers(SlruSegState *members_writer,
+								   MultiXactOffset offset,
+								   int nmembers, MultiXactMember *members);
+
+/*
+ * Convert pg_multixact/offset and /members from the old pre-v19 format with
+ * 32-bit offsets to the current format.
+ *
+ * Multixids in the range [from_multi, to_multi) are read from the old
+ * cluster, and written in the new format.  An important edge case is that if
+ * from_multi == to_multi, this initializes the new pg_multixact files in the
+ * new format without trying to open any old files.  (We rely on that when
+ * upgrading from PostgreSQL version 9.2 or below.)
+ *
+ * Returns the new nextOffset value; the caller should set it in the new
+ * control file.  The new members always start from offset 1, regardless of
+ * the offset range used in the old cluster.
+ */
+MultiXactOffset
+rewrite_multixacts(MultiXactId from_multi, MultiXactId to_multi)
+{
+	MultiXactId oldest_multi,
+				next_multi;
+	MultiXactOffset next_offset;
+	SlruSegState *offsets_writer;
+	SlruSegState *members_writer;
+	char		dir[MAXPGPATH] = {0};
+	bool		prev_multixid_valid = false;
+
+	/*
+	 * The range of valid multi XIDs is unchanged by the conversion (they are
+	 * referenced from the heap tables), but the members SLRU is rewritten to
+	 * start from offset 1.
+	 */
+	oldest_multi = from_multi;
+	next_multi = to_multi;
+	next_offset = 1;
+
+	/* Prepare to write the new SLRU files */
+	pg_sprintf(dir, "%s/pg_multixact/offsets", new_cluster.pgdata);
+	offsets_writer = AllocSlruWrite(dir, false);
+	SlruWriteSwitchPage(offsets_writer, MultiXactIdToOffsetPage(from_multi));
+
+	pg_sprintf(dir, "%s/pg_multixact/members", new_cluster.pgdata);
+	members_writer = AllocSlruWrite(dir, true /* use long segment names */ );
+	SlruWriteSwitchPage(members_writer, MXOffsetToMemberPage(next_offset));
+
+	/*
+	 * Convert old multixids, if needed, by reading them one-by-one from the
+	 * old cluster.
+	 */
+	if (to_multi != from_multi)
+	{
+		OldMultiXactReader *old_reader;
+
+		old_reader = AllocOldMultiXactRead(old_cluster.pgdata,
+										   old_cluster.controldata.chkpnt_nxtmulti,
+										   old_cluster.controldata.chkpnt_nxtmxoff);
+
+		for (MultiXactId multi = oldest_multi; multi != next_multi;)
+		{
+			MultiXactMember member;
+			bool		multixid_valid;
+
+			/*
+			 * Read this multixid's members.
+			 *
+			 * Locking-only XIDs that may be part of multi-xids don't matter
+			 * after upgrade, as there can be no transactions running across
+			 * upgrade.  So as a small optimization, we only read one member
+			 * from each multixid: the one updating one, or if there was no
+			 * update, arbitrarily the first locking xid.
+			 */
+			multixid_valid = GetOldMultiXactIdSingleMember(old_reader, multi, &member);
+
+			/*
+			 * Write the new offset to pg_multixact/offsets.
+			 *
+			 * If the old multixid was invalid, we still need to write this
+			 * offset if the *previous* multixid was valid.  That's because
+			 * the when reading a multixids, the number of members is
+			 * calculated from the difference between the current and the next
+			 * multixid's offsets.
+			 */
+			RecordMultiXactOffset(offsets_writer, multi,
+								  (multixid_valid || prev_multixid_valid) ? next_offset : 0);
+
+			if (multixid_valid)
+			{
+				RecordMultiXactMembers(members_writer, next_offset, 1, &member);
+				next_offset += 1;
+			}
+
+			/* Advance to next multixid, handling wraparound */
+			multi++;
+			if (multi < FirstMultiXactId)
+				multi = FirstMultiXactId;
+			prev_multixid_valid = multixid_valid;
+		}
+
+		FreeOldMultiXactReader(old_reader);
+	}
+
+	/* write the final 'next' offset to the last SLRU page */
+	RecordMultiXactOffset(offsets_writer, next_multi,
+						  prev_multixid_valid ? next_offset : 0);
+
+	/* Release resources */
+	FreeSlruWrite(offsets_writer);
+	FreeSlruWrite(members_writer);
+
+	return next_offset;
+}
+
+
+/*
+ * Write one offset to the offset SLRU
+ */
+static void
+RecordMultiXactOffset(SlruSegState *offsets_writer, MultiXactId multi,
+					  MultiXactOffset offset)
+{
+	int64		pageno;
+	int			entryno;
+	char	   *buf;
+	MultiXactOffset *offptr;
+
+	pageno = MultiXactIdToOffsetPage(multi);
+	entryno = MultiXactIdToOffsetEntry(multi);
+
+	buf = SlruWriteSwitchPage(offsets_writer, pageno);
+	offptr = (MultiXactOffset *) buf;
+	offptr[entryno] = offset;
+}
+
+/*
+ * Write the members for one multixid in the members SLRU
+ *
+ * (Currently, this is only ever called with nmembers == 1)
+ */
+static void
+RecordMultiXactMembers(SlruSegState *members_writer,
+					   MultiXactOffset offset,
+					   int nmembers, MultiXactMember *members)
+{
+	for (int i = 0; i < nmembers; i++, offset++)
+	{
+		int64		pageno;
+		char	   *buf;
+		TransactionId *memberptr;
+		uint32	   *flagsptr;
+		uint32		flagsval;
+		int			bshift;
+		int			flagsoff;
+		int			memberoff;
+
+		Assert(members[i].status <= MultiXactStatusUpdate);
+
+		pageno = MXOffsetToMemberPage(offset);
+		memberoff = MXOffsetToMemberOffset(offset);
+		flagsoff = MXOffsetToFlagsOffset(offset);
+		bshift = MXOffsetToFlagsBitShift(offset);
+
+		buf = SlruWriteSwitchPage(members_writer, pageno);
+
+		memberptr = (TransactionId *) (buf + memberoff);
+
+		*memberptr = members[i].xid;
+
+		flagsptr = (uint32 *) (buf + flagsoff);
+
+		flagsval = *flagsptr;
+		flagsval &= ~(((1 << MXACT_MEMBER_BITS_PER_XACT) - 1) << bshift);
+		flagsval |= (members[i].status << bshift);
+		*flagsptr = flagsval;
+	}
+}
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 490e98fa26f..b3405c22135 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -43,6 +43,7 @@
 
 #include <time.h>
 
+#include "access/multixact.h"
 #include "catalog/pg_class_d.h"
 #include "common/file_perm.h"
 #include "common/logging.h"
@@ -807,15 +808,15 @@ copy_xact_xlog_xid(void)
 			  new_cluster.pgdata);
 	check_ok();
 
-	/*
-	 * If the old server is before the MULTIXACT_FORMATCHANGE_CAT_VER change
-	 * (see pg_upgrade.h) and the new server is after, then we don't copy
-	 * pg_multixact files, but we need to reset pg_control so that the new
-	 * server doesn't attempt to read multis older than the cutoff value.
-	 */
-	if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER &&
-		new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
+	/* Copy or convert pg_multixact files */
+	Assert(new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER);
+	Assert(new_cluster.controldata.cat_ver >= MULTIXACTOFFSET_FORMATCHANGE_CAT_VER);
+	if (old_cluster.controldata.cat_ver >= MULTIXACTOFFSET_FORMATCHANGE_CAT_VER)
 	{
+		/* No change in multixact format, just copy the files */
+		MultiXactId new_nxtmulti = old_cluster.controldata.chkpnt_nxtmulti;
+		MultiXactOffset new_nxtmxoff = old_cluster.controldata.chkpnt_nxtmxoff;
+
 		copy_subdir_files("pg_multixact/offsets", "pg_multixact/offsets");
 		copy_subdir_files("pg_multixact/members", "pg_multixact/members");
 
@@ -826,38 +827,64 @@ copy_xact_xlog_xid(void)
 		 * counters here and the oldest multi present on system.
 		 */
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/pg_resetwal\" -O %u -m %u,%u \"%s\"",
-				  new_cluster.bindir,
-				  old_cluster.controldata.chkpnt_nxtmxoff,
-				  old_cluster.controldata.chkpnt_nxtmulti,
+				  "\"%s/pg_resetwal\" -O %" PRIu64 " -m %u,%u \"%s\"",
+				  new_cluster.bindir, new_nxtmxoff, new_nxtmulti,
 				  old_cluster.controldata.chkpnt_oldstMulti,
 				  new_cluster.pgdata);
 		check_ok();
 	}
-	else if (new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
+	else
 	{
+		/* Conversion is needed */
+		MultiXactId nxtmulti;
+		MultiXactId oldstMulti;
+		MultiXactOffset nxtmxoff;
+
 		/*
-		 * Remove offsets/0000 file created by initdb that no longer matches
-		 * the new multi-xid value.  "members" starts at zero so no need to
-		 * remove it.
+		 * Determine the range of multixacts to convert.
 		 */
-		remove_new_subdir("pg_multixact/offsets", false);
+		nxtmulti = old_cluster.controldata.chkpnt_nxtmulti;
+		if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
+			oldstMulti = old_cluster.controldata.chkpnt_oldstMulti;
+		else
+		{
+			/*
+			 * In PostgreSQL 9.2 and below, multitransactions were only used
+			 * for row locking, and as such don't need to be preserved during
+			 * upgrade.  In that case, we utilize convert_multixacts() just to
+			 * initialize new, empty files in the new format.
+			 *
+			 * It's important that the oldest multi is set to the latest value
+			 * used by the old system, so that multixact.c returns the empty
+			 * set for multis that might be present on disk.
+			 */
+			oldstMulti = nxtmulti;
+		}
+		/* handle wraparound */
+		if (nxtmulti < FirstMultiXactId)
+			nxtmulti = FirstMultiXactId;
+		if (oldstMulti < FirstMultiXactId)
+			oldstMulti = FirstMultiXactId;
 
-		prep_status("Setting oldest multixact ID in new cluster");
+		/*
+		 * Remove the files created by initdb in the new cluster.
+		 * convert_multixacts() will create new ones.
+		 */
+		remove_new_subdir("pg_multixact/members", false);
+		remove_new_subdir("pg_multixact/offsets", false);
 
 		/*
-		 * We don't preserve files in this case, but it's important that the
-		 * oldest multi is set to the latest value used by the old system, so
-		 * that multixact.c returns the empty set for multis that might be
-		 * present on disk.  We set next multi to the value following that; it
-		 * might end up wrapped around (i.e. 0) if the old cluster had
-		 * next=MaxMultiXactId, but multixact.c can cope with that just fine.
+		 * Create new pg_multixact files, converting old ones if needed.
 		 */
+		prep_status("Converting pg_multixact files");
+		nxtmxoff = rewrite_multixacts(oldstMulti, nxtmulti);
+		check_ok();
+
+		prep_status("Setting next multixact ID and offset for new cluster");
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/pg_resetwal\" -m %u,%u \"%s\"",
+				  "\"%s/pg_resetwal\" -O %" PRIu64 " -m %u,%u \"%s\"",
 				  new_cluster.bindir,
-				  old_cluster.controldata.chkpnt_nxtmulti + 1,
-				  old_cluster.controldata.chkpnt_nxtmulti,
+				  nxtmxoff, nxtmulti, oldstMulti,
 				  new_cluster.pgdata);
 		check_ok();
 	}
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index e86336f4be9..48f15dff5e0 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -114,6 +114,13 @@ extern char *output_files[];
  */
 #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231
 
+/*
+ * MultiXactOffset was changed from 32-bit to 64-bit in version 19, at this
+ * catalog version.  pg_multixact files need to be converted when upgrading
+ * across this version.
+ */
+#define MULTIXACTOFFSET_FORMATCHANGE_CAT_VER 999999999
+
 /*
  * large object chunk size added to pg_controldata,
  * commit 5f93c37805e7485488480916b4585e098d3cc883
@@ -235,7 +242,7 @@ typedef struct
 	uint32		chkpnt_nxtepoch;
 	uint32		chkpnt_nxtoid;
 	uint32		chkpnt_nxtmulti;
-	uint32		chkpnt_nxtmxoff;
+	uint64		chkpnt_nxtmxoff;
 	uint32		chkpnt_oldstMulti;
 	uint32		chkpnt_oldstxid;
 	uint32		align;
@@ -499,6 +506,9 @@ void		old_9_6_invalidate_hash_indexes(ClusterInfo *cluster,
 
 void		report_extension_updates(ClusterInfo *cluster);
 
+/* multixact_rewrite.c */
+MultiXactOffset rewrite_multixacts(MultiXactId from_multi, MultiXactId to_multi);
+
 /* parallel.c */
 void		parallel_exec_prog(const char *log_file, const char *opt_log_file,
 							   const char *fmt,...) pg_attribute_printf(3, 4);
diff --git a/src/bin/pg_upgrade/slru_io.c b/src/bin/pg_upgrade/slru_io.c
new file mode 100644
index 00000000000..720445289b9
--- /dev/null
+++ b/src/bin/pg_upgrade/slru_io.c
@@ -0,0 +1,258 @@
+/*
+ * slru_io.c
+ *
+ * Routines for reading and writing SLRU files during upgrade.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/slru_io.c
+ */
+
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+
+#include "common/fe_memutils.h"
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "port/pg_iovec.h"
+#include "pg_upgrade.h"
+#include "slru_io.h"
+
+static SlruSegState *AllocSlruSegState(const char *dir);
+static char *SlruFileName(SlruSegState *state, int64 segno);
+static void SlruFlush(SlruSegState *state);
+
+static SlruSegState *
+AllocSlruSegState(const char *dir)
+{
+	SlruSegState *state = pg_malloc(sizeof(*state));
+
+	state->dir = pstrdup(dir);
+	state->fn = NULL;
+	state->fd = -1;
+	state->segno = -1;
+	state->pageno = 0;
+
+	return state;
+}
+
+/* similar to the backend function with the same name */
+static char *
+SlruFileName(SlruSegState *state, int64 segno)
+{
+	if (state->long_segment_names)
+	{
+		Assert(segno >= 0 && segno <= INT64CONST(0xFFFFFFFFFFFFFFF));
+		return psprintf("%s/%015" PRIX64, state->dir, segno);
+	}
+	else
+	{
+		Assert(segno >= 0 && segno <= INT64CONST(0xFFFFFF));
+		return psprintf("%s/%04X", state->dir, (unsigned int) segno);
+	}
+}
+
+/*
+ * Create slru reader for dir.
+ *
+ * Returns the malloced memory used by the all other read calls in this module.
+ */
+SlruSegState *
+AllocSlruRead(const char *dir, bool long_segment_names)
+{
+	SlruSegState *state = AllocSlruSegState(dir);
+
+	state->writing = false;
+	state->long_segment_names = long_segment_names;
+
+	return state;
+}
+
+/*
+ * Open given page for reading.
+ *
+ * Reading can be done in random order.
+ */
+char *
+SlruReadSwitchPageSlow(SlruSegState *state, uint64 pageno)
+{
+	int64		segno;
+	ssize_t		bytes_read;
+	off_t		offset;
+
+	Assert(!state->writing);	/* read only mode */
+
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+
+	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	if (segno != state->segno)
+	{
+		if (state->segno != -1)
+		{
+			close(state->fd);
+			state->fd = -1;
+
+			pg_free(state->fn);
+			state->fn = NULL;
+
+			state->segno = -1;
+		}
+
+		/* Open new segment */
+		state->fn = SlruFileName(state, segno);
+		if ((state->fd = open(state->fn, O_RDONLY | PG_BINARY, 0)) < 0)
+			pg_fatal("could not open file \"%s\": %m", state->fn);
+	}
+	state->segno = segno;
+
+	offset = (pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+	bytes_read = 0;
+	while (bytes_read < BLCKSZ)
+	{
+		ssize_t		rc;
+
+		rc = pg_pread(state->fd,
+					  &state->buf.data + bytes_read,
+					  BLCKSZ - bytes_read,
+					  offset + bytes_read);
+		if (rc < 0)
+		{
+			if (errno == EINTR)
+				continue;
+			pg_fatal("could not read file \"%s\": %m", state->fn);
+		}
+		if (rc == 0)
+		{
+			/* unexpected EOF */
+			pg_log(PG_WARNING, "unexpected EOF reading file \"%s\" at offset %zd, reading as zeros", state->fn,
+				   offset + bytes_read);
+			memset(&state->buf.data + bytes_read, 0, BLCKSZ - bytes_read);
+			break;
+		}
+		bytes_read += rc;
+	}
+	state->pageno = pageno;
+
+	return state->buf.data;
+}
+
+/*
+ * Frees the malloced reader.
+ */
+void
+FreeSlruRead(SlruSegState *state)
+{
+	Assert(!state->writing);	/* read only mode */
+
+	if (state->fd != -1)
+		close(state->fd);
+	pg_free(state);
+}
+
+/*
+ * Create slru writer for dir.
+ *
+ * Returns the malloced memory used by the all other write calls in this module.
+ */
+SlruSegState *
+AllocSlruWrite(const char *dir, bool long_segment_names)
+{
+	SlruSegState *state = AllocSlruSegState(dir);
+
+	state->writing = true;
+	state->long_segment_names = long_segment_names;
+
+	return state;
+}
+
+/*
+ * Open the given page for writing.
+ *
+ * NOTE: This uses O_EXCL when stepping to a new segment, so this assumes that
+ * each segment is written in full before moving on to next one.  This
+ * limitation would be easy to lift if needed, but it fits the usage pattern of
+ * current callers.
+ */
+char *
+SlruWriteSwitchPageSlow(SlruSegState *state, uint64 pageno)
+{
+	int64		segno;
+	off_t		offset;
+
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+
+	segno = pageno / SLRU_PAGES_PER_SEGMENT;
+	offset = (pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+	SlruFlush(state);
+	memset(state->buf.data, 0, BLCKSZ);
+
+	if (segno != state->segno)
+	{
+		if (state->segno != -1)
+		{
+			close(state->fd);
+			state->fd = -1;
+
+			pg_free(state->fn);
+			state->fn = NULL;
+
+			state->segno = -1;
+		}
+
+		/* Create the segment */
+		state->fn = SlruFileName(state, segno);
+		if ((state->fd = open(state->fn, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+							  pg_file_create_mode)) < 0)
+		{
+			pg_fatal("could not create file \"%s\": %m", state->fn);
+		}
+
+		state->segno = segno;
+
+		if (offset > 0)
+		{
+			if (pg_pwrite_zeros(state->fd, offset, 0) < 0)
+				pg_fatal("could not write file \"%s\": %m", state->fn);
+		}
+	}
+
+	state->pageno = pageno;
+
+	return state->buf.data;
+}
+
+static void
+SlruFlush(SlruSegState *state)
+{
+	struct iovec iovec = {
+		.iov_base = &state->buf,
+		.iov_len = BLCKSZ,
+	};
+	off_t		offset;
+
+	if (state->segno == -1)
+		return;
+
+	offset = (state->pageno % SLRU_PAGES_PER_SEGMENT) * BLCKSZ;
+
+	if (pg_pwritev_with_retry(state->fd, &iovec, 1, offset) < 0)
+		pg_fatal("could not write file \"%s\": %m", state->fn);
+}
+
+/*
+ * Frees the malloced writer.
+ */
+void
+FreeSlruWrite(SlruSegState *state)
+{
+	Assert(state->writing);
+
+	SlruFlush(state);
+
+	if (state->fd != -1)
+		close(state->fd);
+	pg_free(state);
+}
diff --git a/src/bin/pg_upgrade/slru_io.h b/src/bin/pg_upgrade/slru_io.h
new file mode 100644
index 00000000000..5c80a679b4d
--- /dev/null
+++ b/src/bin/pg_upgrade/slru_io.h
@@ -0,0 +1,52 @@
+/*
+ * slru_io.h
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/slru_io.h
+ */
+
+#ifndef SLRU_IO_H
+#define SLRU_IO_H
+
+/*
+ * State for reading or writing an SLRU, with a one page buffer.
+ */
+typedef struct SlruSegState
+{
+	bool		writing;
+	bool		long_segment_names;
+
+	char	   *dir;
+	char	   *fn;
+	int			fd;
+	int64		segno;
+	uint64		pageno;
+
+	PGAlignedBlock buf;
+} SlruSegState;
+
+extern SlruSegState *AllocSlruRead(const char *dir, bool long_segment_names);
+extern char *SlruReadSwitchPageSlow(SlruSegState *state, uint64 pageno);
+extern void FreeSlruRead(SlruSegState *state);
+
+static inline char *
+SlruReadSwitchPage(SlruSegState *state, uint64 pageno)
+{
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+	return SlruReadSwitchPageSlow(state, pageno);
+}
+
+extern SlruSegState *AllocSlruWrite(const char *dir, bool long_segment_names);
+extern char *SlruWriteSwitchPageSlow(SlruSegState *state, uint64 pageno);
+extern void FreeSlruWrite(SlruSegState *state);
+
+static inline char *
+SlruWriteSwitchPage(SlruSegState *state, uint64 pageno)
+{
+	if (state->segno != -1 && pageno == state->pageno)
+		return state->buf.data;
+	return SlruWriteSwitchPageSlow(state, pageno);
+}
+
+#endif							/* SLRU_IO_H */
diff --git a/src/bin/pg_upgrade/t/007_multixact_conversion.pl b/src/bin/pg_upgrade/t/007_multixact_conversion.pl
new file mode 100644
index 00000000000..3e9e2c29af5
--- /dev/null
+++ b/src/bin/pg_upgrade/t/007_multixact_conversion.pl
@@ -0,0 +1,433 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Version 19 expanded MultiXactOffset from 32 to 64 bits. Upgrading
+# across that requires rewriting the SLRU files to the new format.
+# This file contains tests for the conversion.
+#
+# To run, set 'oldinstall' ENV variable to point to a pre-v19
+# installation. If it's not set, or if it points to a v19 or above
+# installation, this still performs a very basic test, upgrading a
+# cluster with some multixacts. It's not very interesting, however,
+# because there's no conversion involved in that case.
+
+use strict;
+use warnings FATAL => 'all';
+
+use Math::BigInt;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Temp dir for a dumps.
+my $tempdir = PostgreSQL::Test::Utils::tempdir;
+
+# A workload that consumes multixids. The purpose of this is to
+# generate some multixids in the old cluster, so that we can test
+# upgrading them. The workload is a mix of KEY SHARE locking queries
+# and UPDATEs, and commits and aborts, to generate a mix of multixids
+# with different statuses. It consumes around 3000 multixids with
+# 30000 members. That's enough to span more than one multixids
+# 'offsets' page, and more than one 'members' segment.
+#
+# The workload leaves behind a table called 'mxofftest' containing a
+# small number of rows referencing some of the generated multixids.
+#
+# Because this function is used to generate test data on the old
+# installation, it needs to work with older PostgreSQL server
+# versions.
+#
+# The first argument is the cluster to connect to, the second argument
+# is a cluster using the new version. We need the 'psql' binary from
+# the new version, the new cluster is otherwise unused. (We need to
+# use the new 'psql' because some of the more advanced background psql
+# perl module features depend on a fairly recent psql version.)
+sub mxact_workload
+{
+	my $node = shift;       # Cluster to connect to
+	my $binnode = shift;    # Use the psql binary from this cluster
+
+	my $connstr = $node->connstr('postgres');
+
+	$node->start;
+	$node->safe_psql(
+		'postgres', qq[
+		CREATE TABLE mxofftest (id INT PRIMARY KEY, n_updated INT)
+		  WITH (AUTOVACUUM_ENABLED=FALSE);
+		INSERT INTO mxofftest SELECT G, 0 FROM GENERATE_SERIES(1, 50) G;
+	]);
+
+	my $nclients = 20;
+	my $update_every = 13;
+	my $abort_every = 11;
+	my @connections = ();
+
+	# Silence the logging of the statements we run to avoid
+	# unnecessarily bloating the test logs. This runs before the
+	# upgrade we're testing, so the details should not be very
+	# interesting for debugging. But if needed, you can make it more
+	# verbose by setting this.
+	my $verbose = 0;
+
+	# Open multiple connections to the database. Start a transaction
+	# in each connection.
+	for (0 .. $nclients)
+	{
+		# Use the psql binary from the new installation. The
+		# BackgroundPsql functionality doesn't work with older psql
+		# versions.
+		my $conn = $binnode->background_psql('',
+			connstr => $node->connstr('postgres'));
+
+		$conn->query_safe("SET log_statement=none", verbose => $verbose)
+		  unless $verbose;
+		$conn->query_safe("SET enable_seqscan=off", verbose => $verbose);
+		$conn->query_safe("BEGIN", verbose => $verbose);
+
+		push(@connections, $conn);
+	}
+
+	# Run queries using cycling through the connections in a
+	# round-robin fashion. We keep a transaction open in each
+	# connection at all times, and lock/update the rows. With 10
+	# connections, each SELECT FOR KEY SHARE query generates a new
+	# multixid, containing the 10 XIDs of all the transactions running
+	# at the time.
+	for (my $i = 0; $i < 3000; $i++)
+	{
+		my $conn = $connections[ $i % $nclients ];
+
+		my $sql;
+		if ($i % $abort_every == 0)
+		{
+			$sql = "ABORT; ";
+		}
+		else
+		{
+			$sql = "COMMIT; ";
+		}
+		$sql .= "BEGIN; ";
+
+		if ($i % $update_every == 0)
+		{
+			$sql .= qq[
+			  UPDATE mxofftest SET n_updated = n_updated + 1 WHERE id = ${i} % 50;
+			];
+		}
+		else
+		{
+			my $threshold = int($i / 3000 * 50);
+			$sql .= qq[
+			  select count(*) from (
+				SELECT * FROM mxofftest WHERE id >= $threshold FOR KEY SHARE
+			  ) as x
+			];
+		}
+		$conn->query_safe($sql, verbose => $verbose);
+	}
+
+	for my $conn (@connections)
+	{
+		$conn->quit();
+	}
+
+	$node->stop;
+	return;
+}
+
+# Return contents of the 'mxofftest' table, created by mxact_workload
+sub get_test_table_contents
+{
+	my ($node, $filename) = @_;
+
+	my $contents = $node->safe_psql('postgres',
+		"SELECT ctid, xmin, xmax, * FROM mxofftest");
+
+	my $path = $tempdir . '/' . $filename;
+	open(my $fh, '>', $path)
+	  || die "could not open $path for writing $!";
+	print $fh $contents;
+	close($fh);
+
+	return $path;
+}
+
+# Return the members of all updating multixids in the given range
+sub get_updating_multixact_members
+{
+	my ($node, $from, $to, $filename) = @_;
+
+	my $path = $tempdir . '/' . $filename;
+	open(my $fh, '>', $path)
+	  || die "could not open $path for writing $!";
+
+	if ($to >= $from)
+	{
+		my $res = $node->safe_psql(
+			'postgres', qq[
+			SELECT multi, mode, xid
+			FROM generate_series($from, $to - 1) as multi,
+				 pg_get_multixact_members(multi::text::xid)
+			WHERE mode not in ('keysh', 'sh');
+		]);
+		print $fh $res;
+	}
+	else
+	{
+		# Multixids wrapped around. Split the query into two parts,
+		# before and after the wraparound.
+		my $res = $node->safe_psql(
+			'postgres', qq[
+			SELECT multi, mode, xid
+			FROM generate_series($from, 4294967295) as multi,
+				 pg_get_multixact_members(multi::text::xid)
+			WHERE mode not in ('keysh', 'sh');
+		]);
+		print $fh $res;
+		$res = $node->safe_psql(
+			'postgres', qq[
+			SELECT multi, mode, xid
+			FROM generate_series(1, $to - 1) as multi,
+				 pg_get_multixact_members(multi::text::xid)
+			WHERE mode not in ('keysh', 'sh');
+		]);
+		print $fh $res;
+	}
+
+	close($fh);
+	return $path;
+}
+
+# Read multixid related fields from the control file
+#
+# Note: This is used on both the old and the new installation, so the
+# command arguments and the output parsing used here must work with
+# all PostgreSQL versions supported by the test.
+sub read_multixid_fields
+{
+	my $node = shift;
+
+	my $pg_controldata_path = $node->installed_command('pg_controldata');
+	my ($stdout, $stderr) =
+	  run_command([ $pg_controldata_path, $node->data_dir ]);
+	$stdout =~ /^Latest checkpoint's oldestMultiXid:\s*(.*)$/m
+	  or die "could not read oldestMultiXid from pg_controldata";
+	my $oldest_multi_xid = $1;
+	$stdout =~ /^Latest checkpoint's NextMultiXactId:\s*(.*)$/m
+	  or die "could not read NextMultiXactId from pg_controldata";
+	my $next_multi_xid = $1;
+	$stdout =~ /^Latest checkpoint's NextMultiOffset:\s*(.*)$/m
+	  or die "could not read NextMultiOffset from pg_controldata";
+	my $next_multi_offset = $1;
+
+	return ($oldest_multi_xid, $next_multi_xid, $next_multi_offset);
+}
+
+# Reset a cluster's next multixid and mxoffset to given values.
+#
+# Note: This is used on the old insallation, so the command arguments
+# and the output parsing used here must work with all pre-v19
+# PostgreSQL versions supported by the test.
+sub reset_mxid_mxoffset_pre_v19
+{
+	my $node = shift;
+	my $mxid = shift;
+	my $mxoffset = shift;
+
+	my $pg_resetwal_path = $node->installed_command('pg_resetwal');
+	# Get block size
+	my ($out, $err) =
+	  run_command([ $pg_resetwal_path, '--dry-run', $node->data_dir ]);
+	$out =~ /^Database block size: *(\d+)$/m or die;
+
+	# Verify that no multixids are currently in use. Resetting would
+	# destroy them. (A freshly initialized cluster has no multixids.)
+	$out =~ /^Latest checkpoint's NextMultiXactId: *(\d+)$/m or die;
+	my $next_mxid = $1;
+	$out =~ /^Latest checkpoint's oldestMultiXid: *(\d+)$/m or die;
+	my $oldest_mxid = $1;
+	die "cluster has some multixids in use" unless $next_mxid == $oldest_mxid;
+
+	# Extract a few other values from pg_resetwal --dry-run output
+	# that we need for the calculations below
+	$out =~ /^Database block size: *(\d+)$/m or die;
+	my $blcksz = $1;
+	# SLRU_PAGES_PER_SEGMENT is always 32 on pre-19 versions
+	my $slru_pages_per_segment = 32;
+
+	# Do the reset
+	my @cmd = (
+		$pg_resetwal_path,
+		'--pgdata' => $node->data_dir,
+		'--multixact-offset' => $mxoffset,
+		'--multixact-ids' => "$mxid,$mxid");
+	command_ok(\@cmd, 'reset multixids and offset');
+
+	# pg_resetwal just updates the control file. The cluster will
+	# refuse to start up, if the SLRU segments corresponding to the
+	# next multixid and offset does not exist. Create a segments that
+	# covers the given values, filled with zeros. But first remove any
+	# old segments.
+	unlink glob $node->data_dir . "/pg_multixact/offsets/*";
+	unlink glob $node->data_dir . "/pg_multixact/members/*";
+
+	# Initialize the 'offsets' SLRU file containing the new next multixid
+	# with zeros
+	#
+	# sizeof(MultiXactOffset) == 4 in PostgreSQL versions before 19
+	my $multixact_offsets_per_page = $blcksz / 4;
+	my $segno =
+	  int($mxid / $multixact_offsets_per_page / $slru_pages_per_segment);
+	my $path =
+	  sprintf('%s/pg_multixact/offsets/%04X', $node->data_dir, $segno);
+	open my $fh, ">", $path
+	  or die "could not open \"$path\": $!";
+	binmode $fh;
+	my $bytes_per_seg = $slru_pages_per_segment * $blcksz;
+	syswrite($fh, "\0" x $bytes_per_seg) == $bytes_per_seg
+	  or die "could not write to \"$path\": $!";
+	close $fh;
+
+	# Same for the 'members' SLRU
+	my $multixact_members_per_page = int($blcksz / 20) * 4;
+	$segno =
+	  int($mxoffset / $multixact_members_per_page / $slru_pages_per_segment);
+	$path = sprintf "%s/pg_multixact/members/%04X", $node->data_dir, $segno;
+	open $fh, ">", $path
+	  or die "could not open \"$path\": $!";
+	binmode $fh;
+	syswrite($fh, "\0" x $bytes_per_seg) == $bytes_per_seg
+	  or die "could not write to \"$path\": $!";
+	close($fh);
+}
+
+# Main test workhorse routine.
+# Dump data on old version, run pg_upgrade, compare data after upgrade.
+sub upgrade_and_compare
+{
+	my $tag = shift;
+	my $oldnode = shift;
+	my $newnode = shift;
+
+	command_ok(
+		[
+			'pg_upgrade', '--no-sync',
+			'--old-datadir' => $oldnode->data_dir,
+			'--new-datadir' => $newnode->data_dir,
+			'--old-bindir' => $oldnode->config_data('--bindir'),
+			'--new-bindir' => $newnode->config_data('--bindir'),
+			'--socketdir' => $newnode->host,
+			'--old-port' => $oldnode->port,
+			'--new-port' => $newnode->port,
+		],
+		'run of pg_upgrade for new instance');
+
+	# Dump contents of the test table, and the status of all updating
+	# multixids from the old cluster. (Locking-only multixids don't
+	# need to be preserved so we ignore those)
+	#
+	# Note: we do this *after* running pg_upgrade, to ensure that we
+	# don't set all the hint bits before upgrade by doing the SELECT
+	# on the table.
+	my ($multixids_start, $multixids_end, undef) =
+	  read_multixid_fields($oldnode);
+	$oldnode->start;
+	my $old_table_contents =
+	  get_test_table_contents($oldnode, "oldnode_${tag}_table_contents");
+	my $old_multixacts =
+	  get_updating_multixact_members($oldnode, $multixids_start,
+		$multixids_end, "oldnode_${tag}_multixacts");
+	$oldnode->stop;
+
+	# Compare them with upgraded cluster
+	$newnode->start;
+	my $new_table_contents =
+	  get_test_table_contents($newnode, "newnode_${tag}_table_contents");
+	my $new_multixacts =
+	  get_updating_multixact_members($newnode, $multixids_start,
+		$multixids_end, "newnode_${tag}_multixacts");
+	$newnode->stop;
+
+	compare_files($old_table_contents, $new_table_contents,
+		'test table contents from original and upgraded clusters match');
+	compare_files($old_multixacts, $new_multixacts,
+		'multixact members from original and upgraded clusters match');
+}
+
+my $old_version;
+
+# Basic scenario: Create a cluster using old installation, run
+# multixid-creating workload on it, then upgrade.
+#
+# This works even even if the old and new version is the same,
+# although it's not very interesting as the conversion routines only
+# run when upgrading from a pre-v19 cluster.
+{
+	my $tag = 'basic';
+	my $old =
+	  PostgreSQL::Test::Cluster->new("${tag}_oldnode",
+		install_path => $ENV{oldinstall});
+	my $new = PostgreSQL::Test::Cluster->new("${tag}_newnode");
+
+	$old->init(extra => ['-k']);
+
+	$old_version = $old->pg_version;
+	note "old installation is version $old_version\n";
+
+	# Run the workload
+	my (undef, $start_mxid, $start_mxoff) = read_multixid_fields($old);
+	mxact_workload($old, $new);
+	my (undef, $finish_mxid, $finish_mxoff) = read_multixid_fields($old);
+
+	note "Testing upgrade, ${tag} scenario\n"
+	  . " mxid from ${start_mxid} to ${finish_mxid}\n"
+	  . " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n";
+
+	$new->init;
+	upgrade_and_compare($tag, $old, $new);
+}
+
+# Wraparound scenario: This is the same as the basic scenario, but the
+# old cluster goes through multixid and offset wraparound.
+#
+# This requires the old installation to be version 19 of older,
+# because the hacks we use to reset the old cluster to a state just
+# before the wraparound rely on the pre-v19 file format. If the old
+# cluster is of v19 or above, multixact SLRU conversion is not needed
+# anyway.
+SKIP:
+{
+	skip
+	  "skipping mxoffset conversion tests because upgrading from the old version does not require conversion"
+	  if ($old_version >= '19devel');
+
+	my $tag = 'wraparound';
+	my $old =
+	  PostgreSQL::Test::Cluster->new("${tag}_oldnode",
+		install_path => $ENV{oldinstall});
+	my $new = PostgreSQL::Test::Cluster->new("${tag}_newnode");
+
+	$old->init(extra => ['-k']);
+
+	# Reset the old cluster to just before multixid and 32-bit offset wraparound.
+	reset_mxid_mxoffset_pre_v19($old, 0xFFFFFA00, 0xFFFFEC00);
+
+	# Run the workload. This crosses multixid and offset wraparound.
+	my (undef, $start_mxid, $start_mxoff) = read_multixid_fields($old);
+	mxact_workload($old, $new);
+	my (undef, $finish_mxid, $finish_mxoff) = read_multixid_fields($old);
+
+	note "Testing upgrade, ${tag} scenario\n"
+	  . " mxid from ${start_mxid} to ${finish_mxid}\n"
+	  . " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n";
+
+	# Verify that wraparounds happened.
+	cmp_ok($finish_mxid, '<', $start_mxid,
+		"multixid wrapped around in old cluster");
+	cmp_ok($finish_mxoff, '<', $start_mxoff,
+		"mxoff wrapped around in old cluster");
+
+	$new->init;
+	upgrade_and_compare($tag, $old, $new);
+}
+
+done_testing();
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 82e4bb90dd5..6433fe16364 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -28,8 +28,6 @@
 
 #define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
 
-#define MaxMultiXactOffset	((MultiXactOffset) 0xFFFFFFFF)
-
 /*
  * Possible multixact lock modes ("status").  The first four modes are for
  * tuple locks (FOR KEY SHARE, FOR SHARE, FOR NO KEY UPDATE, FOR UPDATE); the
@@ -111,7 +109,7 @@ extern bool MultiXactIdIsRunning(MultiXactId multi, bool isLockOnly);
 extern void MultiXactIdSetOldestMember(void);
 extern int	GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 								  bool from_pgupgrade, bool isLockOnly);
-extern bool GetMultiXactInfo(uint32 *multixacts, MultiXactOffset *members,
+extern void GetMultiXactInfo(uint32 *multixacts, MultiXactOffset *members,
 							 MultiXactId *oldestMultiXactId,
 							 MultiXactOffset *oldestOffset);
 extern bool MultiXactIdPrecedes(MultiXactId multi1, MultiXactId multi2);
@@ -131,8 +129,7 @@ extern void BootStrapMultiXact(void);
 extern void StartupMultiXact(void);
 extern void TrimMultiXact(void);
 extern void SetMultiXactIdLimit(MultiXactId oldest_datminmxid,
-								Oid oldest_datoid,
-								bool is_startup);
+								Oid oldest_datoid);
 extern void MultiXactGetCheckptMulti(bool is_shutdown,
 									 MultiXactId *nextMulti,
 									 MultiXactOffset *nextMultiOffset,
diff --git a/src/include/access/multixact_internal.h b/src/include/access/multixact_internal.h
index 9b56deaef31..c4dd1aa044f 100644
--- a/src/include/access/multixact_internal.h
+++ b/src/include/access/multixact_internal.h
@@ -21,17 +21,9 @@
 /*
  * Defines for MultiXactOffset page sizes.  A page is the same BLCKSZ as is
  * used everywhere else in Postgres.
- *
- * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF,
- * MultiXact page numbering also wraps around at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at
- * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in TruncateMultiXact (see
- * MultiXactOffsetPagePrecedes).
  */
 
-/* We need four bytes per offset */
+/* We need 8 bytes per offset */
 #define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
 
 static inline int64
@@ -80,19 +72,6 @@ MultiXactIdToOffsetSegment(MultiXactId multi)
 #define MULTIXACT_MEMBERS_PER_PAGE	\
 	(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
 
-/*
- * Because the number of items per page is not a divisor of the last item
- * number (member 0xFFFFFFFF), the last segment does not use the maximum number
- * of pages, and moreover the last used page therein does not use the same
- * number of items as previous pages.  (Another way to say it is that the
- * 0xFFFFFFFF member is somewhere in the middle of the last page, so the page
- * has some empty space after that item.)
- *
- * This constant is the number of members in the last page of the last segment.
- */
-#define MAX_MEMBERS_IN_LAST_MEMBERS_PAGE \
-		((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
-
 /* page in which a member is to be found */
 static inline int64
 MXOffsetToMemberPage(MultiXactOffset offset)
diff --git a/src/include/c.h b/src/include/c.h
index ccd2b654d45..62cbf7a2eec 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -669,7 +669,7 @@ typedef uint32 SubTransactionId;
 /* MultiXactId must be equivalent to TransactionId, to fit in t_xmax */
 typedef TransactionId MultiXactId;
 
-typedef uint32 MultiXactOffset;
+typedef uint64 MultiXactOffset;
 
 typedef uint32 CommandId;
 
diff --git a/src/test/modules/test_slru/t/002_multixact_wraparound.pl b/src/test/modules/test_slru/t/002_multixact_wraparound.pl
index 169333fc564..272d8e6fb08 100644
--- a/src/test/modules/test_slru/t/002_multixact_wraparound.pl
+++ b/src/test/modules/test_slru/t/002_multixact_wraparound.pl
@@ -37,7 +37,7 @@ my $slru_pages_per_segment = $1;
 
 # initialize the 'offsets' SLRU file containing the new next multixid
 # with zeros
-my $multixact_offsets_per_page = $blcksz / 4;   # sizeof(MultiXactOffset) == 4
+my $multixact_offsets_per_page = $blcksz / 8;   # sizeof(MultiXactOffset) == 8
 my $segno =
   int(0xFFFFFFF8 / $multixact_offsets_per_page / $slru_pages_per_segment);
 my $slru_file = sprintf('%s/pg_multixact/offsets/%04X', $node_pgdata, $segno);
diff --git a/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm b/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
index 60bbd5dd445..9825aaa9bb4 100644
--- a/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
+++ b/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
@@ -230,18 +230,23 @@ Executes a query in the current session and returns the output in scalar
 context and (output, error) in list context where error is 1 in case there
 was output generated on stderr when executing the query.
 
+By default, the query and its results are printed to the test output. This
+can be disabled by passing the keyword parameter verbose => false.
+
 =cut
 
 sub query
 {
-	my ($self, $query) = @_;
+	my ($self, $query, %params) = @_;
 	my $ret;
 	my $output;
 	my $query_cnt = $self->{query_cnt}++;
 
+	$params{verbose} = 1 unless defined $params{verbose};
+
 	local $Test::Builder::Level = $Test::Builder::Level + 1;
 
-	note "issuing query $query_cnt via background psql: $query";
+	note "issuing query $query_cnt via background psql: $query" unless !$params{verbose};
 
 	$self->{timeout}->start() if (defined($self->{query_timer_restart}));
 
@@ -280,7 +285,7 @@ sub query
 	  explain {
 		stdout => $self->{stdout},
 		stderr => $self->{stderr},
-	  };
+	  } unless !$params{verbose};
 
 	# Remove banner from stdout and stderr, our caller doesn't care.  The
 	# first newline is optional, as there would not be one if consuming an
@@ -308,9 +313,9 @@ Query failure is determined by it producing output on stderr.
 
 sub query_safe
 {
-	my ($self, $query) = @_;
+	my ($self, $query, %params) = @_;
 
-	my $ret = $self->query($query);
+	my $ret = $self->query($query, %params);
 
 	if ($self->{stderr} ne "")
 	{
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 747528c4af1..295988b8b87 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -1793,13 +1793,20 @@ sub _get_env
 	return (%inst_env);
 }
 
-# Private routine to get an installation path qualified command.
-#
-# IPC::Run maintains a cache, %cmd_cache, mapping commands to paths.  Tests
-# which use nodes spanning more than one postgres installation path need to
-# avoid confusing which installation's binaries get run.  Setting $ENV{PATH} is
-# insufficient, as IPC::Run does not check to see if the path has changed since
-# caching a command.
+=pod
+
+=item $node->installed_command(cmd)
+
+Get an installation path qualified command.
+
+IPC::Run maintains a cache, %cmd_cache, mapping commands to paths.  Tests
+which use nodes spanning more than one postgres installation path need to
+avoid confusing which installation's binaries get run.  Setting $ENV{PATH} is
+insufficient, as IPC::Run does not check to see if the path has changed since
+caching a command.
+
+=cut
+
 sub installed_command
 {
 	my ($self, $cmd) = @_;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 6e2ed0c8825..9dd65b10254 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1731,6 +1731,7 @@ MultiXactMember
 MultiXactOffset
 MultiXactStateData
 MultiXactStatus
+MultiXactWriter
 MultirangeIOData
 MultirangeParseState
 MultirangeType
@@ -1816,6 +1817,7 @@ OffsetVarNodes_context
 Oid
 OidOptions
 OkeysState
+OldMultiXactReader
 OldToNewMapping
 OldToNewMappingData
 OnCommitAction
@@ -2814,6 +2816,7 @@ SlruCtlData
 SlruErrorCause
 SlruPageStatus
 SlruScanCallback
+SlruSegState
 SlruShared
 SlruSharedData
 SlruWriteAll
-- 
2.47.3

v29-0006-Add-runtime-checks-for-bogus-multixact-offsets.patchtext/x-patch; charset=UTF-8; name=v29-0006-Add-runtime-checks-for-bogus-multixact-offsets.patchDownload
From 01c9a950e1a41042e76c78cf5a6b8e7a10442f7d Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Thu, 4 Dec 2025 15:31:39 +0200
Subject: [PATCH v29 6/6] Add runtime checks for bogus multixact offsets

These are not directly related to 64 bit offsets, but makes sense I
think
---
 src/backend/access/transam/multixact.c | 33 ++++++++++++++++----------
 1 file changed, 21 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index dffa0c8e7d4..dc9c4257a98 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1154,6 +1154,7 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	int			slotno;
 	MultiXactOffset *offptr;
 	MultiXactOffset offset;
+	MultiXactOffset nextMXOffset;
 	int			length;
 	MultiXactId oldestMXact;
 	MultiXactId nextMXact;
@@ -1245,12 +1246,14 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	offptr += entryno;
 	offset = *offptr;
 
-	Assert(offset != 0);
+	if (offset == 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("MultiXact %u has invalid offset", multi)));
 
 	/* read next multi's offset */
 	{
 		MultiXactId tmpMXact;
-		MultiXactOffset nextMXOffset;
 
 		/* handle wraparound if needed */
 		tmpMXact = multi + 1;
@@ -1284,21 +1287,27 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 		offptr += entryno;
 		nextMXOffset = *offptr;
-
-		if (nextMXOffset == 0)
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("MultiXact %u has invalid next offset",
-							multi)));
-
-		length = nextMXOffset - offset;
 	}
 
 	LWLockRelease(lock);
 	lock = NULL;
 
-	/* A multixid with zero members should not happen */
-	Assert(length > 0);
+	/* Sanity check the next offset */
+	if (nextMXOffset == 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("MultiXact %u has invalid next offset", multi)));
+	if (nextMXOffset < offset)
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("MultiXact %u has offset (%" PRIu64") greater than its next offset  (%" PRIu64")",
+						multi, offset, nextMXOffset)));
+	if (nextMXOffset - offset > INT32_MAX)
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("MultiXact %u has too many members (%" PRIu64 ")",
+						multi, nextMXOffset - offset)));
+	length = nextMXOffset - offset;
 
 	/* read the members */
 	ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
-- 
2.47.3

#87Ashutosh Bapat
ashutosh.bapat.oss@gmail.com
In reply to: Heikki Linnakangas (#85)
Re: POC: make mxidoff 64 bits

On Sat, Dec 6, 2025 at 5:06 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 05/12/2025 15:42, Ashutosh Bapat wrote:

007_multixact_conversion.pl fires thousands of queries through
BackgroundPsql which prints debug output for each of the queries. When
running this file with oldinstall set,
2.2M regress_log_007_multixact_conversion (size of file)
77874 regress_log_007_multixact_conversion (wc -l output)

Since this output is also copied in testlog.txt, the effect is two-fold.

Most, if not all, of this output is useless. It also makes it hard to
find the output we are looking for. PFA patch which reduces this
output. The patch adds a flag verbose to query_safe() and query() to
toggle this output. With the patch the sizes are
27K regress_log_007_multixact_conversion
588 regress_log_007_multixact_conversion

And it makes the test faster by about a second or two on my laptop.
Something on those lines or other is required to reduce the output
from query_safe().

Nice! That log bloat was the reason I bundled together the "COMMIT;
BEGIN; SELECT ...;" steps into one statement in the loop. Your solution
addresses it more directly.

Now we can call query_safe() separately on each of those. That will be
more readable and marginally less code.

I turned 'verbose' into a keyword parameter, for future extensibility of
those functions, so you now call it like "$node->query_safe("SELECT 1",
verbose => 0);". I also set "log_statements=none" in those connections,
to reduce the noise in the server log too.

keyword parameter is better. also +1 for log_statements.

Some more comments
+++ b/src/bin/pg_upgrade/multixact_old.c

We may need to introduce new _new and then _old will become _older.
Should we rename the files to have pre19 and post19 or some similar
suffixes which make it clear what is meant by old and new?

+1. I renamed multixact_old.c to multixact_pre_v19.c. And
multixact_new.c to multixact_rewrite.c. I also moved the
"convert_multixact" function that drives the conversion to
multixact_rewrite.c. The idea is that if in the future we change the
format again, we will have:

multixact_pre_v19.c # for reading -v19 files
multixact_pre_v24.c # for reading v19-v23 files
multixact_rewrite.c # for writing new files

Hard to predict what a possible future format might look like and how
we'd want to organize the code then, though. This can be changed then if
needed, but it makes sense now.

+1.

+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)

The prologue mentions that the definitions are copy-pasted from
multixact.c from version 18, but they share the names with functions
in the current version. I think that's going to be a good source of
confusion especially in a file which is a few hundred lines long. Can
we rename them to have "Old" prefix or something similar?

Fair. On the other hand, having the same names makes it easier to see
what the real differences with the server functions are. Not sure what's
best here..

As long as we use the same names, it's important that
multixact_pre_v19.c doesn't #include the new definitions. I added some
comments on that, and also this safeguard:

#define MultiXactOffset should_not_be_used

That actually caught one (harmless) instance in the file where we had
not renamed MultiXactOffset to OldMultiXactOffset.

That looks useful, and has proved to be useful already.

I'm not entirely happy with the "Old" prefix here, because as you
pointed out, we might end up needing "older" or "oldold" in the future.
I couldn't come up with anything better though. "PreV19MultiXactOffset"
is quite a mouthful.

How about MultiXactOffset32?

Thanks for addressing rest of the comments.

+
+ note ">>> case #${tag}\n"
+   . " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n"
+   . " newnode mxoff ${new_next_mxoff}\n";

Should we check that some condition holds between finish_mxoff and
new_next_mxoff?

Got something in mind that we could check?

I have always seen that finish_mxoff is very high compared to newnode
mxoff - given that we write only one member per mxid, is newnode mxoff
going to be always something like 4K or so? Then we can check that
value. But I will experiment more to see if I can come up with
something, if possible.

--
Best Wishes,
Ashutosh Bapat

#88Ashutosh Bapat
ashutosh.bapat.oss@gmail.com
In reply to: Heikki Linnakangas (#86)
Re: POC: make mxidoff 64 bits

On Mon, Dec 8, 2025 at 6:32 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 06/12/2025 01:36, Heikki Linnakangas wrote:

On 05/12/2025 15:42, Ashutosh Bapat wrote:

+ $newnode->start;
+ my $new_dump = get_dump_for_comparison($newnode, "newnode_${tag}
_dump");
+ $newnode->stop;

There is no code which actually looks at the multixact offsets here to
make sure that the conversion happened correctly. I guess the test
relies on visibility checks for that. Anyway, we need a comment
explaining why just comparing the contents of the table is enough to
ensure correct conversion. Better if we can add an explicit test that
the offsets were converted correctly. I don't have any idea of how to
do that right now, though. Maybe use pg_get_multixact_members()
somehow in the query to extract data out of the table?

Agreed, the verification here is quite weak. I didn't realize that
pg_get_multixact_members() exists! That might indeed be handy here, but
I'm not sure how exactly to construct the test. A direct C function like
test_create_multixact() in test_multixact.c would be handy here, but
we'd need to compile and do run that in the old cluster, which seems
difficult.

I added verification of all the multixids between oldest and next
multixid, using pg_get_multixact_members(). The test now calls
pg_get_multixact_members() for all updating multixids in the range,
before and after the upgrade, and compares the results.

I thought about adding pg_get_multixact_member in
get_test_table_contents() itself like SELECT ctid, xmin, xmax,
get_multixact_member(xmin), get_multixact_member(xmax) * FROM
mxofftest; but then I realized that the UPDATE would replace mxids by
actual transaction ids in the visible rows. So that can't be used.
What you have done doesn't have that drawback, but it's also not
checking whether the multixids in (invisible) rows are reachable in
offsets and members. But probably that's too hard to do and is covered
by visibility checks.

--
Best Wishes,
Ashutosh Bapat

#89Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Ashutosh Bapat (#87)
Re: POC: make mxidoff 64 bits

On 08/12/2025 17:43, Ashutosh Bapat wrote:

On Sat, Dec 6, 2025 at 5:06 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 05/12/2025 15:42, Ashutosh Bapat wrote:

And it makes the test faster by about a second or two on my laptop.
Something on those lines or other is required to reduce the output
from query_safe().

Nice! That log bloat was the reason I bundled together the "COMMIT;
BEGIN; SELECT ...;" steps into one statement in the loop. Your solution
addresses it more directly.

Now we can call query_safe() separately on each of those. That will be
more readable and marginally less code.

Done.

I'm not entirely happy with the "Old" prefix here, because as you
pointed out, we might end up needing "older" or "oldold" in the future.
I couldn't come up with anything better though. "PreV19MultiXactOffset"
is quite a mouthful.

How about MultiXactOffset32?

Ooh, I like that. It doesn't sound as nice for the other "old" prefixed
things though. So I changed OldMultiXactOffset to MultiXactOffset32, but
kept OldMultiXactReader, GetOldMultiXactIdSingleMember() et al. We can
live with that for now, and rename in the future if we'd need "oldold".

Committed with that and some other minor cleanups. Thanks everyone! This
patch has been brewing for a while :-).

There are some noncritical followups that I'd like to address, now that
we know that in v19 the pg_multixact files will be rewritten. That gives
us an opportunity to clean up some backwards-compatibility stuff. The
committed patch already cleaned up a bunch, but there's some more we
could do:

1. Currently, at multixid wraparound, MultiXactState->nextMXact goes to
0, which is invalid. All the readers must be prepared for that, and skip
over the 0. That's error-prone, we've already missed that a few times.
Let's change things so that the code that *writes*
MultiXactState->nextMXact skips over the zero already.

2. We currently don't persist 'oldestOffset' in the control file the
same way as 'oldestMultiXactId'. Instead, we look up the offset of the
oldestMultiXactId at startup, and keep that value in memory. Originally
that was because we missed the need for that and had to add the offset
wraparound protections in a minor release without changing the control
file format. But we could easily do it now.

With 64-bit offsets, it's actually less critical to persist the
oldestOffset. Previously, if we failed to look up the oldest offset
because the oldest multixid was invalid, it could lead to serious
trouble if the offsets then wrapped around and old offsets were
overwritten, but that won't happen anymore. Nevertheless, it leads to
unnecessarily aggressive vacuuming and some messages in the log.

At first I thought that the failure to look up the oldest offset should
no longer happen, because we don't need to support reading old 9.3 era
SLRUs anymore that were created before we added the offset wraparound
protection. But it's not so: it's still possible to have multixids with
invalid offsets in the 'offsets' SLRU on a crash. Such multixids won't
be referenced from anywhere in the heap, but I think they could later
become the oldest multixid, and we would fail to look up its offset.
Persisting the oldest offset doesn't fully fix that problem, because
advancing the oldest offset is done by looking up the oldest multixid's
offset anyway.

3. I think we should turn some of the assertions in
GetMultiXactIdMembers() into ereports(ERROR) calls. I included those
changes in my patch version 29 [1], as a separate patch, but I didn't
commit that yet.

4. Compressing the offsets, per discussion. It doesn't really seem worth
to me and I don't intend to work on it, but if someone wants to do it,
now would be the time, so that we don't need to have upgrade code to
deal with yet another format.

- Heikki

#90Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#89)
1 attachment(s)
Re: POC: make mxidoff 64 bits

On 09/12/2025 14:00, Heikki Linnakangas wrote:

1. Currently, at multixid wraparound, MultiXactState->nextMXact goes to
0, which is invalid. All the readers must be prepared for that, and skip
over the 0. That's error-prone, we've already missed that a few times.
Let's change things so that the code that *writes* MultiXactState-

nextMXact skips over the zero already.

Here's a patch for that. Does anyone see a problem with this?

- Heikki

Attachments:

v1-0001-refactor-Never-store-0-as-the-nextMXact.patchtext/x-patch; charset=UTF-8; name=v1-0001-refactor-Never-store-0-as-the-nextMXact.patchDownload
From fb5865dcc5654a601cc8db796e62ea928016396a Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 10 Dec 2025 21:16:29 +0200
Subject: [PATCH v1 1/1] refactor: Never store 0 as the nextMXact

Before this commit, when multixid wraparound happens,
MultiXactState->nextMXact goes to 0, which is invalid. All the readers
deal with that possibility and skip over the 0. That's error-prone,
we've missed that a few times in the past. This commit changes the
responsibility so that all writers of MultiXactState->nextMXact skip
over the zero already, and readers can trust that it's never 0.

Discussion: https://www.postgresql.org/message-id/3624730d-6dae-42bf-9458-76c4c965fb27@iki.fi
---
 src/backend/access/transam/multixact.c | 79 +++++++-------------------
 src/bin/pg_resetwal/pg_resetwal.c      |  2 +
 src/bin/pg_resetwal/t/001_basic.pl     | 15 +----
 3 files changed, 24 insertions(+), 72 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 6ca3d44261e..2e20a0907d8 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -104,6 +104,12 @@ PreviousMultiXactId(MultiXactId multi)
 	return multi == FirstMultiXactId ? MaxMultiXactId : multi - 1;
 }
 
+static inline MultiXactId
+NextMultiXactId(MultiXactId multi)
+{
+	return multi == MaxMultiXactId ? FirstMultiXactId : multi + 1;
+}
+
 /*
  * Links to shared-memory data structures for MultiXact control
  */
@@ -552,14 +558,7 @@ MultiXactIdSetOldestMember(void)
 		 */
 		LWLockAcquire(MultiXactGenLock, LW_SHARED);
 
-		/*
-		 * We have to beware of the possibility that nextMXact is in the
-		 * wrapped-around state.  We don't fix the counter itself here, but we
-		 * must be sure to store a valid value in our array entry.
-		 */
 		nextMXact = MultiXactState->nextMXact;
-		if (nextMXact < FirstMultiXactId)
-			nextMXact = FirstMultiXactId;
 
 		OldestMemberMXactId[MyProcNumber] = nextMXact;
 
@@ -596,15 +595,7 @@ MultiXactIdSetOldestVisible(void)
 
 		LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 
-		/*
-		 * We have to beware of the possibility that nextMXact is in the
-		 * wrapped-around state.  We don't fix the counter itself here, but we
-		 * must be sure to store a valid value in our array entry.
-		 */
 		oldestMXact = MultiXactState->nextMXact;
-		if (oldestMXact < FirstMultiXactId)
-			oldestMXact = FirstMultiXactId;
-
 		for (i = 0; i < MaxOldestSlot; i++)
 		{
 			MultiXactId thisoldest = OldestMemberMXactId[i];
@@ -637,9 +628,6 @@ ReadNextMultiXactId(void)
 	mxid = MultiXactState->nextMXact;
 	LWLockRelease(MultiXactGenLock);
 
-	if (mxid < FirstMultiXactId)
-		mxid = FirstMultiXactId;
-
 	return mxid;
 }
 
@@ -654,11 +642,6 @@ ReadMultiXactIdRange(MultiXactId *oldest, MultiXactId *next)
 	*oldest = MultiXactState->oldestMultiXactId;
 	*next = MultiXactState->nextMXact;
 	LWLockRelease(MultiXactGenLock);
-
-	if (*oldest < FirstMultiXactId)
-		*oldest = FirstMultiXactId;
-	if (*next < FirstMultiXactId)
-		*next = FirstMultiXactId;
 }
 
 
@@ -794,9 +777,7 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	entryno = MultiXactIdToOffsetEntry(multi);
 
 	/* position of the next multixid */
-	next = multi + 1;
-	if (next < FirstMultiXactId)
-		next = FirstMultiXactId;
+	next = NextMultiXactId(multi);
 	next_pageno = MultiXactIdToOffsetPage(next);
 	next_entryno = MultiXactIdToOffsetEntry(next);
 
@@ -955,10 +936,6 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 
-	/* Handle wraparound of the nextMXact counter */
-	if (MultiXactState->nextMXact < FirstMultiXactId)
-		MultiXactState->nextMXact = FirstMultiXactId;
-
 	/* Assign the MXID */
 	result = MultiXactState->nextMXact;
 
@@ -1025,7 +1002,7 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 		 * request only once per 64K multis generated.  This still gives
 		 * plenty of chances before we get into real trouble.
 		 */
-		if (IsUnderPostmaster && (result % 65536) == 0)
+		if (IsUnderPostmaster && ((result % 65536) == 0 || result == FirstMultiXactId))
 			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
 
 		if (!MultiXactIdPrecedes(result, multiWarnLimit))
@@ -1056,15 +1033,13 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 		/* Re-acquire lock and start over */
 		LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 		result = MultiXactState->nextMXact;
-		if (result < FirstMultiXactId)
-			result = FirstMultiXactId;
 	}
 
 	/*
 	 * Make sure there is room for the next MXID in the file.  Assigning this
 	 * MXID sets the next MXID's offset already.
 	 */
-	ExtendMultiXactOffset(result + 1);
+	ExtendMultiXactOffset(NextMultiXactId(result));
 
 	/*
 	 * Reserve the members space, similarly to above.
@@ -1098,15 +1073,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	/*
 	 * Advance counters.  As in GetNewTransactionId(), this must not happen
 	 * until after file extension has succeeded!
-	 *
-	 * We don't care about MultiXactId wraparound here; it will be handled by
-	 * the next iteration.  But note that nextMXact may be InvalidMultiXactId
-	 * or the first value on a segment-beginning page after this routine
-	 * exits, so anyone else looking at the variable must be prepared to deal
-	 * with either case.
 	 */
-	(MultiXactState->nextMXact)++;
-
+	MultiXactState->nextMXact = NextMultiXactId(result);
 	MultiXactState->nextOffset += nmembers;
 
 	LWLockRelease(MultiXactGenLock);
@@ -1252,9 +1220,7 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 		MultiXactOffset nextMXOffset;
 
 		/* handle wraparound if needed */
-		tmpMXact = multi + 1;
-		if (tmpMXact < FirstMultiXactId)
-			tmpMXact = FirstMultiXactId;
+		tmpMXact = NextMultiXactId(multi);
 
 		prev_pageno = pageno;
 
@@ -1898,7 +1864,7 @@ TrimMultiXact(void)
 		LWLock	   *lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
 
 		LWLockAcquire(lock, LW_EXCLUSIVE);
-		if (entryno == 0)
+		if (entryno == 0 || nextMXact == FirstMultiXactId)
 			slotno = SimpleLruZeroPage(MultiXactOffsetCtl, pageno);
 		else
 			slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
@@ -2014,8 +1980,10 @@ void
 MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset)
 {
+	Assert(MultiXactIdIsValid(nextMulti));
 	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %" PRIu64,
 				nextMulti, nextMultiOffset);
+
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->nextMXact = nextMulti;
 	MultiXactState->nextOffset = nextMultiOffset;
@@ -2184,6 +2152,8 @@ void
 MultiXactAdvanceNextMXact(MultiXactId minMulti,
 						  MultiXactOffset minMultiOffset)
 {
+	Assert(MultiXactIdIsValid(minMulti));
+
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	if (MultiXactIdPrecedes(MultiXactState->nextMXact, minMulti))
 	{
@@ -2321,7 +2291,6 @@ MultiXactId
 GetOldestMultiXactId(void)
 {
 	MultiXactId oldestMXact;
-	MultiXactId nextMXact;
 	int			i;
 
 	/*
@@ -2329,17 +2298,7 @@ GetOldestMultiXactId(void)
 	 * OldestVisibleMXactId[] entries, or nextMXact if none are valid.
 	 */
 	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-
-	/*
-	 * We have to beware of the possibility that nextMXact is in the
-	 * wrapped-around state.  We don't fix the counter itself here, but we
-	 * must be sure to use a valid value in our calculation.
-	 */
-	nextMXact = MultiXactState->nextMXact;
-	if (nextMXact < FirstMultiXactId)
-		nextMXact = FirstMultiXactId;
-
-	oldestMXact = nextMXact;
+	oldestMXact = MultiXactState->nextMXact;
 	for (i = 0; i < MaxOldestSlot; i++)
 	{
 		MultiXactId thisoldest;
@@ -2660,6 +2619,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 
 	Assert(!RecoveryInProgress());
 	Assert(MultiXactState->finishedStartup);
+	Assert(MultiXactIdIsValid(newOldestMulti));
 
 	/*
 	 * We can only allow one truncation to happen at once. Otherwise parts of
@@ -2674,7 +2634,6 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	nextOffset = MultiXactState->nextOffset;
 	oldestMulti = MultiXactState->oldestMultiXactId;
 	LWLockRelease(MultiXactGenLock);
-	Assert(MultiXactIdIsValid(oldestMulti));
 
 	/*
 	 * Make sure to only attempt truncation if there's values to truncate
@@ -2944,7 +2903,7 @@ multixact_redo(XLogReaderState *record)
 						   xlrec->members);
 
 		/* Make sure nextMXact/nextOffset are beyond what this record has */
-		MultiXactAdvanceNextMXact(xlrec->mid + 1,
+		MultiXactAdvanceNextMXact(NextMultiXactId(xlrec->mid),
 								  xlrec->moff + xlrec->nmembers);
 
 		/*
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 56012d5f4c4..9bfab8c307b 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -287,6 +287,8 @@ main(int argc, char *argv[])
 				 * XXX It'd be nice to have more sanity checks here, e.g. so
 				 * that oldest is not wrapped around w.r.t. nextMulti.
 				 */
+				if (next_mxid_val == 0)
+					pg_fatal("next multitransaction ID (-m) must not be 0");
 				if (oldest_mxid_val == 0)
 					pg_fatal("oldest multitransaction ID (-m) must not be 0");
 				mxids_given = true;
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index 8bab9add74f..dde024a7f14 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -119,19 +119,10 @@ command_fails_like(
 	[ 'pg_resetwal', '-m' => '10,bar', $node->data_dir ],
 	qr/error: invalid argument for option -m/,
 	'fails with incorrect -m option part 2');
-
-# This used to be forbidden, but nextMulti can legitimately be 0 after
-# wraparound, so we now accept it in pg_resetwal too.
-command_ok(
-	[ 'pg_resetwal', '-m' => '0,10', $node->data_dir ],
-	'succeeds with -m value 0 in the first part');
-
-# -0 doesn't make sense however
 command_fails_like(
-	[ 'pg_resetwal', '-m' => '-0,10', $node->data_dir ],
-	qr/error: invalid argument for option -m/,
-	'fails with -m value -0 in the first part');
-
+	[ 'pg_resetwal', '-m' => '0,10', $node->data_dir ],
+	qr/must not be 0/,
+	'fails with -m value 0 in the first part');
 command_fails_like(
 	[ 'pg_resetwal', '-m' => '10,0', $node->data_dir ],
 	qr/must not be 0/,
-- 
2.47.3

#91Ashutosh Bapat
ashutosh.bapat.oss@gmail.com
In reply to: Heikki Linnakangas (#90)
Re: POC: make mxidoff 64 bits

On Thu, Dec 11, 2025 at 12:49 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 09/12/2025 14:00, Heikki Linnakangas wrote:

1. Currently, at multixid wraparound, MultiXactState->nextMXact goes to
0, which is invalid. All the readers must be prepared for that, and skip
over the 0. That's error-prone, we've already missed that a few times.
Let's change things so that the code that *writes* MultiXactState-

nextMXact skips over the zero already.

Here's a patch for that. Does anyone see a problem with this?

The patch looks fine to me. It simplifies readers without affecting
writers much. I was expecting more explanation of why it wasn't done
that way to start with and why is it safe to do so (now, if
applicable). There must be a reason why we chose to make readers
handle invalid mxid instead of writers writing one. If it's for
performance reasons then does the new arrangement cause any
regression? If it's for safety reasons, are we fixing one set of
problems but introducing a new set. I was expecting commit message to
answer those questions.

--
Best Wishes,
Ashutosh Bapat

#92Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Ashutosh Bapat (#91)
Re: POC: make mxidoff 64 bits

On 11/12/2025 05:06, Ashutosh Bapat wrote:

On Thu, Dec 11, 2025 at 12:49 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 09/12/2025 14:00, Heikki Linnakangas wrote:

1. Currently, at multixid wraparound, MultiXactState->nextMXact goes to
0, which is invalid. All the readers must be prepared for that, and skip
over the 0. That's error-prone, we've already missed that a few times.
Let's change things so that the code that *writes* MultiXactState-

nextMXact skips over the zero already.

Here's a patch for that. Does anyone see a problem with this?

The patch looks fine to me. It simplifies readers without affecting
writers much. I was expecting more explanation of why it wasn't done
that way to start with and why is it safe to do so (now, if
applicable). There must be a reason why we chose to make readers
handle invalid mxid instead of writers writing one. If it's for
performance reasons then does the new arrangement cause any
regression? If it's for safety reasons, are we fixing one set of
problems but introducing a new set. I was expecting commit message to
answer those questions.

That's a great question and I've been wondering about it myself. It goes
all the way to the initial commit where multixacts were introduced, and
I don't see any particular reason for it even back then. Even in the
very first version of multixact.c, IMO it would've been simpler to have
the writer handle the wraparound.

Álvaro, would you happen to remember?

- Heikki

#93Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#92)
Re: POC: make mxidoff 64 bits

On Thu, 11 Dec 2025 at 10:58, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

That's a great question and I've been wondering about it myself. It goes
all the way to the initial commit where multixacts were introduced, and
I don't see any particular reason for it even back then. Even in the
very first version of multixact.c, IMO it would've been simpler to have
the writer handle the wraparound.

+1 This code is quite old. I don't see any particular reason for doing

it that way. Unfortunately, we were unable to prove the absence of
something, namely errors, in this instance. But there were no obvious
statements on why it should be in this manner. So, for me, it's much
clearer to increment and handle wraparound in one place rather
than spread it across multiple calls in the module.

--
Best regards,
Maxim Orlov.

#94Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Heikki Linnakangas (#92)
Re: POC: make mxidoff 64 bits

On 2025-Dec-11, Heikki Linnakangas wrote:

That's a great question and I've been wondering about it myself. It goes all
the way to the initial commit where multixacts were introduced, and I don't
see any particular reason for it even back then. Even in the very first
version of multixact.c, IMO it would've been simpler to have the writer
handle the wraparound.

Álvaro, would you happen to remember?

Sorry, I have no recollections of the reason why it was done this way.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Voy a acabar con todos los humanos / con los humanos yo acabaré
voy a acabar con todos (bis) / con todos los humanos acabaré ¡acabaré! (Bender)

#95Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Alvaro Herrera (#94)
Re: POC: make mxidoff 64 bits

On 11/12/2025 22:05, Alvaro Herrera wrote:

On 2025-Dec-11, Heikki Linnakangas wrote:

That's a great question and I've been wondering about it myself. It goes all
the way to the initial commit where multixacts were introduced, and I don't
see any particular reason for it even back then. Even in the very first
version of multixact.c, IMO it would've been simpler to have the writer
handle the wraparound.

Álvaro, would you happen to remember?

Sorry, I have no recollections of the reason why it was done this way.

Ok, I have pushed this. Thanks!

- Heikki

#96Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#95)
Re: POC: make mxidoff 64 bits

Heikki Linnakangas <hlinnaka@iki.fi> writes:

Ok, I have pushed this. Thanks!

Coverity is unhappy about this bit:

/srv/coverity/git/pgsql-git/postgresql/src/bin/pg_upgrade/multixact_read_v18.c: 282 in GetOldMultiXactIdSingleMember()
276 if (!TransactionIdIsValid(*xactptr))
277 {
278 /*
279 * Corner case 2: we are looking at unused slot zero
280 */
281 if (offset == 0)

CID 1676077: Control flow issues (DEADCODE)
Execution cannot reach this statement: "continue;".

282 continue;
283
284 /*
285 * Otherwise this is an invalid entry that should not be

It sees the earlier test for offset == 0, and evidently is assuming
that the loop's "offset++" will not wrap around. Now I think that
the point of this check is exactly that "offset++" could have wrapped
around, but the commentary is not so clear that I'm certain this is a
false positive. If that is the intention, what do you think of
rephrasing this comment as "we have wrapped around to unused slot
zero"?

regards, tom lane

#97Alexander Lakhin
exclusion@gmail.com
In reply to: Heikki Linnakangas (#89)
Re: POC: make mxidoff 64 bits

Hello Heikki,

09.12.2025 14:00, Heikki Linnakangas wrote:

Committed with that and some other minor cleanups. Thanks everyone! This patch has been brewing for a while :-).

I've spotted a couple of failures of new test 007_multixact_conversion at
buildfarm:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=urutu&amp;dt=2025-12-09%2020%3A40%3A53
007_multixact_conversion_basic_oldnode.log:
...
2025-12-09 22:33:39.299 CET [2872679][client backend][21/2:0] LOG: statement: SET log_statement=none
    ;
2025-12-09 22:36:39.025 CET [2871745][postmaster][:0] LOG:  received immediate shutdown request
(180 seconds timeout)

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=canebrake&amp;dt=2025-12-14%2023%3A53%3A48
007_multixact_conversion_basic_oldnode.log:
...
2025-12-15 01:57:01.380 CET [2178307][client backend][21/2:0] LOG: statement: SET log_statement=none
    ;
2025-12-15 02:00:01.020 CET [2177271][postmaster][:0] LOG:  received immediate shutdown request
(180 seconds timeout)

Both occurred on JIT-enabled animals (moreover, JIT is provided by a debug
build of LLVM), so these animals are very slow.

Looking at other urutu's runs, we can see:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=urutu&amp;dt=2025-12-10%2013%3A06%3A35
007_multixact_conversion_basic_oldnode.log::
2025-12-10 14:37:05.254 CET [2322763][client backend][21/2:0] LOG: statement: SET log_statement=none
    ;
2025-12-10 14:39:21.784 CET [2322610][client backend][:0] LOG: disconnection: session time: 0:02:16.878 user=bf
database=postgres host=[local]
(136 seconds)

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=urutu&amp;dt=2025-12-10%2017%3A43%3A02
2025-12-10 19:20:30.310 CET [1785680][client backend][21/2:0] LOG: statement: SET log_statement=none
    ;
2025-12-10 19:22:41.734 CET [1784967][client backend][:0] LOG: disconnection: session time: 0:02:11.903 user=bf
database=postgres host=[local]
(133 seconds)

Though major runs show timing under 80 seconds, e.g.:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=urutu&amp;dt=2025-12-14%2021%3A20%3A49
2025-12-14 22:48:53.764 CET [3751001][client backend][21/2:0] LOG: statement: SET log_statement=none
    ;
2025-12-14 22:49:57.223 CET [3750961][client backend][:0] LOG: disconnection: session time: 0:01:03.571 user=bf
database=postgres host=[local]
(64 seconds)

And a couple of other (successful) canebrake's runs:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=canebrake&amp;dt=2025-12-14%2021%3A09%3A31
2025-12-14 22:59:48.951 CET [3761649][client backend][21/2:0] LOG: statement: SET log_statement=none
    ;
2025-12-14 23:01:26.274 CET [3761608][client backend][:0] LOG: disconnection: session time: 0:01:37.441 user=bf
database=postgres host=[local]
(98 seconds)

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=canebrake&amp;dt=2025-12-14%2012%3A03%3A29
2025-12-14 13:59:29.205 CET [1564673][client backend][21/2:0] LOG: statement: SET log_statement=none
    ;
2025-12-14 14:01:15.208 CET [1564633][client backend][:0] LOG: disconnection: session time: 0:01:46.147 user=bf
database=postgres host=[local]
(106 seconds)

Thus, it looks like these animals can hit 180 seconds timeout with some
external factors (concurrent load?) that make them run 2-3 times slower...

Best regards,
Alexander

#98Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Tom Lane (#96)
Re: POC: make mxidoff 64 bits

On 15/12/2025 00:55, Tom Lane wrote:

Heikki Linnakangas <hlinnaka@iki.fi> writes:

Ok, I have pushed this. Thanks!

Coverity is unhappy about this bit:

/srv/coverity/git/pgsql-git/postgresql/src/bin/pg_upgrade/multixact_read_v18.c: 282 in GetOldMultiXactIdSingleMember()
276 if (!TransactionIdIsValid(*xactptr))
277 {
278 /*
279 * Corner case 2: we are looking at unused slot zero
280 */
281 if (offset == 0)

CID 1676077: Control flow issues (DEADCODE)
Execution cannot reach this statement: "continue;".

282 continue;
283
284 /*
285 * Otherwise this is an invalid entry that should not be

It sees the earlier test for offset == 0, and evidently is assuming
that the loop's "offset++" will not wrap around. Now I think that
the point of this check is exactly that "offset++" could have wrapped
around, but the commentary is not so clear that I'm certain this is a
false positive.

Correct.

If that is the intention, what do you think of rephrasing this
comment as "we have wrapped around to unused slot zero"?

Ah yes, that's much better. Changed it to "offset must have wrapped
around to unused slot zero". This code and its comments are copied from
v18 GetMultiXactIdMembers(), so in order to maintain the rhyme with
that, I changed the comment in backbranches too. It doesn't exist in the
'master' version of GetMultiXactIdMembers() anymore.

Coverity is also complaining about the 'length' variable being tainted,
because it's calculated from the values read from disk. That's bogus
because we trust and make assumptions of the values on disk. That said,
I think it would make sense to do some more sanity checking here. In
particular, length should never be negative. I added such sanity checks
to the 'master' version of the GetMultiXactIdMembers() server function
in commit d4b7bde418, but it would make sense to add them to the upgrade
code too. I'll look into that.

- Heikki

#99Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Alexander Lakhin (#97)
Re: POC: make mxidoff 64 bits

On 15/12/2025 10:00, Alexander Lakhin wrote:

Hello Heikki,

09.12.2025 14:00, Heikki Linnakangas wrote:

Committed with that and some other minor cleanups. Thanks everyone!
This patch has been brewing for a while :-).

I've spotted a couple of failures of new test 007_multixact_conversion at
buildfarm:
...

Thus, it looks like these animals can hit 180 seconds timeout with some
external factors (concurrent load?) that make them run 2-3 times slower...

Hmm, so it's hitting the timeout while running the multixid generation
workload part of the test. The workload keeps 20 parallel connections
open, using them in a round-robin fashion, and apparently the timeout
spans the whole duration of each connection rather than the individual
queries run on them.

On my laptop with jit_above_cost=0 and jit_optimize_above_cost=1000,
like on those buildfarm animals, the multixid generation workload runs
in under 10 s. I suppose it could be 20x slower on a slow, busy system.
So the straightforward fix is to bump up the timeout.

I'm tempted to force "jit=off" here, as the multixid generation is not
really the thing that's being tested here. It's just used to generate
multixids in the cluster before the upgrade. Then again, the point of
forcing JIT in these buildfarm members is to test JITting as part of
everything else that wasn't originally written as a JIT test.

- Heikki

#100Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#99)
Re: POC: make mxidoff 64 bits

On 15/12/2025 16:31, Heikki Linnakangas wrote:

On my laptop with jit_above_cost=0 and jit_optimize_above_cost=1000,
like on those buildfarm animals, the multixid generation workload runs
in under 10 s. I suppose it could be 20x slower on a slow, busy system.
So the straightforward fix is to bump up the timeout.

I bumped the timeout to 4 * 180 s. I also added a few progress report
notes to the test output, so that we get a better picture of where the
time goes.

- Heikki

#101zengman
zengman@halodbtech.com
In reply to: Heikki Linnakangas (#100)
1 attachment(s)
Re: POC: make mxidoff 64 bits

Hi,

I'm currently looking into the `SlruReadSwitchPageSlow` function and have a question regarding the expression `&state->buf.data + bytes_read` —
I suspect the ampersand (&) here might be misused. Would you be able to help me verify this?

```
while (bytes_read < BLCKSZ)
{
ssize_t rc;

rc = pg_pread(state->fd,
&state->buf.data + bytes_read,
BLCKSZ - bytes_read,
offset);
if (rc < 0)
{
if (errno == EINTR)
continue;
pg_fatal("could not read file \"%s\": %m", state->fn);
}
if (rc == 0)
{
/* unexpected EOF */
pg_log(PG_WARNING, "unexpected EOF reading file \"%s\" at offset %u, reading as zeros",
state->fn, (unsigned int) offset);
memset(&state->buf.data + bytes_read, 0, BLCKSZ - bytes_read);
break;
}
bytes_read += rc;
offset += rc;
}
```

```
rc = pg_pread(state->fd,
&state->buf.data + bytes_read,
BLCKSZ - bytes_read,
offset);
memset(&state->buf.data + bytes_read, 0, BLCKSZ - bytes_read);
```

--
Regards,
Man Zeng
www.openhalo.org

Attachments:

slru_io.diffapplication/octet-stream; charset=gb18030; name=slru_io.diffDownload
diff --git a/src/bin/pg_upgrade/slru_io.c b/src/bin/pg_upgrade/slru_io.c
index 0862cd33e6c..2df0a1d0a76 100644
--- a/src/bin/pg_upgrade/slru_io.c
+++ b/src/bin/pg_upgrade/slru_io.c
@@ -121,7 +121,7 @@ SlruReadSwitchPageSlow(SlruSegState *state, uint64 pageno)
 		ssize_t		rc;
 
 		rc = pg_pread(state->fd,
-					  &state->buf.data + bytes_read,
+					  state->buf.data + bytes_read,
 					  BLCKSZ - bytes_read,
 					  offset);
 		if (rc < 0)
@@ -135,7 +135,7 @@ SlruReadSwitchPageSlow(SlruSegState *state, uint64 pageno)
 			/* unexpected EOF */
 			pg_log(PG_WARNING, "unexpected EOF reading file \"%s\" at offset %u, reading as zeros",
 				   state->fn, (unsigned int) offset);
-			memset(&state->buf.data + bytes_read, 0, BLCKSZ - bytes_read);
+			memset(state->buf.data + bytes_read, 0, BLCKSZ - bytes_read);
 			break;
 		}
 		bytes_read += rc;
#102Heikki Linnakangas
hlinnaka@iki.fi
In reply to: zengman (#101)
Re: POC: make mxidoff 64 bits

On 30/12/2025 03:49, zengman wrote:

Hi,

I'm currently looking into the `SlruReadSwitchPageSlow` function and have a question regarding the expression `&state->buf.data + bytes_read` —
I suspect the ampersand (&) here might be misused. Would you be able to help me verify this?

```
while (bytes_read < BLCKSZ)
{
ssize_t rc;

rc = pg_pread(state->fd,
&state->buf.data + bytes_read,
BLCKSZ - bytes_read,
offset);
if (rc < 0)
{
if (errno == EINTR)
continue;
pg_fatal("could not read file \"%s\": %m", state->fn);
}
if (rc == 0)
{
/* unexpected EOF */
pg_log(PG_WARNING, "unexpected EOF reading file \"%s\" at offset %u, reading as zeros",
state->fn, (unsigned int) offset);
memset(&state->buf.data + bytes_read, 0, BLCKSZ - bytes_read);
break;
}
bytes_read += rc;
offset += rc;
}
```

```
rc = pg_pread(state->fd,
&state->buf.data + bytes_read,
BLCKSZ - bytes_read,
offset);
memset(&state->buf.data + bytes_read, 0, BLCKSZ - bytes_read);
```

Yes, you're right. Good catch! Committed the fix, thanks.

- Heikki

#103zengman
zengman@halodbtech.com
In reply to: Heikki Linnakangas (#102)
Re: POC: make mxidoff 64 bits

I'm currently looking into the `SlruReadSwitchPageSlow` function and have a question regarding the expression `&state->buf.data + bytes_read` —
I suspect the ampersand (&) here might be misused. Would you be able to help me verify this?

Yes, you're right. Good catch! Committed the fix, thanks.

Thank you for confirming this, committing the fix, and recognizing this.😀

--
Regards,
Man Zeng
www.openhalo.org

#104Chao Li
li.evan.chao@gmail.com
In reply to: Heikki Linnakangas (#102)
1 attachment(s)
Re: POC: make mxidoff 64 bits

On Jan 5, 2026, at 02:06, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Yes, you're right. Good catch! Committed the fix, thanks.

- Heikki

Hi Heikki,

I actually reviewed this patch and had a comment on slur_io.c, but I don’t know why I left my comment email in the draft box and never sent it out.

The comment was that:
```
+void
+FreeSlruRead(SlruSegState *state)
+{
+	Assert(!state->writing);	/* read only mode */
+
+	if (state->fd != -1)
+		close(state->fd);
+	pg_free(state);
+}
+void
+FreeSlruWrite(SlruSegState *state)
+{
+	Assert(state->writing);
+
+	SlruFlush(state);
+
+	if (state->fd != -1)
+		close(state->fd);
+	pg_free(state);
+}
```

In both FreeSlruRead() and FreeSlruWrite(), as we pg_free(state), I don’t see a reason why we don’t free state->dir and state->fn as well, because they are allocated by pstrdup and psrintf, which looks like memory leaks.

I made a change as the attached diff. Please see if you agree with the change.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Attachments:

slru_io_memory.diffapplication/octet-stream; name=slru_io_memory.diff; x-unix-mode=0644Download
diff --git a/src/bin/pg_upgrade/slru_io.c b/src/bin/pg_upgrade/slru_io.c
index ae3e224d7b1..ed79124c6d5 100644
--- a/src/bin/pg_upgrade/slru_io.c
+++ b/src/bin/pg_upgrade/slru_io.c
@@ -19,6 +19,7 @@
 #include "slru_io.h"
 
 static SlruSegState *AllocSlruSegState(const char *dir);
+static void FreeSlruSegState(SlruSegState *state);
 static char *SlruFileName(SlruSegState *state, int64 segno);
 static void SlruFlush(SlruSegState *state);
 
@@ -69,6 +70,18 @@ AllocSlruRead(const char *dir, bool long_segment_names)
 	return state;
 }
 
+static void
+FreeSlruSegState(SlruSegState *state)
+{
+	if (state->fd != -1)
+		close(state->fd);
+	if (state->fn)
+		pg_free(state->fn);
+	if (state->dir)
+		pg_free(state->dir);
+	pg_free(state);
+}
+
 /*
  * Read the given page into memory buffer.
  *
@@ -154,9 +167,7 @@ FreeSlruRead(SlruSegState *state)
 {
 	Assert(!state->writing);	/* read only mode */
 
-	if (state->fd != -1)
-		close(state->fd);
-	pg_free(state);
+	FreeSlruSegState(state);
 }
 
 /*
@@ -263,7 +274,5 @@ FreeSlruWrite(SlruSegState *state)
 
 	SlruFlush(state);
 
-	if (state->fd != -1)
-		close(state->fd);
-	pg_free(state);
+	FreeSlruSegState(state);
 }