Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData

Started by Michael Paquier12 months ago13 messages

michael@paquier.xyz

12 months ago

6 attachment(s)

Hi all,
(Noah and Vitaly in CC, who got involved in the previous discussion)

This thread is a continuation of the discussion that happened here:
/messages/by-id/13b5b6-676c3080-4d-531db900@47931709

And I am beginning a new thread about going through an issue that Noah
has mentioned at [1]/messages/by-id/20250117005221.05.nmisch@google.com, which is that the 2PC code may attempt to do
CLOG lookups at very early stage of recovery, where the cluster is not
in a consistent state.

I have dug into this issue, and while working on it, noticed a
separate bug related to how ProcessTwoPhaseBuffer() decides to handle
2PC entries in its shmem area GlobalTransactionData. Most of the
callers of this routine follow this pattern:
for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
{
GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
[...]
buf = ProcessTwoPhaseBuffer(gxact->xid, ...);
[...]
}

That may look OK, but we may finish by calling PrepareRedoRemove(),
which itself does a RemoveGXact(), manipulating directly TwoPhaseState
while we're looping with it. At first, I thought that it can be
pretty bad because it is possible to leave TwoPhaseState entries
around at the end of recovery, assuming that these are still live
transactions in the cluster, keeping locks, etc. However, and as far
as I can see, this would be only reachable if there was 2PC data on
disk inconsistent with the cluster. Base backups via custom tools
would be a problem, making the problem less worse.

My conclusion is that while these two issues are independent, their
fix is the same. My reasoning is that we'd better extract the
internals of ProcessTwoPhaseBuffer() that do the CLOG lookup and
future-2PC checks and move them to the places where they are relevant,
in a fashion similar to the original 2PC code of recovery in the 2005
area with d0a89683a3a4, where CLOG lookups were only attempted at the
end of recovery, with future lookups done at its beginning.

In our case, CLOG lookups are safe to do in
RecoverPreparedTransactions(), where 2PC transaction data is
reinstated before the server is ready for writes. The cleanup of the
2PC entries should happen once all the entries have been scanned. The
checks for future files can be done in restoreTwoPhaseData(), code
path taken early at recovery when scanning the 2PC files on disk.

Another thing that's struck me a bit is that it is possible to make
the checks done at the beginning of recovery tighter, by checking if
the 2PC file involved is older than the cluster-wide oldest xmin,
something we know from the checkpoint record.

Attached is a set of patches for HEAD, down to v13. Most of them have
finished by having conflicts. The patches in ~16 are straight-forward
as 2PC files don't have epochs. The patches in 17~ are more
invasive because FullTransactionIds are not integrated fully in
the 2PC stack (aka 5a1dfde8334b). This relies on 81772a495ec9 that
has made the work simpler. Still, you can notice that the basics are
the same as the ~16 patches, where the logic from
ProcessTwoPhaseBuffer is extracted.

I have gone back and forth with the versions of 17~ for a couple of
days and decided to bite the bullet with basics taken from the area of
[2]: /messages/by-id/20250116205254.65.nmisch@google.com -- Michael
to remove 2PC files from disk. That's more invasive, perhaps that's
for the best at the end keeping 17~ more consistent, meaning less
conflicts.

Each patch has a set of regression tests that check all these
conditions. They are a bit artistic, still the first test has caught
the second problem while I was looking at ways to automate the first
problem.

This needs more eyes, so I'll go add an entry in the CF. Patches for
stable branches are labelled with .txt, to avoid any interference with
the CF bot.

Thoughts and comments are welcome.

[1]: /messages/by-id/20250117005221.05.nmisch@google.com
[2]: /messages/by-id/20250116205254.65.nmisch@google.com -- Michael
--
Michael

Attachments:

0001-Fix-set-of-issues-with-2PC-transaction-handli-master.patchtext/x-diff; charset=us-asciiDownload

From ffef8812802e5419e5195f26103444cbb492ebff Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Thu, 30 Jan 2025 14:45:53 +0900
Subject: [PATCH] Fix set of issues with 2PC transaction handling during
 recovery

This addresses two issues:
- CLOG lookups could happen at a too early stage of recovery, where
these may not be consistent.  These are delayed until the end of
recovery.
- Legit 2PC transactions found as already committed or aborted when
scanning the shared memory data could cause inconsistencies in
TwoPhaseState, leaving around dangling entries for transactions that
should not be there.
---
 src/include/access/multixact.h               |   9 +-
 src/include/access/twophase.h                |  12 +-
 src/include/access/twophase_rmgr.h           |   4 +-
 src/include/pgstat.h                         |   4 +-
 src/include/storage/lock.h                   |  11 +-
 src/include/storage/predicate.h              |   7 +-
 src/backend/access/transam/multixact.c       |  16 +-
 src/backend/access/transam/twophase.c        | 327 +++++++++++--------
 src/backend/access/transam/xact.c            |  13 +-
 src/backend/storage/lmgr/lock.c              |  20 +-
 src/backend/storage/lmgr/predicate.c         |  11 +-
 src/backend/utils/activity/pgstat_relation.c |   4 +-
 src/test/recovery/t/009_twophase.pl          | 140 ++++++++
 13 files changed, 388 insertions(+), 190 deletions(-)

diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 4e6b0eec2ff..b876e98f46e 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -11,6 +11,7 @@
 #ifndef MULTIXACT_H
 #define MULTIXACT_H
 
+#include "access/transam.h"
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 #include "storage/sync.h"
@@ -119,7 +120,7 @@ extern int	multixactmemberssyncfiletag(const FileTag *ftag, char *path);
 
 extern void AtEOXact_MultiXact(void);
 extern void AtPrepare_MultiXact(void);
-extern void PostPrepare_MultiXact(TransactionId xid);
+extern void PostPrepare_MultiXact(FullTransactionId fxid);
 
 extern Size MultiXactShmemSize(void);
 extern void MultiXactShmemInit(void);
@@ -145,11 +146,11 @@ extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
 extern int	MultiXactMemberFreezeThreshold(void);
 
-extern void multixact_twophase_recover(TransactionId xid, uint16 info,
+extern void multixact_twophase_recover(FullTransactionId fxid, uint16 info,
 									   void *recdata, uint32 len);
-extern void multixact_twophase_postcommit(TransactionId xid, uint16 info,
+extern void multixact_twophase_postcommit(FullTransactionId fxid, uint16 info,
 										  void *recdata, uint32 len);
-extern void multixact_twophase_postabort(TransactionId xid, uint16 info,
+extern void multixact_twophase_postabort(FullTransactionId fxid, uint16 info,
 										 void *recdata, uint32 len);
 
 extern void multixact_redo(XLogReaderState *record);
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 9fa82355033..0ab8b3e64a7 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -14,6 +14,7 @@
 #ifndef TWOPHASE_H
 #define TWOPHASE_H
 
+#include "access/transam.h"
 #include "access/xact.h"
 #include "access/xlogdefs.h"
 #include "datatype/timestamp.h"
@@ -36,10 +37,10 @@ extern void PostPrepare_Twophase(void);
 
 extern TransactionId TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
 												bool *have_more);
-extern PGPROC *TwoPhaseGetDummyProc(TransactionId xid, bool lock_held);
-extern int	TwoPhaseGetDummyProcNumber(TransactionId xid, bool lock_held);
+extern PGPROC *TwoPhaseGetDummyProc(FullTransactionId fxid, bool lock_held);
+extern int	TwoPhaseGetDummyProcNumber(FullTransactionId fxid, bool lock_held);
 
-extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
+extern GlobalTransaction MarkAsPreparing(FullTransactionId fxid, const char *gid,
 										 TimestampTz prepared_at,
 										 Oid owner, Oid databaseid);
 
@@ -56,8 +57,9 @@ extern void CheckPointTwoPhase(XLogRecPtr redo_horizon);
 
 extern void FinishPreparedTransaction(const char *gid, bool isCommit);
 
-extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
-						   XLogRecPtr end_lsn, RepOriginId origin_id);
+extern void PrepareRedoAdd(FullTransactionId fxid, char *buf,
+						   XLogRecPtr start_lsn, XLogRecPtr end_lsn,
+						   RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
 extern bool LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
diff --git a/src/include/access/twophase_rmgr.h b/src/include/access/twophase_rmgr.h
index 3ed154bb231..8f576402e36 100644
--- a/src/include/access/twophase_rmgr.h
+++ b/src/include/access/twophase_rmgr.h
@@ -14,7 +14,9 @@
 #ifndef TWOPHASE_RMGR_H
 #define TWOPHASE_RMGR_H
 
-typedef void (*TwoPhaseCallback) (TransactionId xid, uint16 info,
+#include "access/transam.h"
+
+typedef void (*TwoPhaseCallback) (FullTransactionId fxid, uint16 info,
 								  void *recdata, uint32 len);
 typedef uint8 TwoPhaseRmgrId;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 81ec0161c09..dcf230e2b6c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -713,9 +713,9 @@ extern void pgstat_count_heap_delete(Relation rel);
 extern void pgstat_count_truncate(Relation rel);
 extern void pgstat_update_heap_dead_tuples(Relation rel, int delta);
 
-extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
+extern void pgstat_twophase_postcommit(FullTransactionId fxid, uint16 info,
 									   void *recdata, uint32 len);
-extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
+extern void pgstat_twophase_postabort(FullTransactionId fxid, uint16 info,
 									  void *recdata, uint32 len);
 
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 1076995518f..3feedfc8cb2 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -18,6 +18,7 @@
 #error "lock.h may not be included from frontend code"
 #endif
 
+#include "access/transam.h"
 #include "lib/ilist.h"
 #include "storage/lockdefs.h"
 #include "storage/lwlock.h"
@@ -579,7 +580,7 @@ extern bool LockHasWaiters(const LOCKTAG *locktag,
 extern VirtualTransactionId *GetLockConflicts(const LOCKTAG *locktag,
 											  LOCKMODE lockmode, int *countp);
 extern void AtPrepare_Locks(void);
-extern void PostPrepare_Locks(TransactionId xid);
+extern void PostPrepare_Locks(FullTransactionId fxid);
 extern bool LockCheckConflicts(LockMethod lockMethodTable,
 							   LOCKMODE lockmode,
 							   LOCK *lock, PROCLOCK *proclock);
@@ -594,13 +595,13 @@ extern BlockedProcsData *GetBlockerStatusData(int blocked_pid);
 extern xl_standby_lock *GetRunningTransactionLocks(int *nlocks);
 extern const char *GetLockmodeName(LOCKMETHODID lockmethodid, LOCKMODE mode);
 
-extern void lock_twophase_recover(TransactionId xid, uint16 info,
+extern void lock_twophase_recover(FullTransactionId fxid, uint16 info,
 								  void *recdata, uint32 len);
-extern void lock_twophase_postcommit(TransactionId xid, uint16 info,
+extern void lock_twophase_postcommit(FullTransactionId fxid, uint16 info,
 									 void *recdata, uint32 len);
-extern void lock_twophase_postabort(TransactionId xid, uint16 info,
+extern void lock_twophase_postabort(FullTransactionId fxid, uint16 info,
 									void *recdata, uint32 len);
-extern void lock_twophase_standby_recover(TransactionId xid, uint16 info,
+extern void lock_twophase_standby_recover(FullTransactionId fxid, uint16 info,
 										  void *recdata, uint32 len);
 
 extern DeadLockState DeadLockCheck(PGPROC *proc);
diff --git a/src/include/storage/predicate.h b/src/include/storage/predicate.h
index 267d5d90e94..4d3f218f93b 100644
--- a/src/include/storage/predicate.h
+++ b/src/include/storage/predicate.h
@@ -14,6 +14,7 @@
 #ifndef PREDICATE_H
 #define PREDICATE_H
 
+#include "access/transam.h"
 #include "storage/itemptr.h"
 #include "storage/lock.h"
 #include "utils/relcache.h"
@@ -72,9 +73,9 @@ extern void PreCommit_CheckForSerializationFailure(void);
 
 /* two-phase commit support */
 extern void AtPrepare_PredicateLocks(void);
-extern void PostPrepare_PredicateLocks(TransactionId xid);
-extern void PredicateLockTwoPhaseFinish(TransactionId xid, bool isCommit);
-extern void predicatelock_twophase_recover(TransactionId xid, uint16 info,
+extern void PostPrepare_PredicateLocks(FullTransactionId fxid);
+extern void PredicateLockTwoPhaseFinish(FullTransactionId xid, bool isCommit);
+extern void predicatelock_twophase_recover(FullTransactionId fxid, uint16 info,
 										   void *recdata, uint32 len);
 
 /* parallel query support */
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 27ccdf9500f..4e52792bd1f 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1847,7 +1847,7 @@ AtPrepare_MultiXact(void)
  *		Clean up after successful PREPARE TRANSACTION
  */
 void
-PostPrepare_MultiXact(TransactionId xid)
+PostPrepare_MultiXact(FullTransactionId fxid)
 {
 	MultiXactId myOldestMember;
 
@@ -1858,7 +1858,7 @@ PostPrepare_MultiXact(TransactionId xid)
 	myOldestMember = OldestMemberMXactId[MyProcNumber];
 	if (MultiXactIdIsValid(myOldestMember))
 	{
-		ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(xid, false);
+		ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(fxid, false);
 
 		/*
 		 * Even though storing MultiXactId is atomic, acquire lock to make
@@ -1896,10 +1896,10 @@ PostPrepare_MultiXact(TransactionId xid)
  *		Recover the state of a prepared transaction at startup
  */
 void
-multixact_twophase_recover(TransactionId xid, uint16 info,
+multixact_twophase_recover(FullTransactionId fxid, uint16 info,
 						   void *recdata, uint32 len)
 {
-	ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(xid, false);
+	ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(fxid, false);
 	MultiXactId oldestMember;
 
 	/*
@@ -1917,10 +1917,10 @@ multixact_twophase_recover(TransactionId xid, uint16 info,
  *		Similar to AtEOXact_MultiXact but for COMMIT PREPARED
  */
 void
-multixact_twophase_postcommit(TransactionId xid, uint16 info,
+multixact_twophase_postcommit(FullTransactionId fxid, uint16 info,
 							  void *recdata, uint32 len)
 {
-	ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(xid, true);
+	ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(fxid, true);
 
 	Assert(len == sizeof(MultiXactId));
 
@@ -1932,10 +1932,10 @@ multixact_twophase_postcommit(TransactionId xid, uint16 info,
  *		This is actually just the same as the COMMIT case.
  */
 void
-multixact_twophase_postabort(TransactionId xid, uint16 info,
+multixact_twophase_postabort(FullTransactionId fxid, uint16 info,
 							 void *recdata, uint32 len)
 {
-	multixact_twophase_postcommit(xid, info, recdata, len);
+	multixact_twophase_postcommit(fxid, info, recdata, len);
 }
 
 /*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 73a80559194..c8a8d774b10 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -159,7 +159,7 @@ typedef struct GlobalTransactionData
 	 */
 	XLogRecPtr	prepare_start_lsn;	/* XLOG offset of prepare record start */
 	XLogRecPtr	prepare_end_lsn;	/* XLOG offset of prepare record end */
-	TransactionId xid;			/* The GXACT id */
+	FullTransactionId fxid;		/* The GXACT full xid */
 
 	Oid			owner;			/* ID of user that executed the xact */
 	ProcNumber	locking_backend;	/* backend currently working on the xact */
@@ -197,6 +197,7 @@ static GlobalTransaction MyLockedGxact = NULL;
 
 static bool twophaseExitRegistered = false;
 
+static void PrepareRedoRemoveFull(FullTransactionId fxid, bool giveWarning);
 static void RecordTransactionCommitPrepared(TransactionId xid,
 											int nchildren,
 											TransactionId *children,
@@ -216,19 +217,19 @@ static void RecordTransactionAbortPrepared(TransactionId xid,
 										   int nstats,
 										   xl_xact_stats_item *stats,
 										   const char *gid);
-static void ProcessRecords(char *bufptr, TransactionId xid,
+static void ProcessRecords(char *bufptr, FullTransactionId fxid,
 						   const TwoPhaseCallback callbacks[]);
 static void RemoveGXact(GlobalTransaction gxact);
 
 static void XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len);
-static char *ProcessTwoPhaseBuffer(TransactionId xid,
+static char *ProcessTwoPhaseBuffer(FullTransactionId fxid,
 								   XLogRecPtr prepare_start_lsn,
 								   bool fromdisk, bool setParent, bool setNextXid);
-static void MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid,
+static void MarkAsPreparingGuts(GlobalTransaction gxact, FullTransactionId fxid,
 								const char *gid, TimestampTz prepared_at, Oid owner,
 								Oid databaseid);
-static void RemoveTwoPhaseFile(TransactionId xid, bool giveWarning);
-static void RecreateTwoPhaseFile(TransactionId xid, void *content, int len);
+static void RemoveTwoPhaseFile(FullTransactionId fxid, bool giveWarning);
+static void RecreateTwoPhaseFile(FullTransactionId fxid, void *content, int len);
 
 /*
  * Initialization of shared memory
@@ -356,7 +357,7 @@ PostPrepare_Twophase(void)
  *		Reserve the GID for the given transaction.
  */
 GlobalTransaction
-MarkAsPreparing(TransactionId xid, const char *gid,
+MarkAsPreparing(FullTransactionId fxid, const char *gid,
 				TimestampTz prepared_at, Oid owner, Oid databaseid)
 {
 	GlobalTransaction gxact;
@@ -407,7 +408,7 @@ MarkAsPreparing(TransactionId xid, const char *gid,
 	gxact = TwoPhaseState->freeGXacts;
 	TwoPhaseState->freeGXacts = gxact->next;
 
-	MarkAsPreparingGuts(gxact, xid, gid, prepared_at, owner, databaseid);
+	MarkAsPreparingGuts(gxact, fxid, gid, prepared_at, owner, databaseid);
 
 	gxact->ondisk = false;
 
@@ -430,11 +431,13 @@ MarkAsPreparing(TransactionId xid, const char *gid,
  * Note: This function should be called with appropriate locks held.
  */
 static void
-MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
-					TimestampTz prepared_at, Oid owner, Oid databaseid)
+MarkAsPreparingGuts(GlobalTransaction gxact, FullTransactionId fxid,
+					const char *gid, TimestampTz prepared_at, Oid owner,
+					Oid databaseid)
 {
 	PGPROC	   *proc;
 	int			i;
+	TransactionId xid = XidFromFullTransactionId(fxid);
 
 	Assert(LWLockHeldByMeInMode(TwoPhaseStateLock, LW_EXCLUSIVE));
 
@@ -479,7 +482,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	proc->subxidStatus.count = 0;
 
 	gxact->prepared_at = prepared_at;
-	gxact->xid = xid;
+	gxact->fxid = fxid;
 	gxact->owner = owner;
 	gxact->locking_backend = MyProcNumber;
 	gxact->valid = false;
@@ -797,12 +800,12 @@ pg_prepared_xact(PG_FUNCTION_ARGS)
  * caller had better hold it.
  */
 static GlobalTransaction
-TwoPhaseGetGXact(TransactionId xid, bool lock_held)
+TwoPhaseGetGXact(FullTransactionId fxid, bool lock_held)
 {
 	GlobalTransaction result = NULL;
 	int			i;
 
-	static TransactionId cached_xid = InvalidTransactionId;
+	static FullTransactionId cached_fxid = {0};
 	static GlobalTransaction cached_gxact = NULL;
 
 	Assert(!lock_held || LWLockHeldByMe(TwoPhaseStateLock));
@@ -811,7 +814,7 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)
 	 * During a recovery, COMMIT PREPARED, or ABORT PREPARED, we'll be called
 	 * repeatedly for the same XID.  We can save work with a simple cache.
 	 */
-	if (xid == cached_xid)
+	if (FullTransactionIdEquals(fxid, cached_fxid))
 		return cached_gxact;
 
 	if (!lock_held)
@@ -821,7 +824,7 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)
 	{
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
 
-		if (gxact->xid == xid)
+		if (FullTransactionIdEquals(gxact->fxid, fxid))
 		{
 			result = gxact;
 			break;
@@ -832,9 +835,10 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)
 		LWLockRelease(TwoPhaseStateLock);
 
 	if (result == NULL)			/* should not happen */
-		elog(ERROR, "failed to find GlobalTransaction for xid %u", xid);
+		elog(ERROR, "failed to find GlobalTransaction for xid %u",
+			 XidFromFullTransactionId(fxid));
 
-	cached_xid = xid;
+	cached_fxid = fxid;
 	cached_gxact = result;
 
 	return result;
@@ -881,7 +885,7 @@ TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
 				*have_more = true;
 				break;
 			}
-			result = gxact->xid;
+			result = XidFromFullTransactionId(gxact->fxid);
 		}
 	}
 
@@ -892,7 +896,7 @@ TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
 
 /*
  * TwoPhaseGetDummyProcNumber
- *		Get the dummy proc number for prepared transaction specified by XID
+ *		Get the dummy proc number for prepared transaction
  *
  * Dummy proc numbers are similar to proc numbers of real backends.  They
  * start at MaxBackends, and are unique across all currently active real
@@ -900,24 +904,24 @@ TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
  * TwoPhaseStateLock will not be taken, so the caller had better hold it.
  */
 ProcNumber
-TwoPhaseGetDummyProcNumber(TransactionId xid, bool lock_held)
+TwoPhaseGetDummyProcNumber(FullTransactionId fxid, bool lock_held)
 {
-	GlobalTransaction gxact = TwoPhaseGetGXact(xid, lock_held);
+	GlobalTransaction gxact = TwoPhaseGetGXact(fxid, lock_held);
 
 	return gxact->pgprocno;
 }
 
 /*
  * TwoPhaseGetDummyProc
- *		Get the PGPROC that represents a prepared transaction specified by XID
+ *		Get the PGPROC that represents a prepared transaction
  *
  * If lock_held is set to true, TwoPhaseStateLock will not be taken, so the
  * caller had better hold it.
  */
 PGPROC *
-TwoPhaseGetDummyProc(TransactionId xid, bool lock_held)
+TwoPhaseGetDummyProc(FullTransactionId fxid, bool lock_held)
 {
-	GlobalTransaction gxact = TwoPhaseGetGXact(xid, lock_held);
+	GlobalTransaction gxact = TwoPhaseGetGXact(fxid, lock_held);
 
 	return GetPGProcByNumber(gxact->pgprocno);
 }
@@ -942,10 +946,8 @@ AdjustToFullTransactionId(TransactionId xid)
 }
 
 static inline int
-TwoPhaseFilePath(char *path, TransactionId xid)
+TwoPhaseFilePath(char *path, FullTransactionId fxid)
 {
-	FullTransactionId fxid = AdjustToFullTransactionId(xid);
-
 	return snprintf(path, MAXPGPATH, TWOPHASE_DIR "/%08X%08X",
 					EpochFromFullTransactionId(fxid),
 					XidFromFullTransactionId(fxid));
@@ -1049,7 +1051,7 @@ void
 StartPrepare(GlobalTransaction gxact)
 {
 	PGPROC	   *proc = GetPGProcByNumber(gxact->pgprocno);
-	TransactionId xid = gxact->xid;
+	TransactionId xid = XidFromFullTransactionId(gxact->fxid);
 	TwoPhaseFileHeader hdr;
 	TransactionId *children;
 	RelFileLocator *commitrels;
@@ -1281,10 +1283,11 @@ RegisterTwoPhaseRecord(TwoPhaseRmgrId rmid, uint16 info,
  * If it looks OK (has a valid magic number and CRC), return the palloc'd
  * contents of the file, issuing an error when finding corrupted data.  If
  * missing_ok is true, which indicates that missing files can be safely
- * ignored, then return NULL.  This state can be reached when doing recovery.
+ * ignored, then return NULL.  This state can be reached when doing recovery
+ * after discarding two-phase files from frozen epochs.
  */
 static char *
-ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
+ReadTwoPhaseFile(FullTransactionId fxid, bool missing_ok)
 {
 	char		path[MAXPGPATH];
 	char	   *buf;
@@ -1296,7 +1299,7 @@ ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
 				file_crc;
 	int			r;
 
-	TwoPhaseFilePath(path, xid);
+	TwoPhaseFilePath(path, fxid);
 
 	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
 	if (fd < 0)
@@ -1461,6 +1464,7 @@ StandbyTransactionIdIsPrepared(TransactionId xid)
 	char	   *buf;
 	TwoPhaseFileHeader *hdr;
 	bool		result;
+	FullTransactionId fxid;
 
 	Assert(TransactionIdIsValid(xid));
 
@@ -1468,7 +1472,8 @@ StandbyTransactionIdIsPrepared(TransactionId xid)
 		return false;			/* nothing to do */
 
 	/* Read and validate file */
-	buf = ReadTwoPhaseFile(xid, true);
+	fxid = FullTransactionIdFromAllowableAt(TransamVariables->nextXid, xid);
+	buf = ReadTwoPhaseFile(fxid, true);
 	if (buf == NULL)
 		return false;
 
@@ -1488,6 +1493,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 {
 	GlobalTransaction gxact;
 	PGPROC	   *proc;
+	FullTransactionId fxid;
 	TransactionId xid;
 	bool		ondisk;
 	char	   *buf;
@@ -1509,7 +1515,8 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 */
 	gxact = LockGXact(gid, GetUserId());
 	proc = GetPGProcByNumber(gxact->pgprocno);
-	xid = gxact->xid;
+	fxid = gxact->fxid;
+	xid = XidFromFullTransactionId(fxid);
 
 	/*
 	 * Read and validate 2PC state data. State data will typically be stored
@@ -1517,7 +1524,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 * to disk if for some reason they have lived for a long time.
 	 */
 	if (gxact->ondisk)
-		buf = ReadTwoPhaseFile(xid, false);
+		buf = ReadTwoPhaseFile(fxid, false);
 	else
 		XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
 
@@ -1636,11 +1643,11 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 
 	/* And now do the callbacks */
 	if (isCommit)
-		ProcessRecords(bufptr, xid, twophase_postcommit_callbacks);
+		ProcessRecords(bufptr, fxid, twophase_postcommit_callbacks);
 	else
-		ProcessRecords(bufptr, xid, twophase_postabort_callbacks);
+		ProcessRecords(bufptr, fxid, twophase_postabort_callbacks);
 
-	PredicateLockTwoPhaseFinish(xid, isCommit);
+	PredicateLockTwoPhaseFinish(fxid, isCommit);
 
 	/*
 	 * Read this value while holding the two-phase lock, as the on-disk 2PC
@@ -1664,7 +1671,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 * And now we can clean up any files we may have left.
 	 */
 	if (ondisk)
-		RemoveTwoPhaseFile(xid, true);
+		RemoveTwoPhaseFile(fxid, true);
 
 	MyLockedGxact = NULL;
 
@@ -1677,7 +1684,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
  * Scan 2PC state data in memory and call the indicated callbacks for each 2PC record.
  */
 static void
-ProcessRecords(char *bufptr, TransactionId xid,
+ProcessRecords(char *bufptr, FullTransactionId fxid,
 			   const TwoPhaseCallback callbacks[])
 {
 	for (;;)
@@ -1691,24 +1698,28 @@ ProcessRecords(char *bufptr, TransactionId xid,
 		bufptr += MAXALIGN(sizeof(TwoPhaseRecordOnDisk));
 
 		if (callbacks[record->rmid] != NULL)
-			callbacks[record->rmid] (xid, record->info, bufptr, record->len);
+			callbacks[record->rmid] (fxid, record->info, bufptr, record->len);
 
 		bufptr += MAXALIGN(record->len);
 	}
 }
 
 /*
- * Remove the 2PC file for the specified XID.
+ * Remove the 2PC file.
  *
  * If giveWarning is false, do not complain about file-not-present;
  * this is an expected case during WAL replay.
+ *
+ * This routine is used at early stages at recovery where future and
+ * past orphaned files are checked, hence the FullTransactionId to build
+ * a complete file name fit for the removal.
  */
 static void
-RemoveTwoPhaseFile(TransactionId xid, bool giveWarning)
+RemoveTwoPhaseFile(FullTransactionId fxid, bool giveWarning)
 {
 	char		path[MAXPGPATH];
 
-	TwoPhaseFilePath(path, xid);
+	TwoPhaseFilePath(path, fxid);
 	if (unlink(path))
 		if (errno != ENOENT || giveWarning)
 			ereport(WARNING,
@@ -1723,7 +1734,7 @@ RemoveTwoPhaseFile(TransactionId xid, bool giveWarning)
  * Note: content and len don't include CRC.
  */
 static void
-RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
+RecreateTwoPhaseFile(FullTransactionId fxid, void *content, int len)
 {
 	char		path[MAXPGPATH];
 	pg_crc32c	statefile_crc;
@@ -1734,7 +1745,7 @@ RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
 	COMP_CRC32C(statefile_crc, content, len);
 	FIN_CRC32C(statefile_crc);
 
-	TwoPhaseFilePath(path, xid);
+	TwoPhaseFilePath(path, fxid);
 
 	fd = OpenTransientFile(path,
 						   O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
@@ -1846,7 +1857,7 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
 			int			len;
 
 			XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, &len);
-			RecreateTwoPhaseFile(gxact->xid, buf, len);
+			RecreateTwoPhaseFile(gxact->fxid, buf, len);
 			gxact->ondisk = true;
 			gxact->prepare_start_lsn = InvalidXLogRecPtr;
 			gxact->prepare_end_lsn = InvalidXLogRecPtr;
@@ -1882,13 +1893,16 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
  * Scan pg_twophase and fill TwoPhaseState depending on the on-disk data.
  * This is called once at the beginning of recovery, saving any extra
  * lookups in the future.  Two-phase files that are newer than the
- * minimum XID horizon are discarded on the way.
+ * minimum XID horizon are discarded on the way, as much as files that
+ * are older than the oldest XID horizon.
  */
 void
 restoreTwoPhaseData(void)
 {
 	DIR		   *cldir;
 	struct dirent *clde;
+	FullTransactionId nextXid = TransamVariables->nextXid;
+	FullTransactionId oldestXid = AdjustToFullTransactionId(TransamVariables->oldestXid);
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	cldir = AllocateDir(TWOPHASE_DIR);
@@ -1897,19 +1911,33 @@ restoreTwoPhaseData(void)
 		if (strlen(clde->d_name) == 16 &&
 			strspn(clde->d_name, "0123456789ABCDEF") == 16)
 		{
-			TransactionId xid;
 			FullTransactionId fxid;
 			char	   *buf;
 
 			fxid = FullTransactionIdFromU64(strtou64(clde->d_name, NULL, 16));
-			xid = XidFromFullTransactionId(fxid);
 
-			buf = ProcessTwoPhaseBuffer(xid, InvalidXLogRecPtr,
-										true, false, false);
-			if (buf == NULL)
+			/* Reject XID if too new or too old */
+			if (FullTransactionIdFollowsOrEquals(fxid, nextXid) ||
+				FullTransactionIdPrecedes(fxid, oldestXid))
+			{
+				if (FullTransactionIdFollowsOrEquals(fxid, nextXid))
+					ereport(WARNING,
+							(errmsg("removing future two-phase state file for transaction %u of epoch %u",
+									XidFromFullTransactionId(fxid),
+									EpochFromFullTransactionId(fxid))));
+				else
+					ereport(WARNING,
+							(errmsg("removing past two-phase state file for transaction %u of epoch %u",
+									XidFromFullTransactionId(fxid),
+									EpochFromFullTransactionId(fxid))));
+				RemoveTwoPhaseFile(fxid, true);
 				continue;
+			}
 
-			PrepareRedoAdd(buf, InvalidXLogRecPtr,
+			buf = ProcessTwoPhaseBuffer(fxid, InvalidXLogRecPtr,
+										true, false, false);
+
+			PrepareRedoAdd(fxid, buf, InvalidXLogRecPtr,
 						   InvalidXLogRecPtr, InvalidRepOriginId);
 		}
 	}
@@ -1968,15 +1996,11 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 
 		Assert(gxact->inredo);
 
-		xid = gxact->xid;
-
-		buf = ProcessTwoPhaseBuffer(xid,
+		xid = XidFromFullTransactionId(gxact->fxid);
+		buf = ProcessTwoPhaseBuffer(gxact->fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, false, true);
 
-		if (buf == NULL)
-			continue;
-
 		/*
 		 * OK, we think this file is valid.  Incorporate xid into the
 		 * running-minimum result.
@@ -2036,19 +2060,15 @@ StandbyRecoverPreparedTransactions(void)
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
-		TransactionId xid;
 		char	   *buf;
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
 
 		Assert(gxact->inredo);
 
-		xid = gxact->xid;
-
-		buf = ProcessTwoPhaseBuffer(xid,
+		buf = ProcessTwoPhaseBuffer(gxact->fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
-		if (buf != NULL)
-			pfree(buf);
+		pfree(buf);
 	}
 	LWLockRelease(TwoPhaseStateLock);
 }
@@ -2073,19 +2093,51 @@ void
 RecoverPreparedTransactions(void)
 {
 	int			i;
+	FullTransactionId *remove_fxids;
+	int			remove_fxids_cnt;
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+	/*
+	 * Track XIDs candidate for removal if found as already committed or
+	 * aborted, once the first scan through TwoPhaseState is done.  This
+	 * cannot happen while going through the entries in TwoPhaseState as
+	 * PrepareRedoRemove() manipulates it.
+	 */
+	remove_fxids_cnt = 0;
+	remove_fxids = (FullTransactionId *) palloc(TwoPhaseState->numPrepXacts *
+												sizeof(FullTransactionId));
+
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
-		TransactionId xid;
 		char	   *buf;
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+		FullTransactionId fxid = gxact->fxid;
 		char	   *bufptr;
 		TwoPhaseFileHeader *hdr;
 		TransactionId *subxids;
 		const char *gid;
 
-		xid = gxact->xid;
+		/*
+		 * Is this transaction already aborted or committed?  If yes, mark it
+		 * for removal.
+		 *
+		 * Checking CLOGs if these transactions have been already aborted or
+		 * committed is safe at this stage; we are at the end of recovery and
+		 * all WAL has been replayed, all 2PC transactions are reinstated and
+		 * should be tracked in TwoPhaseState.
+		 */
+		if (TransactionIdDidCommit(XidFromFullTransactionId(fxid)) ||
+			TransactionIdDidAbort(XidFromFullTransactionId(fxid)))
+		{
+			/*
+			 * Track this transaction ID for its removal from the shared
+			 * memory state at the end.
+			 */
+			remove_fxids[remove_fxids_cnt] = fxid;
+			remove_fxids_cnt++;
+			continue;
+		}
 
 		/*
 		 * Reconstruct subtrans state for the transaction --- needed because
@@ -2096,17 +2148,18 @@ RecoverPreparedTransactions(void)
 		 * SubTransSetParent has been set before, if the prepared transaction
 		 * generated xid assignment records.
 		 */
-		buf = ProcessTwoPhaseBuffer(xid,
+		buf = ProcessTwoPhaseBuffer(gxact->fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
-		if (buf == NULL)
-			continue;
 
 		ereport(LOG,
-				(errmsg("recovering prepared transaction %u from shared memory", xid)));
+				(errmsg("recovering prepared transaction %u of epoch %u from shared memory",
+						XidFromFullTransactionId(gxact->fxid),
+						EpochFromFullTransactionId(gxact->fxid))));
 
 		hdr = (TwoPhaseFileHeader *) buf;
-		Assert(TransactionIdEquals(hdr->xid, xid));
+		Assert(TransactionIdEquals(hdr->xid,
+								   XidFromFullTransactionId(gxact->fxid)));
 		bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader));
 		gid = (const char *) bufptr;
 		bufptr += MAXALIGN(hdr->gidlen);
@@ -2122,7 +2175,7 @@ RecoverPreparedTransactions(void)
 		 * Recreate its GXACT and dummy PGPROC. But, check whether it was
 		 * added in redo and already has a shmem entry for it.
 		 */
-		MarkAsPreparingGuts(gxact, xid, gid,
+		MarkAsPreparingGuts(gxact, gxact->fxid, gid,
 							hdr->prepared_at,
 							hdr->owner, hdr->database);
 
@@ -2137,7 +2190,7 @@ RecoverPreparedTransactions(void)
 		/*
 		 * Recover other state (notably locks) using resource managers.
 		 */
-		ProcessRecords(bufptr, xid, twophase_recover_callbacks);
+		ProcessRecords(bufptr, fxid, twophase_recover_callbacks);
 
 		/*
 		 * Release locks held by the standby process after we process each
@@ -2145,7 +2198,7 @@ RecoverPreparedTransactions(void)
 		 * additional locks at any one time.
 		 */
 		if (InHotStandby)
-			StandbyReleaseLockTree(xid, hdr->nsubxacts, subxids);
+			StandbyReleaseLockTree(hdr->xid, hdr->nsubxacts, subxids);
 
 		/*
 		 * We're done with recovering this transaction. Clear MyLockedGxact,
@@ -2158,13 +2211,25 @@ RecoverPreparedTransactions(void)
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	}
 
+	for (i = 0; i < remove_fxids_cnt; i++)
+	{
+		FullTransactionId fxid = remove_fxids[i];
+
+		ereport(WARNING,
+				(errmsg("removing stale two-phase state from memory for transaction %u of epoch %u",
+						XidFromFullTransactionId(fxid),
+						EpochFromFullTransactionId(fxid))));
+
+		PrepareRedoRemoveFull(fxid, true);
+	}
+
 	LWLockRelease(TwoPhaseStateLock);
 }
 
 /*
  * ProcessTwoPhaseBuffer
  *
- * Given a transaction id, read it either from disk or read it directly
+ * Given a FullTransactionId, read it either from disk or read it directly
  * via shmem xlog record pointer using the provided "prepare_start_lsn".
  *
  * If setParent is true, set up subtransaction parent linkages.
@@ -2173,13 +2238,11 @@ RecoverPreparedTransactions(void)
  * value scanned.
  */
 static char *
-ProcessTwoPhaseBuffer(TransactionId xid,
+ProcessTwoPhaseBuffer(FullTransactionId fxid,
 					  XLogRecPtr prepare_start_lsn,
 					  bool fromdisk,
 					  bool setParent, bool setNextXid)
 {
-	FullTransactionId nextXid = TransamVariables->nextXid;
-	TransactionId origNextXid = XidFromFullTransactionId(nextXid);
 	TransactionId *subxids;
 	char	   *buf;
 	TwoPhaseFileHeader *hdr;
@@ -2190,50 +2253,10 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	if (!fromdisk)
 		Assert(prepare_start_lsn != InvalidXLogRecPtr);
 
-	/* Already processed? */
-	if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
-	{
-		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing stale two-phase state file for transaction %u",
-							xid)));
-			RemoveTwoPhaseFile(xid, true);
-		}
-		else
-		{
-			ereport(WARNING,
-					(errmsg("removing stale two-phase state from memory for transaction %u",
-							xid)));
-			PrepareRedoRemove(xid, true);
-		}
-		return NULL;
-	}
-
-	/* Reject XID if too new */
-	if (TransactionIdFollowsOrEquals(xid, origNextXid))
-	{
-		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing future two-phase state file for transaction %u",
-							xid)));
-			RemoveTwoPhaseFile(xid, true);
-		}
-		else
-		{
-			ereport(WARNING,
-					(errmsg("removing future two-phase state from memory for transaction %u",
-							xid)));
-			PrepareRedoRemove(xid, true);
-		}
-		return NULL;
-	}
-
 	if (fromdisk)
 	{
 		/* Read and validate file */
-		buf = ReadTwoPhaseFile(xid, false);
+		buf = ReadTwoPhaseFile(fxid, false);
 	}
 	else
 	{
@@ -2243,18 +2266,20 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 
 	/* Deconstruct header */
 	hdr = (TwoPhaseFileHeader *) buf;
-	if (!TransactionIdEquals(hdr->xid, xid))
+	if (!TransactionIdEquals(hdr->xid, XidFromFullTransactionId(fxid)))
 	{
 		if (fromdisk)
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("corrupted two-phase state file for transaction %u",
-							xid)));
+					 errmsg("corrupted two-phase state file for transaction %u of epoch %u",
+							XidFromFullTransactionId(fxid),
+							EpochFromFullTransactionId(fxid))));
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("corrupted two-phase state in memory for transaction %u",
-							xid)));
+					 errmsg("corrupted two-phase state in memory for transaction %u of epoch %u",
+							XidFromFullTransactionId(fxid),
+							EpochFromFullTransactionId(fxid))));
 	}
 
 	/*
@@ -2268,14 +2293,14 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	{
 		TransactionId subxid = subxids[i];
 
-		Assert(TransactionIdFollows(subxid, xid));
+		Assert(TransactionIdFollows(subxid, XidFromFullTransactionId(fxid)));
 
 		/* update nextXid if needed */
 		if (setNextXid)
 			AdvanceNextFullTransactionIdPastXid(subxid);
 
 		if (setParent)
-			SubTransSetParent(subxid, xid);
+			SubTransSetParent(subxid, XidFromFullTransactionId(fxid));
 	}
 
 	return buf;
@@ -2466,8 +2491,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
  * data, the entry is marked as located on disk.
  */
 void
-PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
-			   XLogRecPtr end_lsn, RepOriginId origin_id)
+PrepareRedoAdd(FullTransactionId fxid, char *buf,
+			   XLogRecPtr start_lsn, XLogRecPtr end_lsn,
+			   RepOriginId origin_id)
 {
 	TwoPhaseFileHeader *hdr = (TwoPhaseFileHeader *) buf;
 	char	   *bufptr;
@@ -2477,6 +2503,10 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 	Assert(LWLockHeldByMeInMode(TwoPhaseStateLock, LW_EXCLUSIVE));
 	Assert(RecoveryInProgress());
 
+	if (!FullTransactionIdIsValid(fxid))
+		fxid = FullTransactionIdFromAllowableAt(TransamVariables->nextXid,
+												hdr->xid);
+
 	bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader));
 	gid = (const char *) bufptr;
 
@@ -2505,7 +2535,8 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 	{
 		char		path[MAXPGPATH];
 
-		TwoPhaseFilePath(path, hdr->xid);
+		Assert(InRecovery);
+		TwoPhaseFilePath(path, fxid);
 
 		if (access(path, F_OK) == 0)
 		{
@@ -2536,7 +2567,7 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 	gxact->prepared_at = hdr->prepared_at;
 	gxact->prepare_start_lsn = start_lsn;
 	gxact->prepare_end_lsn = end_lsn;
-	gxact->xid = hdr->xid;
+	gxact->fxid = fxid;
 	gxact->owner = hdr->owner;
 	gxact->locking_backend = INVALID_PROC_NUMBER;
 	gxact->valid = false;
@@ -2555,11 +2586,13 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   false /* backward */ , false /* WAL */ );
 	}
 
-	elog(DEBUG2, "added 2PC data in shared memory for transaction %u", gxact->xid);
+	elog(DEBUG2, "added 2PC data in shared memory for transaction %u of epoch %u",
+		 XidFromFullTransactionId(gxact->fxid),
+		 EpochFromFullTransactionId(gxact->fxid));
 }
 
 /*
- * PrepareRedoRemove
+ * PrepareRedoRemoveFull
  *
  * Remove the corresponding gxact entry from TwoPhaseState. Also remove
  * the 2PC file if a prepared transaction was saved via an earlier checkpoint.
@@ -2567,8 +2600,8 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
  * Caller must hold TwoPhaseStateLock in exclusive mode, because TwoPhaseState
  * is updated.
  */
-void
-PrepareRedoRemove(TransactionId xid, bool giveWarning)
+static void
+PrepareRedoRemoveFull(FullTransactionId fxid, bool giveWarning)
 {
 	GlobalTransaction gxact = NULL;
 	int			i;
@@ -2581,7 +2614,7 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 	{
 		gxact = TwoPhaseState->prepXacts[i];
 
-		if (gxact->xid == xid)
+		if (FullTransactionIdEquals(gxact->fxid, fxid))
 		{
 			Assert(gxact->inredo);
 			found = true;
@@ -2598,12 +2631,28 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 	/*
 	 * And now we can clean up any files we may have left.
 	 */
-	elog(DEBUG2, "removing 2PC data for transaction %u", xid);
+	elog(DEBUG2, "removing 2PC data for transaction %u of epoch %u ",
+		 XidFromFullTransactionId(fxid),
+		 EpochFromFullTransactionId(fxid));
+
 	if (gxact->ondisk)
-		RemoveTwoPhaseFile(xid, giveWarning);
+		RemoveTwoPhaseFile(fxid, giveWarning);
+
 	RemoveGXact(gxact);
 }
 
+/*
+ * Wrapper of PrepareRedoRemoveFull(), for TransactionIds.
+ */
+void
+PrepareRedoRemove(TransactionId xid, bool giveWarning)
+{
+	FullTransactionId fxid =
+		FullTransactionIdFromAllowableAt(TransamVariables->nextXid, xid);
+
+	PrepareRedoRemoveFull(fxid, giveWarning);
+}
+
 /*
  * LookupGXact
  *		Check if the prepared transaction with the given GID, lsn and timestamp
@@ -2648,7 +2697,7 @@ LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
 			 * between publisher and subscriber.
 			 */
 			if (gxact->ondisk)
-				buf = ReadTwoPhaseFile(gxact->xid, false);
+				buf = ReadTwoPhaseFile(gxact->fxid, false);
 			else
 			{
 				Assert(gxact->prepare_start_lsn);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d331ab90d78..8cd0c8bfbd8 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2512,7 +2512,7 @@ static void
 PrepareTransaction(void)
 {
 	TransactionState s = CurrentTransactionState;
-	TransactionId xid = GetCurrentTransactionId();
+	FullTransactionId fxid = GetCurrentFullTransactionId();
 	GlobalTransaction gxact;
 	TimestampTz prepared_at;
 
@@ -2641,7 +2641,7 @@ PrepareTransaction(void)
 	 * Reserve the GID for this transaction. This could fail if the requested
 	 * GID is invalid or already in use.
 	 */
-	gxact = MarkAsPreparing(xid, prepareGID, prepared_at,
+	gxact = MarkAsPreparing(fxid, prepareGID, prepared_at,
 							GetUserId(), MyDatabaseId);
 	prepareGID = NULL;
 
@@ -2691,7 +2691,7 @@ PrepareTransaction(void)
 	 * ProcArrayClearTransaction().  Otherwise, a GetLockConflicts() would
 	 * conclude "xact already committed or aborted" for our locks.
 	 */
-	PostPrepare_Locks(xid);
+	PostPrepare_Locks(fxid);
 
 	/*
 	 * Let others know about no transaction in progress by me.  This has to be
@@ -2733,9 +2733,9 @@ PrepareTransaction(void)
 
 	PostPrepare_smgr();
 
-	PostPrepare_MultiXact(xid);
+	PostPrepare_MultiXact(fxid);
 
-	PostPrepare_PredicateLocks(xid);
+	PostPrepare_PredicateLocks(fxid);
 
 	ResourceOwnerRelease(TopTransactionResourceOwner,
 						 RESOURCE_RELEASE_LOCKS,
@@ -6408,7 +6408,8 @@ xact_redo(XLogReaderState *record)
 		 * gxact entry.
 		 */
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
-		PrepareRedoAdd(XLogRecGetData(record),
+		PrepareRedoAdd(InvalidFullTransactionId,
+					   XLogRecGetData(record),
 					   record->ReadRecPtr,
 					   record->EndRecPtr,
 					   XLogRecGetOrigin(record));
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 11b4d1085bb..9a361d4a988 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -3476,9 +3476,9 @@ AtPrepare_Locks(void)
  * but that probably costs more cycles.
  */
 void
-PostPrepare_Locks(TransactionId xid)
+PostPrepare_Locks(FullTransactionId fxid)
 {
-	PGPROC	   *newproc = TwoPhaseGetDummyProc(xid, false);
+	PGPROC	   *newproc = TwoPhaseGetDummyProc(fxid, false);
 	HASH_SEQ_STATUS status;
 	LOCALLOCK  *locallock;
 	LOCK	   *lock;
@@ -4261,11 +4261,11 @@ DumpAllLocks(void)
  * and PANIC anyway.
  */
 void
-lock_twophase_recover(TransactionId xid, uint16 info,
+lock_twophase_recover(FullTransactionId fxid, uint16 info,
 					  void *recdata, uint32 len)
 {
 	TwoPhaseLockRecord *rec = (TwoPhaseLockRecord *) recdata;
-	PGPROC	   *proc = TwoPhaseGetDummyProc(xid, false);
+	PGPROC	   *proc = TwoPhaseGetDummyProc(fxid, false);
 	LOCKTAG    *locktag;
 	LOCKMODE	lockmode;
 	LOCKMETHODID lockmethodid;
@@ -4442,7 +4442,7 @@ lock_twophase_recover(TransactionId xid, uint16 info,
  * starting up into hot standby mode.
  */
 void
-lock_twophase_standby_recover(TransactionId xid, uint16 info,
+lock_twophase_standby_recover(FullTransactionId fxid, uint16 info,
 							  void *recdata, uint32 len)
 {
 	TwoPhaseLockRecord *rec = (TwoPhaseLockRecord *) recdata;
@@ -4461,7 +4461,7 @@ lock_twophase_standby_recover(TransactionId xid, uint16 info,
 	if (lockmode == AccessExclusiveLock &&
 		locktag->locktag_type == LOCKTAG_RELATION)
 	{
-		StandbyAcquireAccessExclusiveLock(xid,
+		StandbyAcquireAccessExclusiveLock(XidFromFullTransactionId(fxid),
 										  locktag->locktag_field1 /* dboid */ ,
 										  locktag->locktag_field2 /* reloid */ );
 	}
@@ -4474,11 +4474,11 @@ lock_twophase_standby_recover(TransactionId xid, uint16 info,
  * Find and release the lock indicated by the 2PC record.
  */
 void
-lock_twophase_postcommit(TransactionId xid, uint16 info,
+lock_twophase_postcommit(FullTransactionId fxid, uint16 info,
 						 void *recdata, uint32 len)
 {
 	TwoPhaseLockRecord *rec = (TwoPhaseLockRecord *) recdata;
-	PGPROC	   *proc = TwoPhaseGetDummyProc(xid, true);
+	PGPROC	   *proc = TwoPhaseGetDummyProc(fxid, true);
 	LOCKTAG    *locktag;
 	LOCKMETHODID lockmethodid;
 	LockMethod	lockMethodTable;
@@ -4500,10 +4500,10 @@ lock_twophase_postcommit(TransactionId xid, uint16 info,
  * This is actually just the same as the COMMIT case.
  */
 void
-lock_twophase_postabort(TransactionId xid, uint16 info,
+lock_twophase_postabort(FullTransactionId fxid, uint16 info,
 						void *recdata, uint32 len)
 {
-	lock_twophase_postcommit(xid, info, recdata, len);
+	lock_twophase_postcommit(fxid, info, recdata, len);
 }
 
 /*
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 5b21a053981..928647d691e 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -191,7 +191,7 @@
  *		AtPrepare_PredicateLocks(void);
  *		PostPrepare_PredicateLocks(TransactionId xid);
  *		PredicateLockTwoPhaseFinish(TransactionId xid, bool isCommit);
- *		predicatelock_twophase_recover(TransactionId xid, uint16 info,
+ *		predicatelock_twophase_recover(FullTransactionId fxid, uint16 info,
  *									   void *recdata, uint32 len);
  */
 
@@ -4846,7 +4846,7 @@ AtPrepare_PredicateLocks(void)
  *		anyway. We only need to clean up our local state.
  */
 void
-PostPrepare_PredicateLocks(TransactionId xid)
+PostPrepare_PredicateLocks(FullTransactionId fxid)
 {
 	if (MySerializableXact == InvalidSerializableXact)
 		return;
@@ -4869,12 +4869,12 @@ PostPrepare_PredicateLocks(TransactionId xid)
  *		commits or aborts.
  */
 void
-PredicateLockTwoPhaseFinish(TransactionId xid, bool isCommit)
+PredicateLockTwoPhaseFinish(FullTransactionId fxid, bool isCommit)
 {
 	SERIALIZABLEXID *sxid;
 	SERIALIZABLEXIDTAG sxidtag;
 
-	sxidtag.xid = xid;
+	sxidtag.xid = XidFromFullTransactionId(fxid);
 
 	LWLockAcquire(SerializableXactHashLock, LW_SHARED);
 	sxid = (SERIALIZABLEXID *)
@@ -4896,10 +4896,11 @@ PredicateLockTwoPhaseFinish(TransactionId xid, bool isCommit)
  * Re-acquire a predicate lock belonging to a transaction that was prepared.
  */
 void
-predicatelock_twophase_recover(TransactionId xid, uint16 info,
+predicatelock_twophase_recover(FullTransactionId fxid, uint16 info,
 							   void *recdata, uint32 len)
 {
 	TwoPhasePredicateRecord *record;
+	TransactionId xid = XidFromFullTransactionId(fxid);
 
 	Assert(len == sizeof(TwoPhasePredicateRecord));
 
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index d64595a165c..b3c10abde4f 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -744,7 +744,7 @@ PostPrepare_PgStat_Relations(PgStat_SubXactStatus *xact_state)
  * Load the saved counts into our local pgstats state.
  */
 void
-pgstat_twophase_postcommit(TransactionId xid, uint16 info,
+pgstat_twophase_postcommit(FullTransactionId fxid, uint16 info,
 						   void *recdata, uint32 len)
 {
 	TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata;
@@ -780,7 +780,7 @@ pgstat_twophase_postcommit(TransactionId xid, uint16 info,
  * as aborted.
  */
 void
-pgstat_twophase_postabort(TransactionId xid, uint16 info,
+pgstat_twophase_postabort(FullTransactionId fxid, uint16 info,
 						  void *recdata, uint32 len)
 {
 	TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata;
diff --git a/src/test/recovery/t/009_twophase.pl b/src/test/recovery/t/009_twophase.pl
index 1a662ebe499..3a3714a2d8b 100644
--- a/src/test/recovery/t/009_twophase.pl
+++ b/src/test/recovery/t/009_twophase.pl
@@ -5,6 +5,7 @@
 use strict;
 use warnings FATAL => 'all';
 
+use File::Copy;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
@@ -28,6 +29,15 @@ sub configure_and_reload
 	return;
 }
 
+sub twophase_file_name
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
+	my $epoch = shift;
+	my $xid = shift;
+	return sprintf("%08X%08X", $epoch, $xid);
+}
+
 # Set up two nodes, which will alternately be primary and replication standby.
 
 # Setup london node
@@ -572,4 +582,134 @@ my $nsubtrans = $cur_primary->safe_psql('postgres',
 );
 isnt($osubtrans, $nsubtrans, "contents of pg_subtrans/ have changed");
 
+###############################################################################
+# Check handling of already committed or aborted 2PC files at recovery.
+# This test does a manual copy of 2PC files created in a running server,
+# to cheaply emulate situations that could be found in base backups.
+###############################################################################
+
+# Issue a set of transactions that will be used for this portion of the test:
+# - One transaction to hold on the minimum xid horizon at bay.
+# - One transaction that will be found as already committed at recovery.
+# - One transaction that will be fonnd as already rollbacked at recovery.
+$cur_primary->psql(
+	'postgres', "
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (40, 'transaction: xid horizon');
+	PREPARE TRANSACTION 'xact_009_40';
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (41, 'transaction: commit-prepared');
+	PREPARE TRANSACTION 'xact_009_41';
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (42, 'transaction: rollback-prepared');
+	PREPARE TRANSACTION 'xact_009_42';");
+
+# Issue a checkpoint, fixing the XID horizon based on the first transaction,
+# flushing to disk the two files to use.
+$cur_primary->psql('postgres', "CHECKPOINT");
+
+# Get the transaction IDs of the ones to 2PC files to manipulate.
+my $commit_prepared_xid = int(
+	$cur_primary->safe_psql(
+		'postgres',
+		"SELECT transaction FROM pg_prepared_xacts WHERE gid = 'xact_009_41'")
+);
+my $abort_prepared_xid = int(
+	$cur_primary->safe_psql(
+		'postgres',
+		"SELECT transaction FROM pg_prepared_xacts WHERE gid = 'xact_009_42'")
+);
+
+# Copy the two-phase files that will be put back later.  Assume an
+# epoch of 0.
+my $commit_prepared_name = twophase_file_name(0, $commit_prepared_xid);
+my $abort_prepared_name = twophase_file_name(0, $abort_prepared_xid);
+
+my $twophase_tmpdir = $PostgreSQL::Test::Utils::tmp_check . '/' . "2pc_files";
+mkdir($twophase_tmpdir);
+my $primary_twophase_folder = $cur_primary->data_dir . '/pg_twophase/';
+copy("$primary_twophase_folder/$commit_prepared_name", $twophase_tmpdir);
+copy("$primary_twophase_folder/$abort_prepared_name", $twophase_tmpdir);
+
+# Issue abort/commit prepared.
+$cur_primary->psql('postgres', "COMMIT PREPARED 'xact_009_41'");
+$cur_primary->psql('postgres', "ROLLBACK PREPARED 'xact_009_42'");
+
+# Again checkpoint, to advance the LSN past the point where the two previous
+# transaction records would be replayed.
+$cur_primary->psql('postgres', "CHECKPOINT");
+
+# Take down node.
+$cur_primary->teardown_node;
+
+# Move back the two twophase files.
+copy("$twophase_tmpdir/$commit_prepared_name", $primary_twophase_folder);
+copy("$twophase_tmpdir/$abort_prepared_name", $primary_twophase_folder);
+
+# Grab location in logs of primary
+my $log_offset = -s $cur_primary->logfile;
+
+# Start node and check that the two previous files are removed by checking the
+# server logs, following the CLOG lookup done at the end of recovery.
+$cur_primary->start;
+
+$cur_primary->log_check(
+	"two-phase files of committed transactions removed at recovery",
+	$log_offset,
+	log_like => [
+		qr/removing stale two-phase state from memory for transaction $commit_prepared_xid of epoch 0/,
+		qr/removing stale two-phase state from memory for transaction $abort_prepared_xid of epoch 0/
+	]);
+
+# Commit the first transaction.
+$cur_primary->psql('postgres', "COMMIT PREPARED 'xact_009_40'");
+# After replay, there should be no 2PC transactions.
+$cur_primary->psql(
+	'postgres',
+	"SELECT * FROM pg_prepared_xact",
+	stdout => \$psql_out);
+is($psql_out, qq{}, "Check expected pg_prepared_xact data on primary");
+# Data from transactions should be around.
+$cur_primary->psql(
+	'postgres',
+	"SELECT * FROM t_009_tbl WHERE id IN (40, 41, 42);",
+	stdout => \$psql_out);
+is( $psql_out, qq{40|transaction: xid horizon
+41|transaction: commit-prepared},
+	"Check expected table data on primary");
+
+###############################################################################
+# Check handling of orphaned 2PC files at recovery.
+###############################################################################
+
+$cur_standby->teardown_node;
+$cur_primary->teardown_node;
+
+# Grab location in logs of primary
+$log_offset = -s $cur_primary->logfile;
+
+# Create fake files with a transaction ID large or low enough to be in the
+# future or the past, in different epochs, then check that the primary is able
+# to start and remove these files at recovery.
+
+# First bump the epoch with pg_resetwal.
+$cur_primary->command_ok(
+	[ 'pg_resetwal', '-e', 256, '-f', $cur_primary->data_dir ],
+	'bump epoch of primary');
+
+my $future_2pc_file =
+  $cur_primary->data_dir . '/pg_twophase/000001FF00000FFF';
+append_to_file $future_2pc_file, "";
+my $past_2pc_file = $cur_primary->data_dir . '/pg_twophase/000000EE00000FFF';
+append_to_file $past_2pc_file, "";
+
+$cur_primary->start;
+$cur_primary->log_check(
+	"two-phase files removed at recovery",
+	$log_offset,
+	log_like => [
+		qr/removing past two-phase state file for transaction 4095 of epoch 238/,
+		qr/removing future two-phase state file for transaction 4095 of epoch 511/
+	]);
+
 done_testing();
-- 
2.47.2

0001-Fix-set-of-issues-with-2PC-transaction-handling-dur-13.txttext/plain; charset=us-asciiDownload

From 5bbc09abe3e0ec7c5f198dc1ed5ae7037e343ba1 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Thu, 30 Jan 2025 12:04:36 +0900
Subject: [PATCH] Fix set of issues with 2PC transaction handling during
 recovery

This addresses two issues:
- CLOG lookups could happen at a too early stage of recovery, where
these may not be consistent.  These are delayed until the end of
recovery.
- Legit 2PC transactions found as already committed or aborted when
scanning the shared memory data could cause inconsistencies in
TwoPhaseState, leaving around dangling entries for transactions that
should not be there.
---
 src/backend/access/transam/twophase.c | 119 +++++++++++++----------
 src/test/recovery/t/009_twophase.pl   | 135 +++++++++++++++++++++++++-
 2 files changed, 201 insertions(+), 53 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 0a0932cff44..49c760dc19d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1863,13 +1863,17 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
  * Scan pg_twophase and fill TwoPhaseState depending on the on-disk data.
  * This is called once at the beginning of recovery, saving any extra
  * lookups in the future.  Two-phase files that are newer than the
- * minimum XID horizon are discarded on the way.
+ * minimum XID horizon are discarded on the way, as much as files that
+ * are older than the oldest XID horizon.
  */
 void
 restoreTwoPhaseData(void)
 {
 	DIR		   *cldir;
 	struct dirent *clde;
+	FullTransactionId nextFullXid = ShmemVariableCache->nextFullXid;
+	TransactionId origNextXid = XidFromFullTransactionId(nextFullXid);
+	TransactionId oldestXid = ShmemVariableCache->oldestXid;
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	cldir = AllocateDir(TWOPHASE_DIR);
@@ -1883,10 +1887,24 @@ restoreTwoPhaseData(void)
 
 			xid = (TransactionId) strtoul(clde->d_name, NULL, 16);
 
+			/* Reject XID if too new or too old */
+			if (TransactionIdFollowsOrEquals(xid, origNextXid) ||
+				TransactionIdPrecedes(xid, oldestXid))
+			{
+				if (TransactionIdFollowsOrEquals(xid, origNextXid))
+					ereport(WARNING,
+							(errmsg("removing future two-phase state file for transaction %u",
+									xid)));
+				else
+					ereport(WARNING,
+							(errmsg("removing past two-phase state file for transaction %u",
+									xid)));
+				RemoveTwoPhaseFile(xid, true);
+				continue;
+			}
+
 			buf = ProcessTwoPhaseBuffer(xid, InvalidXLogRecPtr,
 										true, false, false);
-			if (buf == NULL)
-				continue;
 
 			PrepareRedoAdd(buf, InvalidXLogRecPtr,
 						   InvalidXLogRecPtr, InvalidRepOriginId);
@@ -1953,9 +1971,6 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 									gxact->prepare_start_lsn,
 									gxact->ondisk, false, true);
 
-		if (buf == NULL)
-			continue;
-
 		/*
 		 * OK, we think this file is valid.  Incorporate xid into the
 		 * running-minimum result.
@@ -2026,8 +2041,7 @@ StandbyRecoverPreparedTransactions(void)
 		buf = ProcessTwoPhaseBuffer(xid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
-		if (buf != NULL)
-			pfree(buf);
+		pfree(buf);
 	}
 	LWLockRelease(TwoPhaseStateLock);
 }
@@ -2052,8 +2066,21 @@ void
 RecoverPreparedTransactions(void)
 {
 	int			i;
+	TransactionId *remove_xids;
+	int			remove_xids_cnt;
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+	/*
+	 * Track XIDs candidate for removal if found as already committed or
+	 * aborted, once the first scan through TwoPhaseState is done.  This
+	 * cannot happen while going through the entries in TwoPhaseState as
+	 * PrepareRedoRemove() manipulates it.
+	 */
+	remove_xids_cnt = 0;
+	remove_xids = (TransactionId *) palloc(TwoPhaseState->numPrepXacts *
+										   sizeof(TransactionId));
+
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
 		TransactionId xid;
@@ -2066,6 +2093,26 @@ RecoverPreparedTransactions(void)
 
 		xid = gxact->xid;
 
+		/*
+		 * Is this transaction already aborted or committed?  If yes, mark it
+		 * for removal.
+		 *
+		 * Checking CLOGs if these transactions have been already aborted or
+		 * committed is safe at this stage; we are at the end of recovery and
+		 * all WAL has been replayed, all 2PC transactions are reinstated and
+		 * should be tracked in TwoPhaseState.
+		 */
+		if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
+		{
+			/*
+			 * Track this transaction ID for its removal from the shared
+			 * memory state at the end.
+			 */
+			remove_xids[remove_xids_cnt] = xid;
+			remove_xids_cnt++;
+			continue;
+		}
+
 		/*
 		 * Reconstruct subtrans state for the transaction --- needed because
 		 * pg_subtrans is not preserved over a restart.  Note that we are
@@ -2078,8 +2125,6 @@ RecoverPreparedTransactions(void)
 		buf = ProcessTwoPhaseBuffer(xid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
-		if (buf == NULL)
-			continue;
 
 		ereport(LOG,
 				(errmsg("recovering prepared transaction %u from shared memory", xid)));
@@ -2135,7 +2180,19 @@ RecoverPreparedTransactions(void)
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	}
 
+	for (i = 0; i < remove_xids_cnt; i++)
+	{
+		TransactionId xid = remove_xids[i];
+
+		ereport(WARNING,
+				(errmsg("removing stale two-phase state from memory for transaction %u",
+						xid)));
+		PrepareRedoRemove(xid, true);
+	}
+
 	LWLockRelease(TwoPhaseStateLock);
+
+	pfree(remove_xids);
 }
 
 /*
@@ -2155,8 +2212,6 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 					  bool fromdisk,
 					  bool setParent, bool setNextXid)
 {
-	FullTransactionId nextFullXid = ShmemVariableCache->nextFullXid;
-	TransactionId origNextXid = XidFromFullTransactionId(nextFullXid);
 	TransactionId *subxids;
 	char	   *buf;
 	TwoPhaseFileHeader *hdr;
@@ -2167,46 +2222,6 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	if (!fromdisk)
 		Assert(prepare_start_lsn != InvalidXLogRecPtr);
 
-	/* Already processed? */
-	if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
-	{
-		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing stale two-phase state file for transaction %u",
-							xid)));
-			RemoveTwoPhaseFile(xid, true);
-		}
-		else
-		{
-			ereport(WARNING,
-					(errmsg("removing stale two-phase state from memory for transaction %u",
-							xid)));
-			PrepareRedoRemove(xid, true);
-		}
-		return NULL;
-	}
-
-	/* Reject XID if too new */
-	if (TransactionIdFollowsOrEquals(xid, origNextXid))
-	{
-		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing future two-phase state file for transaction %u",
-							xid)));
-			RemoveTwoPhaseFile(xid, true);
-		}
-		else
-		{
-			ereport(WARNING,
-					(errmsg("removing future two-phase state from memory for transaction %u",
-							xid)));
-			PrepareRedoRemove(xid, true);
-		}
-		return NULL;
-	}
-
 	if (fromdisk)
 	{
 		/* Read and validate file */
diff --git a/src/test/recovery/t/009_twophase.pl b/src/test/recovery/t/009_twophase.pl
index 15bb28627f9..de9acd5e43a 100644
--- a/src/test/recovery/t/009_twophase.pl
+++ b/src/test/recovery/t/009_twophase.pl
@@ -2,9 +2,10 @@
 use strict;
 use warnings;
 
+use File::Copy;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 27;
+use Test::More tests => 33;
 
 my $psql_out = '';
 my $psql_rc  = '';
@@ -25,6 +26,14 @@ sub configure_and_reload
 	return;
 }
 
+sub twophase_file_name
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
+	my $xid = shift;
+	return sprintf("%08X", $xid);
+}
+
 # Set up two nodes, which will alternately be master and replication standby.
 
 # Setup london node
@@ -523,3 +532,127 @@ $cur_standby->psql(
 is( $psql_out,
 	qq{27|issued to paris},
 	"Check expected t_009_tbl2 data on standby");
+
+###############################################################################
+# Check handling of already committed or aborted 2PC files at recovery.
+# This test does a manual copy of 2PC files created in a running server,
+# to cheaply emulate situations that could be found in base backups.
+###############################################################################
+
+# Issue a set of transactions that will be used for this portion of the test:
+# - One transaction to hold on the minimum xid horizon at bay.
+# - One transaction that will be found as already committed at recovery.
+# - One transaction that will be fonnd as already rollbacked at recovery.
+$cur_master->psql(
+	'postgres', "
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (40, 'transaction: xid horizon');
+	PREPARE TRANSACTION 'xact_009_40';
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (41, 'transaction: commit-prepared');
+	PREPARE TRANSACTION 'xact_009_41';
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (42, 'transaction: rollback-prepared');
+	PREPARE TRANSACTION 'xact_009_42';");
+
+# Issue a checkpoint, fixing the XID horizon based on the first transaction,
+# flushing to disk the two files to use.
+$cur_master->psql('postgres', "CHECKPOINT");
+
+# Get the transaction IDs of the ones to 2PC files to manipulate.
+my $commit_prepared_xid = int(
+	$cur_master->safe_psql(
+		'postgres',
+		"SELECT transaction FROM pg_prepared_xacts WHERE gid = 'xact_009_41'")
+);
+my $abort_prepared_xid = int(
+	$cur_master->safe_psql(
+		'postgres',
+		"SELECT transaction FROM pg_prepared_xacts WHERE gid = 'xact_009_42'")
+);
+
+# Copy the two-phase files that will be put back later.
+my $commit_prepared_name = twophase_file_name($commit_prepared_xid);
+my $abort_prepared_name = twophase_file_name($abort_prepared_xid);
+
+my $twophase_tmpdir = $PostgreSQL::Test::Utils::tmp_check . '/' . "2pc_files";
+mkdir($twophase_tmpdir);
+my $primary_twophase_folder = $cur_master->data_dir . '/pg_twophase/';
+copy("$primary_twophase_folder/$commit_prepared_name", $twophase_tmpdir);
+copy("$primary_twophase_folder/$abort_prepared_name", $twophase_tmpdir);
+
+# Issue abort/commit prepared.
+$cur_master->psql('postgres', "COMMIT PREPARED 'xact_009_41'");
+$cur_master->psql('postgres', "ROLLBACK PREPARED 'xact_009_42'");
+
+# Again checkpoint, to advance the LSN past the point where the two previous
+# transaction records would be replayed.
+$cur_master->psql('postgres', "CHECKPOINT");
+
+# Take down node.
+$cur_master->teardown_node;
+
+# Move back the two twophase files.
+copy("$twophase_tmpdir/$commit_prepared_name", $primary_twophase_folder);
+copy("$twophase_tmpdir/$abort_prepared_name", $primary_twophase_folder);
+
+# Grab location in logs of primary
+my $log_offset = -s $cur_master->logfile;
+
+# Start node and check that the two previous files are removed by checking the
+# server logs, following the CLOG lookup done at the end of recovery.
+$cur_master->start;
+
+ok($cur_master->log_contains(
+	qr/removing stale two-phase state from memory for transaction $commit_prepared_xid/,
+	$log_offset),
+	"two-phase file of committed transaction removed at recovery");
+ok($cur_master->log_contains(
+	qr/removing stale two-phase state from memory for transaction $abort_prepared_xid/,
+	$log_offset),
+	"two-phase file of aborted transaction removed at recovery");
+
+# Commit the first transaction.
+$cur_master->psql('postgres', "COMMIT PREPARED 'xact_009_40'");
+# After replay, there should be no 2PC transactions.
+$cur_master->psql(
+	'postgres',
+	"SELECT * FROM pg_prepared_xact",
+	stdout => \$psql_out);
+is($psql_out, qq{}, "Check expected pg_prepared_xact data on primary");
+# Data from transactions should be around.
+$cur_master->psql(
+	'postgres',
+	"SELECT * FROM t_009_tbl WHERE id IN (40, 41, 42);",
+	stdout => \$psql_out);
+is( $psql_out, qq{40|transaction: xid horizon
+41|transaction: commit-prepared},
+	"Check expected table data on primary");
+
+###############################################################################
+# Check handling of orphaned 2PC files at recovery.
+###############################################################################
+
+$cur_master->teardown_node;
+
+# Grab location in logs of primary
+$log_offset = -s $cur_master->logfile;
+
+# Create fake files with a transaction ID large or low enough to be in the
+# future or the past, then check that the primary is able to start and remove
+# these files at recovery.
+
+my $future_2pc_file = $cur_master->data_dir . '/pg_twophase/00FFFFFF';
+append_to_file $future_2pc_file, "";
+my $past_2pc_file = $cur_master->data_dir . '/pg_twophase/000000FF';
+append_to_file $past_2pc_file, "";
+
+$cur_master->start;
+ok($cur_master->log_contains(
+	qr/removing future two-phase state file for transaction 16777215/,
+	$log_offset),
+	"removed future two-phase state file");
+ok($cur_master->log_contains(
+	qr/removing past two-phase state file for transaction 255/,
+	$log_offset),
+	"removed past two-phase state file");
-- 
2.47.2

0001-Fix-set-of-issues-with-2PC-transaction-handling-dur-14.txttext/plain; charset=us-asciiDownload

From 130917bcb8c3e895e06cc27e3e798a0a7fa6d357 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Thu, 30 Jan 2025 12:04:36 +0900
Subject: [PATCH] Fix set of issues with 2PC transaction handling during
 recovery

This addresses two issues:
- CLOG lookups could happen at a too early stage of recovery, where
these may not be consistent.  These are delayed until the end of
recovery.
- Legit 2PC transactions found as already committed or aborted when
scanning the shared memory data could cause inconsistencies in
TwoPhaseState, leaving around dangling entries for transactions that
should not be there.
---
 src/backend/access/transam/twophase.c | 119 +++++++++++++----------
 src/test/recovery/t/009_twophase.pl   | 133 +++++++++++++++++++++++++-
 2 files changed, 199 insertions(+), 53 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 16848fa226c..7c9019ad8ee 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1852,13 +1852,17 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
  * Scan pg_twophase and fill TwoPhaseState depending on the on-disk data.
  * This is called once at the beginning of recovery, saving any extra
  * lookups in the future.  Two-phase files that are newer than the
- * minimum XID horizon are discarded on the way.
+ * minimum XID horizon are discarded on the way, as much as files that
+ * are older than the oldest XID horizon.
  */
 void
 restoreTwoPhaseData(void)
 {
 	DIR		   *cldir;
 	struct dirent *clde;
+	FullTransactionId nextXid = ShmemVariableCache->nextXid;
+	TransactionId origNextXid = XidFromFullTransactionId(nextXid);
+	TransactionId oldestXid = ShmemVariableCache->oldestXid;
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	cldir = AllocateDir(TWOPHASE_DIR);
@@ -1872,10 +1876,24 @@ restoreTwoPhaseData(void)
 
 			xid = (TransactionId) strtoul(clde->d_name, NULL, 16);
 
+			/* Reject XID if too new or too old */
+			if (TransactionIdFollowsOrEquals(xid, origNextXid) ||
+				TransactionIdPrecedes(xid, oldestXid))
+			{
+				if (TransactionIdFollowsOrEquals(xid, origNextXid))
+					ereport(WARNING,
+							(errmsg("removing future two-phase state file for transaction %u",
+									xid)));
+				else
+					ereport(WARNING,
+							(errmsg("removing past two-phase state file for transaction %u",
+									xid)));
+				RemoveTwoPhaseFile(xid, true);
+				continue;
+			}
+
 			buf = ProcessTwoPhaseBuffer(xid, InvalidXLogRecPtr,
 										true, false, false);
-			if (buf == NULL)
-				continue;
 
 			PrepareRedoAdd(buf, InvalidXLogRecPtr,
 						   InvalidXLogRecPtr, InvalidRepOriginId);
@@ -1942,9 +1960,6 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 									gxact->prepare_start_lsn,
 									gxact->ondisk, false, true);
 
-		if (buf == NULL)
-			continue;
-
 		/*
 		 * OK, we think this file is valid.  Incorporate xid into the
 		 * running-minimum result.
@@ -2015,8 +2030,7 @@ StandbyRecoverPreparedTransactions(void)
 		buf = ProcessTwoPhaseBuffer(xid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
-		if (buf != NULL)
-			pfree(buf);
+		pfree(buf);
 	}
 	LWLockRelease(TwoPhaseStateLock);
 }
@@ -2041,8 +2055,21 @@ void
 RecoverPreparedTransactions(void)
 {
 	int			i;
+	TransactionId *remove_xids;
+	int			remove_xids_cnt;
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+	/*
+	 * Track XIDs candidate for removal if found as already committed or
+	 * aborted, once the first scan through TwoPhaseState is done.  This
+	 * cannot happen while going through the entries in TwoPhaseState as
+	 * PrepareRedoRemove() manipulates it.
+	 */
+	remove_xids_cnt = 0;
+	remove_xids = (TransactionId *) palloc(TwoPhaseState->numPrepXacts *
+										   sizeof(TransactionId));
+
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
 		TransactionId xid;
@@ -2055,6 +2082,26 @@ RecoverPreparedTransactions(void)
 
 		xid = gxact->xid;
 
+		/*
+		 * Is this transaction already aborted or committed?  If yes, mark it
+		 * for removal.
+		 *
+		 * Checking CLOGs if these transactions have been already aborted or
+		 * committed is safe at this stage; we are at the end of recovery and
+		 * all WAL has been replayed, all 2PC transactions are reinstated and
+		 * should be tracked in TwoPhaseState.
+		 */
+		if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
+		{
+			/*
+			 * Track this transaction ID for its removal from the shared
+			 * memory state at the end.
+			 */
+			remove_xids[remove_xids_cnt] = xid;
+			remove_xids_cnt++;
+			continue;
+		}
+
 		/*
 		 * Reconstruct subtrans state for the transaction --- needed because
 		 * pg_subtrans is not preserved over a restart.  Note that we are
@@ -2067,8 +2114,6 @@ RecoverPreparedTransactions(void)
 		buf = ProcessTwoPhaseBuffer(xid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
-		if (buf == NULL)
-			continue;
 
 		ereport(LOG,
 				(errmsg("recovering prepared transaction %u from shared memory", xid)));
@@ -2124,7 +2169,19 @@ RecoverPreparedTransactions(void)
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	}
 
+	for (i = 0; i < remove_xids_cnt; i++)
+	{
+		TransactionId xid = remove_xids[i];
+
+		ereport(WARNING,
+				(errmsg("removing stale two-phase state from memory for transaction %u",
+						xid)));
+		PrepareRedoRemove(xid, true);
+	}
+
 	LWLockRelease(TwoPhaseStateLock);
+
+	pfree(remove_xids);
 }
 
 /*
@@ -2144,8 +2201,6 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 					  bool fromdisk,
 					  bool setParent, bool setNextXid)
 {
-	FullTransactionId nextXid = ShmemVariableCache->nextXid;
-	TransactionId origNextXid = XidFromFullTransactionId(nextXid);
 	TransactionId *subxids;
 	char	   *buf;
 	TwoPhaseFileHeader *hdr;
@@ -2156,46 +2211,6 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	if (!fromdisk)
 		Assert(prepare_start_lsn != InvalidXLogRecPtr);
 
-	/* Already processed? */
-	if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
-	{
-		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing stale two-phase state file for transaction %u",
-							xid)));
-			RemoveTwoPhaseFile(xid, true);
-		}
-		else
-		{
-			ereport(WARNING,
-					(errmsg("removing stale two-phase state from memory for transaction %u",
-							xid)));
-			PrepareRedoRemove(xid, true);
-		}
-		return NULL;
-	}
-
-	/* Reject XID if too new */
-	if (TransactionIdFollowsOrEquals(xid, origNextXid))
-	{
-		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing future two-phase state file for transaction %u",
-							xid)));
-			RemoveTwoPhaseFile(xid, true);
-		}
-		else
-		{
-			ereport(WARNING,
-					(errmsg("removing future two-phase state from memory for transaction %u",
-							xid)));
-			PrepareRedoRemove(xid, true);
-		}
-		return NULL;
-	}
-
 	if (fromdisk)
 	{
 		/* Read and validate file */
diff --git a/src/test/recovery/t/009_twophase.pl b/src/test/recovery/t/009_twophase.pl
index a5686db2526..06557bebbe7 100644
--- a/src/test/recovery/t/009_twophase.pl
+++ b/src/test/recovery/t/009_twophase.pl
@@ -5,9 +5,10 @@
 use strict;
 use warnings;
 
+use File::Copy;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 27;
+use Test::More tests => 33;
 
 my $psql_out = '';
 my $psql_rc  = '';
@@ -28,6 +29,14 @@ sub configure_and_reload
 	return;
 }
 
+sub twophase_file_name
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
+	my $xid = shift;
+	return sprintf("%08X", $xid);
+}
+
 # Set up two nodes, which will alternately be primary and replication standby.
 
 # Setup london node
@@ -527,3 +536,125 @@ $cur_standby->psql(
 is( $psql_out,
 	qq{27|issued to paris},
 	"Check expected t_009_tbl2 data on standby");
+
+###############################################################################
+# Check handling of already committed or aborted 2PC files at recovery.
+# This test does a manual copy of 2PC files created in a running server,
+# to cheaply emulate situations that could be found in base backups.
+###############################################################################
+
+# Issue a set of transactions that will be used for this portion of the test:
+# - One transaction to hold on the minimum xid horizon at bay.
+# - One transaction that will be found as already committed at recovery.
+# - One transaction that will be fonnd as already rollbacked at recovery.
+$cur_primary->psql(
+	'postgres', "
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (40, 'transaction: xid horizon');
+	PREPARE TRANSACTION 'xact_009_40';
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (41, 'transaction: commit-prepared');
+	PREPARE TRANSACTION 'xact_009_41';
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (42, 'transaction: rollback-prepared');
+	PREPARE TRANSACTION 'xact_009_42';");
+
+# Issue a checkpoint, fixing the XID horizon based on the first transaction,
+# flushing to disk the two files to use.
+$cur_primary->psql('postgres', "CHECKPOINT");
+
+# Get the transaction IDs of the ones to 2PC files to manipulate.
+my $commit_prepared_xid = int(
+	$cur_primary->safe_psql(
+		'postgres',
+		"SELECT transaction FROM pg_prepared_xacts WHERE gid = 'xact_009_41'")
+);
+my $abort_prepared_xid = int(
+	$cur_primary->safe_psql(
+		'postgres',
+		"SELECT transaction FROM pg_prepared_xacts WHERE gid = 'xact_009_42'")
+);
+
+# Copy the two-phase files that will be put back later.
+my $commit_prepared_name = twophase_file_name($commit_prepared_xid);
+my $abort_prepared_name = twophase_file_name($abort_prepared_xid);
+
+my $twophase_tmpdir = $PostgreSQL::Test::Utils::tmp_check . '/' . "2pc_files";
+mkdir($twophase_tmpdir);
+my $primary_twophase_folder = $cur_primary->data_dir . '/pg_twophase/';
+copy("$primary_twophase_folder/$commit_prepared_name", $twophase_tmpdir);
+copy("$primary_twophase_folder/$abort_prepared_name", $twophase_tmpdir);
+
+# Issue abort/commit prepared.
+$cur_primary->psql('postgres', "COMMIT PREPARED 'xact_009_41'");
+$cur_primary->psql('postgres', "ROLLBACK PREPARED 'xact_009_42'");
+
+# Again checkpoint, to advance the LSN past the point where the two previous
+# transaction records would be replayed.
+$cur_primary->psql('postgres', "CHECKPOINT");
+
+# Take down node.
+$cur_primary->teardown_node;
+
+# Move back the two twophase files.
+copy("$twophase_tmpdir/$commit_prepared_name", $primary_twophase_folder);
+copy("$twophase_tmpdir/$abort_prepared_name", $primary_twophase_folder);
+
+# Grab location in logs of primary
+my $log_offset = -s $cur_primary->logfile;
+
+# Start node and check that the two previous files are removed by checking the
+# server logs, following the CLOG lookup done at the end of recovery.
+$cur_primary->start;
+
+$cur_primary->log_check(
+	"two-phase files of committed transactions removed at recovery",
+	$log_offset,
+	log_like => [
+		qr/removing stale two-phase state from memory for transaction $commit_prepared_xid/,
+		qr/removing stale two-phase state from memory for transaction $abort_prepared_xid/
+	]);
+
+# Commit the first transaction.
+$cur_primary->psql('postgres', "COMMIT PREPARED 'xact_009_40'");
+# After replay, there should be no 2PC transactions.
+$cur_primary->psql(
+	'postgres',
+	"SELECT * FROM pg_prepared_xact",
+	stdout => \$psql_out);
+is($psql_out, qq{}, "Check expected pg_prepared_xact data on primary");
+# Data from transactions should be around.
+$cur_primary->psql(
+	'postgres',
+	"SELECT * FROM t_009_tbl WHERE id IN (40, 41, 42);",
+	stdout => \$psql_out);
+is( $psql_out, qq{40|transaction: xid horizon
+41|transaction: commit-prepared},
+	"Check expected table data on primary");
+
+###############################################################################
+# Check handling of orphaned 2PC files at recovery.
+###############################################################################
+
+$cur_primary->teardown_node;
+
+# Grab location in logs of primary
+$log_offset = -s $cur_primary->logfile;
+
+# Create fake files with a transaction ID large or low enough to be in the
+# future or the past, then check that the primary is able to start and remove
+# these files at recovery.
+
+my $future_2pc_file = $cur_primary->data_dir . '/pg_twophase/00FFFFFF';
+append_to_file $future_2pc_file, "";
+my $past_2pc_file = $cur_primary->data_dir . '/pg_twophase/000000FF';
+append_to_file $past_2pc_file, "";
+
+$cur_primary->start;
+$cur_primary->log_check(
+	"fake two-phase files removed at recovery",
+	$log_offset,
+	log_like => [
+		qr/removing future two-phase state file for transaction 16777215/,
+		qr/removing past two-phase state file for transaction 255/
+	]);
-- 
2.47.2

0001-Fix-set-of-issues-with-2PC-transaction-handling-dur-15.txttext/plain; charset=us-asciiDownload

From f8c19ef3057dfb98231f31f8227ddf35005a5c33 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Thu, 30 Jan 2025 11:57:04 +0900
Subject: [PATCH] Fix set of issues with 2PC transaction handling during
 recovery

This addresses two issues:
- CLOG lookups could happen at a too early stage of recovery, where
these may not be consistent.  These are delayed until the end of
recovery.
- Legit 2PC transactions found as already committed or aborted when
scanning the shared memory data could cause inconsistencies in
TwoPhaseState, leaving around dangling entries for transactions that
should not be there.
---
 src/backend/access/transam/twophase.c | 119 +++++++++++++----------
 src/test/recovery/t/009_twophase.pl   | 131 ++++++++++++++++++++++++++
 2 files changed, 198 insertions(+), 52 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 8c5e5913df5..743bd8ca3f8 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1879,13 +1879,17 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
  * Scan pg_twophase and fill TwoPhaseState depending on the on-disk data.
  * This is called once at the beginning of recovery, saving any extra
  * lookups in the future.  Two-phase files that are newer than the
- * minimum XID horizon are discarded on the way.
+ * minimum XID horizon are discarded on the way, as much as files that
+ * are older than the oldest XID horizon.
  */
 void
 restoreTwoPhaseData(void)
 {
 	DIR		   *cldir;
 	struct dirent *clde;
+	FullTransactionId nextXid = ShmemVariableCache->nextXid;
+	TransactionId origNextXid = XidFromFullTransactionId(nextXid);
+	TransactionId oldestXid = ShmemVariableCache->oldestXid;
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	cldir = AllocateDir(TWOPHASE_DIR);
@@ -1899,10 +1903,24 @@ restoreTwoPhaseData(void)
 
 			xid = (TransactionId) strtoul(clde->d_name, NULL, 16);
 
+			/* Reject XID if too new or too old */
+			if (TransactionIdFollowsOrEquals(xid, origNextXid) ||
+				TransactionIdPrecedes(xid, oldestXid))
+			{
+				if (TransactionIdFollowsOrEquals(xid, origNextXid))
+					ereport(WARNING,
+							(errmsg("removing future two-phase state file for transaction %u",
+									xid)));
+				else
+					ereport(WARNING,
+							(errmsg("removing past two-phase state file for transaction %u",
+									xid)));
+				RemoveTwoPhaseFile(xid, true);
+				continue;
+			}
+
 			buf = ProcessTwoPhaseBuffer(xid, InvalidXLogRecPtr,
 										true, false, false);
-			if (buf == NULL)
-				continue;
 
 			PrepareRedoAdd(buf, InvalidXLogRecPtr,
 						   InvalidXLogRecPtr, InvalidRepOriginId);
@@ -1969,9 +1987,6 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 									gxact->prepare_start_lsn,
 									gxact->ondisk, false, true);
 
-		if (buf == NULL)
-			continue;
-
 		/*
 		 * OK, we think this file is valid.  Incorporate xid into the
 		 * running-minimum result.
@@ -2042,8 +2057,7 @@ StandbyRecoverPreparedTransactions(void)
 		buf = ProcessTwoPhaseBuffer(xid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
-		if (buf != NULL)
-			pfree(buf);
+		pfree(buf);
 	}
 	LWLockRelease(TwoPhaseStateLock);
 }
@@ -2068,8 +2082,21 @@ void
 RecoverPreparedTransactions(void)
 {
 	int			i;
+	TransactionId *remove_xids;
+	int			remove_xids_cnt;
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+	/*
+	 * Track XIDs candidate for removal if found as already committed or
+	 * aborted, once the first scan through TwoPhaseState is done.  This
+	 * cannot happen while going through the entries in TwoPhaseState as
+	 * PrepareRedoRemove() manipulates it.
+	 */
+	remove_xids_cnt = 0;
+	remove_xids = (TransactionId *) palloc(TwoPhaseState->numPrepXacts *
+										   sizeof(TransactionId));
+
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
 		TransactionId xid;
@@ -2082,6 +2109,26 @@ RecoverPreparedTransactions(void)
 
 		xid = gxact->xid;
 
+		/*
+		 * Is this transaction already aborted or committed?  If yes, mark it
+		 * for removal.
+		 *
+		 * Checking CLOGs if these transactions have been already aborted or
+		 * committed is safe at this stage; we are at the end of recovery and
+		 * all WAL has been replayed, all 2PC transactions are reinstated and
+		 * should be tracked in TwoPhaseState.
+		 */
+		if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
+		{
+			/*
+			 * Track this transaction ID for its removal from the shared
+			 * memory state at the end.
+			 */
+			remove_xids[remove_xids_cnt] = xid;
+			remove_xids_cnt++;
+			continue;
+		}
+
 		/*
 		 * Reconstruct subtrans state for the transaction --- needed because
 		 * pg_subtrans is not preserved over a restart.  Note that we are
@@ -2094,8 +2141,6 @@ RecoverPreparedTransactions(void)
 		buf = ProcessTwoPhaseBuffer(xid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
-		if (buf == NULL)
-			continue;
 
 		ereport(LOG,
 				(errmsg("recovering prepared transaction %u from shared memory", xid)));
@@ -2153,7 +2198,19 @@ RecoverPreparedTransactions(void)
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	}
 
+	for (i = 0; i < remove_xids_cnt; i++)
+	{
+		TransactionId xid = remove_xids[i];
+
+		ereport(WARNING,
+				(errmsg("removing stale two-phase state from memory for transaction %u",
+						xid)));
+		PrepareRedoRemove(xid, true);
+	}
+
 	LWLockRelease(TwoPhaseStateLock);
+
+	pfree(remove_xids);
 }
 
 /*
@@ -2173,8 +2230,6 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 					  bool fromdisk,
 					  bool setParent, bool setNextXid)
 {
-	FullTransactionId nextXid = ShmemVariableCache->nextXid;
-	TransactionId origNextXid = XidFromFullTransactionId(nextXid);
 	TransactionId *subxids;
 	char	   *buf;
 	TwoPhaseFileHeader *hdr;
@@ -2185,46 +2240,6 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	if (!fromdisk)
 		Assert(prepare_start_lsn != InvalidXLogRecPtr);
 
-	/* Already processed? */
-	if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
-	{
-		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing stale two-phase state file for transaction %u",
-							xid)));
-			RemoveTwoPhaseFile(xid, true);
-		}
-		else
-		{
-			ereport(WARNING,
-					(errmsg("removing stale two-phase state from memory for transaction %u",
-							xid)));
-			PrepareRedoRemove(xid, true);
-		}
-		return NULL;
-	}
-
-	/* Reject XID if too new */
-	if (TransactionIdFollowsOrEquals(xid, origNextXid))
-	{
-		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing future two-phase state file for transaction %u",
-							xid)));
-			RemoveTwoPhaseFile(xid, true);
-		}
-		else
-		{
-			ereport(WARNING,
-					(errmsg("removing future two-phase state from memory for transaction %u",
-							xid)));
-			PrepareRedoRemove(xid, true);
-		}
-		return NULL;
-	}
-
 	if (fromdisk)
 	{
 		/* Read and validate file */
diff --git a/src/test/recovery/t/009_twophase.pl b/src/test/recovery/t/009_twophase.pl
index ad9b5371dd0..439ce37c0a5 100644
--- a/src/test/recovery/t/009_twophase.pl
+++ b/src/test/recovery/t/009_twophase.pl
@@ -5,6 +5,7 @@
 use strict;
 use warnings;
 
+use File::Copy;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
@@ -28,6 +29,14 @@ sub configure_and_reload
 	return;
 }
 
+sub twophase_file_name
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
+	my $xid = shift;
+	return sprintf("%08X", $xid);
+}
+
 # Set up two nodes, which will alternately be primary and replication standby.
 
 # Setup london node
@@ -528,4 +537,126 @@ is( $psql_out,
 	qq{27|issued to paris},
 	"Check expected t_009_tbl2 data on standby");
 
+###############################################################################
+# Check handling of already committed or aborted 2PC files at recovery.
+# This test does a manual copy of 2PC files created in a running server,
+# to cheaply emulate situations that could be found in base backups.
+###############################################################################
+
+# Issue a set of transactions that will be used for this portion of the test:
+# - One transaction to hold on the minimum xid horizon at bay.
+# - One transaction that will be found as already committed at recovery.
+# - One transaction that will be fonnd as already rollbacked at recovery.
+$cur_primary->psql(
+	'postgres', "
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (40, 'transaction: xid horizon');
+	PREPARE TRANSACTION 'xact_009_40';
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (41, 'transaction: commit-prepared');
+	PREPARE TRANSACTION 'xact_009_41';
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (42, 'transaction: rollback-prepared');
+	PREPARE TRANSACTION 'xact_009_42';");
+
+# Issue a checkpoint, fixing the XID horizon based on the first transaction,
+# flushing to disk the two files to use.
+$cur_primary->psql('postgres', "CHECKPOINT");
+
+# Get the transaction IDs of the ones to 2PC files to manipulate.
+my $commit_prepared_xid = int(
+	$cur_primary->safe_psql(
+		'postgres',
+		"SELECT transaction FROM pg_prepared_xacts WHERE gid = 'xact_009_41'")
+);
+my $abort_prepared_xid = int(
+	$cur_primary->safe_psql(
+		'postgres',
+		"SELECT transaction FROM pg_prepared_xacts WHERE gid = 'xact_009_42'")
+);
+
+# Copy the two-phase files that will be put back later.
+my $commit_prepared_name = twophase_file_name($commit_prepared_xid);
+my $abort_prepared_name = twophase_file_name($abort_prepared_xid);
+
+my $twophase_tmpdir = $PostgreSQL::Test::Utils::tmp_check . '/' . "2pc_files";
+mkdir($twophase_tmpdir);
+my $primary_twophase_folder = $cur_primary->data_dir . '/pg_twophase/';
+copy("$primary_twophase_folder/$commit_prepared_name", $twophase_tmpdir);
+copy("$primary_twophase_folder/$abort_prepared_name", $twophase_tmpdir);
+
+# Issue abort/commit prepared.
+$cur_primary->psql('postgres', "COMMIT PREPARED 'xact_009_41'");
+$cur_primary->psql('postgres', "ROLLBACK PREPARED 'xact_009_42'");
+
+# Again checkpoint, to advance the LSN past the point where the two previous
+# transaction records would be replayed.
+$cur_primary->psql('postgres', "CHECKPOINT");
+
+# Take down node.
+$cur_primary->teardown_node;
+
+# Move back the two twophase files.
+copy("$twophase_tmpdir/$commit_prepared_name", $primary_twophase_folder);
+copy("$twophase_tmpdir/$abort_prepared_name", $primary_twophase_folder);
+
+# Grab location in logs of primary
+my $log_offset = -s $cur_primary->logfile;
+
+# Start node and check that the two previous files are removed by checking the
+# server logs, following the CLOG lookup done at the end of recovery.
+$cur_primary->start;
+
+$cur_primary->log_check(
+	"two-phase files of committed transactions removed at recovery",
+	$log_offset,
+	log_like => [
+		qr/removing stale two-phase state from memory for transaction $commit_prepared_xid/,
+		qr/removing stale two-phase state from memory for transaction $abort_prepared_xid/
+	]);
+
+# Commit the first transaction.
+$cur_primary->psql('postgres', "COMMIT PREPARED 'xact_009_40'");
+# After replay, there should be no 2PC transactions.
+$cur_primary->psql(
+	'postgres',
+	"SELECT * FROM pg_prepared_xact",
+	stdout => \$psql_out);
+is($psql_out, qq{}, "Check expected pg_prepared_xact data on primary");
+# Data from transactions should be around.
+$cur_primary->psql(
+	'postgres',
+	"SELECT * FROM t_009_tbl WHERE id IN (40, 41, 42);",
+	stdout => \$psql_out);
+is( $psql_out, qq{40|transaction: xid horizon
+41|transaction: commit-prepared},
+	"Check expected table data on primary");
+
+###############################################################################
+# Check handling of orphaned 2PC files at recovery.
+###############################################################################
+
+$cur_primary->teardown_node;
+
+# Grab location in logs of primary
+$log_offset = -s $cur_primary->logfile;
+
+# Create fake files with a transaction ID large or low enough to be in the
+# future or the past, then check that the primary is able to start and remove
+# these files at recovery.
+
+my $future_2pc_file = $cur_primary->data_dir . '/pg_twophase/00FFFFFF';
+append_to_file $future_2pc_file, "";
+my $past_2pc_file = $cur_primary->data_dir . '/pg_twophase/000000FF';
+append_to_file $past_2pc_file, "";
+
+$cur_primary->start;
+$cur_primary->log_check(
+	"fake two-phase files removed at recovery",
+	$log_offset,
+	log_like => [
+		qr/removing future two-phase state file for transaction 16777215/,
+		qr/removing past two-phase state file for transaction 255/
+	]);
+
 done_testing();
-- 
2.47.2

0001-Fix-set-of-issues-with-2PC-transaction-handling-dur-16.txttext/plain; charset=us-asciiDownload

From 094e5a5f7f9cef688405a712b684ccbe8b1008ff Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Thu, 30 Jan 2025 11:57:04 +0900
Subject: [PATCH] Fix set of issues with 2PC transaction handling during
 recovery

This addresses two issues:
- CLOG lookups could happen at a too early stage of recovery, where
these may not be consistent.  These are delayed until the end of
recovery.
- Legit 2PC transactions found as already committed or aborted when
scanning the shared memory data could cause inconsistencies in
TwoPhaseState, leaving around dangling entries for transactions that
should not be there.
---
 src/backend/access/transam/twophase.c | 119 +++++++++++++----------
 src/test/recovery/t/009_twophase.pl   | 131 ++++++++++++++++++++++++++
 2 files changed, 198 insertions(+), 52 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 95aa8be9c53..4a34e7697e5 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1877,13 +1877,17 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
  * Scan pg_twophase and fill TwoPhaseState depending on the on-disk data.
  * This is called once at the beginning of recovery, saving any extra
  * lookups in the future.  Two-phase files that are newer than the
- * minimum XID horizon are discarded on the way.
+ * minimum XID horizon are discarded on the way, as much as files that
+ * are older than the oldest XID horizon.
  */
 void
 restoreTwoPhaseData(void)
 {
 	DIR		   *cldir;
 	struct dirent *clde;
+	FullTransactionId nextXid = ShmemVariableCache->nextXid;
+	TransactionId origNextXid = XidFromFullTransactionId(nextXid);
+	TransactionId oldestXid = ShmemVariableCache->oldestXid;
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	cldir = AllocateDir(TWOPHASE_DIR);
@@ -1897,10 +1901,24 @@ restoreTwoPhaseData(void)
 
 			xid = (TransactionId) strtoul(clde->d_name, NULL, 16);
 
+			/* Reject XID if too new or too old */
+			if (TransactionIdFollowsOrEquals(xid, origNextXid) ||
+				TransactionIdPrecedes(xid, oldestXid))
+			{
+				if (TransactionIdFollowsOrEquals(xid, origNextXid))
+					ereport(WARNING,
+							(errmsg("removing future two-phase state file for transaction %u",
+									xid)));
+				else
+					ereport(WARNING,
+							(errmsg("removing past two-phase state file for transaction %u",
+									xid)));
+				RemoveTwoPhaseFile(xid, true);
+				continue;
+			}
+
 			buf = ProcessTwoPhaseBuffer(xid, InvalidXLogRecPtr,
 										true, false, false);
-			if (buf == NULL)
-				continue;
 
 			PrepareRedoAdd(buf, InvalidXLogRecPtr,
 						   InvalidXLogRecPtr, InvalidRepOriginId);
@@ -1967,9 +1985,6 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 									gxact->prepare_start_lsn,
 									gxact->ondisk, false, true);
 
-		if (buf == NULL)
-			continue;
-
 		/*
 		 * OK, we think this file is valid.  Incorporate xid into the
 		 * running-minimum result.
@@ -2040,8 +2055,7 @@ StandbyRecoverPreparedTransactions(void)
 		buf = ProcessTwoPhaseBuffer(xid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
-		if (buf != NULL)
-			pfree(buf);
+		pfree(buf);
 	}
 	LWLockRelease(TwoPhaseStateLock);
 }
@@ -2066,8 +2080,21 @@ void
 RecoverPreparedTransactions(void)
 {
 	int			i;
+	TransactionId *remove_xids;
+	int			remove_xids_cnt;
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+	/*
+	 * Track XIDs candidate for removal if found as already committed or
+	 * aborted, once the first scan through TwoPhaseState is done.  This
+	 * cannot happen while going through the entries in TwoPhaseState as
+	 * PrepareRedoRemove() manipulates it.
+	 */
+	remove_xids_cnt = 0;
+	remove_xids = (TransactionId *) palloc(TwoPhaseState->numPrepXacts *
+										   sizeof(TransactionId));
+
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
 		TransactionId xid;
@@ -2080,6 +2107,26 @@ RecoverPreparedTransactions(void)
 
 		xid = gxact->xid;
 
+		/*
+		 * Is this transaction already aborted or committed?  If yes, mark it
+		 * for removal.
+		 *
+		 * Checking CLOGs if these transactions have been already aborted or
+		 * committed is safe at this stage; we are at the end of recovery and
+		 * all WAL has been replayed, all 2PC transactions are reinstated and
+		 * should be tracked in TwoPhaseState.
+		 */
+		if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
+		{
+			/*
+			 * Track this transaction ID for its removal from the shared
+			 * memory state at the end.
+			 */
+			remove_xids[remove_xids_cnt] = xid;
+			remove_xids_cnt++;
+			continue;
+		}
+
 		/*
 		 * Reconstruct subtrans state for the transaction --- needed because
 		 * pg_subtrans is not preserved over a restart.  Note that we are
@@ -2092,8 +2139,6 @@ RecoverPreparedTransactions(void)
 		buf = ProcessTwoPhaseBuffer(xid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
-		if (buf == NULL)
-			continue;
 
 		ereport(LOG,
 				(errmsg("recovering prepared transaction %u from shared memory", xid)));
@@ -2151,7 +2196,19 @@ RecoverPreparedTransactions(void)
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	}
 
+	for (i = 0; i < remove_xids_cnt; i++)
+	{
+		TransactionId xid = remove_xids[i];
+
+		ereport(WARNING,
+				(errmsg("removing stale two-phase state from memory for transaction %u",
+						xid)));
+		PrepareRedoRemove(xid, true);
+	}
+
 	LWLockRelease(TwoPhaseStateLock);
+
+	pfree(remove_xids);
 }
 
 /*
@@ -2171,8 +2228,6 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 					  bool fromdisk,
 					  bool setParent, bool setNextXid)
 {
-	FullTransactionId nextXid = ShmemVariableCache->nextXid;
-	TransactionId origNextXid = XidFromFullTransactionId(nextXid);
 	TransactionId *subxids;
 	char	   *buf;
 	TwoPhaseFileHeader *hdr;
@@ -2183,46 +2238,6 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	if (!fromdisk)
 		Assert(prepare_start_lsn != InvalidXLogRecPtr);
 
-	/* Already processed? */
-	if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
-	{
-		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing stale two-phase state file for transaction %u",
-							xid)));
-			RemoveTwoPhaseFile(xid, true);
-		}
-		else
-		{
-			ereport(WARNING,
-					(errmsg("removing stale two-phase state from memory for transaction %u",
-							xid)));
-			PrepareRedoRemove(xid, true);
-		}
-		return NULL;
-	}
-
-	/* Reject XID if too new */
-	if (TransactionIdFollowsOrEquals(xid, origNextXid))
-	{
-		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing future two-phase state file for transaction %u",
-							xid)));
-			RemoveTwoPhaseFile(xid, true);
-		}
-		else
-		{
-			ereport(WARNING,
-					(errmsg("removing future two-phase state from memory for transaction %u",
-							xid)));
-			PrepareRedoRemove(xid, true);
-		}
-		return NULL;
-	}
-
 	if (fromdisk)
 	{
 		/* Read and validate file */
diff --git a/src/test/recovery/t/009_twophase.pl b/src/test/recovery/t/009_twophase.pl
index fe7e8e79802..790f1a1d2ec 100644
--- a/src/test/recovery/t/009_twophase.pl
+++ b/src/test/recovery/t/009_twophase.pl
@@ -5,6 +5,7 @@
 use strict;
 use warnings;
 
+use File::Copy;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
@@ -28,6 +29,14 @@ sub configure_and_reload
 	return;
 }
 
+sub twophase_file_name
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
+	my $xid = shift;
+	return sprintf("%08X", $xid);
+}
+
 # Set up two nodes, which will alternately be primary and replication standby.
 
 # Setup london node
@@ -528,4 +537,126 @@ is( $psql_out,
 	qq{27|issued to paris},
 	"Check expected t_009_tbl2 data on standby");
 
+###############################################################################
+# Check handling of already committed or aborted 2PC files at recovery.
+# This test does a manual copy of 2PC files created in a running server,
+# to cheaply emulate situations that could be found in base backups.
+###############################################################################
+
+# Issue a set of transactions that will be used for this portion of the test:
+# - One transaction to hold on the minimum xid horizon at bay.
+# - One transaction that will be found as already committed at recovery.
+# - One transaction that will be fonnd as already rollbacked at recovery.
+$cur_primary->psql(
+	'postgres', "
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (40, 'transaction: xid horizon');
+	PREPARE TRANSACTION 'xact_009_40';
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (41, 'transaction: commit-prepared');
+	PREPARE TRANSACTION 'xact_009_41';
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (42, 'transaction: rollback-prepared');
+	PREPARE TRANSACTION 'xact_009_42';");
+
+# Issue a checkpoint, fixing the XID horizon based on the first transaction,
+# flushing to disk the two files to use.
+$cur_primary->psql('postgres', "CHECKPOINT");
+
+# Get the transaction IDs of the ones to 2PC files to manipulate.
+my $commit_prepared_xid = int(
+	$cur_primary->safe_psql(
+		'postgres',
+		"SELECT transaction FROM pg_prepared_xacts WHERE gid = 'xact_009_41'")
+);
+my $abort_prepared_xid = int(
+	$cur_primary->safe_psql(
+		'postgres',
+		"SELECT transaction FROM pg_prepared_xacts WHERE gid = 'xact_009_42'")
+);
+
+# Copy the two-phase files that will be put back later.
+my $commit_prepared_name = twophase_file_name($commit_prepared_xid);
+my $abort_prepared_name = twophase_file_name($abort_prepared_xid);
+
+my $twophase_tmpdir = $PostgreSQL::Test::Utils::tmp_check . '/' . "2pc_files";
+mkdir($twophase_tmpdir);
+my $primary_twophase_folder = $cur_primary->data_dir . '/pg_twophase/';
+copy("$primary_twophase_folder/$commit_prepared_name", $twophase_tmpdir);
+copy("$primary_twophase_folder/$abort_prepared_name", $twophase_tmpdir);
+
+# Issue abort/commit prepared.
+$cur_primary->psql('postgres', "COMMIT PREPARED 'xact_009_41'");
+$cur_primary->psql('postgres', "ROLLBACK PREPARED 'xact_009_42'");
+
+# Again checkpoint, to advance the LSN past the point where the two previous
+# transaction records would be replayed.
+$cur_primary->psql('postgres', "CHECKPOINT");
+
+# Take down node.
+$cur_primary->teardown_node;
+
+# Move back the two twophase files.
+copy("$twophase_tmpdir/$commit_prepared_name", $primary_twophase_folder);
+copy("$twophase_tmpdir/$abort_prepared_name", $primary_twophase_folder);
+
+# Grab location in logs of primary
+my $log_offset = -s $cur_primary->logfile;
+
+# Start node and check that the two previous files are removed by checking the
+# server logs, following the CLOG lookup done at the end of recovery.
+$cur_primary->start;
+
+$cur_primary->log_check(
+	"two-phase files of committed transactions removed at recovery",
+	$log_offset,
+	log_like => [
+		qr/removing stale two-phase state from memory for transaction $commit_prepared_xid/,
+		qr/removing stale two-phase state from memory for transaction $abort_prepared_xid/
+	]);
+
+# Commit the first transaction.
+$cur_primary->psql('postgres', "COMMIT PREPARED 'xact_009_40'");
+# After replay, there should be no 2PC transactions.
+$cur_primary->psql(
+	'postgres',
+	"SELECT * FROM pg_prepared_xact",
+	stdout => \$psql_out);
+is($psql_out, qq{}, "Check expected pg_prepared_xact data on primary");
+# Data from transactions should be around.
+$cur_primary->psql(
+	'postgres',
+	"SELECT * FROM t_009_tbl WHERE id IN (40, 41, 42);",
+	stdout => \$psql_out);
+is( $psql_out, qq{40|transaction: xid horizon
+41|transaction: commit-prepared},
+	"Check expected table data on primary");
+
+###############################################################################
+# Check handling of orphaned 2PC files at recovery.
+###############################################################################
+
+$cur_primary->teardown_node;
+
+# Grab location in logs of primary
+$log_offset = -s $cur_primary->logfile;
+
+# Create fake files with a transaction ID large or low enough to be in the
+# future or the past, then check that the primary is able to start and remove
+# these files at recovery.
+
+my $future_2pc_file = $cur_primary->data_dir . '/pg_twophase/00FFFFFF';
+append_to_file $future_2pc_file, "";
+my $past_2pc_file = $cur_primary->data_dir . '/pg_twophase/000000FF';
+append_to_file $past_2pc_file, "";
+
+$cur_primary->start;
+$cur_primary->log_check(
+	"fake two-phase files removed at recovery",
+	$log_offset,
+	log_like => [
+		qr/removing future two-phase state file for transaction 16777215/,
+		qr/removing past two-phase state file for transaction 255/
+	]);
+
 done_testing();
-- 
2.47.2

0001-Fix-set-of-issues-with-2PC-transaction-handling-dur-17.txttext/plain; charset=us-asciiDownload

From 585f2ce9e87455ac7ed7f879f002968810b33ab6 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Thu, 30 Jan 2025 14:51:11 +0900
Subject: [PATCH] Fix set of issues with 2PC transaction handling during
 recovery

This addresses two issues:
- CLOG lookups could happen at a too early stage of recovery, where
these may not be consistent.  These are delayed until the end of
recovery.
- Legit 2PC transactions found as already committed or aborted when
scanning the shared memory data could cause inconsistencies in
TwoPhaseState, leaving around dangling entries for transactions that
should not be there.
---
 src/include/access/multixact.h               |   9 +-
 src/include/access/twophase.h                |  12 +-
 src/include/access/twophase_rmgr.h           |   4 +-
 src/include/pgstat.h                         |   5 +-
 src/include/storage/lock.h                   |  11 +-
 src/include/storage/predicate.h              |   7 +-
 src/backend/access/transam/multixact.c       |  16 +-
 src/backend/access/transam/twophase.c        | 327 +++++++++++--------
 src/backend/access/transam/xact.c            |  13 +-
 src/backend/storage/lmgr/lock.c              |  20 +-
 src/backend/storage/lmgr/predicate.c         |  11 +-
 src/backend/utils/activity/pgstat_relation.c |   4 +-
 src/test/recovery/t/009_twophase.pl          | 140 ++++++++
 13 files changed, 389 insertions(+), 190 deletions(-)

diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 7ffd256c744..7f24cdbc348 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -11,6 +11,7 @@
 #ifndef MULTIXACT_H
 #define MULTIXACT_H
 
+#include "access/transam.h"
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 #include "storage/sync.h"
@@ -119,7 +120,7 @@ extern int	multixactmemberssyncfiletag(const FileTag *ftag, char *path);
 
 extern void AtEOXact_MultiXact(void);
 extern void AtPrepare_MultiXact(void);
-extern void PostPrepare_MultiXact(TransactionId xid);
+extern void PostPrepare_MultiXact(FullTransactionId fxid);
 
 extern Size MultiXactShmemSize(void);
 extern void MultiXactShmemInit(void);
@@ -145,11 +146,11 @@ extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
 extern int	MultiXactMemberFreezeThreshold(void);
 
-extern void multixact_twophase_recover(TransactionId xid, uint16 info,
+extern void multixact_twophase_recover(FullTransactionId fxid, uint16 info,
 									   void *recdata, uint32 len);
-extern void multixact_twophase_postcommit(TransactionId xid, uint16 info,
+extern void multixact_twophase_postcommit(FullTransactionId fxid, uint16 info,
 										  void *recdata, uint32 len);
-extern void multixact_twophase_postabort(TransactionId xid, uint16 info,
+extern void multixact_twophase_postabort(FullTransactionId fxid, uint16 info,
 										 void *recdata, uint32 len);
 
 extern void multixact_redo(XLogReaderState *record);
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 56248c00063..93c5caf8309 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -14,6 +14,7 @@
 #ifndef TWOPHASE_H
 #define TWOPHASE_H
 
+#include "access/transam.h"
 #include "access/xact.h"
 #include "access/xlogdefs.h"
 #include "datatype/timestamp.h"
@@ -36,10 +37,10 @@ extern void PostPrepare_Twophase(void);
 
 extern TransactionId TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
 												bool *have_more);
-extern PGPROC *TwoPhaseGetDummyProc(TransactionId xid, bool lock_held);
-extern int	TwoPhaseGetDummyProcNumber(TransactionId xid, bool lock_held);
+extern PGPROC *TwoPhaseGetDummyProc(FullTransactionId fxid, bool lock_held);
+extern int	TwoPhaseGetDummyProcNumber(FullTransactionId fxid, bool lock_held);
 
-extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
+extern GlobalTransaction MarkAsPreparing(FullTransactionId fxid, const char *gid,
 										 TimestampTz prepared_at,
 										 Oid owner, Oid databaseid);
 
@@ -56,8 +57,9 @@ extern void CheckPointTwoPhase(XLogRecPtr redo_horizon);
 
 extern void FinishPreparedTransaction(const char *gid, bool isCommit);
 
-extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
-						   XLogRecPtr end_lsn, RepOriginId origin_id);
+extern void PrepareRedoAdd(FullTransactionId fxid, char *buf,
+						   XLogRecPtr start_lsn, XLogRecPtr end_lsn,
+						   RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
 extern bool LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
diff --git a/src/include/access/twophase_rmgr.h b/src/include/access/twophase_rmgr.h
index bdd2cb7b339..64db7bade6f 100644
--- a/src/include/access/twophase_rmgr.h
+++ b/src/include/access/twophase_rmgr.h
@@ -14,7 +14,9 @@
 #ifndef TWOPHASE_RMGR_H
 #define TWOPHASE_RMGR_H
 
-typedef void (*TwoPhaseCallback) (TransactionId xid, uint16 info,
+#include "access/transam.h"
+
+typedef void (*TwoPhaseCallback) (FullTransactionId fxid, uint16 info,
 								  void *recdata, uint32 len);
 typedef uint8 TwoPhaseRmgrId;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 2136239710e..f60824de31c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -11,6 +11,7 @@
 #ifndef PGSTAT_H
 #define PGSTAT_H
 
+#include "access/transam.h"
 #include "datatype/timestamp.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"	/* for MAX_XFN_CHARS */
@@ -649,9 +650,9 @@ extern void pgstat_count_heap_delete(Relation rel);
 extern void pgstat_count_truncate(Relation rel);
 extern void pgstat_update_heap_dead_tuples(Relation rel, int delta);
 
-extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
+extern void pgstat_twophase_postcommit(FullTransactionId fxid, uint16 info,
 									   void *recdata, uint32 len);
-extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
+extern void pgstat_twophase_postabort(FullTransactionId fxid, uint16 info,
 									  void *recdata, uint32 len);
 
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index cc1f6e78c39..b196b8a1f7f 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -18,6 +18,7 @@
 #error "lock.h may not be included from frontend code"
 #endif
 
+#include "access/transam.h"
 #include "lib/ilist.h"
 #include "storage/lockdefs.h"
 #include "storage/lwlock.h"
@@ -577,7 +578,7 @@ extern bool LockHasWaiters(const LOCKTAG *locktag,
 extern VirtualTransactionId *GetLockConflicts(const LOCKTAG *locktag,
 											  LOCKMODE lockmode, int *countp);
 extern void AtPrepare_Locks(void);
-extern void PostPrepare_Locks(TransactionId xid);
+extern void PostPrepare_Locks(FullTransactionId fxid);
 extern bool LockCheckConflicts(LockMethod lockMethodTable,
 							   LOCKMODE lockmode,
 							   LOCK *lock, PROCLOCK *proclock);
@@ -591,13 +592,13 @@ extern BlockedProcsData *GetBlockerStatusData(int blocked_pid);
 extern xl_standby_lock *GetRunningTransactionLocks(int *nlocks);
 extern const char *GetLockmodeName(LOCKMETHODID lockmethodid, LOCKMODE mode);
 
-extern void lock_twophase_recover(TransactionId xid, uint16 info,
+extern void lock_twophase_recover(FullTransactionId fxid, uint16 info,
 								  void *recdata, uint32 len);
-extern void lock_twophase_postcommit(TransactionId xid, uint16 info,
+extern void lock_twophase_postcommit(FullTransactionId fxid, uint16 info,
 									 void *recdata, uint32 len);
-extern void lock_twophase_postabort(TransactionId xid, uint16 info,
+extern void lock_twophase_postabort(FullTransactionId fxid, uint16 info,
 									void *recdata, uint32 len);
-extern void lock_twophase_standby_recover(TransactionId xid, uint16 info,
+extern void lock_twophase_standby_recover(FullTransactionId fxid, uint16 info,
 										  void *recdata, uint32 len);
 
 extern DeadLockState DeadLockCheck(PGPROC *proc);
diff --git a/src/include/storage/predicate.h b/src/include/storage/predicate.h
index 14ee9b94a2f..bdd5cdd3dbf 100644
--- a/src/include/storage/predicate.h
+++ b/src/include/storage/predicate.h
@@ -14,6 +14,7 @@
 #ifndef PREDICATE_H
 #define PREDICATE_H
 
+#include "access/transam.h"
 #include "storage/lock.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -71,9 +72,9 @@ extern void PreCommit_CheckForSerializationFailure(void);
 
 /* two-phase commit support */
 extern void AtPrepare_PredicateLocks(void);
-extern void PostPrepare_PredicateLocks(TransactionId xid);
-extern void PredicateLockTwoPhaseFinish(TransactionId xid, bool isCommit);
-extern void predicatelock_twophase_recover(TransactionId xid, uint16 info,
+extern void PostPrepare_PredicateLocks(FullTransactionId fxid);
+extern void PredicateLockTwoPhaseFinish(FullTransactionId xid, bool isCommit);
+extern void predicatelock_twophase_recover(FullTransactionId fxid, uint16 info,
 										   void *recdata, uint32 len);
 
 /* parallel query support */
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b7b47ef076a..b51221cd1da 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1839,7 +1839,7 @@ AtPrepare_MultiXact(void)
  *		Clean up after successful PREPARE TRANSACTION
  */
 void
-PostPrepare_MultiXact(TransactionId xid)
+PostPrepare_MultiXact(FullTransactionId fxid)
 {
 	MultiXactId myOldestMember;
 
@@ -1850,7 +1850,7 @@ PostPrepare_MultiXact(TransactionId xid)
 	myOldestMember = OldestMemberMXactId[MyProcNumber];
 	if (MultiXactIdIsValid(myOldestMember))
 	{
-		ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(xid, false);
+		ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(fxid, false);
 
 		/*
 		 * Even though storing MultiXactId is atomic, acquire lock to make
@@ -1888,10 +1888,10 @@ PostPrepare_MultiXact(TransactionId xid)
  *		Recover the state of a prepared transaction at startup
  */
 void
-multixact_twophase_recover(TransactionId xid, uint16 info,
+multixact_twophase_recover(FullTransactionId fxid, uint16 info,
 						   void *recdata, uint32 len)
 {
-	ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(xid, false);
+	ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(fxid, false);
 	MultiXactId oldestMember;
 
 	/*
@@ -1909,10 +1909,10 @@ multixact_twophase_recover(TransactionId xid, uint16 info,
  *		Similar to AtEOXact_MultiXact but for COMMIT PREPARED
  */
 void
-multixact_twophase_postcommit(TransactionId xid, uint16 info,
+multixact_twophase_postcommit(FullTransactionId fxid, uint16 info,
 							  void *recdata, uint32 len)
 {
-	ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(xid, true);
+	ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(fxid, true);
 
 	Assert(len == sizeof(MultiXactId));
 
@@ -1924,10 +1924,10 @@ multixact_twophase_postcommit(TransactionId xid, uint16 info,
  *		This is actually just the same as the COMMIT case.
  */
 void
-multixact_twophase_postabort(TransactionId xid, uint16 info,
+multixact_twophase_postabort(FullTransactionId fxid, uint16 info,
 							 void *recdata, uint32 len)
 {
-	multixact_twophase_postcommit(xid, info, recdata, len);
+	multixact_twophase_postcommit(fxid, info, recdata, len);
 }
 
 /*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 4f78c1dc579..20b0756c742 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -159,7 +159,7 @@ typedef struct GlobalTransactionData
 	 */
 	XLogRecPtr	prepare_start_lsn;	/* XLOG offset of prepare record start */
 	XLogRecPtr	prepare_end_lsn;	/* XLOG offset of prepare record end */
-	TransactionId xid;			/* The GXACT id */
+	FullTransactionId fxid;		/* The GXACT full xid */
 
 	Oid			owner;			/* ID of user that executed the xact */
 	ProcNumber	locking_backend;	/* backend currently working on the xact */
@@ -197,6 +197,7 @@ static GlobalTransaction MyLockedGxact = NULL;
 
 static bool twophaseExitRegistered = false;
 
+static void PrepareRedoRemoveFull(FullTransactionId fxid, bool giveWarning);
 static void RecordTransactionCommitPrepared(TransactionId xid,
 											int nchildren,
 											TransactionId *children,
@@ -216,19 +217,19 @@ static void RecordTransactionAbortPrepared(TransactionId xid,
 										   int nstats,
 										   xl_xact_stats_item *stats,
 										   const char *gid);
-static void ProcessRecords(char *bufptr, TransactionId xid,
+static void ProcessRecords(char *bufptr, FullTransactionId fxid,
 						   const TwoPhaseCallback callbacks[]);
 static void RemoveGXact(GlobalTransaction gxact);
 
 static void XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len);
-static char *ProcessTwoPhaseBuffer(TransactionId xid,
+static char *ProcessTwoPhaseBuffer(FullTransactionId fxid,
 								   XLogRecPtr prepare_start_lsn,
 								   bool fromdisk, bool setParent, bool setNextXid);
-static void MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid,
+static void MarkAsPreparingGuts(GlobalTransaction gxact, FullTransactionId fxid,
 								const char *gid, TimestampTz prepared_at, Oid owner,
 								Oid databaseid);
-static void RemoveTwoPhaseFile(TransactionId xid, bool giveWarning);
-static void RecreateTwoPhaseFile(TransactionId xid, void *content, int len);
+static void RemoveTwoPhaseFile(FullTransactionId fxid, bool giveWarning);
+static void RecreateTwoPhaseFile(FullTransactionId fxid, void *content, int len);
 
 /*
  * Initialization of shared memory
@@ -356,7 +357,7 @@ PostPrepare_Twophase(void)
  *		Reserve the GID for the given transaction.
  */
 GlobalTransaction
-MarkAsPreparing(TransactionId xid, const char *gid,
+MarkAsPreparing(FullTransactionId fxid, const char *gid,
 				TimestampTz prepared_at, Oid owner, Oid databaseid)
 {
 	GlobalTransaction gxact;
@@ -407,7 +408,7 @@ MarkAsPreparing(TransactionId xid, const char *gid,
 	gxact = TwoPhaseState->freeGXacts;
 	TwoPhaseState->freeGXacts = gxact->next;
 
-	MarkAsPreparingGuts(gxact, xid, gid, prepared_at, owner, databaseid);
+	MarkAsPreparingGuts(gxact, fxid, gid, prepared_at, owner, databaseid);
 
 	gxact->ondisk = false;
 
@@ -430,11 +431,13 @@ MarkAsPreparing(TransactionId xid, const char *gid,
  * Note: This function should be called with appropriate locks held.
  */
 static void
-MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
-					TimestampTz prepared_at, Oid owner, Oid databaseid)
+MarkAsPreparingGuts(GlobalTransaction gxact, FullTransactionId fxid,
+					const char *gid, TimestampTz prepared_at, Oid owner,
+					Oid databaseid)
 {
 	PGPROC	   *proc;
 	int			i;
+	TransactionId xid = XidFromFullTransactionId(fxid);
 
 	Assert(LWLockHeldByMeInMode(TwoPhaseStateLock, LW_EXCLUSIVE));
 
@@ -479,7 +482,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	proc->subxidStatus.count = 0;
 
 	gxact->prepared_at = prepared_at;
-	gxact->xid = xid;
+	gxact->fxid = fxid;
 	gxact->owner = owner;
 	gxact->locking_backend = MyProcNumber;
 	gxact->valid = false;
@@ -797,12 +800,12 @@ pg_prepared_xact(PG_FUNCTION_ARGS)
  * caller had better hold it.
  */
 static GlobalTransaction
-TwoPhaseGetGXact(TransactionId xid, bool lock_held)
+TwoPhaseGetGXact(FullTransactionId fxid, bool lock_held)
 {
 	GlobalTransaction result = NULL;
 	int			i;
 
-	static TransactionId cached_xid = InvalidTransactionId;
+	static FullTransactionId cached_fxid = {0};
 	static GlobalTransaction cached_gxact = NULL;
 
 	Assert(!lock_held || LWLockHeldByMe(TwoPhaseStateLock));
@@ -811,7 +814,7 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)
 	 * During a recovery, COMMIT PREPARED, or ABORT PREPARED, we'll be called
 	 * repeatedly for the same XID.  We can save work with a simple cache.
 	 */
-	if (xid == cached_xid)
+	if (FullTransactionIdEquals(fxid, cached_fxid))
 		return cached_gxact;
 
 	if (!lock_held)
@@ -821,7 +824,7 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)
 	{
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
 
-		if (gxact->xid == xid)
+		if (FullTransactionIdEquals(gxact->fxid, fxid))
 		{
 			result = gxact;
 			break;
@@ -832,9 +835,10 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)
 		LWLockRelease(TwoPhaseStateLock);
 
 	if (result == NULL)			/* should not happen */
-		elog(ERROR, "failed to find GlobalTransaction for xid %u", xid);
+		elog(ERROR, "failed to find GlobalTransaction for xid %u",
+			 XidFromFullTransactionId(fxid));
 
-	cached_xid = xid;
+	cached_fxid = fxid;
 	cached_gxact = result;
 
 	return result;
@@ -881,7 +885,7 @@ TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
 				*have_more = true;
 				break;
 			}
-			result = gxact->xid;
+			result = XidFromFullTransactionId(gxact->fxid);
 		}
 	}
 
@@ -892,7 +896,7 @@ TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
 
 /*
  * TwoPhaseGetDummyProcNumber
- *		Get the dummy proc number for prepared transaction specified by XID
+ *		Get the dummy proc number for prepared transaction
  *
  * Dummy proc numbers are similar to proc numbers of real backends.  They
  * start at MaxBackends, and are unique across all currently active real
@@ -900,24 +904,24 @@ TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
  * TwoPhaseStateLock will not be taken, so the caller had better hold it.
  */
 ProcNumber
-TwoPhaseGetDummyProcNumber(TransactionId xid, bool lock_held)
+TwoPhaseGetDummyProcNumber(FullTransactionId fxid, bool lock_held)
 {
-	GlobalTransaction gxact = TwoPhaseGetGXact(xid, lock_held);
+	GlobalTransaction gxact = TwoPhaseGetGXact(fxid, lock_held);
 
 	return gxact->pgprocno;
 }
 
 /*
  * TwoPhaseGetDummyProc
- *		Get the PGPROC that represents a prepared transaction specified by XID
+ *		Get the PGPROC that represents a prepared transaction
  *
  * If lock_held is set to true, TwoPhaseStateLock will not be taken, so the
  * caller had better hold it.
  */
 PGPROC *
-TwoPhaseGetDummyProc(TransactionId xid, bool lock_held)
+TwoPhaseGetDummyProc(FullTransactionId fxid, bool lock_held)
 {
-	GlobalTransaction gxact = TwoPhaseGetGXact(xid, lock_held);
+	GlobalTransaction gxact = TwoPhaseGetGXact(fxid, lock_held);
 
 	return GetPGProcByNumber(gxact->pgprocno);
 }
@@ -942,10 +946,8 @@ AdjustToFullTransactionId(TransactionId xid)
 }
 
 static inline int
-TwoPhaseFilePath(char *path, TransactionId xid)
+TwoPhaseFilePath(char *path, FullTransactionId fxid)
 {
-	FullTransactionId fxid = AdjustToFullTransactionId(xid);
-
 	return snprintf(path, MAXPGPATH, TWOPHASE_DIR "/%08X%08X",
 					EpochFromFullTransactionId(fxid),
 					XidFromFullTransactionId(fxid));
@@ -1049,7 +1051,7 @@ void
 StartPrepare(GlobalTransaction gxact)
 {
 	PGPROC	   *proc = GetPGProcByNumber(gxact->pgprocno);
-	TransactionId xid = gxact->xid;
+	TransactionId xid = XidFromFullTransactionId(gxact->fxid);
 	TwoPhaseFileHeader hdr;
 	TransactionId *children;
 	RelFileLocator *commitrels;
@@ -1281,10 +1283,11 @@ RegisterTwoPhaseRecord(TwoPhaseRmgrId rmid, uint16 info,
  * If it looks OK (has a valid magic number and CRC), return the palloc'd
  * contents of the file, issuing an error when finding corrupted data.  If
  * missing_ok is true, which indicates that missing files can be safely
- * ignored, then return NULL.  This state can be reached when doing recovery.
+ * ignored, then return NULL.  This state can be reached when doing recovery
+ * after discarding two-phase files from frozen epochs.
  */
 static char *
-ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
+ReadTwoPhaseFile(FullTransactionId fxid, bool missing_ok)
 {
 	char		path[MAXPGPATH];
 	char	   *buf;
@@ -1296,7 +1299,7 @@ ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
 				file_crc;
 	int			r;
 
-	TwoPhaseFilePath(path, xid);
+	TwoPhaseFilePath(path, fxid);
 
 	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
 	if (fd < 0)
@@ -1461,6 +1464,7 @@ StandbyTransactionIdIsPrepared(TransactionId xid)
 	char	   *buf;
 	TwoPhaseFileHeader *hdr;
 	bool		result;
+	FullTransactionId fxid;
 
 	Assert(TransactionIdIsValid(xid));
 
@@ -1468,7 +1472,8 @@ StandbyTransactionIdIsPrepared(TransactionId xid)
 		return false;			/* nothing to do */
 
 	/* Read and validate file */
-	buf = ReadTwoPhaseFile(xid, true);
+	fxid = FullTransactionIdFromAllowableAt(TransamVariables->nextXid, xid);
+	buf = ReadTwoPhaseFile(fxid, true);
 	if (buf == NULL)
 		return false;
 
@@ -1488,6 +1493,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 {
 	GlobalTransaction gxact;
 	PGPROC	   *proc;
+	FullTransactionId fxid;
 	TransactionId xid;
 	bool		ondisk;
 	char	   *buf;
@@ -1509,7 +1515,8 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 */
 	gxact = LockGXact(gid, GetUserId());
 	proc = GetPGProcByNumber(gxact->pgprocno);
-	xid = gxact->xid;
+	fxid = gxact->fxid;
+	xid = XidFromFullTransactionId(fxid);
 
 	/*
 	 * Read and validate 2PC state data. State data will typically be stored
@@ -1517,7 +1524,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 * to disk if for some reason they have lived for a long time.
 	 */
 	if (gxact->ondisk)
-		buf = ReadTwoPhaseFile(xid, false);
+		buf = ReadTwoPhaseFile(fxid, false);
 	else
 		XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
 
@@ -1636,11 +1643,11 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 
 	/* And now do the callbacks */
 	if (isCommit)
-		ProcessRecords(bufptr, xid, twophase_postcommit_callbacks);
+		ProcessRecords(bufptr, fxid, twophase_postcommit_callbacks);
 	else
-		ProcessRecords(bufptr, xid, twophase_postabort_callbacks);
+		ProcessRecords(bufptr, fxid, twophase_postabort_callbacks);
 
-	PredicateLockTwoPhaseFinish(xid, isCommit);
+	PredicateLockTwoPhaseFinish(fxid, isCommit);
 
 	/*
 	 * Read this value while holding the two-phase lock, as the on-disk 2PC
@@ -1664,7 +1671,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 * And now we can clean up any files we may have left.
 	 */
 	if (ondisk)
-		RemoveTwoPhaseFile(xid, true);
+		RemoveTwoPhaseFile(fxid, true);
 
 	MyLockedGxact = NULL;
 
@@ -1677,7 +1684,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
  * Scan 2PC state data in memory and call the indicated callbacks for each 2PC record.
  */
 static void
-ProcessRecords(char *bufptr, TransactionId xid,
+ProcessRecords(char *bufptr, FullTransactionId fxid,
 			   const TwoPhaseCallback callbacks[])
 {
 	for (;;)
@@ -1691,7 +1698,7 @@ ProcessRecords(char *bufptr, TransactionId xid,
 		bufptr += MAXALIGN(sizeof(TwoPhaseRecordOnDisk));
 
 		if (callbacks[record->rmid] != NULL)
-			callbacks[record->rmid] (xid, record->info,
+			callbacks[record->rmid] (fxid, record->info,
 									 (void *) bufptr, record->len);
 
 		bufptr += MAXALIGN(record->len);
@@ -1699,17 +1706,21 @@ ProcessRecords(char *bufptr, TransactionId xid,
 }
 
 /*
- * Remove the 2PC file for the specified XID.
+ * Remove the 2PC file.
  *
  * If giveWarning is false, do not complain about file-not-present;
  * this is an expected case during WAL replay.
+ *
+ * This routine is used at early stages at recovery where future and
+ * past orphaned files are checked, hence the FullTransactionId to build
+ * a complete file name fit for the removal.
  */
 static void
-RemoveTwoPhaseFile(TransactionId xid, bool giveWarning)
+RemoveTwoPhaseFile(FullTransactionId fxid, bool giveWarning)
 {
 	char		path[MAXPGPATH];
 
-	TwoPhaseFilePath(path, xid);
+	TwoPhaseFilePath(path, fxid);
 	if (unlink(path))
 		if (errno != ENOENT || giveWarning)
 			ereport(WARNING,
@@ -1724,7 +1735,7 @@ RemoveTwoPhaseFile(TransactionId xid, bool giveWarning)
  * Note: content and len don't include CRC.
  */
 static void
-RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
+RecreateTwoPhaseFile(FullTransactionId fxid, void *content, int len)
 {
 	char		path[MAXPGPATH];
 	pg_crc32c	statefile_crc;
@@ -1735,7 +1746,7 @@ RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
 	COMP_CRC32C(statefile_crc, content, len);
 	FIN_CRC32C(statefile_crc);
 
-	TwoPhaseFilePath(path, xid);
+	TwoPhaseFilePath(path, fxid);
 
 	fd = OpenTransientFile(path,
 						   O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
@@ -1847,7 +1858,7 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
 			int			len;
 
 			XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, &len);
-			RecreateTwoPhaseFile(gxact->xid, buf, len);
+			RecreateTwoPhaseFile(gxact->fxid, buf, len);
 			gxact->ondisk = true;
 			gxact->prepare_start_lsn = InvalidXLogRecPtr;
 			gxact->prepare_end_lsn = InvalidXLogRecPtr;
@@ -1883,13 +1894,16 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
  * Scan pg_twophase and fill TwoPhaseState depending on the on-disk data.
  * This is called once at the beginning of recovery, saving any extra
  * lookups in the future.  Two-phase files that are newer than the
- * minimum XID horizon are discarded on the way.
+ * minimum XID horizon are discarded on the way, as much as files that
+ * are older than the oldest XID horizon.
  */
 void
 restoreTwoPhaseData(void)
 {
 	DIR		   *cldir;
 	struct dirent *clde;
+	FullTransactionId nextXid = TransamVariables->nextXid;
+	FullTransactionId oldestXid = AdjustToFullTransactionId(TransamVariables->oldestXid);
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	cldir = AllocateDir(TWOPHASE_DIR);
@@ -1898,19 +1912,33 @@ restoreTwoPhaseData(void)
 		if (strlen(clde->d_name) == 16 &&
 			strspn(clde->d_name, "0123456789ABCDEF") == 16)
 		{
-			TransactionId xid;
 			FullTransactionId fxid;
 			char	   *buf;
 
 			fxid = FullTransactionIdFromU64(strtou64(clde->d_name, NULL, 16));
-			xid = XidFromFullTransactionId(fxid);
 
-			buf = ProcessTwoPhaseBuffer(xid, InvalidXLogRecPtr,
-										true, false, false);
-			if (buf == NULL)
+			/* Reject XID if too new or too old */
+			if (FullTransactionIdFollowsOrEquals(fxid, nextXid) ||
+				FullTransactionIdPrecedes(fxid, oldestXid))
+			{
+				if (FullTransactionIdFollowsOrEquals(fxid, nextXid))
+					ereport(WARNING,
+							(errmsg("removing future two-phase state file for transaction %u of epoch %u",
+									XidFromFullTransactionId(fxid),
+									EpochFromFullTransactionId(fxid))));
+				else
+					ereport(WARNING,
+							(errmsg("removing past two-phase state file for transaction %u of epoch %u",
+									XidFromFullTransactionId(fxid),
+									EpochFromFullTransactionId(fxid))));
+				RemoveTwoPhaseFile(fxid, true);
 				continue;
+			}
 
-			PrepareRedoAdd(buf, InvalidXLogRecPtr,
+			buf = ProcessTwoPhaseBuffer(fxid, InvalidXLogRecPtr,
+										true, false, false);
+
+			PrepareRedoAdd(fxid, buf, InvalidXLogRecPtr,
 						   InvalidXLogRecPtr, InvalidRepOriginId);
 		}
 	}
@@ -1969,15 +1997,11 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 
 		Assert(gxact->inredo);
 
-		xid = gxact->xid;
-
-		buf = ProcessTwoPhaseBuffer(xid,
+		xid = XidFromFullTransactionId(gxact->fxid);
+		buf = ProcessTwoPhaseBuffer(gxact->fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, false, true);
 
-		if (buf == NULL)
-			continue;
-
 		/*
 		 * OK, we think this file is valid.  Incorporate xid into the
 		 * running-minimum result.
@@ -2037,19 +2061,15 @@ StandbyRecoverPreparedTransactions(void)
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
-		TransactionId xid;
 		char	   *buf;
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
 
 		Assert(gxact->inredo);
 
-		xid = gxact->xid;
-
-		buf = ProcessTwoPhaseBuffer(xid,
+		buf = ProcessTwoPhaseBuffer(gxact->fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
-		if (buf != NULL)
-			pfree(buf);
+		pfree(buf);
 	}
 	LWLockRelease(TwoPhaseStateLock);
 }
@@ -2074,19 +2094,51 @@ void
 RecoverPreparedTransactions(void)
 {
 	int			i;
+	FullTransactionId *remove_fxids;
+	int			remove_fxids_cnt;
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+	/*
+	 * Track XIDs candidate for removal if found as already committed or
+	 * aborted, once the first scan through TwoPhaseState is done.  This
+	 * cannot happen while going through the entries in TwoPhaseState as
+	 * PrepareRedoRemove() manipulates it.
+	 */
+	remove_fxids_cnt = 0;
+	remove_fxids = (FullTransactionId *) palloc(TwoPhaseState->numPrepXacts *
+												sizeof(FullTransactionId));
+
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
-		TransactionId xid;
 		char	   *buf;
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+		FullTransactionId fxid = gxact->fxid;
 		char	   *bufptr;
 		TwoPhaseFileHeader *hdr;
 		TransactionId *subxids;
 		const char *gid;
 
-		xid = gxact->xid;
+		/*
+		 * Is this transaction already aborted or committed?  If yes, mark it
+		 * for removal.
+		 *
+		 * Checking CLOGs if these transactions have been already aborted or
+		 * committed is safe at this stage; we are at the end of recovery and
+		 * all WAL has been replayed, all 2PC transactions are reinstated and
+		 * should be tracked in TwoPhaseState.
+		 */
+		if (TransactionIdDidCommit(XidFromFullTransactionId(fxid)) ||
+			TransactionIdDidAbort(XidFromFullTransactionId(fxid)))
+		{
+			/*
+			 * Track this transaction ID for its removal from the shared
+			 * memory state at the end.
+			 */
+			remove_fxids[remove_fxids_cnt] = fxid;
+			remove_fxids_cnt++;
+			continue;
+		}
 
 		/*
 		 * Reconstruct subtrans state for the transaction --- needed because
@@ -2097,17 +2149,18 @@ RecoverPreparedTransactions(void)
 		 * SubTransSetParent has been set before, if the prepared transaction
 		 * generated xid assignment records.
 		 */
-		buf = ProcessTwoPhaseBuffer(xid,
+		buf = ProcessTwoPhaseBuffer(gxact->fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
-		if (buf == NULL)
-			continue;
 
 		ereport(LOG,
-				(errmsg("recovering prepared transaction %u from shared memory", xid)));
+				(errmsg("recovering prepared transaction %u of epoch %u from shared memory",
+						XidFromFullTransactionId(gxact->fxid),
+						EpochFromFullTransactionId(gxact->fxid))));
 
 		hdr = (TwoPhaseFileHeader *) buf;
-		Assert(TransactionIdEquals(hdr->xid, xid));
+		Assert(TransactionIdEquals(hdr->xid,
+								   XidFromFullTransactionId(gxact->fxid)));
 		bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader));
 		gid = (const char *) bufptr;
 		bufptr += MAXALIGN(hdr->gidlen);
@@ -2123,7 +2176,7 @@ RecoverPreparedTransactions(void)
 		 * Recreate its GXACT and dummy PGPROC. But, check whether it was
 		 * added in redo and already has a shmem entry for it.
 		 */
-		MarkAsPreparingGuts(gxact, xid, gid,
+		MarkAsPreparingGuts(gxact, gxact->fxid, gid,
 							hdr->prepared_at,
 							hdr->owner, hdr->database);
 
@@ -2138,7 +2191,7 @@ RecoverPreparedTransactions(void)
 		/*
 		 * Recover other state (notably locks) using resource managers.
 		 */
-		ProcessRecords(bufptr, xid, twophase_recover_callbacks);
+		ProcessRecords(bufptr, fxid, twophase_recover_callbacks);
 
 		/*
 		 * Release locks held by the standby process after we process each
@@ -2146,7 +2199,7 @@ RecoverPreparedTransactions(void)
 		 * additional locks at any one time.
 		 */
 		if (InHotStandby)
-			StandbyReleaseLockTree(xid, hdr->nsubxacts, subxids);
+			StandbyReleaseLockTree(hdr->xid, hdr->nsubxacts, subxids);
 
 		/*
 		 * We're done with recovering this transaction. Clear MyLockedGxact,
@@ -2159,13 +2212,25 @@ RecoverPreparedTransactions(void)
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	}
 
+	for (i = 0; i < remove_fxids_cnt; i++)
+	{
+		FullTransactionId fxid = remove_fxids[i];
+
+		ereport(WARNING,
+				(errmsg("removing stale two-phase state from memory for transaction %u of epoch %u",
+						XidFromFullTransactionId(fxid),
+						EpochFromFullTransactionId(fxid))));
+
+		PrepareRedoRemoveFull(fxid, true);
+	}
+
 	LWLockRelease(TwoPhaseStateLock);
 }
 
 /*
  * ProcessTwoPhaseBuffer
  *
- * Given a transaction id, read it either from disk or read it directly
+ * Given a FullTransactionId, read it either from disk or read it directly
  * via shmem xlog record pointer using the provided "prepare_start_lsn".
  *
  * If setParent is true, set up subtransaction parent linkages.
@@ -2174,13 +2239,11 @@ RecoverPreparedTransactions(void)
  * value scanned.
  */
 static char *
-ProcessTwoPhaseBuffer(TransactionId xid,
+ProcessTwoPhaseBuffer(FullTransactionId fxid,
 					  XLogRecPtr prepare_start_lsn,
 					  bool fromdisk,
 					  bool setParent, bool setNextXid)
 {
-	FullTransactionId nextXid = TransamVariables->nextXid;
-	TransactionId origNextXid = XidFromFullTransactionId(nextXid);
 	TransactionId *subxids;
 	char	   *buf;
 	TwoPhaseFileHeader *hdr;
@@ -2191,50 +2254,10 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	if (!fromdisk)
 		Assert(prepare_start_lsn != InvalidXLogRecPtr);
 
-	/* Already processed? */
-	if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
-	{
-		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing stale two-phase state file for transaction %u",
-							xid)));
-			RemoveTwoPhaseFile(xid, true);
-		}
-		else
-		{
-			ereport(WARNING,
-					(errmsg("removing stale two-phase state from memory for transaction %u",
-							xid)));
-			PrepareRedoRemove(xid, true);
-		}
-		return NULL;
-	}
-
-	/* Reject XID if too new */
-	if (TransactionIdFollowsOrEquals(xid, origNextXid))
-	{
-		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing future two-phase state file for transaction %u",
-							xid)));
-			RemoveTwoPhaseFile(xid, true);
-		}
-		else
-		{
-			ereport(WARNING,
-					(errmsg("removing future two-phase state from memory for transaction %u",
-							xid)));
-			PrepareRedoRemove(xid, true);
-		}
-		return NULL;
-	}
-
 	if (fromdisk)
 	{
 		/* Read and validate file */
-		buf = ReadTwoPhaseFile(xid, false);
+		buf = ReadTwoPhaseFile(fxid, false);
 	}
 	else
 	{
@@ -2244,18 +2267,20 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 
 	/* Deconstruct header */
 	hdr = (TwoPhaseFileHeader *) buf;
-	if (!TransactionIdEquals(hdr->xid, xid))
+	if (!TransactionIdEquals(hdr->xid, XidFromFullTransactionId(fxid)))
 	{
 		if (fromdisk)
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("corrupted two-phase state file for transaction %u",
-							xid)));
+					 errmsg("corrupted two-phase state file for transaction %u of epoch %u",
+							XidFromFullTransactionId(fxid),
+							EpochFromFullTransactionId(fxid))));
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("corrupted two-phase state in memory for transaction %u",
-							xid)));
+					 errmsg("corrupted two-phase state in memory for transaction %u of epoch %u",
+							XidFromFullTransactionId(fxid),
+							EpochFromFullTransactionId(fxid))));
 	}
 
 	/*
@@ -2269,14 +2294,14 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	{
 		TransactionId subxid = subxids[i];
 
-		Assert(TransactionIdFollows(subxid, xid));
+		Assert(TransactionIdFollows(subxid, XidFromFullTransactionId(fxid)));
 
 		/* update nextXid if needed */
 		if (setNextXid)
 			AdvanceNextFullTransactionIdPastXid(subxid);
 
 		if (setParent)
-			SubTransSetParent(subxid, xid);
+			SubTransSetParent(subxid, XidFromFullTransactionId(fxid));
 	}
 
 	return buf;
@@ -2467,8 +2492,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
  * data, the entry is marked as located on disk.
  */
 void
-PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
-			   XLogRecPtr end_lsn, RepOriginId origin_id)
+PrepareRedoAdd(FullTransactionId fxid, char *buf,
+			   XLogRecPtr start_lsn, XLogRecPtr end_lsn,
+			   RepOriginId origin_id)
 {
 	TwoPhaseFileHeader *hdr = (TwoPhaseFileHeader *) buf;
 	char	   *bufptr;
@@ -2478,6 +2504,10 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 	Assert(LWLockHeldByMeInMode(TwoPhaseStateLock, LW_EXCLUSIVE));
 	Assert(RecoveryInProgress());
 
+	if (!FullTransactionIdIsValid(fxid))
+		fxid = FullTransactionIdFromAllowableAt(TransamVariables->nextXid,
+												hdr->xid);
+
 	bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader));
 	gid = (const char *) bufptr;
 
@@ -2506,7 +2536,8 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 	{
 		char		path[MAXPGPATH];
 
-		TwoPhaseFilePath(path, hdr->xid);
+		Assert(InRecovery);
+		TwoPhaseFilePath(path, fxid);
 
 		if (access(path, F_OK) == 0)
 		{
@@ -2537,7 +2568,7 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 	gxact->prepared_at = hdr->prepared_at;
 	gxact->prepare_start_lsn = start_lsn;
 	gxact->prepare_end_lsn = end_lsn;
-	gxact->xid = hdr->xid;
+	gxact->fxid = fxid;
 	gxact->owner = hdr->owner;
 	gxact->locking_backend = INVALID_PROC_NUMBER;
 	gxact->valid = false;
@@ -2556,11 +2587,13 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   false /* backward */ , false /* WAL */ );
 	}
 
-	elog(DEBUG2, "added 2PC data in shared memory for transaction %u", gxact->xid);
+	elog(DEBUG2, "added 2PC data in shared memory for transaction %u of epoch %u",
+		 XidFromFullTransactionId(gxact->fxid),
+		 EpochFromFullTransactionId(gxact->fxid));
 }
 
 /*
- * PrepareRedoRemove
+ * PrepareRedoRemoveFull
  *
  * Remove the corresponding gxact entry from TwoPhaseState. Also remove
  * the 2PC file if a prepared transaction was saved via an earlier checkpoint.
@@ -2568,8 +2601,8 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
  * Caller must hold TwoPhaseStateLock in exclusive mode, because TwoPhaseState
  * is updated.
  */
-void
-PrepareRedoRemove(TransactionId xid, bool giveWarning)
+static void
+PrepareRedoRemoveFull(FullTransactionId fxid, bool giveWarning)
 {
 	GlobalTransaction gxact = NULL;
 	int			i;
@@ -2582,7 +2615,7 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 	{
 		gxact = TwoPhaseState->prepXacts[i];
 
-		if (gxact->xid == xid)
+		if (FullTransactionIdEquals(gxact->fxid, fxid))
 		{
 			Assert(gxact->inredo);
 			found = true;
@@ -2599,12 +2632,28 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 	/*
 	 * And now we can clean up any files we may have left.
 	 */
-	elog(DEBUG2, "removing 2PC data for transaction %u", xid);
+	elog(DEBUG2, "removing 2PC data for transaction %u of epoch %u ",
+		 XidFromFullTransactionId(fxid),
+		 EpochFromFullTransactionId(fxid));
+
 	if (gxact->ondisk)
-		RemoveTwoPhaseFile(xid, giveWarning);
+		RemoveTwoPhaseFile(fxid, giveWarning);
+
 	RemoveGXact(gxact);
 }
 
+/*
+ * Wrapper of PrepareRedoRemoveFull(), for TransactionIds.
+ */
+void
+PrepareRedoRemove(TransactionId xid, bool giveWarning)
+{
+	FullTransactionId fxid =
+		FullTransactionIdFromAllowableAt(TransamVariables->nextXid, xid);
+
+	PrepareRedoRemoveFull(fxid, giveWarning);
+}
+
 /*
  * LookupGXact
  *		Check if the prepared transaction with the given GID, lsn and timestamp
@@ -2649,7 +2698,7 @@ LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
 			 * between publisher and subscriber.
 			 */
 			if (gxact->ondisk)
-				buf = ReadTwoPhaseFile(gxact->xid, false);
+				buf = ReadTwoPhaseFile(gxact->fxid, false);
 			else
 			{
 				Assert(gxact->prepare_start_lsn);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 4cecf630060..89ed5b726c8 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2460,7 +2460,7 @@ static void
 PrepareTransaction(void)
 {
 	TransactionState s = CurrentTransactionState;
-	TransactionId xid = GetCurrentTransactionId();
+	FullTransactionId fxid = GetCurrentFullTransactionId();
 	GlobalTransaction gxact;
 	TimestampTz prepared_at;
 
@@ -2589,7 +2589,7 @@ PrepareTransaction(void)
 	 * Reserve the GID for this transaction. This could fail if the requested
 	 * GID is invalid or already in use.
 	 */
-	gxact = MarkAsPreparing(xid, prepareGID, prepared_at,
+	gxact = MarkAsPreparing(fxid, prepareGID, prepared_at,
 							GetUserId(), MyDatabaseId);
 	prepareGID = NULL;
 
@@ -2639,7 +2639,7 @@ PrepareTransaction(void)
 	 * ProcArrayClearTransaction().  Otherwise, a GetLockConflicts() would
 	 * conclude "xact already committed or aborted" for our locks.
 	 */
-	PostPrepare_Locks(xid);
+	PostPrepare_Locks(fxid);
 
 	/*
 	 * Let others know about no transaction in progress by me.  This has to be
@@ -2678,9 +2678,9 @@ PrepareTransaction(void)
 
 	PostPrepare_smgr();
 
-	PostPrepare_MultiXact(xid);
+	PostPrepare_MultiXact(fxid);
 
-	PostPrepare_PredicateLocks(xid);
+	PostPrepare_PredicateLocks(fxid);
 
 	ResourceOwnerRelease(TopTransactionResourceOwner,
 						 RESOURCE_RELEASE_LOCKS,
@@ -6358,7 +6358,8 @@ xact_redo(XLogReaderState *record)
 		 * gxact entry.
 		 */
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
-		PrepareRedoAdd(XLogRecGetData(record),
+		PrepareRedoAdd(InvalidFullTransactionId,
+					   XLogRecGetData(record),
 					   record->ReadRecPtr,
 					   record->EndRecPtr,
 					   XLogRecGetOrigin(record));
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index e5e7ab55716..7e49c8d7d8a 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -3397,9 +3397,9 @@ AtPrepare_Locks(void)
  * but that probably costs more cycles.
  */
 void
-PostPrepare_Locks(TransactionId xid)
+PostPrepare_Locks(FullTransactionId fxid)
 {
-	PGPROC	   *newproc = TwoPhaseGetDummyProc(xid, false);
+	PGPROC	   *newproc = TwoPhaseGetDummyProc(fxid, false);
 	HASH_SEQ_STATUS status;
 	LOCALLOCK  *locallock;
 	LOCK	   *lock;
@@ -4171,11 +4171,11 @@ DumpAllLocks(void)
  * and PANIC anyway.
  */
 void
-lock_twophase_recover(TransactionId xid, uint16 info,
+lock_twophase_recover(FullTransactionId fxid, uint16 info,
 					  void *recdata, uint32 len)
 {
 	TwoPhaseLockRecord *rec = (TwoPhaseLockRecord *) recdata;
-	PGPROC	   *proc = TwoPhaseGetDummyProc(xid, false);
+	PGPROC	   *proc = TwoPhaseGetDummyProc(fxid, false);
 	LOCKTAG    *locktag;
 	LOCKMODE	lockmode;
 	LOCKMETHODID lockmethodid;
@@ -4352,7 +4352,7 @@ lock_twophase_recover(TransactionId xid, uint16 info,
  * starting up into hot standby mode.
  */
 void
-lock_twophase_standby_recover(TransactionId xid, uint16 info,
+lock_twophase_standby_recover(FullTransactionId fxid, uint16 info,
 							  void *recdata, uint32 len)
 {
 	TwoPhaseLockRecord *rec = (TwoPhaseLockRecord *) recdata;
@@ -4371,7 +4371,7 @@ lock_twophase_standby_recover(TransactionId xid, uint16 info,
 	if (lockmode == AccessExclusiveLock &&
 		locktag->locktag_type == LOCKTAG_RELATION)
 	{
-		StandbyAcquireAccessExclusiveLock(xid,
+		StandbyAcquireAccessExclusiveLock(XidFromFullTransactionId(fxid),
 										  locktag->locktag_field1 /* dboid */ ,
 										  locktag->locktag_field2 /* reloid */ );
 	}
@@ -4384,11 +4384,11 @@ lock_twophase_standby_recover(TransactionId xid, uint16 info,
  * Find and release the lock indicated by the 2PC record.
  */
 void
-lock_twophase_postcommit(TransactionId xid, uint16 info,
+lock_twophase_postcommit(FullTransactionId fxid, uint16 info,
 						 void *recdata, uint32 len)
 {
 	TwoPhaseLockRecord *rec = (TwoPhaseLockRecord *) recdata;
-	PGPROC	   *proc = TwoPhaseGetDummyProc(xid, true);
+	PGPROC	   *proc = TwoPhaseGetDummyProc(fxid, true);
 	LOCKTAG    *locktag;
 	LOCKMETHODID lockmethodid;
 	LockMethod	lockMethodTable;
@@ -4410,10 +4410,10 @@ lock_twophase_postcommit(TransactionId xid, uint16 info,
  * This is actually just the same as the COMMIT case.
  */
 void
-lock_twophase_postabort(TransactionId xid, uint16 info,
+lock_twophase_postabort(FullTransactionId fxid, uint16 info,
 						void *recdata, uint32 len)
 {
-	lock_twophase_postcommit(xid, info, recdata, len);
+	lock_twophase_postcommit(fxid, info, recdata, len);
 }
 
 /*
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 2c87273e17e..82a58f54901 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -191,7 +191,7 @@
  *		AtPrepare_PredicateLocks(void);
  *		PostPrepare_PredicateLocks(TransactionId xid);
  *		PredicateLockTwoPhaseFinish(TransactionId xid, bool isCommit);
- *		predicatelock_twophase_recover(TransactionId xid, uint16 info,
+ *		predicatelock_twophase_recover(FullTransactionId fxid, uint16 info,
  *									   void *recdata, uint32 len);
  */
 
@@ -4846,7 +4846,7 @@ AtPrepare_PredicateLocks(void)
  *		anyway. We only need to clean up our local state.
  */
 void
-PostPrepare_PredicateLocks(TransactionId xid)
+PostPrepare_PredicateLocks(FullTransactionId fxid)
 {
 	if (MySerializableXact == InvalidSerializableXact)
 		return;
@@ -4869,12 +4869,12 @@ PostPrepare_PredicateLocks(TransactionId xid)
  *		commits or aborts.
  */
 void
-PredicateLockTwoPhaseFinish(TransactionId xid, bool isCommit)
+PredicateLockTwoPhaseFinish(FullTransactionId fxid, bool isCommit)
 {
 	SERIALIZABLEXID *sxid;
 	SERIALIZABLEXIDTAG sxidtag;
 
-	sxidtag.xid = xid;
+	sxidtag.xid = XidFromFullTransactionId(fxid);
 
 	LWLockAcquire(SerializableXactHashLock, LW_SHARED);
 	sxid = (SERIALIZABLEXID *)
@@ -4896,10 +4896,11 @@ PredicateLockTwoPhaseFinish(TransactionId xid, bool isCommit)
  * Re-acquire a predicate lock belonging to a transaction that was prepared.
  */
 void
-predicatelock_twophase_recover(TransactionId xid, uint16 info,
+predicatelock_twophase_recover(FullTransactionId fxid, uint16 info,
 							   void *recdata, uint32 len)
 {
 	TwoPhasePredicateRecord *record;
+	TransactionId xid = XidFromFullTransactionId(fxid);
 
 	Assert(len == sizeof(TwoPhasePredicateRecord));
 
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 8a3f7d434cf..54500633be5 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -730,7 +730,7 @@ PostPrepare_PgStat_Relations(PgStat_SubXactStatus *xact_state)
  * Load the saved counts into our local pgstats state.
  */
 void
-pgstat_twophase_postcommit(TransactionId xid, uint16 info,
+pgstat_twophase_postcommit(FullTransactionId fxid, uint16 info,
 						   void *recdata, uint32 len)
 {
 	TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata;
@@ -766,7 +766,7 @@ pgstat_twophase_postcommit(TransactionId xid, uint16 info,
  * as aborted.
  */
 void
-pgstat_twophase_postabort(TransactionId xid, uint16 info,
+pgstat_twophase_postabort(FullTransactionId fxid, uint16 info,
 						  void *recdata, uint32 len)
 {
 	TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata;
diff --git a/src/test/recovery/t/009_twophase.pl b/src/test/recovery/t/009_twophase.pl
index 4b3e0f77dc0..c08d7676d66 100644
--- a/src/test/recovery/t/009_twophase.pl
+++ b/src/test/recovery/t/009_twophase.pl
@@ -5,6 +5,7 @@
 use strict;
 use warnings FATAL => 'all';
 
+use File::Copy;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
@@ -28,6 +29,15 @@ sub configure_and_reload
 	return;
 }
 
+sub twophase_file_name
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
+	my $epoch = shift;
+	my $xid = shift;
+	return sprintf("%08X%08X", $epoch, $xid);
+}
+
 # Set up two nodes, which will alternately be primary and replication standby.
 
 # Setup london node
@@ -572,4 +582,134 @@ my $nsubtrans = $cur_primary->safe_psql('postgres',
 );
 isnt($osubtrans, $nsubtrans, "contents of pg_subtrans/ have changed");
 
+###############################################################################
+# Check handling of already committed or aborted 2PC files at recovery.
+# This test does a manual copy of 2PC files created in a running server,
+# to cheaply emulate situations that could be found in base backups.
+###############################################################################
+
+# Issue a set of transactions that will be used for this portion of the test:
+# - One transaction to hold on the minimum xid horizon at bay.
+# - One transaction that will be found as already committed at recovery.
+# - One transaction that will be fonnd as already rollbacked at recovery.
+$cur_primary->psql(
+	'postgres', "
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (40, 'transaction: xid horizon');
+	PREPARE TRANSACTION 'xact_009_40';
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (41, 'transaction: commit-prepared');
+	PREPARE TRANSACTION 'xact_009_41';
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (42, 'transaction: rollback-prepared');
+	PREPARE TRANSACTION 'xact_009_42';");
+
+# Issue a checkpoint, fixing the XID horizon based on the first transaction,
+# flushing to disk the two files to use.
+$cur_primary->psql('postgres', "CHECKPOINT");
+
+# Get the transaction IDs of the ones to 2PC files to manipulate.
+my $commit_prepared_xid = int(
+	$cur_primary->safe_psql(
+		'postgres',
+		"SELECT transaction FROM pg_prepared_xacts WHERE gid = 'xact_009_41'")
+);
+my $abort_prepared_xid = int(
+	$cur_primary->safe_psql(
+		'postgres',
+		"SELECT transaction FROM pg_prepared_xacts WHERE gid = 'xact_009_42'")
+);
+
+# Copy the two-phase files that will be put back later.  Assume an
+# epoch of 0.
+my $commit_prepared_name = twophase_file_name(0, $commit_prepared_xid);
+my $abort_prepared_name = twophase_file_name(0, $abort_prepared_xid);
+
+my $twophase_tmpdir = $PostgreSQL::Test::Utils::tmp_check . '/' . "2pc_files";
+mkdir($twophase_tmpdir);
+my $primary_twophase_folder = $cur_primary->data_dir . '/pg_twophase/';
+copy("$primary_twophase_folder/$commit_prepared_name", $twophase_tmpdir);
+copy("$primary_twophase_folder/$abort_prepared_name", $twophase_tmpdir);
+
+# Issue abort/commit prepared.
+$cur_primary->psql('postgres', "COMMIT PREPARED 'xact_009_41'");
+$cur_primary->psql('postgres', "ROLLBACK PREPARED 'xact_009_42'");
+
+# Again checkpoint, to advance the LSN past the point where the two previous
+# transaction records would be replayed.
+$cur_primary->psql('postgres', "CHECKPOINT");
+
+# Take down node.
+$cur_primary->teardown_node;
+
+# Move back the two twophase files.
+copy("$twophase_tmpdir/$commit_prepared_name", $primary_twophase_folder);
+copy("$twophase_tmpdir/$abort_prepared_name", $primary_twophase_folder);
+
+# Grab location in logs of primary
+my $log_offset = -s $cur_primary->logfile;
+
+# Start node and check that the two previous files are removed by checking the
+# server logs, following the CLOG lookup done at the end of recovery.
+$cur_primary->start;
+
+$cur_primary->log_check(
+	"two-phase files of committed transactions removed at recovery",
+	$log_offset,
+	log_like => [
+		qr/removing stale two-phase state from memory for transaction $commit_prepared_xid of epoch 0/,
+		qr/removing stale two-phase state from memory for transaction $abort_prepared_xid of epoch 0/
+	]);
+
+# Commit the first transaction.
+$cur_primary->psql('postgres', "COMMIT PREPARED 'xact_009_40'");
+# After replay, there should be no 2PC transactions.
+$cur_primary->psql(
+	'postgres',
+	"SELECT * FROM pg_prepared_xact",
+	stdout => \$psql_out);
+is($psql_out, qq{}, "Check expected pg_prepared_xact data on primary");
+# Data from transactions should be around.
+$cur_primary->psql(
+	'postgres',
+	"SELECT * FROM t_009_tbl WHERE id IN (40, 41, 42);",
+	stdout => \$psql_out);
+is( $psql_out, qq{40|transaction: xid horizon
+41|transaction: commit-prepared},
+	"Check expected table data on primary");
+
+###############################################################################
+# Check handling of orphaned 2PC files at recovery.
+###############################################################################
+
+$cur_standby->teardown_node;
+$cur_primary->teardown_node;
+
+# Grab location in logs of primary
+$log_offset = -s $cur_primary->logfile;
+
+# Create fake files with a transaction ID large or low enough to be in the
+# future or the past, in different epochs, then check that the primary is able
+# to start and remove these files at recovery.
+
+# First bump the epoch with pg_resetwal.
+$cur_primary->command_ok(
+	[ 'pg_resetwal', '-e', 256, '-f', $cur_primary->data_dir ],
+	'bump epoch of primary');
+
+my $future_2pc_file =
+  $cur_primary->data_dir . '/pg_twophase/000001FF00000FFF';
+append_to_file $future_2pc_file, "";
+my $past_2pc_file = $cur_primary->data_dir . '/pg_twophase/000000EE00000FFF';
+append_to_file $past_2pc_file, "";
+
+$cur_primary->start;
+$cur_primary->log_check(
+	"two-phase files removed at recovery",
+	$log_offset,
+	log_like => [
+		qr/removing past two-phase state file for transaction 4095 of epoch 238/,
+		qr/removing future two-phase state file for transaction 4095 of epoch 511/
+	]);
+
 done_testing();
-- 
2.47.2

Michael Paquier

michael@paquier.xyz

12 months ago

In reply to: Michael Paquier (#1)

8 attachment(s)

Re: Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData

On Thu, Jan 30, 2025 at 03:36:20PM +0900, Michael Paquier wrote:

This needs more eyes, so I'll go add an entry in the CF. Patches for
stable branches are labelled with .txt, to avoid any interference with
the CF bot.

Thoughts and comments are welcome.

After sleeping over it, I've been able to come up with a split of the
patches for v17 and HEAD: a first one to refactor the 2PC code to
integrate more FullTransactionIds (down to 17) and a second one to
implement a fix. That should make reviews easier. The second patch
for 17 and HEAD matches with the fixes for v13~v16, all are labelled
with a 0002.
--
Michael

Attachments:

v2-0001-Integrate-more-FullTransactionIds-into-2PC-code-17.txttext/plain; charset=us-asciiDownload

From 2c82b2cc596e711e8d59e2ce1831d56ec052b6ed Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Fri, 31 Jan 2025 09:00:38 +0900
Subject: [PATCH v2 1/2] Integrate more FullTransactionIds into 2PC code

This refactoring will help in fixing a follow-up issue.

Backpatch-through: 17 (?)
---
 src/include/access/multixact.h               |   9 +-
 src/include/access/twophase.h                |  12 +-
 src/include/access/twophase_rmgr.h           |   4 +-
 src/include/pgstat.h                         |   5 +-
 src/include/storage/lock.h                   |  11 +-
 src/include/storage/predicate.h              |   7 +-
 src/backend/access/transam/multixact.c       |  16 +-
 src/backend/access/transam/twophase.c        | 242 +++++++++++--------
 src/backend/access/transam/xact.c            |  13 +-
 src/backend/storage/lmgr/lock.c              |  20 +-
 src/backend/storage/lmgr/predicate.c         |  11 +-
 src/backend/utils/activity/pgstat_relation.c |   4 +-
 12 files changed, 200 insertions(+), 154 deletions(-)

diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 7ffd256c744..7f24cdbc348 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -11,6 +11,7 @@
 #ifndef MULTIXACT_H
 #define MULTIXACT_H
 
+#include "access/transam.h"
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 #include "storage/sync.h"
@@ -119,7 +120,7 @@ extern int	multixactmemberssyncfiletag(const FileTag *ftag, char *path);
 
 extern void AtEOXact_MultiXact(void);
 extern void AtPrepare_MultiXact(void);
-extern void PostPrepare_MultiXact(TransactionId xid);
+extern void PostPrepare_MultiXact(FullTransactionId fxid);
 
 extern Size MultiXactShmemSize(void);
 extern void MultiXactShmemInit(void);
@@ -145,11 +146,11 @@ extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
 extern int	MultiXactMemberFreezeThreshold(void);
 
-extern void multixact_twophase_recover(TransactionId xid, uint16 info,
+extern void multixact_twophase_recover(FullTransactionId fxid, uint16 info,
 									   void *recdata, uint32 len);
-extern void multixact_twophase_postcommit(TransactionId xid, uint16 info,
+extern void multixact_twophase_postcommit(FullTransactionId fxid, uint16 info,
 										  void *recdata, uint32 len);
-extern void multixact_twophase_postabort(TransactionId xid, uint16 info,
+extern void multixact_twophase_postabort(FullTransactionId fxid, uint16 info,
 										 void *recdata, uint32 len);
 
 extern void multixact_redo(XLogReaderState *record);
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 56248c00063..93c5caf8309 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -14,6 +14,7 @@
 #ifndef TWOPHASE_H
 #define TWOPHASE_H
 
+#include "access/transam.h"
 #include "access/xact.h"
 #include "access/xlogdefs.h"
 #include "datatype/timestamp.h"
@@ -36,10 +37,10 @@ extern void PostPrepare_Twophase(void);
 
 extern TransactionId TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
 												bool *have_more);
-extern PGPROC *TwoPhaseGetDummyProc(TransactionId xid, bool lock_held);
-extern int	TwoPhaseGetDummyProcNumber(TransactionId xid, bool lock_held);
+extern PGPROC *TwoPhaseGetDummyProc(FullTransactionId fxid, bool lock_held);
+extern int	TwoPhaseGetDummyProcNumber(FullTransactionId fxid, bool lock_held);
 
-extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
+extern GlobalTransaction MarkAsPreparing(FullTransactionId fxid, const char *gid,
 										 TimestampTz prepared_at,
 										 Oid owner, Oid databaseid);
 
@@ -56,8 +57,9 @@ extern void CheckPointTwoPhase(XLogRecPtr redo_horizon);
 
 extern void FinishPreparedTransaction(const char *gid, bool isCommit);
 
-extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
-						   XLogRecPtr end_lsn, RepOriginId origin_id);
+extern void PrepareRedoAdd(FullTransactionId fxid, char *buf,
+						   XLogRecPtr start_lsn, XLogRecPtr end_lsn,
+						   RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
 extern bool LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
diff --git a/src/include/access/twophase_rmgr.h b/src/include/access/twophase_rmgr.h
index bdd2cb7b339..64db7bade6f 100644
--- a/src/include/access/twophase_rmgr.h
+++ b/src/include/access/twophase_rmgr.h
@@ -14,7 +14,9 @@
 #ifndef TWOPHASE_RMGR_H
 #define TWOPHASE_RMGR_H
 
-typedef void (*TwoPhaseCallback) (TransactionId xid, uint16 info,
+#include "access/transam.h"
+
+typedef void (*TwoPhaseCallback) (FullTransactionId fxid, uint16 info,
 								  void *recdata, uint32 len);
 typedef uint8 TwoPhaseRmgrId;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 2136239710e..f60824de31c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -11,6 +11,7 @@
 #ifndef PGSTAT_H
 #define PGSTAT_H
 
+#include "access/transam.h"
 #include "datatype/timestamp.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"	/* for MAX_XFN_CHARS */
@@ -649,9 +650,9 @@ extern void pgstat_count_heap_delete(Relation rel);
 extern void pgstat_count_truncate(Relation rel);
 extern void pgstat_update_heap_dead_tuples(Relation rel, int delta);
 
-extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
+extern void pgstat_twophase_postcommit(FullTransactionId fxid, uint16 info,
 									   void *recdata, uint32 len);
-extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
+extern void pgstat_twophase_postabort(FullTransactionId fxid, uint16 info,
 									  void *recdata, uint32 len);
 
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index cc1f6e78c39..b196b8a1f7f 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -18,6 +18,7 @@
 #error "lock.h may not be included from frontend code"
 #endif
 
+#include "access/transam.h"
 #include "lib/ilist.h"
 #include "storage/lockdefs.h"
 #include "storage/lwlock.h"
@@ -577,7 +578,7 @@ extern bool LockHasWaiters(const LOCKTAG *locktag,
 extern VirtualTransactionId *GetLockConflicts(const LOCKTAG *locktag,
 											  LOCKMODE lockmode, int *countp);
 extern void AtPrepare_Locks(void);
-extern void PostPrepare_Locks(TransactionId xid);
+extern void PostPrepare_Locks(FullTransactionId fxid);
 extern bool LockCheckConflicts(LockMethod lockMethodTable,
 							   LOCKMODE lockmode,
 							   LOCK *lock, PROCLOCK *proclock);
@@ -591,13 +592,13 @@ extern BlockedProcsData *GetBlockerStatusData(int blocked_pid);
 extern xl_standby_lock *GetRunningTransactionLocks(int *nlocks);
 extern const char *GetLockmodeName(LOCKMETHODID lockmethodid, LOCKMODE mode);
 
-extern void lock_twophase_recover(TransactionId xid, uint16 info,
+extern void lock_twophase_recover(FullTransactionId fxid, uint16 info,
 								  void *recdata, uint32 len);
-extern void lock_twophase_postcommit(TransactionId xid, uint16 info,
+extern void lock_twophase_postcommit(FullTransactionId fxid, uint16 info,
 									 void *recdata, uint32 len);
-extern void lock_twophase_postabort(TransactionId xid, uint16 info,
+extern void lock_twophase_postabort(FullTransactionId fxid, uint16 info,
 									void *recdata, uint32 len);
-extern void lock_twophase_standby_recover(TransactionId xid, uint16 info,
+extern void lock_twophase_standby_recover(FullTransactionId fxid, uint16 info,
 										  void *recdata, uint32 len);
 
 extern DeadLockState DeadLockCheck(PGPROC *proc);
diff --git a/src/include/storage/predicate.h b/src/include/storage/predicate.h
index 14ee9b94a2f..bdd5cdd3dbf 100644
--- a/src/include/storage/predicate.h
+++ b/src/include/storage/predicate.h
@@ -14,6 +14,7 @@
 #ifndef PREDICATE_H
 #define PREDICATE_H
 
+#include "access/transam.h"
 #include "storage/lock.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -71,9 +72,9 @@ extern void PreCommit_CheckForSerializationFailure(void);
 
 /* two-phase commit support */
 extern void AtPrepare_PredicateLocks(void);
-extern void PostPrepare_PredicateLocks(TransactionId xid);
-extern void PredicateLockTwoPhaseFinish(TransactionId xid, bool isCommit);
-extern void predicatelock_twophase_recover(TransactionId xid, uint16 info,
+extern void PostPrepare_PredicateLocks(FullTransactionId fxid);
+extern void PredicateLockTwoPhaseFinish(FullTransactionId xid, bool isCommit);
+extern void predicatelock_twophase_recover(FullTransactionId fxid, uint16 info,
 										   void *recdata, uint32 len);
 
 /* parallel query support */
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b7b47ef076a..b51221cd1da 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1839,7 +1839,7 @@ AtPrepare_MultiXact(void)
  *		Clean up after successful PREPARE TRANSACTION
  */
 void
-PostPrepare_MultiXact(TransactionId xid)
+PostPrepare_MultiXact(FullTransactionId fxid)
 {
 	MultiXactId myOldestMember;
 
@@ -1850,7 +1850,7 @@ PostPrepare_MultiXact(TransactionId xid)
 	myOldestMember = OldestMemberMXactId[MyProcNumber];
 	if (MultiXactIdIsValid(myOldestMember))
 	{
-		ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(xid, false);
+		ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(fxid, false);
 
 		/*
 		 * Even though storing MultiXactId is atomic, acquire lock to make
@@ -1888,10 +1888,10 @@ PostPrepare_MultiXact(TransactionId xid)
  *		Recover the state of a prepared transaction at startup
  */
 void
-multixact_twophase_recover(TransactionId xid, uint16 info,
+multixact_twophase_recover(FullTransactionId fxid, uint16 info,
 						   void *recdata, uint32 len)
 {
-	ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(xid, false);
+	ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(fxid, false);
 	MultiXactId oldestMember;
 
 	/*
@@ -1909,10 +1909,10 @@ multixact_twophase_recover(TransactionId xid, uint16 info,
  *		Similar to AtEOXact_MultiXact but for COMMIT PREPARED
  */
 void
-multixact_twophase_postcommit(TransactionId xid, uint16 info,
+multixact_twophase_postcommit(FullTransactionId fxid, uint16 info,
 							  void *recdata, uint32 len)
 {
-	ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(xid, true);
+	ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(fxid, true);
 
 	Assert(len == sizeof(MultiXactId));
 
@@ -1924,10 +1924,10 @@ multixact_twophase_postcommit(TransactionId xid, uint16 info,
  *		This is actually just the same as the COMMIT case.
  */
 void
-multixact_twophase_postabort(TransactionId xid, uint16 info,
+multixact_twophase_postabort(FullTransactionId fxid, uint16 info,
 							 void *recdata, uint32 len)
 {
-	multixact_twophase_postcommit(xid, info, recdata, len);
+	multixact_twophase_postcommit(fxid, info, recdata, len);
 }
 
 /*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 4f78c1dc579..0286336484e 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -159,7 +159,7 @@ typedef struct GlobalTransactionData
 	 */
 	XLogRecPtr	prepare_start_lsn;	/* XLOG offset of prepare record start */
 	XLogRecPtr	prepare_end_lsn;	/* XLOG offset of prepare record end */
-	TransactionId xid;			/* The GXACT id */
+	FullTransactionId fxid;		/* The GXACT full xid */
 
 	Oid			owner;			/* ID of user that executed the xact */
 	ProcNumber	locking_backend;	/* backend currently working on the xact */
@@ -197,6 +197,7 @@ static GlobalTransaction MyLockedGxact = NULL;
 
 static bool twophaseExitRegistered = false;
 
+static void PrepareRedoRemoveFull(FullTransactionId fxid, bool giveWarning);
 static void RecordTransactionCommitPrepared(TransactionId xid,
 											int nchildren,
 											TransactionId *children,
@@ -216,19 +217,19 @@ static void RecordTransactionAbortPrepared(TransactionId xid,
 										   int nstats,
 										   xl_xact_stats_item *stats,
 										   const char *gid);
-static void ProcessRecords(char *bufptr, TransactionId xid,
+static void ProcessRecords(char *bufptr, FullTransactionId fxid,
 						   const TwoPhaseCallback callbacks[]);
 static void RemoveGXact(GlobalTransaction gxact);
 
 static void XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len);
-static char *ProcessTwoPhaseBuffer(TransactionId xid,
+static char *ProcessTwoPhaseBuffer(FullTransactionId fxid,
 								   XLogRecPtr prepare_start_lsn,
 								   bool fromdisk, bool setParent, bool setNextXid);
-static void MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid,
+static void MarkAsPreparingGuts(GlobalTransaction gxact, FullTransactionId fxid,
 								const char *gid, TimestampTz prepared_at, Oid owner,
 								Oid databaseid);
-static void RemoveTwoPhaseFile(TransactionId xid, bool giveWarning);
-static void RecreateTwoPhaseFile(TransactionId xid, void *content, int len);
+static void RemoveTwoPhaseFile(FullTransactionId fxid, bool giveWarning);
+static void RecreateTwoPhaseFile(FullTransactionId fxid, void *content, int len);
 
 /*
  * Initialization of shared memory
@@ -356,7 +357,7 @@ PostPrepare_Twophase(void)
  *		Reserve the GID for the given transaction.
  */
 GlobalTransaction
-MarkAsPreparing(TransactionId xid, const char *gid,
+MarkAsPreparing(FullTransactionId fxid, const char *gid,
 				TimestampTz prepared_at, Oid owner, Oid databaseid)
 {
 	GlobalTransaction gxact;
@@ -407,7 +408,7 @@ MarkAsPreparing(TransactionId xid, const char *gid,
 	gxact = TwoPhaseState->freeGXacts;
 	TwoPhaseState->freeGXacts = gxact->next;
 
-	MarkAsPreparingGuts(gxact, xid, gid, prepared_at, owner, databaseid);
+	MarkAsPreparingGuts(gxact, fxid, gid, prepared_at, owner, databaseid);
 
 	gxact->ondisk = false;
 
@@ -430,11 +431,13 @@ MarkAsPreparing(TransactionId xid, const char *gid,
  * Note: This function should be called with appropriate locks held.
  */
 static void
-MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
-					TimestampTz prepared_at, Oid owner, Oid databaseid)
+MarkAsPreparingGuts(GlobalTransaction gxact, FullTransactionId fxid,
+					const char *gid, TimestampTz prepared_at, Oid owner,
+					Oid databaseid)
 {
 	PGPROC	   *proc;
 	int			i;
+	TransactionId xid = XidFromFullTransactionId(fxid);
 
 	Assert(LWLockHeldByMeInMode(TwoPhaseStateLock, LW_EXCLUSIVE));
 
@@ -479,7 +482,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	proc->subxidStatus.count = 0;
 
 	gxact->prepared_at = prepared_at;
-	gxact->xid = xid;
+	gxact->fxid = fxid;
 	gxact->owner = owner;
 	gxact->locking_backend = MyProcNumber;
 	gxact->valid = false;
@@ -797,12 +800,12 @@ pg_prepared_xact(PG_FUNCTION_ARGS)
  * caller had better hold it.
  */
 static GlobalTransaction
-TwoPhaseGetGXact(TransactionId xid, bool lock_held)
+TwoPhaseGetGXact(FullTransactionId fxid, bool lock_held)
 {
 	GlobalTransaction result = NULL;
 	int			i;
 
-	static TransactionId cached_xid = InvalidTransactionId;
+	static FullTransactionId cached_fxid = {0};
 	static GlobalTransaction cached_gxact = NULL;
 
 	Assert(!lock_held || LWLockHeldByMe(TwoPhaseStateLock));
@@ -811,7 +814,7 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)
 	 * During a recovery, COMMIT PREPARED, or ABORT PREPARED, we'll be called
 	 * repeatedly for the same XID.  We can save work with a simple cache.
 	 */
-	if (xid == cached_xid)
+	if (FullTransactionIdEquals(fxid, cached_fxid))
 		return cached_gxact;
 
 	if (!lock_held)
@@ -821,7 +824,7 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)
 	{
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
 
-		if (gxact->xid == xid)
+		if (FullTransactionIdEquals(gxact->fxid, fxid))
 		{
 			result = gxact;
 			break;
@@ -832,9 +835,10 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)
 		LWLockRelease(TwoPhaseStateLock);
 
 	if (result == NULL)			/* should not happen */
-		elog(ERROR, "failed to find GlobalTransaction for xid %u", xid);
+		elog(ERROR, "failed to find GlobalTransaction for xid %u",
+			 XidFromFullTransactionId(fxid));
 
-	cached_xid = xid;
+	cached_fxid = fxid;
 	cached_gxact = result;
 
 	return result;
@@ -881,7 +885,7 @@ TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
 				*have_more = true;
 				break;
 			}
-			result = gxact->xid;
+			result = XidFromFullTransactionId(gxact->fxid);
 		}
 	}
 
@@ -892,7 +896,7 @@ TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
 
 /*
  * TwoPhaseGetDummyProcNumber
- *		Get the dummy proc number for prepared transaction specified by XID
+ *		Get the dummy proc number for prepared transaction
  *
  * Dummy proc numbers are similar to proc numbers of real backends.  They
  * start at MaxBackends, and are unique across all currently active real
@@ -900,24 +904,24 @@ TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
  * TwoPhaseStateLock will not be taken, so the caller had better hold it.
  */
 ProcNumber
-TwoPhaseGetDummyProcNumber(TransactionId xid, bool lock_held)
+TwoPhaseGetDummyProcNumber(FullTransactionId fxid, bool lock_held)
 {
-	GlobalTransaction gxact = TwoPhaseGetGXact(xid, lock_held);
+	GlobalTransaction gxact = TwoPhaseGetGXact(fxid, lock_held);
 
 	return gxact->pgprocno;
 }
 
 /*
  * TwoPhaseGetDummyProc
- *		Get the PGPROC that represents a prepared transaction specified by XID
+ *		Get the PGPROC that represents a prepared transaction
  *
  * If lock_held is set to true, TwoPhaseStateLock will not be taken, so the
  * caller had better hold it.
  */
 PGPROC *
-TwoPhaseGetDummyProc(TransactionId xid, bool lock_held)
+TwoPhaseGetDummyProc(FullTransactionId fxid, bool lock_held)
 {
-	GlobalTransaction gxact = TwoPhaseGetGXact(xid, lock_held);
+	GlobalTransaction gxact = TwoPhaseGetGXact(fxid, lock_held);
 
 	return GetPGProcByNumber(gxact->pgprocno);
 }
@@ -942,10 +946,8 @@ AdjustToFullTransactionId(TransactionId xid)
 }
 
 static inline int
-TwoPhaseFilePath(char *path, TransactionId xid)
+TwoPhaseFilePath(char *path, FullTransactionId fxid)
 {
-	FullTransactionId fxid = AdjustToFullTransactionId(xid);
-
 	return snprintf(path, MAXPGPATH, TWOPHASE_DIR "/%08X%08X",
 					EpochFromFullTransactionId(fxid),
 					XidFromFullTransactionId(fxid));
@@ -1049,7 +1051,7 @@ void
 StartPrepare(GlobalTransaction gxact)
 {
 	PGPROC	   *proc = GetPGProcByNumber(gxact->pgprocno);
-	TransactionId xid = gxact->xid;
+	TransactionId xid = XidFromFullTransactionId(gxact->fxid);
 	TwoPhaseFileHeader hdr;
 	TransactionId *children;
 	RelFileLocator *commitrels;
@@ -1281,10 +1283,11 @@ RegisterTwoPhaseRecord(TwoPhaseRmgrId rmid, uint16 info,
  * If it looks OK (has a valid magic number and CRC), return the palloc'd
  * contents of the file, issuing an error when finding corrupted data.  If
  * missing_ok is true, which indicates that missing files can be safely
- * ignored, then return NULL.  This state can be reached when doing recovery.
+ * ignored, then return NULL.  This state can be reached when doing recovery
+ * after discarding two-phase files from frozen epochs.
  */
 static char *
-ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
+ReadTwoPhaseFile(FullTransactionId fxid, bool missing_ok)
 {
 	char		path[MAXPGPATH];
 	char	   *buf;
@@ -1296,7 +1299,7 @@ ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
 				file_crc;
 	int			r;
 
-	TwoPhaseFilePath(path, xid);
+	TwoPhaseFilePath(path, fxid);
 
 	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
 	if (fd < 0)
@@ -1461,6 +1464,7 @@ StandbyTransactionIdIsPrepared(TransactionId xid)
 	char	   *buf;
 	TwoPhaseFileHeader *hdr;
 	bool		result;
+	FullTransactionId fxid;
 
 	Assert(TransactionIdIsValid(xid));
 
@@ -1468,7 +1472,8 @@ StandbyTransactionIdIsPrepared(TransactionId xid)
 		return false;			/* nothing to do */
 
 	/* Read and validate file */
-	buf = ReadTwoPhaseFile(xid, true);
+	fxid = FullTransactionIdFromAllowableAt(TransamVariables->nextXid, xid);
+	buf = ReadTwoPhaseFile(fxid, true);
 	if (buf == NULL)
 		return false;
 
@@ -1488,6 +1493,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 {
 	GlobalTransaction gxact;
 	PGPROC	   *proc;
+	FullTransactionId fxid;
 	TransactionId xid;
 	bool		ondisk;
 	char	   *buf;
@@ -1509,7 +1515,8 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 */
 	gxact = LockGXact(gid, GetUserId());
 	proc = GetPGProcByNumber(gxact->pgprocno);
-	xid = gxact->xid;
+	fxid = gxact->fxid;
+	xid = XidFromFullTransactionId(fxid);
 
 	/*
 	 * Read and validate 2PC state data. State data will typically be stored
@@ -1517,7 +1524,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 * to disk if for some reason they have lived for a long time.
 	 */
 	if (gxact->ondisk)
-		buf = ReadTwoPhaseFile(xid, false);
+		buf = ReadTwoPhaseFile(fxid, false);
 	else
 		XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
 
@@ -1636,11 +1643,11 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 
 	/* And now do the callbacks */
 	if (isCommit)
-		ProcessRecords(bufptr, xid, twophase_postcommit_callbacks);
+		ProcessRecords(bufptr, fxid, twophase_postcommit_callbacks);
 	else
-		ProcessRecords(bufptr, xid, twophase_postabort_callbacks);
+		ProcessRecords(bufptr, fxid, twophase_postabort_callbacks);
 
-	PredicateLockTwoPhaseFinish(xid, isCommit);
+	PredicateLockTwoPhaseFinish(fxid, isCommit);
 
 	/*
 	 * Read this value while holding the two-phase lock, as the on-disk 2PC
@@ -1664,7 +1671,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 * And now we can clean up any files we may have left.
 	 */
 	if (ondisk)
-		RemoveTwoPhaseFile(xid, true);
+		RemoveTwoPhaseFile(fxid, true);
 
 	MyLockedGxact = NULL;
 
@@ -1677,7 +1684,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
  * Scan 2PC state data in memory and call the indicated callbacks for each 2PC record.
  */
 static void
-ProcessRecords(char *bufptr, TransactionId xid,
+ProcessRecords(char *bufptr, FullTransactionId fxid,
 			   const TwoPhaseCallback callbacks[])
 {
 	for (;;)
@@ -1691,7 +1698,7 @@ ProcessRecords(char *bufptr, TransactionId xid,
 		bufptr += MAXALIGN(sizeof(TwoPhaseRecordOnDisk));
 
 		if (callbacks[record->rmid] != NULL)
-			callbacks[record->rmid] (xid, record->info,
+			callbacks[record->rmid] (fxid, record->info,
 									 (void *) bufptr, record->len);
 
 		bufptr += MAXALIGN(record->len);
@@ -1699,17 +1706,21 @@ ProcessRecords(char *bufptr, TransactionId xid,
 }
 
 /*
- * Remove the 2PC file for the specified XID.
+ * Remove the 2PC file.
  *
  * If giveWarning is false, do not complain about file-not-present;
  * this is an expected case during WAL replay.
+ *
+ * This routine is used at early stages at recovery where future and
+ * past orphaned files are checked, hence the FullTransactionId to build
+ * a complete file name fit for the removal.
  */
 static void
-RemoveTwoPhaseFile(TransactionId xid, bool giveWarning)
+RemoveTwoPhaseFile(FullTransactionId fxid, bool giveWarning)
 {
 	char		path[MAXPGPATH];
 
-	TwoPhaseFilePath(path, xid);
+	TwoPhaseFilePath(path, fxid);
 	if (unlink(path))
 		if (errno != ENOENT || giveWarning)
 			ereport(WARNING,
@@ -1724,7 +1735,7 @@ RemoveTwoPhaseFile(TransactionId xid, bool giveWarning)
  * Note: content and len don't include CRC.
  */
 static void
-RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
+RecreateTwoPhaseFile(FullTransactionId fxid, void *content, int len)
 {
 	char		path[MAXPGPATH];
 	pg_crc32c	statefile_crc;
@@ -1735,7 +1746,7 @@ RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
 	COMP_CRC32C(statefile_crc, content, len);
 	FIN_CRC32C(statefile_crc);
 
-	TwoPhaseFilePath(path, xid);
+	TwoPhaseFilePath(path, fxid);
 
 	fd = OpenTransientFile(path,
 						   O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
@@ -1847,7 +1858,7 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
 			int			len;
 
 			XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, &len);
-			RecreateTwoPhaseFile(gxact->xid, buf, len);
+			RecreateTwoPhaseFile(gxact->fxid, buf, len);
 			gxact->ondisk = true;
 			gxact->prepare_start_lsn = InvalidXLogRecPtr;
 			gxact->prepare_end_lsn = InvalidXLogRecPtr;
@@ -1898,19 +1909,17 @@ restoreTwoPhaseData(void)
 		if (strlen(clde->d_name) == 16 &&
 			strspn(clde->d_name, "0123456789ABCDEF") == 16)
 		{
-			TransactionId xid;
 			FullTransactionId fxid;
 			char	   *buf;
 
 			fxid = FullTransactionIdFromU64(strtou64(clde->d_name, NULL, 16));
-			xid = XidFromFullTransactionId(fxid);
 
-			buf = ProcessTwoPhaseBuffer(xid, InvalidXLogRecPtr,
+			buf = ProcessTwoPhaseBuffer(fxid, InvalidXLogRecPtr,
 										true, false, false);
 			if (buf == NULL)
 				continue;
 
-			PrepareRedoAdd(buf, InvalidXLogRecPtr,
+			PrepareRedoAdd(fxid, buf, InvalidXLogRecPtr,
 						   InvalidXLogRecPtr, InvalidRepOriginId);
 		}
 	}
@@ -1969,9 +1978,8 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 
 		Assert(gxact->inredo);
 
-		xid = gxact->xid;
-
-		buf = ProcessTwoPhaseBuffer(xid,
+		xid = XidFromFullTransactionId(gxact->fxid);
+		buf = ProcessTwoPhaseBuffer(gxact->fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, false, true);
 
@@ -2037,15 +2045,12 @@ StandbyRecoverPreparedTransactions(void)
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
-		TransactionId xid;
 		char	   *buf;
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
 
 		Assert(gxact->inredo);
 
-		xid = gxact->xid;
-
-		buf = ProcessTwoPhaseBuffer(xid,
+		buf = ProcessTwoPhaseBuffer(gxact->fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
 		if (buf != NULL)
@@ -2078,16 +2083,14 @@ RecoverPreparedTransactions(void)
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
-		TransactionId xid;
 		char	   *buf;
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+		FullTransactionId fxid = gxact->fxid;
 		char	   *bufptr;
 		TwoPhaseFileHeader *hdr;
 		TransactionId *subxids;
 		const char *gid;
 
-		xid = gxact->xid;
-
 		/*
 		 * Reconstruct subtrans state for the transaction --- needed because
 		 * pg_subtrans is not preserved over a restart.  Note that we are
@@ -2097,17 +2100,20 @@ RecoverPreparedTransactions(void)
 		 * SubTransSetParent has been set before, if the prepared transaction
 		 * generated xid assignment records.
 		 */
-		buf = ProcessTwoPhaseBuffer(xid,
+		buf = ProcessTwoPhaseBuffer(gxact->fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
 		if (buf == NULL)
 			continue;
 
 		ereport(LOG,
-				(errmsg("recovering prepared transaction %u from shared memory", xid)));
+				(errmsg("recovering prepared transaction %u of epoch %u from shared memory",
+						XidFromFullTransactionId(gxact->fxid),
+						EpochFromFullTransactionId(gxact->fxid))));
 
 		hdr = (TwoPhaseFileHeader *) buf;
-		Assert(TransactionIdEquals(hdr->xid, xid));
+		Assert(TransactionIdEquals(hdr->xid,
+								   XidFromFullTransactionId(gxact->fxid)));
 		bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader));
 		gid = (const char *) bufptr;
 		bufptr += MAXALIGN(hdr->gidlen);
@@ -2123,7 +2129,7 @@ RecoverPreparedTransactions(void)
 		 * Recreate its GXACT and dummy PGPROC. But, check whether it was
 		 * added in redo and already has a shmem entry for it.
 		 */
-		MarkAsPreparingGuts(gxact, xid, gid,
+		MarkAsPreparingGuts(gxact, gxact->fxid, gid,
 							hdr->prepared_at,
 							hdr->owner, hdr->database);
 
@@ -2138,7 +2144,7 @@ RecoverPreparedTransactions(void)
 		/*
 		 * Recover other state (notably locks) using resource managers.
 		 */
-		ProcessRecords(bufptr, xid, twophase_recover_callbacks);
+		ProcessRecords(bufptr, fxid, twophase_recover_callbacks);
 
 		/*
 		 * Release locks held by the standby process after we process each
@@ -2146,7 +2152,7 @@ RecoverPreparedTransactions(void)
 		 * additional locks at any one time.
 		 */
 		if (InHotStandby)
-			StandbyReleaseLockTree(xid, hdr->nsubxacts, subxids);
+			StandbyReleaseLockTree(hdr->xid, hdr->nsubxacts, subxids);
 
 		/*
 		 * We're done with recovering this transaction. Clear MyLockedGxact,
@@ -2165,7 +2171,7 @@ RecoverPreparedTransactions(void)
 /*
  * ProcessTwoPhaseBuffer
  *
- * Given a transaction id, read it either from disk or read it directly
+ * Given a FullTransactionId, read it either from disk or read it directly
  * via shmem xlog record pointer using the provided "prepare_start_lsn".
  *
  * If setParent is true, set up subtransaction parent linkages.
@@ -2174,13 +2180,12 @@ RecoverPreparedTransactions(void)
  * value scanned.
  */
 static char *
-ProcessTwoPhaseBuffer(TransactionId xid,
+ProcessTwoPhaseBuffer(FullTransactionId fxid,
 					  XLogRecPtr prepare_start_lsn,
 					  bool fromdisk,
 					  bool setParent, bool setNextXid)
 {
 	FullTransactionId nextXid = TransamVariables->nextXid;
-	TransactionId origNextXid = XidFromFullTransactionId(nextXid);
 	TransactionId *subxids;
 	char	   *buf;
 	TwoPhaseFileHeader *hdr;
@@ -2192,41 +2197,46 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 		Assert(prepare_start_lsn != InvalidXLogRecPtr);
 
 	/* Already processed? */
-	if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
+	if (TransactionIdDidCommit(XidFromFullTransactionId(fxid)) ||
+		TransactionIdDidAbort(XidFromFullTransactionId(fxid)))
 	{
 		if (fromdisk)
 		{
 			ereport(WARNING,
-					(errmsg("removing stale two-phase state file for transaction %u",
-							xid)));
-			RemoveTwoPhaseFile(xid, true);
+					(errmsg("removing stale two-phase state file for transaction %u of epoch %u",
+							XidFromFullTransactionId(fxid),
+							EpochFromFullTransactionId(fxid))));
+			RemoveTwoPhaseFile(fxid, true);
 		}
 		else
 		{
 			ereport(WARNING,
-					(errmsg("removing stale two-phase state from memory for transaction %u",
-							xid)));
-			PrepareRedoRemove(xid, true);
+					(errmsg("removing stale two-phase state from memory for transaction %u of epoch %u",
+							XidFromFullTransactionId(fxid),
+							EpochFromFullTransactionId(fxid))));
+			PrepareRedoRemoveFull(fxid, true);
 		}
 		return NULL;
 	}
 
 	/* Reject XID if too new */
-	if (TransactionIdFollowsOrEquals(xid, origNextXid))
+	if (FullTransactionIdFollowsOrEquals(fxid, nextXid))
 	{
 		if (fromdisk)
 		{
 			ereport(WARNING,
-					(errmsg("removing future two-phase state file for transaction %u",
-							xid)));
-			RemoveTwoPhaseFile(xid, true);
+					(errmsg("removing future two-phase state file for transaction %u of epoch %u",
+							XidFromFullTransactionId(fxid),
+							EpochFromFullTransactionId(fxid))));
+			RemoveTwoPhaseFile(fxid, true);
 		}
 		else
 		{
 			ereport(WARNING,
-					(errmsg("removing future two-phase state from memory for transaction %u",
-							xid)));
-			PrepareRedoRemove(xid, true);
+					(errmsg("removing future two-phase state from memory for transaction %u of epoch %u",
+							XidFromFullTransactionId(fxid),
+							EpochFromFullTransactionId(fxid))));
+			PrepareRedoRemoveFull(fxid, true);
 		}
 		return NULL;
 	}
@@ -2234,7 +2244,7 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	if (fromdisk)
 	{
 		/* Read and validate file */
-		buf = ReadTwoPhaseFile(xid, false);
+		buf = ReadTwoPhaseFile(fxid, false);
 	}
 	else
 	{
@@ -2244,18 +2254,20 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 
 	/* Deconstruct header */
 	hdr = (TwoPhaseFileHeader *) buf;
-	if (!TransactionIdEquals(hdr->xid, xid))
+	if (!TransactionIdEquals(hdr->xid, XidFromFullTransactionId(fxid)))
 	{
 		if (fromdisk)
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("corrupted two-phase state file for transaction %u",
-							xid)));
+					 errmsg("corrupted two-phase state file for transaction %u of epoch %u",
+							XidFromFullTransactionId(fxid),
+							EpochFromFullTransactionId(fxid))));
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("corrupted two-phase state in memory for transaction %u",
-							xid)));
+					 errmsg("corrupted two-phase state in memory for transaction %u of epoch %u",
+							XidFromFullTransactionId(fxid),
+							EpochFromFullTransactionId(fxid))));
 	}
 
 	/*
@@ -2269,14 +2281,14 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	{
 		TransactionId subxid = subxids[i];
 
-		Assert(TransactionIdFollows(subxid, xid));
+		Assert(TransactionIdFollows(subxid, XidFromFullTransactionId(fxid)));
 
 		/* update nextXid if needed */
 		if (setNextXid)
 			AdvanceNextFullTransactionIdPastXid(subxid);
 
 		if (setParent)
-			SubTransSetParent(subxid, xid);
+			SubTransSetParent(subxid, XidFromFullTransactionId(fxid));
 	}
 
 	return buf;
@@ -2467,8 +2479,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
  * data, the entry is marked as located on disk.
  */
 void
-PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
-			   XLogRecPtr end_lsn, RepOriginId origin_id)
+PrepareRedoAdd(FullTransactionId fxid, char *buf,
+			   XLogRecPtr start_lsn, XLogRecPtr end_lsn,
+			   RepOriginId origin_id)
 {
 	TwoPhaseFileHeader *hdr = (TwoPhaseFileHeader *) buf;
 	char	   *bufptr;
@@ -2478,6 +2491,10 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 	Assert(LWLockHeldByMeInMode(TwoPhaseStateLock, LW_EXCLUSIVE));
 	Assert(RecoveryInProgress());
 
+	if (!FullTransactionIdIsValid(fxid))
+		fxid = FullTransactionIdFromAllowableAt(TransamVariables->nextXid,
+												hdr->xid);
+
 	bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader));
 	gid = (const char *) bufptr;
 
@@ -2506,7 +2523,8 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 	{
 		char		path[MAXPGPATH];
 
-		TwoPhaseFilePath(path, hdr->xid);
+		Assert(InRecovery);
+		TwoPhaseFilePath(path, fxid);
 
 		if (access(path, F_OK) == 0)
 		{
@@ -2537,7 +2555,7 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 	gxact->prepared_at = hdr->prepared_at;
 	gxact->prepare_start_lsn = start_lsn;
 	gxact->prepare_end_lsn = end_lsn;
-	gxact->xid = hdr->xid;
+	gxact->fxid = fxid;
 	gxact->owner = hdr->owner;
 	gxact->locking_backend = INVALID_PROC_NUMBER;
 	gxact->valid = false;
@@ -2556,11 +2574,13 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   false /* backward */ , false /* WAL */ );
 	}
 
-	elog(DEBUG2, "added 2PC data in shared memory for transaction %u", gxact->xid);
+	elog(DEBUG2, "added 2PC data in shared memory for transaction %u of epoch %u",
+		 XidFromFullTransactionId(gxact->fxid),
+		 EpochFromFullTransactionId(gxact->fxid));
 }
 
 /*
- * PrepareRedoRemove
+ * PrepareRedoRemoveFull
  *
  * Remove the corresponding gxact entry from TwoPhaseState. Also remove
  * the 2PC file if a prepared transaction was saved via an earlier checkpoint.
@@ -2568,8 +2588,8 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
  * Caller must hold TwoPhaseStateLock in exclusive mode, because TwoPhaseState
  * is updated.
  */
-void
-PrepareRedoRemove(TransactionId xid, bool giveWarning)
+static void
+PrepareRedoRemoveFull(FullTransactionId fxid, bool giveWarning)
 {
 	GlobalTransaction gxact = NULL;
 	int			i;
@@ -2582,7 +2602,7 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 	{
 		gxact = TwoPhaseState->prepXacts[i];
 
-		if (gxact->xid == xid)
+		if (FullTransactionIdEquals(gxact->fxid, fxid))
 		{
 			Assert(gxact->inredo);
 			found = true;
@@ -2599,12 +2619,28 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 	/*
 	 * And now we can clean up any files we may have left.
 	 */
-	elog(DEBUG2, "removing 2PC data for transaction %u", xid);
+	elog(DEBUG2, "removing 2PC data for transaction %u of epoch %u ",
+		 XidFromFullTransactionId(fxid),
+		 EpochFromFullTransactionId(fxid));
+
 	if (gxact->ondisk)
-		RemoveTwoPhaseFile(xid, giveWarning);
+		RemoveTwoPhaseFile(fxid, giveWarning);
+
 	RemoveGXact(gxact);
 }
 
+/*
+ * Wrapper of PrepareRedoRemoveFull(), for TransactionIds.
+ */
+void
+PrepareRedoRemove(TransactionId xid, bool giveWarning)
+{
+	FullTransactionId fxid =
+		FullTransactionIdFromAllowableAt(TransamVariables->nextXid, xid);
+
+	PrepareRedoRemoveFull(fxid, giveWarning);
+}
+
 /*
  * LookupGXact
  *		Check if the prepared transaction with the given GID, lsn and timestamp
@@ -2649,7 +2685,7 @@ LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
 			 * between publisher and subscriber.
 			 */
 			if (gxact->ondisk)
-				buf = ReadTwoPhaseFile(gxact->xid, false);
+				buf = ReadTwoPhaseFile(gxact->fxid, false);
 			else
 			{
 				Assert(gxact->prepare_start_lsn);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 4cecf630060..89ed5b726c8 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2460,7 +2460,7 @@ static void
 PrepareTransaction(void)
 {
 	TransactionState s = CurrentTransactionState;
-	TransactionId xid = GetCurrentTransactionId();
+	FullTransactionId fxid = GetCurrentFullTransactionId();
 	GlobalTransaction gxact;
 	TimestampTz prepared_at;
 
@@ -2589,7 +2589,7 @@ PrepareTransaction(void)
 	 * Reserve the GID for this transaction. This could fail if the requested
 	 * GID is invalid or already in use.
 	 */
-	gxact = MarkAsPreparing(xid, prepareGID, prepared_at,
+	gxact = MarkAsPreparing(fxid, prepareGID, prepared_at,
 							GetUserId(), MyDatabaseId);
 	prepareGID = NULL;
 
@@ -2639,7 +2639,7 @@ PrepareTransaction(void)
 	 * ProcArrayClearTransaction().  Otherwise, a GetLockConflicts() would
 	 * conclude "xact already committed or aborted" for our locks.
 	 */
-	PostPrepare_Locks(xid);
+	PostPrepare_Locks(fxid);
 
 	/*
 	 * Let others know about no transaction in progress by me.  This has to be
@@ -2678,9 +2678,9 @@ PrepareTransaction(void)
 
 	PostPrepare_smgr();
 
-	PostPrepare_MultiXact(xid);
+	PostPrepare_MultiXact(fxid);
 
-	PostPrepare_PredicateLocks(xid);
+	PostPrepare_PredicateLocks(fxid);
 
 	ResourceOwnerRelease(TopTransactionResourceOwner,
 						 RESOURCE_RELEASE_LOCKS,
@@ -6358,7 +6358,8 @@ xact_redo(XLogReaderState *record)
 		 * gxact entry.
 		 */
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
-		PrepareRedoAdd(XLogRecGetData(record),
+		PrepareRedoAdd(InvalidFullTransactionId,
+					   XLogRecGetData(record),
 					   record->ReadRecPtr,
 					   record->EndRecPtr,
 					   XLogRecGetOrigin(record));
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index e5e7ab55716..7e49c8d7d8a 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -3397,9 +3397,9 @@ AtPrepare_Locks(void)
  * but that probably costs more cycles.
  */
 void
-PostPrepare_Locks(TransactionId xid)
+PostPrepare_Locks(FullTransactionId fxid)
 {
-	PGPROC	   *newproc = TwoPhaseGetDummyProc(xid, false);
+	PGPROC	   *newproc = TwoPhaseGetDummyProc(fxid, false);
 	HASH_SEQ_STATUS status;
 	LOCALLOCK  *locallock;
 	LOCK	   *lock;
@@ -4171,11 +4171,11 @@ DumpAllLocks(void)
  * and PANIC anyway.
  */
 void
-lock_twophase_recover(TransactionId xid, uint16 info,
+lock_twophase_recover(FullTransactionId fxid, uint16 info,
 					  void *recdata, uint32 len)
 {
 	TwoPhaseLockRecord *rec = (TwoPhaseLockRecord *) recdata;
-	PGPROC	   *proc = TwoPhaseGetDummyProc(xid, false);
+	PGPROC	   *proc = TwoPhaseGetDummyProc(fxid, false);
 	LOCKTAG    *locktag;
 	LOCKMODE	lockmode;
 	LOCKMETHODID lockmethodid;
@@ -4352,7 +4352,7 @@ lock_twophase_recover(TransactionId xid, uint16 info,
  * starting up into hot standby mode.
  */
 void
-lock_twophase_standby_recover(TransactionId xid, uint16 info,
+lock_twophase_standby_recover(FullTransactionId fxid, uint16 info,
 							  void *recdata, uint32 len)
 {
 	TwoPhaseLockRecord *rec = (TwoPhaseLockRecord *) recdata;
@@ -4371,7 +4371,7 @@ lock_twophase_standby_recover(TransactionId xid, uint16 info,
 	if (lockmode == AccessExclusiveLock &&
 		locktag->locktag_type == LOCKTAG_RELATION)
 	{
-		StandbyAcquireAccessExclusiveLock(xid,
+		StandbyAcquireAccessExclusiveLock(XidFromFullTransactionId(fxid),
 										  locktag->locktag_field1 /* dboid */ ,
 										  locktag->locktag_field2 /* reloid */ );
 	}
@@ -4384,11 +4384,11 @@ lock_twophase_standby_recover(TransactionId xid, uint16 info,
  * Find and release the lock indicated by the 2PC record.
  */
 void
-lock_twophase_postcommit(TransactionId xid, uint16 info,
+lock_twophase_postcommit(FullTransactionId fxid, uint16 info,
 						 void *recdata, uint32 len)
 {
 	TwoPhaseLockRecord *rec = (TwoPhaseLockRecord *) recdata;
-	PGPROC	   *proc = TwoPhaseGetDummyProc(xid, true);
+	PGPROC	   *proc = TwoPhaseGetDummyProc(fxid, true);
 	LOCKTAG    *locktag;
 	LOCKMETHODID lockmethodid;
 	LockMethod	lockMethodTable;
@@ -4410,10 +4410,10 @@ lock_twophase_postcommit(TransactionId xid, uint16 info,
  * This is actually just the same as the COMMIT case.
  */
 void
-lock_twophase_postabort(TransactionId xid, uint16 info,
+lock_twophase_postabort(FullTransactionId fxid, uint16 info,
 						void *recdata, uint32 len)
 {
-	lock_twophase_postcommit(xid, info, recdata, len);
+	lock_twophase_postcommit(fxid, info, recdata, len);
 }
 
 /*
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 2c87273e17e..82a58f54901 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -191,7 +191,7 @@
  *		AtPrepare_PredicateLocks(void);
  *		PostPrepare_PredicateLocks(TransactionId xid);
  *		PredicateLockTwoPhaseFinish(TransactionId xid, bool isCommit);
- *		predicatelock_twophase_recover(TransactionId xid, uint16 info,
+ *		predicatelock_twophase_recover(FullTransactionId fxid, uint16 info,
  *									   void *recdata, uint32 len);
  */
 
@@ -4846,7 +4846,7 @@ AtPrepare_PredicateLocks(void)
  *		anyway. We only need to clean up our local state.
  */
 void
-PostPrepare_PredicateLocks(TransactionId xid)
+PostPrepare_PredicateLocks(FullTransactionId fxid)
 {
 	if (MySerializableXact == InvalidSerializableXact)
 		return;
@@ -4869,12 +4869,12 @@ PostPrepare_PredicateLocks(TransactionId xid)
  *		commits or aborts.
  */
 void
-PredicateLockTwoPhaseFinish(TransactionId xid, bool isCommit)
+PredicateLockTwoPhaseFinish(FullTransactionId fxid, bool isCommit)
 {
 	SERIALIZABLEXID *sxid;
 	SERIALIZABLEXIDTAG sxidtag;
 
-	sxidtag.xid = xid;
+	sxidtag.xid = XidFromFullTransactionId(fxid);
 
 	LWLockAcquire(SerializableXactHashLock, LW_SHARED);
 	sxid = (SERIALIZABLEXID *)
@@ -4896,10 +4896,11 @@ PredicateLockTwoPhaseFinish(TransactionId xid, bool isCommit)
  * Re-acquire a predicate lock belonging to a transaction that was prepared.
  */
 void
-predicatelock_twophase_recover(TransactionId xid, uint16 info,
+predicatelock_twophase_recover(FullTransactionId fxid, uint16 info,
 							   void *recdata, uint32 len)
 {
 	TwoPhasePredicateRecord *record;
+	TransactionId xid = XidFromFullTransactionId(fxid);
 
 	Assert(len == sizeof(TwoPhasePredicateRecord));
 
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 8a3f7d434cf..54500633be5 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -730,7 +730,7 @@ PostPrepare_PgStat_Relations(PgStat_SubXactStatus *xact_state)
  * Load the saved counts into our local pgstats state.
  */
 void
-pgstat_twophase_postcommit(TransactionId xid, uint16 info,
+pgstat_twophase_postcommit(FullTransactionId fxid, uint16 info,
 						   void *recdata, uint32 len)
 {
 	TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata;
@@ -766,7 +766,7 @@ pgstat_twophase_postcommit(TransactionId xid, uint16 info,
  * as aborted.
  */
 void
-pgstat_twophase_postabort(TransactionId xid, uint16 info,
+pgstat_twophase_postabort(FullTransactionId fxid, uint16 info,
 						  void *recdata, uint32 len)
 {
 	TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata;
-- 
2.47.2

v2-0001-Integrate-more-FullTransactionIds-into-2PC-master.patchtext/x-diff; charset=us-asciiDownload

From b5f64b910366ef0b0992ea878d1b1d2af63c8d87 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Fri, 31 Jan 2025 08:45:10 +0900
Subject: [PATCH v2 1/2] Integrate more FullTransactionIds into 2PC code

This refactoring will help in fixing a follow-up issue.

Backpatch-through: 17 (?)
---
 src/include/access/multixact.h               |   9 +-
 src/include/access/twophase.h                |  12 +-
 src/include/access/twophase_rmgr.h           |   4 +-
 src/include/pgstat.h                         |   4 +-
 src/include/storage/lock.h                   |  11 +-
 src/include/storage/predicate.h              |   7 +-
 src/backend/access/transam/multixact.c       |  16 +-
 src/backend/access/transam/twophase.c        | 242 +++++++++++--------
 src/backend/access/transam/xact.c            |  13 +-
 src/backend/storage/lmgr/lock.c              |  20 +-
 src/backend/storage/lmgr/predicate.c         |  11 +-
 src/backend/utils/activity/pgstat_relation.c |   4 +-
 12 files changed, 199 insertions(+), 154 deletions(-)

diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 4e6b0eec2ff..b876e98f46e 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -11,6 +11,7 @@
 #ifndef MULTIXACT_H
 #define MULTIXACT_H
 
+#include "access/transam.h"
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 #include "storage/sync.h"
@@ -119,7 +120,7 @@ extern int	multixactmemberssyncfiletag(const FileTag *ftag, char *path);
 
 extern void AtEOXact_MultiXact(void);
 extern void AtPrepare_MultiXact(void);
-extern void PostPrepare_MultiXact(TransactionId xid);
+extern void PostPrepare_MultiXact(FullTransactionId fxid);
 
 extern Size MultiXactShmemSize(void);
 extern void MultiXactShmemInit(void);
@@ -145,11 +146,11 @@ extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
 extern int	MultiXactMemberFreezeThreshold(void);
 
-extern void multixact_twophase_recover(TransactionId xid, uint16 info,
+extern void multixact_twophase_recover(FullTransactionId fxid, uint16 info,
 									   void *recdata, uint32 len);
-extern void multixact_twophase_postcommit(TransactionId xid, uint16 info,
+extern void multixact_twophase_postcommit(FullTransactionId fxid, uint16 info,
 										  void *recdata, uint32 len);
-extern void multixact_twophase_postabort(TransactionId xid, uint16 info,
+extern void multixact_twophase_postabort(FullTransactionId fxid, uint16 info,
 										 void *recdata, uint32 len);
 
 extern void multixact_redo(XLogReaderState *record);
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 9fa82355033..0ab8b3e64a7 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -14,6 +14,7 @@
 #ifndef TWOPHASE_H
 #define TWOPHASE_H
 
+#include "access/transam.h"
 #include "access/xact.h"
 #include "access/xlogdefs.h"
 #include "datatype/timestamp.h"
@@ -36,10 +37,10 @@ extern void PostPrepare_Twophase(void);
 
 extern TransactionId TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
 												bool *have_more);
-extern PGPROC *TwoPhaseGetDummyProc(TransactionId xid, bool lock_held);
-extern int	TwoPhaseGetDummyProcNumber(TransactionId xid, bool lock_held);
+extern PGPROC *TwoPhaseGetDummyProc(FullTransactionId fxid, bool lock_held);
+extern int	TwoPhaseGetDummyProcNumber(FullTransactionId fxid, bool lock_held);
 
-extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
+extern GlobalTransaction MarkAsPreparing(FullTransactionId fxid, const char *gid,
 										 TimestampTz prepared_at,
 										 Oid owner, Oid databaseid);
 
@@ -56,8 +57,9 @@ extern void CheckPointTwoPhase(XLogRecPtr redo_horizon);
 
 extern void FinishPreparedTransaction(const char *gid, bool isCommit);
 
-extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
-						   XLogRecPtr end_lsn, RepOriginId origin_id);
+extern void PrepareRedoAdd(FullTransactionId fxid, char *buf,
+						   XLogRecPtr start_lsn, XLogRecPtr end_lsn,
+						   RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
 extern bool LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
diff --git a/src/include/access/twophase_rmgr.h b/src/include/access/twophase_rmgr.h
index 3ed154bb231..8f576402e36 100644
--- a/src/include/access/twophase_rmgr.h
+++ b/src/include/access/twophase_rmgr.h
@@ -14,7 +14,9 @@
 #ifndef TWOPHASE_RMGR_H
 #define TWOPHASE_RMGR_H
 
-typedef void (*TwoPhaseCallback) (TransactionId xid, uint16 info,
+#include "access/transam.h"
+
+typedef void (*TwoPhaseCallback) (FullTransactionId fxid, uint16 info,
 								  void *recdata, uint32 len);
 typedef uint8 TwoPhaseRmgrId;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 81ec0161c09..dcf230e2b6c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -713,9 +713,9 @@ extern void pgstat_count_heap_delete(Relation rel);
 extern void pgstat_count_truncate(Relation rel);
 extern void pgstat_update_heap_dead_tuples(Relation rel, int delta);
 
-extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
+extern void pgstat_twophase_postcommit(FullTransactionId fxid, uint16 info,
 									   void *recdata, uint32 len);
-extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
+extern void pgstat_twophase_postabort(FullTransactionId fxid, uint16 info,
 									  void *recdata, uint32 len);
 
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 1076995518f..3feedfc8cb2 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -18,6 +18,7 @@
 #error "lock.h may not be included from frontend code"
 #endif
 
+#include "access/transam.h"
 #include "lib/ilist.h"
 #include "storage/lockdefs.h"
 #include "storage/lwlock.h"
@@ -579,7 +580,7 @@ extern bool LockHasWaiters(const LOCKTAG *locktag,
 extern VirtualTransactionId *GetLockConflicts(const LOCKTAG *locktag,
 											  LOCKMODE lockmode, int *countp);
 extern void AtPrepare_Locks(void);
-extern void PostPrepare_Locks(TransactionId xid);
+extern void PostPrepare_Locks(FullTransactionId fxid);
 extern bool LockCheckConflicts(LockMethod lockMethodTable,
 							   LOCKMODE lockmode,
 							   LOCK *lock, PROCLOCK *proclock);
@@ -594,13 +595,13 @@ extern BlockedProcsData *GetBlockerStatusData(int blocked_pid);
 extern xl_standby_lock *GetRunningTransactionLocks(int *nlocks);
 extern const char *GetLockmodeName(LOCKMETHODID lockmethodid, LOCKMODE mode);
 
-extern void lock_twophase_recover(TransactionId xid, uint16 info,
+extern void lock_twophase_recover(FullTransactionId fxid, uint16 info,
 								  void *recdata, uint32 len);
-extern void lock_twophase_postcommit(TransactionId xid, uint16 info,
+extern void lock_twophase_postcommit(FullTransactionId fxid, uint16 info,
 									 void *recdata, uint32 len);
-extern void lock_twophase_postabort(TransactionId xid, uint16 info,
+extern void lock_twophase_postabort(FullTransactionId fxid, uint16 info,
 									void *recdata, uint32 len);
-extern void lock_twophase_standby_recover(TransactionId xid, uint16 info,
+extern void lock_twophase_standby_recover(FullTransactionId fxid, uint16 info,
 										  void *recdata, uint32 len);
 
 extern DeadLockState DeadLockCheck(PGPROC *proc);
diff --git a/src/include/storage/predicate.h b/src/include/storage/predicate.h
index 267d5d90e94..4d3f218f93b 100644
--- a/src/include/storage/predicate.h
+++ b/src/include/storage/predicate.h
@@ -14,6 +14,7 @@
 #ifndef PREDICATE_H
 #define PREDICATE_H
 
+#include "access/transam.h"
 #include "storage/itemptr.h"
 #include "storage/lock.h"
 #include "utils/relcache.h"
@@ -72,9 +73,9 @@ extern void PreCommit_CheckForSerializationFailure(void);
 
 /* two-phase commit support */
 extern void AtPrepare_PredicateLocks(void);
-extern void PostPrepare_PredicateLocks(TransactionId xid);
-extern void PredicateLockTwoPhaseFinish(TransactionId xid, bool isCommit);
-extern void predicatelock_twophase_recover(TransactionId xid, uint16 info,
+extern void PostPrepare_PredicateLocks(FullTransactionId fxid);
+extern void PredicateLockTwoPhaseFinish(FullTransactionId xid, bool isCommit);
+extern void predicatelock_twophase_recover(FullTransactionId fxid, uint16 info,
 										   void *recdata, uint32 len);
 
 /* parallel query support */
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 27ccdf9500f..4e52792bd1f 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1847,7 +1847,7 @@ AtPrepare_MultiXact(void)
  *		Clean up after successful PREPARE TRANSACTION
  */
 void
-PostPrepare_MultiXact(TransactionId xid)
+PostPrepare_MultiXact(FullTransactionId fxid)
 {
 	MultiXactId myOldestMember;
 
@@ -1858,7 +1858,7 @@ PostPrepare_MultiXact(TransactionId xid)
 	myOldestMember = OldestMemberMXactId[MyProcNumber];
 	if (MultiXactIdIsValid(myOldestMember))
 	{
-		ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(xid, false);
+		ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(fxid, false);
 
 		/*
 		 * Even though storing MultiXactId is atomic, acquire lock to make
@@ -1896,10 +1896,10 @@ PostPrepare_MultiXact(TransactionId xid)
  *		Recover the state of a prepared transaction at startup
  */
 void
-multixact_twophase_recover(TransactionId xid, uint16 info,
+multixact_twophase_recover(FullTransactionId fxid, uint16 info,
 						   void *recdata, uint32 len)
 {
-	ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(xid, false);
+	ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(fxid, false);
 	MultiXactId oldestMember;
 
 	/*
@@ -1917,10 +1917,10 @@ multixact_twophase_recover(TransactionId xid, uint16 info,
  *		Similar to AtEOXact_MultiXact but for COMMIT PREPARED
  */
 void
-multixact_twophase_postcommit(TransactionId xid, uint16 info,
+multixact_twophase_postcommit(FullTransactionId fxid, uint16 info,
 							  void *recdata, uint32 len)
 {
-	ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(xid, true);
+	ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(fxid, true);
 
 	Assert(len == sizeof(MultiXactId));
 
@@ -1932,10 +1932,10 @@ multixact_twophase_postcommit(TransactionId xid, uint16 info,
  *		This is actually just the same as the COMMIT case.
  */
 void
-multixact_twophase_postabort(TransactionId xid, uint16 info,
+multixact_twophase_postabort(FullTransactionId fxid, uint16 info,
 							 void *recdata, uint32 len)
 {
-	multixact_twophase_postcommit(xid, info, recdata, len);
+	multixact_twophase_postcommit(fxid, info, recdata, len);
 }
 
 /*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 73a80559194..4f5b4542662 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -159,7 +159,7 @@ typedef struct GlobalTransactionData
 	 */
 	XLogRecPtr	prepare_start_lsn;	/* XLOG offset of prepare record start */
 	XLogRecPtr	prepare_end_lsn;	/* XLOG offset of prepare record end */
-	TransactionId xid;			/* The GXACT id */
+	FullTransactionId fxid;		/* The GXACT full xid */
 
 	Oid			owner;			/* ID of user that executed the xact */
 	ProcNumber	locking_backend;	/* backend currently working on the xact */
@@ -197,6 +197,7 @@ static GlobalTransaction MyLockedGxact = NULL;
 
 static bool twophaseExitRegistered = false;
 
+static void PrepareRedoRemoveFull(FullTransactionId fxid, bool giveWarning);
 static void RecordTransactionCommitPrepared(TransactionId xid,
 											int nchildren,
 											TransactionId *children,
@@ -216,19 +217,19 @@ static void RecordTransactionAbortPrepared(TransactionId xid,
 										   int nstats,
 										   xl_xact_stats_item *stats,
 										   const char *gid);
-static void ProcessRecords(char *bufptr, TransactionId xid,
+static void ProcessRecords(char *bufptr, FullTransactionId fxid,
 						   const TwoPhaseCallback callbacks[]);
 static void RemoveGXact(GlobalTransaction gxact);
 
 static void XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len);
-static char *ProcessTwoPhaseBuffer(TransactionId xid,
+static char *ProcessTwoPhaseBuffer(FullTransactionId fxid,
 								   XLogRecPtr prepare_start_lsn,
 								   bool fromdisk, bool setParent, bool setNextXid);
-static void MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid,
+static void MarkAsPreparingGuts(GlobalTransaction gxact, FullTransactionId fxid,
 								const char *gid, TimestampTz prepared_at, Oid owner,
 								Oid databaseid);
-static void RemoveTwoPhaseFile(TransactionId xid, bool giveWarning);
-static void RecreateTwoPhaseFile(TransactionId xid, void *content, int len);
+static void RemoveTwoPhaseFile(FullTransactionId fxid, bool giveWarning);
+static void RecreateTwoPhaseFile(FullTransactionId fxid, void *content, int len);
 
 /*
  * Initialization of shared memory
@@ -356,7 +357,7 @@ PostPrepare_Twophase(void)
  *		Reserve the GID for the given transaction.
  */
 GlobalTransaction
-MarkAsPreparing(TransactionId xid, const char *gid,
+MarkAsPreparing(FullTransactionId fxid, const char *gid,
 				TimestampTz prepared_at, Oid owner, Oid databaseid)
 {
 	GlobalTransaction gxact;
@@ -407,7 +408,7 @@ MarkAsPreparing(TransactionId xid, const char *gid,
 	gxact = TwoPhaseState->freeGXacts;
 	TwoPhaseState->freeGXacts = gxact->next;
 
-	MarkAsPreparingGuts(gxact, xid, gid, prepared_at, owner, databaseid);
+	MarkAsPreparingGuts(gxact, fxid, gid, prepared_at, owner, databaseid);
 
 	gxact->ondisk = false;
 
@@ -430,11 +431,13 @@ MarkAsPreparing(TransactionId xid, const char *gid,
  * Note: This function should be called with appropriate locks held.
  */
 static void
-MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
-					TimestampTz prepared_at, Oid owner, Oid databaseid)
+MarkAsPreparingGuts(GlobalTransaction gxact, FullTransactionId fxid,
+					const char *gid, TimestampTz prepared_at, Oid owner,
+					Oid databaseid)
 {
 	PGPROC	   *proc;
 	int			i;
+	TransactionId xid = XidFromFullTransactionId(fxid);
 
 	Assert(LWLockHeldByMeInMode(TwoPhaseStateLock, LW_EXCLUSIVE));
 
@@ -479,7 +482,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	proc->subxidStatus.count = 0;
 
 	gxact->prepared_at = prepared_at;
-	gxact->xid = xid;
+	gxact->fxid = fxid;
 	gxact->owner = owner;
 	gxact->locking_backend = MyProcNumber;
 	gxact->valid = false;
@@ -797,12 +800,12 @@ pg_prepared_xact(PG_FUNCTION_ARGS)
  * caller had better hold it.
  */
 static GlobalTransaction
-TwoPhaseGetGXact(TransactionId xid, bool lock_held)
+TwoPhaseGetGXact(FullTransactionId fxid, bool lock_held)
 {
 	GlobalTransaction result = NULL;
 	int			i;
 
-	static TransactionId cached_xid = InvalidTransactionId;
+	static FullTransactionId cached_fxid = {0};
 	static GlobalTransaction cached_gxact = NULL;
 
 	Assert(!lock_held || LWLockHeldByMe(TwoPhaseStateLock));
@@ -811,7 +814,7 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)
 	 * During a recovery, COMMIT PREPARED, or ABORT PREPARED, we'll be called
 	 * repeatedly for the same XID.  We can save work with a simple cache.
 	 */
-	if (xid == cached_xid)
+	if (FullTransactionIdEquals(fxid, cached_fxid))
 		return cached_gxact;
 
 	if (!lock_held)
@@ -821,7 +824,7 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)
 	{
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
 
-		if (gxact->xid == xid)
+		if (FullTransactionIdEquals(gxact->fxid, fxid))
 		{
 			result = gxact;
 			break;
@@ -832,9 +835,10 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)
 		LWLockRelease(TwoPhaseStateLock);
 
 	if (result == NULL)			/* should not happen */
-		elog(ERROR, "failed to find GlobalTransaction for xid %u", xid);
+		elog(ERROR, "failed to find GlobalTransaction for xid %u",
+			 XidFromFullTransactionId(fxid));
 
-	cached_xid = xid;
+	cached_fxid = fxid;
 	cached_gxact = result;
 
 	return result;
@@ -881,7 +885,7 @@ TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
 				*have_more = true;
 				break;
 			}
-			result = gxact->xid;
+			result = XidFromFullTransactionId(gxact->fxid);
 		}
 	}
 
@@ -892,7 +896,7 @@ TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
 
 /*
  * TwoPhaseGetDummyProcNumber
- *		Get the dummy proc number for prepared transaction specified by XID
+ *		Get the dummy proc number for prepared transaction
  *
  * Dummy proc numbers are similar to proc numbers of real backends.  They
  * start at MaxBackends, and are unique across all currently active real
@@ -900,24 +904,24 @@ TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
  * TwoPhaseStateLock will not be taken, so the caller had better hold it.
  */
 ProcNumber
-TwoPhaseGetDummyProcNumber(TransactionId xid, bool lock_held)
+TwoPhaseGetDummyProcNumber(FullTransactionId fxid, bool lock_held)
 {
-	GlobalTransaction gxact = TwoPhaseGetGXact(xid, lock_held);
+	GlobalTransaction gxact = TwoPhaseGetGXact(fxid, lock_held);
 
 	return gxact->pgprocno;
 }
 
 /*
  * TwoPhaseGetDummyProc
- *		Get the PGPROC that represents a prepared transaction specified by XID
+ *		Get the PGPROC that represents a prepared transaction
  *
  * If lock_held is set to true, TwoPhaseStateLock will not be taken, so the
  * caller had better hold it.
  */
 PGPROC *
-TwoPhaseGetDummyProc(TransactionId xid, bool lock_held)
+TwoPhaseGetDummyProc(FullTransactionId fxid, bool lock_held)
 {
-	GlobalTransaction gxact = TwoPhaseGetGXact(xid, lock_held);
+	GlobalTransaction gxact = TwoPhaseGetGXact(fxid, lock_held);
 
 	return GetPGProcByNumber(gxact->pgprocno);
 }
@@ -942,10 +946,8 @@ AdjustToFullTransactionId(TransactionId xid)
 }
 
 static inline int
-TwoPhaseFilePath(char *path, TransactionId xid)
+TwoPhaseFilePath(char *path, FullTransactionId fxid)
 {
-	FullTransactionId fxid = AdjustToFullTransactionId(xid);
-
 	return snprintf(path, MAXPGPATH, TWOPHASE_DIR "/%08X%08X",
 					EpochFromFullTransactionId(fxid),
 					XidFromFullTransactionId(fxid));
@@ -1049,7 +1051,7 @@ void
 StartPrepare(GlobalTransaction gxact)
 {
 	PGPROC	   *proc = GetPGProcByNumber(gxact->pgprocno);
-	TransactionId xid = gxact->xid;
+	TransactionId xid = XidFromFullTransactionId(gxact->fxid);
 	TwoPhaseFileHeader hdr;
 	TransactionId *children;
 	RelFileLocator *commitrels;
@@ -1281,10 +1283,11 @@ RegisterTwoPhaseRecord(TwoPhaseRmgrId rmid, uint16 info,
  * If it looks OK (has a valid magic number and CRC), return the palloc'd
  * contents of the file, issuing an error when finding corrupted data.  If
  * missing_ok is true, which indicates that missing files can be safely
- * ignored, then return NULL.  This state can be reached when doing recovery.
+ * ignored, then return NULL.  This state can be reached when doing recovery
+ * after discarding two-phase files from frozen epochs.
  */
 static char *
-ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
+ReadTwoPhaseFile(FullTransactionId fxid, bool missing_ok)
 {
 	char		path[MAXPGPATH];
 	char	   *buf;
@@ -1296,7 +1299,7 @@ ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
 				file_crc;
 	int			r;
 
-	TwoPhaseFilePath(path, xid);
+	TwoPhaseFilePath(path, fxid);
 
 	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
 	if (fd < 0)
@@ -1461,6 +1464,7 @@ StandbyTransactionIdIsPrepared(TransactionId xid)
 	char	   *buf;
 	TwoPhaseFileHeader *hdr;
 	bool		result;
+	FullTransactionId fxid;
 
 	Assert(TransactionIdIsValid(xid));
 
@@ -1468,7 +1472,8 @@ StandbyTransactionIdIsPrepared(TransactionId xid)
 		return false;			/* nothing to do */
 
 	/* Read and validate file */
-	buf = ReadTwoPhaseFile(xid, true);
+	fxid = FullTransactionIdFromAllowableAt(TransamVariables->nextXid, xid);
+	buf = ReadTwoPhaseFile(fxid, true);
 	if (buf == NULL)
 		return false;
 
@@ -1488,6 +1493,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 {
 	GlobalTransaction gxact;
 	PGPROC	   *proc;
+	FullTransactionId fxid;
 	TransactionId xid;
 	bool		ondisk;
 	char	   *buf;
@@ -1509,7 +1515,8 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 */
 	gxact = LockGXact(gid, GetUserId());
 	proc = GetPGProcByNumber(gxact->pgprocno);
-	xid = gxact->xid;
+	fxid = gxact->fxid;
+	xid = XidFromFullTransactionId(fxid);
 
 	/*
 	 * Read and validate 2PC state data. State data will typically be stored
@@ -1517,7 +1524,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 * to disk if for some reason they have lived for a long time.
 	 */
 	if (gxact->ondisk)
-		buf = ReadTwoPhaseFile(xid, false);
+		buf = ReadTwoPhaseFile(fxid, false);
 	else
 		XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
 
@@ -1636,11 +1643,11 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 
 	/* And now do the callbacks */
 	if (isCommit)
-		ProcessRecords(bufptr, xid, twophase_postcommit_callbacks);
+		ProcessRecords(bufptr, fxid, twophase_postcommit_callbacks);
 	else
-		ProcessRecords(bufptr, xid, twophase_postabort_callbacks);
+		ProcessRecords(bufptr, fxid, twophase_postabort_callbacks);
 
-	PredicateLockTwoPhaseFinish(xid, isCommit);
+	PredicateLockTwoPhaseFinish(fxid, isCommit);
 
 	/*
 	 * Read this value while holding the two-phase lock, as the on-disk 2PC
@@ -1664,7 +1671,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 * And now we can clean up any files we may have left.
 	 */
 	if (ondisk)
-		RemoveTwoPhaseFile(xid, true);
+		RemoveTwoPhaseFile(fxid, true);
 
 	MyLockedGxact = NULL;
 
@@ -1677,7 +1684,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
  * Scan 2PC state data in memory and call the indicated callbacks for each 2PC record.
  */
 static void
-ProcessRecords(char *bufptr, TransactionId xid,
+ProcessRecords(char *bufptr, FullTransactionId fxid,
 			   const TwoPhaseCallback callbacks[])
 {
 	for (;;)
@@ -1691,24 +1698,28 @@ ProcessRecords(char *bufptr, TransactionId xid,
 		bufptr += MAXALIGN(sizeof(TwoPhaseRecordOnDisk));
 
 		if (callbacks[record->rmid] != NULL)
-			callbacks[record->rmid] (xid, record->info, bufptr, record->len);
+			callbacks[record->rmid] (fxid, record->info, bufptr, record->len);
 
 		bufptr += MAXALIGN(record->len);
 	}
 }
 
 /*
- * Remove the 2PC file for the specified XID.
+ * Remove the 2PC file.
  *
  * If giveWarning is false, do not complain about file-not-present;
  * this is an expected case during WAL replay.
+ *
+ * This routine is used at early stages at recovery where future and
+ * past orphaned files are checked, hence the FullTransactionId to build
+ * a complete file name fit for the removal.
  */
 static void
-RemoveTwoPhaseFile(TransactionId xid, bool giveWarning)
+RemoveTwoPhaseFile(FullTransactionId fxid, bool giveWarning)
 {
 	char		path[MAXPGPATH];
 
-	TwoPhaseFilePath(path, xid);
+	TwoPhaseFilePath(path, fxid);
 	if (unlink(path))
 		if (errno != ENOENT || giveWarning)
 			ereport(WARNING,
@@ -1723,7 +1734,7 @@ RemoveTwoPhaseFile(TransactionId xid, bool giveWarning)
  * Note: content and len don't include CRC.
  */
 static void
-RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
+RecreateTwoPhaseFile(FullTransactionId fxid, void *content, int len)
 {
 	char		path[MAXPGPATH];
 	pg_crc32c	statefile_crc;
@@ -1734,7 +1745,7 @@ RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
 	COMP_CRC32C(statefile_crc, content, len);
 	FIN_CRC32C(statefile_crc);
 
-	TwoPhaseFilePath(path, xid);
+	TwoPhaseFilePath(path, fxid);
 
 	fd = OpenTransientFile(path,
 						   O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
@@ -1846,7 +1857,7 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
 			int			len;
 
 			XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, &len);
-			RecreateTwoPhaseFile(gxact->xid, buf, len);
+			RecreateTwoPhaseFile(gxact->fxid, buf, len);
 			gxact->ondisk = true;
 			gxact->prepare_start_lsn = InvalidXLogRecPtr;
 			gxact->prepare_end_lsn = InvalidXLogRecPtr;
@@ -1897,19 +1908,17 @@ restoreTwoPhaseData(void)
 		if (strlen(clde->d_name) == 16 &&
 			strspn(clde->d_name, "0123456789ABCDEF") == 16)
 		{
-			TransactionId xid;
 			FullTransactionId fxid;
 			char	   *buf;
 
 			fxid = FullTransactionIdFromU64(strtou64(clde->d_name, NULL, 16));
-			xid = XidFromFullTransactionId(fxid);
 
-			buf = ProcessTwoPhaseBuffer(xid, InvalidXLogRecPtr,
+			buf = ProcessTwoPhaseBuffer(fxid, InvalidXLogRecPtr,
 										true, false, false);
 			if (buf == NULL)
 				continue;
 
-			PrepareRedoAdd(buf, InvalidXLogRecPtr,
+			PrepareRedoAdd(fxid, buf, InvalidXLogRecPtr,
 						   InvalidXLogRecPtr, InvalidRepOriginId);
 		}
 	}
@@ -1968,9 +1977,8 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 
 		Assert(gxact->inredo);
 
-		xid = gxact->xid;
-
-		buf = ProcessTwoPhaseBuffer(xid,
+		xid = XidFromFullTransactionId(gxact->fxid);
+		buf = ProcessTwoPhaseBuffer(gxact->fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, false, true);
 
@@ -2036,15 +2044,12 @@ StandbyRecoverPreparedTransactions(void)
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
-		TransactionId xid;
 		char	   *buf;
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
 
 		Assert(gxact->inredo);
 
-		xid = gxact->xid;
-
-		buf = ProcessTwoPhaseBuffer(xid,
+		buf = ProcessTwoPhaseBuffer(gxact->fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
 		if (buf != NULL)
@@ -2077,16 +2082,14 @@ RecoverPreparedTransactions(void)
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
-		TransactionId xid;
 		char	   *buf;
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+		FullTransactionId fxid = gxact->fxid;
 		char	   *bufptr;
 		TwoPhaseFileHeader *hdr;
 		TransactionId *subxids;
 		const char *gid;
 
-		xid = gxact->xid;
-
 		/*
 		 * Reconstruct subtrans state for the transaction --- needed because
 		 * pg_subtrans is not preserved over a restart.  Note that we are
@@ -2096,17 +2099,20 @@ RecoverPreparedTransactions(void)
 		 * SubTransSetParent has been set before, if the prepared transaction
 		 * generated xid assignment records.
 		 */
-		buf = ProcessTwoPhaseBuffer(xid,
+		buf = ProcessTwoPhaseBuffer(gxact->fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
 		if (buf == NULL)
 			continue;
 
 		ereport(LOG,
-				(errmsg("recovering prepared transaction %u from shared memory", xid)));
+				(errmsg("recovering prepared transaction %u of epoch %u from shared memory",
+						XidFromFullTransactionId(gxact->fxid),
+						EpochFromFullTransactionId(gxact->fxid))));
 
 		hdr = (TwoPhaseFileHeader *) buf;
-		Assert(TransactionIdEquals(hdr->xid, xid));
+		Assert(TransactionIdEquals(hdr->xid,
+								   XidFromFullTransactionId(gxact->fxid)));
 		bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader));
 		gid = (const char *) bufptr;
 		bufptr += MAXALIGN(hdr->gidlen);
@@ -2122,7 +2128,7 @@ RecoverPreparedTransactions(void)
 		 * Recreate its GXACT and dummy PGPROC. But, check whether it was
 		 * added in redo and already has a shmem entry for it.
 		 */
-		MarkAsPreparingGuts(gxact, xid, gid,
+		MarkAsPreparingGuts(gxact, gxact->fxid, gid,
 							hdr->prepared_at,
 							hdr->owner, hdr->database);
 
@@ -2137,7 +2143,7 @@ RecoverPreparedTransactions(void)
 		/*
 		 * Recover other state (notably locks) using resource managers.
 		 */
-		ProcessRecords(bufptr, xid, twophase_recover_callbacks);
+		ProcessRecords(bufptr, fxid, twophase_recover_callbacks);
 
 		/*
 		 * Release locks held by the standby process after we process each
@@ -2145,7 +2151,7 @@ RecoverPreparedTransactions(void)
 		 * additional locks at any one time.
 		 */
 		if (InHotStandby)
-			StandbyReleaseLockTree(xid, hdr->nsubxacts, subxids);
+			StandbyReleaseLockTree(hdr->xid, hdr->nsubxacts, subxids);
 
 		/*
 		 * We're done with recovering this transaction. Clear MyLockedGxact,
@@ -2164,7 +2170,7 @@ RecoverPreparedTransactions(void)
 /*
  * ProcessTwoPhaseBuffer
  *
- * Given a transaction id, read it either from disk or read it directly
+ * Given a FullTransactionId, read it either from disk or read it directly
  * via shmem xlog record pointer using the provided "prepare_start_lsn".
  *
  * If setParent is true, set up subtransaction parent linkages.
@@ -2173,13 +2179,12 @@ RecoverPreparedTransactions(void)
  * value scanned.
  */
 static char *
-ProcessTwoPhaseBuffer(TransactionId xid,
+ProcessTwoPhaseBuffer(FullTransactionId fxid,
 					  XLogRecPtr prepare_start_lsn,
 					  bool fromdisk,
 					  bool setParent, bool setNextXid)
 {
 	FullTransactionId nextXid = TransamVariables->nextXid;
-	TransactionId origNextXid = XidFromFullTransactionId(nextXid);
 	TransactionId *subxids;
 	char	   *buf;
 	TwoPhaseFileHeader *hdr;
@@ -2191,41 +2196,46 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 		Assert(prepare_start_lsn != InvalidXLogRecPtr);
 
 	/* Already processed? */
-	if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
+	if (TransactionIdDidCommit(XidFromFullTransactionId(fxid)) ||
+		TransactionIdDidAbort(XidFromFullTransactionId(fxid)))
 	{
 		if (fromdisk)
 		{
 			ereport(WARNING,
-					(errmsg("removing stale two-phase state file for transaction %u",
-							xid)));
-			RemoveTwoPhaseFile(xid, true);
+					(errmsg("removing stale two-phase state file for transaction %u of epoch %u",
+							XidFromFullTransactionId(fxid),
+							EpochFromFullTransactionId(fxid))));
+			RemoveTwoPhaseFile(fxid, true);
 		}
 		else
 		{
 			ereport(WARNING,
-					(errmsg("removing stale two-phase state from memory for transaction %u",
-							xid)));
-			PrepareRedoRemove(xid, true);
+					(errmsg("removing stale two-phase state from memory for transaction %u of epoch %u",
+							XidFromFullTransactionId(fxid),
+							EpochFromFullTransactionId(fxid))));
+			PrepareRedoRemoveFull(fxid, true);
 		}
 		return NULL;
 	}
 
 	/* Reject XID if too new */
-	if (TransactionIdFollowsOrEquals(xid, origNextXid))
+	if (FullTransactionIdFollowsOrEquals(fxid, nextXid))
 	{
 		if (fromdisk)
 		{
 			ereport(WARNING,
-					(errmsg("removing future two-phase state file for transaction %u",
-							xid)));
-			RemoveTwoPhaseFile(xid, true);
+					(errmsg("removing future two-phase state file for transaction %u of epoch %u",
+							XidFromFullTransactionId(fxid),
+							EpochFromFullTransactionId(fxid))));
+			RemoveTwoPhaseFile(fxid, true);
 		}
 		else
 		{
 			ereport(WARNING,
-					(errmsg("removing future two-phase state from memory for transaction %u",
-							xid)));
-			PrepareRedoRemove(xid, true);
+					(errmsg("removing future two-phase state from memory for transaction %u of epoch %u",
+							XidFromFullTransactionId(fxid),
+							EpochFromFullTransactionId(fxid))));
+			PrepareRedoRemoveFull(fxid, true);
 		}
 		return NULL;
 	}
@@ -2233,7 +2243,7 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	if (fromdisk)
 	{
 		/* Read and validate file */
-		buf = ReadTwoPhaseFile(xid, false);
+		buf = ReadTwoPhaseFile(fxid, false);
 	}
 	else
 	{
@@ -2243,18 +2253,20 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 
 	/* Deconstruct header */
 	hdr = (TwoPhaseFileHeader *) buf;
-	if (!TransactionIdEquals(hdr->xid, xid))
+	if (!TransactionIdEquals(hdr->xid, XidFromFullTransactionId(fxid)))
 	{
 		if (fromdisk)
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("corrupted two-phase state file for transaction %u",
-							xid)));
+					 errmsg("corrupted two-phase state file for transaction %u of epoch %u",
+							XidFromFullTransactionId(fxid),
+							EpochFromFullTransactionId(fxid))));
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("corrupted two-phase state in memory for transaction %u",
-							xid)));
+					 errmsg("corrupted two-phase state in memory for transaction %u of epoch %u",
+							XidFromFullTransactionId(fxid),
+							EpochFromFullTransactionId(fxid))));
 	}
 
 	/*
@@ -2268,14 +2280,14 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	{
 		TransactionId subxid = subxids[i];
 
-		Assert(TransactionIdFollows(subxid, xid));
+		Assert(TransactionIdFollows(subxid, XidFromFullTransactionId(fxid)));
 
 		/* update nextXid if needed */
 		if (setNextXid)
 			AdvanceNextFullTransactionIdPastXid(subxid);
 
 		if (setParent)
-			SubTransSetParent(subxid, xid);
+			SubTransSetParent(subxid, XidFromFullTransactionId(fxid));
 	}
 
 	return buf;
@@ -2466,8 +2478,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
  * data, the entry is marked as located on disk.
  */
 void
-PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
-			   XLogRecPtr end_lsn, RepOriginId origin_id)
+PrepareRedoAdd(FullTransactionId fxid, char *buf,
+			   XLogRecPtr start_lsn, XLogRecPtr end_lsn,
+			   RepOriginId origin_id)
 {
 	TwoPhaseFileHeader *hdr = (TwoPhaseFileHeader *) buf;
 	char	   *bufptr;
@@ -2477,6 +2490,10 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 	Assert(LWLockHeldByMeInMode(TwoPhaseStateLock, LW_EXCLUSIVE));
 	Assert(RecoveryInProgress());
 
+	if (!FullTransactionIdIsValid(fxid))
+		fxid = FullTransactionIdFromAllowableAt(TransamVariables->nextXid,
+												hdr->xid);
+
 	bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader));
 	gid = (const char *) bufptr;
 
@@ -2505,7 +2522,8 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 	{
 		char		path[MAXPGPATH];
 
-		TwoPhaseFilePath(path, hdr->xid);
+		Assert(InRecovery);
+		TwoPhaseFilePath(path, fxid);
 
 		if (access(path, F_OK) == 0)
 		{
@@ -2536,7 +2554,7 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 	gxact->prepared_at = hdr->prepared_at;
 	gxact->prepare_start_lsn = start_lsn;
 	gxact->prepare_end_lsn = end_lsn;
-	gxact->xid = hdr->xid;
+	gxact->fxid = fxid;
 	gxact->owner = hdr->owner;
 	gxact->locking_backend = INVALID_PROC_NUMBER;
 	gxact->valid = false;
@@ -2555,11 +2573,13 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   false /* backward */ , false /* WAL */ );
 	}
 
-	elog(DEBUG2, "added 2PC data in shared memory for transaction %u", gxact->xid);
+	elog(DEBUG2, "added 2PC data in shared memory for transaction %u of epoch %u",
+		 XidFromFullTransactionId(gxact->fxid),
+		 EpochFromFullTransactionId(gxact->fxid));
 }
 
 /*
- * PrepareRedoRemove
+ * PrepareRedoRemoveFull
  *
  * Remove the corresponding gxact entry from TwoPhaseState. Also remove
  * the 2PC file if a prepared transaction was saved via an earlier checkpoint.
@@ -2567,8 +2587,8 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
  * Caller must hold TwoPhaseStateLock in exclusive mode, because TwoPhaseState
  * is updated.
  */
-void
-PrepareRedoRemove(TransactionId xid, bool giveWarning)
+static void
+PrepareRedoRemoveFull(FullTransactionId fxid, bool giveWarning)
 {
 	GlobalTransaction gxact = NULL;
 	int			i;
@@ -2581,7 +2601,7 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 	{
 		gxact = TwoPhaseState->prepXacts[i];
 
-		if (gxact->xid == xid)
+		if (FullTransactionIdEquals(gxact->fxid, fxid))
 		{
 			Assert(gxact->inredo);
 			found = true;
@@ -2598,12 +2618,28 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 	/*
 	 * And now we can clean up any files we may have left.
 	 */
-	elog(DEBUG2, "removing 2PC data for transaction %u", xid);
+	elog(DEBUG2, "removing 2PC data for transaction %u of epoch %u ",
+		 XidFromFullTransactionId(fxid),
+		 EpochFromFullTransactionId(fxid));
+
 	if (gxact->ondisk)
-		RemoveTwoPhaseFile(xid, giveWarning);
+		RemoveTwoPhaseFile(fxid, giveWarning);
+
 	RemoveGXact(gxact);
 }
 
+/*
+ * Wrapper of PrepareRedoRemoveFull(), for TransactionIds.
+ */
+void
+PrepareRedoRemove(TransactionId xid, bool giveWarning)
+{
+	FullTransactionId fxid =
+		FullTransactionIdFromAllowableAt(TransamVariables->nextXid, xid);
+
+	PrepareRedoRemoveFull(fxid, giveWarning);
+}
+
 /*
  * LookupGXact
  *		Check if the prepared transaction with the given GID, lsn and timestamp
@@ -2648,7 +2684,7 @@ LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
 			 * between publisher and subscriber.
 			 */
 			if (gxact->ondisk)
-				buf = ReadTwoPhaseFile(gxact->xid, false);
+				buf = ReadTwoPhaseFile(gxact->fxid, false);
 			else
 			{
 				Assert(gxact->prepare_start_lsn);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d331ab90d78..8cd0c8bfbd8 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2512,7 +2512,7 @@ static void
 PrepareTransaction(void)
 {
 	TransactionState s = CurrentTransactionState;
-	TransactionId xid = GetCurrentTransactionId();
+	FullTransactionId fxid = GetCurrentFullTransactionId();
 	GlobalTransaction gxact;
 	TimestampTz prepared_at;
 
@@ -2641,7 +2641,7 @@ PrepareTransaction(void)
 	 * Reserve the GID for this transaction. This could fail if the requested
 	 * GID is invalid or already in use.
 	 */
-	gxact = MarkAsPreparing(xid, prepareGID, prepared_at,
+	gxact = MarkAsPreparing(fxid, prepareGID, prepared_at,
 							GetUserId(), MyDatabaseId);
 	prepareGID = NULL;
 
@@ -2691,7 +2691,7 @@ PrepareTransaction(void)
 	 * ProcArrayClearTransaction().  Otherwise, a GetLockConflicts() would
 	 * conclude "xact already committed or aborted" for our locks.
 	 */
-	PostPrepare_Locks(xid);
+	PostPrepare_Locks(fxid);
 
 	/*
 	 * Let others know about no transaction in progress by me.  This has to be
@@ -2733,9 +2733,9 @@ PrepareTransaction(void)
 
 	PostPrepare_smgr();
 
-	PostPrepare_MultiXact(xid);
+	PostPrepare_MultiXact(fxid);
 
-	PostPrepare_PredicateLocks(xid);
+	PostPrepare_PredicateLocks(fxid);
 
 	ResourceOwnerRelease(TopTransactionResourceOwner,
 						 RESOURCE_RELEASE_LOCKS,
@@ -6408,7 +6408,8 @@ xact_redo(XLogReaderState *record)
 		 * gxact entry.
 		 */
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
-		PrepareRedoAdd(XLogRecGetData(record),
+		PrepareRedoAdd(InvalidFullTransactionId,
+					   XLogRecGetData(record),
 					   record->ReadRecPtr,
 					   record->EndRecPtr,
 					   XLogRecGetOrigin(record));
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 11b4d1085bb..9a361d4a988 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -3476,9 +3476,9 @@ AtPrepare_Locks(void)
  * but that probably costs more cycles.
  */
 void
-PostPrepare_Locks(TransactionId xid)
+PostPrepare_Locks(FullTransactionId fxid)
 {
-	PGPROC	   *newproc = TwoPhaseGetDummyProc(xid, false);
+	PGPROC	   *newproc = TwoPhaseGetDummyProc(fxid, false);
 	HASH_SEQ_STATUS status;
 	LOCALLOCK  *locallock;
 	LOCK	   *lock;
@@ -4261,11 +4261,11 @@ DumpAllLocks(void)
  * and PANIC anyway.
  */
 void
-lock_twophase_recover(TransactionId xid, uint16 info,
+lock_twophase_recover(FullTransactionId fxid, uint16 info,
 					  void *recdata, uint32 len)
 {
 	TwoPhaseLockRecord *rec = (TwoPhaseLockRecord *) recdata;
-	PGPROC	   *proc = TwoPhaseGetDummyProc(xid, false);
+	PGPROC	   *proc = TwoPhaseGetDummyProc(fxid, false);
 	LOCKTAG    *locktag;
 	LOCKMODE	lockmode;
 	LOCKMETHODID lockmethodid;
@@ -4442,7 +4442,7 @@ lock_twophase_recover(TransactionId xid, uint16 info,
  * starting up into hot standby mode.
  */
 void
-lock_twophase_standby_recover(TransactionId xid, uint16 info,
+lock_twophase_standby_recover(FullTransactionId fxid, uint16 info,
 							  void *recdata, uint32 len)
 {
 	TwoPhaseLockRecord *rec = (TwoPhaseLockRecord *) recdata;
@@ -4461,7 +4461,7 @@ lock_twophase_standby_recover(TransactionId xid, uint16 info,
 	if (lockmode == AccessExclusiveLock &&
 		locktag->locktag_type == LOCKTAG_RELATION)
 	{
-		StandbyAcquireAccessExclusiveLock(xid,
+		StandbyAcquireAccessExclusiveLock(XidFromFullTransactionId(fxid),
 										  locktag->locktag_field1 /* dboid */ ,
 										  locktag->locktag_field2 /* reloid */ );
 	}
@@ -4474,11 +4474,11 @@ lock_twophase_standby_recover(TransactionId xid, uint16 info,
  * Find and release the lock indicated by the 2PC record.
  */
 void
-lock_twophase_postcommit(TransactionId xid, uint16 info,
+lock_twophase_postcommit(FullTransactionId fxid, uint16 info,
 						 void *recdata, uint32 len)
 {
 	TwoPhaseLockRecord *rec = (TwoPhaseLockRecord *) recdata;
-	PGPROC	   *proc = TwoPhaseGetDummyProc(xid, true);
+	PGPROC	   *proc = TwoPhaseGetDummyProc(fxid, true);
 	LOCKTAG    *locktag;
 	LOCKMETHODID lockmethodid;
 	LockMethod	lockMethodTable;
@@ -4500,10 +4500,10 @@ lock_twophase_postcommit(TransactionId xid, uint16 info,
  * This is actually just the same as the COMMIT case.
  */
 void
-lock_twophase_postabort(TransactionId xid, uint16 info,
+lock_twophase_postabort(FullTransactionId fxid, uint16 info,
 						void *recdata, uint32 len)
 {
-	lock_twophase_postcommit(xid, info, recdata, len);
+	lock_twophase_postcommit(fxid, info, recdata, len);
 }
 
 /*
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 5b21a053981..928647d691e 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -191,7 +191,7 @@
  *		AtPrepare_PredicateLocks(void);
  *		PostPrepare_PredicateLocks(TransactionId xid);
  *		PredicateLockTwoPhaseFinish(TransactionId xid, bool isCommit);
- *		predicatelock_twophase_recover(TransactionId xid, uint16 info,
+ *		predicatelock_twophase_recover(FullTransactionId fxid, uint16 info,
  *									   void *recdata, uint32 len);
  */
 
@@ -4846,7 +4846,7 @@ AtPrepare_PredicateLocks(void)
  *		anyway. We only need to clean up our local state.
  */
 void
-PostPrepare_PredicateLocks(TransactionId xid)
+PostPrepare_PredicateLocks(FullTransactionId fxid)
 {
 	if (MySerializableXact == InvalidSerializableXact)
 		return;
@@ -4869,12 +4869,12 @@ PostPrepare_PredicateLocks(TransactionId xid)
  *		commits or aborts.
  */
 void
-PredicateLockTwoPhaseFinish(TransactionId xid, bool isCommit)
+PredicateLockTwoPhaseFinish(FullTransactionId fxid, bool isCommit)
 {
 	SERIALIZABLEXID *sxid;
 	SERIALIZABLEXIDTAG sxidtag;
 
-	sxidtag.xid = xid;
+	sxidtag.xid = XidFromFullTransactionId(fxid);
 
 	LWLockAcquire(SerializableXactHashLock, LW_SHARED);
 	sxid = (SERIALIZABLEXID *)
@@ -4896,10 +4896,11 @@ PredicateLockTwoPhaseFinish(TransactionId xid, bool isCommit)
  * Re-acquire a predicate lock belonging to a transaction that was prepared.
  */
 void
-predicatelock_twophase_recover(TransactionId xid, uint16 info,
+predicatelock_twophase_recover(FullTransactionId fxid, uint16 info,
 							   void *recdata, uint32 len)
 {
 	TwoPhasePredicateRecord *record;
+	TransactionId xid = XidFromFullTransactionId(fxid);
 
 	Assert(len == sizeof(TwoPhasePredicateRecord));
 
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index d64595a165c..b3c10abde4f 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -744,7 +744,7 @@ PostPrepare_PgStat_Relations(PgStat_SubXactStatus *xact_state)
  * Load the saved counts into our local pgstats state.
  */
 void
-pgstat_twophase_postcommit(TransactionId xid, uint16 info,
+pgstat_twophase_postcommit(FullTransactionId fxid, uint16 info,
 						   void *recdata, uint32 len)
 {
 	TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata;
@@ -780,7 +780,7 @@ pgstat_twophase_postcommit(TransactionId xid, uint16 info,
  * as aborted.
  */
 void
-pgstat_twophase_postabort(TransactionId xid, uint16 info,
+pgstat_twophase_postabort(FullTransactionId fxid, uint16 info,
 						  void *recdata, uint32 len)
 {
 	TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata;
-- 
2.47.2

v2-0002-Fix-issues-with-2PC-file-handling-at-recovery-13.txttext/plain; charset=us-asciiDownload

From 5a151afb34f639979407d8e37087a49c915f8aa6 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Fri, 31 Jan 2025 09:18:08 +0900
Subject: [PATCH v2] Fix issues with 2PC file handling at recovery

This addresses two issues:
- Avoid CLOG file lookups until we are sure that this is safe.  This is
now done at the end of recovery.
- Avoid mishandling of 2PC shmem state data.

Tests are added to show the problems possible.

Backpatch-through: 13
---
 src/backend/access/transam/twophase.c | 119 +++++++++++++----------
 src/test/recovery/t/009_twophase.pl   | 135 +++++++++++++++++++++++++-
 2 files changed, 201 insertions(+), 53 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 0a0932cff44..49c760dc19d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1863,13 +1863,17 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
  * Scan pg_twophase and fill TwoPhaseState depending on the on-disk data.
  * This is called once at the beginning of recovery, saving any extra
  * lookups in the future.  Two-phase files that are newer than the
- * minimum XID horizon are discarded on the way.
+ * minimum XID horizon are discarded on the way, as much as files that
+ * are older than the oldest XID horizon.
  */
 void
 restoreTwoPhaseData(void)
 {
 	DIR		   *cldir;
 	struct dirent *clde;
+	FullTransactionId nextFullXid = ShmemVariableCache->nextFullXid;
+	TransactionId origNextXid = XidFromFullTransactionId(nextFullXid);
+	TransactionId oldestXid = ShmemVariableCache->oldestXid;
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	cldir = AllocateDir(TWOPHASE_DIR);
@@ -1883,10 +1887,24 @@ restoreTwoPhaseData(void)
 
 			xid = (TransactionId) strtoul(clde->d_name, NULL, 16);
 
+			/* Reject XID if too new or too old */
+			if (TransactionIdFollowsOrEquals(xid, origNextXid) ||
+				TransactionIdPrecedes(xid, oldestXid))
+			{
+				if (TransactionIdFollowsOrEquals(xid, origNextXid))
+					ereport(WARNING,
+							(errmsg("removing future two-phase state file for transaction %u",
+									xid)));
+				else
+					ereport(WARNING,
+							(errmsg("removing past two-phase state file for transaction %u",
+									xid)));
+				RemoveTwoPhaseFile(xid, true);
+				continue;
+			}
+
 			buf = ProcessTwoPhaseBuffer(xid, InvalidXLogRecPtr,
 										true, false, false);
-			if (buf == NULL)
-				continue;
 
 			PrepareRedoAdd(buf, InvalidXLogRecPtr,
 						   InvalidXLogRecPtr, InvalidRepOriginId);
@@ -1953,9 +1971,6 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 									gxact->prepare_start_lsn,
 									gxact->ondisk, false, true);
 
-		if (buf == NULL)
-			continue;
-
 		/*
 		 * OK, we think this file is valid.  Incorporate xid into the
 		 * running-minimum result.
@@ -2026,8 +2041,7 @@ StandbyRecoverPreparedTransactions(void)
 		buf = ProcessTwoPhaseBuffer(xid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
-		if (buf != NULL)
-			pfree(buf);
+		pfree(buf);
 	}
 	LWLockRelease(TwoPhaseStateLock);
 }
@@ -2052,8 +2066,21 @@ void
 RecoverPreparedTransactions(void)
 {
 	int			i;
+	TransactionId *remove_xids;
+	int			remove_xids_cnt;
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+	/*
+	 * Track XIDs candidate for removal if found as already committed or
+	 * aborted, once the first scan through TwoPhaseState is done.  This
+	 * cannot happen while going through the entries in TwoPhaseState as
+	 * PrepareRedoRemove() manipulates it.
+	 */
+	remove_xids_cnt = 0;
+	remove_xids = (TransactionId *) palloc(TwoPhaseState->numPrepXacts *
+										   sizeof(TransactionId));
+
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
 		TransactionId xid;
@@ -2066,6 +2093,26 @@ RecoverPreparedTransactions(void)
 
 		xid = gxact->xid;
 
+		/*
+		 * Is this transaction already aborted or committed?  If yes, mark it
+		 * for removal.
+		 *
+		 * Checking CLOGs if these transactions have been already aborted or
+		 * committed is safe at this stage; we are at the end of recovery and
+		 * all WAL has been replayed, all 2PC transactions are reinstated and
+		 * should be tracked in TwoPhaseState.
+		 */
+		if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
+		{
+			/*
+			 * Track this transaction ID for its removal from the shared
+			 * memory state at the end.
+			 */
+			remove_xids[remove_xids_cnt] = xid;
+			remove_xids_cnt++;
+			continue;
+		}
+
 		/*
 		 * Reconstruct subtrans state for the transaction --- needed because
 		 * pg_subtrans is not preserved over a restart.  Note that we are
@@ -2078,8 +2125,6 @@ RecoverPreparedTransactions(void)
 		buf = ProcessTwoPhaseBuffer(xid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
-		if (buf == NULL)
-			continue;
 
 		ereport(LOG,
 				(errmsg("recovering prepared transaction %u from shared memory", xid)));
@@ -2135,7 +2180,19 @@ RecoverPreparedTransactions(void)
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	}
 
+	for (i = 0; i < remove_xids_cnt; i++)
+	{
+		TransactionId xid = remove_xids[i];
+
+		ereport(WARNING,
+				(errmsg("removing stale two-phase state from memory for transaction %u",
+						xid)));
+		PrepareRedoRemove(xid, true);
+	}
+
 	LWLockRelease(TwoPhaseStateLock);
+
+	pfree(remove_xids);
 }
 
 /*
@@ -2155,8 +2212,6 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 					  bool fromdisk,
 					  bool setParent, bool setNextXid)
 {
-	FullTransactionId nextFullXid = ShmemVariableCache->nextFullXid;
-	TransactionId origNextXid = XidFromFullTransactionId(nextFullXid);
 	TransactionId *subxids;
 	char	   *buf;
 	TwoPhaseFileHeader *hdr;
@@ -2167,46 +2222,6 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	if (!fromdisk)
 		Assert(prepare_start_lsn != InvalidXLogRecPtr);
 
-	/* Already processed? */
-	if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
-	{
-		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing stale two-phase state file for transaction %u",
-							xid)));
-			RemoveTwoPhaseFile(xid, true);
-		}
-		else
-		{
-			ereport(WARNING,
-					(errmsg("removing stale two-phase state from memory for transaction %u",
-							xid)));
-			PrepareRedoRemove(xid, true);
-		}
-		return NULL;
-	}
-
-	/* Reject XID if too new */
-	if (TransactionIdFollowsOrEquals(xid, origNextXid))
-	{
-		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing future two-phase state file for transaction %u",
-							xid)));
-			RemoveTwoPhaseFile(xid, true);
-		}
-		else
-		{
-			ereport(WARNING,
-					(errmsg("removing future two-phase state from memory for transaction %u",
-							xid)));
-			PrepareRedoRemove(xid, true);
-		}
-		return NULL;
-	}
-
 	if (fromdisk)
 	{
 		/* Read and validate file */
diff --git a/src/test/recovery/t/009_twophase.pl b/src/test/recovery/t/009_twophase.pl
index 15bb28627f9..de9acd5e43a 100644
--- a/src/test/recovery/t/009_twophase.pl
+++ b/src/test/recovery/t/009_twophase.pl
@@ -2,9 +2,10 @@
 use strict;
 use warnings;
 
+use File::Copy;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 27;
+use Test::More tests => 33;
 
 my $psql_out = '';
 my $psql_rc  = '';
@@ -25,6 +26,14 @@ sub configure_and_reload
 	return;
 }
 
+sub twophase_file_name
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
+	my $xid = shift;
+	return sprintf("%08X", $xid);
+}
+
 # Set up two nodes, which will alternately be master and replication standby.
 
 # Setup london node
@@ -523,3 +532,127 @@ $cur_standby->psql(
 is( $psql_out,
 	qq{27|issued to paris},
 	"Check expected t_009_tbl2 data on standby");
+
+###############################################################################
+# Check handling of already committed or aborted 2PC files at recovery.
+# This test does a manual copy of 2PC files created in a running server,
+# to cheaply emulate situations that could be found in base backups.
+###############################################################################
+
+# Issue a set of transactions that will be used for this portion of the test:
+# - One transaction to hold on the minimum xid horizon at bay.
+# - One transaction that will be found as already committed at recovery.
+# - One transaction that will be fonnd as already rollbacked at recovery.
+$cur_master->psql(
+	'postgres', "
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (40, 'transaction: xid horizon');
+	PREPARE TRANSACTION 'xact_009_40';
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (41, 'transaction: commit-prepared');
+	PREPARE TRANSACTION 'xact_009_41';
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (42, 'transaction: rollback-prepared');
+	PREPARE TRANSACTION 'xact_009_42';");
+
+# Issue a checkpoint, fixing the XID horizon based on the first transaction,
+# flushing to disk the two files to use.
+$cur_master->psql('postgres', "CHECKPOINT");
+
+# Get the transaction IDs of the ones to 2PC files to manipulate.
+my $commit_prepared_xid = int(
+	$cur_master->safe_psql(
+		'postgres',
+		"SELECT transaction FROM pg_prepared_xacts WHERE gid = 'xact_009_41'")
+);
+my $abort_prepared_xid = int(
+	$cur_master->safe_psql(
+		'postgres',
+		"SELECT transaction FROM pg_prepared_xacts WHERE gid = 'xact_009_42'")
+);
+
+# Copy the two-phase files that will be put back later.
+my $commit_prepared_name = twophase_file_name($commit_prepared_xid);
+my $abort_prepared_name = twophase_file_name($abort_prepared_xid);
+
+my $twophase_tmpdir = $PostgreSQL::Test::Utils::tmp_check . '/' . "2pc_files";
+mkdir($twophase_tmpdir);
+my $primary_twophase_folder = $cur_master->data_dir . '/pg_twophase/';
+copy("$primary_twophase_folder/$commit_prepared_name", $twophase_tmpdir);
+copy("$primary_twophase_folder/$abort_prepared_name", $twophase_tmpdir);
+
+# Issue abort/commit prepared.
+$cur_master->psql('postgres', "COMMIT PREPARED 'xact_009_41'");
+$cur_master->psql('postgres', "ROLLBACK PREPARED 'xact_009_42'");
+
+# Again checkpoint, to advance the LSN past the point where the two previous
+# transaction records would be replayed.
+$cur_master->psql('postgres', "CHECKPOINT");
+
+# Take down node.
+$cur_master->teardown_node;
+
+# Move back the two twophase files.
+copy("$twophase_tmpdir/$commit_prepared_name", $primary_twophase_folder);
+copy("$twophase_tmpdir/$abort_prepared_name", $primary_twophase_folder);
+
+# Grab location in logs of primary
+my $log_offset = -s $cur_master->logfile;
+
+# Start node and check that the two previous files are removed by checking the
+# server logs, following the CLOG lookup done at the end of recovery.
+$cur_master->start;
+
+ok($cur_master->log_contains(
+	qr/removing stale two-phase state from memory for transaction $commit_prepared_xid/,
+	$log_offset),
+	"two-phase file of committed transaction removed at recovery");
+ok($cur_master->log_contains(
+	qr/removing stale two-phase state from memory for transaction $abort_prepared_xid/,
+	$log_offset),
+	"two-phase file of aborted transaction removed at recovery");
+
+# Commit the first transaction.
+$cur_master->psql('postgres', "COMMIT PREPARED 'xact_009_40'");
+# After replay, there should be no 2PC transactions.
+$cur_master->psql(
+	'postgres',
+	"SELECT * FROM pg_prepared_xact",
+	stdout => \$psql_out);
+is($psql_out, qq{}, "Check expected pg_prepared_xact data on primary");
+# Data from transactions should be around.
+$cur_master->psql(
+	'postgres',
+	"SELECT * FROM t_009_tbl WHERE id IN (40, 41, 42);",
+	stdout => \$psql_out);
+is( $psql_out, qq{40|transaction: xid horizon
+41|transaction: commit-prepared},
+	"Check expected table data on primary");
+
+###############################################################################
+# Check handling of orphaned 2PC files at recovery.
+###############################################################################
+
+$cur_master->teardown_node;
+
+# Grab location in logs of primary
+$log_offset = -s $cur_master->logfile;
+
+# Create fake files with a transaction ID large or low enough to be in the
+# future or the past, then check that the primary is able to start and remove
+# these files at recovery.
+
+my $future_2pc_file = $cur_master->data_dir . '/pg_twophase/00FFFFFF';
+append_to_file $future_2pc_file, "";
+my $past_2pc_file = $cur_master->data_dir . '/pg_twophase/000000FF';
+append_to_file $past_2pc_file, "";
+
+$cur_master->start;
+ok($cur_master->log_contains(
+	qr/removing future two-phase state file for transaction 16777215/,
+	$log_offset),
+	"removed future two-phase state file");
+ok($cur_master->log_contains(
+	qr/removing past two-phase state file for transaction 255/,
+	$log_offset),
+	"removed past two-phase state file");
-- 
2.47.2

v2-0002-Fix-issues-with-2PC-file-handling-at-recovery-14.txttext/plain; charset=us-asciiDownload

From 42d53985268d69555b2407e711630c3a6254f190 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Fri, 31 Jan 2025 09:18:04 +0900
Subject: [PATCH v2] Fix issues with 2PC file handling at recovery

This addresses two issues:
- Avoid CLOG file lookups until we are sure that this is safe.  This is
now done at the end of recovery.
- Avoid mishandling of 2PC shmem state data.

Tests are added to show the problems possible.

Backpatch-through: 13
---
 src/backend/access/transam/twophase.c | 119 +++++++++++++----------
 src/test/recovery/t/009_twophase.pl   | 133 +++++++++++++++++++++++++-
 2 files changed, 199 insertions(+), 53 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 16848fa226c..7c9019ad8ee 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1852,13 +1852,17 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
  * Scan pg_twophase and fill TwoPhaseState depending on the on-disk data.
  * This is called once at the beginning of recovery, saving any extra
  * lookups in the future.  Two-phase files that are newer than the
- * minimum XID horizon are discarded on the way.
+ * minimum XID horizon are discarded on the way, as much as files that
+ * are older than the oldest XID horizon.
  */
 void
 restoreTwoPhaseData(void)
 {
 	DIR		   *cldir;
 	struct dirent *clde;
+	FullTransactionId nextXid = ShmemVariableCache->nextXid;
+	TransactionId origNextXid = XidFromFullTransactionId(nextXid);
+	TransactionId oldestXid = ShmemVariableCache->oldestXid;
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	cldir = AllocateDir(TWOPHASE_DIR);
@@ -1872,10 +1876,24 @@ restoreTwoPhaseData(void)
 
 			xid = (TransactionId) strtoul(clde->d_name, NULL, 16);
 
+			/* Reject XID if too new or too old */
+			if (TransactionIdFollowsOrEquals(xid, origNextXid) ||
+				TransactionIdPrecedes(xid, oldestXid))
+			{
+				if (TransactionIdFollowsOrEquals(xid, origNextXid))
+					ereport(WARNING,
+							(errmsg("removing future two-phase state file for transaction %u",
+									xid)));
+				else
+					ereport(WARNING,
+							(errmsg("removing past two-phase state file for transaction %u",
+									xid)));
+				RemoveTwoPhaseFile(xid, true);
+				continue;
+			}
+
 			buf = ProcessTwoPhaseBuffer(xid, InvalidXLogRecPtr,
 										true, false, false);
-			if (buf == NULL)
-				continue;
 
 			PrepareRedoAdd(buf, InvalidXLogRecPtr,
 						   InvalidXLogRecPtr, InvalidRepOriginId);
@@ -1942,9 +1960,6 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 									gxact->prepare_start_lsn,
 									gxact->ondisk, false, true);
 
-		if (buf == NULL)
-			continue;
-
 		/*
 		 * OK, we think this file is valid.  Incorporate xid into the
 		 * running-minimum result.
@@ -2015,8 +2030,7 @@ StandbyRecoverPreparedTransactions(void)
 		buf = ProcessTwoPhaseBuffer(xid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
-		if (buf != NULL)
-			pfree(buf);
+		pfree(buf);
 	}
 	LWLockRelease(TwoPhaseStateLock);
 }
@@ -2041,8 +2055,21 @@ void
 RecoverPreparedTransactions(void)
 {
 	int			i;
+	TransactionId *remove_xids;
+	int			remove_xids_cnt;
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+	/*
+	 * Track XIDs candidate for removal if found as already committed or
+	 * aborted, once the first scan through TwoPhaseState is done.  This
+	 * cannot happen while going through the entries in TwoPhaseState as
+	 * PrepareRedoRemove() manipulates it.
+	 */
+	remove_xids_cnt = 0;
+	remove_xids = (TransactionId *) palloc(TwoPhaseState->numPrepXacts *
+										   sizeof(TransactionId));
+
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
 		TransactionId xid;
@@ -2055,6 +2082,26 @@ RecoverPreparedTransactions(void)
 
 		xid = gxact->xid;
 
+		/*
+		 * Is this transaction already aborted or committed?  If yes, mark it
+		 * for removal.
+		 *
+		 * Checking CLOGs if these transactions have been already aborted or
+		 * committed is safe at this stage; we are at the end of recovery and
+		 * all WAL has been replayed, all 2PC transactions are reinstated and
+		 * should be tracked in TwoPhaseState.
+		 */
+		if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
+		{
+			/*
+			 * Track this transaction ID for its removal from the shared
+			 * memory state at the end.
+			 */
+			remove_xids[remove_xids_cnt] = xid;
+			remove_xids_cnt++;
+			continue;
+		}
+
 		/*
 		 * Reconstruct subtrans state for the transaction --- needed because
 		 * pg_subtrans is not preserved over a restart.  Note that we are
@@ -2067,8 +2114,6 @@ RecoverPreparedTransactions(void)
 		buf = ProcessTwoPhaseBuffer(xid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
-		if (buf == NULL)
-			continue;
 
 		ereport(LOG,
 				(errmsg("recovering prepared transaction %u from shared memory", xid)));
@@ -2124,7 +2169,19 @@ RecoverPreparedTransactions(void)
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	}
 
+	for (i = 0; i < remove_xids_cnt; i++)
+	{
+		TransactionId xid = remove_xids[i];
+
+		ereport(WARNING,
+				(errmsg("removing stale two-phase state from memory for transaction %u",
+						xid)));
+		PrepareRedoRemove(xid, true);
+	}
+
 	LWLockRelease(TwoPhaseStateLock);
+
+	pfree(remove_xids);
 }
 
 /*
@@ -2144,8 +2201,6 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 					  bool fromdisk,
 					  bool setParent, bool setNextXid)
 {
-	FullTransactionId nextXid = ShmemVariableCache->nextXid;
-	TransactionId origNextXid = XidFromFullTransactionId(nextXid);
 	TransactionId *subxids;
 	char	   *buf;
 	TwoPhaseFileHeader *hdr;
@@ -2156,46 +2211,6 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	if (!fromdisk)
 		Assert(prepare_start_lsn != InvalidXLogRecPtr);
 
-	/* Already processed? */
-	if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
-	{
-		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing stale two-phase state file for transaction %u",
-							xid)));
-			RemoveTwoPhaseFile(xid, true);
-		}
-		else
-		{
-			ereport(WARNING,
-					(errmsg("removing stale two-phase state from memory for transaction %u",
-							xid)));
-			PrepareRedoRemove(xid, true);
-		}
-		return NULL;
-	}
-
-	/* Reject XID if too new */
-	if (TransactionIdFollowsOrEquals(xid, origNextXid))
-	{
-		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing future two-phase state file for transaction %u",
-							xid)));
-			RemoveTwoPhaseFile(xid, true);
-		}
-		else
-		{
-			ereport(WARNING,
-					(errmsg("removing future two-phase state from memory for transaction %u",
-							xid)));
-			PrepareRedoRemove(xid, true);
-		}
-		return NULL;
-	}
-
 	if (fromdisk)
 	{
 		/* Read and validate file */
diff --git a/src/test/recovery/t/009_twophase.pl b/src/test/recovery/t/009_twophase.pl
index a5686db2526..06557bebbe7 100644
--- a/src/test/recovery/t/009_twophase.pl
+++ b/src/test/recovery/t/009_twophase.pl
@@ -5,9 +5,10 @@
 use strict;
 use warnings;
 
+use File::Copy;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 27;
+use Test::More tests => 33;
 
 my $psql_out = '';
 my $psql_rc  = '';
@@ -28,6 +29,14 @@ sub configure_and_reload
 	return;
 }
 
+sub twophase_file_name
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
+	my $xid = shift;
+	return sprintf("%08X", $xid);
+}
+
 # Set up two nodes, which will alternately be primary and replication standby.
 
 # Setup london node
@@ -527,3 +536,125 @@ $cur_standby->psql(
 is( $psql_out,
 	qq{27|issued to paris},
 	"Check expected t_009_tbl2 data on standby");
+
+###############################################################################
+# Check handling of already committed or aborted 2PC files at recovery.
+# This test does a manual copy of 2PC files created in a running server,
+# to cheaply emulate situations that could be found in base backups.
+###############################################################################
+
+# Issue a set of transactions that will be used for this portion of the test:
+# - One transaction to hold on the minimum xid horizon at bay.
+# - One transaction that will be found as already committed at recovery.
+# - One transaction that will be fonnd as already rollbacked at recovery.
+$cur_primary->psql(
+	'postgres', "
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (40, 'transaction: xid horizon');
+	PREPARE TRANSACTION 'xact_009_40';
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (41, 'transaction: commit-prepared');
+	PREPARE TRANSACTION 'xact_009_41';
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (42, 'transaction: rollback-prepared');
+	PREPARE TRANSACTION 'xact_009_42';");
+
+# Issue a checkpoint, fixing the XID horizon based on the first transaction,
+# flushing to disk the two files to use.
+$cur_primary->psql('postgres', "CHECKPOINT");
+
+# Get the transaction IDs of the ones to 2PC files to manipulate.
+my $commit_prepared_xid = int(
+	$cur_primary->safe_psql(
+		'postgres',
+		"SELECT transaction FROM pg_prepared_xacts WHERE gid = 'xact_009_41'")
+);
+my $abort_prepared_xid = int(
+	$cur_primary->safe_psql(
+		'postgres',
+		"SELECT transaction FROM pg_prepared_xacts WHERE gid = 'xact_009_42'")
+);
+
+# Copy the two-phase files that will be put back later.
+my $commit_prepared_name = twophase_file_name($commit_prepared_xid);
+my $abort_prepared_name = twophase_file_name($abort_prepared_xid);
+
+my $twophase_tmpdir = $PostgreSQL::Test::Utils::tmp_check . '/' . "2pc_files";
+mkdir($twophase_tmpdir);
+my $primary_twophase_folder = $cur_primary->data_dir . '/pg_twophase/';
+copy("$primary_twophase_folder/$commit_prepared_name", $twophase_tmpdir);
+copy("$primary_twophase_folder/$abort_prepared_name", $twophase_tmpdir);
+
+# Issue abort/commit prepared.
+$cur_primary->psql('postgres', "COMMIT PREPARED 'xact_009_41'");
+$cur_primary->psql('postgres', "ROLLBACK PREPARED 'xact_009_42'");
+
+# Again checkpoint, to advance the LSN past the point where the two previous
+# transaction records would be replayed.
+$cur_primary->psql('postgres', "CHECKPOINT");
+
+# Take down node.
+$cur_primary->teardown_node;
+
+# Move back the two twophase files.
+copy("$twophase_tmpdir/$commit_prepared_name", $primary_twophase_folder);
+copy("$twophase_tmpdir/$abort_prepared_name", $primary_twophase_folder);
+
+# Grab location in logs of primary
+my $log_offset = -s $cur_primary->logfile;
+
+# Start node and check that the two previous files are removed by checking the
+# server logs, following the CLOG lookup done at the end of recovery.
+$cur_primary->start;
+
+$cur_primary->log_check(
+	"two-phase files of committed transactions removed at recovery",
+	$log_offset,
+	log_like => [
+		qr/removing stale two-phase state from memory for transaction $commit_prepared_xid/,
+		qr/removing stale two-phase state from memory for transaction $abort_prepared_xid/
+	]);
+
+# Commit the first transaction.
+$cur_primary->psql('postgres', "COMMIT PREPARED 'xact_009_40'");
+# After replay, there should be no 2PC transactions.
+$cur_primary->psql(
+	'postgres',
+	"SELECT * FROM pg_prepared_xact",
+	stdout => \$psql_out);
+is($psql_out, qq{}, "Check expected pg_prepared_xact data on primary");
+# Data from transactions should be around.
+$cur_primary->psql(
+	'postgres',
+	"SELECT * FROM t_009_tbl WHERE id IN (40, 41, 42);",
+	stdout => \$psql_out);
+is( $psql_out, qq{40|transaction: xid horizon
+41|transaction: commit-prepared},
+	"Check expected table data on primary");
+
+###############################################################################
+# Check handling of orphaned 2PC files at recovery.
+###############################################################################
+
+$cur_primary->teardown_node;
+
+# Grab location in logs of primary
+$log_offset = -s $cur_primary->logfile;
+
+# Create fake files with a transaction ID large or low enough to be in the
+# future or the past, then check that the primary is able to start and remove
+# these files at recovery.
+
+my $future_2pc_file = $cur_primary->data_dir . '/pg_twophase/00FFFFFF';
+append_to_file $future_2pc_file, "";
+my $past_2pc_file = $cur_primary->data_dir . '/pg_twophase/000000FF';
+append_to_file $past_2pc_file, "";
+
+$cur_primary->start;
+$cur_primary->log_check(
+	"fake two-phase files removed at recovery",
+	$log_offset,
+	log_like => [
+		qr/removing future two-phase state file for transaction 16777215/,
+		qr/removing past two-phase state file for transaction 255/
+	]);
-- 
2.47.2

v2-0002-Fix-issues-with-2PC-file-handling-at-recovery-15.txttext/plain; charset=us-asciiDownload

From 969e3a48a99437bd50a6b496965f1848b90a02f3 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Fri, 31 Jan 2025 09:18:01 +0900
Subject: [PATCH v2] Fix issues with 2PC file handling at recovery

This addresses two issues:
- Avoid CLOG file lookups until we are sure that this is safe.  This is
now done at the end of recovery.
- Avoid mishandling of 2PC shmem state data.

Tests are added to show the problems possible.

Backpatch-through: 13
---
 src/backend/access/transam/twophase.c | 119 +++++++++++++----------
 src/test/recovery/t/009_twophase.pl   | 131 ++++++++++++++++++++++++++
 2 files changed, 198 insertions(+), 52 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 8c5e5913df5..743bd8ca3f8 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1879,13 +1879,17 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
  * Scan pg_twophase and fill TwoPhaseState depending on the on-disk data.
  * This is called once at the beginning of recovery, saving any extra
  * lookups in the future.  Two-phase files that are newer than the
- * minimum XID horizon are discarded on the way.
+ * minimum XID horizon are discarded on the way, as much as files that
+ * are older than the oldest XID horizon.
  */
 void
 restoreTwoPhaseData(void)
 {
 	DIR		   *cldir;
 	struct dirent *clde;
+	FullTransactionId nextXid = ShmemVariableCache->nextXid;
+	TransactionId origNextXid = XidFromFullTransactionId(nextXid);
+	TransactionId oldestXid = ShmemVariableCache->oldestXid;
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	cldir = AllocateDir(TWOPHASE_DIR);
@@ -1899,10 +1903,24 @@ restoreTwoPhaseData(void)
 
 			xid = (TransactionId) strtoul(clde->d_name, NULL, 16);
 
+			/* Reject XID if too new or too old */
+			if (TransactionIdFollowsOrEquals(xid, origNextXid) ||
+				TransactionIdPrecedes(xid, oldestXid))
+			{
+				if (TransactionIdFollowsOrEquals(xid, origNextXid))
+					ereport(WARNING,
+							(errmsg("removing future two-phase state file for transaction %u",
+									xid)));
+				else
+					ereport(WARNING,
+							(errmsg("removing past two-phase state file for transaction %u",
+									xid)));
+				RemoveTwoPhaseFile(xid, true);
+				continue;
+			}
+
 			buf = ProcessTwoPhaseBuffer(xid, InvalidXLogRecPtr,
 										true, false, false);
-			if (buf == NULL)
-				continue;
 
 			PrepareRedoAdd(buf, InvalidXLogRecPtr,
 						   InvalidXLogRecPtr, InvalidRepOriginId);
@@ -1969,9 +1987,6 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 									gxact->prepare_start_lsn,
 									gxact->ondisk, false, true);
 
-		if (buf == NULL)
-			continue;
-
 		/*
 		 * OK, we think this file is valid.  Incorporate xid into the
 		 * running-minimum result.
@@ -2042,8 +2057,7 @@ StandbyRecoverPreparedTransactions(void)
 		buf = ProcessTwoPhaseBuffer(xid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
-		if (buf != NULL)
-			pfree(buf);
+		pfree(buf);
 	}
 	LWLockRelease(TwoPhaseStateLock);
 }
@@ -2068,8 +2082,21 @@ void
 RecoverPreparedTransactions(void)
 {
 	int			i;
+	TransactionId *remove_xids;
+	int			remove_xids_cnt;
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+	/*
+	 * Track XIDs candidate for removal if found as already committed or
+	 * aborted, once the first scan through TwoPhaseState is done.  This
+	 * cannot happen while going through the entries in TwoPhaseState as
+	 * PrepareRedoRemove() manipulates it.
+	 */
+	remove_xids_cnt = 0;
+	remove_xids = (TransactionId *) palloc(TwoPhaseState->numPrepXacts *
+										   sizeof(TransactionId));
+
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
 		TransactionId xid;
@@ -2082,6 +2109,26 @@ RecoverPreparedTransactions(void)
 
 		xid = gxact->xid;
 
+		/*
+		 * Is this transaction already aborted or committed?  If yes, mark it
+		 * for removal.
+		 *
+		 * Checking CLOGs if these transactions have been already aborted or
+		 * committed is safe at this stage; we are at the end of recovery and
+		 * all WAL has been replayed, all 2PC transactions are reinstated and
+		 * should be tracked in TwoPhaseState.
+		 */
+		if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
+		{
+			/*
+			 * Track this transaction ID for its removal from the shared
+			 * memory state at the end.
+			 */
+			remove_xids[remove_xids_cnt] = xid;
+			remove_xids_cnt++;
+			continue;
+		}
+
 		/*
 		 * Reconstruct subtrans state for the transaction --- needed because
 		 * pg_subtrans is not preserved over a restart.  Note that we are
@@ -2094,8 +2141,6 @@ RecoverPreparedTransactions(void)
 		buf = ProcessTwoPhaseBuffer(xid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
-		if (buf == NULL)
-			continue;
 
 		ereport(LOG,
 				(errmsg("recovering prepared transaction %u from shared memory", xid)));
@@ -2153,7 +2198,19 @@ RecoverPreparedTransactions(void)
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	}
 
+	for (i = 0; i < remove_xids_cnt; i++)
+	{
+		TransactionId xid = remove_xids[i];
+
+		ereport(WARNING,
+				(errmsg("removing stale two-phase state from memory for transaction %u",
+						xid)));
+		PrepareRedoRemove(xid, true);
+	}
+
 	LWLockRelease(TwoPhaseStateLock);
+
+	pfree(remove_xids);
 }
 
 /*
@@ -2173,8 +2230,6 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 					  bool fromdisk,
 					  bool setParent, bool setNextXid)
 {
-	FullTransactionId nextXid = ShmemVariableCache->nextXid;
-	TransactionId origNextXid = XidFromFullTransactionId(nextXid);
 	TransactionId *subxids;
 	char	   *buf;
 	TwoPhaseFileHeader *hdr;
@@ -2185,46 +2240,6 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	if (!fromdisk)
 		Assert(prepare_start_lsn != InvalidXLogRecPtr);
 
-	/* Already processed? */
-	if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
-	{
-		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing stale two-phase state file for transaction %u",
-							xid)));
-			RemoveTwoPhaseFile(xid, true);
-		}
-		else
-		{
-			ereport(WARNING,
-					(errmsg("removing stale two-phase state from memory for transaction %u",
-							xid)));
-			PrepareRedoRemove(xid, true);
-		}
-		return NULL;
-	}
-
-	/* Reject XID if too new */
-	if (TransactionIdFollowsOrEquals(xid, origNextXid))
-	{
-		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing future two-phase state file for transaction %u",
-							xid)));
-			RemoveTwoPhaseFile(xid, true);
-		}
-		else
-		{
-			ereport(WARNING,
-					(errmsg("removing future two-phase state from memory for transaction %u",
-							xid)));
-			PrepareRedoRemove(xid, true);
-		}
-		return NULL;
-	}
-
 	if (fromdisk)
 	{
 		/* Read and validate file */
diff --git a/src/test/recovery/t/009_twophase.pl b/src/test/recovery/t/009_twophase.pl
index ad9b5371dd0..439ce37c0a5 100644
--- a/src/test/recovery/t/009_twophase.pl
+++ b/src/test/recovery/t/009_twophase.pl
@@ -5,6 +5,7 @@
 use strict;
 use warnings;
 
+use File::Copy;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
@@ -28,6 +29,14 @@ sub configure_and_reload
 	return;
 }
 
+sub twophase_file_name
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
+	my $xid = shift;
+	return sprintf("%08X", $xid);
+}
+
 # Set up two nodes, which will alternately be primary and replication standby.
 
 # Setup london node
@@ -528,4 +537,126 @@ is( $psql_out,
 	qq{27|issued to paris},
 	"Check expected t_009_tbl2 data on standby");
 
+###############################################################################
+# Check handling of already committed or aborted 2PC files at recovery.
+# This test does a manual copy of 2PC files created in a running server,
+# to cheaply emulate situations that could be found in base backups.
+###############################################################################
+
+# Issue a set of transactions that will be used for this portion of the test:
+# - One transaction to hold on the minimum xid horizon at bay.
+# - One transaction that will be found as already committed at recovery.
+# - One transaction that will be fonnd as already rollbacked at recovery.
+$cur_primary->psql(
+	'postgres', "
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (40, 'transaction: xid horizon');
+	PREPARE TRANSACTION 'xact_009_40';
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (41, 'transaction: commit-prepared');
+	PREPARE TRANSACTION 'xact_009_41';
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (42, 'transaction: rollback-prepared');
+	PREPARE TRANSACTION 'xact_009_42';");
+
+# Issue a checkpoint, fixing the XID horizon based on the first transaction,
+# flushing to disk the two files to use.
+$cur_primary->psql('postgres', "CHECKPOINT");
+
+# Get the transaction IDs of the ones to 2PC files to manipulate.
+my $commit_prepared_xid = int(
+	$cur_primary->safe_psql(
+		'postgres',
+		"SELECT transaction FROM pg_prepared_xacts WHERE gid = 'xact_009_41'")
+);
+my $abort_prepared_xid = int(
+	$cur_primary->safe_psql(
+		'postgres',
+		"SELECT transaction FROM pg_prepared_xacts WHERE gid = 'xact_009_42'")
+);
+
+# Copy the two-phase files that will be put back later.
+my $commit_prepared_name = twophase_file_name($commit_prepared_xid);
+my $abort_prepared_name = twophase_file_name($abort_prepared_xid);
+
+my $twophase_tmpdir = $PostgreSQL::Test::Utils::tmp_check . '/' . "2pc_files";
+mkdir($twophase_tmpdir);
+my $primary_twophase_folder = $cur_primary->data_dir . '/pg_twophase/';
+copy("$primary_twophase_folder/$commit_prepared_name", $twophase_tmpdir);
+copy("$primary_twophase_folder/$abort_prepared_name", $twophase_tmpdir);
+
+# Issue abort/commit prepared.
+$cur_primary->psql('postgres', "COMMIT PREPARED 'xact_009_41'");
+$cur_primary->psql('postgres', "ROLLBACK PREPARED 'xact_009_42'");
+
+# Again checkpoint, to advance the LSN past the point where the two previous
+# transaction records would be replayed.
+$cur_primary->psql('postgres', "CHECKPOINT");
+
+# Take down node.
+$cur_primary->teardown_node;
+
+# Move back the two twophase files.
+copy("$twophase_tmpdir/$commit_prepared_name", $primary_twophase_folder);
+copy("$twophase_tmpdir/$abort_prepared_name", $primary_twophase_folder);
+
+# Grab location in logs of primary
+my $log_offset = -s $cur_primary->logfile;
+
+# Start node and check that the two previous files are removed by checking the
+# server logs, following the CLOG lookup done at the end of recovery.
+$cur_primary->start;
+
+$cur_primary->log_check(
+	"two-phase files of committed transactions removed at recovery",
+	$log_offset,
+	log_like => [
+		qr/removing stale two-phase state from memory for transaction $commit_prepared_xid/,
+		qr/removing stale two-phase state from memory for transaction $abort_prepared_xid/
+	]);
+
+# Commit the first transaction.
+$cur_primary->psql('postgres', "COMMIT PREPARED 'xact_009_40'");
+# After replay, there should be no 2PC transactions.
+$cur_primary->psql(
+	'postgres',
+	"SELECT * FROM pg_prepared_xact",
+	stdout => \$psql_out);
+is($psql_out, qq{}, "Check expected pg_prepared_xact data on primary");
+# Data from transactions should be around.
+$cur_primary->psql(
+	'postgres',
+	"SELECT * FROM t_009_tbl WHERE id IN (40, 41, 42);",
+	stdout => \$psql_out);
+is( $psql_out, qq{40|transaction: xid horizon
+41|transaction: commit-prepared},
+	"Check expected table data on primary");
+
+###############################################################################
+# Check handling of orphaned 2PC files at recovery.
+###############################################################################
+
+$cur_primary->teardown_node;
+
+# Grab location in logs of primary
+$log_offset = -s $cur_primary->logfile;
+
+# Create fake files with a transaction ID large or low enough to be in the
+# future or the past, then check that the primary is able to start and remove
+# these files at recovery.
+
+my $future_2pc_file = $cur_primary->data_dir . '/pg_twophase/00FFFFFF';
+append_to_file $future_2pc_file, "";
+my $past_2pc_file = $cur_primary->data_dir . '/pg_twophase/000000FF';
+append_to_file $past_2pc_file, "";
+
+$cur_primary->start;
+$cur_primary->log_check(
+	"fake two-phase files removed at recovery",
+	$log_offset,
+	log_like => [
+		qr/removing future two-phase state file for transaction 16777215/,
+		qr/removing past two-phase state file for transaction 255/
+	]);
+
 done_testing();
-- 
2.47.2

v2-0002-Fix-issues-with-2PC-file-handling-at-recovery-16.txttext/plain; charset=us-asciiDownload

From fc1b0f10e6622f3f6e7f0c19212f06db2c553fb3 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Fri, 31 Jan 2025 09:17:43 +0900
Subject: [PATCH v2] Fix issues with 2PC file handling at recovery

This addresses two issues:
- Avoid CLOG file lookups until we are sure that this is safe.  This is
now done at the end of recovery.
- Avoid mishandling of 2PC shmem state data.

Tests are added to show the problems possible.

Backpatch-through: 13
---
 src/backend/access/transam/twophase.c | 119 +++++++++++++----------
 src/test/recovery/t/009_twophase.pl   | 131 ++++++++++++++++++++++++++
 2 files changed, 198 insertions(+), 52 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 95aa8be9c53..4a34e7697e5 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1877,13 +1877,17 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
  * Scan pg_twophase and fill TwoPhaseState depending on the on-disk data.
  * This is called once at the beginning of recovery, saving any extra
  * lookups in the future.  Two-phase files that are newer than the
- * minimum XID horizon are discarded on the way.
+ * minimum XID horizon are discarded on the way, as much as files that
+ * are older than the oldest XID horizon.
  */
 void
 restoreTwoPhaseData(void)
 {
 	DIR		   *cldir;
 	struct dirent *clde;
+	FullTransactionId nextXid = ShmemVariableCache->nextXid;
+	TransactionId origNextXid = XidFromFullTransactionId(nextXid);
+	TransactionId oldestXid = ShmemVariableCache->oldestXid;
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	cldir = AllocateDir(TWOPHASE_DIR);
@@ -1897,10 +1901,24 @@ restoreTwoPhaseData(void)
 
 			xid = (TransactionId) strtoul(clde->d_name, NULL, 16);
 
+			/* Reject XID if too new or too old */
+			if (TransactionIdFollowsOrEquals(xid, origNextXid) ||
+				TransactionIdPrecedes(xid, oldestXid))
+			{
+				if (TransactionIdFollowsOrEquals(xid, origNextXid))
+					ereport(WARNING,
+							(errmsg("removing future two-phase state file for transaction %u",
+									xid)));
+				else
+					ereport(WARNING,
+							(errmsg("removing past two-phase state file for transaction %u",
+									xid)));
+				RemoveTwoPhaseFile(xid, true);
+				continue;
+			}
+
 			buf = ProcessTwoPhaseBuffer(xid, InvalidXLogRecPtr,
 										true, false, false);
-			if (buf == NULL)
-				continue;
 
 			PrepareRedoAdd(buf, InvalidXLogRecPtr,
 						   InvalidXLogRecPtr, InvalidRepOriginId);
@@ -1967,9 +1985,6 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 									gxact->prepare_start_lsn,
 									gxact->ondisk, false, true);
 
-		if (buf == NULL)
-			continue;
-
 		/*
 		 * OK, we think this file is valid.  Incorporate xid into the
 		 * running-minimum result.
@@ -2040,8 +2055,7 @@ StandbyRecoverPreparedTransactions(void)
 		buf = ProcessTwoPhaseBuffer(xid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
-		if (buf != NULL)
-			pfree(buf);
+		pfree(buf);
 	}
 	LWLockRelease(TwoPhaseStateLock);
 }
@@ -2066,8 +2080,21 @@ void
 RecoverPreparedTransactions(void)
 {
 	int			i;
+	TransactionId *remove_xids;
+	int			remove_xids_cnt;
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+	/*
+	 * Track XIDs candidate for removal if found as already committed or
+	 * aborted, once the first scan through TwoPhaseState is done.  This
+	 * cannot happen while going through the entries in TwoPhaseState as
+	 * PrepareRedoRemove() manipulates it.
+	 */
+	remove_xids_cnt = 0;
+	remove_xids = (TransactionId *) palloc(TwoPhaseState->numPrepXacts *
+										   sizeof(TransactionId));
+
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
 		TransactionId xid;
@@ -2080,6 +2107,26 @@ RecoverPreparedTransactions(void)
 
 		xid = gxact->xid;
 
+		/*
+		 * Is this transaction already aborted or committed?  If yes, mark it
+		 * for removal.
+		 *
+		 * Checking CLOGs if these transactions have been already aborted or
+		 * committed is safe at this stage; we are at the end of recovery and
+		 * all WAL has been replayed, all 2PC transactions are reinstated and
+		 * should be tracked in TwoPhaseState.
+		 */
+		if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
+		{
+			/*
+			 * Track this transaction ID for its removal from the shared
+			 * memory state at the end.
+			 */
+			remove_xids[remove_xids_cnt] = xid;
+			remove_xids_cnt++;
+			continue;
+		}
+
 		/*
 		 * Reconstruct subtrans state for the transaction --- needed because
 		 * pg_subtrans is not preserved over a restart.  Note that we are
@@ -2092,8 +2139,6 @@ RecoverPreparedTransactions(void)
 		buf = ProcessTwoPhaseBuffer(xid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
-		if (buf == NULL)
-			continue;
 
 		ereport(LOG,
 				(errmsg("recovering prepared transaction %u from shared memory", xid)));
@@ -2151,7 +2196,19 @@ RecoverPreparedTransactions(void)
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	}
 
+	for (i = 0; i < remove_xids_cnt; i++)
+	{
+		TransactionId xid = remove_xids[i];
+
+		ereport(WARNING,
+				(errmsg("removing stale two-phase state from memory for transaction %u",
+						xid)));
+		PrepareRedoRemove(xid, true);
+	}
+
 	LWLockRelease(TwoPhaseStateLock);
+
+	pfree(remove_xids);
 }
 
 /*
@@ -2171,8 +2228,6 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 					  bool fromdisk,
 					  bool setParent, bool setNextXid)
 {
-	FullTransactionId nextXid = ShmemVariableCache->nextXid;
-	TransactionId origNextXid = XidFromFullTransactionId(nextXid);
 	TransactionId *subxids;
 	char	   *buf;
 	TwoPhaseFileHeader *hdr;
@@ -2183,46 +2238,6 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	if (!fromdisk)
 		Assert(prepare_start_lsn != InvalidXLogRecPtr);
 
-	/* Already processed? */
-	if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
-	{
-		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing stale two-phase state file for transaction %u",
-							xid)));
-			RemoveTwoPhaseFile(xid, true);
-		}
-		else
-		{
-			ereport(WARNING,
-					(errmsg("removing stale two-phase state from memory for transaction %u",
-							xid)));
-			PrepareRedoRemove(xid, true);
-		}
-		return NULL;
-	}
-
-	/* Reject XID if too new */
-	if (TransactionIdFollowsOrEquals(xid, origNextXid))
-	{
-		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing future two-phase state file for transaction %u",
-							xid)));
-			RemoveTwoPhaseFile(xid, true);
-		}
-		else
-		{
-			ereport(WARNING,
-					(errmsg("removing future two-phase state from memory for transaction %u",
-							xid)));
-			PrepareRedoRemove(xid, true);
-		}
-		return NULL;
-	}
-
 	if (fromdisk)
 	{
 		/* Read and validate file */
diff --git a/src/test/recovery/t/009_twophase.pl b/src/test/recovery/t/009_twophase.pl
index fe7e8e79802..790f1a1d2ec 100644
--- a/src/test/recovery/t/009_twophase.pl
+++ b/src/test/recovery/t/009_twophase.pl
@@ -5,6 +5,7 @@
 use strict;
 use warnings;
 
+use File::Copy;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
@@ -28,6 +29,14 @@ sub configure_and_reload
 	return;
 }
 
+sub twophase_file_name
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
+	my $xid = shift;
+	return sprintf("%08X", $xid);
+}
+
 # Set up two nodes, which will alternately be primary and replication standby.
 
 # Setup london node
@@ -528,4 +537,126 @@ is( $psql_out,
 	qq{27|issued to paris},
 	"Check expected t_009_tbl2 data on standby");
 
+###############################################################################
+# Check handling of already committed or aborted 2PC files at recovery.
+# This test does a manual copy of 2PC files created in a running server,
+# to cheaply emulate situations that could be found in base backups.
+###############################################################################
+
+# Issue a set of transactions that will be used for this portion of the test:
+# - One transaction to hold on the minimum xid horizon at bay.
+# - One transaction that will be found as already committed at recovery.
+# - One transaction that will be fonnd as already rollbacked at recovery.
+$cur_primary->psql(
+	'postgres', "
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (40, 'transaction: xid horizon');
+	PREPARE TRANSACTION 'xact_009_40';
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (41, 'transaction: commit-prepared');
+	PREPARE TRANSACTION 'xact_009_41';
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (42, 'transaction: rollback-prepared');
+	PREPARE TRANSACTION 'xact_009_42';");
+
+# Issue a checkpoint, fixing the XID horizon based on the first transaction,
+# flushing to disk the two files to use.
+$cur_primary->psql('postgres', "CHECKPOINT");
+
+# Get the transaction IDs of the ones to 2PC files to manipulate.
+my $commit_prepared_xid = int(
+	$cur_primary->safe_psql(
+		'postgres',
+		"SELECT transaction FROM pg_prepared_xacts WHERE gid = 'xact_009_41'")
+);
+my $abort_prepared_xid = int(
+	$cur_primary->safe_psql(
+		'postgres',
+		"SELECT transaction FROM pg_prepared_xacts WHERE gid = 'xact_009_42'")
+);
+
+# Copy the two-phase files that will be put back later.
+my $commit_prepared_name = twophase_file_name($commit_prepared_xid);
+my $abort_prepared_name = twophase_file_name($abort_prepared_xid);
+
+my $twophase_tmpdir = $PostgreSQL::Test::Utils::tmp_check . '/' . "2pc_files";
+mkdir($twophase_tmpdir);
+my $primary_twophase_folder = $cur_primary->data_dir . '/pg_twophase/';
+copy("$primary_twophase_folder/$commit_prepared_name", $twophase_tmpdir);
+copy("$primary_twophase_folder/$abort_prepared_name", $twophase_tmpdir);
+
+# Issue abort/commit prepared.
+$cur_primary->psql('postgres', "COMMIT PREPARED 'xact_009_41'");
+$cur_primary->psql('postgres', "ROLLBACK PREPARED 'xact_009_42'");
+
+# Again checkpoint, to advance the LSN past the point where the two previous
+# transaction records would be replayed.
+$cur_primary->psql('postgres', "CHECKPOINT");
+
+# Take down node.
+$cur_primary->teardown_node;
+
+# Move back the two twophase files.
+copy("$twophase_tmpdir/$commit_prepared_name", $primary_twophase_folder);
+copy("$twophase_tmpdir/$abort_prepared_name", $primary_twophase_folder);
+
+# Grab location in logs of primary
+my $log_offset = -s $cur_primary->logfile;
+
+# Start node and check that the two previous files are removed by checking the
+# server logs, following the CLOG lookup done at the end of recovery.
+$cur_primary->start;
+
+$cur_primary->log_check(
+	"two-phase files of committed transactions removed at recovery",
+	$log_offset,
+	log_like => [
+		qr/removing stale two-phase state from memory for transaction $commit_prepared_xid/,
+		qr/removing stale two-phase state from memory for transaction $abort_prepared_xid/
+	]);
+
+# Commit the first transaction.
+$cur_primary->psql('postgres', "COMMIT PREPARED 'xact_009_40'");
+# After replay, there should be no 2PC transactions.
+$cur_primary->psql(
+	'postgres',
+	"SELECT * FROM pg_prepared_xact",
+	stdout => \$psql_out);
+is($psql_out, qq{}, "Check expected pg_prepared_xact data on primary");
+# Data from transactions should be around.
+$cur_primary->psql(
+	'postgres',
+	"SELECT * FROM t_009_tbl WHERE id IN (40, 41, 42);",
+	stdout => \$psql_out);
+is( $psql_out, qq{40|transaction: xid horizon
+41|transaction: commit-prepared},
+	"Check expected table data on primary");
+
+###############################################################################
+# Check handling of orphaned 2PC files at recovery.
+###############################################################################
+
+$cur_primary->teardown_node;
+
+# Grab location in logs of primary
+$log_offset = -s $cur_primary->logfile;
+
+# Create fake files with a transaction ID large or low enough to be in the
+# future or the past, then check that the primary is able to start and remove
+# these files at recovery.
+
+my $future_2pc_file = $cur_primary->data_dir . '/pg_twophase/00FFFFFF';
+append_to_file $future_2pc_file, "";
+my $past_2pc_file = $cur_primary->data_dir . '/pg_twophase/000000FF';
+append_to_file $past_2pc_file, "";
+
+$cur_primary->start;
+$cur_primary->log_check(
+	"fake two-phase files removed at recovery",
+	$log_offset,
+	log_like => [
+		qr/removing future two-phase state file for transaction 16777215/,
+		qr/removing past two-phase state file for transaction 255/
+	]);
+
 done_testing();
-- 
2.47.2

v2-0002-Fix-issues-with-2PC-file-handling-at-recovery-17.txttext/plain; charset=us-asciiDownload

From 30e93dc08c73ea3f93b3b260ede71d5c2919e68e Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Fri, 31 Jan 2025 08:47:24 +0900
Subject: [PATCH v2 2/2] Fix issues with 2PC file handling at recovery

This addresses two issues:
- Avoid CLOG file lookups until we are sure that this is safe.  This is
now done at the end of recovery.
- Avoid mishandling of 2PC shmem state data.

Tests are added to show the problems possible.

Backpatch-through: 13
---
 src/backend/access/transam/twophase.c | 125 ++++++++++++-----------
 src/test/recovery/t/009_twophase.pl   | 140 ++++++++++++++++++++++++++
 2 files changed, 209 insertions(+), 56 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 0286336484e..20b0756c742 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1894,13 +1894,16 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
  * Scan pg_twophase and fill TwoPhaseState depending on the on-disk data.
  * This is called once at the beginning of recovery, saving any extra
  * lookups in the future.  Two-phase files that are newer than the
- * minimum XID horizon are discarded on the way.
+ * minimum XID horizon are discarded on the way, as much as files that
+ * are older than the oldest XID horizon.
  */
 void
 restoreTwoPhaseData(void)
 {
 	DIR		   *cldir;
 	struct dirent *clde;
+	FullTransactionId nextXid = TransamVariables->nextXid;
+	FullTransactionId oldestXid = AdjustToFullTransactionId(TransamVariables->oldestXid);
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	cldir = AllocateDir(TWOPHASE_DIR);
@@ -1914,10 +1917,26 @@ restoreTwoPhaseData(void)
 
 			fxid = FullTransactionIdFromU64(strtou64(clde->d_name, NULL, 16));
 
+			/* Reject XID if too new or too old */
+			if (FullTransactionIdFollowsOrEquals(fxid, nextXid) ||
+				FullTransactionIdPrecedes(fxid, oldestXid))
+			{
+				if (FullTransactionIdFollowsOrEquals(fxid, nextXid))
+					ereport(WARNING,
+							(errmsg("removing future two-phase state file for transaction %u of epoch %u",
+									XidFromFullTransactionId(fxid),
+									EpochFromFullTransactionId(fxid))));
+				else
+					ereport(WARNING,
+							(errmsg("removing past two-phase state file for transaction %u of epoch %u",
+									XidFromFullTransactionId(fxid),
+									EpochFromFullTransactionId(fxid))));
+				RemoveTwoPhaseFile(fxid, true);
+				continue;
+			}
+
 			buf = ProcessTwoPhaseBuffer(fxid, InvalidXLogRecPtr,
 										true, false, false);
-			if (buf == NULL)
-				continue;
 
 			PrepareRedoAdd(fxid, buf, InvalidXLogRecPtr,
 						   InvalidXLogRecPtr, InvalidRepOriginId);
@@ -1983,9 +2002,6 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 									gxact->prepare_start_lsn,
 									gxact->ondisk, false, true);
 
-		if (buf == NULL)
-			continue;
-
 		/*
 		 * OK, we think this file is valid.  Incorporate xid into the
 		 * running-minimum result.
@@ -2053,8 +2069,7 @@ StandbyRecoverPreparedTransactions(void)
 		buf = ProcessTwoPhaseBuffer(gxact->fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
-		if (buf != NULL)
-			pfree(buf);
+		pfree(buf);
 	}
 	LWLockRelease(TwoPhaseStateLock);
 }
@@ -2079,8 +2094,21 @@ void
 RecoverPreparedTransactions(void)
 {
 	int			i;
+	FullTransactionId *remove_fxids;
+	int			remove_fxids_cnt;
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+	/*
+	 * Track XIDs candidate for removal if found as already committed or
+	 * aborted, once the first scan through TwoPhaseState is done.  This
+	 * cannot happen while going through the entries in TwoPhaseState as
+	 * PrepareRedoRemove() manipulates it.
+	 */
+	remove_fxids_cnt = 0;
+	remove_fxids = (FullTransactionId *) palloc(TwoPhaseState->numPrepXacts *
+												sizeof(FullTransactionId));
+
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
 		char	   *buf;
@@ -2091,6 +2119,27 @@ RecoverPreparedTransactions(void)
 		TransactionId *subxids;
 		const char *gid;
 
+		/*
+		 * Is this transaction already aborted or committed?  If yes, mark it
+		 * for removal.
+		 *
+		 * Checking CLOGs if these transactions have been already aborted or
+		 * committed is safe at this stage; we are at the end of recovery and
+		 * all WAL has been replayed, all 2PC transactions are reinstated and
+		 * should be tracked in TwoPhaseState.
+		 */
+		if (TransactionIdDidCommit(XidFromFullTransactionId(fxid)) ||
+			TransactionIdDidAbort(XidFromFullTransactionId(fxid)))
+		{
+			/*
+			 * Track this transaction ID for its removal from the shared
+			 * memory state at the end.
+			 */
+			remove_fxids[remove_fxids_cnt] = fxid;
+			remove_fxids_cnt++;
+			continue;
+		}
+
 		/*
 		 * Reconstruct subtrans state for the transaction --- needed because
 		 * pg_subtrans is not preserved over a restart.  Note that we are
@@ -2103,8 +2152,6 @@ RecoverPreparedTransactions(void)
 		buf = ProcessTwoPhaseBuffer(gxact->fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
-		if (buf == NULL)
-			continue;
 
 		ereport(LOG,
 				(errmsg("recovering prepared transaction %u of epoch %u from shared memory",
@@ -2165,6 +2212,18 @@ RecoverPreparedTransactions(void)
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	}
 
+	for (i = 0; i < remove_fxids_cnt; i++)
+	{
+		FullTransactionId fxid = remove_fxids[i];
+
+		ereport(WARNING,
+				(errmsg("removing stale two-phase state from memory for transaction %u of epoch %u",
+						XidFromFullTransactionId(fxid),
+						EpochFromFullTransactionId(fxid))));
+
+		PrepareRedoRemoveFull(fxid, true);
+	}
+
 	LWLockRelease(TwoPhaseStateLock);
 }
 
@@ -2185,7 +2244,6 @@ ProcessTwoPhaseBuffer(FullTransactionId fxid,
 					  bool fromdisk,
 					  bool setParent, bool setNextXid)
 {
-	FullTransactionId nextXid = TransamVariables->nextXid;
 	TransactionId *subxids;
 	char	   *buf;
 	TwoPhaseFileHeader *hdr;
@@ -2196,51 +2254,6 @@ ProcessTwoPhaseBuffer(FullTransactionId fxid,
 	if (!fromdisk)
 		Assert(prepare_start_lsn != InvalidXLogRecPtr);
 
-	/* Already processed? */
-	if (TransactionIdDidCommit(XidFromFullTransactionId(fxid)) ||
-		TransactionIdDidAbort(XidFromFullTransactionId(fxid)))
-	{
-		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing stale two-phase state file for transaction %u of epoch %u",
-							XidFromFullTransactionId(fxid),
-							EpochFromFullTransactionId(fxid))));
-			RemoveTwoPhaseFile(fxid, true);
-		}
-		else
-		{
-			ereport(WARNING,
-					(errmsg("removing stale two-phase state from memory for transaction %u of epoch %u",
-							XidFromFullTransactionId(fxid),
-							EpochFromFullTransactionId(fxid))));
-			PrepareRedoRemoveFull(fxid, true);
-		}
-		return NULL;
-	}
-
-	/* Reject XID if too new */
-	if (FullTransactionIdFollowsOrEquals(fxid, nextXid))
-	{
-		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing future two-phase state file for transaction %u of epoch %u",
-							XidFromFullTransactionId(fxid),
-							EpochFromFullTransactionId(fxid))));
-			RemoveTwoPhaseFile(fxid, true);
-		}
-		else
-		{
-			ereport(WARNING,
-					(errmsg("removing future two-phase state from memory for transaction %u of epoch %u",
-							XidFromFullTransactionId(fxid),
-							EpochFromFullTransactionId(fxid))));
-			PrepareRedoRemoveFull(fxid, true);
-		}
-		return NULL;
-	}
-
 	if (fromdisk)
 	{
 		/* Read and validate file */
diff --git a/src/test/recovery/t/009_twophase.pl b/src/test/recovery/t/009_twophase.pl
index 4b3e0f77dc0..c08d7676d66 100644
--- a/src/test/recovery/t/009_twophase.pl
+++ b/src/test/recovery/t/009_twophase.pl
@@ -5,6 +5,7 @@
 use strict;
 use warnings FATAL => 'all';
 
+use File::Copy;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
@@ -28,6 +29,15 @@ sub configure_and_reload
 	return;
 }
 
+sub twophase_file_name
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
+	my $epoch = shift;
+	my $xid = shift;
+	return sprintf("%08X%08X", $epoch, $xid);
+}
+
 # Set up two nodes, which will alternately be primary and replication standby.
 
 # Setup london node
@@ -572,4 +582,134 @@ my $nsubtrans = $cur_primary->safe_psql('postgres',
 );
 isnt($osubtrans, $nsubtrans, "contents of pg_subtrans/ have changed");
 
+###############################################################################
+# Check handling of already committed or aborted 2PC files at recovery.
+# This test does a manual copy of 2PC files created in a running server,
+# to cheaply emulate situations that could be found in base backups.
+###############################################################################
+
+# Issue a set of transactions that will be used for this portion of the test:
+# - One transaction to hold on the minimum xid horizon at bay.
+# - One transaction that will be found as already committed at recovery.
+# - One transaction that will be fonnd as already rollbacked at recovery.
+$cur_primary->psql(
+	'postgres', "
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (40, 'transaction: xid horizon');
+	PREPARE TRANSACTION 'xact_009_40';
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (41, 'transaction: commit-prepared');
+	PREPARE TRANSACTION 'xact_009_41';
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (42, 'transaction: rollback-prepared');
+	PREPARE TRANSACTION 'xact_009_42';");
+
+# Issue a checkpoint, fixing the XID horizon based on the first transaction,
+# flushing to disk the two files to use.
+$cur_primary->psql('postgres', "CHECKPOINT");
+
+# Get the transaction IDs of the ones to 2PC files to manipulate.
+my $commit_prepared_xid = int(
+	$cur_primary->safe_psql(
+		'postgres',
+		"SELECT transaction FROM pg_prepared_xacts WHERE gid = 'xact_009_41'")
+);
+my $abort_prepared_xid = int(
+	$cur_primary->safe_psql(
+		'postgres',
+		"SELECT transaction FROM pg_prepared_xacts WHERE gid = 'xact_009_42'")
+);
+
+# Copy the two-phase files that will be put back later.  Assume an
+# epoch of 0.
+my $commit_prepared_name = twophase_file_name(0, $commit_prepared_xid);
+my $abort_prepared_name = twophase_file_name(0, $abort_prepared_xid);
+
+my $twophase_tmpdir = $PostgreSQL::Test::Utils::tmp_check . '/' . "2pc_files";
+mkdir($twophase_tmpdir);
+my $primary_twophase_folder = $cur_primary->data_dir . '/pg_twophase/';
+copy("$primary_twophase_folder/$commit_prepared_name", $twophase_tmpdir);
+copy("$primary_twophase_folder/$abort_prepared_name", $twophase_tmpdir);
+
+# Issue abort/commit prepared.
+$cur_primary->psql('postgres', "COMMIT PREPARED 'xact_009_41'");
+$cur_primary->psql('postgres', "ROLLBACK PREPARED 'xact_009_42'");
+
+# Again checkpoint, to advance the LSN past the point where the two previous
+# transaction records would be replayed.
+$cur_primary->psql('postgres', "CHECKPOINT");
+
+# Take down node.
+$cur_primary->teardown_node;
+
+# Move back the two twophase files.
+copy("$twophase_tmpdir/$commit_prepared_name", $primary_twophase_folder);
+copy("$twophase_tmpdir/$abort_prepared_name", $primary_twophase_folder);
+
+# Grab location in logs of primary
+my $log_offset = -s $cur_primary->logfile;
+
+# Start node and check that the two previous files are removed by checking the
+# server logs, following the CLOG lookup done at the end of recovery.
+$cur_primary->start;
+
+$cur_primary->log_check(
+	"two-phase files of committed transactions removed at recovery",
+	$log_offset,
+	log_like => [
+		qr/removing stale two-phase state from memory for transaction $commit_prepared_xid of epoch 0/,
+		qr/removing stale two-phase state from memory for transaction $abort_prepared_xid of epoch 0/
+	]);
+
+# Commit the first transaction.
+$cur_primary->psql('postgres', "COMMIT PREPARED 'xact_009_40'");
+# After replay, there should be no 2PC transactions.
+$cur_primary->psql(
+	'postgres',
+	"SELECT * FROM pg_prepared_xact",
+	stdout => \$psql_out);
+is($psql_out, qq{}, "Check expected pg_prepared_xact data on primary");
+# Data from transactions should be around.
+$cur_primary->psql(
+	'postgres',
+	"SELECT * FROM t_009_tbl WHERE id IN (40, 41, 42);",
+	stdout => \$psql_out);
+is( $psql_out, qq{40|transaction: xid horizon
+41|transaction: commit-prepared},
+	"Check expected table data on primary");
+
+###############################################################################
+# Check handling of orphaned 2PC files at recovery.
+###############################################################################
+
+$cur_standby->teardown_node;
+$cur_primary->teardown_node;
+
+# Grab location in logs of primary
+$log_offset = -s $cur_primary->logfile;
+
+# Create fake files with a transaction ID large or low enough to be in the
+# future or the past, in different epochs, then check that the primary is able
+# to start and remove these files at recovery.
+
+# First bump the epoch with pg_resetwal.
+$cur_primary->command_ok(
+	[ 'pg_resetwal', '-e', 256, '-f', $cur_primary->data_dir ],
+	'bump epoch of primary');
+
+my $future_2pc_file =
+  $cur_primary->data_dir . '/pg_twophase/000001FF00000FFF';
+append_to_file $future_2pc_file, "";
+my $past_2pc_file = $cur_primary->data_dir . '/pg_twophase/000000EE00000FFF';
+append_to_file $past_2pc_file, "";
+
+$cur_primary->start;
+$cur_primary->log_check(
+	"two-phase files removed at recovery",
+	$log_offset,
+	log_like => [
+		qr/removing past two-phase state file for transaction 4095 of epoch 238/,
+		qr/removing future two-phase state file for transaction 4095 of epoch 511/
+	]);
+
 done_testing();
-- 
2.47.2

v2-0002-Fix-issues-with-2PC-file-handling-at-recov-master.patchtext/x-diff; charset=us-asciiDownload

From 56c94091614a884f08ca2aa11e96a1718605e1d9 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Fri, 31 Jan 2025 08:47:24 +0900
Subject: [PATCH v2 2/2] Fix issues with 2PC file handling at recovery

This addresses two issues:
- Avoid CLOG file lookups until we are sure that this is safe.  This is
now done at the end of recovery.
- Avoid mishandling of 2PC shmem state data.

Tests are added to show the problems possible.

Backpatch-through: 13
---
 src/backend/access/transam/twophase.c | 125 ++++++++++++-----------
 src/test/recovery/t/009_twophase.pl   | 140 ++++++++++++++++++++++++++
 2 files changed, 209 insertions(+), 56 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 4f5b4542662..c8a8d774b10 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1893,13 +1893,16 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
  * Scan pg_twophase and fill TwoPhaseState depending on the on-disk data.
  * This is called once at the beginning of recovery, saving any extra
  * lookups in the future.  Two-phase files that are newer than the
- * minimum XID horizon are discarded on the way.
+ * minimum XID horizon are discarded on the way, as much as files that
+ * are older than the oldest XID horizon.
  */
 void
 restoreTwoPhaseData(void)
 {
 	DIR		   *cldir;
 	struct dirent *clde;
+	FullTransactionId nextXid = TransamVariables->nextXid;
+	FullTransactionId oldestXid = AdjustToFullTransactionId(TransamVariables->oldestXid);
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	cldir = AllocateDir(TWOPHASE_DIR);
@@ -1913,10 +1916,26 @@ restoreTwoPhaseData(void)
 
 			fxid = FullTransactionIdFromU64(strtou64(clde->d_name, NULL, 16));
 
+			/* Reject XID if too new or too old */
+			if (FullTransactionIdFollowsOrEquals(fxid, nextXid) ||
+				FullTransactionIdPrecedes(fxid, oldestXid))
+			{
+				if (FullTransactionIdFollowsOrEquals(fxid, nextXid))
+					ereport(WARNING,
+							(errmsg("removing future two-phase state file for transaction %u of epoch %u",
+									XidFromFullTransactionId(fxid),
+									EpochFromFullTransactionId(fxid))));
+				else
+					ereport(WARNING,
+							(errmsg("removing past two-phase state file for transaction %u of epoch %u",
+									XidFromFullTransactionId(fxid),
+									EpochFromFullTransactionId(fxid))));
+				RemoveTwoPhaseFile(fxid, true);
+				continue;
+			}
+
 			buf = ProcessTwoPhaseBuffer(fxid, InvalidXLogRecPtr,
 										true, false, false);
-			if (buf == NULL)
-				continue;
 
 			PrepareRedoAdd(fxid, buf, InvalidXLogRecPtr,
 						   InvalidXLogRecPtr, InvalidRepOriginId);
@@ -1982,9 +2001,6 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 									gxact->prepare_start_lsn,
 									gxact->ondisk, false, true);
 
-		if (buf == NULL)
-			continue;
-
 		/*
 		 * OK, we think this file is valid.  Incorporate xid into the
 		 * running-minimum result.
@@ -2052,8 +2068,7 @@ StandbyRecoverPreparedTransactions(void)
 		buf = ProcessTwoPhaseBuffer(gxact->fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
-		if (buf != NULL)
-			pfree(buf);
+		pfree(buf);
 	}
 	LWLockRelease(TwoPhaseStateLock);
 }
@@ -2078,8 +2093,21 @@ void
 RecoverPreparedTransactions(void)
 {
 	int			i;
+	FullTransactionId *remove_fxids;
+	int			remove_fxids_cnt;
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+	/*
+	 * Track XIDs candidate for removal if found as already committed or
+	 * aborted, once the first scan through TwoPhaseState is done.  This
+	 * cannot happen while going through the entries in TwoPhaseState as
+	 * PrepareRedoRemove() manipulates it.
+	 */
+	remove_fxids_cnt = 0;
+	remove_fxids = (FullTransactionId *) palloc(TwoPhaseState->numPrepXacts *
+												sizeof(FullTransactionId));
+
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
 		char	   *buf;
@@ -2090,6 +2118,27 @@ RecoverPreparedTransactions(void)
 		TransactionId *subxids;
 		const char *gid;
 
+		/*
+		 * Is this transaction already aborted or committed?  If yes, mark it
+		 * for removal.
+		 *
+		 * Checking CLOGs if these transactions have been already aborted or
+		 * committed is safe at this stage; we are at the end of recovery and
+		 * all WAL has been replayed, all 2PC transactions are reinstated and
+		 * should be tracked in TwoPhaseState.
+		 */
+		if (TransactionIdDidCommit(XidFromFullTransactionId(fxid)) ||
+			TransactionIdDidAbort(XidFromFullTransactionId(fxid)))
+		{
+			/*
+			 * Track this transaction ID for its removal from the shared
+			 * memory state at the end.
+			 */
+			remove_fxids[remove_fxids_cnt] = fxid;
+			remove_fxids_cnt++;
+			continue;
+		}
+
 		/*
 		 * Reconstruct subtrans state for the transaction --- needed because
 		 * pg_subtrans is not preserved over a restart.  Note that we are
@@ -2102,8 +2151,6 @@ RecoverPreparedTransactions(void)
 		buf = ProcessTwoPhaseBuffer(gxact->fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
-		if (buf == NULL)
-			continue;
 
 		ereport(LOG,
 				(errmsg("recovering prepared transaction %u of epoch %u from shared memory",
@@ -2164,6 +2211,18 @@ RecoverPreparedTransactions(void)
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	}
 
+	for (i = 0; i < remove_fxids_cnt; i++)
+	{
+		FullTransactionId fxid = remove_fxids[i];
+
+		ereport(WARNING,
+				(errmsg("removing stale two-phase state from memory for transaction %u of epoch %u",
+						XidFromFullTransactionId(fxid),
+						EpochFromFullTransactionId(fxid))));
+
+		PrepareRedoRemoveFull(fxid, true);
+	}
+
 	LWLockRelease(TwoPhaseStateLock);
 }
 
@@ -2184,7 +2243,6 @@ ProcessTwoPhaseBuffer(FullTransactionId fxid,
 					  bool fromdisk,
 					  bool setParent, bool setNextXid)
 {
-	FullTransactionId nextXid = TransamVariables->nextXid;
 	TransactionId *subxids;
 	char	   *buf;
 	TwoPhaseFileHeader *hdr;
@@ -2195,51 +2253,6 @@ ProcessTwoPhaseBuffer(FullTransactionId fxid,
 	if (!fromdisk)
 		Assert(prepare_start_lsn != InvalidXLogRecPtr);
 
-	/* Already processed? */
-	if (TransactionIdDidCommit(XidFromFullTransactionId(fxid)) ||
-		TransactionIdDidAbort(XidFromFullTransactionId(fxid)))
-	{
-		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing stale two-phase state file for transaction %u of epoch %u",
-							XidFromFullTransactionId(fxid),
-							EpochFromFullTransactionId(fxid))));
-			RemoveTwoPhaseFile(fxid, true);
-		}
-		else
-		{
-			ereport(WARNING,
-					(errmsg("removing stale two-phase state from memory for transaction %u of epoch %u",
-							XidFromFullTransactionId(fxid),
-							EpochFromFullTransactionId(fxid))));
-			PrepareRedoRemoveFull(fxid, true);
-		}
-		return NULL;
-	}
-
-	/* Reject XID if too new */
-	if (FullTransactionIdFollowsOrEquals(fxid, nextXid))
-	{
-		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing future two-phase state file for transaction %u of epoch %u",
-							XidFromFullTransactionId(fxid),
-							EpochFromFullTransactionId(fxid))));
-			RemoveTwoPhaseFile(fxid, true);
-		}
-		else
-		{
-			ereport(WARNING,
-					(errmsg("removing future two-phase state from memory for transaction %u of epoch %u",
-							XidFromFullTransactionId(fxid),
-							EpochFromFullTransactionId(fxid))));
-			PrepareRedoRemoveFull(fxid, true);
-		}
-		return NULL;
-	}
-
 	if (fromdisk)
 	{
 		/* Read and validate file */
diff --git a/src/test/recovery/t/009_twophase.pl b/src/test/recovery/t/009_twophase.pl
index 1a662ebe499..3a3714a2d8b 100644
--- a/src/test/recovery/t/009_twophase.pl
+++ b/src/test/recovery/t/009_twophase.pl
@@ -5,6 +5,7 @@
 use strict;
 use warnings FATAL => 'all';
 
+use File::Copy;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
@@ -28,6 +29,15 @@ sub configure_and_reload
 	return;
 }
 
+sub twophase_file_name
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
+	my $epoch = shift;
+	my $xid = shift;
+	return sprintf("%08X%08X", $epoch, $xid);
+}
+
 # Set up two nodes, which will alternately be primary and replication standby.
 
 # Setup london node
@@ -572,4 +582,134 @@ my $nsubtrans = $cur_primary->safe_psql('postgres',
 );
 isnt($osubtrans, $nsubtrans, "contents of pg_subtrans/ have changed");
 
+###############################################################################
+# Check handling of already committed or aborted 2PC files at recovery.
+# This test does a manual copy of 2PC files created in a running server,
+# to cheaply emulate situations that could be found in base backups.
+###############################################################################
+
+# Issue a set of transactions that will be used for this portion of the test:
+# - One transaction to hold on the minimum xid horizon at bay.
+# - One transaction that will be found as already committed at recovery.
+# - One transaction that will be fonnd as already rollbacked at recovery.
+$cur_primary->psql(
+	'postgres', "
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (40, 'transaction: xid horizon');
+	PREPARE TRANSACTION 'xact_009_40';
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (41, 'transaction: commit-prepared');
+	PREPARE TRANSACTION 'xact_009_41';
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (42, 'transaction: rollback-prepared');
+	PREPARE TRANSACTION 'xact_009_42';");
+
+# Issue a checkpoint, fixing the XID horizon based on the first transaction,
+# flushing to disk the two files to use.
+$cur_primary->psql('postgres', "CHECKPOINT");
+
+# Get the transaction IDs of the ones to 2PC files to manipulate.
+my $commit_prepared_xid = int(
+	$cur_primary->safe_psql(
+		'postgres',
+		"SELECT transaction FROM pg_prepared_xacts WHERE gid = 'xact_009_41'")
+);
+my $abort_prepared_xid = int(
+	$cur_primary->safe_psql(
+		'postgres',
+		"SELECT transaction FROM pg_prepared_xacts WHERE gid = 'xact_009_42'")
+);
+
+# Copy the two-phase files that will be put back later.  Assume an
+# epoch of 0.
+my $commit_prepared_name = twophase_file_name(0, $commit_prepared_xid);
+my $abort_prepared_name = twophase_file_name(0, $abort_prepared_xid);
+
+my $twophase_tmpdir = $PostgreSQL::Test::Utils::tmp_check . '/' . "2pc_files";
+mkdir($twophase_tmpdir);
+my $primary_twophase_folder = $cur_primary->data_dir . '/pg_twophase/';
+copy("$primary_twophase_folder/$commit_prepared_name", $twophase_tmpdir);
+copy("$primary_twophase_folder/$abort_prepared_name", $twophase_tmpdir);
+
+# Issue abort/commit prepared.
+$cur_primary->psql('postgres', "COMMIT PREPARED 'xact_009_41'");
+$cur_primary->psql('postgres', "ROLLBACK PREPARED 'xact_009_42'");
+
+# Again checkpoint, to advance the LSN past the point where the two previous
+# transaction records would be replayed.
+$cur_primary->psql('postgres', "CHECKPOINT");
+
+# Take down node.
+$cur_primary->teardown_node;
+
+# Move back the two twophase files.
+copy("$twophase_tmpdir/$commit_prepared_name", $primary_twophase_folder);
+copy("$twophase_tmpdir/$abort_prepared_name", $primary_twophase_folder);
+
+# Grab location in logs of primary
+my $log_offset = -s $cur_primary->logfile;
+
+# Start node and check that the two previous files are removed by checking the
+# server logs, following the CLOG lookup done at the end of recovery.
+$cur_primary->start;
+
+$cur_primary->log_check(
+	"two-phase files of committed transactions removed at recovery",
+	$log_offset,
+	log_like => [
+		qr/removing stale two-phase state from memory for transaction $commit_prepared_xid of epoch 0/,
+		qr/removing stale two-phase state from memory for transaction $abort_prepared_xid of epoch 0/
+	]);
+
+# Commit the first transaction.
+$cur_primary->psql('postgres', "COMMIT PREPARED 'xact_009_40'");
+# After replay, there should be no 2PC transactions.
+$cur_primary->psql(
+	'postgres',
+	"SELECT * FROM pg_prepared_xact",
+	stdout => \$psql_out);
+is($psql_out, qq{}, "Check expected pg_prepared_xact data on primary");
+# Data from transactions should be around.
+$cur_primary->psql(
+	'postgres',
+	"SELECT * FROM t_009_tbl WHERE id IN (40, 41, 42);",
+	stdout => \$psql_out);
+is( $psql_out, qq{40|transaction: xid horizon
+41|transaction: commit-prepared},
+	"Check expected table data on primary");
+
+###############################################################################
+# Check handling of orphaned 2PC files at recovery.
+###############################################################################
+
+$cur_standby->teardown_node;
+$cur_primary->teardown_node;
+
+# Grab location in logs of primary
+$log_offset = -s $cur_primary->logfile;
+
+# Create fake files with a transaction ID large or low enough to be in the
+# future or the past, in different epochs, then check that the primary is able
+# to start and remove these files at recovery.
+
+# First bump the epoch with pg_resetwal.
+$cur_primary->command_ok(
+	[ 'pg_resetwal', '-e', 256, '-f', $cur_primary->data_dir ],
+	'bump epoch of primary');
+
+my $future_2pc_file =
+  $cur_primary->data_dir . '/pg_twophase/000001FF00000FFF';
+append_to_file $future_2pc_file, "";
+my $past_2pc_file = $cur_primary->data_dir . '/pg_twophase/000000EE00000FFF';
+append_to_file $past_2pc_file, "";
+
+$cur_primary->start;
+$cur_primary->log_check(
+	"two-phase files removed at recovery",
+	$log_offset,
+	log_like => [
+		qr/removing past two-phase state file for transaction 4095 of epoch 238/,
+		qr/removing future two-phase state file for transaction 4095 of epoch 511/
+	]);
+
 done_testing();
-- 
2.47.2

Vitaly Davydov

v.davydov@postgrespro.ru

12 months ago

In reply to: Michael Paquier (#2)

Re: Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData

On Friday, January 31, 2025 03:21 MSK, Michael Paquier <michael@paquier.xyz>
wrote:

Thoughts and comments are welcome.

I'm looking at the v13 patch. I see, there is the only file for v13:
v2-0002-Fix-issues-with-2PC-file-handling-at-recovery-13.txt

There are two points I would like to highlight:

#1. In RecoverPreparedTransactions we iterave over all in-memory two phase
states and check every xid state in CLOG unconditionally. Image, we have
a very old transaction that is close to the oldestXid. Will CLOG state be
available for such transaction? I'm not sure about it.

#2. In restoreTwoPhaseData we load all the twostate files that are in
the valid xid range (from oldestXid to nextFullXid in terms of logical
comparision of xids with epoch). The question - should we load files
which xids greater than ControlFile->checkPointCopy.nextXid
(xids after last checkpoint). In general, all twophase files should belong
to xids before the last checkpoint. I guess, we should probably ignore
files which xid is equal or greater than the xid of the last checkpoint -
twostate data should be in the WAL. If not, I guess, we may see error messages
like show below when doing xact_redo -> PrepareRedoAdd:

Two-phase state file has been found in WAL record %X/%X, but this transaction
has already been restored from disk

I'm not sure about the logic related to this message in PrepareRedoAdd.

P.S. Thank you for responses on my emails. I also apologize for the
formatting of my emails. I will check what is wrong and fix.

With best regards,
Vitaly

Michael Paquier

michael@paquier.xyz

11 months ago

In reply to: Vitaly Davydov (#3)

Re: Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData

On Fri, Jan 31, 2025 at 04:54:17PM +0300, Vitaly Davydov wrote:

I'm looking at the v13 patch. I see, there is the only file for v13:
v2-0002-Fix-issues-with-2PC-file-handling-at-recovery-13.txt

Yes. The fixes in 13~16 are simpler. There are so many conflicts
across all the branches that I'm planning to maintain branches for
each stable versions, depending on the discussion going on. The
refactoring of 17~ and HEAD could be done first, I suppose.

#1. In RecoverPreparedTransactions we iterave over all in-memory two phase
states and check every xid state in CLOG unconditionally. Image, we have
a very old transaction that is close to the oldestXid. Will CLOG state be
available for such transaction? I'm not sure about it.

Yes. RecoverPreparedTransactions() where the cluster is officially at
the end of recovery, and we've checked that we are consistent.

#2. In restoreTwoPhaseData we load all the twostate files that are in
the valid xid range (from oldestXid to nextFullXid in terms of logical
comparision of xids with epoch). The question - should we load files
which xids greater than ControlFile->checkPointCopy.nextXid
(xids after last checkpoint). In general, all twophase files should belong
to xids before the last checkpoint. I guess, we should probably ignore
files which xid is equal or greater than the xid of the last checkpoint -
twostate data should be in the WAL. If not, I guess, we may see error messages
like show below when doing xact_redo -> PrepareRedoAdd:

We are already using the nextFullXid from ShmemVariableCache, which is
something that StartupXLOG() fetches from the checkpoint record we
define as the starting point of recovery, which provides a protection
good enough. Grabbing this information from checkPointCopy.nextXid
would be actually worse when recovering from a backup_label, no? The
copy of the checkpoint record in the control file would be easily
newer than the checkpoint record retrieved from the LSN in the
backup_label, leading to less files considered as in the future. I'd
rather trust what we have in WAL a maximum for this data.

Note that there is a minor release next week on the roadmap, and for
clarity I am not planning to rush this stuff in the tree this week:
https://www.postgresql.org/developer/roadmap/

Let's give it more time to find a solution we're all OK with. I'd
love to hear from Noah here as well, as he got involved in the
discussion of the other thread.
--
Michael

Noah Misch

noah@leadboat.com

11 months ago

In reply to: Michael Paquier (#2)

Re: Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData

On Thu, Jan 30, 2025 at 03:36:20PM +0900, Michael Paquier wrote:

And I am beginning a new thread about going through an issue that Noah
has mentioned at [1], which is that the 2PC code may attempt to do
CLOG lookups at very early stage of recovery, where the cluster is not
in a consistent state.

It's broader than CLOG lookups. I wrote in [1], "We must not read the old
pg_twophase file, which may contain an unfinished write." Until recovery
reaches consistency, none of the checks in ProcessTwoPhaseBuffer() or its
callee ReadTwoPhaseFile() are safe.

[1]: /messages/by-id/20250117005221.05.nmisch@google.com
[2]: /messages/by-id/20250116205254.65.nmisch@google.com

On Fri, Jan 31, 2025 at 09:21:53AM +0900, Michael Paquier wrote:

--- a/src/test/recovery/t/009_twophase.pl
+++ b/src/test/recovery/t/009_twophase.pl

+	log_like => [
+		qr/removing stale two-phase state from memory for transaction $commit_prepared_xid of epoch 0/,
+		qr/removing stale two-phase state from memory for transaction $abort_prepared_xid of epoch 0/
+	]);

+	log_like => [
+		qr/removing past two-phase state file for transaction 4095 of epoch 238/,
+		qr/removing future two-phase state file for transaction 4095 of epoch 511/
+	]);

As I wrote in [1], "By the time we reach consistency, every file in
pg_twophase will be applicable (not committed or aborted)." If we find
otherwise, the user didn't follow the backup protocol (or there's another
bug). Hence, long-term, we should stop these removals and just fail recovery.
We can't fix all data loss consequences of not following the backup protocol,
so the biggest favor we can do the user is draw their attention to the
problem. How do you see it?

For back branches, the ideal is less clear. If we can convince ourselves that
enough of these events will indicate damaging problems (user error, hardware
failure, or PostgreSQL bugs), the long-term ideal of failing recovery is also
right for back branches. However, it could be too hard to convince ourselves
of that. If so, that could justify keeping these removals in back branches.

Michael Paquier

michael@paquier.xyz

8 months ago

In reply to: Noah Misch (#5)

Re: Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData

On Tue, Feb 18, 2025 at 04:57:47PM -0800, Noah Misch wrote:

As I wrote in [1], "By the time we reach consistency, every file in
pg_twophase will be applicable (not committed or aborted)." If we find
otherwise, the user didn't follow the backup protocol (or there's another
bug). Hence, long-term, we should stop these removals and just fail recovery.
We can't fix all data loss consequences of not following the backup protocol,
so the biggest favor we can do the user is draw their attention to the
problem. How do you see it?

Deciding to not trust at all any of the contents of pg_twophase/ until
consistency is reached is not something we should aim for, IMO. Going
in this direction would mean to delay restoreTwoPhaseData() until
consistency is reached, but there are cases where we can read that
safely, and where we should do so. For example, this flow is
perfectly OK to do in the wasShutdown case, where
PrescanPreparedTransactions() would be able to do its initialization
job before performing WAL recovery to get a clean list of running
XIDs.

I agree that moving towards a solution where we get rid entirely of
the CLOG lookups in ProcessTwoPhaseBuffer() is what we should aim for,
and actually is there a reason to not just nuke and replace them
something based on the checkpoint record itself? I have to admit that
I don't quite see the issue with ReadTwoPhaseFile() when it comes to
crash recovery. For example, in the case of a partial write, doesn't
the CRC32 check offer some protection about the contents of the file?
Wouldn't it be OK in this case to assume that the contents of this
file will be in WAL anyway? We have one case based on
reachedConsistency in PrepareRedoAdd(), should this happen, for
crashes happening during checkpoints where we know that the files
could be found in WAL but they could have been loaded at the beginning
of recovery.

The base backup issue is a different one, of course, and I think that
we are going to require more data in the 2PC file to provide a better
cross-check barrier, which would be the addition to the 2PC file of
the end LSN where the 2PC file record has been inserted. Then we
could cross-check that with the redo location, and see that it's
actually safe to discard the file because we know it will be in WAL.
This seems like a hefty cost to pay for, though, meaning 8 bytes in
each 2PC file because base backups were done wrong. Bleh.

One extra thing that I have mentioned is that we could replace the
CLOG safeguards based on what we know from the checkpoint record based
on the oldest XID horizon of the checkpoint record and its next XID:
- If we have a 2PC file older than the oldest XID horizon, we know
that it should not exist.
- If we have a 2PC file newer than the next XID, same, or we'll know
about it while replaying.

WDYT?

For back branches, the ideal is less clear. If we can convince ourselves that
enough of these events will indicate damaging problems (user error, hardware
failure, or PostgreSQL bugs), the long-term ideal of failing recovery is also
right for back branches. However, it could be too hard to convince ourselves
of that. If so, that could justify keeping these removals in back branches.

While I would like to see something in back-branches, I am not seeing
occurences of the current behavior being hit by users in the wild, so
I'd rather consider that as material only for v19 at this stage. If
we figure out a clear picture on HEAD, perhaps it will be possible to
salvage some of it for the back branches, but I'm not clear if this
would be possible yet. That would be nice, but we'll see.
--
Michael

Michael Paquier

michael@paquier.xyz

8 months ago

In reply to: Michael Paquier (#6)

2 attachment(s)

Re: Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData

On Fri, May 09, 2025 at 02:08:26PM +0900, Michael Paquier wrote:

One extra thing that I have mentioned is that we could replace the
CLOG safeguards based on what we know from the checkpoint record based
on the oldest XID horizon of the checkpoint record and its next XID:
- If we have a 2PC file older than the oldest XID horizon, we know
that it should not exist.
- If we have a 2PC file newer than the next XID, same, or we'll know
about it while replaying.

WDYT?

I have been going back and forth on this patch set for the last few
weeks. Please find a refreshed version as of the attached. These are
aimed only for v19 for the time being.

Patch 0001 is a refactoring of the existing 2PC code to integrate full
XIDs deeper into these code paths, easing the evaluation of the 2PC
files by not having to guess from which epoch they are. This is worth
improving on its own, and I'm hoing that this is acceptable as-is for
HEAD.

To summarize, patch 0002 provides fixes in the lines of what I have
mentioned upthread, based on the following lines:
- restoreTwoPhaseData() is in charge of checking the 2PC files in
pg_twophase/ at the beginning of recovery based on the oldest and
newest XID horizons retrieved from the checkpoint record, discarding
files seen as too new or too old.
- CLOG lookups are delayed at the end of recovery, handled by
RecoverPreparedTransactions() when we retore the 2PC data in shmem
filled during recovery. At this stage, checking for aborted and
committed transactions should be safe, the cluster is moved as ready
for WAL activity in the steps after this call.
- ProcessTwoPhaseBuffer() can never return NULL, does not include any
sanity checks anymore.

Thoughts or comments are welcome.
--
Michael

Attachments:

v3-0001-Integrate-more-FullTransactionIds-into-2PC-code.patchtext/x-diff; charset=us-asciiDownload

From f4d7dbf4b6eabac39752ee10938099e1c05ad937 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 20 May 2025 06:01:51 +0900
Subject: [PATCH v3 1/2] Integrate more FullTransactionIds into 2PC code

This refactoring is a follow-up of the work done in 5a1dfde8334b, that
has switched 2PC file names to use FullTransactionIds.  This will help
with the integration of solutions for the handling on-disk orphaned 2PC
files, without requiring a need to build the file names from epoch
numbers and TransactionId, avoiding a lot of back-and-forth between the
FullTransactionIds retrieved from the file names and how these are
passed around in the internal 2PC logic.

Note that the core of the change is use of a FullTransactionId instead
of a TransactionId in GlobalTransactionData, that tracks 2PC file
information in shared memory.
---
 src/include/access/multixact.h               |   9 +-
 src/include/access/twophase.h                |  11 +-
 src/include/access/twophase_rmgr.h           |   4 +-
 src/include/pgstat.h                         |   4 +-
 src/include/storage/lock.h                   |  11 +-
 src/include/storage/predicate.h              |   6 +-
 src/backend/access/transam/multixact.c       |  16 +-
 src/backend/access/transam/twophase.c        | 242 +++++++++++--------
 src/backend/access/transam/xact.c            |  13 +-
 src/backend/storage/lmgr/lock.c              |  20 +-
 src/backend/storage/lmgr/predicate.c         |  11 +-
 src/backend/utils/activity/pgstat_relation.c |   4 +-
 12 files changed, 197 insertions(+), 154 deletions(-)

diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 4e6b0eec2ff4..b876e98f46ed 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -11,6 +11,7 @@
 #ifndef MULTIXACT_H
 #define MULTIXACT_H
 
+#include "access/transam.h"
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 #include "storage/sync.h"
@@ -119,7 +120,7 @@ extern int	multixactmemberssyncfiletag(const FileTag *ftag, char *path);
 
 extern void AtEOXact_MultiXact(void);
 extern void AtPrepare_MultiXact(void);
-extern void PostPrepare_MultiXact(TransactionId xid);
+extern void PostPrepare_MultiXact(FullTransactionId fxid);
 
 extern Size MultiXactShmemSize(void);
 extern void MultiXactShmemInit(void);
@@ -145,11 +146,11 @@ extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
 extern int	MultiXactMemberFreezeThreshold(void);
 
-extern void multixact_twophase_recover(TransactionId xid, uint16 info,
+extern void multixact_twophase_recover(FullTransactionId fxid, uint16 info,
 									   void *recdata, uint32 len);
-extern void multixact_twophase_postcommit(TransactionId xid, uint16 info,
+extern void multixact_twophase_postcommit(FullTransactionId fxid, uint16 info,
 										  void *recdata, uint32 len);
-extern void multixact_twophase_postabort(TransactionId xid, uint16 info,
+extern void multixact_twophase_postabort(FullTransactionId fxid, uint16 info,
 										 void *recdata, uint32 len);
 
 extern void multixact_redo(XLogReaderState *record);
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 9fa823550337..509bdad9a5d5 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -36,10 +36,10 @@ extern void PostPrepare_Twophase(void);
 
 extern TransactionId TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
 												bool *have_more);
-extern PGPROC *TwoPhaseGetDummyProc(TransactionId xid, bool lock_held);
-extern int	TwoPhaseGetDummyProcNumber(TransactionId xid, bool lock_held);
+extern PGPROC *TwoPhaseGetDummyProc(FullTransactionId fxid, bool lock_held);
+extern int	TwoPhaseGetDummyProcNumber(FullTransactionId fxid, bool lock_held);
 
-extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
+extern GlobalTransaction MarkAsPreparing(FullTransactionId fxid, const char *gid,
 										 TimestampTz prepared_at,
 										 Oid owner, Oid databaseid);
 
@@ -56,8 +56,9 @@ extern void CheckPointTwoPhase(XLogRecPtr redo_horizon);
 
 extern void FinishPreparedTransaction(const char *gid, bool isCommit);
 
-extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
-						   XLogRecPtr end_lsn, RepOriginId origin_id);
+extern void PrepareRedoAdd(FullTransactionId fxid, char *buf,
+						   XLogRecPtr start_lsn, XLogRecPtr end_lsn,
+						   RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
 extern bool LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
diff --git a/src/include/access/twophase_rmgr.h b/src/include/access/twophase_rmgr.h
index 3ed154bb2312..8f576402e360 100644
--- a/src/include/access/twophase_rmgr.h
+++ b/src/include/access/twophase_rmgr.h
@@ -14,7 +14,9 @@
 #ifndef TWOPHASE_RMGR_H
 #define TWOPHASE_RMGR_H
 
-typedef void (*TwoPhaseCallback) (TransactionId xid, uint16 info,
+#include "access/transam.h"
+
+typedef void (*TwoPhaseCallback) (FullTransactionId fxid, uint16 info,
 								  void *recdata, uint32 len);
 typedef uint8 TwoPhaseRmgrId;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 378f2f2c2ba2..202bd2d5aced 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -718,9 +718,9 @@ extern void pgstat_count_heap_delete(Relation rel);
 extern void pgstat_count_truncate(Relation rel);
 extern void pgstat_update_heap_dead_tuples(Relation rel, int delta);
 
-extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
+extern void pgstat_twophase_postcommit(FullTransactionId fxid, uint16 info,
 									   void *recdata, uint32 len);
-extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
+extern void pgstat_twophase_postabort(FullTransactionId fxid, uint16 info,
 									  void *recdata, uint32 len);
 
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 6f2108a44e8f..0f0fc446a197 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -18,6 +18,7 @@
 #error "lock.h may not be included from frontend code"
 #endif
 
+#include "access/transam.h"
 #include "lib/ilist.h"
 #include "storage/lockdefs.h"
 #include "storage/lwlock.h"
@@ -581,7 +582,7 @@ extern bool LockHasWaiters(const LOCKTAG *locktag,
 extern VirtualTransactionId *GetLockConflicts(const LOCKTAG *locktag,
 											  LOCKMODE lockmode, int *countp);
 extern void AtPrepare_Locks(void);
-extern void PostPrepare_Locks(TransactionId xid);
+extern void PostPrepare_Locks(FullTransactionId fxid);
 extern bool LockCheckConflicts(LockMethod lockMethodTable,
 							   LOCKMODE lockmode,
 							   LOCK *lock, PROCLOCK *proclock);
@@ -597,13 +598,13 @@ extern BlockedProcsData *GetBlockerStatusData(int blocked_pid);
 extern xl_standby_lock *GetRunningTransactionLocks(int *nlocks);
 extern const char *GetLockmodeName(LOCKMETHODID lockmethodid, LOCKMODE mode);
 
-extern void lock_twophase_recover(TransactionId xid, uint16 info,
+extern void lock_twophase_recover(FullTransactionId fxid, uint16 info,
 								  void *recdata, uint32 len);
-extern void lock_twophase_postcommit(TransactionId xid, uint16 info,
+extern void lock_twophase_postcommit(FullTransactionId fxid, uint16 info,
 									 void *recdata, uint32 len);
-extern void lock_twophase_postabort(TransactionId xid, uint16 info,
+extern void lock_twophase_postabort(FullTransactionId fxid, uint16 info,
 									void *recdata, uint32 len);
-extern void lock_twophase_standby_recover(TransactionId xid, uint16 info,
+extern void lock_twophase_standby_recover(FullTransactionId fxid, uint16 info,
 										  void *recdata, uint32 len);
 
 extern DeadLockState DeadLockCheck(PGPROC *proc);
diff --git a/src/include/storage/predicate.h b/src/include/storage/predicate.h
index 267d5d90e948..c1e3a4d9f64a 100644
--- a/src/include/storage/predicate.h
+++ b/src/include/storage/predicate.h
@@ -72,9 +72,9 @@ extern void PreCommit_CheckForSerializationFailure(void);
 
 /* two-phase commit support */
 extern void AtPrepare_PredicateLocks(void);
-extern void PostPrepare_PredicateLocks(TransactionId xid);
-extern void PredicateLockTwoPhaseFinish(TransactionId xid, bool isCommit);
-extern void predicatelock_twophase_recover(TransactionId xid, uint16 info,
+extern void PostPrepare_PredicateLocks(FullTransactionId fxid);
+extern void PredicateLockTwoPhaseFinish(FullTransactionId xid, bool isCommit);
+extern void predicatelock_twophase_recover(FullTransactionId fxid, uint16 info,
 										   void *recdata, uint32 len);
 
 /* parallel query support */
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 3c06ac45532f..7a7afe3edc67 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1847,7 +1847,7 @@ AtPrepare_MultiXact(void)
  *		Clean up after successful PREPARE TRANSACTION
  */
 void
-PostPrepare_MultiXact(TransactionId xid)
+PostPrepare_MultiXact(FullTransactionId fxid)
 {
 	MultiXactId myOldestMember;
 
@@ -1858,7 +1858,7 @@ PostPrepare_MultiXact(TransactionId xid)
 	myOldestMember = OldestMemberMXactId[MyProcNumber];
 	if (MultiXactIdIsValid(myOldestMember))
 	{
-		ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(xid, false);
+		ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(fxid, false);
 
 		/*
 		 * Even though storing MultiXactId is atomic, acquire lock to make
@@ -1896,10 +1896,10 @@ PostPrepare_MultiXact(TransactionId xid)
  *		Recover the state of a prepared transaction at startup
  */
 void
-multixact_twophase_recover(TransactionId xid, uint16 info,
+multixact_twophase_recover(FullTransactionId fxid, uint16 info,
 						   void *recdata, uint32 len)
 {
-	ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(xid, false);
+	ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(fxid, false);
 	MultiXactId oldestMember;
 
 	/*
@@ -1917,10 +1917,10 @@ multixact_twophase_recover(TransactionId xid, uint16 info,
  *		Similar to AtEOXact_MultiXact but for COMMIT PREPARED
  */
 void
-multixact_twophase_postcommit(TransactionId xid, uint16 info,
+multixact_twophase_postcommit(FullTransactionId fxid, uint16 info,
 							  void *recdata, uint32 len)
 {
-	ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(xid, true);
+	ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(fxid, true);
 
 	Assert(len == sizeof(MultiXactId));
 
@@ -1932,10 +1932,10 @@ multixact_twophase_postcommit(TransactionId xid, uint16 info,
  *		This is actually just the same as the COMMIT case.
  */
 void
-multixact_twophase_postabort(TransactionId xid, uint16 info,
+multixact_twophase_postabort(FullTransactionId fxid, uint16 info,
 							 void *recdata, uint32 len)
 {
-	multixact_twophase_postcommit(xid, info, recdata, len);
+	multixact_twophase_postcommit(fxid, info, recdata, len);
 }
 
 /*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 73a80559194e..82324395f5a1 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -159,7 +159,7 @@ typedef struct GlobalTransactionData
 	 */
 	XLogRecPtr	prepare_start_lsn;	/* XLOG offset of prepare record start */
 	XLogRecPtr	prepare_end_lsn;	/* XLOG offset of prepare record end */
-	TransactionId xid;			/* The GXACT id */
+	FullTransactionId fxid;		/* The GXACT full xid */
 
 	Oid			owner;			/* ID of user that executed the xact */
 	ProcNumber	locking_backend;	/* backend currently working on the xact */
@@ -197,6 +197,7 @@ static GlobalTransaction MyLockedGxact = NULL;
 
 static bool twophaseExitRegistered = false;
 
+static void PrepareRedoRemoveFull(FullTransactionId fxid, bool giveWarning);
 static void RecordTransactionCommitPrepared(TransactionId xid,
 											int nchildren,
 											TransactionId *children,
@@ -216,19 +217,19 @@ static void RecordTransactionAbortPrepared(TransactionId xid,
 										   int nstats,
 										   xl_xact_stats_item *stats,
 										   const char *gid);
-static void ProcessRecords(char *bufptr, TransactionId xid,
+static void ProcessRecords(char *bufptr, FullTransactionId fxid,
 						   const TwoPhaseCallback callbacks[]);
 static void RemoveGXact(GlobalTransaction gxact);
 
 static void XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len);
-static char *ProcessTwoPhaseBuffer(TransactionId xid,
+static char *ProcessTwoPhaseBuffer(FullTransactionId fxid,
 								   XLogRecPtr prepare_start_lsn,
 								   bool fromdisk, bool setParent, bool setNextXid);
-static void MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid,
+static void MarkAsPreparingGuts(GlobalTransaction gxact, FullTransactionId fxid,
 								const char *gid, TimestampTz prepared_at, Oid owner,
 								Oid databaseid);
-static void RemoveTwoPhaseFile(TransactionId xid, bool giveWarning);
-static void RecreateTwoPhaseFile(TransactionId xid, void *content, int len);
+static void RemoveTwoPhaseFile(FullTransactionId fxid, bool giveWarning);
+static void RecreateTwoPhaseFile(FullTransactionId fxid, void *content, int len);
 
 /*
  * Initialization of shared memory
@@ -356,7 +357,7 @@ PostPrepare_Twophase(void)
  *		Reserve the GID for the given transaction.
  */
 GlobalTransaction
-MarkAsPreparing(TransactionId xid, const char *gid,
+MarkAsPreparing(FullTransactionId fxid, const char *gid,
 				TimestampTz prepared_at, Oid owner, Oid databaseid)
 {
 	GlobalTransaction gxact;
@@ -407,7 +408,7 @@ MarkAsPreparing(TransactionId xid, const char *gid,
 	gxact = TwoPhaseState->freeGXacts;
 	TwoPhaseState->freeGXacts = gxact->next;
 
-	MarkAsPreparingGuts(gxact, xid, gid, prepared_at, owner, databaseid);
+	MarkAsPreparingGuts(gxact, fxid, gid, prepared_at, owner, databaseid);
 
 	gxact->ondisk = false;
 
@@ -430,11 +431,13 @@ MarkAsPreparing(TransactionId xid, const char *gid,
  * Note: This function should be called with appropriate locks held.
  */
 static void
-MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
-					TimestampTz prepared_at, Oid owner, Oid databaseid)
+MarkAsPreparingGuts(GlobalTransaction gxact, FullTransactionId fxid,
+					const char *gid, TimestampTz prepared_at, Oid owner,
+					Oid databaseid)
 {
 	PGPROC	   *proc;
 	int			i;
+	TransactionId xid = XidFromFullTransactionId(fxid);
 
 	Assert(LWLockHeldByMeInMode(TwoPhaseStateLock, LW_EXCLUSIVE));
 
@@ -479,7 +482,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	proc->subxidStatus.count = 0;
 
 	gxact->prepared_at = prepared_at;
-	gxact->xid = xid;
+	gxact->fxid = fxid;
 	gxact->owner = owner;
 	gxact->locking_backend = MyProcNumber;
 	gxact->valid = false;
@@ -797,12 +800,12 @@ pg_prepared_xact(PG_FUNCTION_ARGS)
  * caller had better hold it.
  */
 static GlobalTransaction
-TwoPhaseGetGXact(TransactionId xid, bool lock_held)
+TwoPhaseGetGXact(FullTransactionId fxid, bool lock_held)
 {
 	GlobalTransaction result = NULL;
 	int			i;
 
-	static TransactionId cached_xid = InvalidTransactionId;
+	static FullTransactionId cached_fxid = {0};
 	static GlobalTransaction cached_gxact = NULL;
 
 	Assert(!lock_held || LWLockHeldByMe(TwoPhaseStateLock));
@@ -811,7 +814,7 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)
 	 * During a recovery, COMMIT PREPARED, or ABORT PREPARED, we'll be called
 	 * repeatedly for the same XID.  We can save work with a simple cache.
 	 */
-	if (xid == cached_xid)
+	if (FullTransactionIdEquals(fxid, cached_fxid))
 		return cached_gxact;
 
 	if (!lock_held)
@@ -821,7 +824,7 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)
 	{
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
 
-		if (gxact->xid == xid)
+		if (FullTransactionIdEquals(gxact->fxid, fxid))
 		{
 			result = gxact;
 			break;
@@ -832,9 +835,10 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)
 		LWLockRelease(TwoPhaseStateLock);
 
 	if (result == NULL)			/* should not happen */
-		elog(ERROR, "failed to find GlobalTransaction for xid %u", xid);
+		elog(ERROR, "failed to find GlobalTransaction for xid %u",
+			 XidFromFullTransactionId(fxid));
 
-	cached_xid = xid;
+	cached_fxid = fxid;
 	cached_gxact = result;
 
 	return result;
@@ -881,7 +885,7 @@ TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
 				*have_more = true;
 				break;
 			}
-			result = gxact->xid;
+			result = XidFromFullTransactionId(gxact->fxid);
 		}
 	}
 
@@ -892,7 +896,7 @@ TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
 
 /*
  * TwoPhaseGetDummyProcNumber
- *		Get the dummy proc number for prepared transaction specified by XID
+ *		Get the dummy proc number for prepared transaction
  *
  * Dummy proc numbers are similar to proc numbers of real backends.  They
  * start at MaxBackends, and are unique across all currently active real
@@ -900,24 +904,24 @@ TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
  * TwoPhaseStateLock will not be taken, so the caller had better hold it.
  */
 ProcNumber
-TwoPhaseGetDummyProcNumber(TransactionId xid, bool lock_held)
+TwoPhaseGetDummyProcNumber(FullTransactionId fxid, bool lock_held)
 {
-	GlobalTransaction gxact = TwoPhaseGetGXact(xid, lock_held);
+	GlobalTransaction gxact = TwoPhaseGetGXact(fxid, lock_held);
 
 	return gxact->pgprocno;
 }
 
 /*
  * TwoPhaseGetDummyProc
- *		Get the PGPROC that represents a prepared transaction specified by XID
+ *		Get the PGPROC that represents a prepared transaction
  *
  * If lock_held is set to true, TwoPhaseStateLock will not be taken, so the
  * caller had better hold it.
  */
 PGPROC *
-TwoPhaseGetDummyProc(TransactionId xid, bool lock_held)
+TwoPhaseGetDummyProc(FullTransactionId fxid, bool lock_held)
 {
-	GlobalTransaction gxact = TwoPhaseGetGXact(xid, lock_held);
+	GlobalTransaction gxact = TwoPhaseGetGXact(fxid, lock_held);
 
 	return GetPGProcByNumber(gxact->pgprocno);
 }
@@ -942,10 +946,8 @@ AdjustToFullTransactionId(TransactionId xid)
 }
 
 static inline int
-TwoPhaseFilePath(char *path, TransactionId xid)
+TwoPhaseFilePath(char *path, FullTransactionId fxid)
 {
-	FullTransactionId fxid = AdjustToFullTransactionId(xid);
-
 	return snprintf(path, MAXPGPATH, TWOPHASE_DIR "/%08X%08X",
 					EpochFromFullTransactionId(fxid),
 					XidFromFullTransactionId(fxid));
@@ -1049,7 +1051,7 @@ void
 StartPrepare(GlobalTransaction gxact)
 {
 	PGPROC	   *proc = GetPGProcByNumber(gxact->pgprocno);
-	TransactionId xid = gxact->xid;
+	TransactionId xid = XidFromFullTransactionId(gxact->fxid);
 	TwoPhaseFileHeader hdr;
 	TransactionId *children;
 	RelFileLocator *commitrels;
@@ -1281,10 +1283,11 @@ RegisterTwoPhaseRecord(TwoPhaseRmgrId rmid, uint16 info,
  * If it looks OK (has a valid magic number and CRC), return the palloc'd
  * contents of the file, issuing an error when finding corrupted data.  If
  * missing_ok is true, which indicates that missing files can be safely
- * ignored, then return NULL.  This state can be reached when doing recovery.
+ * ignored, then return NULL.  This state can be reached when doing recovery
+ * after discarding two-phase files from frozen epochs.
  */
 static char *
-ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
+ReadTwoPhaseFile(FullTransactionId fxid, bool missing_ok)
 {
 	char		path[MAXPGPATH];
 	char	   *buf;
@@ -1296,7 +1299,7 @@ ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
 				file_crc;
 	int			r;
 
-	TwoPhaseFilePath(path, xid);
+	TwoPhaseFilePath(path, fxid);
 
 	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
 	if (fd < 0)
@@ -1461,6 +1464,7 @@ StandbyTransactionIdIsPrepared(TransactionId xid)
 	char	   *buf;
 	TwoPhaseFileHeader *hdr;
 	bool		result;
+	FullTransactionId fxid;
 
 	Assert(TransactionIdIsValid(xid));
 
@@ -1468,7 +1472,8 @@ StandbyTransactionIdIsPrepared(TransactionId xid)
 		return false;			/* nothing to do */
 
 	/* Read and validate file */
-	buf = ReadTwoPhaseFile(xid, true);
+	fxid = FullTransactionIdFromAllowableAt(TransamVariables->nextXid, xid);
+	buf = ReadTwoPhaseFile(fxid, true);
 	if (buf == NULL)
 		return false;
 
@@ -1488,6 +1493,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 {
 	GlobalTransaction gxact;
 	PGPROC	   *proc;
+	FullTransactionId fxid;
 	TransactionId xid;
 	bool		ondisk;
 	char	   *buf;
@@ -1509,7 +1515,8 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 */
 	gxact = LockGXact(gid, GetUserId());
 	proc = GetPGProcByNumber(gxact->pgprocno);
-	xid = gxact->xid;
+	fxid = gxact->fxid;
+	xid = XidFromFullTransactionId(fxid);
 
 	/*
 	 * Read and validate 2PC state data. State data will typically be stored
@@ -1517,7 +1524,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 * to disk if for some reason they have lived for a long time.
 	 */
 	if (gxact->ondisk)
-		buf = ReadTwoPhaseFile(xid, false);
+		buf = ReadTwoPhaseFile(fxid, false);
 	else
 		XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
 
@@ -1636,11 +1643,11 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 
 	/* And now do the callbacks */
 	if (isCommit)
-		ProcessRecords(bufptr, xid, twophase_postcommit_callbacks);
+		ProcessRecords(bufptr, fxid, twophase_postcommit_callbacks);
 	else
-		ProcessRecords(bufptr, xid, twophase_postabort_callbacks);
+		ProcessRecords(bufptr, fxid, twophase_postabort_callbacks);
 
-	PredicateLockTwoPhaseFinish(xid, isCommit);
+	PredicateLockTwoPhaseFinish(fxid, isCommit);
 
 	/*
 	 * Read this value while holding the two-phase lock, as the on-disk 2PC
@@ -1664,7 +1671,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 * And now we can clean up any files we may have left.
 	 */
 	if (ondisk)
-		RemoveTwoPhaseFile(xid, true);
+		RemoveTwoPhaseFile(fxid, true);
 
 	MyLockedGxact = NULL;
 
@@ -1677,7 +1684,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
  * Scan 2PC state data in memory and call the indicated callbacks for each 2PC record.
  */
 static void
-ProcessRecords(char *bufptr, TransactionId xid,
+ProcessRecords(char *bufptr, FullTransactionId fxid,
 			   const TwoPhaseCallback callbacks[])
 {
 	for (;;)
@@ -1691,24 +1698,28 @@ ProcessRecords(char *bufptr, TransactionId xid,
 		bufptr += MAXALIGN(sizeof(TwoPhaseRecordOnDisk));
 
 		if (callbacks[record->rmid] != NULL)
-			callbacks[record->rmid] (xid, record->info, bufptr, record->len);
+			callbacks[record->rmid] (fxid, record->info, bufptr, record->len);
 
 		bufptr += MAXALIGN(record->len);
 	}
 }
 
 /*
- * Remove the 2PC file for the specified XID.
+ * Remove the 2PC file.
  *
  * If giveWarning is false, do not complain about file-not-present;
  * this is an expected case during WAL replay.
+ *
+ * This routine is used at early stages at recovery where future and
+ * past orphaned files are checked, hence the FullTransactionId to build
+ * a complete file name fit for the removal.
  */
 static void
-RemoveTwoPhaseFile(TransactionId xid, bool giveWarning)
+RemoveTwoPhaseFile(FullTransactionId fxid, bool giveWarning)
 {
 	char		path[MAXPGPATH];
 
-	TwoPhaseFilePath(path, xid);
+	TwoPhaseFilePath(path, fxid);
 	if (unlink(path))
 		if (errno != ENOENT || giveWarning)
 			ereport(WARNING,
@@ -1723,7 +1734,7 @@ RemoveTwoPhaseFile(TransactionId xid, bool giveWarning)
  * Note: content and len don't include CRC.
  */
 static void
-RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
+RecreateTwoPhaseFile(FullTransactionId fxid, void *content, int len)
 {
 	char		path[MAXPGPATH];
 	pg_crc32c	statefile_crc;
@@ -1734,7 +1745,7 @@ RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
 	COMP_CRC32C(statefile_crc, content, len);
 	FIN_CRC32C(statefile_crc);
 
-	TwoPhaseFilePath(path, xid);
+	TwoPhaseFilePath(path, fxid);
 
 	fd = OpenTransientFile(path,
 						   O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
@@ -1846,7 +1857,7 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
 			int			len;
 
 			XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, &len);
-			RecreateTwoPhaseFile(gxact->xid, buf, len);
+			RecreateTwoPhaseFile(gxact->fxid, buf, len);
 			gxact->ondisk = true;
 			gxact->prepare_start_lsn = InvalidXLogRecPtr;
 			gxact->prepare_end_lsn = InvalidXLogRecPtr;
@@ -1897,19 +1908,17 @@ restoreTwoPhaseData(void)
 		if (strlen(clde->d_name) == 16 &&
 			strspn(clde->d_name, "0123456789ABCDEF") == 16)
 		{
-			TransactionId xid;
 			FullTransactionId fxid;
 			char	   *buf;
 
 			fxid = FullTransactionIdFromU64(strtou64(clde->d_name, NULL, 16));
-			xid = XidFromFullTransactionId(fxid);
 
-			buf = ProcessTwoPhaseBuffer(xid, InvalidXLogRecPtr,
+			buf = ProcessTwoPhaseBuffer(fxid, InvalidXLogRecPtr,
 										true, false, false);
 			if (buf == NULL)
 				continue;
 
-			PrepareRedoAdd(buf, InvalidXLogRecPtr,
+			PrepareRedoAdd(fxid, buf, InvalidXLogRecPtr,
 						   InvalidXLogRecPtr, InvalidRepOriginId);
 		}
 	}
@@ -1968,9 +1977,7 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 
 		Assert(gxact->inredo);
 
-		xid = gxact->xid;
-
-		buf = ProcessTwoPhaseBuffer(xid,
+		buf = ProcessTwoPhaseBuffer(gxact->fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, false, true);
 
@@ -1981,6 +1988,7 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 		 * OK, we think this file is valid.  Incorporate xid into the
 		 * running-minimum result.
 		 */
+		xid = XidFromFullTransactionId(gxact->fxid);
 		if (TransactionIdPrecedes(xid, result))
 			result = xid;
 
@@ -2036,15 +2044,12 @@ StandbyRecoverPreparedTransactions(void)
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
-		TransactionId xid;
 		char	   *buf;
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
 
 		Assert(gxact->inredo);
 
-		xid = gxact->xid;
-
-		buf = ProcessTwoPhaseBuffer(xid,
+		buf = ProcessTwoPhaseBuffer(gxact->fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
 		if (buf != NULL)
@@ -2077,16 +2082,14 @@ RecoverPreparedTransactions(void)
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
-		TransactionId xid;
 		char	   *buf;
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+		FullTransactionId fxid = gxact->fxid;
 		char	   *bufptr;
 		TwoPhaseFileHeader *hdr;
 		TransactionId *subxids;
 		const char *gid;
 
-		xid = gxact->xid;
-
 		/*
 		 * Reconstruct subtrans state for the transaction --- needed because
 		 * pg_subtrans is not preserved over a restart.  Note that we are
@@ -2096,17 +2099,20 @@ RecoverPreparedTransactions(void)
 		 * SubTransSetParent has been set before, if the prepared transaction
 		 * generated xid assignment records.
 		 */
-		buf = ProcessTwoPhaseBuffer(xid,
+		buf = ProcessTwoPhaseBuffer(gxact->fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
 		if (buf == NULL)
 			continue;
 
 		ereport(LOG,
-				(errmsg("recovering prepared transaction %u from shared memory", xid)));
+				(errmsg("recovering prepared transaction %u of epoch %u from shared memory",
+						XidFromFullTransactionId(gxact->fxid),
+						EpochFromFullTransactionId(gxact->fxid))));
 
 		hdr = (TwoPhaseFileHeader *) buf;
-		Assert(TransactionIdEquals(hdr->xid, xid));
+		Assert(TransactionIdEquals(hdr->xid,
+								   XidFromFullTransactionId(gxact->fxid)));
 		bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader));
 		gid = (const char *) bufptr;
 		bufptr += MAXALIGN(hdr->gidlen);
@@ -2122,7 +2128,7 @@ RecoverPreparedTransactions(void)
 		 * Recreate its GXACT and dummy PGPROC. But, check whether it was
 		 * added in redo and already has a shmem entry for it.
 		 */
-		MarkAsPreparingGuts(gxact, xid, gid,
+		MarkAsPreparingGuts(gxact, gxact->fxid, gid,
 							hdr->prepared_at,
 							hdr->owner, hdr->database);
 
@@ -2137,7 +2143,7 @@ RecoverPreparedTransactions(void)
 		/*
 		 * Recover other state (notably locks) using resource managers.
 		 */
-		ProcessRecords(bufptr, xid, twophase_recover_callbacks);
+		ProcessRecords(bufptr, fxid, twophase_recover_callbacks);
 
 		/*
 		 * Release locks held by the standby process after we process each
@@ -2145,7 +2151,7 @@ RecoverPreparedTransactions(void)
 		 * additional locks at any one time.
 		 */
 		if (InHotStandby)
-			StandbyReleaseLockTree(xid, hdr->nsubxacts, subxids);
+			StandbyReleaseLockTree(hdr->xid, hdr->nsubxacts, subxids);
 
 		/*
 		 * We're done with recovering this transaction. Clear MyLockedGxact,
@@ -2164,7 +2170,7 @@ RecoverPreparedTransactions(void)
 /*
  * ProcessTwoPhaseBuffer
  *
- * Given a transaction id, read it either from disk or read it directly
+ * Given a FullTransactionId, read it either from disk or read it directly
  * via shmem xlog record pointer using the provided "prepare_start_lsn".
  *
  * If setParent is true, set up subtransaction parent linkages.
@@ -2173,13 +2179,12 @@ RecoverPreparedTransactions(void)
  * value scanned.
  */
 static char *
-ProcessTwoPhaseBuffer(TransactionId xid,
+ProcessTwoPhaseBuffer(FullTransactionId fxid,
 					  XLogRecPtr prepare_start_lsn,
 					  bool fromdisk,
 					  bool setParent, bool setNextXid)
 {
 	FullTransactionId nextXid = TransamVariables->nextXid;
-	TransactionId origNextXid = XidFromFullTransactionId(nextXid);
 	TransactionId *subxids;
 	char	   *buf;
 	TwoPhaseFileHeader *hdr;
@@ -2191,41 +2196,46 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 		Assert(prepare_start_lsn != InvalidXLogRecPtr);
 
 	/* Already processed? */
-	if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
+	if (TransactionIdDidCommit(XidFromFullTransactionId(fxid)) ||
+		TransactionIdDidAbort(XidFromFullTransactionId(fxid)))
 	{
 		if (fromdisk)
 		{
 			ereport(WARNING,
-					(errmsg("removing stale two-phase state file for transaction %u",
-							xid)));
-			RemoveTwoPhaseFile(xid, true);
+					(errmsg("removing stale two-phase state file for transaction %u of epoch %u",
+							XidFromFullTransactionId(fxid),
+							EpochFromFullTransactionId(fxid))));
+			RemoveTwoPhaseFile(fxid, true);
 		}
 		else
 		{
 			ereport(WARNING,
-					(errmsg("removing stale two-phase state from memory for transaction %u",
-							xid)));
-			PrepareRedoRemove(xid, true);
+					(errmsg("removing stale two-phase state from memory for transaction %u of epoch %u",
+							XidFromFullTransactionId(fxid),
+							EpochFromFullTransactionId(fxid))));
+			PrepareRedoRemoveFull(fxid, true);
 		}
 		return NULL;
 	}
 
 	/* Reject XID if too new */
-	if (TransactionIdFollowsOrEquals(xid, origNextXid))
+	if (FullTransactionIdFollowsOrEquals(fxid, nextXid))
 	{
 		if (fromdisk)
 		{
 			ereport(WARNING,
-					(errmsg("removing future two-phase state file for transaction %u",
-							xid)));
-			RemoveTwoPhaseFile(xid, true);
+					(errmsg("removing future two-phase state file for transaction %u of epoch %u",
+							XidFromFullTransactionId(fxid),
+							EpochFromFullTransactionId(fxid))));
+			RemoveTwoPhaseFile(fxid, true);
 		}
 		else
 		{
 			ereport(WARNING,
-					(errmsg("removing future two-phase state from memory for transaction %u",
-							xid)));
-			PrepareRedoRemove(xid, true);
+					(errmsg("removing future two-phase state from memory for transaction %u of epoch %u",
+							XidFromFullTransactionId(fxid),
+							EpochFromFullTransactionId(fxid))));
+			PrepareRedoRemoveFull(fxid, true);
 		}
 		return NULL;
 	}
@@ -2233,7 +2243,7 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	if (fromdisk)
 	{
 		/* Read and validate file */
-		buf = ReadTwoPhaseFile(xid, false);
+		buf = ReadTwoPhaseFile(fxid, false);
 	}
 	else
 	{
@@ -2243,18 +2253,20 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 
 	/* Deconstruct header */
 	hdr = (TwoPhaseFileHeader *) buf;
-	if (!TransactionIdEquals(hdr->xid, xid))
+	if (!TransactionIdEquals(hdr->xid, XidFromFullTransactionId(fxid)))
 	{
 		if (fromdisk)
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("corrupted two-phase state file for transaction %u",
-							xid)));
+					 errmsg("corrupted two-phase state file for transaction %u of epoch %u",
+							XidFromFullTransactionId(fxid),
+							EpochFromFullTransactionId(fxid))));
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("corrupted two-phase state in memory for transaction %u",
-							xid)));
+					 errmsg("corrupted two-phase state in memory for transaction %u of epoch %u",
+							XidFromFullTransactionId(fxid),
+							EpochFromFullTransactionId(fxid))));
 	}
 
 	/*
@@ -2268,14 +2280,14 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	{
 		TransactionId subxid = subxids[i];
 
-		Assert(TransactionIdFollows(subxid, xid));
+		Assert(TransactionIdFollows(subxid, XidFromFullTransactionId(fxid)));
 
 		/* update nextXid if needed */
 		if (setNextXid)
 			AdvanceNextFullTransactionIdPastXid(subxid);
 
 		if (setParent)
-			SubTransSetParent(subxid, xid);
+			SubTransSetParent(subxid, XidFromFullTransactionId(fxid));
 	}
 
 	return buf;
@@ -2466,8 +2478,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
  * data, the entry is marked as located on disk.
  */
 void
-PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
-			   XLogRecPtr end_lsn, RepOriginId origin_id)
+PrepareRedoAdd(FullTransactionId fxid, char *buf,
+			   XLogRecPtr start_lsn, XLogRecPtr end_lsn,
+			   RepOriginId origin_id)
 {
 	TwoPhaseFileHeader *hdr = (TwoPhaseFileHeader *) buf;
 	char	   *bufptr;
@@ -2477,6 +2490,10 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 	Assert(LWLockHeldByMeInMode(TwoPhaseStateLock, LW_EXCLUSIVE));
 	Assert(RecoveryInProgress());
 
+	if (!FullTransactionIdIsValid(fxid))
+		fxid = FullTransactionIdFromAllowableAt(TransamVariables->nextXid,
+												hdr->xid);
+
 	bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader));
 	gid = (const char *) bufptr;
 
@@ -2505,7 +2522,8 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 	{
 		char		path[MAXPGPATH];
 
-		TwoPhaseFilePath(path, hdr->xid);
+		Assert(InRecovery);
+		TwoPhaseFilePath(path, fxid);
 
 		if (access(path, F_OK) == 0)
 		{
@@ -2536,7 +2554,7 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 	gxact->prepared_at = hdr->prepared_at;
 	gxact->prepare_start_lsn = start_lsn;
 	gxact->prepare_end_lsn = end_lsn;
-	gxact->xid = hdr->xid;
+	gxact->fxid = fxid;
 	gxact->owner = hdr->owner;
 	gxact->locking_backend = INVALID_PROC_NUMBER;
 	gxact->valid = false;
@@ -2555,11 +2573,13 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   false /* backward */ , false /* WAL */ );
 	}
 
-	elog(DEBUG2, "added 2PC data in shared memory for transaction %u", gxact->xid);
+	elog(DEBUG2, "added 2PC data in shared memory for transaction %u of epoch %u",
+		 XidFromFullTransactionId(gxact->fxid),
+		 EpochFromFullTransactionId(gxact->fxid));
 }
 
 /*
- * PrepareRedoRemove
+ * PrepareRedoRemoveFull
  *
  * Remove the corresponding gxact entry from TwoPhaseState. Also remove
  * the 2PC file if a prepared transaction was saved via an earlier checkpoint.
@@ -2567,8 +2587,8 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
  * Caller must hold TwoPhaseStateLock in exclusive mode, because TwoPhaseState
  * is updated.
  */
-void
-PrepareRedoRemove(TransactionId xid, bool giveWarning)
+static void
+PrepareRedoRemoveFull(FullTransactionId fxid, bool giveWarning)
 {
 	GlobalTransaction gxact = NULL;
 	int			i;
@@ -2581,7 +2601,7 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 	{
 		gxact = TwoPhaseState->prepXacts[i];
 
-		if (gxact->xid == xid)
+		if (FullTransactionIdEquals(gxact->fxid, fxid))
 		{
 			Assert(gxact->inredo);
 			found = true;
@@ -2598,12 +2618,28 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 	/*
 	 * And now we can clean up any files we may have left.
 	 */
-	elog(DEBUG2, "removing 2PC data for transaction %u", xid);
+	elog(DEBUG2, "removing 2PC data for transaction %u of epoch %u ",
+		 XidFromFullTransactionId(fxid),
+		 EpochFromFullTransactionId(fxid));
+
 	if (gxact->ondisk)
-		RemoveTwoPhaseFile(xid, giveWarning);
+		RemoveTwoPhaseFile(fxid, giveWarning);
+
 	RemoveGXact(gxact);
 }
 
+/*
+ * Wrapper of PrepareRedoRemoveFull(), for TransactionIds.
+ */
+void
+PrepareRedoRemove(TransactionId xid, bool giveWarning)
+{
+	FullTransactionId fxid =
+		FullTransactionIdFromAllowableAt(TransamVariables->nextXid, xid);
+
+	PrepareRedoRemoveFull(fxid, giveWarning);
+}
+
 /*
  * LookupGXact
  *		Check if the prepared transaction with the given GID, lsn and timestamp
@@ -2648,7 +2684,7 @@ LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
 			 * between publisher and subscriber.
 			 */
 			if (gxact->ondisk)
-				buf = ReadTwoPhaseFile(gxact->xid, false);
+				buf = ReadTwoPhaseFile(gxact->fxid, false);
 			else
 			{
 				Assert(gxact->prepare_start_lsn);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b885513f7654..41601fcb2803 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2515,7 +2515,7 @@ static void
 PrepareTransaction(void)
 {
 	TransactionState s = CurrentTransactionState;
-	TransactionId xid = GetCurrentTransactionId();
+	FullTransactionId fxid = GetCurrentFullTransactionId();
 	GlobalTransaction gxact;
 	TimestampTz prepared_at;
 
@@ -2644,7 +2644,7 @@ PrepareTransaction(void)
 	 * Reserve the GID for this transaction. This could fail if the requested
 	 * GID is invalid or already in use.
 	 */
-	gxact = MarkAsPreparing(xid, prepareGID, prepared_at,
+	gxact = MarkAsPreparing(fxid, prepareGID, prepared_at,
 							GetUserId(), MyDatabaseId);
 	prepareGID = NULL;
 
@@ -2694,7 +2694,7 @@ PrepareTransaction(void)
 	 * ProcArrayClearTransaction().  Otherwise, a GetLockConflicts() would
 	 * conclude "xact already committed or aborted" for our locks.
 	 */
-	PostPrepare_Locks(xid);
+	PostPrepare_Locks(fxid);
 
 	/*
 	 * Let others know about no transaction in progress by me.  This has to be
@@ -2738,9 +2738,9 @@ PrepareTransaction(void)
 
 	PostPrepare_smgr();
 
-	PostPrepare_MultiXact(xid);
+	PostPrepare_MultiXact(fxid);
 
-	PostPrepare_PredicateLocks(xid);
+	PostPrepare_PredicateLocks(fxid);
 
 	ResourceOwnerRelease(TopTransactionResourceOwner,
 						 RESOURCE_RELEASE_LOCKS,
@@ -6420,7 +6420,8 @@ xact_redo(XLogReaderState *record)
 		 * gxact entry.
 		 */
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
-		PrepareRedoAdd(XLogRecGetData(record),
+		PrepareRedoAdd(InvalidFullTransactionId,
+					   XLogRecGetData(record),
 					   record->ReadRecPtr,
 					   record->EndRecPtr,
 					   XLogRecGetOrigin(record));
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 86b06b9223f0..fbd331f2d711 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -3539,9 +3539,9 @@ AtPrepare_Locks(void)
  * but that probably costs more cycles.
  */
 void
-PostPrepare_Locks(TransactionId xid)
+PostPrepare_Locks(FullTransactionId fxid)
 {
-	PGPROC	   *newproc = TwoPhaseGetDummyProc(xid, false);
+	PGPROC	   *newproc = TwoPhaseGetDummyProc(fxid, false);
 	HASH_SEQ_STATUS status;
 	LOCALLOCK  *locallock;
 	LOCK	   *lock;
@@ -4324,11 +4324,11 @@ DumpAllLocks(void)
  * and PANIC anyway.
  */
 void
-lock_twophase_recover(TransactionId xid, uint16 info,
+lock_twophase_recover(FullTransactionId fxid, uint16 info,
 					  void *recdata, uint32 len)
 {
 	TwoPhaseLockRecord *rec = (TwoPhaseLockRecord *) recdata;
-	PGPROC	   *proc = TwoPhaseGetDummyProc(xid, false);
+	PGPROC	   *proc = TwoPhaseGetDummyProc(fxid, false);
 	LOCKTAG    *locktag;
 	LOCKMODE	lockmode;
 	LOCKMETHODID lockmethodid;
@@ -4505,7 +4505,7 @@ lock_twophase_recover(TransactionId xid, uint16 info,
  * starting up into hot standby mode.
  */
 void
-lock_twophase_standby_recover(TransactionId xid, uint16 info,
+lock_twophase_standby_recover(FullTransactionId fxid, uint16 info,
 							  void *recdata, uint32 len)
 {
 	TwoPhaseLockRecord *rec = (TwoPhaseLockRecord *) recdata;
@@ -4524,7 +4524,7 @@ lock_twophase_standby_recover(TransactionId xid, uint16 info,
 	if (lockmode == AccessExclusiveLock &&
 		locktag->locktag_type == LOCKTAG_RELATION)
 	{
-		StandbyAcquireAccessExclusiveLock(xid,
+		StandbyAcquireAccessExclusiveLock(XidFromFullTransactionId(fxid),
 										  locktag->locktag_field1 /* dboid */ ,
 										  locktag->locktag_field2 /* reloid */ );
 	}
@@ -4537,11 +4537,11 @@ lock_twophase_standby_recover(TransactionId xid, uint16 info,
  * Find and release the lock indicated by the 2PC record.
  */
 void
-lock_twophase_postcommit(TransactionId xid, uint16 info,
+lock_twophase_postcommit(FullTransactionId fxid, uint16 info,
 						 void *recdata, uint32 len)
 {
 	TwoPhaseLockRecord *rec = (TwoPhaseLockRecord *) recdata;
-	PGPROC	   *proc = TwoPhaseGetDummyProc(xid, true);
+	PGPROC	   *proc = TwoPhaseGetDummyProc(fxid, true);
 	LOCKTAG    *locktag;
 	LOCKMETHODID lockmethodid;
 	LockMethod	lockMethodTable;
@@ -4563,10 +4563,10 @@ lock_twophase_postcommit(TransactionId xid, uint16 info,
  * This is actually just the same as the COMMIT case.
  */
 void
-lock_twophase_postabort(TransactionId xid, uint16 info,
+lock_twophase_postabort(FullTransactionId fxid, uint16 info,
 						void *recdata, uint32 len)
 {
-	lock_twophase_postcommit(xid, info, recdata, len);
+	lock_twophase_postcommit(fxid, info, recdata, len);
 }
 
 /*
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index d82114ffca16..c07fb5883555 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -191,7 +191,7 @@
  *		AtPrepare_PredicateLocks(void);
  *		PostPrepare_PredicateLocks(TransactionId xid);
  *		PredicateLockTwoPhaseFinish(TransactionId xid, bool isCommit);
- *		predicatelock_twophase_recover(TransactionId xid, uint16 info,
+ *		predicatelock_twophase_recover(FullTransactionId fxid, uint16 info,
  *									   void *recdata, uint32 len);
  */
 
@@ -4856,7 +4856,7 @@ AtPrepare_PredicateLocks(void)
  *		anyway. We only need to clean up our local state.
  */
 void
-PostPrepare_PredicateLocks(TransactionId xid)
+PostPrepare_PredicateLocks(FullTransactionId fxid)
 {
 	if (MySerializableXact == InvalidSerializableXact)
 		return;
@@ -4879,12 +4879,12 @@ PostPrepare_PredicateLocks(TransactionId xid)
  *		commits or aborts.
  */
 void
-PredicateLockTwoPhaseFinish(TransactionId xid, bool isCommit)
+PredicateLockTwoPhaseFinish(FullTransactionId fxid, bool isCommit)
 {
 	SERIALIZABLEXID *sxid;
 	SERIALIZABLEXIDTAG sxidtag;
 
-	sxidtag.xid = xid;
+	sxidtag.xid = XidFromFullTransactionId(fxid);
 
 	LWLockAcquire(SerializableXactHashLock, LW_SHARED);
 	sxid = (SERIALIZABLEXID *)
@@ -4906,10 +4906,11 @@ PredicateLockTwoPhaseFinish(TransactionId xid, bool isCommit)
  * Re-acquire a predicate lock belonging to a transaction that was prepared.
  */
 void
-predicatelock_twophase_recover(TransactionId xid, uint16 info,
+predicatelock_twophase_recover(FullTransactionId fxid, uint16 info,
 							   void *recdata, uint32 len)
 {
 	TwoPhasePredicateRecord *record;
+	TransactionId xid = XidFromFullTransactionId(fxid);
 
 	Assert(len == sizeof(TwoPhasePredicateRecord));
 
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 28587e2916b1..69df741cbf63 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -744,7 +744,7 @@ PostPrepare_PgStat_Relations(PgStat_SubXactStatus *xact_state)
  * Load the saved counts into our local pgstats state.
  */
 void
-pgstat_twophase_postcommit(TransactionId xid, uint16 info,
+pgstat_twophase_postcommit(FullTransactionId fxid, uint16 info,
 						   void *recdata, uint32 len)
 {
 	TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata;
@@ -780,7 +780,7 @@ pgstat_twophase_postcommit(TransactionId xid, uint16 info,
  * as aborted.
  */
 void
-pgstat_twophase_postabort(TransactionId xid, uint16 info,
+pgstat_twophase_postabort(FullTransactionId fxid, uint16 info,
 						  void *recdata, uint32 len)
 {
 	TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata;
-- 
2.49.0

v3-0002-Improve-handling-of-2PC-files-during-recovery.patchtext/x-diff; charset=us-asciiDownload

From 12c863ddd9721e87c81ad755d1729d8241ab2204 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Thu, 22 May 2025 10:09:36 +0900
Subject: [PATCH v3 2/2] Improve handling of 2PC files during recovery

This addresses two issues with 2PC files.

First, at the beginning of recovery, 2PC file data is loaded from
pg_twophase/.  The previous codes used CLOG lookups to determine fif
transactions were committed and/or aborted, which is unsafe at this
stage of recovery if a consistent state may not have been reached.  The
code is changed so as files are discarded based on the horizon seen by
the checkpoint record from where recovery begins:
- If 2PC files are older than the checkpoint XID horizon, these files
are useless, and can be safely removed/
- If 2PC files are newer than the next XID available, these files should
not exist yet, and can be discarded.  If a 2PC transaction exists with
the same transaction ID, WAL replay will take care of adding it.

Second, at the end of recovery, it may be possible that dangling files
exist for transactions already committed and/or aborted, when these are
loaded into memory based on the XID horizon available at the beginning
of recovery.  At this stage of recovery, CLOG lookups are safe as
recovery finishes so the state of the on-disk CLOG data is consistent.

The first step is now done by restoreTwoPhaseData(), which is the code
in charge of loading the 2PC state files from pg_twophase at the
beginning of recovery.  The second step is done in
RecoverPreparedTransactions(), called at the end of recovery to restore
full the state of existing 2PC transactions based on their in-memory
data filled during recovery.  Previously, ProcessTwoPhaseBuffer() was in
charge of doing such checks, based on CLOG lookups, which was unreliable
at the beginning of recovery.

Tests are added to show these two problems.
---
 src/backend/access/transam/twophase.c | 140 +++++++++++++----------
 src/test/recovery/t/009_twophase.pl   | 153 ++++++++++++++++++++++++++
 2 files changed, 236 insertions(+), 57 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 82324395f5a1..644642e80df0 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -54,7 +54,9 @@
  *		* At the beginning of recovery, pg_twophase is scanned once, filling
  *		  TwoPhaseState with entries marked with gxact->inredo and
  *		  gxact->ondisk.  Two-phase file data older than the XID horizon of
- *		  the redo position are discarded.
+ *		  the redo position and two-phase file data newer than the next XID
+ *		  are discarded, based on the information of the checkpoint record
+ *		  retrieved at the beginning of recovery.
  *		* On PREPARE redo, the transaction is added to TwoPhaseState->prepXacts.
  *		  gxact->inredo is set to true for such entries.
  *		* On Checkpoint we iterate through TwoPhaseState->prepXacts entries
@@ -66,6 +68,10 @@
  *		* RecoverPreparedTransactions(), StandbyRecoverPreparedTransactions()
  *		  and PrescanPreparedTransactions() have been modified to go through
  *		  gxact->inredo entries that have not made it to disk.
+ *		* At the end of recovery, RecoverPreparedTransactions() performs an
+ *		  extra check for transactions that could be found as already
+ *		  committed or aborted.  This is safe to do as recovery is done,
+ *		  making CLOG lookups a safe operation.
  *
  *-------------------------------------------------------------------------
  */
@@ -1893,13 +1899,16 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
  * Scan pg_twophase and fill TwoPhaseState depending on the on-disk data.
  * This is called once at the beginning of recovery, saving any extra
  * lookups in the future.  Two-phase files that are newer than the
- * minimum XID horizon are discarded on the way.
+ * minimum XID horizon and two-phase files that are older than the oldest
+ * XID horizon are discarded on the way.
  */
 void
 restoreTwoPhaseData(void)
 {
 	DIR		   *cldir;
 	struct dirent *clde;
+	FullTransactionId nextXid = TransamVariables->nextXid;
+	FullTransactionId oldestXid = AdjustToFullTransactionId(TransamVariables->oldestXid);
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	cldir = AllocateDir(TWOPHASE_DIR);
@@ -1913,10 +1922,26 @@ restoreTwoPhaseData(void)
 
 			fxid = FullTransactionIdFromU64(strtou64(clde->d_name, NULL, 16));
 
+			/* Reject XID if too new or too old */
+			if (FullTransactionIdFollowsOrEquals(fxid, nextXid) ||
+				FullTransactionIdPrecedes(fxid, oldestXid))
+			{
+				if (FullTransactionIdFollowsOrEquals(fxid, nextXid))
+					ereport(WARNING,
+							(errmsg("removing future two-phase state file for transaction %u of epoch %u",
+									XidFromFullTransactionId(fxid),
+									EpochFromFullTransactionId(fxid))));
+				else
+					ereport(WARNING,
+							(errmsg("removing past two-phase state file for transaction %u of epoch %u",
+									XidFromFullTransactionId(fxid),
+									EpochFromFullTransactionId(fxid))));
+				RemoveTwoPhaseFile(fxid, true);
+				continue;
+			}
+
 			buf = ProcessTwoPhaseBuffer(fxid, InvalidXLogRecPtr,
 										true, false, false);
-			if (buf == NULL)
-				continue;
 
 			PrepareRedoAdd(fxid, buf, InvalidXLogRecPtr,
 						   InvalidXLogRecPtr, InvalidRepOriginId);
@@ -1981,9 +2006,6 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 									gxact->prepare_start_lsn,
 									gxact->ondisk, false, true);
 
-		if (buf == NULL)
-			continue;
-
 		/*
 		 * OK, we think this file is valid.  Incorporate xid into the
 		 * running-minimum result.
@@ -2052,8 +2074,7 @@ StandbyRecoverPreparedTransactions(void)
 		buf = ProcessTwoPhaseBuffer(gxact->fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
-		if (buf != NULL)
-			pfree(buf);
+		pfree(buf);
 	}
 	LWLockRelease(TwoPhaseStateLock);
 }
@@ -2078,8 +2099,28 @@ void
 RecoverPreparedTransactions(void)
 {
 	int			i;
+	FullTransactionId *remove_fxids;
+	int			remove_fxids_cnt;
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+	/* Fast-exit path if no work is required */
+	if (TwoPhaseState->numPrepXacts == 0)
+	{
+		LWLockRelease(TwoPhaseStateLock);
+		return;
+	}
+
+	/*
+	 * Track XIDs candidate for removal if found as already committed or
+	 * aborted, once the first scan through TwoPhaseState is done.  This
+	 * cannot happen while going through the entries in TwoPhaseState as
+	 * PrepareRedoRemove() manipulates it.
+	 */
+	remove_fxids_cnt = 0;
+	remove_fxids = (FullTransactionId *) palloc(TwoPhaseState->numPrepXacts *
+												sizeof(FullTransactionId));
+
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
 		char	   *buf;
@@ -2090,6 +2131,27 @@ RecoverPreparedTransactions(void)
 		TransactionId *subxids;
 		const char *gid;
 
+		/*
+		 * Is this transaction already aborted or committed?  If yes, mark it
+		 * for removal.
+		 *
+		 * Checking CLOGs if these transactions have been already aborted or
+		 * committed is safe at this stage; we are at the end of recovery and
+		 * all WAL has been replayed, all 2PC transactions are reinstated and
+		 * are tracked in TwoPhaseState.
+		 */
+		if (TransactionIdDidCommit(XidFromFullTransactionId(fxid)) ||
+			TransactionIdDidAbort(XidFromFullTransactionId(fxid)))
+		{
+			/*
+			 * Track this transaction ID for its removal from the shared
+			 * memory state at the end.
+			 */
+			remove_fxids[remove_fxids_cnt] = fxid;
+			remove_fxids_cnt++;
+			continue;
+		}
+
 		/*
 		 * Reconstruct subtrans state for the transaction --- needed because
 		 * pg_subtrans is not preserved over a restart.  Note that we are
@@ -2102,8 +2164,6 @@ RecoverPreparedTransactions(void)
 		buf = ProcessTwoPhaseBuffer(gxact->fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
-		if (buf == NULL)
-			continue;
 
 		ereport(LOG,
 				(errmsg("recovering prepared transaction %u of epoch %u from shared memory",
@@ -2164,6 +2224,18 @@ RecoverPreparedTransactions(void)
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	}
 
+	for (i = 0; i < remove_fxids_cnt; i++)
+	{
+		FullTransactionId fxid = remove_fxids[i];
+
+		ereport(WARNING,
+				(errmsg("removing stale two-phase state from memory for transaction %u of epoch %u",
+						XidFromFullTransactionId(fxid),
+						EpochFromFullTransactionId(fxid))));
+
+		PrepareRedoRemoveFull(fxid, true);
+	}
+
 	LWLockRelease(TwoPhaseStateLock);
 }
 
@@ -2184,7 +2256,6 @@ ProcessTwoPhaseBuffer(FullTransactionId fxid,
 					  bool fromdisk,
 					  bool setParent, bool setNextXid)
 {
-	FullTransactionId nextXid = TransamVariables->nextXid;
 	TransactionId *subxids;
 	char	   *buf;
 	TwoPhaseFileHeader *hdr;
@@ -2195,51 +2266,6 @@ ProcessTwoPhaseBuffer(FullTransactionId fxid,
 	if (!fromdisk)
 		Assert(prepare_start_lsn != InvalidXLogRecPtr);
 
-	/* Already processed? */
-	if (TransactionIdDidCommit(XidFromFullTransactionId(fxid)) ||
-		TransactionIdDidAbort(XidFromFullTransactionId(fxid)))
-	{
-		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing stale two-phase state file for transaction %u of epoch %u",
-							XidFromFullTransactionId(fxid),
-							EpochFromFullTransactionId(fxid))));
-			RemoveTwoPhaseFile(fxid, true);
-		}
-		else
-		{
-			ereport(WARNING,
-					(errmsg("removing stale two-phase state from memory for transaction %u of epoch %u",
-							XidFromFullTransactionId(fxid),
-							EpochFromFullTransactionId(fxid))));
-			PrepareRedoRemoveFull(fxid, true);
-		}
-		return NULL;
-	}
-
-	/* Reject XID if too new */
-	if (FullTransactionIdFollowsOrEquals(fxid, nextXid))
-	{
-		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing future two-phase state file for transaction %u of epoch %u",
-							XidFromFullTransactionId(fxid),
-							EpochFromFullTransactionId(fxid))));
-			RemoveTwoPhaseFile(fxid, true);
-		}
-		else
-		{
-			ereport(WARNING,
-					(errmsg("removing future two-phase state from memory for transaction %u of epoch %u",
-							XidFromFullTransactionId(fxid),
-							EpochFromFullTransactionId(fxid))));
-			PrepareRedoRemoveFull(fxid, true);
-		}
-		return NULL;
-	}
-
 	if (fromdisk)
 	{
 		/* Read and validate file */
diff --git a/src/test/recovery/t/009_twophase.pl b/src/test/recovery/t/009_twophase.pl
index 1a662ebe499d..2d0bf63a27c0 100644
--- a/src/test/recovery/t/009_twophase.pl
+++ b/src/test/recovery/t/009_twophase.pl
@@ -5,6 +5,7 @@
 use strict;
 use warnings FATAL => 'all';
 
+use File::Copy;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
@@ -28,6 +29,15 @@ sub configure_and_reload
 	return;
 }
 
+sub twophase_file_name
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
+	my $epoch = shift;
+	my $xid = shift;
+	return sprintf("%08X%08X", $epoch, $xid);
+}
+
 # Set up two nodes, which will alternately be primary and replication standby.
 
 # Setup london node
@@ -572,4 +582,147 @@ my $nsubtrans = $cur_primary->safe_psql('postgres',
 );
 isnt($osubtrans, $nsubtrans, "contents of pg_subtrans/ have changed");
 
+###############################################################################
+# Check handling of already committed or aborted 2PC files at recovery.
+# This test does a manual copy of 2PC files created in a running server,
+# to cheaply emulate situations that could be found in base backups.
+###############################################################################
+
+# Issue a set of transactions that will be used for this portion of the test:
+# - One transaction to hold on the minimum xid horizon at bay.
+# - One transaction that will be found as already committed at recovery.
+# - One transaction that will be fonnd as already rollbacked at recovery.
+$cur_primary->psql(
+	'postgres', "
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (40, 'transaction: xid horizon');
+	PREPARE TRANSACTION 'xact_009_40';
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (41, 'transaction: commit-prepared');
+	PREPARE TRANSACTION 'xact_009_41';
+	BEGIN;
+	INSERT INTO t_009_tbl VALUES (42, 'transaction: rollback-prepared');
+	PREPARE TRANSACTION 'xact_009_42';");
+
+# Issue a checkpoint, defining the XID horizon based on the first transaction
+# and flushing to disk the two-phase files that are used later in this test.
+$cur_primary->psql('postgres', "CHECKPOINT");
+
+# Get the transaction IDs of the 2PC files.
+my $horizon_xid = int(
+	$cur_primary->safe_psql(
+		'postgres',
+		"SELECT transaction FROM pg_prepared_xacts WHERE gid = 'xact_009_40'")
+);
+my $commit_prepared_xid = int(
+	$cur_primary->safe_psql(
+		'postgres',
+		"SELECT transaction FROM pg_prepared_xacts WHERE gid = 'xact_009_41'")
+);
+my $abort_prepared_xid = int(
+	$cur_primary->safe_psql(
+		'postgres',
+		"SELECT transaction FROM pg_prepared_xacts WHERE gid = 'xact_009_42'")
+);
+
+# Copy the two-phase files that will be put back later.  Assume an
+# epoch of 0.
+my $commit_prepared_name = twophase_file_name(0, $commit_prepared_xid);
+my $abort_prepared_name = twophase_file_name(0, $abort_prepared_xid);
+
+my $twophase_tmpdir = $PostgreSQL::Test::Utils::tmp_check . '/' . "2pc_files";
+mkdir($twophase_tmpdir);
+my $primary_twophase_folder = $cur_primary->data_dir . '/pg_twophase/';
+copy("$primary_twophase_folder/$commit_prepared_name", $twophase_tmpdir);
+copy("$primary_twophase_folder/$abort_prepared_name", $twophase_tmpdir);
+
+# Issue abort/commit prepared.
+$cur_primary->psql('postgres', "COMMIT PREPARED 'xact_009_41'");
+$cur_primary->psql('postgres', "ROLLBACK PREPARED 'xact_009_42'");
+
+# Again checkpoint, to advance the LSN past the point where the two previous
+# transaction records would be replayed.
+$cur_primary->psql('postgres', "CHECKPOINT");
+
+# Take down node.
+$cur_primary->teardown_node;
+
+# Move back the two twophase files.
+copy("$twophase_tmpdir/$commit_prepared_name", $primary_twophase_folder);
+copy("$twophase_tmpdir/$abort_prepared_name", $primary_twophase_folder);
+
+# Grab location in logs of primary
+my $log_offset = -s $cur_primary->logfile;
+
+# Start node and check how the two previous files are handled during recovery
+# by checking the server logs: they are loaded when the contents of
+# pg_twophase/ are initially scanned, then removed at the end of recovery once
+# CLOG lookups are safe to do.
+$cur_primary->start;
+
+$cur_primary->log_check(
+	"two-phase file of transaction recovered at end of recovery",
+	$log_offset,
+	log_like => [
+		qr/recovering prepared transaction $horizon_xid of epoch 0 from shared memory/,
+	]);
+$cur_primary->log_check(
+	"two-phase files of transactions removed at end of recovery",
+	$log_offset,
+	log_like => [
+		qr/removing stale two-phase state from memory for transaction $commit_prepared_xid of epoch 0/,
+		qr/removing stale two-phase state from memory for transaction $abort_prepared_xid of epoch 0/,
+	]);
+
+# Commit the first transaction.
+$cur_primary->psql('postgres', "COMMIT PREPARED 'xact_009_40'");
+# After replay, there should be no 2PC transactions.
+$cur_primary->psql(
+	'postgres',
+	"SELECT * FROM pg_prepared_xact",
+	stdout => \$psql_out);
+is($psql_out, qq{}, "Check expected pg_prepared_xact data on primary");
+# Data from transactions should be around.
+$cur_primary->psql(
+	'postgres',
+	"SELECT * FROM t_009_tbl WHERE id IN (40, 41, 42);",
+	stdout => \$psql_out);
+is( $psql_out, qq{40|transaction: xid horizon
+41|transaction: commit-prepared},
+	"Check expected table data on primary");
+
+###############################################################################
+# Check handling of orphaned 2PC files at recovery.
+###############################################################################
+
+$cur_standby->teardown_node;
+$cur_primary->teardown_node;
+
+# Grab location in logs of primary
+$log_offset = -s $cur_primary->logfile;
+
+# Create fake files with a transaction ID large or low enough to be in the
+# future or the past, in different epochs, then check that the primary is able
+# to start and remove these files at recovery.
+
+# First bump the epoch with pg_resetwal.
+$cur_primary->command_ok(
+	[ 'pg_resetwal', '-e', 256, '-f', $cur_primary->data_dir ],
+	'bump epoch of primary');
+
+my $future_2pc_file =
+  $cur_primary->data_dir . '/pg_twophase/000001FF00000FFF';
+append_to_file $future_2pc_file, "";
+my $past_2pc_file = $cur_primary->data_dir . '/pg_twophase/000000EE00000FFF';
+append_to_file $past_2pc_file, "";
+
+$cur_primary->start;
+$cur_primary->log_check(
+	"two-phase files removed at recovery",
+	$log_offset,
+	log_like => [
+		qr/removing past two-phase state file for transaction 4095 of epoch 238/,
+		qr/removing future two-phase state file for transaction 4095 of epoch 511/
+	]);
+
 done_testing();
-- 
2.49.0

Noah Misch

noah@leadboat.com

7 months ago

In reply to: Michael Paquier (#6)

Re: Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData

On Fri, May 09, 2025 at 02:08:26PM +0900, Michael Paquier wrote:

On Tue, Feb 18, 2025 at 04:57:47PM -0800, Noah Misch wrote:

As I wrote in [1], "By the time we reach consistency, every file in
pg_twophase will be applicable (not committed or aborted)." If we find
otherwise, the user didn't follow the backup protocol (or there's another
bug). Hence, long-term, we should stop these removals and just fail recovery.
We can't fix all data loss consequences of not following the backup protocol,
so the biggest favor we can do the user is draw their attention to the
problem. How do you see it?

Deciding to not trust at all any of the contents of pg_twophase/ until
consistency is reached is not something we should aim for, IMO. Going
in this direction would mean to delay restoreTwoPhaseData() until
consistency is reached, but there are cases where we can read that
safely, and where we should do so. For example, this flow is
perfectly OK to do in the wasShutdown case, where
PrescanPreparedTransactions() would be able to do its initialization
job before performing WAL recovery to get a clean list of running
XIDs.

The wasShutdown case reaches consistency from the beginning, so I don't see
that as an example of a time we benefit from reading pg_twophase before
reaching consistency. Can you elaborate on that?

What's the benefit you're trying to get by reading pg_twophase before reaching
consistency?

Before reaching consistency, our normal approach is to let WAL tell us what to
read, not explore the data directory for files of interest. That's a good
principle, because there are few bounds on the chaos that may exist in the
files of the data directory before reaching consistency. Today's twophase
departs from that principle. In light of this thread's problems, we should
have a strong reason for keeping that departure. The default should be to
align with the rest of recovery in this respect.

I can think of one benefit of attempting to read pg_twophase before reaching
consistency. Suppose we can prove that a pg_twophase file will cause an error
by end of recovery, regardless of what WAL contains. It's nice to fail
recovery immediately instead of failing recovery when we reach consistency.
However, I doubt that benefit is important enough to depart from our usual
principle and incur additional storage seeks in order to achieve that benefit.
If recovery will certainly fail, you are going to have a bad day anyway.
Accelerating recovery failure is a small benefit, particularly when we'd
accelerate failure for only a small slice of recovery failure causes.

I agree that moving towards a solution where we get rid entirely of
the CLOG lookups in ProcessTwoPhaseBuffer() is what we should aim for,
and actually is there a reason to not just nuke and replace them
something based on the checkpoint record itself?

I don't know what this means.

I have to admit that
I don't quite see the issue with ReadTwoPhaseFile() when it comes to
crash recovery. For example, in the case of a partial write, doesn't
the CRC32 check offer some protection about the contents of the file?

Not the protection we want. If we've not reached consistency, we must not
ERROR "calculated CRC checksum does not match value stored in file" for a file
that later WAL may recreate. That might be what you're saying:

Wouldn't it be OK in this case to assume that the contents of this
file will be in WAL anyway?

Sure. Meanwhile, if a twophase file is going to be in later WAL, what's the
value in opening the file before we get to that WAL?

The base backup issue is a different one, of course, and I think that
we are going to require more data in the 2PC file to provide a better
cross-check barrier, which would be the addition to the 2PC file of
the end LSN where the 2PC file record has been inserted. Then we
could cross-check that with the redo location, and see that it's
actually safe to discard the file because we know it will be in WAL.
This seems like a hefty cost to pay for, though, meaning 8 bytes in
each 2PC file because base backups were done wrong. Bleh.

I'm not saying we should go out of our way to detect base backup protocol
violations. Weakened detection of base backup protocol violations is one
drawback of acting on pg_twophase before consistency, but it's less important
than the deviation from standard recovery principles.

Michael Paquier

michael@paquier.xyz

7 months ago

In reply to: Noah Misch (#8)

Re: Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData

On Mon, Jun 02, 2025 at 06:48:46PM -0700, Noah Misch wrote:

The wasShutdown case reaches consistency from the beginning, so I don't see
that as an example of a time we benefit from reading pg_twophase before
reaching consistency. Can you elaborate on that?

What's the benefit you're trying to get by reading pg_twophase before reaching
consistency?

My point is mostly about code complicity and consistency, by using the
same logic (aka reading the contents of pg_twophase) at the same point
in the recovery process and perform some filtering of the contents we
know we will not be able to trust. So I mean to do that once at its
beginning of recovery, where we only compare the names of the 2PC file
names with the XID boundaries in the checkpoint record:
- Discard any files with an XID newer than the next XID. We know that
these would be in WAL anyway, if they replay from a timeline where it
matters.
- Discard any files that are older than the oldest XID. We know that
these files don't matter as 2PC transactions hold the XID horizon.
- Keep the others for later evaluation.

So there's no actual need to check the contents of the files, still
that implies trusting the names of the 2PC files in pg_twophase/.

I can think of one benefit of attempting to read pg_twophase before reaching
consistency. Suppose we can prove that a pg_twophase file will cause an error
by end of recovery, regardless of what WAL contains. It's nice to fail
recovery immediately instead of failing recovery when we reach consistency.
However, I doubt that benefit is important enough to depart from our usual
principle and incur additional storage seeks in order to achieve that benefit.
If recovery will certainly fail, you are going to have a bad day anyway.
Accelerating recovery failure is a small benefit, particularly when we'd
accelerate failure for only a small slice of recovery failure causes.

Well, I kind of disagree here. Failing recovery faster can be
beneficial. It perhaps has less merit since 7ff23c6d277d, meaning
that we should replay less after a failure during crash recovery,
still that seems useful to me if we can do that. That depends on the
amount of trust put in the data read, of course, and if only WAL is
trusted, there's not much that can be done at the beginning of
recovery.

I agree that moving towards a solution where we get rid entirely of
the CLOG lookups in ProcessTwoPhaseBuffer() is what we should aim for,
and actually is there a reason to not just nuke and replace them
something based on the checkpoint record itself?

I don't know what this means.

As of the contents of the last patch posted on this thread, I am
referring to the calls of TransactionIdDidCommit() and
TransactionIdDidAbort() moved from ProcessTwoPhaseBuffer(), which is
called mostly always when WAL is replayed (consistency may or may not
be reached), to RecoverPreparedTransactions(), which is run just
before consistency is marked as such in the system. When
RecoverPreparedTransactions() is called, we're ready to mark the
cluster as OK for writes and WAL has been already fully replayed with
all sanity checks done. CLOG accesses at this stage would not be an
issue.

Wouldn't it be OK in this case to assume that the contents of this
file will be in WAL anyway?

Sure. Meanwhile, if a twophase file is going to be in later WAL, what's the
value in opening the file before we get to that WAL?

True. We could avoid loading some 2PC files if we know that these
could be in WAL when replaying.

There is still one point that I'm really unclear about. Some 2PC
transactions are flushed at checkpoint time, so we will have to trust
the contents of pg_twophase/ at some point. Do you mean to delay that
to happen always when consistency is reached or if we know that we're
starting from a clean state? What I'd really prefer to avoid is
having two code paths in charge of reading the contents of
pg_twophase.

The point I am trying to make is that there has to be a certain level
of trust in the contents of pg_twophase, at some point during replay.
Or, we invent a new mechanism where all the twophase files go through
WAL and remove the need for pg_twophase/ when recovering. For
example, we could have an extra record generated at each checkpoint
with the contents of the 2PC files still pending for commit
(potentially costly if the same transaction persists across multiple
checkpoints as this gets repeated), or something entirely different.
Something like that would put WAL as the sole source of trust by
design.

Or are you seeing things differently? In which case, I am not sure
what's the line you think would be "correct" here (well, you did say
only WAL until consistency is reached), nor do I completely understand
how much 2PC transaction state we should have in shared memory until
consistency is reached. Or perhaps you mean to somehow eliminate more
that? I'm unsure how much this would imply for the existing recovery
mechanisms (replication origin advancing at replay for prepared
transactions may be one area to look at, for example).
--
Michael

#10

Noah Misch

noah@leadboat.com

7 months ago

In reply to: Michael Paquier (#9)

Re: Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData

On Thu, Jun 05, 2025 at 04:22:48PM +0900, Michael Paquier wrote:

On Mon, Jun 02, 2025 at 06:48:46PM -0700, Noah Misch wrote:

The wasShutdown case reaches consistency from the beginning, so I don't see
that as an example of a time we benefit from reading pg_twophase before
reaching consistency. Can you elaborate on that?

What's the benefit you're trying to get by reading pg_twophase before reaching
consistency?

My point is mostly about code complicity and consistency, by using the
same logic (aka reading the contents of pg_twophase) at the same point
in the recovery process

That's a nice goal.

and perform some filtering of the contents we
know we will not be able to trust. So I mean to do that once at its
beginning of recovery, where we only compare the names of the 2PC file
names with the XID boundaries in the checkpoint record:

- Discard any files with an XID newer than the next XID. We know that
these would be in WAL anyway, if they replay from a timeline where it
matters.

v3-0002-Improve-handling-of-2PC-files-during-recovery.patch has this continue
to emit a WARNING. However, this can happen without any exceptional events in
the backup's history. The message should be LOG or DEBUG if done this early.
This is not new in the patch, and it would be okay to not change it as part of
$SUBJECT. I mention it because, if we scanned pg_twophase after reaching
consistency, an ERROR would be appropriate. (A too-new pg_twophase file then
signifies the backup creator having copied files after the backup stop, a
violation of the backup protocol.)

- Discard any files that are older than the oldest XID. We know that
these files don't matter as 2PC transactions hold the XID horizon.

In contrast, a WARNING is fine here. A too-old file implies one of these
three, each of which counts as an exceptional event:

- an OS crash revived a file after unlink(), since RemoveTwoPhaseFile() does not fsync
- unlink() failed in RemoveTwoPhaseFile()
- backup protocol violation

I agree start-of-recovery can correctly do your list of two discard actions;
they do not require a consistent state. If the patch brings us to the point
that recovery does only those two things with pg_twophase before reaching
consistency, that sounds fine. Doing them that early doesn't sound optimal to
me, but I've not edited that area as much as you have. If it gets the
user-visible behaviors right and doesn't look fragile, I'll be fine with it.

- Keep the others for later evaluation.

So there's no actual need to check the contents of the files, still
that implies trusting the names of the 2PC files in pg_twophase/.

Trusting file names is fine.

I can think of one benefit of attempting to read pg_twophase before reaching
consistency. Suppose we can prove that a pg_twophase file will cause an error
by end of recovery, regardless of what WAL contains.

Thinking about that more, few errors are detectable at that time. Here too,
if your patch achieves some of this while getting the user-visible behaviors
right and not looking fragile, I'll be fine with it.

Wouldn't it be OK in this case to assume that the contents of this
file will be in WAL anyway?

Sure. Meanwhile, if a twophase file is going to be in later WAL, what's the
value in opening the file before we get to that WAL?

True. We could avoid loading some 2PC files if we know that these
could be in WAL when replaying.

If the 2PC data is in future WAL, the file on disk may be a partially-written
file that fails the CRC check. That's a key benefit of not reading the file.

There is still one point that I'm really unclear about. Some 2PC
transactions are flushed at checkpoint time, so we will have to trust
the contents of pg_twophase/ at some point. Do you mean to delay that
to happen always when consistency is reached

Yep.

or if we know that we're
starting from a clean state?

I don't know what "clean slate" means. If "clean state"=wasShutdown, that's
just a special case of "consistency is reached".

What I'd really prefer to avoid is
having two code paths in charge of reading the contents of
pg_twophase.

That's a good goal. I suspect you'll achieve that.

The point I am trying to make is that there has to be a certain level
of trust in the contents of pg_twophase, at some point during replay.
Or, we invent a new mechanism where all the twophase files go through
WAL and remove the need for pg_twophase/ when recovering. For
example, we could have an extra record generated at each checkpoint
with the contents of the 2PC files still pending for commit
(potentially costly if the same transaction persists across multiple
checkpoints as this gets repeated), or something entirely different.
Something like that would put WAL as the sole source of trust by
design.

Adding such a record would be nonstandard and unnecessary. Before reaching
consistency, the standard is that recovery reads and writes what WAL directs
it to read and write. Upon reaching consistency, the server can trust the
entire data directory, including parts that WAL never referenced.

Or are you seeing things differently? In which case, I am not sure
what's the line you think would be "correct" here (well, you did say
only WAL until consistency is reached), nor do I completely understand
how much 2PC transaction state we should have in shared memory until
consistency is reached.

Shared memory should (not "must") omit a GXACT whose prepare_start_lsn lies
beyond the point recovery has reached. In other words, recovery should keep
anachronistic information out of shared memory. For example, one could
imagine a straw man implementation in which, before reaching consistency, we
load each pg_twophase file that does pass its CRC check. I'd dislike that,
because it would facilitate anachronisms involving the
GlobalTransactionData.gid field. We could end up with two GXACT having the
same gid, one being the present (according to recovery progress) instance of
that gid and another being a future instance. The system might not
malfunction today, but I'd consider that fragile. Anachronistic entries might
cause recovery to need more shared memory than max_prepared_transactions has
allocated.

You might find some more-important goal preempts that no-anachronisms goal.

#11

Michael Paquier

michael@paquier.xyz

7 months ago

In reply to: Noah Misch (#10)

1 attachment(s)

Re: Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData

On Fri, Jun 06, 2025 at 03:34:13PM -0700, Noah Misch wrote:

On Thu, Jun 05, 2025 at 04:22:48PM +0900, Michael Paquier wrote:

On Mon, Jun 02, 2025 at 06:48:46PM -0700, Noah Misch wrote:
and perform some filtering of the contents we
know we will not be able to trust. So I mean to do that once at its
beginning of recovery, where we only compare the names of the 2PC file
names with the XID boundaries in the checkpoint record:

- Discard any files with an XID newer than the next XID. We know that
these would be in WAL anyway, if they replay from a timeline where it
matters.

v3-0002-Improve-handling-of-2PC-files-during-recovery.patch has this continue
to emit a WARNING. However, this can happen without any exceptional events in
the backup's history. The message should be LOG or DEBUG if done this early.
This is not new in the patch, and it would be okay to not change it as part of
$SUBJECT. I mention it because, if we scanned pg_twophase after reaching
consistency, an ERROR would be appropriate. (A too-new pg_twophase file then
signifies the backup creator having copied files after the backup stop, a
violation of the backup protocol.)

I need to ponder more about this point, I think.

- Discard any files that are older than the oldest XID. We know that
these files don't matter as 2PC transactions hold the XID horizon.

In contrast, a WARNING is fine here. A too-old file implies one of these
three, each of which counts as an exceptional event:

- an OS crash revived a file after unlink(), since RemoveTwoPhaseFile() does not fsync
- unlink() failed in RemoveTwoPhaseFile()
- backup protocol violation

I agree start-of-recovery can correctly do your list of two discard actions;
they do not require a consistent state. If the patch brings us to the point
that recovery does only those two things with pg_twophase before reaching
consistency, that sounds fine. Doing them that early doesn't sound optimal to
me, but I've not edited that area as much as you have. If it gets the
user-visible behaviors right and doesn't look fragile, I'll be fine with it.

Noted.

So there's no actual need to check the contents of the files, still
that implies trusting the names of the 2PC files in pg_twophase/.

Trusting file names is fine.

Noted, thanks.

or if we know that we're
starting from a clean state?

I don't know what "clean slate" means. If "clean state"=wasShutdown, that's
just a special case of "consistency is reached".

By "clean state", I meant wasShutdown = true, yes. Sorry for the
confusion.

Or are you seeing things differently? In which case, I am not sure
what's the line you think would be "correct" here (well, you did say
only WAL until consistency is reached), nor do I completely understand
how much 2PC transaction state we should have in shared memory until
consistency is reached.

Shared memory should (not "must") omit a GXACT whose prepare_start_lsn lies
beyond the point recovery has reached. In other words, recovery should keep
anachronistic information out of shared memory. For example, one could
imagine a straw man implementation in which, before reaching consistency, we
load each pg_twophase file that does pass its CRC check. I'd dislike that,
because it would facilitate anachronisms involving the
GlobalTransactionData.gid field. We could end up with two GXACT having the
same gid, one being the present (according to recovery progress) instance of
that gid and another being a future instance. The system might not
malfunction today, but I'd consider that fragile. Anachronistic entries might
cause recovery to need more shared memory than max_prepared_transactions has
allocated.

So this comes down to attempting to translate the information from the
2PC files to shared memory at a point earlier than what we could.
Thinking more about this point, there is something that I can see us
potentially do here: how about tracking the file names in shared
memory when we go through them in restoreTwoPhaseData()? That
addresses my point about doing only one scan of pg_twophase/ at the
beginning of recovery. I have not studied in details how doable it
is, but we should be able to delay any calls to
ProcessTwoPhaseBuffer() until we are sure that it is safe to do, which
would in charge of reading the files based on the names we've picked
at the early stages.

RecoverPreparedTransactions() is an easy one, we only call it at the
end of recovery. The sticky point to study is the call of
StandbyRecoverPreparedTransactions() when replaying a shutdown
checkpoint.. Ugh.

You might find some more-important goal preempts that no-anachronisms goal.

Right, thanks.

By the way, what do you think we should do with 0001 at this stage
(your refactoring work with some changes I've sprinkled)? I was
looking at it this morning with fresh eyes and my opinion about it is
that it improves the situation in all these 2PC callers on HEAD now
that we need to deal with fxids for the file names, so that's an
independent piece. Perhaps this should be applied once v19 opens for
business while we sort out the rest that 0002 was trying to cover?
It's not something that we can backpatch safely due to the ABI change
in 2PC callback, unfortunately, just a cleanup piece.

Note: this introduced two calls to FullTransactionIdFromAllowableAt()
in StandbyTransactionIdIsPrepared() and PrepareRedoAdd(), which worried
me a bit on second look and gave me a pause:
- The call in StandbyTransactionIdIsPrepared() should use
AdjustToFullTransactionId(), ReadTwoPhaseFile() calls
TwoPhaseFilePath(), which uses AdjustToFullTransactionId() on HEAD.
- The call in PrepareRedoAdd() slightly worried me, but I think that
we're OK with using FullTransactionIdFromAllowableAt(), as we are in
the case of a 2PC transaction recovered from WAL, with the XID coming
from the file's header. I think that we'd better put an
Assert(InRecovery) in this path, just for safety.

What do you think? 0002 needs more thoughts, so I'm attaching a
slightly-reviewed version of 0001 for now.
--
Michael

Attachments:

v4-0001-Integrate-more-FullTransactionIds-into-2PC-code.patchtext/x-diff; charset=us-asciiDownload

From d2fa85980f70f1dc076d75d6c57bc2d736d33b4d Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Fri, 20 Jun 2025 08:59:00 +0900
Subject: [PATCH v4] Integrate more FullTransactionIds into 2PC code

This refactoring is a follow-up of the work done in 5a1dfde8334b, that
has switched 2PC file names to use FullTransactionIds.  This will help
with the integration of solutions for the handling on-disk orphaned 2PC
files, without requiring a need to build the file names from epoch
numbers and TransactionId, avoiding a lot of back-and-forth between the
FullTransactionIds retrieved from the file names and how these are
passed around in the internal 2PC logic.

Note that the core of the change is use of a FullTransactionId instead
of a TransactionId in GlobalTransactionData, that tracks 2PC file
information in shared memory.
---
 src/include/access/multixact.h               |   9 +-
 src/include/access/twophase.h                |  11 +-
 src/include/access/twophase_rmgr.h           |   4 +-
 src/include/pgstat.h                         |   4 +-
 src/include/storage/lock.h                   |  11 +-
 src/include/storage/predicate.h              |   6 +-
 src/backend/access/transam/multixact.c       |  16 +-
 src/backend/access/transam/twophase.c        | 245 +++++++++++--------
 src/backend/access/transam/xact.c            |  13 +-
 src/backend/storage/lmgr/lock.c              |  20 +-
 src/backend/storage/lmgr/predicate.c         |  11 +-
 src/backend/utils/activity/pgstat_relation.c |   4 +-
 12 files changed, 200 insertions(+), 154 deletions(-)

diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 4e6b0eec2ff4..b876e98f46ed 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -11,6 +11,7 @@
 #ifndef MULTIXACT_H
 #define MULTIXACT_H
 
+#include "access/transam.h"
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 #include "storage/sync.h"
@@ -119,7 +120,7 @@ extern int	multixactmemberssyncfiletag(const FileTag *ftag, char *path);
 
 extern void AtEOXact_MultiXact(void);
 extern void AtPrepare_MultiXact(void);
-extern void PostPrepare_MultiXact(TransactionId xid);
+extern void PostPrepare_MultiXact(FullTransactionId fxid);
 
 extern Size MultiXactShmemSize(void);
 extern void MultiXactShmemInit(void);
@@ -145,11 +146,11 @@ extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
 extern int	MultiXactMemberFreezeThreshold(void);
 
-extern void multixact_twophase_recover(TransactionId xid, uint16 info,
+extern void multixact_twophase_recover(FullTransactionId fxid, uint16 info,
 									   void *recdata, uint32 len);
-extern void multixact_twophase_postcommit(TransactionId xid, uint16 info,
+extern void multixact_twophase_postcommit(FullTransactionId fxid, uint16 info,
 										  void *recdata, uint32 len);
-extern void multixact_twophase_postabort(TransactionId xid, uint16 info,
+extern void multixact_twophase_postabort(FullTransactionId fxid, uint16 info,
 										 void *recdata, uint32 len);
 
 extern void multixact_redo(XLogReaderState *record);
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 9fa823550337..509bdad9a5d5 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -36,10 +36,10 @@ extern void PostPrepare_Twophase(void);
 
 extern TransactionId TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
 												bool *have_more);
-extern PGPROC *TwoPhaseGetDummyProc(TransactionId xid, bool lock_held);
-extern int	TwoPhaseGetDummyProcNumber(TransactionId xid, bool lock_held);
+extern PGPROC *TwoPhaseGetDummyProc(FullTransactionId fxid, bool lock_held);
+extern int	TwoPhaseGetDummyProcNumber(FullTransactionId fxid, bool lock_held);
 
-extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
+extern GlobalTransaction MarkAsPreparing(FullTransactionId fxid, const char *gid,
 										 TimestampTz prepared_at,
 										 Oid owner, Oid databaseid);
 
@@ -56,8 +56,9 @@ extern void CheckPointTwoPhase(XLogRecPtr redo_horizon);
 
 extern void FinishPreparedTransaction(const char *gid, bool isCommit);
 
-extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
-						   XLogRecPtr end_lsn, RepOriginId origin_id);
+extern void PrepareRedoAdd(FullTransactionId fxid, char *buf,
+						   XLogRecPtr start_lsn, XLogRecPtr end_lsn,
+						   RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
 extern bool LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
diff --git a/src/include/access/twophase_rmgr.h b/src/include/access/twophase_rmgr.h
index 3ed154bb2312..8f576402e360 100644
--- a/src/include/access/twophase_rmgr.h
+++ b/src/include/access/twophase_rmgr.h
@@ -14,7 +14,9 @@
 #ifndef TWOPHASE_RMGR_H
 #define TWOPHASE_RMGR_H
 
-typedef void (*TwoPhaseCallback) (TransactionId xid, uint16 info,
+#include "access/transam.h"
+
+typedef void (*TwoPhaseCallback) (FullTransactionId fxid, uint16 info,
 								  void *recdata, uint32 len);
 typedef uint8 TwoPhaseRmgrId;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 378f2f2c2ba2..202bd2d5aced 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -718,9 +718,9 @@ extern void pgstat_count_heap_delete(Relation rel);
 extern void pgstat_count_truncate(Relation rel);
 extern void pgstat_update_heap_dead_tuples(Relation rel, int delta);
 
-extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
+extern void pgstat_twophase_postcommit(FullTransactionId fxid, uint16 info,
 									   void *recdata, uint32 len);
-extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
+extern void pgstat_twophase_postabort(FullTransactionId fxid, uint16 info,
 									  void *recdata, uint32 len);
 
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 4862b80eec3c..826cf28fdbd9 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -18,6 +18,7 @@
 #error "lock.h may not be included from frontend code"
 #endif
 
+#include "access/transam.h"
 #include "lib/ilist.h"
 #include "storage/lockdefs.h"
 #include "storage/lwlock.h"
@@ -581,7 +582,7 @@ extern bool LockHasWaiters(const LOCKTAG *locktag,
 extern VirtualTransactionId *GetLockConflicts(const LOCKTAG *locktag,
 											  LOCKMODE lockmode, int *countp);
 extern void AtPrepare_Locks(void);
-extern void PostPrepare_Locks(TransactionId xid);
+extern void PostPrepare_Locks(FullTransactionId fxid);
 extern bool LockCheckConflicts(LockMethod lockMethodTable,
 							   LOCKMODE lockmode,
 							   LOCK *lock, PROCLOCK *proclock);
@@ -597,13 +598,13 @@ extern BlockedProcsData *GetBlockerStatusData(int blocked_pid);
 extern xl_standby_lock *GetRunningTransactionLocks(int *nlocks);
 extern const char *GetLockmodeName(LOCKMETHODID lockmethodid, LOCKMODE mode);
 
-extern void lock_twophase_recover(TransactionId xid, uint16 info,
+extern void lock_twophase_recover(FullTransactionId fxid, uint16 info,
 								  void *recdata, uint32 len);
-extern void lock_twophase_postcommit(TransactionId xid, uint16 info,
+extern void lock_twophase_postcommit(FullTransactionId fxid, uint16 info,
 									 void *recdata, uint32 len);
-extern void lock_twophase_postabort(TransactionId xid, uint16 info,
+extern void lock_twophase_postabort(FullTransactionId fxid, uint16 info,
 									void *recdata, uint32 len);
-extern void lock_twophase_standby_recover(TransactionId xid, uint16 info,
+extern void lock_twophase_standby_recover(FullTransactionId fxid, uint16 info,
 										  void *recdata, uint32 len);
 
 extern DeadLockState DeadLockCheck(PGPROC *proc);
diff --git a/src/include/storage/predicate.h b/src/include/storage/predicate.h
index 267d5d90e948..c1e3a4d9f64a 100644
--- a/src/include/storage/predicate.h
+++ b/src/include/storage/predicate.h
@@ -72,9 +72,9 @@ extern void PreCommit_CheckForSerializationFailure(void);
 
 /* two-phase commit support */
 extern void AtPrepare_PredicateLocks(void);
-extern void PostPrepare_PredicateLocks(TransactionId xid);
-extern void PredicateLockTwoPhaseFinish(TransactionId xid, bool isCommit);
-extern void predicatelock_twophase_recover(TransactionId xid, uint16 info,
+extern void PostPrepare_PredicateLocks(FullTransactionId fxid);
+extern void PredicateLockTwoPhaseFinish(FullTransactionId xid, bool isCommit);
+extern void predicatelock_twophase_recover(FullTransactionId fxid, uint16 info,
 										   void *recdata, uint32 len);
 
 /* parallel query support */
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 3c06ac45532f..7a7afe3edc67 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1847,7 +1847,7 @@ AtPrepare_MultiXact(void)
  *		Clean up after successful PREPARE TRANSACTION
  */
 void
-PostPrepare_MultiXact(TransactionId xid)
+PostPrepare_MultiXact(FullTransactionId fxid)
 {
 	MultiXactId myOldestMember;
 
@@ -1858,7 +1858,7 @@ PostPrepare_MultiXact(TransactionId xid)
 	myOldestMember = OldestMemberMXactId[MyProcNumber];
 	if (MultiXactIdIsValid(myOldestMember))
 	{
-		ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(xid, false);
+		ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(fxid, false);
 
 		/*
 		 * Even though storing MultiXactId is atomic, acquire lock to make
@@ -1896,10 +1896,10 @@ PostPrepare_MultiXact(TransactionId xid)
  *		Recover the state of a prepared transaction at startup
  */
 void
-multixact_twophase_recover(TransactionId xid, uint16 info,
+multixact_twophase_recover(FullTransactionId fxid, uint16 info,
 						   void *recdata, uint32 len)
 {
-	ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(xid, false);
+	ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(fxid, false);
 	MultiXactId oldestMember;
 
 	/*
@@ -1917,10 +1917,10 @@ multixact_twophase_recover(TransactionId xid, uint16 info,
  *		Similar to AtEOXact_MultiXact but for COMMIT PREPARED
  */
 void
-multixact_twophase_postcommit(TransactionId xid, uint16 info,
+multixact_twophase_postcommit(FullTransactionId fxid, uint16 info,
 							  void *recdata, uint32 len)
 {
-	ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(xid, true);
+	ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(fxid, true);
 
 	Assert(len == sizeof(MultiXactId));
 
@@ -1932,10 +1932,10 @@ multixact_twophase_postcommit(TransactionId xid, uint16 info,
  *		This is actually just the same as the COMMIT case.
  */
 void
-multixact_twophase_postabort(TransactionId xid, uint16 info,
+multixact_twophase_postabort(FullTransactionId fxid, uint16 info,
 							 void *recdata, uint32 len)
 {
-	multixact_twophase_postcommit(xid, info, recdata, len);
+	multixact_twophase_postcommit(fxid, info, recdata, len);
 }
 
 /*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 73a80559194e..ef6780d998d1 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -159,7 +159,7 @@ typedef struct GlobalTransactionData
 	 */
 	XLogRecPtr	prepare_start_lsn;	/* XLOG offset of prepare record start */
 	XLogRecPtr	prepare_end_lsn;	/* XLOG offset of prepare record end */
-	TransactionId xid;			/* The GXACT id */
+	FullTransactionId fxid;		/* The GXACT full xid */
 
 	Oid			owner;			/* ID of user that executed the xact */
 	ProcNumber	locking_backend;	/* backend currently working on the xact */
@@ -197,6 +197,7 @@ static GlobalTransaction MyLockedGxact = NULL;
 
 static bool twophaseExitRegistered = false;
 
+static void PrepareRedoRemoveFull(FullTransactionId fxid, bool giveWarning);
 static void RecordTransactionCommitPrepared(TransactionId xid,
 											int nchildren,
 											TransactionId *children,
@@ -216,19 +217,19 @@ static void RecordTransactionAbortPrepared(TransactionId xid,
 										   int nstats,
 										   xl_xact_stats_item *stats,
 										   const char *gid);
-static void ProcessRecords(char *bufptr, TransactionId xid,
+static void ProcessRecords(char *bufptr, FullTransactionId fxid,
 						   const TwoPhaseCallback callbacks[]);
 static void RemoveGXact(GlobalTransaction gxact);
 
 static void XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len);
-static char *ProcessTwoPhaseBuffer(TransactionId xid,
+static char *ProcessTwoPhaseBuffer(FullTransactionId fxid,
 								   XLogRecPtr prepare_start_lsn,
 								   bool fromdisk, bool setParent, bool setNextXid);
-static void MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid,
+static void MarkAsPreparingGuts(GlobalTransaction gxact, FullTransactionId fxid,
 								const char *gid, TimestampTz prepared_at, Oid owner,
 								Oid databaseid);
-static void RemoveTwoPhaseFile(TransactionId xid, bool giveWarning);
-static void RecreateTwoPhaseFile(TransactionId xid, void *content, int len);
+static void RemoveTwoPhaseFile(FullTransactionId fxid, bool giveWarning);
+static void RecreateTwoPhaseFile(FullTransactionId fxid, void *content, int len);
 
 /*
  * Initialization of shared memory
@@ -356,7 +357,7 @@ PostPrepare_Twophase(void)
  *		Reserve the GID for the given transaction.
  */
 GlobalTransaction
-MarkAsPreparing(TransactionId xid, const char *gid,
+MarkAsPreparing(FullTransactionId fxid, const char *gid,
 				TimestampTz prepared_at, Oid owner, Oid databaseid)
 {
 	GlobalTransaction gxact;
@@ -407,7 +408,7 @@ MarkAsPreparing(TransactionId xid, const char *gid,
 	gxact = TwoPhaseState->freeGXacts;
 	TwoPhaseState->freeGXacts = gxact->next;
 
-	MarkAsPreparingGuts(gxact, xid, gid, prepared_at, owner, databaseid);
+	MarkAsPreparingGuts(gxact, fxid, gid, prepared_at, owner, databaseid);
 
 	gxact->ondisk = false;
 
@@ -430,11 +431,13 @@ MarkAsPreparing(TransactionId xid, const char *gid,
  * Note: This function should be called with appropriate locks held.
  */
 static void
-MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
-					TimestampTz prepared_at, Oid owner, Oid databaseid)
+MarkAsPreparingGuts(GlobalTransaction gxact, FullTransactionId fxid,
+					const char *gid, TimestampTz prepared_at, Oid owner,
+					Oid databaseid)
 {
 	PGPROC	   *proc;
 	int			i;
+	TransactionId xid = XidFromFullTransactionId(fxid);
 
 	Assert(LWLockHeldByMeInMode(TwoPhaseStateLock, LW_EXCLUSIVE));
 
@@ -479,7 +482,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	proc->subxidStatus.count = 0;
 
 	gxact->prepared_at = prepared_at;
-	gxact->xid = xid;
+	gxact->fxid = fxid;
 	gxact->owner = owner;
 	gxact->locking_backend = MyProcNumber;
 	gxact->valid = false;
@@ -797,12 +800,12 @@ pg_prepared_xact(PG_FUNCTION_ARGS)
  * caller had better hold it.
  */
 static GlobalTransaction
-TwoPhaseGetGXact(TransactionId xid, bool lock_held)
+TwoPhaseGetGXact(FullTransactionId fxid, bool lock_held)
 {
 	GlobalTransaction result = NULL;
 	int			i;
 
-	static TransactionId cached_xid = InvalidTransactionId;
+	static FullTransactionId cached_fxid = {0};
 	static GlobalTransaction cached_gxact = NULL;
 
 	Assert(!lock_held || LWLockHeldByMe(TwoPhaseStateLock));
@@ -811,7 +814,7 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)
 	 * During a recovery, COMMIT PREPARED, or ABORT PREPARED, we'll be called
 	 * repeatedly for the same XID.  We can save work with a simple cache.
 	 */
-	if (xid == cached_xid)
+	if (FullTransactionIdEquals(fxid, cached_fxid))
 		return cached_gxact;
 
 	if (!lock_held)
@@ -821,7 +824,7 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)
 	{
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
 
-		if (gxact->xid == xid)
+		if (FullTransactionIdEquals(gxact->fxid, fxid))
 		{
 			result = gxact;
 			break;
@@ -832,9 +835,10 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)
 		LWLockRelease(TwoPhaseStateLock);
 
 	if (result == NULL)			/* should not happen */
-		elog(ERROR, "failed to find GlobalTransaction for xid %u", xid);
+		elog(ERROR, "failed to find GlobalTransaction for xid %u",
+			 XidFromFullTransactionId(fxid));
 
-	cached_xid = xid;
+	cached_fxid = fxid;
 	cached_gxact = result;
 
 	return result;
@@ -881,7 +885,7 @@ TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
 				*have_more = true;
 				break;
 			}
-			result = gxact->xid;
+			result = XidFromFullTransactionId(gxact->fxid);
 		}
 	}
 
@@ -892,7 +896,7 @@ TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
 
 /*
  * TwoPhaseGetDummyProcNumber
- *		Get the dummy proc number for prepared transaction specified by XID
+ *		Get the dummy proc number for prepared transaction
  *
  * Dummy proc numbers are similar to proc numbers of real backends.  They
  * start at MaxBackends, and are unique across all currently active real
@@ -900,24 +904,24 @@ TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
  * TwoPhaseStateLock will not be taken, so the caller had better hold it.
  */
 ProcNumber
-TwoPhaseGetDummyProcNumber(TransactionId xid, bool lock_held)
+TwoPhaseGetDummyProcNumber(FullTransactionId fxid, bool lock_held)
 {
-	GlobalTransaction gxact = TwoPhaseGetGXact(xid, lock_held);
+	GlobalTransaction gxact = TwoPhaseGetGXact(fxid, lock_held);
 
 	return gxact->pgprocno;
 }
 
 /*
  * TwoPhaseGetDummyProc
- *		Get the PGPROC that represents a prepared transaction specified by XID
+ *		Get the PGPROC that represents a prepared transaction
  *
  * If lock_held is set to true, TwoPhaseStateLock will not be taken, so the
  * caller had better hold it.
  */
 PGPROC *
-TwoPhaseGetDummyProc(TransactionId xid, bool lock_held)
+TwoPhaseGetDummyProc(FullTransactionId fxid, bool lock_held)
 {
-	GlobalTransaction gxact = TwoPhaseGetGXact(xid, lock_held);
+	GlobalTransaction gxact = TwoPhaseGetGXact(fxid, lock_held);
 
 	return GetPGProcByNumber(gxact->pgprocno);
 }
@@ -942,10 +946,8 @@ AdjustToFullTransactionId(TransactionId xid)
 }
 
 static inline int
-TwoPhaseFilePath(char *path, TransactionId xid)
+TwoPhaseFilePath(char *path, FullTransactionId fxid)
 {
-	FullTransactionId fxid = AdjustToFullTransactionId(xid);
-
 	return snprintf(path, MAXPGPATH, TWOPHASE_DIR "/%08X%08X",
 					EpochFromFullTransactionId(fxid),
 					XidFromFullTransactionId(fxid));
@@ -1049,7 +1051,7 @@ void
 StartPrepare(GlobalTransaction gxact)
 {
 	PGPROC	   *proc = GetPGProcByNumber(gxact->pgprocno);
-	TransactionId xid = gxact->xid;
+	TransactionId xid = XidFromFullTransactionId(gxact->fxid);
 	TwoPhaseFileHeader hdr;
 	TransactionId *children;
 	RelFileLocator *commitrels;
@@ -1281,10 +1283,11 @@ RegisterTwoPhaseRecord(TwoPhaseRmgrId rmid, uint16 info,
  * If it looks OK (has a valid magic number and CRC), return the palloc'd
  * contents of the file, issuing an error when finding corrupted data.  If
  * missing_ok is true, which indicates that missing files can be safely
- * ignored, then return NULL.  This state can be reached when doing recovery.
+ * ignored, then return NULL.  This state can be reached when doing recovery
+ * after discarding two-phase files from frozen epochs.
  */
 static char *
-ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
+ReadTwoPhaseFile(FullTransactionId fxid, bool missing_ok)
 {
 	char		path[MAXPGPATH];
 	char	   *buf;
@@ -1296,7 +1299,7 @@ ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
 				file_crc;
 	int			r;
 
-	TwoPhaseFilePath(path, xid);
+	TwoPhaseFilePath(path, fxid);
 
 	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
 	if (fd < 0)
@@ -1461,6 +1464,7 @@ StandbyTransactionIdIsPrepared(TransactionId xid)
 	char	   *buf;
 	TwoPhaseFileHeader *hdr;
 	bool		result;
+	FullTransactionId fxid;
 
 	Assert(TransactionIdIsValid(xid));
 
@@ -1468,7 +1472,8 @@ StandbyTransactionIdIsPrepared(TransactionId xid)
 		return false;			/* nothing to do */
 
 	/* Read and validate file */
-	buf = ReadTwoPhaseFile(xid, true);
+	fxid = AdjustToFullTransactionId(xid);
+	buf = ReadTwoPhaseFile(fxid, true);
 	if (buf == NULL)
 		return false;
 
@@ -1488,6 +1493,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 {
 	GlobalTransaction gxact;
 	PGPROC	   *proc;
+	FullTransactionId fxid;
 	TransactionId xid;
 	bool		ondisk;
 	char	   *buf;
@@ -1509,7 +1515,8 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 */
 	gxact = LockGXact(gid, GetUserId());
 	proc = GetPGProcByNumber(gxact->pgprocno);
-	xid = gxact->xid;
+	fxid = gxact->fxid;
+	xid = XidFromFullTransactionId(fxid);
 
 	/*
 	 * Read and validate 2PC state data. State data will typically be stored
@@ -1517,7 +1524,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 * to disk if for some reason they have lived for a long time.
 	 */
 	if (gxact->ondisk)
-		buf = ReadTwoPhaseFile(xid, false);
+		buf = ReadTwoPhaseFile(fxid, false);
 	else
 		XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
 
@@ -1636,11 +1643,11 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 
 	/* And now do the callbacks */
 	if (isCommit)
-		ProcessRecords(bufptr, xid, twophase_postcommit_callbacks);
+		ProcessRecords(bufptr, fxid, twophase_postcommit_callbacks);
 	else
-		ProcessRecords(bufptr, xid, twophase_postabort_callbacks);
+		ProcessRecords(bufptr, fxid, twophase_postabort_callbacks);
 
-	PredicateLockTwoPhaseFinish(xid, isCommit);
+	PredicateLockTwoPhaseFinish(fxid, isCommit);
 
 	/*
 	 * Read this value while holding the two-phase lock, as the on-disk 2PC
@@ -1664,7 +1671,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 * And now we can clean up any files we may have left.
 	 */
 	if (ondisk)
-		RemoveTwoPhaseFile(xid, true);
+		RemoveTwoPhaseFile(fxid, true);
 
 	MyLockedGxact = NULL;
 
@@ -1677,7 +1684,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
  * Scan 2PC state data in memory and call the indicated callbacks for each 2PC record.
  */
 static void
-ProcessRecords(char *bufptr, TransactionId xid,
+ProcessRecords(char *bufptr, FullTransactionId fxid,
 			   const TwoPhaseCallback callbacks[])
 {
 	for (;;)
@@ -1691,24 +1698,28 @@ ProcessRecords(char *bufptr, TransactionId xid,
 		bufptr += MAXALIGN(sizeof(TwoPhaseRecordOnDisk));
 
 		if (callbacks[record->rmid] != NULL)
-			callbacks[record->rmid] (xid, record->info, bufptr, record->len);
+			callbacks[record->rmid] (fxid, record->info, bufptr, record->len);
 
 		bufptr += MAXALIGN(record->len);
 	}
 }
 
 /*
- * Remove the 2PC file for the specified XID.
+ * Remove the 2PC file.
  *
  * If giveWarning is false, do not complain about file-not-present;
  * this is an expected case during WAL replay.
+ *
+ * This routine is used at early stages at recovery where future and
+ * past orphaned files are checked, hence the FullTransactionId to build
+ * a complete file name fit for the removal.
  */
 static void
-RemoveTwoPhaseFile(TransactionId xid, bool giveWarning)
+RemoveTwoPhaseFile(FullTransactionId fxid, bool giveWarning)
 {
 	char		path[MAXPGPATH];
 
-	TwoPhaseFilePath(path, xid);
+	TwoPhaseFilePath(path, fxid);
 	if (unlink(path))
 		if (errno != ENOENT || giveWarning)
 			ereport(WARNING,
@@ -1723,7 +1734,7 @@ RemoveTwoPhaseFile(TransactionId xid, bool giveWarning)
  * Note: content and len don't include CRC.
  */
 static void
-RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
+RecreateTwoPhaseFile(FullTransactionId fxid, void *content, int len)
 {
 	char		path[MAXPGPATH];
 	pg_crc32c	statefile_crc;
@@ -1734,7 +1745,7 @@ RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
 	COMP_CRC32C(statefile_crc, content, len);
 	FIN_CRC32C(statefile_crc);
 
-	TwoPhaseFilePath(path, xid);
+	TwoPhaseFilePath(path, fxid);
 
 	fd = OpenTransientFile(path,
 						   O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
@@ -1846,7 +1857,7 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
 			int			len;
 
 			XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, &len);
-			RecreateTwoPhaseFile(gxact->xid, buf, len);
+			RecreateTwoPhaseFile(gxact->fxid, buf, len);
 			gxact->ondisk = true;
 			gxact->prepare_start_lsn = InvalidXLogRecPtr;
 			gxact->prepare_end_lsn = InvalidXLogRecPtr;
@@ -1897,19 +1908,17 @@ restoreTwoPhaseData(void)
 		if (strlen(clde->d_name) == 16 &&
 			strspn(clde->d_name, "0123456789ABCDEF") == 16)
 		{
-			TransactionId xid;
 			FullTransactionId fxid;
 			char	   *buf;
 
 			fxid = FullTransactionIdFromU64(strtou64(clde->d_name, NULL, 16));
-			xid = XidFromFullTransactionId(fxid);
 
-			buf = ProcessTwoPhaseBuffer(xid, InvalidXLogRecPtr,
+			buf = ProcessTwoPhaseBuffer(fxid, InvalidXLogRecPtr,
 										true, false, false);
 			if (buf == NULL)
 				continue;
 
-			PrepareRedoAdd(buf, InvalidXLogRecPtr,
+			PrepareRedoAdd(fxid, buf, InvalidXLogRecPtr,
 						   InvalidXLogRecPtr, InvalidRepOriginId);
 		}
 	}
@@ -1968,9 +1977,7 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 
 		Assert(gxact->inredo);
 
-		xid = gxact->xid;
-
-		buf = ProcessTwoPhaseBuffer(xid,
+		buf = ProcessTwoPhaseBuffer(gxact->fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, false, true);
 
@@ -1981,6 +1988,7 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 		 * OK, we think this file is valid.  Incorporate xid into the
 		 * running-minimum result.
 		 */
+		xid = XidFromFullTransactionId(gxact->fxid);
 		if (TransactionIdPrecedes(xid, result))
 			result = xid;
 
@@ -2036,15 +2044,12 @@ StandbyRecoverPreparedTransactions(void)
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
-		TransactionId xid;
 		char	   *buf;
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
 
 		Assert(gxact->inredo);
 
-		xid = gxact->xid;
-
-		buf = ProcessTwoPhaseBuffer(xid,
+		buf = ProcessTwoPhaseBuffer(gxact->fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
 		if (buf != NULL)
@@ -2077,16 +2082,14 @@ RecoverPreparedTransactions(void)
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
-		TransactionId xid;
 		char	   *buf;
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+		FullTransactionId fxid = gxact->fxid;
 		char	   *bufptr;
 		TwoPhaseFileHeader *hdr;
 		TransactionId *subxids;
 		const char *gid;
 
-		xid = gxact->xid;
-
 		/*
 		 * Reconstruct subtrans state for the transaction --- needed because
 		 * pg_subtrans is not preserved over a restart.  Note that we are
@@ -2096,17 +2099,20 @@ RecoverPreparedTransactions(void)
 		 * SubTransSetParent has been set before, if the prepared transaction
 		 * generated xid assignment records.
 		 */
-		buf = ProcessTwoPhaseBuffer(xid,
+		buf = ProcessTwoPhaseBuffer(gxact->fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
 		if (buf == NULL)
 			continue;
 
 		ereport(LOG,
-				(errmsg("recovering prepared transaction %u from shared memory", xid)));
+				(errmsg("recovering prepared transaction %u of epoch %u from shared memory",
+						XidFromFullTransactionId(gxact->fxid),
+						EpochFromFullTransactionId(gxact->fxid))));
 
 		hdr = (TwoPhaseFileHeader *) buf;
-		Assert(TransactionIdEquals(hdr->xid, xid));
+		Assert(TransactionIdEquals(hdr->xid,
+								   XidFromFullTransactionId(gxact->fxid)));
 		bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader));
 		gid = (const char *) bufptr;
 		bufptr += MAXALIGN(hdr->gidlen);
@@ -2122,7 +2128,7 @@ RecoverPreparedTransactions(void)
 		 * Recreate its GXACT and dummy PGPROC. But, check whether it was
 		 * added in redo and already has a shmem entry for it.
 		 */
-		MarkAsPreparingGuts(gxact, xid, gid,
+		MarkAsPreparingGuts(gxact, gxact->fxid, gid,
 							hdr->prepared_at,
 							hdr->owner, hdr->database);
 
@@ -2137,7 +2143,7 @@ RecoverPreparedTransactions(void)
 		/*
 		 * Recover other state (notably locks) using resource managers.
 		 */
-		ProcessRecords(bufptr, xid, twophase_recover_callbacks);
+		ProcessRecords(bufptr, fxid, twophase_recover_callbacks);
 
 		/*
 		 * Release locks held by the standby process after we process each
@@ -2145,7 +2151,7 @@ RecoverPreparedTransactions(void)
 		 * additional locks at any one time.
 		 */
 		if (InHotStandby)
-			StandbyReleaseLockTree(xid, hdr->nsubxacts, subxids);
+			StandbyReleaseLockTree(hdr->xid, hdr->nsubxacts, subxids);
 
 		/*
 		 * We're done with recovering this transaction. Clear MyLockedGxact,
@@ -2164,7 +2170,7 @@ RecoverPreparedTransactions(void)
 /*
  * ProcessTwoPhaseBuffer
  *
- * Given a transaction id, read it either from disk or read it directly
+ * Given a FullTransactionId, read it either from disk or read it directly
  * via shmem xlog record pointer using the provided "prepare_start_lsn".
  *
  * If setParent is true, set up subtransaction parent linkages.
@@ -2173,13 +2179,12 @@ RecoverPreparedTransactions(void)
  * value scanned.
  */
 static char *
-ProcessTwoPhaseBuffer(TransactionId xid,
+ProcessTwoPhaseBuffer(FullTransactionId fxid,
 					  XLogRecPtr prepare_start_lsn,
 					  bool fromdisk,
 					  bool setParent, bool setNextXid)
 {
 	FullTransactionId nextXid = TransamVariables->nextXid;
-	TransactionId origNextXid = XidFromFullTransactionId(nextXid);
 	TransactionId *subxids;
 	char	   *buf;
 	TwoPhaseFileHeader *hdr;
@@ -2191,41 +2196,46 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 		Assert(prepare_start_lsn != InvalidXLogRecPtr);
 
 	/* Already processed? */
-	if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
+	if (TransactionIdDidCommit(XidFromFullTransactionId(fxid)) ||
+		TransactionIdDidAbort(XidFromFullTransactionId(fxid)))
 	{
 		if (fromdisk)
 		{
 			ereport(WARNING,
-					(errmsg("removing stale two-phase state file for transaction %u",
-							xid)));
-			RemoveTwoPhaseFile(xid, true);
+					(errmsg("removing stale two-phase state file for transaction %u of epoch %u",
+							XidFromFullTransactionId(fxid),
+							EpochFromFullTransactionId(fxid))));
+			RemoveTwoPhaseFile(fxid, true);
 		}
 		else
 		{
 			ereport(WARNING,
-					(errmsg("removing stale two-phase state from memory for transaction %u",
-							xid)));
-			PrepareRedoRemove(xid, true);
+					(errmsg("removing stale two-phase state from memory for transaction %u of epoch %u",
+							XidFromFullTransactionId(fxid),
+							EpochFromFullTransactionId(fxid))));
+			PrepareRedoRemoveFull(fxid, true);
 		}
 		return NULL;
 	}
 
 	/* Reject XID if too new */
-	if (TransactionIdFollowsOrEquals(xid, origNextXid))
+	if (FullTransactionIdFollowsOrEquals(fxid, nextXid))
 	{
 		if (fromdisk)
 		{
 			ereport(WARNING,
-					(errmsg("removing future two-phase state file for transaction %u",
-							xid)));
-			RemoveTwoPhaseFile(xid, true);
+					(errmsg("removing future two-phase state file for transaction %u of epoch %u",
+							XidFromFullTransactionId(fxid),
+							EpochFromFullTransactionId(fxid))));
+			RemoveTwoPhaseFile(fxid, true);
 		}
 		else
 		{
 			ereport(WARNING,
-					(errmsg("removing future two-phase state from memory for transaction %u",
-							xid)));
-			PrepareRedoRemove(xid, true);
+					(errmsg("removing future two-phase state from memory for transaction %u of epoch %u",
+							XidFromFullTransactionId(fxid),
+							EpochFromFullTransactionId(fxid))));
+			PrepareRedoRemoveFull(fxid, true);
 		}
 		return NULL;
 	}
@@ -2233,7 +2243,7 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	if (fromdisk)
 	{
 		/* Read and validate file */
-		buf = ReadTwoPhaseFile(xid, false);
+		buf = ReadTwoPhaseFile(fxid, false);
 	}
 	else
 	{
@@ -2243,18 +2253,20 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 
 	/* Deconstruct header */
 	hdr = (TwoPhaseFileHeader *) buf;
-	if (!TransactionIdEquals(hdr->xid, xid))
+	if (!TransactionIdEquals(hdr->xid, XidFromFullTransactionId(fxid)))
 	{
 		if (fromdisk)
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("corrupted two-phase state file for transaction %u",
-							xid)));
+					 errmsg("corrupted two-phase state file for transaction %u of epoch %u",
+							XidFromFullTransactionId(fxid),
+							EpochFromFullTransactionId(fxid))));
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("corrupted two-phase state in memory for transaction %u",
-							xid)));
+					 errmsg("corrupted two-phase state in memory for transaction %u of epoch %u",
+							XidFromFullTransactionId(fxid),
+							EpochFromFullTransactionId(fxid))));
 	}
 
 	/*
@@ -2268,14 +2280,14 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	{
 		TransactionId subxid = subxids[i];
 
-		Assert(TransactionIdFollows(subxid, xid));
+		Assert(TransactionIdFollows(subxid, XidFromFullTransactionId(fxid)));
 
 		/* update nextXid if needed */
 		if (setNextXid)
 			AdvanceNextFullTransactionIdPastXid(subxid);
 
 		if (setParent)
-			SubTransSetParent(subxid, xid);
+			SubTransSetParent(subxid, XidFromFullTransactionId(fxid));
 	}
 
 	return buf;
@@ -2466,8 +2478,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
  * data, the entry is marked as located on disk.
  */
 void
-PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
-			   XLogRecPtr end_lsn, RepOriginId origin_id)
+PrepareRedoAdd(FullTransactionId fxid, char *buf,
+			   XLogRecPtr start_lsn, XLogRecPtr end_lsn,
+			   RepOriginId origin_id)
 {
 	TwoPhaseFileHeader *hdr = (TwoPhaseFileHeader *) buf;
 	char	   *bufptr;
@@ -2477,6 +2490,13 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 	Assert(LWLockHeldByMeInMode(TwoPhaseStateLock, LW_EXCLUSIVE));
 	Assert(RecoveryInProgress());
 
+	if (!FullTransactionIdIsValid(fxid))
+	{
+		Assert(InRecovery);
+		fxid = FullTransactionIdFromAllowableAt(TransamVariables->nextXid,
+												hdr->xid);
+	}
+
 	bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader));
 	gid = (const char *) bufptr;
 
@@ -2505,7 +2525,8 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 	{
 		char		path[MAXPGPATH];
 
-		TwoPhaseFilePath(path, hdr->xid);
+		Assert(InRecovery);
+		TwoPhaseFilePath(path, fxid);
 
 		if (access(path, F_OK) == 0)
 		{
@@ -2536,7 +2557,7 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 	gxact->prepared_at = hdr->prepared_at;
 	gxact->prepare_start_lsn = start_lsn;
 	gxact->prepare_end_lsn = end_lsn;
-	gxact->xid = hdr->xid;
+	gxact->fxid = fxid;
 	gxact->owner = hdr->owner;
 	gxact->locking_backend = INVALID_PROC_NUMBER;
 	gxact->valid = false;
@@ -2555,11 +2576,13 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   false /* backward */ , false /* WAL */ );
 	}
 
-	elog(DEBUG2, "added 2PC data in shared memory for transaction %u", gxact->xid);
+	elog(DEBUG2, "added 2PC data in shared memory for transaction %u of epoch %u",
+		 XidFromFullTransactionId(gxact->fxid),
+		 EpochFromFullTransactionId(gxact->fxid));
 }
 
 /*
- * PrepareRedoRemove
+ * PrepareRedoRemoveFull
  *
  * Remove the corresponding gxact entry from TwoPhaseState. Also remove
  * the 2PC file if a prepared transaction was saved via an earlier checkpoint.
@@ -2567,8 +2590,8 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
  * Caller must hold TwoPhaseStateLock in exclusive mode, because TwoPhaseState
  * is updated.
  */
-void
-PrepareRedoRemove(TransactionId xid, bool giveWarning)
+static void
+PrepareRedoRemoveFull(FullTransactionId fxid, bool giveWarning)
 {
 	GlobalTransaction gxact = NULL;
 	int			i;
@@ -2581,7 +2604,7 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 	{
 		gxact = TwoPhaseState->prepXacts[i];
 
-		if (gxact->xid == xid)
+		if (FullTransactionIdEquals(gxact->fxid, fxid))
 		{
 			Assert(gxact->inredo);
 			found = true;
@@ -2598,12 +2621,28 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 	/*
 	 * And now we can clean up any files we may have left.
 	 */
-	elog(DEBUG2, "removing 2PC data for transaction %u", xid);
+	elog(DEBUG2, "removing 2PC data for transaction %u of epoch %u ",
+		 XidFromFullTransactionId(fxid),
+		 EpochFromFullTransactionId(fxid));
+
 	if (gxact->ondisk)
-		RemoveTwoPhaseFile(xid, giveWarning);
+		RemoveTwoPhaseFile(fxid, giveWarning);
+
 	RemoveGXact(gxact);
 }
 
+/*
+ * Wrapper of PrepareRedoRemoveFull(), for TransactionIds.
+ */
+void
+PrepareRedoRemove(TransactionId xid, bool giveWarning)
+{
+	FullTransactionId fxid =
+		FullTransactionIdFromAllowableAt(TransamVariables->nextXid, xid);
+
+	PrepareRedoRemoveFull(fxid, giveWarning);
+}
+
 /*
  * LookupGXact
  *		Check if the prepared transaction with the given GID, lsn and timestamp
@@ -2648,7 +2687,7 @@ LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
 			 * between publisher and subscriber.
 			 */
 			if (gxact->ondisk)
-				buf = ReadTwoPhaseFile(gxact->xid, false);
+				buf = ReadTwoPhaseFile(gxact->fxid, false);
 			else
 			{
 				Assert(gxact->prepare_start_lsn);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b885513f7654..41601fcb2803 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2515,7 +2515,7 @@ static void
 PrepareTransaction(void)
 {
 	TransactionState s = CurrentTransactionState;
-	TransactionId xid = GetCurrentTransactionId();
+	FullTransactionId fxid = GetCurrentFullTransactionId();
 	GlobalTransaction gxact;
 	TimestampTz prepared_at;
 
@@ -2644,7 +2644,7 @@ PrepareTransaction(void)
 	 * Reserve the GID for this transaction. This could fail if the requested
 	 * GID is invalid or already in use.
 	 */
-	gxact = MarkAsPreparing(xid, prepareGID, prepared_at,
+	gxact = MarkAsPreparing(fxid, prepareGID, prepared_at,
 							GetUserId(), MyDatabaseId);
 	prepareGID = NULL;
 
@@ -2694,7 +2694,7 @@ PrepareTransaction(void)
 	 * ProcArrayClearTransaction().  Otherwise, a GetLockConflicts() would
 	 * conclude "xact already committed or aborted" for our locks.
 	 */
-	PostPrepare_Locks(xid);
+	PostPrepare_Locks(fxid);
 
 	/*
 	 * Let others know about no transaction in progress by me.  This has to be
@@ -2738,9 +2738,9 @@ PrepareTransaction(void)
 
 	PostPrepare_smgr();
 
-	PostPrepare_MultiXact(xid);
+	PostPrepare_MultiXact(fxid);
 
-	PostPrepare_PredicateLocks(xid);
+	PostPrepare_PredicateLocks(fxid);
 
 	ResourceOwnerRelease(TopTransactionResourceOwner,
 						 RESOURCE_RELEASE_LOCKS,
@@ -6420,7 +6420,8 @@ xact_redo(XLogReaderState *record)
 		 * gxact entry.
 		 */
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
-		PrepareRedoAdd(XLogRecGetData(record),
+		PrepareRedoAdd(InvalidFullTransactionId,
+					   XLogRecGetData(record),
 					   record->ReadRecPtr,
 					   record->EndRecPtr,
 					   XLogRecGetOrigin(record));
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 2776ceb295be..62f3471448eb 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -3539,9 +3539,9 @@ AtPrepare_Locks(void)
  * but that probably costs more cycles.
  */
 void
-PostPrepare_Locks(TransactionId xid)
+PostPrepare_Locks(FullTransactionId fxid)
 {
-	PGPROC	   *newproc = TwoPhaseGetDummyProc(xid, false);
+	PGPROC	   *newproc = TwoPhaseGetDummyProc(fxid, false);
 	HASH_SEQ_STATUS status;
 	LOCALLOCK  *locallock;
 	LOCK	   *lock;
@@ -4324,11 +4324,11 @@ DumpAllLocks(void)
  * and PANIC anyway.
  */
 void
-lock_twophase_recover(TransactionId xid, uint16 info,
+lock_twophase_recover(FullTransactionId fxid, uint16 info,
 					  void *recdata, uint32 len)
 {
 	TwoPhaseLockRecord *rec = (TwoPhaseLockRecord *) recdata;
-	PGPROC	   *proc = TwoPhaseGetDummyProc(xid, false);
+	PGPROC	   *proc = TwoPhaseGetDummyProc(fxid, false);
 	LOCKTAG    *locktag;
 	LOCKMODE	lockmode;
 	LOCKMETHODID lockmethodid;
@@ -4505,7 +4505,7 @@ lock_twophase_recover(TransactionId xid, uint16 info,
  * starting up into hot standby mode.
  */
 void
-lock_twophase_standby_recover(TransactionId xid, uint16 info,
+lock_twophase_standby_recover(FullTransactionId fxid, uint16 info,
 							  void *recdata, uint32 len)
 {
 	TwoPhaseLockRecord *rec = (TwoPhaseLockRecord *) recdata;
@@ -4524,7 +4524,7 @@ lock_twophase_standby_recover(TransactionId xid, uint16 info,
 	if (lockmode == AccessExclusiveLock &&
 		locktag->locktag_type == LOCKTAG_RELATION)
 	{
-		StandbyAcquireAccessExclusiveLock(xid,
+		StandbyAcquireAccessExclusiveLock(XidFromFullTransactionId(fxid),
 										  locktag->locktag_field1 /* dboid */ ,
 										  locktag->locktag_field2 /* reloid */ );
 	}
@@ -4537,11 +4537,11 @@ lock_twophase_standby_recover(TransactionId xid, uint16 info,
  * Find and release the lock indicated by the 2PC record.
  */
 void
-lock_twophase_postcommit(TransactionId xid, uint16 info,
+lock_twophase_postcommit(FullTransactionId fxid, uint16 info,
 						 void *recdata, uint32 len)
 {
 	TwoPhaseLockRecord *rec = (TwoPhaseLockRecord *) recdata;
-	PGPROC	   *proc = TwoPhaseGetDummyProc(xid, true);
+	PGPROC	   *proc = TwoPhaseGetDummyProc(fxid, true);
 	LOCKTAG    *locktag;
 	LOCKMETHODID lockmethodid;
 	LockMethod	lockMethodTable;
@@ -4563,10 +4563,10 @@ lock_twophase_postcommit(TransactionId xid, uint16 info,
  * This is actually just the same as the COMMIT case.
  */
 void
-lock_twophase_postabort(TransactionId xid, uint16 info,
+lock_twophase_postabort(FullTransactionId fxid, uint16 info,
 						void *recdata, uint32 len)
 {
-	lock_twophase_postcommit(xid, info, recdata, len);
+	lock_twophase_postcommit(fxid, info, recdata, len);
 }
 
 /*
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index d82114ffca16..c07fb5883555 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -191,7 +191,7 @@
  *		AtPrepare_PredicateLocks(void);
  *		PostPrepare_PredicateLocks(TransactionId xid);
  *		PredicateLockTwoPhaseFinish(TransactionId xid, bool isCommit);
- *		predicatelock_twophase_recover(TransactionId xid, uint16 info,
+ *		predicatelock_twophase_recover(FullTransactionId fxid, uint16 info,
  *									   void *recdata, uint32 len);
  */
 
@@ -4856,7 +4856,7 @@ AtPrepare_PredicateLocks(void)
  *		anyway. We only need to clean up our local state.
  */
 void
-PostPrepare_PredicateLocks(TransactionId xid)
+PostPrepare_PredicateLocks(FullTransactionId fxid)
 {
 	if (MySerializableXact == InvalidSerializableXact)
 		return;
@@ -4879,12 +4879,12 @@ PostPrepare_PredicateLocks(TransactionId xid)
  *		commits or aborts.
  */
 void
-PredicateLockTwoPhaseFinish(TransactionId xid, bool isCommit)
+PredicateLockTwoPhaseFinish(FullTransactionId fxid, bool isCommit)
 {
 	SERIALIZABLEXID *sxid;
 	SERIALIZABLEXIDTAG sxidtag;
 
-	sxidtag.xid = xid;
+	sxidtag.xid = XidFromFullTransactionId(fxid);
 
 	LWLockAcquire(SerializableXactHashLock, LW_SHARED);
 	sxid = (SERIALIZABLEXID *)
@@ -4906,10 +4906,11 @@ PredicateLockTwoPhaseFinish(TransactionId xid, bool isCommit)
  * Re-acquire a predicate lock belonging to a transaction that was prepared.
  */
 void
-predicatelock_twophase_recover(TransactionId xid, uint16 info,
+predicatelock_twophase_recover(FullTransactionId fxid, uint16 info,
 							   void *recdata, uint32 len)
 {
 	TwoPhasePredicateRecord *record;
+	TransactionId xid = XidFromFullTransactionId(fxid);
 
 	Assert(len == sizeof(TwoPhasePredicateRecord));
 
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 28587e2916b1..69df741cbf63 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -744,7 +744,7 @@ PostPrepare_PgStat_Relations(PgStat_SubXactStatus *xact_state)
  * Load the saved counts into our local pgstats state.
  */
 void
-pgstat_twophase_postcommit(TransactionId xid, uint16 info,
+pgstat_twophase_postcommit(FullTransactionId fxid, uint16 info,
 						   void *recdata, uint32 len)
 {
 	TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata;
@@ -780,7 +780,7 @@ pgstat_twophase_postcommit(TransactionId xid, uint16 info,
  * as aborted.
  */
 void
-pgstat_twophase_postabort(TransactionId xid, uint16 info,
+pgstat_twophase_postabort(FullTransactionId fxid, uint16 info,
 						  void *recdata, uint32 len)
 {
 	TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata;
-- 
2.49.0

#12

Noah Misch

noah@leadboat.com

6 months ago

In reply to: Michael Paquier (#11)

Re: Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData

On Fri, Jun 20, 2025 at 09:02:00AM +0900, Michael Paquier wrote:

On Fri, Jun 06, 2025 at 03:34:13PM -0700, Noah Misch wrote:

Shared memory should (not "must") omit a GXACT whose prepare_start_lsn lies
beyond the point recovery has reached. In other words, recovery should keep
anachronistic information out of shared memory. For example, one could
imagine a straw man implementation in which, before reaching consistency, we
load each pg_twophase file that does pass its CRC check. I'd dislike that,
because it would facilitate anachronisms involving the
GlobalTransactionData.gid field. We could end up with two GXACT having the
same gid, one being the present (according to recovery progress) instance of
that gid and another being a future instance. The system might not
malfunction today, but I'd consider that fragile. Anachronistic entries might
cause recovery to need more shared memory than max_prepared_transactions has
allocated.

So this comes down to attempting to translate the information from the
2PC files to shared memory at a point earlier than what we could.
Thinking more about this point, there is something that I can see us
potentially do here: how about tracking the file names in shared
memory when we go through them in restoreTwoPhaseData()? That
addresses my point about doing only one scan of pg_twophase/ at the
beginning of recovery. I have not studied in details how doable it
is,

A challenge is that the number of files in pg_twophase can exceed
max_prepared_transactions if we've not yet reached consistency. That happens
with a sequence like this:

- exactly max_prepared_transactions files are in pg_twophase
- backup process copies pg_twophase/$file1, then pauses before the next readdir
- some backend deletes pg_twophase/$file1
- checkpointer creates pg_twophase/$fileN
- backup process wakes up and copies remaining pg_twophase files

That's the only problem coming to mind. If that problem doesn't cancel out
the benefits of scanning pg_twophase early, you may as well do it.

By the way, what do you think we should do with 0001 at this stage
(your refactoring work with some changes I've sprinkled)? I was
looking at it this morning with fresh eyes and my opinion about it is
that it improves the situation in all these 2PC callers on HEAD now
that we need to deal with fxids for the file names, so that's an
independent piece. Perhaps this should be applied once v19 opens for
business while we sort out the rest that 0002 was trying to cover?

That looks harmless.

Note: this introduced two calls to FullTransactionIdFromAllowableAt()
in StandbyTransactionIdIsPrepared() and PrepareRedoAdd(), which worried
me a bit on second look and gave me a pause:
- The call in StandbyTransactionIdIsPrepared() should use
AdjustToFullTransactionId(), ReadTwoPhaseFile() calls
TwoPhaseFilePath(), which uses AdjustToFullTransactionId() on HEAD.
- The call in PrepareRedoAdd() slightly worried me, but I think that
we're OK with using FullTransactionIdFromAllowableAt(), as we are in
the case of a 2PC transaction recovered from WAL, with the XID coming
from the file's header. I think that we'd better put an
Assert(InRecovery) in this path, just for safety.

What do you think?

It's an appropriate level of risk-taking.

#13

Michael Paquier

michael@paquier.xyz

6 months ago

In reply to: Noah Misch (#12)

Re: Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData

On Sun, Jul 06, 2025 at 11:07:11AM -0700, Noah Misch wrote:

A challenge is that the number of files in pg_twophase can exceed
max_prepared_transactions if we've not yet reached consistency. That happens
with a sequence like this:

- exactly max_prepared_transactions files are in pg_twophase
- backup process copies pg_twophase/$file1, then pauses before the next readdir
- some backend deletes pg_twophase/$file1
- checkpointer creates pg_twophase/$fileN
- backup process wakes up and copies remaining pg_twophase files

That's the only problem coming to mind. If that problem doesn't cancel out
the benefits of scanning pg_twophase early, you may as well do it.

Yeah.. That's where a linked link could become useful, and where
we'll need something different for PreparedXactProcs and the fake proc
entries. Not sure about that yet.

By the way, what do you think we should do with 0001 at this stage
(your refactoring work with some changes I've sprinkled)? I was
looking at it this morning with fresh eyes and my opinion about it is
that it improves the situation in all these 2PC callers on HEAD now
that we need to deal with fxids for the file names, so that's an
independent piece. Perhaps this should be applied once v19 opens for
business while we sort out the rest that 0002 was trying to cover?

That looks harmless.

Thanks. I've applied the refactoring piece for now, and marked the CF
entry as returned with feedback. I'm hoping to get back to that for
the September commit fest.
--
Michael