An improvement of ProcessTwoPhaseBuffer logic

Started by Vitaly Davydovabout 1 year ago24 messages
#1Vitaly Davydov
v.davydov@postgrespro.ru

Dear Hackers,

I would like to discuss ProcessTwoPhaseBuffer function. It reads two-phase transaction states from disk or the WAL. It takes xid as well as some other input parameters and executes the following steps:

Step #1: Check if xid is committed or aborted in clog (TransactionIdDidCommit, TransactionIdDidAbort)
Step #2: Check if xid is not equal or greater than ShmemVariableCache->nextXid
Step #3: Read two-phase state for the specified xid from memory or the corresponding file and returns it

In some, very rare scenarios, the postgres instance will newer recover because of such logic. Imagine, that the two_phase directory contains some files with two-phase states of transactions of distant future. I assume, it can happen if some WAL segments are broken and ignored (as well as clog data) but two_phase directory was not broken. In recovery, postgresql reads all the files in two_phase and tries to recover two-phase states.

The problem appears in the functions TransactionIdDidCommit or TransactionIdDidAbort. These functions may fail with the FATAL message like below when no clog state on disk is available for the xid:

FATAL: could not access status of transaction 286331153
DETAIL: Could not open file "pg_xact/0111": No such file or directory.

Such error do not allow the postgresql instance to be started.

My guess, if to swap Step #1 with Step #2 such error will disappear because transactions will be filtered when comparing xid with ShmemVariableCache->nextXid before accessing clog. The function will be more robust. In general, it works but I'm not sure that such logic will not break some rare boundary cases. Another solution is to catch and ignore such error, but the original solution is the simpler one. I appreciate any thoughts concerning this topic. May be, you know some cases when such change in logic is not relevant?

Thank you in advance!

With best regards,
Vitaly

#2Michael Paquier
michael@paquier.xyz
In reply to: Vitaly Davydov (#1)
1 attachment(s)
Re: An improvement of ProcessTwoPhaseBuffer logic

On Tue, Dec 24, 2024 at 04:26:32PM +0300, Vitaly Davydov wrote:

In some, very rare scenarios, the postgres instance will newer
recover because of such logic. Imagine, that the two_phase directory
contains some files with two-phase states of transactions of distant
future. I assume, it can happen if some WAL segments are broken and
ignored (as well as clog data) but two_phase directory was not
broken. In recovery, postgresql reads all the files in two_phase and
tries to recover two-phase states.

Well, one recent issue in this area is a bug fixed by cf4401fe6cf5:
/messages/by-id/tencent_A7F059B5136A359625C7B2E4A386B3C3F007@qq.com

If you see other bug patterns like this one, please let me know.
There are good changes that I could be involved in stuff that touched
this area of the code, so I'm mostly behind its maintenance these
days.

My guess, if to swap Step #1 with Step #2 such error will disappear
because transactions will be filtered when comparing xid with
ShmemVariableCache->nextXid before accessing clog. The function will
be more robust. In general, it works but I'm not sure that such
logic will not break some rare boundary cases. Another solution is
to catch and ignore such error, but the original solution is the
simpler one. I appreciate any thoughts concerning this topic. May
be, you know some cases when such change in logic is not relevant?

Hmm. Historically, up to 9.6, StandbyRecoverPreparedTransactions()
and RecoverPreparedTransactions() have been in charge of doing the
commit/abort checks with potentially clog lookups when it came to past
files. PrescanPreparedTransactions() has been in charge of doing the
check about future files.

If you look closely, PrescanPreparedTransactions() is called in two
code paths before StandbyRecoverPreparedTransactions(), meaning that
we did the checks for the future files first, then the past checks
with potential CLOG lookups.

In v10~, things have changed to use ProcessTwoPhaseBuffer(), and as
you say, the checks are what we have now: they're done in a reverse
order. So, yes, I agree that you have a point here: the past code was
safer because it would have been able to avoid clog lookups if we had
a 2PC file in the future for a reason or another so it was safer than
what we have now.

I'm wondering if the attached patch is something worth backpatching,
actually, as these failing CLOG lookups can lead to spurious failures
at startup. HEAD-only would be OK based on the odds so that's a
no-brainer. Vitaly, have you seen that in the wild as an effect of
future 2PC files?

Any opinions from others?
--
Michael

Attachments:

twophase-process-checks.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 49be1df91c..d3d6ff3dce 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2206,26 +2206,6 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	if (!fromdisk)
 		Assert(prepare_start_lsn != InvalidXLogRecPtr);
 
-	/* Already processed? */
-	if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
-	{
-		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing stale two-phase state file for transaction %u",
-							xid)));
-			RemoveTwoPhaseFile(xid, true);
-		}
-		else
-		{
-			ereport(WARNING,
-					(errmsg("removing stale two-phase state from memory for transaction %u",
-							xid)));
-			PrepareRedoRemove(xid, true);
-		}
-		return NULL;
-	}
-
 	/* Reject XID if too new */
 	if (TransactionIdFollowsOrEquals(xid, origNextXid))
 	{
@@ -2246,6 +2226,26 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 		return NULL;
 	}
 
+	/* Already processed? */
+	if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
+	{
+		if (fromdisk)
+		{
+			ereport(WARNING,
+					(errmsg("removing stale two-phase state file for transaction %u",
+							xid)));
+			RemoveTwoPhaseFile(xid, true);
+		}
+		else
+		{
+			ereport(WARNING,
+					(errmsg("removing stale two-phase state from memory for transaction %u",
+							xid)));
+			PrepareRedoRemove(xid, true);
+		}
+		return NULL;
+	}
+
 	if (fromdisk)
 	{
 		/* Read and validate file */
#3Vitaly Davydov
v.davydov@postgrespro.ru
In reply to: Michael Paquier (#2)
Re: An improvement of ProcessTwoPhaseBuffer logic

Hi Michael,

Thank you for the explanation and the patch! I'm happy, that I seem to be on the right way.

On Wednesday, December 25, 2024 08:04 MSK, Michael Paquier <michael@paquier.xyz> wrote:

Vitaly, have you seen that in the wild as an effect of future 2PC files?

I haven't heard about this problem in production, but I've encountered it when I did some development in two-phase functionality. I fixed it by swapping the if blocks and it solved my problem.

It is pretty easy to reproduce it on the master branch:

1. Start an instance with enabled prepared transactions
2. Create a simple prepared transaction
3. Do checkpoint to write the transaction two-phase state into a file in pg_twophase subdirectory
4. Copy the created file, change its name to reflect a future xid (in my case: cp 00000000000002E8 00000000000FF2E8)
5. Commit the prepared transaction
6. Stop the instance with -m immediate
7. Start the instance

After starting, you can get an error like "could not start server". In the log file you can find a message like:

LOG: database system was not properly shut down; automatic recovery in progress
FATAL: could not access status of transaction 1045224
DETAIL: Could not read from file "pg_xact/0000" at offset 253952: read too few bytes.
2024-12-25 18:38:30.606 MSK [795557] LOG: startup process (PID 795560) exited with exit code 1

I tried your patch and it seems the server is started successfully. But I've found another problem in my synthetic test - it can not remove the file with the following message:

LOG: database system was not properly shut down; automatic recovery in progress
WARNING: removing future two-phase state file for transaction 1045224
WARNING: could not remove file "pg_twophase/FFFFFFFF000FF2E8": No such file or directory
LOG: redo starts at 0/1762978

The fill will never be removed automatically.
I guess, it is because we incorrectly calculate the two-phase file name using TwoPhaseFilePath in RemoveTwoPhaseFile in this scenario. It can be fixed if to pass file path directly from RecoverPreparedTransactions or StandbyRecoverPreparedTransaction into ProcessTwoPhaseBuffer -> RemoveTwoPhaseFile. I did it in the proposed patch /messages/by-id/cedbe-65e0c000-1-6db17700@133269862 (it is incomplete now).

With best regards,
Vitaly

#4Michael Paquier
michael@paquier.xyz
In reply to: Vitaly Davydov (#3)
Re: An improvement of ProcessTwoPhaseBuffer logic

On Wed, Dec 25, 2024 at 07:18:18PM +0300, Vitaly Davydov wrote:

I tried your patch and it seems the server is started
successfully. But I've found another problem in my synthetic test -
it can not remove the file with the following message:

LOG: database system was not properly shut down; automatic recovery in progress
WARNING: removing future two-phase state file for transaction 1045224
WARNING: could not remove file "pg_twophase/FFFFFFFF000FF2E8": No such file or directory
LOG: redo starts at 0/1762978

That works properly up to v16~. I have slept over this problem and I
think that we'd better backpatch the fix for the first problem. Now,
I also want to add a test to provide coverage. As you say, that's
simple:
- Create an empty file with a future XID name
- Restart the server.
- Scan the logs for the WARNING where the future file is removed.

The fill will never be removed automatically.
I guess, it is because we incorrectly calculate the two-phase file
name using TwoPhaseFilePath in RemoveTwoPhaseFile in this
scenario. It can be fixed if to pass file path directly from
RecoverPreparedTransactions or StandbyRecoverPreparedTransaction
into ProcessTwoPhaseBuffer -> RemoveTwoPhaseFile.

Yeah, I've noticed that as well. We are dealing with a second bug
that affects 17~ caused by the switch to FullTransactionIds for 2PC
file names, which had been introduced in 5a1dfde8334b (pretty sure
that it is the culprit, did not bisect). Attempting to remove a
future file triggers an assertion failure:
#3 0x00007fed046324f0 in __GI_abort () at ./stdlib/abort.c:79
#4 0x0000563e68a9bd07 in ExceptionalCondition
(conditionName=0x563e68c66fc3 "epoch > 0", fileName=0x563e68c66b0b
"twophase.c", lineNumber=953) at assert.c:66
#5 0x0000563e67683060 in AdjustToFullTransactionId (xid=4095) at
twophase.c:953
#6 0x0000563e67683092 in TwoPhaseFilePath (path=0x7ffcb515b4a0
"h\307!\227>V", xid=4095) at twophase.c:963
#7 0x0000563e67686603 in RemoveTwoPhaseFile (xid=4095,
giveWarning=true) at twophase.c:1728
#8 0x0000563e67688989 in ProcessTwoPhaseBuffer (xid=4095,
prepare_start_lsn=0, fromdisk=true, setParent=false, setNextXid=false)
at twophase.c:2219
#9 0x0000563e676872a6 in restoreTwoPhaseData () at twophase.c:1924
#10 0x0000563e676b5c8b in StartupXLOG () at xlog.c:5642

So this means that we are dealing with two different bugs, and that we
need to fix the assertion failure of the second problem first down to
17, then fix the first problem on all stable branches with a test to
cover both the first and second problems.

I did it in the proposed patch
/messages/by-id/cedbe-65e0c000-1-6db17700@133269862
(it is incomplete now).

I suspect that this can be still dangerous as-is while complicating
the code with more possible paths for the removal of the 2PC files,
because we may still build a file name from an XID, and that's what's
causing this issue..
--
Michael

#5Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#4)
3 attachment(s)
Re: An improvement of ProcessTwoPhaseBuffer logic

On Thu, Dec 26, 2024 at 07:42:34AM +0900, Michael Paquier wrote:

I suspect that this can be still dangerous as-is while complicating
the code with more possible paths for the removal of the 2PC files,
because we may still build a file name from an XID, and that's what's
causing this issue..

Here are two patches to address both issues:
- 0001 for the epoch calculation, down to 17, which would take care of
the underflow problem when having a 2PC file that has an XID in the
future at epoch 0. It is true that it could be smarter when dealing
with files from other epochs, and that this is a problem in v17~. I
think that this should integrate better with FullTransactionIds moving
forward rather than pass the file names. That would be something
quite invasive, only for HEAD. At least that's my impression.
- 0002, to take care of the future file issue, down to 13. This
includes a TAP test to demonstrate the problem. The test needs a
tweak for the 2PC file name in v17~.
--
Michael

Attachments:

v1-0001-Fix-failure-with-incorrect-epoch-for-future-2PC-f.patchtext/x-diff; charset=us-asciiDownload
From d45d0c95a0ddee51563b5e94ea046ce4c5cf1baf Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Thu, 26 Dec 2024 14:06:01 +0900
Subject: [PATCH v1 1/2] Fix failure with incorrect epoch for future 2PC files
 at recovery

A two-phase file in the future would be able to trigger an assertion
failure in AdjustToFullTransactionId() as the epoch counter would
underflow if the checkpoint record uses an epoch of 0.

In non-assert builds, this would create a WARNING message referring to a
2PC file with an epoch of "FFFFFFFF" (or UINT32_MAX), as an effect of
the underflow calculation.

Issue introduced by 5a1dfde8334b.

Reported-by: Vitaly Davydov
Discussion: https://postgr.es/m/13b5b6-676c3080-4d-531db900@47931709
Backpatch-through: 17
---
 src/backend/access/transam/twophase.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 49be1df91c..d2649264f9 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -947,10 +947,9 @@ AdjustToFullTransactionId(TransactionId xid)
 
 	nextXid = XidFromFullTransactionId(nextFullXid);
 	epoch = EpochFromFullTransactionId(nextFullXid);
-	if (unlikely(xid > nextXid))
+	if (unlikely(xid > nextXid) && epoch > 0)
 	{
 		/* Wraparound occurred, must be from a prev epoch. */
-		Assert(epoch > 0);
 		epoch--;
 	}
 
-- 
2.45.2

v1-0002-Fix-handling-of-orphaned-2PC-files-in-the-future-.patchtext/x-diff; charset=us-asciiDownload
From 31ebbadbca629abf8b7cf982d573ad2c838bf27f Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Thu, 26 Dec 2024 14:15:51 +0900
Subject: [PATCH v1 2/2] Fix handling of orphaned 2PC files in the future at
 recovery

Issue introduced by 728bd991c3c4, that has added support for 2PC files
in shared memory at recovery.

Before this feature was introduced, files in the future of the
transaction ID horizon were checked first, followed by a check if a
transaction ID is aborted or committed which could involve a pg_xact
lookup.  After this commit, these checks were done in a reversed order,
but files in the future do not have a state that can be checked in
pg_xact, hence this caused recovery to fail abruptly should an orphaned
2PC file in the future of the transaction horizon exist in pg_twophase
at the beginning of recovery.

A test is added to check this scenario, using an empty 2PC with a
transaction ID large enough to be in the future when running the test.

Author: Vitaly Davydov, Michael Paquier
Discussion: https://postgr.es/m/11e597-676ab680-8d-374f23c0@145466129
Backpatch-through: 13
---
 src/backend/access/transam/twophase.c | 40 ++++++++++-----------
 src/test/recovery/t/009_twophase.pl   | 51 ++++++++++-----------------
 2 files changed, 38 insertions(+), 53 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index d2649264f9..4b5a3146f1 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2205,26 +2205,6 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	if (!fromdisk)
 		Assert(prepare_start_lsn != InvalidXLogRecPtr);
 
-	/* Already processed? */
-	if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
-	{
-		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing stale two-phase state file for transaction %u",
-							xid)));
-			RemoveTwoPhaseFile(xid, true);
-		}
-		else
-		{
-			ereport(WARNING,
-					(errmsg("removing stale two-phase state from memory for transaction %u",
-							xid)));
-			PrepareRedoRemove(xid, true);
-		}
-		return NULL;
-	}
-
 	/* Reject XID if too new */
 	if (TransactionIdFollowsOrEquals(xid, origNextXid))
 	{
@@ -2245,6 +2225,26 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 		return NULL;
 	}
 
+	/* Already processed? */
+	if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
+	{
+		if (fromdisk)
+		{
+			ereport(WARNING,
+					(errmsg("removing stale two-phase state file for transaction %u",
+							xid)));
+			RemoveTwoPhaseFile(xid, true);
+		}
+		else
+		{
+			ereport(WARNING,
+					(errmsg("removing stale two-phase state from memory for transaction %u",
+							xid)));
+			PrepareRedoRemove(xid, true);
+		}
+		return NULL;
+	}
+
 	if (fromdisk)
 	{
 		/* Read and validate file */
diff --git a/src/test/recovery/t/009_twophase.pl b/src/test/recovery/t/009_twophase.pl
index 4b3e0f77dc..854910ea68 100644
--- a/src/test/recovery/t/009_twophase.pl
+++ b/src/test/recovery/t/009_twophase.pl
@@ -534,42 +534,27 @@ is( $psql_out,
 	qq{27|issued to paris},
 	"Check expected t_009_tbl2 data on standby");
 
+###############################################################################
+# Check handling of orphaned 2PC files at recovery.
+###############################################################################
 
-# Exercise the 2PC recovery code in StartupSUBTRANS, which is concerned with
-# ensuring that enough pg_subtrans pages exist on disk to cover the range of
-# prepared transactions at server start time.  There's not much we can verify
-# directly, but let's at least get the code to run.
-$cur_standby->stop();
-configure_and_reload($cur_primary, "synchronous_standby_names = ''");
+$cur_primary->teardown_node;
 
-$cur_primary->safe_psql('postgres', "CHECKPOINT");
+# Grab location in logs of primary
+my $log_offset = -s $cur_primary->logfile;
+
+# Create a fake file with a transaction ID large enough to be in the future,
+# then check that the primary is able to start and remove this file at
+# recovery.
+
+my $future_2pc_file = $cur_primary->data_dir . '/pg_twophase/0000000000FFFFFF';
+append_to_file $future_2pc_file, "";
 
-my $start_lsn =
-  $cur_primary->safe_psql('postgres', 'select pg_current_wal_insert_lsn()');
-$cur_primary->safe_psql('postgres',
-	"CREATE TABLE test(); BEGIN; CREATE TABLE test1(); PREPARE TRANSACTION 'foo';"
-);
-my $osubtrans = $cur_primary->safe_psql('postgres',
-	"select 'pg_subtrans/'||f, s.size from pg_ls_dir('pg_subtrans') f, pg_stat_file('pg_subtrans/'||f) s"
-);
-$cur_primary->pgbench(
-	'--no-vacuum --client=5 --transactions=1000',
-	0,
-	[],
-	[],
-	'pgbench run to cause pg_subtrans traffic',
-	{
-		'009_twophase.pgb' => 'insert into test default values'
-	});
-# StartupSUBTRANS is exercised with a wide range of visible XIDs in this
-# stop/start sequence, because we left a prepared transaction open above.
-# Also, setting subtransaction_buffers to 32 above causes to switch SLRU
-# bank, for additional code coverage.
-$cur_primary->stop;
 $cur_primary->start;
-my $nsubtrans = $cur_primary->safe_psql('postgres',
-	"select 'pg_subtrans/'||f, s.size from pg_ls_dir('pg_subtrans') f, pg_stat_file('pg_subtrans/'||f) s"
-);
-isnt($osubtrans, $nsubtrans, "contents of pg_subtrans/ have changed");
+$cur_primary->log_check(
+	"future two-phase file removed at recovery",
+	$log_offset,
+	log_like =>
+	  [qr/removing future two-phase state file for transaction 16777215/]);
 
 done_testing();
-- 
2.45.2

v1-0002-Fix-handling-of-orphaned-2PC-files-in-the-fut-v16.patchtext/x-diff; charset=us-asciiDownload
From 854c87ae2ec2a44457143f24bc457158d2dac3ac Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Thu, 26 Dec 2024 13:26:46 +0900
Subject: [PATCH v1] Fix handling of orphaned 2PC files in the future at
 recovery

Issue introduced by 728bd991c3c4, that has added support for 2PC files
in shared memory at recovery.

Before this feature was introduced, files in the future of the
transaction ID horizon were checked first, followed by a check if a
transaction ID is aborted or committed which could involve a pg_xact
lookup.  After this commit, these checks were done in a reversed order,
but files in the future do not have a state that can be checked in
pg_xact, hence this caused recovery to fail abruptly should an orphaned
2PC file in the future of the transaction horizon exist in pg_twophase
at the beginning of recovery.

A test is added to check this scenario, using an empty 2PC with a
transaction ID large enough to be in the future when running the test.

Author: Vitaly Davydov, Michael Paquier
Discussion: https://postgr.es/m/11e597-676ab680-8d-374f23c0@145466129
Backpatch-through: 13
---
 src/backend/access/transam/twophase.c | 40 +++++++++++++--------------
 src/test/recovery/t/009_twophase.pl   | 23 +++++++++++++++
 2 files changed, 43 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index bca653e485..5adb98163a 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2183,26 +2183,6 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	if (!fromdisk)
 		Assert(prepare_start_lsn != InvalidXLogRecPtr);
 
-	/* Already processed? */
-	if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
-	{
-		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing stale two-phase state file for transaction %u",
-							xid)));
-			RemoveTwoPhaseFile(xid, true);
-		}
-		else
-		{
-			ereport(WARNING,
-					(errmsg("removing stale two-phase state from memory for transaction %u",
-							xid)));
-			PrepareRedoRemove(xid, true);
-		}
-		return NULL;
-	}
-
 	/* Reject XID if too new */
 	if (TransactionIdFollowsOrEquals(xid, origNextXid))
 	{
@@ -2223,6 +2203,26 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 		return NULL;
 	}
 
+	/* Already processed? */
+	if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
+	{
+		if (fromdisk)
+		{
+			ereport(WARNING,
+					(errmsg("removing stale two-phase state file for transaction %u",
+							xid)));
+			RemoveTwoPhaseFile(xid, true);
+		}
+		else
+		{
+			ereport(WARNING,
+					(errmsg("removing stale two-phase state from memory for transaction %u",
+							xid)));
+			PrepareRedoRemove(xid, true);
+		}
+		return NULL;
+	}
+
 	if (fromdisk)
 	{
 		/* Read and validate file */
diff --git a/src/test/recovery/t/009_twophase.pl b/src/test/recovery/t/009_twophase.pl
index fe7e8e7980..7642e4f0b3 100644
--- a/src/test/recovery/t/009_twophase.pl
+++ b/src/test/recovery/t/009_twophase.pl
@@ -528,4 +528,27 @@ is( $psql_out,
 	qq{27|issued to paris},
 	"Check expected t_009_tbl2 data on standby");
 
+###############################################################################
+# Check handling of orphaned 2PC files at recovery.
+###############################################################################
+
+$cur_primary->teardown_node;
+
+# Grab location in logs of primary
+my $log_offset = -s $cur_primary->logfile;
+
+# Create a fake file with a transaction ID large enough to be in the future,
+# then check that the primary is able to start and remove this file at
+# recovery.
+
+my $future_2pc_file = $cur_primary->data_dir . '/pg_twophase/00FFFFFF';
+append_to_file $future_2pc_file, "";
+
+$cur_primary->start;
+$cur_primary->log_check(
+	"future two-phase file removed at recovery",
+	$log_offset,
+	log_like =>
+	  [qr/removing future two-phase state file for transaction 16777215/]);
+
 done_testing();
-- 
2.45.2

#6Давыдов Виталий
v.davydov@postgrespro.ru
In reply to: Michael Paquier (#5)
Re: An improvement of ProcessTwoPhaseBuffer logic

Hi Michael,

Here are two patches to address both issues:

Thank you for the preparing the patches and test simplification. My bad, I overcompilcated it.

I concerned about twophase file name generation. The function TwoPhaseFilePath() is pretty straitforward and unambiguous in 16 and earlier versions. The logic of this function in 17+ seems to be more complex. I do not understand it clearly. But, I guess, it will work incorrectly after turning to a newer epoch, because the epoch is calculated from TransamVariables->nextXid, but not from problem xid. The same problem may happen if we are in epoch 1 or greater. It will produce a wrong file name, because the epoch will be obtained from the last xid, not from file name xid. In another words, the function AdjustToFullTransactionId assumes that if xid > TransamVariables->nextXid, then the xid from the previous epoch. I may be not the case in our scenario.

I suspect that this can be still dangerous as-is while complicating the code with more possible paths for the removal of the 2PC files

Agree, but we may pass file name into TwoPhaseFilePath if any, instead of creation of two functions as in the patch. The cost - two new if conditions. Using file names is pretty safe. Once we read the file and extract xid from its name, just pass this file name to TwoPhaseFilePath(). If not, try to generate it. Anyway, I do not insist on it, just try to discuss.

With best regards,
Vitaly

#7Michael Paquier
michael@paquier.xyz
In reply to: Давыдов Виталий (#6)
1 attachment(s)
Re: An improvement of ProcessTwoPhaseBuffer logic

On Thu, Dec 26, 2024 at 06:11:25PM +0300, Давыдов Виталий wrote:

I concerned about twophase file name generation. The function
TwoPhaseFilePath() is pretty straitforward and unambiguous in 16 and
earlier versions. The logic of this function in 17+ seems to be more
complex. I do not understand it clearly. But, I guess, it will work
incorrectly after turning to a newer epoch, because the epoch is
calculated from TransamVariables->nextXid, but not from problem
xid. The same problem may happen if we are in epoch 1 or greater. It
will produce a wrong file name, because the epoch will be obtained
from the last xid, not from file name xid. In another words, the
function AdjustToFullTransactionId assumes that if xid >
TransamVariables->nextXid, then the xid from the previous epoch. I
may be not the case in our scenario.

Yeah. At this stage, we can pretty much say that the whole idea of
relying AdjustToFullTransactionId() is broken, because we would build
2PC file names based on wrong assumptions, while orphaned files could
be in the far past or far future depending on the epoch.

TBH, we'll live better if we remove AdjustToFullTransactionId() and
sprinkle a bit more the knowledge of FullTransactionIds to build
correctly the 2PC file path in v17~. I've been playing with this code
for a couple of hours and finished with the attached patch. I have
wondered if ReadTwoPhaseFile() should gain the same knowledge as
TwoPhaseFilePath(), but decided to limit the invasiness because we
always call ReadTwoPhaseFile() once we don't have any orphaned files,
for all the phases of recovery. So while this requires a bit more
logic that depends on FullTransactionIdFromEpochAndXid() and
ReadNextFullTransactionId() to build a FullTransactionId and get the
current epoch, that's still acceptable as we only store an XID in the
2PC file.

So please see the attached. You will note that RemoveTwoPhaseFile(),
ProcessTwoPhaseBuffer() and TwoPhaseFilePath() now require a
FullTransactionId, hence callers need to think about the epoch to use.
That should limit future errors compared to passing the file name as
optional argument.

What do you think?
--
Michael

Attachments:

v2-0001-Fix-failure-with-incorrect-epoch-handling-for-2PC.patchtext/x-diff; charset=us-asciiDownload
From bb250c0cd303f0d8db0fee2d26353bf0e0cb1275 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Fri, 27 Dec 2024 16:34:44 +0900
Subject: [PATCH v2] Fix failure with incorrect epoch handling for 2PC files at
 recovery

At the beginning of recovery, an orphaned two-phase file in an epoch
different than the one defined in the checkpoint record could not be
removed based on the assumption that AdjustToFullTransactionId() relies
on, assuming that all files would be either from the current epoch for
the epoch before that.

If the checkpoint epoch was 0 while the 2PC file was orphaned and in the
future, AdjustToFullTransactionId() would underflow the epoch used to
build the 2PC file path.  In non-assert builds, this would create a
WARNING message referring to a 2PC file with an epoch of "FFFFFFFF" (or
UINT32_MAX), as an effect of the underflow calculation.

Some tests are added with dummy 2PC files in the past and the future,
checking that these are properly removed.

Issue introduced by 5a1dfde8334b.

Reported-by: Vitaly Davydov
Discussion: https://postgr.es/m/13b5b6-676c3080-4d-531db900@47931709
Backpatch-through: 17
---
 src/backend/access/transam/twophase.c | 231 +++++++++++++++++---------
 src/test/recovery/t/009_twophase.pl   |  31 ++++
 2 files changed, 184 insertions(+), 78 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 49be1df91c..01f5d728be 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -221,13 +221,13 @@ static void ProcessRecords(char *bufptr, TransactionId xid,
 static void RemoveGXact(GlobalTransaction gxact);
 
 static void XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len);
-static char *ProcessTwoPhaseBuffer(TransactionId xid,
+static char *ProcessTwoPhaseBuffer(FullTransactionId xid,
 								   XLogRecPtr prepare_start_lsn,
 								   bool fromdisk, bool setParent, bool setNextXid);
 static void MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid,
 								const char *gid, TimestampTz prepared_at, Oid owner,
 								Oid databaseid);
-static void RemoveTwoPhaseFile(TransactionId xid, bool giveWarning);
+static void RemoveTwoPhaseFile(FullTransactionId fxid, bool giveWarning);
 static void RecreateTwoPhaseFile(TransactionId xid, void *content, int len);
 
 /*
@@ -926,42 +926,9 @@ TwoPhaseGetDummyProc(TransactionId xid, bool lock_held)
 /* State file support													*/
 /************************************************************************/
 
-/*
- * Compute the FullTransactionId for the given TransactionId.
- *
- * The wrap logic is safe here because the span of active xids cannot exceed one
- * epoch at any given time.
- */
-static inline FullTransactionId
-AdjustToFullTransactionId(TransactionId xid)
-{
-	FullTransactionId nextFullXid;
-	TransactionId nextXid;
-	uint32		epoch;
-
-	Assert(TransactionIdIsValid(xid));
-
-	LWLockAcquire(XidGenLock, LW_SHARED);
-	nextFullXid = TransamVariables->nextXid;
-	LWLockRelease(XidGenLock);
-
-	nextXid = XidFromFullTransactionId(nextFullXid);
-	epoch = EpochFromFullTransactionId(nextFullXid);
-	if (unlikely(xid > nextXid))
-	{
-		/* Wraparound occurred, must be from a prev epoch. */
-		Assert(epoch > 0);
-		epoch--;
-	}
-
-	return FullTransactionIdFromEpochAndXid(epoch, xid);
-}
-
 static inline int
-TwoPhaseFilePath(char *path, TransactionId xid)
+TwoPhaseFilePath(char *path, FullTransactionId fxid)
 {
-	FullTransactionId fxid = AdjustToFullTransactionId(xid);
-
 	return snprintf(path, MAXPGPATH, TWOPHASE_DIR "/%08X%08X",
 					EpochFromFullTransactionId(fxid),
 					XidFromFullTransactionId(fxid));
@@ -1297,7 +1264,8 @@ RegisterTwoPhaseRecord(TwoPhaseRmgrId rmid, uint16 info,
  * If it looks OK (has a valid magic number and CRC), return the palloc'd
  * contents of the file, issuing an error when finding corrupted data.  If
  * missing_ok is true, which indicates that missing files can be safely
- * ignored, then return NULL.  This state can be reached when doing recovery.
+ * ignored, then return NULL.  This state can be reached when doing recovery
+ * after discarding twophase files from other epochs.
  */
 static char *
 ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
@@ -1311,8 +1279,16 @@ ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
 	pg_crc32c	calc_crc,
 				file_crc;
 	int			r;
+	FullTransactionId fxid;
+	FullTransactionId nextFullXid;
+	uint32		epoch;
 
-	TwoPhaseFilePath(path, xid);
+	/* get current epoch */
+	nextFullXid = ReadNextFullTransactionId();
+	epoch = EpochFromFullTransactionId(nextFullXid);
+
+	fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
+	TwoPhaseFilePath(path, fxid);
 
 	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
 	if (fd < 0)
@@ -1537,7 +1513,6 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	else
 		XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
 
-
 	/*
 	 * Disassemble the header area
 	 */
@@ -1677,10 +1652,22 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	AtEOXact_PgStat(isCommit, false);
 
 	/*
-	 * And now we can clean up any files we may have left.
+	 * And now we can clean up any files we may have left.  These should
+	 * be from the current epoch.
 	 */
 	if (ondisk)
-		RemoveTwoPhaseFile(xid, true);
+	{
+		uint32		epoch;
+		FullTransactionId nextFullXid;
+		FullTransactionId fxid;
+
+		/* get current epoch */
+		nextFullXid = ReadNextFullTransactionId();
+		epoch = EpochFromFullTransactionId(nextFullXid);
+
+		fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
+		RemoveTwoPhaseFile(fxid, true);
+	}
 
 	MyLockedGxact = NULL;
 
@@ -1718,13 +1705,17 @@ ProcessRecords(char *bufptr, TransactionId xid,
  *
  * If giveWarning is false, do not complain about file-not-present;
  * this is an expected case during WAL replay.
+ *
+ * This routine is used at early stages at recovery where future and
+ * past orphaned files are checked, hence the FullTransactionId to build
+ * a complete file name fit for the removal.
  */
 static void
-RemoveTwoPhaseFile(TransactionId xid, bool giveWarning)
+RemoveTwoPhaseFile(FullTransactionId fxid, bool giveWarning)
 {
 	char		path[MAXPGPATH];
 
-	TwoPhaseFilePath(path, xid);
+	TwoPhaseFilePath(path, fxid);
 	if (unlink(path))
 		if (errno != ENOENT || giveWarning)
 			ereport(WARNING,
@@ -1744,13 +1735,21 @@ RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
 	char		path[MAXPGPATH];
 	pg_crc32c	statefile_crc;
 	int			fd;
+	FullTransactionId fxid;
+	FullTransactionId nextFullXid;
+	uint32		epoch;
 
 	/* Recompute CRC */
 	INIT_CRC32C(statefile_crc);
 	COMP_CRC32C(statefile_crc, content, len);
 	FIN_CRC32C(statefile_crc);
 
-	TwoPhaseFilePath(path, xid);
+	/* Use current epoch */
+	nextFullXid = ReadNextFullTransactionId();
+	epoch = EpochFromFullTransactionId(nextFullXid);
+
+	fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
+	TwoPhaseFilePath(path, fxid);
 
 	fd = OpenTransientFile(path,
 						   O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
@@ -1898,7 +1897,9 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
  * Scan pg_twophase and fill TwoPhaseState depending on the on-disk data.
  * This is called once at the beginning of recovery, saving any extra
  * lookups in the future.  Two-phase files that are newer than the
- * minimum XID horizon are discarded on the way.
+ * minimum XID horizon are discarded on the way.  Two phase files with
+ * an epoch older or newer than the current checkpoint's record epoch
+ * are also discarded.
  */
 void
 restoreTwoPhaseData(void)
@@ -1913,14 +1914,11 @@ restoreTwoPhaseData(void)
 		if (strlen(clde->d_name) == 16 &&
 			strspn(clde->d_name, "0123456789ABCDEF") == 16)
 		{
-			TransactionId xid;
 			FullTransactionId fxid;
 			char	   *buf;
 
 			fxid = FullTransactionIdFromU64(strtou64(clde->d_name, NULL, 16));
-			xid = XidFromFullTransactionId(fxid);
-
-			buf = ProcessTwoPhaseBuffer(xid, InvalidXLogRecPtr,
+			buf = ProcessTwoPhaseBuffer(fxid, InvalidXLogRecPtr,
 										true, false, false);
 			if (buf == NULL)
 				continue;
@@ -1971,6 +1969,7 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 	TransactionId origNextXid = XidFromFullTransactionId(nextXid);
 	TransactionId result = origNextXid;
 	TransactionId *xids = NULL;
+	uint32		epoch = EpochFromFullTransactionId(nextXid);
 	int			nxids = 0;
 	int			allocsize = 0;
 	int			i;
@@ -1979,6 +1978,7 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
 		TransactionId xid;
+		FullTransactionId fxid;
 		char	   *buf;
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
 
@@ -1986,7 +1986,12 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 
 		xid = gxact->xid;
 
-		buf = ProcessTwoPhaseBuffer(xid,
+		/*
+		 * All two-phase files with past and future epoch in pg_twophase are
+		 * gone at this point, so we're OK to rely on only the current epoch.
+		 */
+		fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
+		buf = ProcessTwoPhaseBuffer(fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, false, true);
 
@@ -2048,11 +2053,18 @@ void
 StandbyRecoverPreparedTransactions(void)
 {
 	int			i;
+	uint32		epoch;
+	FullTransactionId nextFullXid;
+
+	/* get current epoch */
+	nextFullXid = ReadNextFullTransactionId();
+	epoch = EpochFromFullTransactionId(nextFullXid);
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
 		TransactionId xid;
+		FullTransactionId fxid;
 		char	   *buf;
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
 
@@ -2060,7 +2072,12 @@ StandbyRecoverPreparedTransactions(void)
 
 		xid = gxact->xid;
 
-		buf = ProcessTwoPhaseBuffer(xid,
+		/*
+		 * At this stage, we're OK to work with the current epoch
+		 * as all past and future files have been already discarded.
+		 */
+		fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
+		buf = ProcessTwoPhaseBuffer(fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
 		if (buf != NULL)
@@ -2089,11 +2106,18 @@ void
 RecoverPreparedTransactions(void)
 {
 	int			i;
+	uint32		epoch;
+	FullTransactionId nextFullXid;
+
+	/* get current epoch */
+	nextFullXid = ReadNextFullTransactionId();
+	epoch = EpochFromFullTransactionId(nextFullXid);
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
 		TransactionId xid;
+		FullTransactionId fxid;
 		char	   *buf;
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
 		char	   *bufptr;
@@ -2101,6 +2125,10 @@ RecoverPreparedTransactions(void)
 		TransactionId *subxids;
 		const char *gid;
 
+		/*
+		 * At this stage, we're OK to work with the current epoch
+		 * as all past and future files have been already discarded.
+		 */
 		xid = gxact->xid;
 
 		/*
@@ -2112,7 +2140,8 @@ RecoverPreparedTransactions(void)
 		 * SubTransSetParent has been set before, if the prepared transaction
 		 * generated xid assignment records.
 		 */
-		buf = ProcessTwoPhaseBuffer(xid,
+		fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
+		buf = ProcessTwoPhaseBuffer(fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
 		if (buf == NULL)
@@ -2180,7 +2209,7 @@ RecoverPreparedTransactions(void)
 /*
  * ProcessTwoPhaseBuffer
  *
- * Given a transaction id, read it either from disk or read it directly
+ * Given a FullTransactionId, read it either from disk or read it directly
  * via shmem xlog record pointer using the provided "prepare_start_lsn".
  *
  * If setParent is true, set up subtransaction parent linkages.
@@ -2189,23 +2218,66 @@ RecoverPreparedTransactions(void)
  * value scanned.
  */
 static char *
-ProcessTwoPhaseBuffer(TransactionId xid,
+ProcessTwoPhaseBuffer(FullTransactionId fxid,
 					  XLogRecPtr prepare_start_lsn,
 					  bool fromdisk,
 					  bool setParent, bool setNextXid)
 {
 	FullTransactionId nextXid = TransamVariables->nextXid;
-	TransactionId origNextXid = XidFromFullTransactionId(nextXid);
 	TransactionId *subxids;
 	char	   *buf;
 	TwoPhaseFileHeader *hdr;
 	int			i;
+	TransactionId xid = XidFromFullTransactionId(fxid);
 
 	Assert(LWLockHeldByMeInMode(TwoPhaseStateLock, LW_EXCLUSIVE));
 
 	if (!fromdisk)
 		Assert(prepare_start_lsn != InvalidXLogRecPtr);
 
+	/*
+	 * Reject full XID if too new.  Note that this discards files from future
+	 * epochs.
+	 */
+	if (FullTransactionIdFollowsOrEquals(fxid, nextXid))
+	{
+		if (fromdisk)
+		{
+			ereport(WARNING,
+					(errmsg("removing future two-phase state file of epoch %u for transaction %u",
+							EpochFromFullTransactionId(fxid), xid)));
+			RemoveTwoPhaseFile(fxid, true);
+		}
+		else
+		{
+			ereport(WARNING,
+					(errmsg("removing future two-phase state from memory for transaction %u",
+							xid)));
+			PrepareRedoRemove(xid, true);
+		}
+		return NULL;
+	}
+
+	/* Discard files from past epochs */
+	if (EpochFromFullTransactionId(fxid) < EpochFromFullTransactionId(nextXid))
+	{
+		if (fromdisk)
+		{
+			ereport(WARNING,
+					(errmsg("removing past two-phase state file of epoch %u for transaction %u",
+							EpochFromFullTransactionId(fxid), xid)));
+			RemoveTwoPhaseFile(fxid, true);
+		}
+		else
+		{
+			ereport(WARNING,
+					(errmsg("removing past two-phase state from memory for transaction %u",
+							xid)));
+			PrepareRedoRemove(xid, true);
+		}
+		return NULL;
+	}
+
 	/* Already processed? */
 	if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
 	{
@@ -2214,7 +2286,8 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 			ereport(WARNING,
 					(errmsg("removing stale two-phase state file for transaction %u",
 							xid)));
-			RemoveTwoPhaseFile(xid, true);
+
+			RemoveTwoPhaseFile(fxid, true);
 		}
 		else
 		{
@@ -2226,26 +2299,6 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 		return NULL;
 	}
 
-	/* Reject XID if too new */
-	if (TransactionIdFollowsOrEquals(xid, origNextXid))
-	{
-		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing future two-phase state file for transaction %u",
-							xid)));
-			RemoveTwoPhaseFile(xid, true);
-		}
-		else
-		{
-			ereport(WARNING,
-					(errmsg("removing future two-phase state from memory for transaction %u",
-							xid)));
-			PrepareRedoRemove(xid, true);
-		}
-		return NULL;
-	}
-
 	if (fromdisk)
 	{
 		/* Read and validate file */
@@ -2520,8 +2573,16 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 	if (!XLogRecPtrIsInvalid(start_lsn))
 	{
 		char		path[MAXPGPATH];
+		uint32		epoch;
+		FullTransactionId fxid;
+		FullTransactionId nextFullXid;
 
-		TwoPhaseFilePath(path, hdr->xid);
+		/* Use current epoch */
+		nextFullXid = ReadNextFullTransactionId();
+		epoch = EpochFromFullTransactionId(nextFullXid);
+
+		fxid = FullTransactionIdFromEpochAndXid(epoch, hdr->xid);
+		TwoPhaseFilePath(path, fxid);
 
 		if (access(path, F_OK) == 0)
 		{
@@ -2616,7 +2677,21 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 	 */
 	elog(DEBUG2, "removing 2PC data for transaction %u", xid);
 	if (gxact->ondisk)
-		RemoveTwoPhaseFile(xid, giveWarning);
+	{
+		uint32		epoch;
+		FullTransactionId nextFullXid;
+		FullTransactionId fxid;
+
+		/*
+		 * We should deal with a file at the current epoch here, so
+		 * grab it.
+		 */
+		nextFullXid = ReadNextFullTransactionId();
+		epoch = EpochFromFullTransactionId(nextFullXid);
+		fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
+
+		RemoveTwoPhaseFile(fxid, giveWarning);
+	}
 	RemoveGXact(gxact);
 }
 
diff --git a/src/test/recovery/t/009_twophase.pl b/src/test/recovery/t/009_twophase.pl
index 4b3e0f77dc..2463b7ea30 100644
--- a/src/test/recovery/t/009_twophase.pl
+++ b/src/test/recovery/t/009_twophase.pl
@@ -572,4 +572,35 @@ my $nsubtrans = $cur_primary->safe_psql('postgres',
 );
 isnt($osubtrans, $nsubtrans, "contents of pg_subtrans/ have changed");
 
+###############################################################################
+# Check handling of orphaned 2PC files at recovery.
+###############################################################################
+
+$cur_standby->teardown_node;
+$cur_primary->teardown_node;
+
+# Grab location in logs of primary
+my $log_offset = -s $cur_primary->logfile;
+
+# Create fake files with a transaction ID large or low enough to be in the
+# future or the past, in different epochs, then check that the primary is able
+# to start and remove these files at recovery.
+
+# First bump the epoch with pg_resetwal.
+$cur_primary->command_ok([ 'pg_resetwal', '-e', 256, '-f', $cur_primary->data_dir ],
+						 'bump epoch of primary');
+
+my $future_2pc_file = $cur_primary->data_dir . '/pg_twophase/000001FF00000FFF';
+append_to_file $future_2pc_file, "";
+my $past_2pc_file = $cur_primary->data_dir . '/pg_twophase/000000EE00000FFF';
+append_to_file $past_2pc_file, "";
+
+$cur_primary->start;
+$cur_primary->log_check(
+	"two-phase files removed at recovery",
+	$log_offset,
+	log_like =>
+	[qr/removing past two-phase state file of epoch 238 for transaction 4095/,
+	qr/removing future two-phase state file of epoch 511 for transaction 4095/]);
+
 done_testing();
-- 
2.45.2

#8Vitaly Davydov
v.davydov@postgrespro.ru
In reply to: Michael Paquier (#7)
1 attachment(s)
Re: An improvement of ProcessTwoPhaseBuffer logic

On Friday, December 27, 2024 10:37 MSK, Michael Paquier <michael@paquier.xyz> wrote:

So please see the attached. You will note that RemoveTwoPhaseFile(),
ProcessTwoPhaseBuffer() and TwoPhaseFilePath() now require a
FullTransactionId, hence callers need to think about the epoch to use.
That should limit future errors compared to passing the file name as
optional argument.

In general, I like your solution to use FullTransactionId. I haven't found any evident problems. I just would like to propose to create a couple of new functions like RemoveTwoPhaseFileInCurrentEpoch, TwoPhaseFilePathInCurrentEpoch which accept TransactionId instead FullTransactionId. It may make the code clearer and result into less boilerplate code. I tried to do some hand testing. It seems, the problem is gone with the patch. Thank you!

As an idea, I would like to propose to store FullTransactionId in global transaction state instead of TransactionId. I'm not sure, it will consume significant additional memory, but it make the code more clear and probably result into less number of locks.

With best regards,
Vitaly

Attachments:

0001-Some-cosmetic-fixes.patchtext/x-patchDownload
From 1e74e01a2485e2b3a59f880a5eaf86af1af81cc5 Mon Sep 17 00:00:00 2001
From: Vitaly Davydov <v.davydov@postgrespro.ru>
Date: Fri, 27 Dec 2024 18:03:28 +0300
Subject: [PATCH] Some cosmetic fixes

---
 src/backend/access/transam/twophase.c | 92 +++++++++++----------------
 1 file changed, 37 insertions(+), 55 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 01f5d728be..9418fd74ee 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -228,6 +228,7 @@ static void MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid,
 								const char *gid, TimestampTz prepared_at, Oid owner,
 								Oid databaseid);
 static void RemoveTwoPhaseFile(FullTransactionId fxid, bool giveWarning);
+static void RemoveTwoPhaseFileInCurrentEpoch(TransactionId xid, bool giveWarning);
 static void RecreateTwoPhaseFile(TransactionId xid, void *content, int len);
 
 /*
@@ -934,6 +935,22 @@ TwoPhaseFilePath(char *path, FullTransactionId fxid)
 					XidFromFullTransactionId(fxid));
 }
 
+static inline int
+TwoPhaseFilePathInCurrentEpoch(char *path, TransactionId xid)
+{
+	uint32		epoch;
+	FullTransactionId fxid;
+	FullTransactionId nextFullXid;
+
+	/* Use current epoch */
+	nextFullXid = ReadNextFullTransactionId();
+	epoch = EpochFromFullTransactionId(nextFullXid);
+
+	fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
+
+	return TwoPhaseFilePath(path, fxid);
+}
+
 /*
  * 2PC state file format:
  *
@@ -1279,16 +1296,8 @@ ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
 	pg_crc32c	calc_crc,
 				file_crc;
 	int			r;
-	FullTransactionId fxid;
-	FullTransactionId nextFullXid;
-	uint32		epoch;
-
-	/* get current epoch */
-	nextFullXid = ReadNextFullTransactionId();
-	epoch = EpochFromFullTransactionId(nextFullXid);
 
-	fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
-	TwoPhaseFilePath(path, fxid);
+	TwoPhaseFilePathInCurrentEpoch(path, xid);
 
 	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
 	if (fd < 0)
@@ -1656,18 +1665,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 * be from the current epoch.
 	 */
 	if (ondisk)
-	{
-		uint32		epoch;
-		FullTransactionId nextFullXid;
-		FullTransactionId fxid;
-
-		/* get current epoch */
-		nextFullXid = ReadNextFullTransactionId();
-		epoch = EpochFromFullTransactionId(nextFullXid);
-
-		fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
-		RemoveTwoPhaseFile(fxid, true);
-	}
+		RemoveTwoPhaseFileInCurrentEpoch(xid, true);
 
 	MyLockedGxact = NULL;
 
@@ -1723,6 +1721,21 @@ RemoveTwoPhaseFile(FullTransactionId fxid, bool giveWarning)
 					 errmsg("could not remove file \"%s\": %m", path)));
 }
 
+static void
+RemoveTwoPhaseFileInCurrentEpoch(TransactionId xid, bool giveWarning)
+{
+	uint32		epoch;
+	FullTransactionId nextFullXid;
+	FullTransactionId fxid;
+
+	/* get current epoch */
+	nextFullXid = ReadNextFullTransactionId();
+	epoch = EpochFromFullTransactionId(nextFullXid);
+
+	fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
+	RemoveTwoPhaseFile(fxid, true);
+}
+
 /*
  * Recreates a state file. This is used in WAL replay and during
  * checkpoint creation.
@@ -1735,21 +1748,13 @@ RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
 	char		path[MAXPGPATH];
 	pg_crc32c	statefile_crc;
 	int			fd;
-	FullTransactionId fxid;
-	FullTransactionId nextFullXid;
-	uint32		epoch;
 
 	/* Recompute CRC */
 	INIT_CRC32C(statefile_crc);
 	COMP_CRC32C(statefile_crc, content, len);
 	FIN_CRC32C(statefile_crc);
 
-	/* Use current epoch */
-	nextFullXid = ReadNextFullTransactionId();
-	epoch = EpochFromFullTransactionId(nextFullXid);
-
-	fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
-	TwoPhaseFilePath(path, fxid);
+	TwoPhaseFilePathInCurrentEpoch(path, xid);
 
 	fd = OpenTransientFile(path,
 						   O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
@@ -2286,7 +2291,6 @@ ProcessTwoPhaseBuffer(FullTransactionId fxid,
 			ereport(WARNING,
 					(errmsg("removing stale two-phase state file for transaction %u",
 							xid)));
-
 			RemoveTwoPhaseFile(fxid, true);
 		}
 		else
@@ -2573,16 +2577,8 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 	if (!XLogRecPtrIsInvalid(start_lsn))
 	{
 		char		path[MAXPGPATH];
-		uint32		epoch;
-		FullTransactionId fxid;
-		FullTransactionId nextFullXid;
 
-		/* Use current epoch */
-		nextFullXid = ReadNextFullTransactionId();
-		epoch = EpochFromFullTransactionId(nextFullXid);
-
-		fxid = FullTransactionIdFromEpochAndXid(epoch, hdr->xid);
-		TwoPhaseFilePath(path, fxid);
+		TwoPhaseFilePathInCurrentEpoch(path, hdr->xid);
 
 		if (access(path, F_OK) == 0)
 		{
@@ -2677,21 +2673,7 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 	 */
 	elog(DEBUG2, "removing 2PC data for transaction %u", xid);
 	if (gxact->ondisk)
-	{
-		uint32		epoch;
-		FullTransactionId nextFullXid;
-		FullTransactionId fxid;
-
-		/*
-		 * We should deal with a file at the current epoch here, so
-		 * grab it.
-		 */
-		nextFullXid = ReadNextFullTransactionId();
-		epoch = EpochFromFullTransactionId(nextFullXid);
-		fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
-
-		RemoveTwoPhaseFile(fxid, giveWarning);
-	}
+		RemoveTwoPhaseFileInCurrentEpoch(xid, giveWarning);
 	RemoveGXact(gxact);
 }
 
-- 
2.34.1

#9Michael Paquier
michael@paquier.xyz
In reply to: Vitaly Davydov (#8)
Re: An improvement of ProcessTwoPhaseBuffer logic

On Fri, Dec 27, 2024 at 06:16:24PM +0300, Vitaly Davydov wrote:

In general, I like your solution to use FullTransactionId. I haven't
found any evident problems. I just would like to propose to create a
couple of new functions like RemoveTwoPhaseFileInCurrentEpoch,
TwoPhaseFilePathInCurrentEpoch which accept TransactionId instead
FullTransactionId. It may make the code clearer and result into less
boilerplate code. I tried to do some hand testing. It seems, the
problem is gone with the patch.

Thanks for the review. Agreed that it would be a good thing to limit
the number of paths calling ReadNextFullTransactionId(), but I did not
like much the suggestion TwoPhaseFilePathInCurrentEpoch(), feeling
that it was more important to keep a single code path in charge of
building the file names. Instead, I have gone with a new
FullTransactionIdFromCurrentEpoch() that replaces
AdjustToFullTransactionId(). It cleans up most of the calls to
ReadNextFullTransactionId() compared to the previous patch. It is
true that these couple with calls to ProcessTwoPhaseBuffer(), but the
result felt OK this way in the scope of this fix.

As an idea, I would like to propose to store FullTransactionId in
global transaction state instead of TransactionId. I'm not sure, it
will consume significant additional memory, but it make the code
more clear and probably result into less number of locks.

FWIW, I was wondering about doing the same thing. However, I have
concluded that this some refactoring work out of the scope of fixing
the primary issue we have here, as we are hit by the way the file
names are built when we attempt to remove them.

Note that the memory footprint of storing a FullTransactionId in
twophase.c's GlobalTransactionData does not worry me much. It is more
important to not increase the size of the two-phase state data,
impacting files on disk and their WAL records. This size argument
has been mentioned on the thread that has added epochs to the 2PC file
names, as far as I recall.

At the end, I have applied two patches, the first one down to 13 that
took care of the "future" issue, with tests added in the v14~v16
range. v13 was lacking a perl routine, and it's not a big deal to not
have coverage for this code path anyway. The second patch has been
applied down to v17, to fix the epoch issue, with the more advanced
tests.

If you have a suggestion of patch to plug in a FullTransactionId into
GlobalTransactionData rather than a TransactionId, feel free to
propose something! Help is always welcome, and this would be
HEAD-only work, making it less urgent to deal with.
--
Michael

#10Давыдов Виталий
v.davydov@postgrespro.ru
In reply to: Michael Paquier (#9)
Re: An improvement of ProcessTwoPhaseBuffer logic

On Monday, December 30, 2024 04:08 MSK, Michael Paquier <michael@paquier.xyz> wrote:

Instead, I have gone with a new
FullTransactionIdFromCurrentEpoch() that replaces
AdjustToFullTransactionId(). It cleans up most of the calls to
ReadNextFullTransactionId() compared to the previous patch. It is
true that these couple with calls to ProcessTwoPhaseBuffer(), but the
result felt OK this way in the scope of this fix.

Could you please send the latest version of the patch, if any? I haven't found any new patches in the latest email.

Happy New Year!

With best regards,
Vitaly

#11Michael Paquier
michael@paquier.xyz
In reply to: Давыдов Виталий (#10)
Re: An improvement of ProcessTwoPhaseBuffer logic

On Thu, Jan 09, 2025 at 11:21:38AM +0300, Давыдов Виталий wrote:

Could you please send the latest version of the patch, if any? I haven't found any new patches in the latest email.

I've applied fixes for this stuff as of e3584258154f (down to 13) and
7e125b20eed6 (down to 17) with tests for all the supported branches,
meaning that we should be done here.
--
Michael

#12Noah Misch
noah@leadboat.com
In reply to: Michael Paquier (#9)
1 attachment(s)
Re: An improvement of ProcessTwoPhaseBuffer logic

On Mon, Dec 30, 2024 at 10:08:31AM +0900, Michael Paquier wrote:

On Fri, Dec 27, 2024 at 06:16:24PM +0300, Vitaly Davydov wrote:

As an idea, I would like to propose to store FullTransactionId in
global transaction state instead of TransactionId. I'm not sure, it
will consume significant additional memory, but it make the code
more clear and probably result into less number of locks.

FWIW, I was wondering about doing the same thing. However, I have
concluded that this some refactoring work out of the scope of fixing
the primary issue we have here, as we are hit by the way the file
names are built when we attempt to remove them.

Note that the memory footprint of storing a FullTransactionId in
twophase.c's GlobalTransactionData does not worry me much. It is more
important to not increase the size of the two-phase state data,
impacting files on disk and their WAL records. This size argument
has been mentioned on the thread that has added epochs to the 2PC file
names, as far as I recall.

I agree with accepting +4 bytes in GlobalTransactionData.

At the end, I have applied two patches, the first one down to 13 that
took care of the "future" issue, with tests added in the v14~v16
range. v13 was lacking a perl routine, and it's not a big deal to not
have coverage for this code path anyway. The second patch has been
applied down to v17, to fix the epoch issue, with the more advanced
tests.

If you have a suggestion of patch to plug in a FullTransactionId into
GlobalTransactionData rather than a TransactionId, feel free to
propose something! Help is always welcome, and this would be
HEAD-only work, making it less urgent to deal with.

I suspect we should do that and back-patch to v17 before the next releases,
because commit 7e125b2 wrote this function:

/*
* Compute FullTransactionId for the given TransactionId, using the current
* epoch.
*/
static inline FullTransactionId
FullTransactionIdFromCurrentEpoch(TransactionId xid)
{
FullTransactionId fxid;
FullTransactionId nextFullXid;
uint32 epoch;

nextFullXid = ReadNextFullTransactionId();
epoch = EpochFromFullTransactionId(nextFullXid);

fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
return fxid;
}

I think "using the current epoch" is wrong for half of the nextFullXid values
having epoch > 0. For example, nextFullId==2^32 is in epoch 1, but all the
allowable XIDs are in epoch 0. (I mean "allowable" in the sense of
AssertTransactionIdInAllowableRange().) From then until we assign another
2^31 XIDs, epochs 0 and 1 are both expected in XID values. After 2^31 XID
assignments, every allowable XID be in epoch 1. Hence, twophase.c would need
two-epoch logic like we have in widen_snapshot_xid() and XLogRecGetFullXid().
Is that right? (I wrote this in a hurry, so this email may have more than the
standard level of errors.) Before commit 7e125b2, twophase also had that
logic. I didn't work out the user-visible consequences of that logic's new
absence here, but I bet on twophase recovery breakage. Similar problem here
(up to two epochs are acceptable, not just one):

+	/* Discard files from past epochs */
+	if (EpochFromFullTransactionId(fxid) < EpochFromFullTransactionId(nextXid))

I wrote the attached half-baked patch to fix those, but I tend to think it's
better to use FullTransactionId as many places as possible in twophase.c.
(We'd still need to convert XIDs that we read from xl_xact_prepare records,
along the lines of XLogRecGetFullXid().) How do you see it?

Attachments:

FullTransactionIdFromCurrentEpoch-v0.patchtext/plain; charset=us-asciiDownload
Author:     Noah Misch <noah@leadboat.com>
Commit:     Noah Misch <noah@leadboat.com>

    Fix twophase.c XID epoch tracking
    
    Half the time after epoch 0, allowable XIDs span two epochs.  This would
    have no user-visible consequences during epoch 0, but I expect
    (unconfirmed) twophase breakage during other epochs.
    
    FIXME likely rework this in favor of broader fulltransaction use in twophase.c

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index a3190dc..7a16293 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -926,24 +926,6 @@ TwoPhaseGetDummyProc(TransactionId xid, bool lock_held)
 /* State file support													*/
 /************************************************************************/
 
-/*
- * Compute FullTransactionId for the given TransactionId, using the current
- * epoch.
- */
-static inline FullTransactionId
-FullTransactionIdFromCurrentEpoch(TransactionId xid)
-{
-	FullTransactionId fxid;
-	FullTransactionId nextFullXid;
-	uint32		epoch;
-
-	nextFullXid = ReadNextFullTransactionId();
-	epoch = EpochFromFullTransactionId(nextFullXid);
-
-	fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
-	return fxid;
-}
-
 static inline int
 TwoPhaseFilePath(char *path, FullTransactionId fxid)
 {
@@ -1283,7 +1265,7 @@ RegisterTwoPhaseRecord(TwoPhaseRmgrId rmid, uint16 info,
  * contents of the file, issuing an error when finding corrupted data.  If
  * missing_ok is true, which indicates that missing files can be safely
  * ignored, then return NULL.  This state can be reached when doing recovery
- * after discarding two-phase files from other epochs.
+ * after discarding two-phase files from frozen epochs.
  */
 static char *
 ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
@@ -1299,7 +1281,7 @@ ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
 	int			r;
 	FullTransactionId fxid;
 
-	fxid = FullTransactionIdFromCurrentEpoch(xid);
+	fxid = FullTransactionIdFromAllowableAt(ReadNextFullTransactionId(), xid);
 	TwoPhaseFilePath(path, fxid);
 
 	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
@@ -1664,15 +1646,13 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	/* Count the prepared xact as committed or aborted */
 	AtEOXact_PgStat(isCommit, false);
 
-	/*
-	 * And now we can clean up any files we may have left.  These should be
-	 * from the current epoch.
-	 */
+	/* And now we can clean up any files we may have left. */
 	if (ondisk)
 	{
 		FullTransactionId fxid;
 
-		fxid = FullTransactionIdFromCurrentEpoch(xid);
+		fxid = FullTransactionIdFromAllowableAt(ReadNextFullTransactionId(),
+												xid);
 		RemoveTwoPhaseFile(fxid, true);
 	}
 
@@ -1749,8 +1729,7 @@ RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
 	COMP_CRC32C(statefile_crc, content, len);
 	FIN_CRC32C(statefile_crc);
 
-	/* Use current epoch */
-	fxid = FullTransactionIdFromCurrentEpoch(xid);
+	fxid = FullTransactionIdFromAllowableAt(ReadNextFullTransactionId(), xid);
 	TwoPhaseFilePath(path, fxid);
 
 	fd = OpenTransientFile(path,
@@ -1900,7 +1879,7 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
  * This is called once at the beginning of recovery, saving any extra
  * lookups in the future.  Two-phase files that are newer than the
  * minimum XID horizon are discarded on the way.  Two-phase files with
- * an epoch older or newer than the current checkpoint's record epoch
+ * an epoch frozen relative to the current checkpoint's record epoch
  * are also discarded.
  */
 void
@@ -1971,7 +1950,6 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 	TransactionId origNextXid = XidFromFullTransactionId(nextXid);
 	TransactionId result = origNextXid;
 	TransactionId *xids = NULL;
-	uint32		epoch = EpochFromFullTransactionId(nextXid);
 	int			nxids = 0;
 	int			allocsize = 0;
 	int			i;
@@ -1988,11 +1966,7 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 
 		xid = gxact->xid;
 
-		/*
-		 * All two-phase files with past and future epoch in pg_twophase are
-		 * gone at this point, so we're OK to rely on only the current epoch.
-		 */
-		fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
+		fxid = FullTransactionIdFromAllowableAt(nextXid, xid);
 		buf = ProcessTwoPhaseBuffer(fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, false, true);
@@ -2055,12 +2029,9 @@ void
 StandbyRecoverPreparedTransactions(void)
 {
 	int			i;
-	uint32		epoch;
 	FullTransactionId nextFullXid;
 
-	/* get current epoch */
 	nextFullXid = ReadNextFullTransactionId();
-	epoch = EpochFromFullTransactionId(nextFullXid);
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
@@ -2074,11 +2045,7 @@ StandbyRecoverPreparedTransactions(void)
 
 		xid = gxact->xid;
 
-		/*
-		 * At this stage, we're OK to work with the current epoch as all past
-		 * and future files have been already discarded.
-		 */
-		fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
+		fxid = FullTransactionIdFromAllowableAt(nextFullXid, xid);
 		buf = ProcessTwoPhaseBuffer(fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
@@ -2108,12 +2075,9 @@ void
 RecoverPreparedTransactions(void)
 {
 	int			i;
-	uint32		epoch;
 	FullTransactionId nextFullXid;
 
-	/* get current epoch */
 	nextFullXid = ReadNextFullTransactionId();
-	epoch = EpochFromFullTransactionId(nextFullXid);
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
@@ -2127,10 +2091,6 @@ RecoverPreparedTransactions(void)
 		TransactionId *subxids;
 		const char *gid;
 
-		/*
-		 * At this stage, we're OK to work with the current epoch as all past
-		 * and future files have been already discarded.
-		 */
 		xid = gxact->xid;
 
 		/*
@@ -2142,7 +2102,7 @@ RecoverPreparedTransactions(void)
 		 * SubTransSetParent has been set before, if the prepared transaction
 		 * generated xid assignment records.
 		 */
-		fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
+		fxid = FullTransactionIdFromAllowableAt(nextFullXid, xid);
 		buf = ProcessTwoPhaseBuffer(fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
@@ -2260,8 +2220,9 @@ ProcessTwoPhaseBuffer(FullTransactionId fxid,
 		return NULL;
 	}
 
-	/* Discard files from past epochs */
-	if (EpochFromFullTransactionId(fxid) < EpochFromFullTransactionId(nextXid))
+	/* Discard files from frozen epochs */
+	if (EpochFromFullTransactionId(fxid) + 1 <
+		EpochFromFullTransactionId(nextXid))
 	{
 		if (fromdisk)
 		{
@@ -2576,8 +2537,8 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 		char		path[MAXPGPATH];
 		FullTransactionId fxid;
 
-		/* Use current epoch */
-		fxid = FullTransactionIdFromCurrentEpoch(hdr->xid);
+		fxid = FullTransactionIdFromAllowableAt(ReadNextFullTransactionId(),
+												hdr->xid);
 		TwoPhaseFilePath(path, fxid);
 
 		if (access(path, F_OK) == 0)
@@ -2676,10 +2637,8 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 	{
 		FullTransactionId fxid;
 
-		/*
-		 * We should deal with a file at the current epoch here.
-		 */
-		fxid = FullTransactionIdFromCurrentEpoch(xid);
+		fxid = FullTransactionIdFromAllowableAt(ReadNextFullTransactionId(),
+												xid);
 		RemoveTwoPhaseFile(fxid, giveWarning);
 	}
 	RemoveGXact(gxact);
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 3596af0..91b6a91 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -2166,28 +2166,14 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 FullTransactionId
 XLogRecGetFullXid(XLogReaderState *record)
 {
-	TransactionId xid,
-				next_xid;
-	uint32		epoch;
-
 	/*
 	 * This function is only safe during replay, because it depends on the
 	 * replay state.  See AdvanceNextFullTransactionIdPastXid() for more.
 	 */
 	Assert(AmStartupProcess() || !IsUnderPostmaster);
 
-	xid = XLogRecGetXid(record);
-	next_xid = XidFromFullTransactionId(TransamVariables->nextXid);
-	epoch = EpochFromFullTransactionId(TransamVariables->nextXid);
-
-	/*
-	 * If xid is numerically greater than next_xid, it has to be from the last
-	 * epoch.
-	 */
-	if (unlikely(xid > next_xid))
-		--epoch;
-
-	return FullTransactionIdFromEpochAndXid(epoch, xid);
+	return FullTransactionIdFromAllowableAt(TransamVariables->nextXid,
+											XLogRecGetXid(record));
 }
 
 #endif
diff --git a/src/backend/utils/adt/xid8funcs.c b/src/backend/utils/adt/xid8funcs.c
index 4736755..b173956 100644
--- a/src/backend/utils/adt/xid8funcs.c
+++ b/src/backend/utils/adt/xid8funcs.c
@@ -155,35 +155,6 @@ TransactionIdInRecentPast(FullTransactionId fxid, TransactionId *extracted_xid)
 }
 
 /*
- * Convert a TransactionId obtained from a snapshot held by the caller to a
- * FullTransactionId.  Use next_fxid as a reference FullTransactionId, so that
- * we can compute the high order bits.  It must have been obtained by the
- * caller with ReadNextFullTransactionId() after the snapshot was created.
- */
-static FullTransactionId
-widen_snapshot_xid(TransactionId xid, FullTransactionId next_fxid)
-{
-	TransactionId next_xid = XidFromFullTransactionId(next_fxid);
-	uint32		epoch = EpochFromFullTransactionId(next_fxid);
-
-	/* Special transaction ID. */
-	if (!TransactionIdIsNormal(xid))
-		return FullTransactionIdFromEpochAndXid(0, xid);
-
-	/*
-	 * The 64 bit result must be <= next_fxid, since next_fxid hadn't been
-	 * issued yet when the snapshot was created.  Every TransactionId in the
-	 * snapshot must therefore be from the same epoch as next_fxid, or the
-	 * epoch before.  We know this because next_fxid is never allow to get
-	 * more than one epoch ahead of the TransactionIds in any snapshot.
-	 */
-	if (xid > next_xid)
-		epoch--;
-
-	return FullTransactionIdFromEpochAndXid(epoch, xid);
-}
-
-/*
  * txid comparator for qsort/bsearch
  */
 static int
@@ -420,12 +391,18 @@ pg_current_snapshot(PG_FUNCTION_ARGS)
 	nxip = cur->xcnt;
 	snap = palloc(PG_SNAPSHOT_SIZE(nxip));
 
-	/* fill */
-	snap->xmin = widen_snapshot_xid(cur->xmin, next_fxid);
-	snap->xmax = widen_snapshot_xid(cur->xmax, next_fxid);
+	/*
+	 * Fill.  This is the current backend's active snapshot, so MyProc->xmin
+	 * is <= all these XIDs.  As long as that remains so, oldestXid can't
+	 * advance past any of these XIDs.  Hence, these XIDs remain allowable
+	 * relative to next_fxid.
+	 */
+	snap->xmin = FullTransactionIdFromAllowableAt(next_fxid, cur->xmin);
+	snap->xmax = FullTransactionIdFromAllowableAt(next_fxid, cur->xmax);
 	snap->nxip = nxip;
 	for (i = 0; i < nxip; i++)
-		snap->xip[i] = widen_snapshot_xid(cur->xip[i], next_fxid);
+		snap->xip[i] =
+			FullTransactionIdFromAllowableAt(next_fxid, cur->xip[i]);
 
 	/*
 	 * We want them guaranteed to be in ascending order.  This also removes
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 0cab865..48fce16 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -77,6 +77,35 @@ FullTransactionIdFromEpochAndXid(uint32 epoch, TransactionId xid)
 	return result;
 }
 
+/*
+ * Compute FullTransactionId for the given TransactionId, assuming xid was
+ * between [oldestXid, nextXid] at the time when TransamVariables->nextXid was
+ * nextFullXid.
+ */
+static inline FullTransactionId
+FullTransactionIdFromAllowableAt(FullTransactionId nextFullXid,
+								 TransactionId xid)
+{
+	uint32		epoch;
+
+	/* Special transaction ID. */
+	if (!TransactionIdIsNormal(xid))
+		return FullTransactionIdFromEpochAndXid(0, xid);
+
+	/*
+	 * The 64 bit result must be <= nextFullXid, since nextFullXid hadn't been
+	 * issued yet when xid was in the past.  The xid must therefore be from
+	 * the epoch of nextFullXid or the epoch before.  We know this because we
+	 * must remove (by freezing) an XID before assigning the XID half an epoch
+	 * ahead of it.
+	 */
+	epoch = EpochFromFullTransactionId(nextFullXid);
+	if (xid > XidFromFullTransactionId(nextFullXid))
+		epoch--;
+
+	return FullTransactionIdFromEpochAndXid(epoch, xid);
+}
+
 static inline FullTransactionId
 FullTransactionIdFromU64(uint64 value)
 {
#13Michael Paquier
michael@paquier.xyz
In reply to: Noah Misch (#12)
2 attachment(s)
Re: An improvement of ProcessTwoPhaseBuffer logic

On Wed, Jan 15, 2025 at 05:00:51PM -0800, Noah Misch wrote:

I agree with accepting +4 bytes in GlobalTransactionData.

Let's just bite the bullet and do that on HEAD and v17, then,
integrating deeper FullTransactionIds into the internals of
twophase.c.

I think "using the current epoch" is wrong for half of the nextFullXid values
having epoch > 0. For example, nextFullId==2^32 is in epoch 1, but all the
allowable XIDs are in epoch 0. (I mean "allowable" in the sense of
AssertTransactionIdInAllowableRange().) From then until we assign another
2^31 XIDs, epochs 0 and 1 are both expected in XID values. After 2^31 XID
assignments, every allowable XID be in epoch 1. Hence, twophase.c would need
two-epoch logic like we have in widen_snapshot_xid() and XLogRecGetFullXid().
Is that right? (I wrote this in a hurry, so this email may have more than the
standard level of errors.) Before commit 7e125b2, twophase also had that
logic. I didn't work out the user-visible consequences of that logic's new
absence here, but I bet on twophase recovery breakage. Similar problem here
(up to two epochs are acceptable, not just one):

+	/* Discard files from past epochs */
+	if (EpochFromFullTransactionId(fxid) < EpochFromFullTransactionId(nextXid))

Oops, you're right. Your suggestion to unify all that in a single
routine is an excellent idea. Missed the bits in xid8funcs.c.

I wrote the attached half-baked patch to fix those, but I tend to think it's
better to use FullTransactionId as many places as possible in twophase.c.
(We'd still need to convert XIDs that we read from xl_xact_prepare records,
along the lines of XLogRecGetFullXid().) How do you see it?

I'm all for integrating more FullTransactionIds now that these reflect
in the file names, and do a deeper cut.

As far as I understand, the most important point of the logic is to
detect and discard the future files first in restoreTwoPhaseData() ->
ProcessTwoPhaseBuffer() when scanning the contents of pg_twophase at
the beginning of recovery. Once this filtering is done, it should be
safe to use your FullTransactionIdFromAllowableAt() when doing
the fxid <-> xid transitions between the records and the files on disk
flushed by a restartpoint which store an XID, and the shmem state of
GlobalTransactionData with a fxid.

With the additions attached, FullTransactionIdFromAllowableAt() gets
down from 8 to 6 calls in twophase.c. The change related to
MarkAsPreparingGuts() seems optional, though. I am trying to figure
out how to write a regression test to trigger this error, lacking a
bit of time today. That's going to require more trickery with
pg_resetwal to make that cheap, or something like that.. Attached are
some suggestions, as of a 0002 that applies on top of your 0001.

XLogRecGetFullXid() is used nowhere. This could be removed, perhaps,
or not?
--
Michael

Attachments:

v2-0001-Fix-twophase.c-XID-epoch-tracking.patchtext/x-diff; charset=us-asciiDownload
From 0e1538fa72ede3566ea3dd03958540b80e9fd15c Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Thu, 16 Jan 2025 15:14:06 +0900
Subject: [PATCH v2 1/2] Fix twophase.c XID epoch tracking

Half the time after epoch 0, allowable XIDs span two epochs.  This would
have no user-visible consequences during epoch 0, but I expect
(unconfirmed) twophase breakage during other epochs.

FIXME likely rework this in favor of broader fulltransaction use in
twophase.c
---
 src/include/access/transam.h            | 29 ++++++++++
 src/backend/access/transam/twophase.c   | 75 ++++++-------------------
 src/backend/access/transam/xlogreader.c | 18 +-----
 src/backend/utils/adt/xid8funcs.c       | 43 ++++----------
 4 files changed, 58 insertions(+), 107 deletions(-)

diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 0cab8653f1..48fce16aeb 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -77,6 +77,35 @@ FullTransactionIdFromEpochAndXid(uint32 epoch, TransactionId xid)
 	return result;
 }
 
+/*
+ * Compute FullTransactionId for the given TransactionId, assuming xid was
+ * between [oldestXid, nextXid] at the time when TransamVariables->nextXid was
+ * nextFullXid.
+ */
+static inline FullTransactionId
+FullTransactionIdFromAllowableAt(FullTransactionId nextFullXid,
+								 TransactionId xid)
+{
+	uint32		epoch;
+
+	/* Special transaction ID. */
+	if (!TransactionIdIsNormal(xid))
+		return FullTransactionIdFromEpochAndXid(0, xid);
+
+	/*
+	 * The 64 bit result must be <= nextFullXid, since nextFullXid hadn't been
+	 * issued yet when xid was in the past.  The xid must therefore be from
+	 * the epoch of nextFullXid or the epoch before.  We know this because we
+	 * must remove (by freezing) an XID before assigning the XID half an epoch
+	 * ahead of it.
+	 */
+	epoch = EpochFromFullTransactionId(nextFullXid);
+	if (xid > XidFromFullTransactionId(nextFullXid))
+		epoch--;
+
+	return FullTransactionIdFromEpochAndXid(epoch, xid);
+}
+
 static inline FullTransactionId
 FullTransactionIdFromU64(uint64 value)
 {
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index a3190dc4f1..7a162938f4 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -926,24 +926,6 @@ TwoPhaseGetDummyProc(TransactionId xid, bool lock_held)
 /* State file support													*/
 /************************************************************************/
 
-/*
- * Compute FullTransactionId for the given TransactionId, using the current
- * epoch.
- */
-static inline FullTransactionId
-FullTransactionIdFromCurrentEpoch(TransactionId xid)
-{
-	FullTransactionId fxid;
-	FullTransactionId nextFullXid;
-	uint32		epoch;
-
-	nextFullXid = ReadNextFullTransactionId();
-	epoch = EpochFromFullTransactionId(nextFullXid);
-
-	fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
-	return fxid;
-}
-
 static inline int
 TwoPhaseFilePath(char *path, FullTransactionId fxid)
 {
@@ -1283,7 +1265,7 @@ RegisterTwoPhaseRecord(TwoPhaseRmgrId rmid, uint16 info,
  * contents of the file, issuing an error when finding corrupted data.  If
  * missing_ok is true, which indicates that missing files can be safely
  * ignored, then return NULL.  This state can be reached when doing recovery
- * after discarding two-phase files from other epochs.
+ * after discarding two-phase files from frozen epochs.
  */
 static char *
 ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
@@ -1299,7 +1281,7 @@ ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
 	int			r;
 	FullTransactionId fxid;
 
-	fxid = FullTransactionIdFromCurrentEpoch(xid);
+	fxid = FullTransactionIdFromAllowableAt(ReadNextFullTransactionId(), xid);
 	TwoPhaseFilePath(path, fxid);
 
 	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
@@ -1664,15 +1646,13 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	/* Count the prepared xact as committed or aborted */
 	AtEOXact_PgStat(isCommit, false);
 
-	/*
-	 * And now we can clean up any files we may have left.  These should be
-	 * from the current epoch.
-	 */
+	/* And now we can clean up any files we may have left. */
 	if (ondisk)
 	{
 		FullTransactionId fxid;
 
-		fxid = FullTransactionIdFromCurrentEpoch(xid);
+		fxid = FullTransactionIdFromAllowableAt(ReadNextFullTransactionId(),
+												xid);
 		RemoveTwoPhaseFile(fxid, true);
 	}
 
@@ -1749,8 +1729,7 @@ RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
 	COMP_CRC32C(statefile_crc, content, len);
 	FIN_CRC32C(statefile_crc);
 
-	/* Use current epoch */
-	fxid = FullTransactionIdFromCurrentEpoch(xid);
+	fxid = FullTransactionIdFromAllowableAt(ReadNextFullTransactionId(), xid);
 	TwoPhaseFilePath(path, fxid);
 
 	fd = OpenTransientFile(path,
@@ -1900,7 +1879,7 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
  * This is called once at the beginning of recovery, saving any extra
  * lookups in the future.  Two-phase files that are newer than the
  * minimum XID horizon are discarded on the way.  Two-phase files with
- * an epoch older or newer than the current checkpoint's record epoch
+ * an epoch frozen relative to the current checkpoint's record epoch
  * are also discarded.
  */
 void
@@ -1971,7 +1950,6 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 	TransactionId origNextXid = XidFromFullTransactionId(nextXid);
 	TransactionId result = origNextXid;
 	TransactionId *xids = NULL;
-	uint32		epoch = EpochFromFullTransactionId(nextXid);
 	int			nxids = 0;
 	int			allocsize = 0;
 	int			i;
@@ -1988,11 +1966,7 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 
 		xid = gxact->xid;
 
-		/*
-		 * All two-phase files with past and future epoch in pg_twophase are
-		 * gone at this point, so we're OK to rely on only the current epoch.
-		 */
-		fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
+		fxid = FullTransactionIdFromAllowableAt(nextXid, xid);
 		buf = ProcessTwoPhaseBuffer(fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, false, true);
@@ -2055,12 +2029,9 @@ void
 StandbyRecoverPreparedTransactions(void)
 {
 	int			i;
-	uint32		epoch;
 	FullTransactionId nextFullXid;
 
-	/* get current epoch */
 	nextFullXid = ReadNextFullTransactionId();
-	epoch = EpochFromFullTransactionId(nextFullXid);
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
@@ -2074,11 +2045,7 @@ StandbyRecoverPreparedTransactions(void)
 
 		xid = gxact->xid;
 
-		/*
-		 * At this stage, we're OK to work with the current epoch as all past
-		 * and future files have been already discarded.
-		 */
-		fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
+		fxid = FullTransactionIdFromAllowableAt(nextFullXid, xid);
 		buf = ProcessTwoPhaseBuffer(fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
@@ -2108,12 +2075,9 @@ void
 RecoverPreparedTransactions(void)
 {
 	int			i;
-	uint32		epoch;
 	FullTransactionId nextFullXid;
 
-	/* get current epoch */
 	nextFullXid = ReadNextFullTransactionId();
-	epoch = EpochFromFullTransactionId(nextFullXid);
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
@@ -2127,10 +2091,6 @@ RecoverPreparedTransactions(void)
 		TransactionId *subxids;
 		const char *gid;
 
-		/*
-		 * At this stage, we're OK to work with the current epoch as all past
-		 * and future files have been already discarded.
-		 */
 		xid = gxact->xid;
 
 		/*
@@ -2142,7 +2102,7 @@ RecoverPreparedTransactions(void)
 		 * SubTransSetParent has been set before, if the prepared transaction
 		 * generated xid assignment records.
 		 */
-		fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
+		fxid = FullTransactionIdFromAllowableAt(nextFullXid, xid);
 		buf = ProcessTwoPhaseBuffer(fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
@@ -2260,8 +2220,9 @@ ProcessTwoPhaseBuffer(FullTransactionId fxid,
 		return NULL;
 	}
 
-	/* Discard files from past epochs */
-	if (EpochFromFullTransactionId(fxid) < EpochFromFullTransactionId(nextXid))
+	/* Discard files from frozen epochs */
+	if (EpochFromFullTransactionId(fxid) + 1 <
+		EpochFromFullTransactionId(nextXid))
 	{
 		if (fromdisk)
 		{
@@ -2576,8 +2537,8 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 		char		path[MAXPGPATH];
 		FullTransactionId fxid;
 
-		/* Use current epoch */
-		fxid = FullTransactionIdFromCurrentEpoch(hdr->xid);
+		fxid = FullTransactionIdFromAllowableAt(ReadNextFullTransactionId(),
+												hdr->xid);
 		TwoPhaseFilePath(path, fxid);
 
 		if (access(path, F_OK) == 0)
@@ -2676,10 +2637,8 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 	{
 		FullTransactionId fxid;
 
-		/*
-		 * We should deal with a file at the current epoch here.
-		 */
-		fxid = FullTransactionIdFromCurrentEpoch(xid);
+		fxid = FullTransactionIdFromAllowableAt(ReadNextFullTransactionId(),
+												xid);
 		RemoveTwoPhaseFile(fxid, giveWarning);
 	}
 	RemoveGXact(gxact);
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 3596af0617..91b6a91767 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -2166,28 +2166,14 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 FullTransactionId
 XLogRecGetFullXid(XLogReaderState *record)
 {
-	TransactionId xid,
-				next_xid;
-	uint32		epoch;
-
 	/*
 	 * This function is only safe during replay, because it depends on the
 	 * replay state.  See AdvanceNextFullTransactionIdPastXid() for more.
 	 */
 	Assert(AmStartupProcess() || !IsUnderPostmaster);
 
-	xid = XLogRecGetXid(record);
-	next_xid = XidFromFullTransactionId(TransamVariables->nextXid);
-	epoch = EpochFromFullTransactionId(TransamVariables->nextXid);
-
-	/*
-	 * If xid is numerically greater than next_xid, it has to be from the last
-	 * epoch.
-	 */
-	if (unlikely(xid > next_xid))
-		--epoch;
-
-	return FullTransactionIdFromEpochAndXid(epoch, xid);
+	return FullTransactionIdFromAllowableAt(TransamVariables->nextXid,
+											XLogRecGetXid(record));
 }
 
 #endif
diff --git a/src/backend/utils/adt/xid8funcs.c b/src/backend/utils/adt/xid8funcs.c
index 4736755b29..b17395617f 100644
--- a/src/backend/utils/adt/xid8funcs.c
+++ b/src/backend/utils/adt/xid8funcs.c
@@ -154,35 +154,6 @@ TransactionIdInRecentPast(FullTransactionId fxid, TransactionId *extracted_xid)
 	return !FullTransactionIdPrecedes(fxid, oldest_fxid);
 }
 
-/*
- * Convert a TransactionId obtained from a snapshot held by the caller to a
- * FullTransactionId.  Use next_fxid as a reference FullTransactionId, so that
- * we can compute the high order bits.  It must have been obtained by the
- * caller with ReadNextFullTransactionId() after the snapshot was created.
- */
-static FullTransactionId
-widen_snapshot_xid(TransactionId xid, FullTransactionId next_fxid)
-{
-	TransactionId next_xid = XidFromFullTransactionId(next_fxid);
-	uint32		epoch = EpochFromFullTransactionId(next_fxid);
-
-	/* Special transaction ID. */
-	if (!TransactionIdIsNormal(xid))
-		return FullTransactionIdFromEpochAndXid(0, xid);
-
-	/*
-	 * The 64 bit result must be <= next_fxid, since next_fxid hadn't been
-	 * issued yet when the snapshot was created.  Every TransactionId in the
-	 * snapshot must therefore be from the same epoch as next_fxid, or the
-	 * epoch before.  We know this because next_fxid is never allow to get
-	 * more than one epoch ahead of the TransactionIds in any snapshot.
-	 */
-	if (xid > next_xid)
-		epoch--;
-
-	return FullTransactionIdFromEpochAndXid(epoch, xid);
-}
-
 /*
  * txid comparator for qsort/bsearch
  */
@@ -420,12 +391,18 @@ pg_current_snapshot(PG_FUNCTION_ARGS)
 	nxip = cur->xcnt;
 	snap = palloc(PG_SNAPSHOT_SIZE(nxip));
 
-	/* fill */
-	snap->xmin = widen_snapshot_xid(cur->xmin, next_fxid);
-	snap->xmax = widen_snapshot_xid(cur->xmax, next_fxid);
+	/*
+	 * Fill.  This is the current backend's active snapshot, so MyProc->xmin
+	 * is <= all these XIDs.  As long as that remains so, oldestXid can't
+	 * advance past any of these XIDs.  Hence, these XIDs remain allowable
+	 * relative to next_fxid.
+	 */
+	snap->xmin = FullTransactionIdFromAllowableAt(next_fxid, cur->xmin);
+	snap->xmax = FullTransactionIdFromAllowableAt(next_fxid, cur->xmax);
 	snap->nxip = nxip;
 	for (i = 0; i < nxip; i++)
-		snap->xip[i] = widen_snapshot_xid(cur->xip[i], next_fxid);
+		snap->xip[i] =
+			FullTransactionIdFromAllowableAt(next_fxid, cur->xip[i]);
 
 	/*
 	 * We want them guaranteed to be in ascending order.  This also removes
-- 
2.47.1

v2-0002-Integrate-deeper-FullTransactionIds-into-twophase.patchtext/x-diff; charset=us-asciiDownload
From 55b6d87b3cf032a65efdce260f5422fdee6426f1 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Thu, 16 Jan 2025 16:45:33 +0900
Subject: [PATCH v2 2/2] Integrate deeper FullTransactionIds into twophase.c

---
 src/backend/access/transam/twophase.c | 115 ++++++++++++--------------
 1 file changed, 55 insertions(+), 60 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 7a162938f4..75008fb949 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -159,7 +159,7 @@ typedef struct GlobalTransactionData
 	 */
 	XLogRecPtr	prepare_start_lsn;	/* XLOG offset of prepare record start */
 	XLogRecPtr	prepare_end_lsn;	/* XLOG offset of prepare record end */
-	TransactionId xid;			/* The GXACT id */
+	FullTransactionId fxid;		/* The GXACT full xid */
 
 	Oid			owner;			/* ID of user that executed the xact */
 	ProcNumber	locking_backend;	/* backend currently working on the xact */
@@ -224,11 +224,11 @@ static void XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len);
 static char *ProcessTwoPhaseBuffer(FullTransactionId xid,
 								   XLogRecPtr prepare_start_lsn,
 								   bool fromdisk, bool setParent, bool setNextXid);
-static void MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid,
+static void MarkAsPreparingGuts(GlobalTransaction gxact, FullTransactionId fxid,
 								const char *gid, TimestampTz prepared_at, Oid owner,
 								Oid databaseid);
 static void RemoveTwoPhaseFile(FullTransactionId fxid, bool giveWarning);
-static void RecreateTwoPhaseFile(TransactionId xid, void *content, int len);
+static void RecreateTwoPhaseFile(FullTransactionId fxid, void *content, int len);
 
 /*
  * Initialization of shared memory
@@ -360,6 +360,7 @@ MarkAsPreparing(TransactionId xid, const char *gid,
 				TimestampTz prepared_at, Oid owner, Oid databaseid)
 {
 	GlobalTransaction gxact;
+	FullTransactionId fxid;
 	int			i;
 
 	if (strlen(gid) >= GIDSIZE)
@@ -407,7 +408,9 @@ MarkAsPreparing(TransactionId xid, const char *gid,
 	gxact = TwoPhaseState->freeGXacts;
 	TwoPhaseState->freeGXacts = gxact->next;
 
-	MarkAsPreparingGuts(gxact, xid, gid, prepared_at, owner, databaseid);
+	fxid = FullTransactionIdFromAllowableAt(ReadNextFullTransactionId(),
+											xid);
+	MarkAsPreparingGuts(gxact, fxid, gid, prepared_at, owner, databaseid);
 
 	gxact->ondisk = false;
 
@@ -430,11 +433,13 @@ MarkAsPreparing(TransactionId xid, const char *gid,
  * Note: This function should be called with appropriate locks held.
  */
 static void
-MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
-					TimestampTz prepared_at, Oid owner, Oid databaseid)
+MarkAsPreparingGuts(GlobalTransaction gxact, FullTransactionId fxid,
+					const char *gid, TimestampTz prepared_at, Oid owner,
+					Oid databaseid)
 {
 	PGPROC	   *proc;
 	int			i;
+	TransactionId xid = XidFromFullTransactionId(fxid);
 
 	Assert(LWLockHeldByMeInMode(TwoPhaseStateLock, LW_EXCLUSIVE));
 
@@ -479,7 +484,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	proc->subxidStatus.count = 0;
 
 	gxact->prepared_at = prepared_at;
-	gxact->xid = xid;
+	gxact->fxid = fxid;
 	gxact->owner = owner;
 	gxact->locking_backend = MyProcNumber;
 	gxact->valid = false;
@@ -800,6 +805,7 @@ static GlobalTransaction
 TwoPhaseGetGXact(TransactionId xid, bool lock_held)
 {
 	GlobalTransaction result = NULL;
+	FullTransactionId fxid;
 	int			i;
 
 	static TransactionId cached_xid = InvalidTransactionId;
@@ -817,11 +823,14 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)
 	if (!lock_held)
 		LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
 
+	fxid = FullTransactionIdFromAllowableAt(ReadNextFullTransactionId(),
+											xid);
+
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
 
-		if (gxact->xid == xid)
+		if (FullTransactionIdEquals(gxact->fxid, fxid))
 		{
 			result = gxact;
 			break;
@@ -881,7 +890,7 @@ TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
 				*have_more = true;
 				break;
 			}
-			result = gxact->xid;
+			result = XidFromFullTransactionId(gxact->fxid);
 		}
 	}
 
@@ -1032,7 +1041,7 @@ void
 StartPrepare(GlobalTransaction gxact)
 {
 	PGPROC	   *proc = GetPGProcByNumber(gxact->pgprocno);
-	TransactionId xid = gxact->xid;
+	TransactionId xid = XidFromFullTransactionId(gxact->fxid);
 	TwoPhaseFileHeader hdr;
 	TransactionId *children;
 	RelFileLocator *commitrels;
@@ -1268,7 +1277,7 @@ RegisterTwoPhaseRecord(TwoPhaseRmgrId rmid, uint16 info,
  * after discarding two-phase files from frozen epochs.
  */
 static char *
-ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
+ReadTwoPhaseFile(FullTransactionId fxid, bool missing_ok)
 {
 	char		path[MAXPGPATH];
 	char	   *buf;
@@ -1279,9 +1288,7 @@ ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
 	pg_crc32c	calc_crc,
 				file_crc;
 	int			r;
-	FullTransactionId fxid;
 
-	fxid = FullTransactionIdFromAllowableAt(ReadNextFullTransactionId(), xid);
 	TwoPhaseFilePath(path, fxid);
 
 	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
@@ -1447,6 +1454,7 @@ StandbyTransactionIdIsPrepared(TransactionId xid)
 	char	   *buf;
 	TwoPhaseFileHeader *hdr;
 	bool		result;
+	FullTransactionId fxid;
 
 	Assert(TransactionIdIsValid(xid));
 
@@ -1454,7 +1462,8 @@ StandbyTransactionIdIsPrepared(TransactionId xid)
 		return false;			/* nothing to do */
 
 	/* Read and validate file */
-	buf = ReadTwoPhaseFile(xid, true);
+	fxid = FullTransactionIdFromAllowableAt(ReadNextFullTransactionId(), xid);
+	buf = ReadTwoPhaseFile(fxid, true);
 	if (buf == NULL)
 		return false;
 
@@ -1474,6 +1483,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 {
 	GlobalTransaction gxact;
 	PGPROC	   *proc;
+	FullTransactionId fxid;
 	TransactionId xid;
 	bool		ondisk;
 	char	   *buf;
@@ -1495,7 +1505,8 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 */
 	gxact = LockGXact(gid, GetUserId());
 	proc = GetPGProcByNumber(gxact->pgprocno);
-	xid = gxact->xid;
+	fxid = gxact->fxid;
+	xid = XidFromFullTransactionId(fxid);
 
 	/*
 	 * Read and validate 2PC state data. State data will typically be stored
@@ -1503,7 +1514,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 * to disk if for some reason they have lived for a long time.
 	 */
 	if (gxact->ondisk)
-		buf = ReadTwoPhaseFile(xid, false);
+		buf = ReadTwoPhaseFile(fxid, false);
 	else
 		XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
 
@@ -1648,13 +1659,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 
 	/* And now we can clean up any files we may have left. */
 	if (ondisk)
-	{
-		FullTransactionId fxid;
-
-		fxid = FullTransactionIdFromAllowableAt(ReadNextFullTransactionId(),
-												xid);
 		RemoveTwoPhaseFile(fxid, true);
-	}
 
 	MyLockedGxact = NULL;
 
@@ -1717,19 +1722,17 @@ RemoveTwoPhaseFile(FullTransactionId fxid, bool giveWarning)
  * Note: content and len don't include CRC.
  */
 static void
-RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
+RecreateTwoPhaseFile(FullTransactionId fxid, void *content, int len)
 {
 	char		path[MAXPGPATH];
 	pg_crc32c	statefile_crc;
 	int			fd;
-	FullTransactionId fxid;
 
 	/* Recompute CRC */
 	INIT_CRC32C(statefile_crc);
 	COMP_CRC32C(statefile_crc, content, len);
 	FIN_CRC32C(statefile_crc);
 
-	fxid = FullTransactionIdFromAllowableAt(ReadNextFullTransactionId(), xid);
 	TwoPhaseFilePath(path, fxid);
 
 	fd = OpenTransientFile(path,
@@ -1842,7 +1845,7 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
 			int			len;
 
 			XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, &len);
-			RecreateTwoPhaseFile(gxact->xid, buf, len);
+			RecreateTwoPhaseFile(gxact->fxid, buf, len);
 			gxact->ondisk = true;
 			gxact->prepare_start_lsn = InvalidXLogRecPtr;
 			gxact->prepare_end_lsn = InvalidXLogRecPtr;
@@ -1964,9 +1967,8 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 
 		Assert(gxact->inredo);
 
-		xid = gxact->xid;
-
-		fxid = FullTransactionIdFromAllowableAt(nextXid, xid);
+		fxid = gxact->fxid;
+		xid = XidFromFullTransactionId(fxid);
 		buf = ProcessTwoPhaseBuffer(fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, false, true);
@@ -2029,24 +2031,16 @@ void
 StandbyRecoverPreparedTransactions(void)
 {
 	int			i;
-	FullTransactionId nextFullXid;
-
-	nextFullXid = ReadNextFullTransactionId();
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
-		TransactionId xid;
-		FullTransactionId fxid;
 		char	   *buf;
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
 
 		Assert(gxact->inredo);
 
-		xid = gxact->xid;
-
-		fxid = FullTransactionIdFromAllowableAt(nextFullXid, xid);
-		buf = ProcessTwoPhaseBuffer(fxid,
+		buf = ProcessTwoPhaseBuffer(gxact->fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
 		if (buf != NULL)
@@ -2075,15 +2069,11 @@ void
 RecoverPreparedTransactions(void)
 {
 	int			i;
-	FullTransactionId nextFullXid;
-
-	nextFullXid = ReadNextFullTransactionId();
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
 		TransactionId xid;
-		FullTransactionId fxid;
 		char	   *buf;
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
 		char	   *bufptr;
@@ -2091,8 +2081,6 @@ RecoverPreparedTransactions(void)
 		TransactionId *subxids;
 		const char *gid;
 
-		xid = gxact->xid;
-
 		/*
 		 * Reconstruct subtrans state for the transaction --- needed because
 		 * pg_subtrans is not preserved over a restart.  Note that we are
@@ -2102,15 +2090,17 @@ RecoverPreparedTransactions(void)
 		 * SubTransSetParent has been set before, if the prepared transaction
 		 * generated xid assignment records.
 		 */
-		fxid = FullTransactionIdFromAllowableAt(nextFullXid, xid);
-		buf = ProcessTwoPhaseBuffer(fxid,
+		buf = ProcessTwoPhaseBuffer(gxact->fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
 		if (buf == NULL)
 			continue;
 
+		xid = XidFromFullTransactionId(gxact->fxid);
 		ereport(LOG,
-				(errmsg("recovering prepared transaction %u from shared memory", xid)));
+				(errmsg("recovering prepared transaction %u of epoch %u from shared memory",
+						XidFromFullTransactionId(gxact->fxid),
+						EpochFromFullTransactionId(gxact->fxid))));
 
 		hdr = (TwoPhaseFileHeader *) buf;
 		Assert(TransactionIdEquals(hdr->xid, xid));
@@ -2129,7 +2119,7 @@ RecoverPreparedTransactions(void)
 		 * Recreate its GXACT and dummy PGPROC. But, check whether it was
 		 * added in redo and already has a shmem entry for it.
 		 */
-		MarkAsPreparingGuts(gxact, xid, gid,
+		MarkAsPreparingGuts(gxact, gxact->fxid, gid,
 							hdr->prepared_at,
 							hdr->owner, hdr->database);
 
@@ -2264,7 +2254,7 @@ ProcessTwoPhaseBuffer(FullTransactionId fxid,
 	if (fromdisk)
 	{
 		/* Read and validate file */
-		buf = ReadTwoPhaseFile(xid, false);
+		buf = ReadTwoPhaseFile(fxid, false);
 	}
 	else
 	{
@@ -2570,7 +2560,8 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 	gxact->prepared_at = hdr->prepared_at;
 	gxact->prepare_start_lsn = start_lsn;
 	gxact->prepare_end_lsn = end_lsn;
-	gxact->xid = hdr->xid;
+	gxact->fxid = FullTransactionIdFromAllowableAt(ReadNextFullTransactionId(),
+												   hdr->xid);
 	gxact->owner = hdr->owner;
 	gxact->locking_backend = INVALID_PROC_NUMBER;
 	gxact->valid = false;
@@ -2589,7 +2580,9 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   false /* backward */ , false /* WAL */ );
 	}
 
-	elog(DEBUG2, "added 2PC data in shared memory for transaction %u", gxact->xid);
+	elog(DEBUG2, "added 2PC data in shared memory for transaction %u of epoch %u",
+		 XidFromFullTransactionId(gxact->fxid),
+		 EpochFromFullTransactionId(gxact->fxid));
 }
 
 /*
@@ -2605,17 +2598,21 @@ void
 PrepareRedoRemove(TransactionId xid, bool giveWarning)
 {
 	GlobalTransaction gxact = NULL;
+	FullTransactionId fxid;
 	int			i;
 	bool		found = false;
 
 	Assert(LWLockHeldByMeInMode(TwoPhaseStateLock, LW_EXCLUSIVE));
 	Assert(RecoveryInProgress());
 
+	fxid = FullTransactionIdFromAllowableAt(ReadNextFullTransactionId(),
+											xid);
+
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
 		gxact = TwoPhaseState->prepXacts[i];
 
-		if (gxact->xid == xid)
+		if (FullTransactionIdEquals(gxact->fxid, fxid))
 		{
 			Assert(gxact->inredo);
 			found = true;
@@ -2632,15 +2629,13 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 	/*
 	 * And now we can clean up any files we may have left.
 	 */
-	elog(DEBUG2, "removing 2PC data for transaction %u", xid);
-	if (gxact->ondisk)
-	{
-		FullTransactionId fxid;
+	elog(DEBUG2, "removing 2PC data for transaction %u of epoch %u ",
+		 XidFromFullTransactionId(fxid),
+		 EpochFromFullTransactionId(fxid));
 
-		fxid = FullTransactionIdFromAllowableAt(ReadNextFullTransactionId(),
-												xid);
+	if (gxact->ondisk)
 		RemoveTwoPhaseFile(fxid, giveWarning);
-	}
+
 	RemoveGXact(gxact);
 }
 
@@ -2688,7 +2683,7 @@ LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
 			 * between publisher and subscriber.
 			 */
 			if (gxact->ondisk)
-				buf = ReadTwoPhaseFile(gxact->xid, false);
+				buf = ReadTwoPhaseFile(gxact->fxid, false);
 			else
 			{
 				Assert(gxact->prepare_start_lsn);
-- 
2.47.1

#14Noah Misch
noah@leadboat.com
In reply to: Michael Paquier (#13)
1 attachment(s)
Re: An improvement of ProcessTwoPhaseBuffer logic

On Thu, Jan 16, 2025 at 04:50:09PM +0900, Michael Paquier wrote:

On Wed, Jan 15, 2025 at 05:00:51PM -0800, Noah Misch wrote:

I think "using the current epoch" is wrong for half of the nextFullXid values
having epoch > 0. For example, nextFullId==2^32 is in epoch 1, but all the
allowable XIDs are in epoch 0. (I mean "allowable" in the sense of
AssertTransactionIdInAllowableRange().) From then until we assign another
2^31 XIDs, epochs 0 and 1 are both expected in XID values. After 2^31 XID
assignments, every allowable XID be in epoch 1. Hence, twophase.c would need
two-epoch logic like we have in widen_snapshot_xid() and XLogRecGetFullXid().
Is that right? (I wrote this in a hurry, so this email may have more than the
standard level of errors.) Before commit 7e125b2, twophase also had that
logic. I didn't work out the user-visible consequences of that logic's new
absence here, but I bet on twophase recovery breakage. Similar problem here
(up to two epochs are acceptable, not just one):

+	/* Discard files from past epochs */
+	if (EpochFromFullTransactionId(fxid) < EpochFromFullTransactionId(nextXid))

Oops, you're right. Your suggestion to unify all that in a single
routine is an excellent idea. Missed the bits in xid8funcs.c.

Added.

I wrote the attached half-baked patch to fix those, but I tend to think it's
better to use FullTransactionId as many places as possible in twophase.c.
(We'd still need to convert XIDs that we read from xl_xact_prepare records,
along the lines of XLogRecGetFullXid().) How do you see it?

I'm all for integrating more FullTransactionIds now that these reflect
in the file names, and do a deeper cut.

As far as I understand, the most important point of the logic is to
detect and discard the future files first in restoreTwoPhaseData() ->
ProcessTwoPhaseBuffer() when scanning the contents of pg_twophase at
the beginning of recovery. Once this filtering is done, it should be
safe to use your FullTransactionIdFromAllowableAt() when doing
the fxid <-> xid transitions between the records and the files on disk
flushed by a restartpoint which store an XID, and the shmem state of
GlobalTransactionData with a fxid.

(I did not expect that a function called restoreTwoPhaseData() would run before
a function called PrescanPreparedTransactions(), but so it is.)

How is it that restoreTwoPhaseData() -> ProcessTwoPhaseBuffer() can safely
call TransactionIdDidAbort() when we've not replayed WAL to make CLOG
consistent? What can it assume about the value of TransamVariables->nextXid
at that early time?

With the additions attached, FullTransactionIdFromAllowableAt() gets
down from 8 to 6 calls in twophase.c. The change related to
MarkAsPreparingGuts() seems optional, though.

Thanks. It's probably not worth doing at that level of reduction. I tried
spreading fxid further and got it down to three conversions, corresponding to
redo of each of XLOG_XACT_PREPARE, XLOG_XACT_COMMIT_PREPARED, and
XLOG_XACT_ABORT_PREPARED. I'm attaching the WIP of that. It's not as
satisfying as I expected, so FullTransactionIdFromCurrentEpoch-v0.patch may
yet be the better direction.

The ProcessTwoPhaseBuffer() code to remove "past two-phase state" seems
best-effort, independent of this change. Just because an XID is in a
potentially-acceptable epoch doesn't mean TransactionIdDidCommit() will find
clog for it. I've not studied whether that matters.

Incidentally, this comment about when a function is called:

* PrescanPreparedTransactions
*
* Scan the shared memory entries of TwoPhaseState and determine the range
* of valid XIDs present. This is run during database startup, after we
* have completed reading WAL. TransamVariables->nextXid has been set to
* one more than the highest XID for which evidence exists in WAL.

doesn't match the timing of the actual earliest call:

/* REDO */
if (InRecovery)
{
...
if (ArchiveRecoveryRequested && EnableHotStandby)
{
...
if (wasShutdown)
oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
else
oldestActiveXID = checkPoint.oldestActiveXid;
...
PerformWalRecovery();
performedWalRecovery = true;

I am trying to figure
out how to write a regression test to trigger this error, lacking a
bit of time today. That's going to require more trickery with
pg_resetwal to make that cheap, or something like that.. Attached are
some suggestions, as of a 0002 that applies on top of your 0001.

Thanks. I'd value having your regression test, but manual-ish testing could
suffice if it's too hard.

XLogRecGetFullXid() is used nowhere. This could be removed, perhaps,
or not?

Maybe. Looks like it was born unused in 67b9b3c (2019-07), so removing may as
well be a separate discussion.

Attachments:

FullTransactionIdFromCurrentEpoch-v0.1.patchtext/plain; charset=us-asciiDownload
diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 8a8e36d..8273123 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -1877,7 +1877,9 @@ check_tuple(HeapCheckContext *ctx, bool *xmin_commit_status_ok,
 /*
  * Convert a TransactionId into a FullTransactionId using our cached values of
  * the valid transaction ID range.  It is the caller's responsibility to have
- * already updated the cached values, if necessary.
+ * already updated the cached values, if necessary.  This is akin to
+ * FullTransactionIdFromAllowableAt(), but it tolerates corruption in the form
+ * of an xid before epoch 0.
  */
 static FullTransactionId
 FullTransactionIdFromXidAndCtx(TransactionId xid, const HeapCheckContext *ctx)
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 27ccdf9..4e52792 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1847,7 +1847,7 @@ AtPrepare_MultiXact(void)
  *		Clean up after successful PREPARE TRANSACTION
  */
 void
-PostPrepare_MultiXact(TransactionId xid)
+PostPrepare_MultiXact(FullTransactionId fxid)
 {
 	MultiXactId myOldestMember;
 
@@ -1858,7 +1858,7 @@ PostPrepare_MultiXact(TransactionId xid)
 	myOldestMember = OldestMemberMXactId[MyProcNumber];
 	if (MultiXactIdIsValid(myOldestMember))
 	{
-		ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(xid, false);
+		ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(fxid, false);
 
 		/*
 		 * Even though storing MultiXactId is atomic, acquire lock to make
@@ -1896,10 +1896,10 @@ PostPrepare_MultiXact(TransactionId xid)
  *		Recover the state of a prepared transaction at startup
  */
 void
-multixact_twophase_recover(TransactionId xid, uint16 info,
+multixact_twophase_recover(FullTransactionId fxid, uint16 info,
 						   void *recdata, uint32 len)
 {
-	ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(xid, false);
+	ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(fxid, false);
 	MultiXactId oldestMember;
 
 	/*
@@ -1917,10 +1917,10 @@ multixact_twophase_recover(TransactionId xid, uint16 info,
  *		Similar to AtEOXact_MultiXact but for COMMIT PREPARED
  */
 void
-multixact_twophase_postcommit(TransactionId xid, uint16 info,
+multixact_twophase_postcommit(FullTransactionId fxid, uint16 info,
 							  void *recdata, uint32 len)
 {
-	ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(xid, true);
+	ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(fxid, true);
 
 	Assert(len == sizeof(MultiXactId));
 
@@ -1932,10 +1932,10 @@ multixact_twophase_postcommit(TransactionId xid, uint16 info,
  *		This is actually just the same as the COMMIT case.
  */
 void
-multixact_twophase_postabort(TransactionId xid, uint16 info,
+multixact_twophase_postabort(FullTransactionId fxid, uint16 info,
 							 void *recdata, uint32 len)
 {
-	multixact_twophase_postcommit(xid, info, recdata, len);
+	multixact_twophase_postcommit(fxid, info, recdata, len);
 }
 
 /*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index a3190dc..8b2d102 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -159,7 +159,7 @@ typedef struct GlobalTransactionData
 	 */
 	XLogRecPtr	prepare_start_lsn;	/* XLOG offset of prepare record start */
 	XLogRecPtr	prepare_end_lsn;	/* XLOG offset of prepare record end */
-	TransactionId xid;			/* The GXACT id */
+	FullTransactionId fxid;		/* The GXACT full xid */
 
 	Oid			owner;			/* ID of user that executed the xact */
 	ProcNumber	locking_backend;	/* backend currently working on the xact */
@@ -197,6 +197,7 @@ static GlobalTransaction MyLockedGxact = NULL;
 
 static bool twophaseExitRegistered = false;
 
+static void PrepareRedoRemoveFull(FullTransactionId fxid, bool giveWarning);
 static void RecordTransactionCommitPrepared(TransactionId xid,
 											int nchildren,
 											TransactionId *children,
@@ -216,19 +217,19 @@ static void RecordTransactionAbortPrepared(TransactionId xid,
 										   int nstats,
 										   xl_xact_stats_item *stats,
 										   const char *gid);
-static void ProcessRecords(char *bufptr, TransactionId xid,
+static void ProcessRecords(char *bufptr, FullTransactionId fxid,
 						   const TwoPhaseCallback callbacks[]);
 static void RemoveGXact(GlobalTransaction gxact);
 
 static void XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len);
-static char *ProcessTwoPhaseBuffer(FullTransactionId xid,
+static char *ProcessTwoPhaseBuffer(FullTransactionId fxid,
 								   XLogRecPtr prepare_start_lsn,
 								   bool fromdisk, bool setParent, bool setNextXid);
-static void MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid,
+static void MarkAsPreparingGuts(GlobalTransaction gxact, FullTransactionId fxid,
 								const char *gid, TimestampTz prepared_at, Oid owner,
 								Oid databaseid);
 static void RemoveTwoPhaseFile(FullTransactionId fxid, bool giveWarning);
-static void RecreateTwoPhaseFile(TransactionId xid, void *content, int len);
+static void RecreateTwoPhaseFile(FullTransactionId fxid, void *content, int len);
 
 /*
  * Initialization of shared memory
@@ -356,7 +357,7 @@ PostPrepare_Twophase(void)
  *		Reserve the GID for the given transaction.
  */
 GlobalTransaction
-MarkAsPreparing(TransactionId xid, const char *gid,
+MarkAsPreparing(FullTransactionId fxid, const char *gid,
 				TimestampTz prepared_at, Oid owner, Oid databaseid)
 {
 	GlobalTransaction gxact;
@@ -407,7 +408,7 @@ MarkAsPreparing(TransactionId xid, const char *gid,
 	gxact = TwoPhaseState->freeGXacts;
 	TwoPhaseState->freeGXacts = gxact->next;
 
-	MarkAsPreparingGuts(gxact, xid, gid, prepared_at, owner, databaseid);
+	MarkAsPreparingGuts(gxact, fxid, gid, prepared_at, owner, databaseid);
 
 	gxact->ondisk = false;
 
@@ -430,11 +431,13 @@ MarkAsPreparing(TransactionId xid, const char *gid,
  * Note: This function should be called with appropriate locks held.
  */
 static void
-MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
-					TimestampTz prepared_at, Oid owner, Oid databaseid)
+MarkAsPreparingGuts(GlobalTransaction gxact, FullTransactionId fxid,
+					const char *gid, TimestampTz prepared_at, Oid owner,
+					Oid databaseid)
 {
 	PGPROC	   *proc;
 	int			i;
+	TransactionId xid = XidFromFullTransactionId(fxid);
 
 	Assert(LWLockHeldByMeInMode(TwoPhaseStateLock, LW_EXCLUSIVE));
 
@@ -479,7 +482,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	proc->subxidStatus.count = 0;
 
 	gxact->prepared_at = prepared_at;
-	gxact->xid = xid;
+	gxact->fxid = fxid;
 	gxact->owner = owner;
 	gxact->locking_backend = MyProcNumber;
 	gxact->valid = false;
@@ -797,12 +800,12 @@ pg_prepared_xact(PG_FUNCTION_ARGS)
  * caller had better hold it.
  */
 static GlobalTransaction
-TwoPhaseGetGXact(TransactionId xid, bool lock_held)
+TwoPhaseGetGXact(FullTransactionId fxid, bool lock_held)
 {
 	GlobalTransaction result = NULL;
 	int			i;
 
-	static TransactionId cached_xid = InvalidTransactionId;
+	static FullTransactionId cached_fxid = {0};
 	static GlobalTransaction cached_gxact = NULL;
 
 	Assert(!lock_held || LWLockHeldByMe(TwoPhaseStateLock));
@@ -811,7 +814,7 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)
 	 * During a recovery, COMMIT PREPARED, or ABORT PREPARED, we'll be called
 	 * repeatedly for the same XID.  We can save work with a simple cache.
 	 */
-	if (xid == cached_xid)
+	if (FullTransactionIdEquals(fxid, cached_fxid))
 		return cached_gxact;
 
 	if (!lock_held)
@@ -821,7 +824,7 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)
 	{
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
 
-		if (gxact->xid == xid)
+		if (FullTransactionIdEquals(gxact->fxid, fxid))
 		{
 			result = gxact;
 			break;
@@ -832,9 +835,10 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)
 		LWLockRelease(TwoPhaseStateLock);
 
 	if (result == NULL)			/* should not happen */
-		elog(ERROR, "failed to find GlobalTransaction for xid %u", xid);
+		elog(ERROR, "failed to find GlobalTransaction for xid %u",
+			 XidFromFullTransactionId(fxid));
 
-	cached_xid = xid;
+	cached_fxid = fxid;
 	cached_gxact = result;
 
 	return result;
@@ -881,7 +885,7 @@ TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
 				*have_more = true;
 				break;
 			}
-			result = gxact->xid;
+			result = XidFromFullTransactionId(gxact->fxid);
 		}
 	}
 
@@ -892,7 +896,7 @@ TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
 
 /*
  * TwoPhaseGetDummyProcNumber
- *		Get the dummy proc number for prepared transaction specified by XID
+ *		Get the dummy proc number for prepared transaction
  *
  * Dummy proc numbers are similar to proc numbers of real backends.  They
  * start at MaxBackends, and are unique across all currently active real
@@ -900,24 +904,24 @@ TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
  * TwoPhaseStateLock will not be taken, so the caller had better hold it.
  */
 ProcNumber
-TwoPhaseGetDummyProcNumber(TransactionId xid, bool lock_held)
+TwoPhaseGetDummyProcNumber(FullTransactionId fxid, bool lock_held)
 {
-	GlobalTransaction gxact = TwoPhaseGetGXact(xid, lock_held);
+	GlobalTransaction gxact = TwoPhaseGetGXact(fxid, lock_held);
 
 	return gxact->pgprocno;
 }
 
 /*
  * TwoPhaseGetDummyProc
- *		Get the PGPROC that represents a prepared transaction specified by XID
+ *		Get the PGPROC that represents a prepared transaction
  *
  * If lock_held is set to true, TwoPhaseStateLock will not be taken, so the
  * caller had better hold it.
  */
 PGPROC *
-TwoPhaseGetDummyProc(TransactionId xid, bool lock_held)
+TwoPhaseGetDummyProc(FullTransactionId fxid, bool lock_held)
 {
-	GlobalTransaction gxact = TwoPhaseGetGXact(xid, lock_held);
+	GlobalTransaction gxact = TwoPhaseGetGXact(fxid, lock_held);
 
 	return GetPGProcByNumber(gxact->pgprocno);
 }
@@ -926,24 +930,6 @@ TwoPhaseGetDummyProc(TransactionId xid, bool lock_held)
 /* State file support													*/
 /************************************************************************/
 
-/*
- * Compute FullTransactionId for the given TransactionId, using the current
- * epoch.
- */
-static inline FullTransactionId
-FullTransactionIdFromCurrentEpoch(TransactionId xid)
-{
-	FullTransactionId fxid;
-	FullTransactionId nextFullXid;
-	uint32		epoch;
-
-	nextFullXid = ReadNextFullTransactionId();
-	epoch = EpochFromFullTransactionId(nextFullXid);
-
-	fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
-	return fxid;
-}
-
 static inline int
 TwoPhaseFilePath(char *path, FullTransactionId fxid)
 {
@@ -1050,7 +1036,7 @@ void
 StartPrepare(GlobalTransaction gxact)
 {
 	PGPROC	   *proc = GetPGProcByNumber(gxact->pgprocno);
-	TransactionId xid = gxact->xid;
+	TransactionId xid = XidFromFullTransactionId(gxact->fxid);
 	TwoPhaseFileHeader hdr;
 	TransactionId *children;
 	RelFileLocator *commitrels;
@@ -1283,10 +1269,10 @@ RegisterTwoPhaseRecord(TwoPhaseRmgrId rmid, uint16 info,
  * contents of the file, issuing an error when finding corrupted data.  If
  * missing_ok is true, which indicates that missing files can be safely
  * ignored, then return NULL.  This state can be reached when doing recovery
- * after discarding two-phase files from other epochs.
+ * after discarding two-phase files from frozen epochs.
  */
 static char *
-ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
+ReadTwoPhaseFile(FullTransactionId fxid, bool missing_ok)
 {
 	char		path[MAXPGPATH];
 	char	   *buf;
@@ -1297,9 +1283,7 @@ ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
 	pg_crc32c	calc_crc,
 				file_crc;
 	int			r;
-	FullTransactionId fxid;
 
-	fxid = FullTransactionIdFromCurrentEpoch(xid);
 	TwoPhaseFilePath(path, fxid);
 
 	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
@@ -1465,6 +1449,7 @@ StandbyTransactionIdIsPrepared(TransactionId xid)
 	char	   *buf;
 	TwoPhaseFileHeader *hdr;
 	bool		result;
+	FullTransactionId fxid;
 
 	Assert(TransactionIdIsValid(xid));
 
@@ -1472,7 +1457,8 @@ StandbyTransactionIdIsPrepared(TransactionId xid)
 		return false;			/* nothing to do */
 
 	/* Read and validate file */
-	buf = ReadTwoPhaseFile(xid, true);
+	fxid = FullTransactionIdFromAllowableAt(TransamVariables->nextXid, xid);
+	buf = ReadTwoPhaseFile(fxid, true);
 	if (buf == NULL)
 		return false;
 
@@ -1492,6 +1478,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 {
 	GlobalTransaction gxact;
 	PGPROC	   *proc;
+	FullTransactionId fxid;
 	TransactionId xid;
 	bool		ondisk;
 	char	   *buf;
@@ -1513,7 +1500,8 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 */
 	gxact = LockGXact(gid, GetUserId());
 	proc = GetPGProcByNumber(gxact->pgprocno);
-	xid = gxact->xid;
+	fxid = gxact->fxid;
+	xid = XidFromFullTransactionId(fxid);
 
 	/*
 	 * Read and validate 2PC state data. State data will typically be stored
@@ -1521,7 +1509,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 * to disk if for some reason they have lived for a long time.
 	 */
 	if (gxact->ondisk)
-		buf = ReadTwoPhaseFile(xid, false);
+		buf = ReadTwoPhaseFile(fxid, false);
 	else
 		XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
 
@@ -1640,11 +1628,11 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 
 	/* And now do the callbacks */
 	if (isCommit)
-		ProcessRecords(bufptr, xid, twophase_postcommit_callbacks);
+		ProcessRecords(bufptr, fxid, twophase_postcommit_callbacks);
 	else
-		ProcessRecords(bufptr, xid, twophase_postabort_callbacks);
+		ProcessRecords(bufptr, fxid, twophase_postabort_callbacks);
 
-	PredicateLockTwoPhaseFinish(xid, isCommit);
+	PredicateLockTwoPhaseFinish(fxid, isCommit);
 
 	/*
 	 * Read this value while holding the two-phase lock, as the on-disk 2PC
@@ -1664,17 +1652,9 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	/* Count the prepared xact as committed or aborted */
 	AtEOXact_PgStat(isCommit, false);
 
-	/*
-	 * And now we can clean up any files we may have left.  These should be
-	 * from the current epoch.
-	 */
+	/* And now we can clean up any files we may have left. */
 	if (ondisk)
-	{
-		FullTransactionId fxid;
-
-		fxid = FullTransactionIdFromCurrentEpoch(xid);
 		RemoveTwoPhaseFile(fxid, true);
-	}
 
 	MyLockedGxact = NULL;
 
@@ -1687,7 +1667,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
  * Scan 2PC state data in memory and call the indicated callbacks for each 2PC record.
  */
 static void
-ProcessRecords(char *bufptr, TransactionId xid,
+ProcessRecords(char *bufptr, FullTransactionId fxid,
 			   const TwoPhaseCallback callbacks[])
 {
 	for (;;)
@@ -1701,14 +1681,14 @@ ProcessRecords(char *bufptr, TransactionId xid,
 		bufptr += MAXALIGN(sizeof(TwoPhaseRecordOnDisk));
 
 		if (callbacks[record->rmid] != NULL)
-			callbacks[record->rmid] (xid, record->info, bufptr, record->len);
+			callbacks[record->rmid] (fxid, record->info, bufptr, record->len);
 
 		bufptr += MAXALIGN(record->len);
 	}
 }
 
 /*
- * Remove the 2PC file for the specified XID.
+ * Remove the 2PC file.
  *
  * If giveWarning is false, do not complain about file-not-present;
  * this is an expected case during WAL replay.
@@ -1737,20 +1717,17 @@ RemoveTwoPhaseFile(FullTransactionId fxid, bool giveWarning)
  * Note: content and len don't include CRC.
  */
 static void
-RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
+RecreateTwoPhaseFile(FullTransactionId fxid, void *content, int len)
 {
 	char		path[MAXPGPATH];
 	pg_crc32c	statefile_crc;
 	int			fd;
-	FullTransactionId fxid;
 
 	/* Recompute CRC */
 	INIT_CRC32C(statefile_crc);
 	COMP_CRC32C(statefile_crc, content, len);
 	FIN_CRC32C(statefile_crc);
 
-	/* Use current epoch */
-	fxid = FullTransactionIdFromCurrentEpoch(xid);
 	TwoPhaseFilePath(path, fxid);
 
 	fd = OpenTransientFile(path,
@@ -1863,7 +1840,7 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
 			int			len;
 
 			XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, &len);
-			RecreateTwoPhaseFile(gxact->xid, buf, len);
+			RecreateTwoPhaseFile(gxact->fxid, buf, len);
 			gxact->ondisk = true;
 			gxact->prepare_start_lsn = InvalidXLogRecPtr;
 			gxact->prepare_end_lsn = InvalidXLogRecPtr;
@@ -1900,7 +1877,7 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
  * This is called once at the beginning of recovery, saving any extra
  * lookups in the future.  Two-phase files that are newer than the
  * minimum XID horizon are discarded on the way.  Two-phase files with
- * an epoch older or newer than the current checkpoint's record epoch
+ * an epoch frozen relative to the current checkpoint's record epoch
  * are also discarded.
  */
 void
@@ -1925,7 +1902,7 @@ restoreTwoPhaseData(void)
 			if (buf == NULL)
 				continue;
 
-			PrepareRedoAdd(buf, InvalidXLogRecPtr,
+			PrepareRedoAdd(fxid, buf, InvalidXLogRecPtr,
 						   InvalidXLogRecPtr, InvalidRepOriginId);
 		}
 	}
@@ -1971,7 +1948,6 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 	TransactionId origNextXid = XidFromFullTransactionId(nextXid);
 	TransactionId result = origNextXid;
 	TransactionId *xids = NULL;
-	uint32		epoch = EpochFromFullTransactionId(nextXid);
 	int			nxids = 0;
 	int			allocsize = 0;
 	int			i;
@@ -1979,20 +1955,15 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
-		TransactionId xid;
 		FullTransactionId fxid;
+		TransactionId xid;
 		char	   *buf;
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
 
 		Assert(gxact->inredo);
 
-		xid = gxact->xid;
-
-		/*
-		 * All two-phase files with past and future epoch in pg_twophase are
-		 * gone at this point, so we're OK to rely on only the current epoch.
-		 */
-		fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
+		fxid = gxact->fxid;
+		xid = XidFromFullTransactionId(fxid);
 		buf = ProcessTwoPhaseBuffer(fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, false, true);
@@ -2055,31 +2026,16 @@ void
 StandbyRecoverPreparedTransactions(void)
 {
 	int			i;
-	uint32		epoch;
-	FullTransactionId nextFullXid;
-
-	/* get current epoch */
-	nextFullXid = ReadNextFullTransactionId();
-	epoch = EpochFromFullTransactionId(nextFullXid);
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
-		TransactionId xid;
-		FullTransactionId fxid;
 		char	   *buf;
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
 
 		Assert(gxact->inredo);
 
-		xid = gxact->xid;
-
-		/*
-		 * At this stage, we're OK to work with the current epoch as all past
-		 * and future files have been already discarded.
-		 */
-		fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
-		buf = ProcessTwoPhaseBuffer(fxid,
+		buf = ProcessTwoPhaseBuffer(gxact->fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
 		if (buf != NULL)
@@ -2108,17 +2064,10 @@ void
 RecoverPreparedTransactions(void)
 {
 	int			i;
-	uint32		epoch;
-	FullTransactionId nextFullXid;
-
-	/* get current epoch */
-	nextFullXid = ReadNextFullTransactionId();
-	epoch = EpochFromFullTransactionId(nextFullXid);
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
-		TransactionId xid;
 		FullTransactionId fxid;
 		char	   *buf;
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
@@ -2128,12 +2077,6 @@ RecoverPreparedTransactions(void)
 		const char *gid;
 
 		/*
-		 * At this stage, we're OK to work with the current epoch as all past
-		 * and future files have been already discarded.
-		 */
-		xid = gxact->xid;
-
-		/*
 		 * Reconstruct subtrans state for the transaction --- needed because
 		 * pg_subtrans is not preserved over a restart.  Note that we are
 		 * linking all the subtransactions directly to the top-level XID;
@@ -2142,18 +2085,20 @@ RecoverPreparedTransactions(void)
 		 * SubTransSetParent has been set before, if the prepared transaction
 		 * generated xid assignment records.
 		 */
-		fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
-		buf = ProcessTwoPhaseBuffer(fxid,
+		buf = ProcessTwoPhaseBuffer(gxact->fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
 		if (buf == NULL)
 			continue;
 
 		ereport(LOG,
-				(errmsg("recovering prepared transaction %u from shared memory", xid)));
+				(errmsg("recovering prepared transaction %u of epoch %u from shared memory",
+						XidFromFullTransactionId(gxact->fxid),
+						EpochFromFullTransactionId(gxact->fxid))));
 
 		hdr = (TwoPhaseFileHeader *) buf;
-		Assert(TransactionIdEquals(hdr->xid, xid));
+		Assert(TransactionIdEquals(hdr->xid,
+								   XidFromFullTransactionId(gxact->fxid)));
 		bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader));
 		gid = (const char *) bufptr;
 		bufptr += MAXALIGN(hdr->gidlen);
@@ -2169,7 +2114,7 @@ RecoverPreparedTransactions(void)
 		 * Recreate its GXACT and dummy PGPROC. But, check whether it was
 		 * added in redo and already has a shmem entry for it.
 		 */
-		MarkAsPreparingGuts(gxact, xid, gid,
+		MarkAsPreparingGuts(gxact, gxact->fxid, gid,
 							hdr->prepared_at,
 							hdr->owner, hdr->database);
 
@@ -2184,7 +2129,7 @@ RecoverPreparedTransactions(void)
 		/*
 		 * Recover other state (notably locks) using resource managers.
 		 */
-		ProcessRecords(bufptr, xid, twophase_recover_callbacks);
+		ProcessRecords(bufptr, fxid, twophase_recover_callbacks);
 
 		/*
 		 * Release locks held by the standby process after we process each
@@ -2192,7 +2137,7 @@ RecoverPreparedTransactions(void)
 		 * additional locks at any one time.
 		 */
 		if (InHotStandby)
-			StandbyReleaseLockTree(xid, hdr->nsubxacts, subxids);
+			StandbyReleaseLockTree(hdr->xid, hdr->nsubxacts, subxids);
 
 		/*
 		 * We're done with recovering this transaction. Clear MyLockedGxact,
@@ -2226,20 +2171,22 @@ ProcessTwoPhaseBuffer(FullTransactionId fxid,
 					  bool setParent, bool setNextXid)
 {
 	FullTransactionId nextXid = TransamVariables->nextXid;
+	TransactionId xid = XidFromFullTransactionId(fxid);
 	TransactionId *subxids;
 	char	   *buf;
 	TwoPhaseFileHeader *hdr;
 	int			i;
-	TransactionId xid = XidFromFullTransactionId(fxid);
 
+	Assert(InRecovery);
 	Assert(LWLockHeldByMeInMode(TwoPhaseStateLock, LW_EXCLUSIVE));
 
 	if (!fromdisk)
 		Assert(prepare_start_lsn != InvalidXLogRecPtr);
 
 	/*
-	 * Reject full XID if too new.  Note that this discards files from future
-	 * epochs.
+	 * Reject future XIDs and delete their files; WAL will recreate them if
+	 * needed.  This is a normal outcome if the base backup process copied
+	 * twophase files after a long time copying other files.
 	 */
 	if (FullTransactionIdFollowsOrEquals(fxid, nextXid))
 	{
@@ -2255,13 +2202,23 @@ ProcessTwoPhaseBuffer(FullTransactionId fxid,
 			ereport(WARNING,
 					(errmsg("removing future two-phase state from memory for transaction %u",
 							xid)));
-			PrepareRedoRemove(xid, true);
+			PrepareRedoRemoveFull(fxid, true);
 		}
 		return NULL;
 	}
 
-	/* Discard files from past epochs */
-	if (EpochFromFullTransactionId(fxid) < EpochFromFullTransactionId(nextXid))
+	/*
+	 * Reject XIDs for which we've already trimmed clog.  This should not
+	 * happen, because redo begins at a moment in the XID assignments before
+	 * any pg_twophase file that existed during the base backup.
+	 *
+	 * Since AssignTransactionId() ensures subxact XIDs follow the top XID, an
+	 * allowable top XID implies allowable subxact XIDs.
+	 *
+	 * FIXME reimplement not in terms of epochs
+	 */
+	if (EpochFromFullTransactionId(fxid) + 1 <
+		EpochFromFullTransactionId(nextXid))
 	{
 		if (fromdisk)
 		{
@@ -2275,13 +2232,14 @@ ProcessTwoPhaseBuffer(FullTransactionId fxid,
 			ereport(WARNING,
 					(errmsg("removing past two-phase state from memory for transaction %u",
 							xid)));
-			PrepareRedoRemove(xid, true);
+			PrepareRedoRemoveFull(fxid, true);
 		}
 		return NULL;
 	}
 
-	/* Already processed? */
-	if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
+	/* Reject already-resolved XIDs, deleting their files.  FIXME can this happen? */
+	if (TransactionIdDidCommit(XidFromFullTransactionId(fxid)) ||
+		TransactionIdDidAbort(XidFromFullTransactionId(fxid)))
 	{
 		if (fromdisk)
 		{
@@ -2295,7 +2253,7 @@ ProcessTwoPhaseBuffer(FullTransactionId fxid,
 			ereport(WARNING,
 					(errmsg("removing stale two-phase state from memory for transaction %u",
 							xid)));
-			PrepareRedoRemove(xid, true);
+			PrepareRedoRemoveFull(fxid, true);
 		}
 		return NULL;
 	}
@@ -2303,7 +2261,7 @@ ProcessTwoPhaseBuffer(FullTransactionId fxid,
 	if (fromdisk)
 	{
 		/* Read and validate file */
-		buf = ReadTwoPhaseFile(xid, false);
+		buf = ReadTwoPhaseFile(fxid, false);
 	}
 	else
 	{
@@ -2536,8 +2494,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
  * data, the entry is marked as located on disk.
  */
 void
-PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
-			   XLogRecPtr end_lsn, RepOriginId origin_id)
+PrepareRedoAdd(FullTransactionId fxid, char *buf,
+			   XLogRecPtr start_lsn, XLogRecPtr end_lsn,
+			   RepOriginId origin_id)
 {
 	TwoPhaseFileHeader *hdr = (TwoPhaseFileHeader *) buf;
 	char	   *bufptr;
@@ -2545,7 +2504,11 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 	GlobalTransaction gxact;
 
 	Assert(LWLockHeldByMeInMode(TwoPhaseStateLock, LW_EXCLUSIVE));
-	Assert(RecoveryInProgress());
+	Assert(InRecovery);
+
+	if (!FullTransactionIdIsValid(fxid))
+		fxid = FullTransactionIdFromAllowableAt(TransamVariables->nextXid,
+												hdr->xid);
 
 	bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader));
 	gid = (const char *) bufptr;
@@ -2574,10 +2537,8 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 	if (!XLogRecPtrIsInvalid(start_lsn))
 	{
 		char		path[MAXPGPATH];
-		FullTransactionId fxid;
 
-		/* Use current epoch */
-		fxid = FullTransactionIdFromCurrentEpoch(hdr->xid);
+		Assert(InRecovery);
 		TwoPhaseFilePath(path, fxid);
 
 		if (access(path, F_OK) == 0)
@@ -2609,7 +2570,7 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 	gxact->prepared_at = hdr->prepared_at;
 	gxact->prepare_start_lsn = start_lsn;
 	gxact->prepare_end_lsn = end_lsn;
-	gxact->xid = hdr->xid;
+	gxact->fxid = fxid;
 	gxact->owner = hdr->owner;
 	gxact->locking_backend = INVALID_PROC_NUMBER;
 	gxact->valid = false;
@@ -2628,7 +2589,9 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   false /* backward */ , false /* WAL */ );
 	}
 
-	elog(DEBUG2, "added 2PC data in shared memory for transaction %u", gxact->xid);
+	elog(DEBUG2, "added 2PC data in shared memory for transaction %u of epoch %u",
+		 XidFromFullTransactionId(gxact->fxid),
+		 EpochFromFullTransactionId(gxact->fxid));
 }
 
 /*
@@ -2640,8 +2603,8 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
  * Caller must hold TwoPhaseStateLock in exclusive mode, because TwoPhaseState
  * is updated.
  */
-void
-PrepareRedoRemove(TransactionId xid, bool giveWarning)
+static void
+PrepareRedoRemoveFull(FullTransactionId fxid, bool giveWarning)
 {
 	GlobalTransaction gxact = NULL;
 	int			i;
@@ -2654,7 +2617,7 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 	{
 		gxact = TwoPhaseState->prepXacts[i];
 
-		if (gxact->xid == xid)
+		if (FullTransactionIdEquals(gxact->fxid, fxid))
 		{
 			Assert(gxact->inredo);
 			found = true;
@@ -2671,20 +2634,25 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 	/*
 	 * And now we can clean up any files we may have left.
 	 */
-	elog(DEBUG2, "removing 2PC data for transaction %u", xid);
-	if (gxact->ondisk)
-	{
-		FullTransactionId fxid;
+	elog(DEBUG2, "removing 2PC data for transaction %u of epoch %u ",
+		 XidFromFullTransactionId(fxid),
+		 EpochFromFullTransactionId(fxid));
 
-		/*
-		 * We should deal with a file at the current epoch here.
-		 */
-		fxid = FullTransactionIdFromCurrentEpoch(xid);
+	if (gxact->ondisk)
 		RemoveTwoPhaseFile(fxid, giveWarning);
-	}
+
 	RemoveGXact(gxact);
 }
 
+void
+PrepareRedoRemove(TransactionId xid, bool giveWarning)
+{
+	FullTransactionId fxid =
+		FullTransactionIdFromAllowableAt(TransamVariables->nextXid, xid);
+
+	PrepareRedoRemoveFull(fxid, giveWarning);
+}
+
 /*
  * LookupGXact
  *		Check if the prepared transaction with the given GID, lsn and timestamp
@@ -2729,7 +2697,7 @@ LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
 			 * between publisher and subscriber.
 			 */
 			if (gxact->ondisk)
-				buf = ReadTwoPhaseFile(gxact->xid, false);
+				buf = ReadTwoPhaseFile(gxact->fxid, false);
 			else
 			{
 				Assert(gxact->prepare_start_lsn);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d331ab9..8cd0c8b 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2512,7 +2512,7 @@ static void
 PrepareTransaction(void)
 {
 	TransactionState s = CurrentTransactionState;
-	TransactionId xid = GetCurrentTransactionId();
+	FullTransactionId fxid = GetCurrentFullTransactionId();
 	GlobalTransaction gxact;
 	TimestampTz prepared_at;
 
@@ -2641,7 +2641,7 @@ PrepareTransaction(void)
 	 * Reserve the GID for this transaction. This could fail if the requested
 	 * GID is invalid or already in use.
 	 */
-	gxact = MarkAsPreparing(xid, prepareGID, prepared_at,
+	gxact = MarkAsPreparing(fxid, prepareGID, prepared_at,
 							GetUserId(), MyDatabaseId);
 	prepareGID = NULL;
 
@@ -2691,7 +2691,7 @@ PrepareTransaction(void)
 	 * ProcArrayClearTransaction().  Otherwise, a GetLockConflicts() would
 	 * conclude "xact already committed or aborted" for our locks.
 	 */
-	PostPrepare_Locks(xid);
+	PostPrepare_Locks(fxid);
 
 	/*
 	 * Let others know about no transaction in progress by me.  This has to be
@@ -2733,9 +2733,9 @@ PrepareTransaction(void)
 
 	PostPrepare_smgr();
 
-	PostPrepare_MultiXact(xid);
+	PostPrepare_MultiXact(fxid);
 
-	PostPrepare_PredicateLocks(xid);
+	PostPrepare_PredicateLocks(fxid);
 
 	ResourceOwnerRelease(TopTransactionResourceOwner,
 						 RESOURCE_RELEASE_LOCKS,
@@ -6408,7 +6408,8 @@ xact_redo(XLogReaderState *record)
 		 * gxact entry.
 		 */
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
-		PrepareRedoAdd(XLogRecGetData(record),
+		PrepareRedoAdd(InvalidFullTransactionId,
+					   XLogRecGetData(record),
 					   record->ReadRecPtr,
 					   record->EndRecPtr,
 					   XLogRecGetOrigin(record));
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 3596af0..91b6a91 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -2166,28 +2166,14 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 FullTransactionId
 XLogRecGetFullXid(XLogReaderState *record)
 {
-	TransactionId xid,
-				next_xid;
-	uint32		epoch;
-
 	/*
 	 * This function is only safe during replay, because it depends on the
 	 * replay state.  See AdvanceNextFullTransactionIdPastXid() for more.
 	 */
 	Assert(AmStartupProcess() || !IsUnderPostmaster);
 
-	xid = XLogRecGetXid(record);
-	next_xid = XidFromFullTransactionId(TransamVariables->nextXid);
-	epoch = EpochFromFullTransactionId(TransamVariables->nextXid);
-
-	/*
-	 * If xid is numerically greater than next_xid, it has to be from the last
-	 * epoch.
-	 */
-	if (unlikely(xid > next_xid))
-		--epoch;
-
-	return FullTransactionIdFromEpochAndXid(epoch, xid);
+	return FullTransactionIdFromAllowableAt(TransamVariables->nextXid,
+											XLogRecGetXid(record));
 }
 
 #endif
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 3e2f98b..e9bac94 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -3476,9 +3476,9 @@ AtPrepare_Locks(void)
  * but that probably costs more cycles.
  */
 void
-PostPrepare_Locks(TransactionId xid)
+PostPrepare_Locks(FullTransactionId fxid)
 {
-	PGPROC	   *newproc = TwoPhaseGetDummyProc(xid, false);
+	PGPROC	   *newproc = TwoPhaseGetDummyProc(fxid, false);
 	HASH_SEQ_STATUS status;
 	LOCALLOCK  *locallock;
 	LOCK	   *lock;
@@ -4261,11 +4261,11 @@ DumpAllLocks(void)
  * and PANIC anyway.
  */
 void
-lock_twophase_recover(TransactionId xid, uint16 info,
+lock_twophase_recover(FullTransactionId fxid, uint16 info,
 					  void *recdata, uint32 len)
 {
 	TwoPhaseLockRecord *rec = (TwoPhaseLockRecord *) recdata;
-	PGPROC	   *proc = TwoPhaseGetDummyProc(xid, false);
+	PGPROC	   *proc = TwoPhaseGetDummyProc(fxid, false);
 	LOCKTAG    *locktag;
 	LOCKMODE	lockmode;
 	LOCKMETHODID lockmethodid;
@@ -4442,7 +4442,7 @@ lock_twophase_recover(TransactionId xid, uint16 info,
  * starting up into hot standby mode.
  */
 void
-lock_twophase_standby_recover(TransactionId xid, uint16 info,
+lock_twophase_standby_recover(FullTransactionId fxid, uint16 info,
 							  void *recdata, uint32 len)
 {
 	TwoPhaseLockRecord *rec = (TwoPhaseLockRecord *) recdata;
@@ -4461,7 +4461,7 @@ lock_twophase_standby_recover(TransactionId xid, uint16 info,
 	if (lockmode == AccessExclusiveLock &&
 		locktag->locktag_type == LOCKTAG_RELATION)
 	{
-		StandbyAcquireAccessExclusiveLock(xid,
+		StandbyAcquireAccessExclusiveLock(XidFromFullTransactionId(fxid),
 										  locktag->locktag_field1 /* dboid */ ,
 										  locktag->locktag_field2 /* reloid */ );
 	}
@@ -4474,11 +4474,11 @@ lock_twophase_standby_recover(TransactionId xid, uint16 info,
  * Find and release the lock indicated by the 2PC record.
  */
 void
-lock_twophase_postcommit(TransactionId xid, uint16 info,
+lock_twophase_postcommit(FullTransactionId fxid, uint16 info,
 						 void *recdata, uint32 len)
 {
 	TwoPhaseLockRecord *rec = (TwoPhaseLockRecord *) recdata;
-	PGPROC	   *proc = TwoPhaseGetDummyProc(xid, true);
+	PGPROC	   *proc = TwoPhaseGetDummyProc(fxid, true);
 	LOCKTAG    *locktag;
 	LOCKMETHODID lockmethodid;
 	LockMethod	lockMethodTable;
@@ -4500,10 +4500,10 @@ lock_twophase_postcommit(TransactionId xid, uint16 info,
  * This is actually just the same as the COMMIT case.
  */
 void
-lock_twophase_postabort(TransactionId xid, uint16 info,
+lock_twophase_postabort(FullTransactionId fxid, uint16 info,
 						void *recdata, uint32 len)
 {
-	lock_twophase_postcommit(xid, info, recdata, len);
+	lock_twophase_postcommit(fxid, info, recdata, len);
 }
 
 /*
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 5b21a05..928647d 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -191,7 +191,7 @@
  *		AtPrepare_PredicateLocks(void);
  *		PostPrepare_PredicateLocks(TransactionId xid);
  *		PredicateLockTwoPhaseFinish(TransactionId xid, bool isCommit);
- *		predicatelock_twophase_recover(TransactionId xid, uint16 info,
+ *		predicatelock_twophase_recover(FullTransactionId fxid, uint16 info,
  *									   void *recdata, uint32 len);
  */
 
@@ -4846,7 +4846,7 @@ AtPrepare_PredicateLocks(void)
  *		anyway. We only need to clean up our local state.
  */
 void
-PostPrepare_PredicateLocks(TransactionId xid)
+PostPrepare_PredicateLocks(FullTransactionId fxid)
 {
 	if (MySerializableXact == InvalidSerializableXact)
 		return;
@@ -4869,12 +4869,12 @@ PostPrepare_PredicateLocks(TransactionId xid)
  *		commits or aborts.
  */
 void
-PredicateLockTwoPhaseFinish(TransactionId xid, bool isCommit)
+PredicateLockTwoPhaseFinish(FullTransactionId fxid, bool isCommit)
 {
 	SERIALIZABLEXID *sxid;
 	SERIALIZABLEXIDTAG sxidtag;
 
-	sxidtag.xid = xid;
+	sxidtag.xid = XidFromFullTransactionId(fxid);
 
 	LWLockAcquire(SerializableXactHashLock, LW_SHARED);
 	sxid = (SERIALIZABLEXID *)
@@ -4896,10 +4896,11 @@ PredicateLockTwoPhaseFinish(TransactionId xid, bool isCommit)
  * Re-acquire a predicate lock belonging to a transaction that was prepared.
  */
 void
-predicatelock_twophase_recover(TransactionId xid, uint16 info,
+predicatelock_twophase_recover(FullTransactionId fxid, uint16 info,
 							   void *recdata, uint32 len)
 {
 	TwoPhasePredicateRecord *record;
+	TransactionId xid = XidFromFullTransactionId(fxid);
 
 	Assert(len == sizeof(TwoPhasePredicateRecord));
 
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 09247ba..8f5945a 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -731,7 +731,7 @@ PostPrepare_PgStat_Relations(PgStat_SubXactStatus *xact_state)
  * Load the saved counts into our local pgstats state.
  */
 void
-pgstat_twophase_postcommit(TransactionId xid, uint16 info,
+pgstat_twophase_postcommit(FullTransactionId fxid, uint16 info,
 						   void *recdata, uint32 len)
 {
 	TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata;
@@ -767,7 +767,7 @@ pgstat_twophase_postcommit(TransactionId xid, uint16 info,
  * as aborted.
  */
 void
-pgstat_twophase_postabort(TransactionId xid, uint16 info,
+pgstat_twophase_postabort(FullTransactionId fxid, uint16 info,
 						  void *recdata, uint32 len)
 {
 	TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata;
diff --git a/src/backend/utils/adt/xid8funcs.c b/src/backend/utils/adt/xid8funcs.c
index 4736755..20b28b2 100644
--- a/src/backend/utils/adt/xid8funcs.c
+++ b/src/backend/utils/adt/xid8funcs.c
@@ -97,15 +97,11 @@ static bool
 TransactionIdInRecentPast(FullTransactionId fxid, TransactionId *extracted_xid)
 {
 	TransactionId xid = XidFromFullTransactionId(fxid);
-	uint32		now_epoch;
-	TransactionId now_epoch_next_xid;
 	FullTransactionId now_fullxid;
-	TransactionId oldest_xid;
-	FullTransactionId oldest_fxid;
+	TransactionId oldest_clog_xid;
+	FullTransactionId oldest_clog_fxid;
 
 	now_fullxid = ReadNextFullTransactionId();
-	now_epoch_next_xid = XidFromFullTransactionId(now_fullxid);
-	now_epoch = EpochFromFullTransactionId(now_fullxid);
 
 	if (extracted_xid != NULL)
 		*extracted_xid = xid;
@@ -135,52 +131,19 @@ TransactionIdInRecentPast(FullTransactionId fxid, TransactionId *extracted_xid)
 
 	/*
 	 * If fxid is not older than TransamVariables->oldestClogXid, the relevant
-	 * CLOG entry is guaranteed to still exist.  Convert
-	 * TransamVariables->oldestClogXid into a FullTransactionId to compare it
-	 * with fxid.  Determine the right epoch knowing that oldest_fxid
-	 * shouldn't be more than 2^31 older than now_fullxid.
-	 */
-	oldest_xid = TransamVariables->oldestClogXid;
-	Assert(TransactionIdPrecedesOrEquals(oldest_xid, now_epoch_next_xid));
-	if (oldest_xid <= now_epoch_next_xid)
-	{
-		oldest_fxid = FullTransactionIdFromEpochAndXid(now_epoch, oldest_xid);
-	}
-	else
-	{
-		Assert(now_epoch > 0);
-		oldest_fxid = FullTransactionIdFromEpochAndXid(now_epoch - 1, oldest_xid);
-	}
-	return !FullTransactionIdPrecedes(fxid, oldest_fxid);
-}
-
-/*
- * Convert a TransactionId obtained from a snapshot held by the caller to a
- * FullTransactionId.  Use next_fxid as a reference FullTransactionId, so that
- * we can compute the high order bits.  It must have been obtained by the
- * caller with ReadNextFullTransactionId() after the snapshot was created.
- */
-static FullTransactionId
-widen_snapshot_xid(TransactionId xid, FullTransactionId next_fxid)
-{
-	TransactionId next_xid = XidFromFullTransactionId(next_fxid);
-	uint32		epoch = EpochFromFullTransactionId(next_fxid);
-
-	/* Special transaction ID. */
-	if (!TransactionIdIsNormal(xid))
-		return FullTransactionIdFromEpochAndXid(0, xid);
-
-	/*
-	 * The 64 bit result must be <= next_fxid, since next_fxid hadn't been
-	 * issued yet when the snapshot was created.  Every TransactionId in the
-	 * snapshot must therefore be from the same epoch as next_fxid, or the
-	 * epoch before.  We know this because next_fxid is never allow to get
-	 * more than one epoch ahead of the TransactionIds in any snapshot.
+	 * CLOG entry is guaranteed to still exist.
+	 *
+	 * TransamVariables->oldestXid governs allowable XIDs.  Usually,
+	 * oldestClogXid==oldestXid.  It's also possible for oldestClogXid to
+	 * follow oldestXid, in which case oldestXid might advance after our
+	 * ReadNextFullTransactionId() call.  If oldestXid has advanced, that
+	 * advancement reinstated the usual oldestClogXid==oldestXid.  Whether or
+	 * not that happened, oldestClogXid is allowable relative to now_fullxid.
 	 */
-	if (xid > next_xid)
-		epoch--;
-
-	return FullTransactionIdFromEpochAndXid(epoch, xid);
+	oldest_clog_xid = TransamVariables->oldestClogXid;
+	oldest_clog_fxid =
+		FullTransactionIdFromAllowableAt(now_fullxid, oldest_clog_xid);
+	return !FullTransactionIdPrecedes(fxid, oldest_clog_fxid);
 }
 
 /*
@@ -420,12 +383,18 @@ pg_current_snapshot(PG_FUNCTION_ARGS)
 	nxip = cur->xcnt;
 	snap = palloc(PG_SNAPSHOT_SIZE(nxip));
 
-	/* fill */
-	snap->xmin = widen_snapshot_xid(cur->xmin, next_fxid);
-	snap->xmax = widen_snapshot_xid(cur->xmax, next_fxid);
+	/*
+	 * Fill.  This is the current backend's active snapshot, so MyProc->xmin
+	 * is <= all these XIDs.  As long as that remains so, oldestXid can't
+	 * advance past any of these XIDs.  Hence, these XIDs remain allowable
+	 * relative to next_fxid.
+	 */
+	snap->xmin = FullTransactionIdFromAllowableAt(next_fxid, cur->xmin);
+	snap->xmax = FullTransactionIdFromAllowableAt(next_fxid, cur->xmax);
 	snap->nxip = nxip;
 	for (i = 0; i < nxip; i++)
-		snap->xip[i] = widen_snapshot_xid(cur->xip[i], next_fxid);
+		snap->xip[i] =
+			FullTransactionIdFromAllowableAt(next_fxid, cur->xip[i]);
 
 	/*
 	 * We want them guaranteed to be in ascending order.  This also removes
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 4e6b0ee..b876e98 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -11,6 +11,7 @@
 #ifndef MULTIXACT_H
 #define MULTIXACT_H
 
+#include "access/transam.h"
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 #include "storage/sync.h"
@@ -119,7 +120,7 @@ extern int	multixactmemberssyncfiletag(const FileTag *ftag, char *path);
 
 extern void AtEOXact_MultiXact(void);
 extern void AtPrepare_MultiXact(void);
-extern void PostPrepare_MultiXact(TransactionId xid);
+extern void PostPrepare_MultiXact(FullTransactionId fxid);
 
 extern Size MultiXactShmemSize(void);
 extern void MultiXactShmemInit(void);
@@ -145,11 +146,11 @@ extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
 extern int	MultiXactMemberFreezeThreshold(void);
 
-extern void multixact_twophase_recover(TransactionId xid, uint16 info,
+extern void multixact_twophase_recover(FullTransactionId fxid, uint16 info,
 									   void *recdata, uint32 len);
-extern void multixact_twophase_postcommit(TransactionId xid, uint16 info,
+extern void multixact_twophase_postcommit(FullTransactionId fxid, uint16 info,
 										  void *recdata, uint32 len);
-extern void multixact_twophase_postabort(TransactionId xid, uint16 info,
+extern void multixact_twophase_postabort(FullTransactionId fxid, uint16 info,
 										 void *recdata, uint32 len);
 
 extern void multixact_redo(XLogReaderState *record);
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 0cab865..604c5d0 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -370,6 +370,43 @@ FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
 	return b;
 }
 
+/*
+ * Compute FullTransactionId for the given TransactionId, assuming xid was
+ * between [oldestXid, nextXid] at the time when TransamVariables->nextXid was
+ * nextFullXid.  When adding calls, evaluate what prevents xid from preceding
+ * oldestXid if SetTransactionIdLimit() runs between the collection of xid and
+ * the collection of nextFullXid.
+ */
+static inline FullTransactionId
+FullTransactionIdFromAllowableAt(FullTransactionId nextFullXid,
+								 TransactionId xid)
+{
+	uint32		epoch;
+
+	/* Special transaction ID. */
+	if (!TransactionIdIsNormal(xid))
+		return FullTransactionIdFromEpochAndXid(0, xid);
+
+	Assert(TransactionIdPrecedesOrEquals(xid,
+										 XidFromFullTransactionId(nextFullXid)));
+
+	/*
+	 * The 64 bit result must be <= nextFullXid, since nextFullXid hadn't been
+	 * issued yet when xid was in the past.  The xid must therefore be from
+	 * the epoch of nextFullXid or the epoch before.  We know this because we
+	 * must remove (by freezing) an XID before assigning the XID half an epoch
+	 * ahead of it.
+	 */
+	epoch = EpochFromFullTransactionId(nextFullXid);
+	if (xid > XidFromFullTransactionId(nextFullXid))
+	{
+		Assert(epoch != 0);
+		epoch--;
+	}
+
+	return FullTransactionIdFromEpochAndXid(epoch, xid);
+}
+
 #endif							/* FRONTEND */
 
 #endif							/* TRANSAM_H */
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 9fa8235..0ab8b3e 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -14,6 +14,7 @@
 #ifndef TWOPHASE_H
 #define TWOPHASE_H
 
+#include "access/transam.h"
 #include "access/xact.h"
 #include "access/xlogdefs.h"
 #include "datatype/timestamp.h"
@@ -36,10 +37,10 @@ extern void PostPrepare_Twophase(void);
 
 extern TransactionId TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
 												bool *have_more);
-extern PGPROC *TwoPhaseGetDummyProc(TransactionId xid, bool lock_held);
-extern int	TwoPhaseGetDummyProcNumber(TransactionId xid, bool lock_held);
+extern PGPROC *TwoPhaseGetDummyProc(FullTransactionId fxid, bool lock_held);
+extern int	TwoPhaseGetDummyProcNumber(FullTransactionId fxid, bool lock_held);
 
-extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
+extern GlobalTransaction MarkAsPreparing(FullTransactionId fxid, const char *gid,
 										 TimestampTz prepared_at,
 										 Oid owner, Oid databaseid);
 
@@ -56,8 +57,9 @@ extern void CheckPointTwoPhase(XLogRecPtr redo_horizon);
 
 extern void FinishPreparedTransaction(const char *gid, bool isCommit);
 
-extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
-						   XLogRecPtr end_lsn, RepOriginId origin_id);
+extern void PrepareRedoAdd(FullTransactionId fxid, char *buf,
+						   XLogRecPtr start_lsn, XLogRecPtr end_lsn,
+						   RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
 extern bool LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
diff --git a/src/include/access/twophase_rmgr.h b/src/include/access/twophase_rmgr.h
index 3ed154b..8f57640 100644
--- a/src/include/access/twophase_rmgr.h
+++ b/src/include/access/twophase_rmgr.h
@@ -14,7 +14,9 @@
 #ifndef TWOPHASE_RMGR_H
 #define TWOPHASE_RMGR_H
 
-typedef void (*TwoPhaseCallback) (TransactionId xid, uint16 info,
+#include "access/transam.h"
+
+typedef void (*TwoPhaseCallback) (FullTransactionId fxid, uint16 info,
 								  void *recdata, uint32 len);
 typedef uint8 TwoPhaseRmgrId;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 6475889..a8c09de 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -734,9 +734,9 @@ extern void pgstat_count_heap_delete(Relation rel);
 extern void pgstat_count_truncate(Relation rel);
 extern void pgstat_update_heap_dead_tuples(Relation rel, int delta);
 
-extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
+extern void pgstat_twophase_postcommit(FullTransactionId fxid, uint16 info,
 									   void *recdata, uint32 len);
-extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
+extern void pgstat_twophase_postabort(FullTransactionId fxid, uint16 info,
 									  void *recdata, uint32 len);
 
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 1076995..3feedfc 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -18,6 +18,7 @@
 #error "lock.h may not be included from frontend code"
 #endif
 
+#include "access/transam.h"
 #include "lib/ilist.h"
 #include "storage/lockdefs.h"
 #include "storage/lwlock.h"
@@ -579,7 +580,7 @@ extern bool LockHasWaiters(const LOCKTAG *locktag,
 extern VirtualTransactionId *GetLockConflicts(const LOCKTAG *locktag,
 											  LOCKMODE lockmode, int *countp);
 extern void AtPrepare_Locks(void);
-extern void PostPrepare_Locks(TransactionId xid);
+extern void PostPrepare_Locks(FullTransactionId fxid);
 extern bool LockCheckConflicts(LockMethod lockMethodTable,
 							   LOCKMODE lockmode,
 							   LOCK *lock, PROCLOCK *proclock);
@@ -594,13 +595,13 @@ extern BlockedProcsData *GetBlockerStatusData(int blocked_pid);
 extern xl_standby_lock *GetRunningTransactionLocks(int *nlocks);
 extern const char *GetLockmodeName(LOCKMETHODID lockmethodid, LOCKMODE mode);
 
-extern void lock_twophase_recover(TransactionId xid, uint16 info,
+extern void lock_twophase_recover(FullTransactionId fxid, uint16 info,
 								  void *recdata, uint32 len);
-extern void lock_twophase_postcommit(TransactionId xid, uint16 info,
+extern void lock_twophase_postcommit(FullTransactionId fxid, uint16 info,
 									 void *recdata, uint32 len);
-extern void lock_twophase_postabort(TransactionId xid, uint16 info,
+extern void lock_twophase_postabort(FullTransactionId fxid, uint16 info,
 									void *recdata, uint32 len);
-extern void lock_twophase_standby_recover(TransactionId xid, uint16 info,
+extern void lock_twophase_standby_recover(FullTransactionId fxid, uint16 info,
 										  void *recdata, uint32 len);
 
 extern DeadLockState DeadLockCheck(PGPROC *proc);
diff --git a/src/include/storage/predicate.h b/src/include/storage/predicate.h
index 267d5d9..4d3f218 100644
--- a/src/include/storage/predicate.h
+++ b/src/include/storage/predicate.h
@@ -14,6 +14,7 @@
 #ifndef PREDICATE_H
 #define PREDICATE_H
 
+#include "access/transam.h"
 #include "storage/itemptr.h"
 #include "storage/lock.h"
 #include "utils/relcache.h"
@@ -72,9 +73,9 @@ extern void PreCommit_CheckForSerializationFailure(void);
 
 /* two-phase commit support */
 extern void AtPrepare_PredicateLocks(void);
-extern void PostPrepare_PredicateLocks(TransactionId xid);
-extern void PredicateLockTwoPhaseFinish(TransactionId xid, bool isCommit);
-extern void predicatelock_twophase_recover(TransactionId xid, uint16 info,
+extern void PostPrepare_PredicateLocks(FullTransactionId fxid);
+extern void PredicateLockTwoPhaseFinish(FullTransactionId xid, bool isCommit);
+extern void predicatelock_twophase_recover(FullTransactionId fxid, uint16 info,
 										   void *recdata, uint32 len);
 
 /* parallel query support */
#15Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#14)
Re: An improvement of ProcessTwoPhaseBuffer logic

On Thu, Jan 16, 2025 at 12:52:54PM -0800, Noah Misch wrote:

On Thu, Jan 16, 2025 at 04:50:09PM +0900, Michael Paquier wrote:

As far as I understand, the most important point of the logic is to
detect and discard the future files first in restoreTwoPhaseData() ->
ProcessTwoPhaseBuffer() when scanning the contents of pg_twophase at
the beginning of recovery. Once this filtering is done, it should be
safe to use your FullTransactionIdFromAllowableAt() when doing
the fxid <-> xid transitions between the records and the files on disk
flushed by a restartpoint which store an XID, and the shmem state of
GlobalTransactionData with a fxid.

(I did not expect that a function called restoreTwoPhaseData() would run before
a function called PrescanPreparedTransactions(), but so it is.)

How is it that restoreTwoPhaseData() -> ProcessTwoPhaseBuffer() can safely
call TransactionIdDidAbort() when we've not replayed WAL to make CLOG
consistent? What can it assume about the value of TransamVariables->nextXid
at that early time?

I think it can't use CLOG safely. I regret answering my own question now as
opposed to refraining from asking it, but hopefully this will save you time.

Since EndPrepare() flushes the XLOG_XACT_PREPARE record before anything writes
to pg_twophase, presence of a pg_twophase file tells us its record once
existed in the flushed WAL stream. Even before we've started the server on a
base backup, each pg_twophase file pertains to an XLOG_XACT_PREPARE LSN no
later than the end-of-backup checkpoint. If a later file exists, the backup
copied a file after the end-of-backup checkpoint started, which is a violation
of the base backup spec.

Elsewhere in recovery, we have the principle that data directory content means
little until we reach a consistent state by replaying the end-of-backup
checkpoint or by crash recovery reaching the end of WAL. For twophase, that
calls for ignoring pg_twophase content. If a restartpoint has a pg_twophase
file to write, we should allow that to overwrite an old file iff we've not
reached consistency. (We must not read the old pg_twophase file, which may
contain an unfinished write.)

By the time we reach consistency, every file in pg_twophase will be applicable
(not committed or aborted). Each file either predates the start-of-backup
checkpoint (or crash recovery equivalent), or recovery wrote that file. We
need readdir(pg_twophase) only at end of recovery, when we need
_twophase_recover callbacks to restore enough state to accept writes. (During
hot standby, state from XLOG_RUNNING_XACTS suffices[1]Incidentally, the comment lock_twophase_standby_recover incorrectly claims it runs "when starting up into hot standby mode." for the actions
possible in that mode.)

In other words, the root problem that led to commits e358425 and 7e125b2 was
recovery interpreting pg_twophase file content before reaching consistency.
We can't make the ProcessTwoPhaseBuffer() checks safe before reaching
consistency. Is that right? Once ProcessTwoPhaseBuffer() stops running
before consistency, it won't strictly need the range checks. We may as well
have something like those range checks as a best-effort way to detect base
backup spec violations.

Thanks,
nm

[1]: Incidentally, the comment lock_twophase_standby_recover incorrectly claims it runs "when starting up into hot standby mode."
it runs "when starting up into hot standby mode."

#16Michael Paquier
michael@paquier.xyz
In reply to: Noah Misch (#15)
Re: An improvement of ProcessTwoPhaseBuffer logic

On Thu, Jan 16, 2025 at 04:52:21PM -0800, Noah Misch wrote:

In other words, the root problem that led to commits e358425 and 7e125b2 was
recovery interpreting pg_twophase file content before reaching consistency.
We can't make the ProcessTwoPhaseBuffer() checks safe before reaching
consistency. Is that right? Once ProcessTwoPhaseBuffer() stops running
before consistency, it won't strictly need the range checks. We may as well
have something like those range checks as a best-effort way to detect base
backup spec violations.

Oh, yeah, it seems like you have a very good point here.
ProcessTwoPhaseBuffer() is an effort to use a unique path for the
handling of this 2PC data, and we cannot do that safely at this stage.
The problem is restoreTwoPhaseData(), which is called very early in
the recovery process, while all the other code paths calling
ProcessTwoPhaseBuffer() are done when we're doing reading WAL after
we've reached consistency as far as I can see. This is a wrong
concept since 728bd991c3c4. 5a1dfde8334b has made that much harder to
think about, while not integrating fully things. It also looks like
we may be able to just get rid of the CLOG checks entirely if we
overwrite existing files as long as we are not in a consistent state,
as you say. Hmm.

I agree that e358425 and 7e125b2 are weird attempts that just try to
work around the real issue, and that they're making some cases wrong
while on it with potential file removals.

+typedef void (*TwoPhaseCallback) (FullTransactionId fxid, uint16 info,
void *recdata, uint32 len);

Based on your latest patch, I doubt that we'll be able to do any of
that in the back-branches. That's much nicer in the long term to show
what this code relies on.

I've been thinking about the strategy to use here, and here are some
rough steps:
- Let's first revert e358425 and 7e125b2. This indeed needs to be
reworked, and there is a release coming by.
- Your refactoring around xid8funcs.c is a good idea on its own. This
can be an independent patch.
- Let's figure out how much surgery is required with a focus on HEAD
for now (I'm feeling that more non-backpatchable pieces are going to
be needed here), then extract the relevant pieces that could be
backpatchable. Hard to say if this is going to be doable at this
stage, or even worth doing, but it will be possible to dig into the
details once we're happy with the state of HEAD. My first intuition
is that the risk of touching the back-branches may not be worth the
potential reward based on the lack of field complaints (not seen
customer cases, tbh): high risk, low reward.

Another point to consider is if we'd better switch 2PC files to store
a fxid rather than a xid.. Perhaps that's not necessary, but if we're
using FullTransactionIds across the full stack of twophase.c that
could make the whole better with even less xid <-> fxid translations
in all these paths. There is always the counter-argument of the extra
4 bytes in the 2PC files and the records, though.
--
Michael

#17Noah Misch
noah@leadboat.com
In reply to: Michael Paquier (#16)
Re: An improvement of ProcessTwoPhaseBuffer logic

On Fri, Jan 17, 2025 at 11:04:03AM +0900, Michael Paquier wrote:

On Thu, Jan 16, 2025 at 04:52:21PM -0800, Noah Misch wrote:

In other words, the root problem that led to commits e358425 and 7e125b2 was
recovery interpreting pg_twophase file content before reaching consistency.
We can't make the ProcessTwoPhaseBuffer() checks safe before reaching
consistency. Is that right? Once ProcessTwoPhaseBuffer() stops running
before consistency, it won't strictly need the range checks. We may as well
have something like those range checks as a best-effort way to detect base
backup spec violations.

Oh, yeah, it seems like you have a very good point here.
ProcessTwoPhaseBuffer() is an effort to use a unique path for the
handling of this 2PC data, and we cannot do that safely at this stage.
The problem is restoreTwoPhaseData(), which is called very early in

Agreed.

the recovery process, while all the other code paths calling
ProcessTwoPhaseBuffer() are done when we're doing reading WAL after
we've reached consistency as far as I can see. This is a wrong
concept since 728bd991c3c4. 5a1dfde8334b has made that much harder to
think about, while not integrating fully things. It also looks like
we may be able to just get rid of the CLOG checks entirely if we
overwrite existing files as long as we are not in a consistent state,
as you say. Hmm.

I agree that e358425 and 7e125b2 are weird attempts that just try to
work around the real issue, and that they're making some cases wrong
while on it with potential file removals.

+typedef void (*TwoPhaseCallback) (FullTransactionId fxid, uint16 info,
void *recdata, uint32 len);

Based on your latest patch, I doubt that we'll be able to do any of
that in the back-branches. That's much nicer in the long term to show
what this code relies on.

The twophase.c bits of that patch turned out to be orthogonal to the key
issue, so that's fine. I suspect the signature changes would be okay to
back-patch to v17, subject to PGXN confirmation of no extension using those
APIs. (The TwoPhaseCallback list is hard-coded; extensions can't add
callbacks.) That said, the rationale for back-patching fxid stuff is gone.

I've been thinking about the strategy to use here, and here are some
rough steps:
- Let's first revert e358425 and 7e125b2. This indeed needs to be
reworked, and there is a release coming by.

+1. e358425 is borderline, but it's hard to rule out ways of it making
recovery unlink pg_twophase files improperly. Released code may also be doing
that, but that code's seven years of history suggests it does so infrequently
if at all. We don't have as much evidence about what the frequency would be
under e358425.

- Your refactoring around xid8funcs.c is a good idea on its own. This
can be an independent patch.

I got here on a yak shave for postgr.es/m/20250111214454.9a.nmisch@google.com.
I expect that project will still want FullTransactionIdFromAllowableAt(). If
so, I'll include it in that thread's patch series.

- Let's figure out how much surgery is required with a focus on HEAD
for now (I'm feeling that more non-backpatchable pieces are going to
be needed here), then extract the relevant pieces that could be
backpatchable. Hard to say if this is going to be doable at this
stage, or even worth doing, but it will be possible to dig into the
details once we're happy with the state of HEAD. My first intuition
is that the risk of touching the back-branches may not be worth the
potential reward based on the lack of field complaints (not seen
customer cases, tbh): high risk, low reward.

Let's see how it turns out. My first intuition would be to target a
back-patch, because a corner-case PITR failure is the kind of thing that can
go unreported and yet be hard to forgive.

I'm not confident I could fix both that other thread's data loss bug and
$SUBJECT in time for the 2024-02 releases. (By $SUBJECT, I mean the
seven-year-old bug of interpreting pg_twophase before recovery consistency,
which has a higher chance of causing spurious recovery failure starting in
v17.) Would your or someone else like to fix $SUBJECT, before or after
2024-02 releases? I'd make time to review a patch.

Another point to consider is if we'd better switch 2PC files to store
a fxid rather than a xid.. Perhaps that's not necessary, but if we're
using FullTransactionIds across the full stack of twophase.c that
could make the whole better with even less xid <-> fxid translations
in all these paths. There is always the counter-argument of the extra
4 bytes in the 2PC files and the records, though.

Yes, perhaps so. It could also be an option to store it only in the
pg_twophase file, not in the WAL record. Files are a lot rarer.

Similarly, I wondered if pg_twophase files should contain the LSN of the
XLOG_XACT_PREPARE record, which would make the file's oldness unambiguous in a
way fxid doesn't achieve. I didn't come up with a problem whose solution
requires it, though.

Thanks,
nm

#18Michael Paquier
michael@paquier.xyz
In reply to: Noah Misch (#17)
Re: An improvement of ProcessTwoPhaseBuffer logic

On Thu, Jan 16, 2025 at 06:44:16PM -0800, Noah Misch wrote:

On Fri, Jan 17, 2025 at 11:04:03AM +0900, Michael Paquier wrote:

+typedef void (*TwoPhaseCallback) (FullTransactionId fxid, uint16 info,
void *recdata, uint32 len);

Based on your latest patch, I doubt that we'll be able to do any of
that in the back-branches. That's much nicer in the long term to show
what this code relies on.

The twophase.c bits of that patch turned out to be orthogonal to the key
issue, so that's fine. I suspect the signature changes would be okay to
back-patch to v17, subject to PGXN confirmation of no extension using those
APIs. (The TwoPhaseCallback list is hard-coded; extensions can't add
callbacks.) That said, the rationale for back-patching fxid stuff is gone.

I would not bet on this part.

- Let's first revert e358425 and 7e125b2. This indeed needs to be
reworked, and there is a release coming by.

+1. e358425 is borderline, but it's hard to rule out ways of it making
recovery unlink pg_twophase files improperly. Released code may also be doing
that, but that code's seven years of history suggests it does so infrequently
if at all. We don't have as much evidence about what the frequency would be
under e358425.

Thanks. I'll got do that now as I can look at the buildfarm today,
then look at the rest.

- Your refactoring around xid8funcs.c is a good idea on its own. This
can be an independent patch.

I got here on a yak shave for postgr.es/m/20250111214454.9a.nmisch@google.com.
I expect that project will still want FullTransactionIdFromAllowableAt(). If
so, I'll include it in that thread's patch series.

Okay.

I'm not confident I could fix both that other thread's data loss bug and
$SUBJECT in time for the 2024-02 releases. (By $SUBJECT, I mean the
seven-year-old bug of interpreting pg_twophase before recovery consistency,
which has a higher chance of causing spurious recovery failure starting in
v17.) Would your or someone else like to fix $SUBJECT, before or after
2024-02 releases? I'd make time to review a patch.

Yes, I can work on these two and dig into the pieces, once the state
of the branch a bit cleaner. I'm not sure that a fully-ready patch
will be able to emerge within two weeks as this may need some
interations, but let's see. Backpatching something to remove the clog
lookups does not seem that complicated, actually, now that I'm looking
at it. Simpler in ~16 of course, but it does not seem that bad for
17~ as well if taken the simplest way.

Another point to consider is if we'd better switch 2PC files to store
a fxid rather than a xid.. Perhaps that's not necessary, but if we're
using FullTransactionIds across the full stack of twophase.c that
could make the whole better with even less xid <-> fxid translations
in all these paths. There is always the counter-argument of the extra
4 bytes in the 2PC files and the records, though.

Yes, perhaps so. It could also be an option to store it only in the
pg_twophase file, not in the WAL record. Files are a lot rarer.

It would be slightly simpler to do both, I guess. Less translations
between fxids and xids involved this way.

Similarly, I wondered if pg_twophase files should contain the LSN of the
XLOG_XACT_PREPARE record, which would make the file's oldness unambiguous in a
way fxid doesn't achieve. I didn't come up with a problem whose solution
requires it, though.

Not sure about this one.
--
Michael

#19Vitaly Davydov
v.davydov@postgrespro.ru
In reply to: Michael Paquier (#18)
1 attachment(s)
Re: An improvement of ProcessTwoPhaseBuffer logic

Dear Hackers,

I hope you may find this patch worth to consider. You may consider this patch as one more step to a complete migration to FullTransactionId in twophase.c for 17+ versions.

The patch fixes TwoPhaseFilePath function. The current version gets TransactionId as an argument and makes a path to the file with twophase transaction state. The problem - this logic is ambiguous, because xid doesn't contain the epoch. By replacing the argument with FullTransactionId, the logic of this function becomes straightforward. All the responsibility to properly form a full xid were delegated to the caller functions.

The change in TwoPhaseFilePath implies to migrate to FullTransactionId in some other functions. There is a function AdjustToFullTransactionId (in the current master) which converts xid to full xid. The problem with this function is that it works properly only with those xids that are inside the current xid range which is defined by the xid horizon (wraparound), because it applies some assumptions concerning epoch definition (two epoches may be in action).

I assume that if a xid is coming from in-memory states, it has to be in the current xid range. Based on this assumption, I would conclude that if the xid is coming via the interface (external) functions which are defined in twophase.h, AdjustToFullTransactionId can be applied to such xid.

There is another story when we define xid from two phase file names, when reading pg_twophase directory. In 17+ twophase file names was changed to contain tx epoch as well. Once we work with twophase files, we have to use full xids. Function AdjustToFullTransactionId is not applicable in this case, because pg_twophase directory may contain any garbage files with future or past full xids which are outside of the current range.

Based on these assumptions (AdjustToFullTransactionId or use full xids) some other functions were modified as shown in the patch.

With best regards,
Vitaly

Attachments:

0001-Use-FullTransactionId-in-TwoPhaseFilePath-and-in-som.patchtext/x-patchDownload
From 6fa76418466a511a8af32fc3032b4f2ac9daae77 Mon Sep 17 00:00:00 2001
From: Vitaly Davydov <v.davydov@postgrespro.ru>
Date: Wed, 22 Jan 2025 11:12:32 +0300
Subject: [PATCH] Use FullTransactionId in TwoPhaseFilePath and in some other
 functions

The patch fixes TwoPhaseFilePath function. The current version gets
TransactionId as an argument and makes a path to the file with twophase
transaction state. The problem - this logic is ambiguous, because xid
doesn't contain the epoch. By replacing the argument with
FullTransactionId, the logic of this function becomes straightforward.
All the responsibility to properly form a full xid were delegated to
the caller functions.
---
 src/backend/access/transam/twophase.c | 84 ++++++++++++++++-----------
 1 file changed, 51 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index ab2f4a8a92..e215170cca 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -221,14 +221,14 @@ static void ProcessRecords(char *bufptr, TransactionId xid,
 static void RemoveGXact(GlobalTransaction gxact);
 
 static void XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len);
-static char *ProcessTwoPhaseBuffer(TransactionId xid,
+static char *ProcessTwoPhaseBuffer(FullTransactionId xid,
 								   XLogRecPtr prepare_start_lsn,
 								   bool fromdisk, bool setParent, bool setNextXid);
 static void MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid,
 								const char *gid, TimestampTz prepared_at, Oid owner,
 								Oid databaseid);
-static void RemoveTwoPhaseFile(TransactionId xid, bool giveWarning);
-static void RecreateTwoPhaseFile(TransactionId xid, void *content, int len);
+static void RemoveTwoPhaseFile(FullTransactionId xid, bool giveWarning);
+static void RecreateTwoPhaseFile(FullTransactionId xid, void *content, int len);
 
 /*
  * Initialization of shared memory
@@ -958,10 +958,8 @@ AdjustToFullTransactionId(TransactionId xid)
 }
 
 static inline int
-TwoPhaseFilePath(char *path, TransactionId xid)
+TwoPhaseFilePath(char *path, FullTransactionId fxid)
 {
-	FullTransactionId fxid = AdjustToFullTransactionId(xid);
-
 	return snprintf(path, MAXPGPATH, TWOPHASE_DIR "/%08X%08X",
 					EpochFromFullTransactionId(fxid),
 					XidFromFullTransactionId(fxid));
@@ -1300,7 +1298,7 @@ RegisterTwoPhaseRecord(TwoPhaseRmgrId rmid, uint16 info,
  * ignored, then return NULL.  This state can be reached when doing recovery.
  */
 static char *
-ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
+ReadTwoPhaseFile(FullTransactionId xid, bool missing_ok)
 {
 	char		path[MAXPGPATH];
 	char	   *buf;
@@ -1477,14 +1475,17 @@ StandbyTransactionIdIsPrepared(TransactionId xid)
 	char	   *buf;
 	TwoPhaseFileHeader *hdr;
 	bool		result;
+	FullTransactionId fxid;
 
 	Assert(TransactionIdIsValid(xid));
 
 	if (max_prepared_xacts <= 0)
 		return false;			/* nothing to do */
 
+	fxid = AdjustToFullTransactionId(xid);
+
 	/* Read and validate file */
-	buf = ReadTwoPhaseFile(xid, true);
+	buf = ReadTwoPhaseFile(fxid, true);
 	if (buf == NULL)
 		return false;
 
@@ -1518,6 +1519,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	xl_xact_stats_item *commitstats;
 	xl_xact_stats_item *abortstats;
 	SharedInvalidationMessage *invalmsgs;
+	FullTransactionId fxid = InvalidFullTransactionId;
 
 	/*
 	 * Validate the GID, and lock the GXACT to ensure that two backends do not
@@ -1532,8 +1534,10 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 * in WAL files if the LSN is after the last checkpoint record, or moved
 	 * to disk if for some reason they have lived for a long time.
 	 */
-	if (gxact->ondisk)
-		buf = ReadTwoPhaseFile(xid, false);
+	if (gxact->ondisk) {
+		fxid = AdjustToFullTransactionId(xid);
+		buf = ReadTwoPhaseFile(fxid, false);
+	}
 	else
 		XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
 
@@ -1679,8 +1683,10 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	/*
 	 * And now we can clean up any files we may have left.
 	 */
-	if (ondisk)
-		RemoveTwoPhaseFile(xid, true);
+	if (ondisk) {
+		Assert(FullTransactionIdIsValid(fxid));
+		RemoveTwoPhaseFile(fxid, true);
+	}
 
 	MyLockedGxact = NULL;
 
@@ -1720,11 +1726,11 @@ ProcessRecords(char *bufptr, TransactionId xid,
  * this is an expected case during WAL replay.
  */
 static void
-RemoveTwoPhaseFile(TransactionId xid, bool giveWarning)
+RemoveTwoPhaseFile(FullTransactionId fxid, bool giveWarning)
 {
 	char		path[MAXPGPATH];
 
-	TwoPhaseFilePath(path, xid);
+	TwoPhaseFilePath(path, fxid);
 	if (unlink(path))
 		if (errno != ENOENT || giveWarning)
 			ereport(WARNING,
@@ -1739,7 +1745,7 @@ RemoveTwoPhaseFile(TransactionId xid, bool giveWarning)
  * Note: content and len don't include CRC.
  */
 static void
-RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
+RecreateTwoPhaseFile(FullTransactionId xid, void *content, int len)
 {
 	char		path[MAXPGPATH];
 	pg_crc32c	statefile_crc;
@@ -1860,9 +1866,10 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
 		{
 			char	   *buf;
 			int			len;
+			FullTransactionId fxid = AdjustToFullTransactionId(gxact->xid);
 
 			XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, &len);
-			RecreateTwoPhaseFile(gxact->xid, buf, len);
+			RecreateTwoPhaseFile(fxid, buf, len);
 			gxact->ondisk = true;
 			gxact->prepare_start_lsn = InvalidXLogRecPtr;
 			gxact->prepare_end_lsn = InvalidXLogRecPtr;
@@ -1913,14 +1920,12 @@ restoreTwoPhaseData(void)
 		if (strlen(clde->d_name) == 16 &&
 			strspn(clde->d_name, "0123456789ABCDEF") == 16)
 		{
-			TransactionId xid;
 			FullTransactionId fxid;
 			char	   *buf;
 
 			fxid = FullTransactionIdFromU64(strtou64(clde->d_name, NULL, 16));
-			xid = XidFromFullTransactionId(fxid);
 
-			buf = ProcessTwoPhaseBuffer(xid, InvalidXLogRecPtr,
+			buf = ProcessTwoPhaseBuffer(fxid, InvalidXLogRecPtr,
 										true, false, false);
 			if (buf == NULL)
 				continue;
@@ -1981,12 +1986,14 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 		TransactionId xid;
 		char	   *buf;
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+		FullTransactionId fxid;
 
 		Assert(gxact->inredo);
 
 		xid = gxact->xid;
+		fxid = AdjustToFullTransactionId(xid);
 
-		buf = ProcessTwoPhaseBuffer(xid,
+		buf = ProcessTwoPhaseBuffer(fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, false, true);
 
@@ -2055,12 +2062,14 @@ StandbyRecoverPreparedTransactions(void)
 		TransactionId xid;
 		char	   *buf;
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+		FullTransactionId fxid;
 
 		Assert(gxact->inredo);
 
 		xid = gxact->xid;
+		fxid = AdjustToFullTransactionId(xid);
 
-		buf = ProcessTwoPhaseBuffer(xid,
+		buf = ProcessTwoPhaseBuffer(fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
 		if (buf != NULL)
@@ -2100,8 +2109,10 @@ RecoverPreparedTransactions(void)
 		TwoPhaseFileHeader *hdr;
 		TransactionId *subxids;
 		const char *gid;
+		FullTransactionId fxid;
 
 		xid = gxact->xid;
+		fxid = AdjustToFullTransactionId(xid);
 
 		/*
 		 * Reconstruct subtrans state for the transaction --- needed because
@@ -2112,7 +2123,7 @@ RecoverPreparedTransactions(void)
 		 * SubTransSetParent has been set before, if the prepared transaction
 		 * generated xid assignment records.
 		 */
-		buf = ProcessTwoPhaseBuffer(xid,
+		buf = ProcessTwoPhaseBuffer(fxid,
 									gxact->prepare_start_lsn,
 									gxact->ondisk, true, false);
 		if (buf == NULL)
@@ -2189,13 +2200,13 @@ RecoverPreparedTransactions(void)
  * value scanned.
  */
 static char *
-ProcessTwoPhaseBuffer(TransactionId xid,
+ProcessTwoPhaseBuffer(FullTransactionId fxid,
 					  XLogRecPtr prepare_start_lsn,
 					  bool fromdisk,
 					  bool setParent, bool setNextXid)
 {
 	FullTransactionId nextXid = TransamVariables->nextXid;
-	TransactionId origNextXid = XidFromFullTransactionId(nextXid);
+	TransactionId xid = XidFromFullTransactionId(fxid);
 	TransactionId *subxids;
 	char	   *buf;
 	TwoPhaseFileHeader *hdr;
@@ -2214,7 +2225,7 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 			ereport(WARNING,
 					(errmsg("removing stale two-phase state file for transaction %u",
 							xid)));
-			RemoveTwoPhaseFile(xid, true);
+			RemoveTwoPhaseFile(fxid, true);
 		}
 		else
 		{
@@ -2227,14 +2238,14 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	}
 
 	/* Reject XID if too new */
-	if (TransactionIdFollowsOrEquals(xid, origNextXid))
+	if (FullTransactionIdFollowsOrEquals(fxid, nextXid))
 	{
 		if (fromdisk)
 		{
 			ereport(WARNING,
 					(errmsg("removing future two-phase state file for transaction %u",
 							xid)));
-			RemoveTwoPhaseFile(xid, true);
+			RemoveTwoPhaseFile(fxid, true);
 		}
 		else
 		{
@@ -2249,7 +2260,7 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	if (fromdisk)
 	{
 		/* Read and validate file */
-		buf = ReadTwoPhaseFile(xid, false);
+		buf = ReadTwoPhaseFile(fxid, false);
 	}
 	else
 	{
@@ -2520,8 +2531,9 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 	if (!XLogRecPtrIsInvalid(start_lsn))
 	{
 		char		path[MAXPGPATH];
+		FullTransactionId fxid = AdjustToFullTransactionId(hdr->xid);
 
-		TwoPhaseFilePath(path, hdr->xid);
+		TwoPhaseFilePath(path, fxid);
 
 		if (access(path, F_OK) == 0)
 		{
@@ -2615,8 +2627,11 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 	 * And now we can clean up any files we may have left.
 	 */
 	elog(DEBUG2, "removing 2PC data for transaction %u", xid);
-	if (gxact->ondisk)
-		RemoveTwoPhaseFile(xid, giveWarning);
+	if (gxact->ondisk) {
+		FullTransactionId fxid = AdjustToFullTransactionId(xid);
+
+		RemoveTwoPhaseFile(fxid, giveWarning);
+	}
 	RemoveGXact(gxact);
 }
 
@@ -2663,8 +2678,11 @@ LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
 			 * do this optimization if we encounter many collisions in GID
 			 * between publisher and subscriber.
 			 */
-			if (gxact->ondisk)
-				buf = ReadTwoPhaseFile(gxact->xid, false);
+			if (gxact->ondisk) {
+				FullTransactionId fxid = AdjustToFullTransactionId(gxact->xid);
+
+				buf = ReadTwoPhaseFile(fxid, false);
+			}
 			else
 			{
 				Assert(gxact->prepare_start_lsn);
-- 
2.34.1

#20Michael Paquier
michael@paquier.xyz
In reply to: Noah Misch (#17)
Re: An improvement of ProcessTwoPhaseBuffer logic

On Thu, Jan 16, 2025 at 06:44:16PM -0800, Noah Misch wrote:

I got here on a yak shave for postgr.es/m/20250111214454.9a.nmisch@google.com.
I expect that project will still want FullTransactionIdFromAllowableAt(). If
so, I'll include it in that thread's patch series.

Note that this one has been committed as 81772a495ec9 by Noah. I am
going to spend some time looking at the rest.
--
Michael

#21Michael Paquier
michael@paquier.xyz
In reply to: Vitaly Davydov (#19)
Re: An improvement of ProcessTwoPhaseBuffer logic

On Wed, Jan 22, 2025 at 03:45:30PM +0300, Vitaly Davydov wrote:

I hope you may find this patch worth to consider. You may consider
this patch as one more step to a complete migration to
FullTransactionId in twophase.c for 17+ versions.

Heavy refactorings are only going to be something for 18~, as far as
my analysis goes.

I assume that if a xid is coming from in-memory states, it has to be
in the current xid range. Based on this assumption, I would conclude
that if the xid is coming via the interface (external) functions
which are defined in twophase.h, AdjustToFullTransactionId can be
applied to such xid.

There is another story when we define xid from two phase file names,
when reading pg_twophase directory. In 17+ twophase file names was
changed to contain tx epoch as well. Once we work with twophase
files, we have to use full xids. Function AdjustToFullTransactionId
is not applicable in this case, because pg_twophase directory may
contain any garbage files with future or past full xids which are
outside of the current range.

Based on these assumptions (AdjustToFullTransactionId or use full
xids) some other functions were modified as shown in the patch.

Yeah, we need to do something here. Anyway, I don't think what you
are posting here is ambitious enough for two reasons:
- We should try to limit the number of times where we do maths between
TransactionId and FullTransactionId. It's going to be cleaner if we
make the 2PC interface require FullTransactionIds all the time (note
that the point of storing a fxid or an xid is different). Noah's
proposal at [1]/messages/by-id/20250116205254.65.nmisch@google.com -- Michael is much closer to the long-term picture that would
look adapted.
- The CLOG lookups that can happen in ProcessTwoPhaseBuffer() during
recovery while a consistent state is not reached are still possible
(planning to start a different thread about this specific issue).

[1]: /messages/by-id/20250116205254.65.nmisch@google.com -- Michael
--
Michael

#22Vitaly Davydov
v.davydov@postgrespro.ru
In reply to: Michael Paquier (#21)
Re: An improvement of ProcessTwoPhaseBuffer logic

It seems, there are much deeper problems with twophase transactions as I thought. I'm interested in fixing twophase transactions, because I support a solution which actively uses twophase transactions. I'm interested to get more deeply into the twophase functionality. Below, I just want to clarify some ideas behind twophase transactions. I appreciate, if you comment my point of view, or just ignore this email if you find it too naive and boring.

Two phase files are created after checkpoint to keep twophase states on disk after WAL truncation. For transactions, that are inside the checkpoint horizon, we do not create two phase files because the states are currently stored in the WAL.

Based on the thesis above, I guess, we have to read only those twophase files which are related to the transactions before the latest checkpoint. Its full xid should be lesser than TransamVariables->nextXid (which is the same as ControlFile->checkPointCopy.nextXid at the moment of StartupXLOG -> restoreTwoPhaseData call). The files with greater (or equal) full xids, should be ignored and removed. That's all what we need in restoreTwoPhaseData, I believe.

In the current implementation, such check is applied in restoreTwoPhaseData -> ProcessTwoPhaseBuffer but after checking for xid in CLOG. I'm not sure, why we check CLOG here. Once we truncate the WAL on checkpoint and save twophase states into pg_twophase, these files must store states of real transactions from past. I mean, if someone creates a stub file with full xid < TransamVariables->nextXid, we have no means (except CLOG ?) to check that this file belongs to a real transaction from past. CLOG check seems to be a weak attempt to deal with it. At this point, I'm not sure that CLOG may contain states for all full xids of existing twophase files. I guess, we should call restoreTwoPhaseData at start of recovery but we shouldn't check CLOG at this stage. May be it is reasonable to check some not so old xids which are greater than the current CLOG horizon, but I'm not sure how CLOG segments are managed and how the horizon is moving.

There is another question about the loading order of twophase files but I think it doesn't matter in which order we load these files. But I would prefer to load it in full xid ascending order.

On Tuesday, January 28, 2025 08:02 MSK, Michael Paquier <michael@paquier.xyz> wrote:

Noah's
proposal at [1] is much closer to the long-term picture that would
look adapted.
- The CLOG lookups that can happen in ProcessTwoPhaseBuffer() during
recovery while a consistent state is not reached are still possible
(planning to start a different thread about this specific issue).

[1]: /messages/by-id/20250116205254.65.nmisch@google.com

Agree, thank you, but my simple patch with some adjustments and swapping of checks in ProcessTwoPhaseBuffer may be back-ported. It doesn't fix all the problems but may help to fix the problem with twophase files related to broken latest WAL segments.

With best regards,
Vitaly

#23Michael Paquier
michael@paquier.xyz
In reply to: Vitaly Davydov (#22)
Re: An improvement of ProcessTwoPhaseBuffer logic

On Wed, Jan 29, 2025 at 03:00:54PM +0300, Vitaly Davydov wrote:

It seems, there are much deeper problems with twophase transactions
as I thought. I'm interested in fixing twophase transactions,
because I support a solution which actively uses twophase
transactions. I'm interested to get more deeply into the twophase
functionality. Below, I just want to clarify some ideas behind
twophase transactions. I appreciate, if you comment my point of
view, or just ignore this email if you find it too naive and boring.

(Please be careful about your email indentation. These are hard to
read.)

Two phase files are created after checkpoint to keep twophase states
on disk after WAL truncation. For transactions, that are inside the
checkpoint horizon, we do not create two phase files because the
states are currently stored in the WAL.

Yeah, the oldest XID horizon stored in the checkpoint records would be
stuck if you have a long-running 2PC transaction. It's actually
something I have been brewing a bit for the last few days: we should
discard early at recovery files that are older than the horizon in the
checkpoint record we've retrieved at the beginning of recovery. So
the early check can be tighter with past files.

Based on the thesis above, I guess, we have to read only those
twophase files which are related to the transactions before the
latest checkpoint. Its full xid should be lesser than
TransamVariables->nextXid (which is the same as
ControlFile->checkPointCopy.nextXid at the moment of StartupXLOG ->
restoreTwoPhaseData call). The files with greater (or equal) full
xids, should be ignored and removed. That's all what we need in
restoreTwoPhaseData, I believe.

Yeah. It is important to do that only in restoreTwoPhaseData(). The
check about already-aborted and already-committed 2PC files is
something we have to keep as well. We should only do it at the end of
recovery.

In the current implementation, such check is applied in
restoreTwoPhaseData -> ProcessTwoPhaseBuffer but after checking for
xid in CLOG. I'm not sure, why we check CLOG here. Once we truncate
the WAL on checkpoint and save twophase states into pg_twophase,
these files must store states of real transactions from past.

Based on my analysis, the CLOG check ought to happen only in
RecoverPreparedTransactions(), which is the last step taken at
recovery. The future and past checks only need to happen in
restoreTwoPhaseData(). I'm been bumping my head on my desk on this
area for a few days now, but I think that the solution is simple:
these checks should be moved *outside* ProcessTwoPhaseBuffer() and
applied one layer higher in the two places I've mentioning above,
where they matter.

There is another question about the loading order of twophase files
but I think it doesn't matter in which order we load these
files. But I would prefer to load it in full xid ascending order.

I don't see a need in doing that in the short-term, but perhaps you
have a point with the ordering of the entries in the TwoPhaseState
shmem data if there are many entries to handle. Another idea would be
to switch to a hash table.

Note that I've actually found what looks like a potential data
corruption bug in the logic while doing this investigation and adding
TAP tests to automate all that: if we detect a 2PC file as already
aborted or committed in ProcessTwoPhaseBuffer() while scanning
TwoPhaseState->numPrepXacts, we could finish by calling
PrepareRedoRemove(), which itself *manipulates* TwoPhaseState,
decrementing numPrepXacts. This can leave dangling entries in
TwoPhaseState if you have multiple TwoPhaseState entries (my
implemented TAP tests have been doing exactly that). This is relevant
at the end of recovery where CLOG lookups are OK to do.

So the situation is a bit of a mess in all the branches, a bit worse
in 17~ due to the fact that epochs are poorly integrated, but I'm
getting there with a set of patches mostly ready to be sent, and I
think that this would be OK for a backpatch. My plan is to start a
new thread to deal with the matter, because that's a bit larger than
the topic you have raised on this thread. I'll add you and Noah in CC
for awareness.
--
Michael

#24Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#23)
Re: An improvement of ProcessTwoPhaseBuffer logic

On Thu, Jan 30, 2025 at 08:57:23AM +0900, Michael Paquier wrote:

So the situation is a bit of a mess in all the branches, a bit worse
in 17~ due to the fact that epochs are poorly integrated, but I'm
getting there with a set of patches mostly ready to be sent, and I
think that this would be OK for a backpatch. My plan is to start a
new thread to deal with the matter, because that's a bit larger than
the topic you have raised on this thread. I'll add you and Noah in CC
for awareness.

And done that here, with a closer lookup at everything I've bumped
into:
/messages/by-id/Z5sd5O9JO7NYNK-C@paquier.xyz
--
Michael