The same 2PC data maybe recovered twice
Hi, all. I want to report a bug about recovery of 2pc data, in current implementation of crash recovery, there are two ways to recover 2pc data:
1、before redo, func restoreTwoPhaseData() will restore 2pc data those xid < ShmemVariableCache->nextXid, which is initialized from checkPoint.nextXid;
2、during redo, func xact_redo() will add 2pc from wal;
The following scenario may cause the same 2pc to be added repeatedly:
1、start creating checkpoint_1, checkpoint_1.redo is set as curInsert;
2、before set checkPoint_1.nextXid, a new 2pc is prepared, suppose the xid of this 2pc is 100, and then ShmemVariableCache->nextXid will be advanced as 101;
3、checkPoint_1.nextXid is set as 101;
4、in CheckPointTwoPhase() of checkpoint_1, 2pc_100 won't be copied to disk because its prepare_end_lsn > checkpoint_1.redo;
5、checkPoint_1 is finished, after checkpoint_timeout, start creating checkpoint_2;
6、during checkpoint_2, data of 2pc_100 will be copied to disk;
7、before UpdateControlFile() of checkpoint_2, crash happened;
8、during crash recovery, redo will start from checkpoint_1, and 2pc_100 will be restored first by restoreTwoPhaseData() because xid_100 < checkPoint_1.nextXid, which is 101;
9、because prepare_start_lsn of 2pc_100 > checkpoint_1.redo, 2pc_100 will be added again by xact_redo() during wal replay, resulting in the same 2pc data being added twice;
10、In RecoverPreparedTransactions() -> lock_twophase_recover(), lock the same 2pc will cause panic.
Is the above scenario reasonable, and do you have any good ideas for fixing this bug?
Thanks & Best Regard
Hi, all
I add a patch for pg11 to fix this bug, hope you can check it.
Thanks & Best Regard
------------------------------------------------------------------
发件人:蔡梦娟(玊于) <mengjuan.cmj@alibaba-inc.com>
发送时间:2023年7月6日(星期四) 10:02
收件人:pgsql-hackers <pgsql-hackers@postgresql.org>
抄 送:pgsql-bugs <pgsql-bugs@postgresql.org>
主 题:The same 2PC data maybe recovered twice
Hi, all. I want to report a bug about recovery of 2pc data, in current implementation of crash recovery, there are two ways to recover 2pc data:
1、before redo, func restoreTwoPhaseData() will restore 2pc data those xid < ShmemVariableCache->nextXid, which is initialized from checkPoint.nextXid;
2、during redo, func xact_redo() will add 2pc from wal;
The following scenario may cause the same 2pc to be added repeatedly:
1、start creating checkpoint_1, checkpoint_1.redo is set as curInsert;
2、before set checkPoint_1.nextXid, a new 2pc is prepared, suppose the xid of this 2pc is 100, and then ShmemVariableCache->nextXid will be advanced as 101;
3、checkPoint_1.nextXid is set as 101;
4、in CheckPointTwoPhase() of checkpoint_1, 2pc_100 won't be copied to disk because its prepare_end_lsn > checkpoint_1.redo;
5、checkPoint_1 is finished, after checkpoint_timeout, start creating checkpoint_2;
6、during checkpoint_2, data of 2pc_100 will be copied to disk;
7、before UpdateControlFile() of checkpoint_2, crash happened;
8、during crash recovery, redo will start from checkpoint_1, and 2pc_100 will be restored first by restoreTwoPhaseData() because xid_100 < checkPoint_1.nextXid, which is 101;
9、because prepare_start_lsn of 2pc_100 > checkpoint_1.redo, 2pc_100 will be added again by xact_redo() during wal replay, resulting in the same 2pc data being added twice;
10、In RecoverPreparedTransactions() -> lock_twophase_recover(), lock the same 2pc will cause panic.
Is the above scenario reasonable, and do you have any good ideas for fixing this bug?
Thanks & Best Regard
Attachments:
v1-0001-Fix-a-2PC-transaction-maybe-recovered-twice_11.patchapplication/octet-streamDownload
From 1ba372ef6e4e77154d04c2665f22fcf9cc4c53d8 Mon Sep 17 00:00:00 2001
From: "suyu.cmj" <mengjuan.cmj@alibaba-inc.com>
Date: Fri, 7 Jul 2023 08:44:49 +0000
Subject: [PATCH] Fix the bug of a 2PC transaction maybe recovered twice
During recovery, a two-phase transaction should be restored
either from the disk file or from the WAL. However, a two-phase
transaction maybe restored from both way if we crashed during
doing a checkpoint. Considering that the data on disk file may
not be reliable, so in this case, we only store the transaction
data recorded in the WAL.
---
src/backend/access/transam/twophase.c | 40 ++++++++++++++++++++++++---
1 file changed, 36 insertions(+), 4 deletions(-)
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index d145836a79..e1dda1b3fe 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2491,6 +2491,8 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
char *bufptr;
const char *gid;
GlobalTransaction gxact;
+ bool addnewgxact = true;
+ int i;
Assert(LWLockHeldByMeInMode(TwoPhaseStateLock, LW_EXCLUSIVE));
Assert(RecoveryInProgress());
@@ -2498,6 +2500,30 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader));
gid = (const char *) bufptr;
+ /*
+ * During recovery, the two-phase data can be added in two ways:
+ * 1) restored from disk file when its xid < checkPoint.nextxid,
+ * 2) restored from the WAL when its prepare_start_lsn > checkPoint.redo,
+ * A two-phase transaction may satisfy above two conditions if we
+ * crashed during doing a checkpoint. Considering that the data
+ * on disk file may not be reliable, so in this case, we only store
+ * the transaction data recorded in the WAL.
+ */
+ for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+ {
+ GlobalTransaction curxact = TwoPhaseState->prepXacts[i];
+
+ if (curxact->xid == hdr->xid && curxact->ondisk && !XLogRecPtrIsInvalid(start_lsn))
+ {
+ gxact = curxact;
+ addnewgxact = false;
+ ereport(WARNING,
+ (errmsg("found duplicate two-phase transaction %u, store data from the WAL",
+ hdr->xid)));
+ break;
+ }
+ }
+
/*
* Reserve the GID for the given transaction in the redo code path.
*
@@ -2510,14 +2536,18 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
*/
/* Get a free gxact from the freelist */
- if (TwoPhaseState->freeGXacts == NULL)
+ if (TwoPhaseState->freeGXacts == NULL && addnewgxact)
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
errmsg("maximum number of prepared transactions reached"),
errhint("Increase max_prepared_transactions (currently %d).",
max_prepared_xacts)));
- gxact = TwoPhaseState->freeGXacts;
- TwoPhaseState->freeGXacts = gxact->next;
+
+ if (addnewgxact)
+ {
+ gxact = TwoPhaseState->freeGXacts;
+ TwoPhaseState->freeGXacts = gxact->next;
+ }
gxact->prepared_at = hdr->prepared_at;
gxact->prepare_start_lsn = start_lsn;
@@ -2532,7 +2562,9 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
/* And insert it into the active array */
Assert(TwoPhaseState->numPrepXacts < max_prepared_xacts);
- TwoPhaseState->prepXacts[TwoPhaseState->numPrepXacts++] = gxact;
+
+ if (addnewgxact)
+ TwoPhaseState->prepXacts[TwoPhaseState->numPrepXacts++] = gxact;
if (origin_id != InvalidRepOriginId)
{
--
2.19.1.6.gb485710b
Hi:
On Sat, Jul 8, 2023 at 2:53 AM 蔡梦娟(玊于) <mengjuan.cmj@alibaba-inc.com> wrote:
Hi, all
I add a patch for pg11 to fix this bug, hope you can check it.
Thanks for the bug report and patch! Usually we talk about bugs
against the master branch, no people want to check out a history
branch and do the discussion there:) This bug is reproducible on
the master IIUC.
I dislike the patch here because it uses more CPU cycles to detect
duplication for every 2pc record. How many CPU cycles we use
depends on how many 2pc are used. How about detecting such
duplication only at restoreTwoPhaseData stage? Instead of
ProcessTwoPhaseBuffer:
if (TransactionIdFollowsOrEquals(xid, origNextXid))
{
...
ereport(WARNING,
(errmsg("removing future two-phase state file for transaction %u",
xid)));
RemoveTwoPhaseFile(xid, true);
...
}
we use:
if (TwoPhaseFileHeader.startup_lsn > checkpoint.redo)
{
ereport(WARNING,
(errmsg("removing future two-phase state file for transaction %u",
xid)));
}
We have several advantages with this approach. a). We only care
about the restoreTwoPhaseData, not for every WAL record recovery.
b). We use constant comparison rather than an-array-for-loop. c).
It is better design since we avoid the issue at the first place rather
than allowing it at the first stage and fix that at the following stage.
The only blocker I know is currently we don't write startup_lsn into
the 2pc checkpoint file and if we do that, the decode on the old 2pc
file will fail. We also have several choices here.
a). Notify users to complete all the pending 2pc before upgrading
within manual. b). Use a different MAGIC NUMBER in the 2pc
checkpoint file to distinguish the 2 versions. Basically I prefer
the method a).
Any suggestion is welcome.
------------------------------------------------------------------
发件人:蔡梦娟(玊于) <mengjuan.cmj@alibaba-inc.com>
发送时间:2023年7月6日(星期四) 10:02
收件人:pgsql-hackers <pgsql-hackers@postgresql.org>
抄 送:pgsql-bugs <pgsql-bugs@postgresql.org>
主 题:The same 2PC data maybe recovered twiceHi, all. I want to report a bug about recovery of 2pc data, in current
implementation of crash recovery, there are two ways to recover 2pc data:
1、before redo, func restoreTwoPhaseData() will restore 2pc data those xid
< ShmemVariableCache->nextXid, which is initialized from
checkPoint.nextXid;
2、during redo, func xact_redo() will add 2pc from wal;The following scenario may cause the same 2pc to be added repeatedly:
1、start creating checkpoint_1, checkpoint_1.redo is set as curInsert;
2、before set checkPoint_1.nextXid, a new 2pc is prepared, suppose the xid
of this 2pc is 100, and then ShmemVariableCache->nextXid will be advanced
as 101;
3、checkPoint_1.nextXid is set as 101;
4、in CheckPointTwoPhase() of checkpoint_1, 2pc_100 won't be copied to
disk because its prepare_end_lsn > checkpoint_1.redo;
5、checkPoint_1 is finished, after checkpoint_timeout, start creating
checkpoint_2;
6、during checkpoint_2, data of 2pc_100 will be copied to disk;
7、before UpdateControlFile() of checkpoint_2, crash happened;
8、during crash recovery, redo will start from checkpoint_1, and 2pc_100
will be restored first by restoreTwoPhaseData() because xid_100 < checkPoint_1.nextXid,
which is 101;
9、because prepare_start_lsn of 2pc_100 > checkpoint_1.redo, 2pc_100 will
be added again by xact_redo() during wal replay, resulting in the same
2pc data being added twice;
10、In RecoverPreparedTransactions() -> lock_twophase_recover(), lock the
same 2pc will cause panic.Is the above scenario reasonable, and do you have any good ideas for
fixing this bug?Thanks & Best Regard
--
Best Regards
Andy Fan
Yes, this bug can also be reproduced on the master branch, and the corresponding reproduction patch is attached.
I also considered comparing the 2pc.prepare_start_lsn and checkpoint.redo in ProcessTwoPhaseBuffer before, but this method requires modifying the format of the 2pc checkpoint file, which will bring compatibility issues. Especially for released branches, assuming that a node has encountered this bug, it will not be able to start successfully due to FATAL during crash recovery, and therefore cannot manually commit previous two-phase transactions. Using magic number to distinguish 2pc checkpoint file versions can't solve the problem in the above scenario either.
For unreleased branches, writing 2pc.prepare_start_lsn into the checkpoint file will be a good solution, but for released branches, I personally think using WAL record to overwrite checkpoint data would be a more reasonable approach, What do you think?
Best Regards
suyu.cmj
Attachments:
0001-Reproduce-the-error_master.patchapplication/octet-streamDownload
From 1a87d755258cf96d82c67fca18b3db97980af237 Mon Sep 17 00:00:00 2001
From: "suyu.cmj" <mengjuan.cmj@alibaba-inc.com>
Date: Wed, 12 Jul 2023 04:05:48 +0000
Subject: [PATCH] Reproduce the error that the same 2pc may be added repeatedly
---
src/backend/access/transam/xlog.c | 10 +++
src/backend/postmaster/checkpointer.c | 51 ++++++++++++++
src/include/catalog/pg_proc.dat | 8 +++
src/include/postmaster/bgwriter.h | 3 +
.../037_add_duplicate_2pc_during_recovery.pl | 68 +++++++++++++++++++
5 files changed, 140 insertions(+)
create mode 100644 src/test/recovery/t/037_add_duplicate_2pc_during_recovery.pl
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8b0710abe6..1e91f1e342 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6624,6 +6624,11 @@ CreateCheckPoint(int flags)
TRACE_POSTGRESQL_CHECKPOINT_START(flags);
+ /* Test inject fault */
+ while (IsCheckpointDelayed())
+ pg_usleep(1000000);
+ /* Test inject fault end */
+
/*
* Get the other info we need for the checkpoint record.
*
@@ -7039,6 +7044,11 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
/* We deliberately delay 2PC checkpointing as long as possible */
CheckPointTwoPhase(checkPointRedo);
+
+ /* Test inject fault */
+ if (IsCheckpointPanicInjected())
+ elog(PANIC, "panic is injected during checkpoint");
+ /* Test inject fault end */
}
/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index ace9893d95..161f98a1a3 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -122,6 +122,9 @@ typedef struct
int ckpt_flags; /* checkpoint flags, as defined in xlog.h */
+ bool is_ckpt_delayed; /* add for test */
+ bool is_ckpt_panic_injected; /* add for test */
+
ConditionVariable start_cv; /* signaled when ckpt_started advances */
ConditionVariable done_cv; /* signaled when ckpt_done advances */
@@ -1351,3 +1354,51 @@ FirstCallSinceLastCheckpoint(void)
return FirstCall;
}
+
+bool
+IsCheckpointDelayed(void)
+{
+ bool is_delayed = false;
+
+ SpinLockAcquire(&CheckpointerShmem->ckpt_lck);
+ is_delayed = CheckpointerShmem->is_ckpt_delayed;
+ SpinLockRelease(&CheckpointerShmem->ckpt_lck);
+
+ return is_delayed;
+}
+
+bool
+IsCheckpointPanicInjected(void)
+{
+ bool inject_panic = false;
+
+ SpinLockAcquire(&CheckpointerShmem->ckpt_lck);
+ inject_panic = CheckpointerShmem->is_ckpt_panic_injected;
+ SpinLockRelease(&CheckpointerShmem->ckpt_lck);
+
+ return inject_panic;
+}
+
+Datum
+pg_test_inject_checkpoint_delay(PG_FUNCTION_ARGS)
+{
+ bool set_delay = PG_GETARG_BOOL(0);
+
+ SpinLockAcquire(&CheckpointerShmem->ckpt_lck);
+ CheckpointerShmem->is_ckpt_delayed = set_delay;
+ SpinLockRelease(&CheckpointerShmem->ckpt_lck);
+
+ PG_RETURN_VOID();
+}
+
+Datum
+pg_test_inject_checkpoint_panic(PG_FUNCTION_ARGS)
+{
+ bool inject_panic = PG_GETARG_BOOL(0);
+
+ SpinLockAcquire(&CheckpointerShmem->ckpt_lck);
+ CheckpointerShmem->is_ckpt_panic_injected = inject_panic;
+ SpinLockRelease(&CheckpointerShmem->ckpt_lck);
+
+ PG_RETURN_VOID();
+}
\ No newline at end of file
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 6996073989..8e3d9c5ea3 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12043,4 +12043,12 @@
proname => 'any_value_transfn', prorettype => 'anyelement',
proargtypes => 'anyelement anyelement', prosrc => 'any_value_transfn' },
+{ oid => '6393', descr => 'inject delay during checkpoint',
+ proname => 'pg_test_inject_checkpoint_delay', provolatile => 'v', prorettype => 'void',
+ proargtypes => 'bool', prosrc => 'pg_test_inject_checkpoint_delay' },
+
+{ oid => '6394', descr => 'inject panic during checkpoint',
+ proname => 'pg_test_inject_checkpoint_panic', provolatile => 'v', prorettype => 'void',
+ proargtypes => 'bool', prosrc => 'pg_test_inject_checkpoint_panic' },
+
]
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index a66722873f..3b845680ff 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,7 @@ extern void CheckpointerShmemInit(void);
extern bool FirstCallSinceLastCheckpoint(void);
+extern bool IsCheckpointDelayed(void);
+extern bool IsCheckpointPanicInjected(void);
+
#endif /* _BGWRITER_H */
diff --git a/src/test/recovery/t/037_add_duplicate_2pc_during_recovery.pl b/src/test/recovery/t/037_add_duplicate_2pc_during_recovery.pl
new file mode 100644
index 0000000000..9896c65bf9
--- /dev/null
+++ b/src/test/recovery/t/037_add_duplicate_2pc_during_recovery.pl
@@ -0,0 +1,68 @@
+
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+# Tests add duplicate two-phase transaction during recovery
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 1;
+
+# Setup primary node
+my $node_primary = PostgreSQL::Test::Cluster->new("primary");
+$node_primary->init;
+$node_primary->append_conf(
+ 'postgresql.conf', qq(
+ max_prepared_transactions = 10
+ log_checkpoints = true
+ checkpoint_timeout = 3000
+));
+$node_primary->start;
+
+$node_primary->psql('postgres', "create table t_test_2pc(id int, msg text)");
+$node_primary->psql('postgres', "select pg_test_inject_checkpoint_delay(true)");
+
+# start do checkpoint1
+my $host = $node_primary->host;
+my $port = $node_primary->port;
+`nohup psql -h $host -p $port -c 'checkpoint' >tmp_check/test.file 2>&1 &`;
+
+# prepare transaction during checkpoint1
+$node_primary->psql(
+ 'postgres', "
+ BEGIN;
+ INSERT INTO t_test_2pc VALUES (1, 'test add duplicate 2pc');
+ PREPARE TRANSACTION 'xact_test_1';");
+
+# reset delay flag to finish checkpoint1
+$node_primary->psql('postgres', "select pg_test_inject_checkpoint_delay(false)");
+
+my $logdir = $node_primary->logfile;
+my $found = 0;
+my @result;
+while ($found == 0)
+{
+ @result = `grep -rn "checkpoint complete" $logdir`;
+ $found = @result;
+}
+print "checkpoint1 finished\n";
+
+# inject panic during checkpoint2
+$node_primary->psql('postgres', "select pg_test_inject_checkpoint_panic(true)");
+$node_primary->psql('postgres', "checkpoint");
+
+$found = 0;
+while ($found == 0)
+{
+ @result = `grep -rn "terminating any other active server processes" $logdir`;
+ $found = @result;
+}
+
+@result = `grep -rn "panic is injected during checkpoint" $logdir`;
+$found = @result;
+ok($found == 1, "inject panic success");
+
+# start will fail due to error in lock_twophase_recover()
+$node_primary->_update_pid(0);
+$node_primary->start;
\ No newline at end of file
--
2.19.1.6.gb485710b
On Wed, Jul 12, 2023 at 03:20:57PM +0800, suyu.cmj wrote:
Yes, this bug can also be reproduced on the master branch, and the
corresponding reproduction patch is attached.
That's an interesting reproducer with injection points. It looks like
you've spent a lot of time investigating that. So, basically, a
checkpoint fails after writing a 2PC file to disk, but before the redo
LSN has been updated.
I also considered comparing the 2pc.prepare_start_lsn and
checkpoint.redo in ProcessTwoPhaseBuffer before, but this method
requires modifying the format of the 2pc checkpoint file, which will
bring compatibility issues. Especially for released branches,
assuming that a node has encountered this bug, it will not be able
to start successfully due to FATAL during crash recovery, and
therefore cannot manually commit previous two-phase
transactions. Using magic number to distinguish 2pc checkpoint file
versions can't solve the problem in the above scenario either.
For unreleased branches, writing 2pc.prepare_start_lsn into the
checkpoint file will be a good solution, but for released branches,
Yes, changing anything in this format is a no-go. Now, things could
be written so as the recovery code is able to handle multiple formats,
meaning that it would be able to feed from the a new format that
includes a LSN or something else for the comparison, but that would
not save from the case where 2PC files with the old format are still
around and a 2PC WAL record is replayed.
I personally think using WAL record to overwrite checkpoint data
would be a more reasonable approach, What do you think?
The O(2) loop added in PrepareRedoAdd() to scan the set of 2PC
transactions stored in TwoPhaseState for the purpose of checking for a
duplicate sucks from a performance point of view, particularly for
deployments with many 2PC transactions allowed. It could delay
recovery a lot. And actually, this is not completely correct, no?
It is OK to bypass the recovery of the same transaction if the server
has not reached a consistent state, but getting a duplicate when
consistency has been reached should lead to a hard failure.
One approach to avoid this O(2) would be to use a hash table to store
the 2PC entries, for example, rather than an array. That would be
simple enough but such refactoring is rather scary from the point of
view of recovery.
And, actually, we could do something much more simpler than what's
been proposed on this thread.. PrepareRedoAdd() would be called when
scanning pg_twophase at the beginning of recovery, or when replaying a
PREPARE record, both aiming at adding an entry in shmem for the 2PC
transaction tracked. Here is a simpler idea: why don't we just check
in PrepareRedoAdd() if the 2PC file of the transaction being recovery
is in pg_twophase/ when adding an entry from a WAL record? If a
consistent point has *not* been reached by recovery and we find a file
on disk, then do nothing because we *know* thanks to
restoreTwoPhaseData() done at the beginning of recover that there is
an entry for this file. If a consistent point has been reached in
recovery and we find a file on disk while replaying a WAL record for
the same 2PC file, then fail. If there is no file in pg_twophase for
the record replayed, then add it to the array TwoPhaseState.
Adding a O(2) loop that checks for duplicates may be a good idea as a
cross-check if replaying a record, but I'd rather put that under an
USE_ASSERT_CHECKING so as there is no impact on production systems,
still we'd have some sanity checks for test setups.
--
Michael
Yes, the method you proposed is simpler and more efficient. Following your idea, I have modified the corresponding patch, hope you can review it when you have time.
Best Regards
suyu.cmj
Attachments:
v2-0001-Fix-a-2PC-transaction-maybe-recovered-twice_master.patchapplication/octet-streamDownload
From e66b8c3285331537b09df577f4d487847e085a36 Mon Sep 17 00:00:00 2001
From: "suyu.cmj" <mengjuan.cmj@alibaba-inc.com>
Date: Mon, 17 Jul 2023 05:52:59 +0000
Subject: [PATCH] Fix the bug of a 2PC transaction maybe recovered twice
During recovery, a two-phase transaction should be restored either from
the disk file or from the WAL. However, a two-phase transaction maybe
restored from both way if we crashed after writing a 2PC file to disk,
but before the redo LSN has been updated to pg_control during doing a
checkpoint. We check if the 2PC file of the transaction being recovery
is already in pg_twophase/ when adding an entry from a WAL record, if a
consistent point has not been reached by recovery and we find a file on
disk, then do nothing because we know restoreTwoPhaseData() has done that
at the beginning of recovery. If a consistent point has been reached in
recovery and we find a file on disk while replaying a WAL record for the
same 2PC file, then fail. If there is no file in pg_twophase/ for the
record replayed, then add it to the array.
---
src/backend/access/transam/twophase.c | 52 +++++++++++++++++++++++++++
1 file changed, 52 insertions(+)
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 068e59bec0..9971a78d30 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -86,6 +86,7 @@
#include "access/xlog.h"
#include "access/xloginsert.h"
#include "access/xlogreader.h"
+#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
#include "catalog/pg_type.h"
#include "catalog/storage.h"
@@ -2459,6 +2460,8 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
char *bufptr;
const char *gid;
GlobalTransaction gxact;
+ char path[MAXPGPATH];
+ struct stat stat_buf;
Assert(LWLockHeldByMeInMode(TwoPhaseStateLock, LW_EXCLUSIVE));
Assert(RecoveryInProgress());
@@ -2477,6 +2480,55 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
* that it got added in the redo phase
*/
+ /*
+ * During recovery, the two-phase data can be restored in two ways:
+ * 1) restored from disk file when its xid < checkPoint.nextxid;
+ * 2) restored from the WAL record when its prepare_start_lsn > checkPoint.redo.
+ * A two-phase transaction can be restored from either disk file or WAL,
+ * but can't both. However, a 2PC may satisfy above two conditions if
+ * we crashed after writing a 2PC file to disk, but before the redo LSN
+ * has been updated to pg_control during doing a checkpoint.
+ *
+ * We check if the 2PC file of the transaction being recovery is already
+ * in pg_twophase/ when adding an entry from a WAL record, if a consistent
+ * point has not been reached by recovery and we find a file on disk, then
+ * do nothing because we know restoreTwoPhaseData() has done that at the
+ * beginning of recovery. If a consistent point has been reached in recovery
+ * and we find a file on disk while replaying a WAL record for the same 2PC
+ * file, then fail. If there is no file in pg_twophase/ for the record
+ * replayed, then add it to the array.
+ */
+ if (!XLogRecPtrIsInvalid(start_lsn))
+ {
+#ifdef USE_ASSERT_CHECKING
+ int i;
+#endif
+ TwoPhaseFilePath(path, hdr->xid);
+
+ if (stat(path, &stat_buf) == 0)
+ {
+ Assert(S_ISREG(stat_buf.st_mode));
+
+ if (reachedConsistency)
+ ereport(FATAL,
+ (errmsg("found unexpected duplicate two-phase transaction:%u in pg_twophase, check for data correctness.",
+ hdr->xid)));
+ else
+ {
+ ereport(WARNING,
+ (errmsg("found duplicate two-phase transaction:%u in pg_twophase, skip adding this transaction.",
+ hdr->xid)));
+ return;
+ }
+ }
+
+#ifdef USE_ASSERT_CHECKING
+ /* cross-check for duplicates in array during replay */
+ for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+ Assert(TwoPhaseState->prepXacts[i]->xid != hdr->xid);
+#endif
+ }
+
/* Get a free gxact from the freelist */
if (TwoPhaseState->freeGXacts == NULL)
ereport(ERROR,
--
2.19.1.6.gb485710b
On Mon, Jul 17, 2023 at 02:26:56PM +0800, suyu.cmj wrote:
Yes, the method you proposed is simpler and more
efficient. Following your idea, I have modified the corresponding
patch, hope you can review it when you have time.
I'll double-check that tomorrow, but yes, that's basically what I had
in mind. Thanks for the patch!
+ char path[MAXPGPATH];
+ struct stat stat_buf;
These two variables can be declared in the code block added by the
patch where start_lsn is valid.
+ ereport(FATAL,
+ (errmsg("found unexpected duplicate two-phase
transaction:%u in pg_twophase, check for data correctness.",
+ hdr->xid)));
The last part of this sentence has no need to be IMO, because it is
misleading when building without assertions. How about a single
FATAL/WARNING like that:
- errmsg: "could not recover two-phase state file for transaction %u"
- errdetail: "Two-phase state file has been found in WAL record %X/%X
but this transaction has already been restored from disk."
Then a WARNING simply means that we've skipped the record entirely.
--
Michael
Thanks for the feedback! I have updated the patch, hope you can check it.
Best Regards
suyu.cmj
Attachments:
v3-0001-Fix-a-2PC-transaction-maybe-recovered-twice_master.patchapplication/octet-streamDownload
From a67bb9034800fcc2c7790a1547a2ddd8d4ec18ae Mon Sep 17 00:00:00 2001
From: "suyu.cmj" <mengjuan.cmj@alibaba-inc.com>
Date: Mon, 17 Jul 2023 05:52:59 +0000
Subject: [PATCH] Fix the bug of a 2PC transaction maybe recovered twice
During recovery, a two-phase transaction should be restored either from
the disk file or from the WAL. However, a two-phase transaction maybe
restored from both way if we crashed after writing a 2PC file to disk,
but before the redo LSN has been updated to pg_control during doing a
checkpoint. We check if the 2PC file of the transaction being recovery
is already in pg_twophase/ when adding an entry from a WAL record, if a
consistent point has not been reached by recovery and we find a file on
disk, then do nothing because we know restoreTwoPhaseData() has done that
at the beginning of recovery. If a consistent point has been reached in
recovery and we find a file on disk while replaying a WAL record for the
same 2PC file, then fail. If there is no file in pg_twophase/ for the
record replayed, then add it to the array.
---
src/backend/access/transam/twophase.c | 47 +++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 068e59bec0..0be105bbbc 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -86,6 +86,7 @@
#include "access/xlog.h"
#include "access/xloginsert.h"
#include "access/xlogreader.h"
+#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
#include "catalog/pg_type.h"
#include "catalog/storage.h"
@@ -2477,6 +2478,52 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
* that it got added in the redo phase
*/
+ /*
+ * During recovery, the two-phase data can be restored in two ways:
+ * 1) restored from disk file when its xid < checkPoint.nextxid;
+ * 2) restored from the WAL record when its prepare_start_lsn > checkPoint.redo.
+ * A two-phase transaction can be restored from either disk file or WAL,
+ * but can't both. However, a 2PC may satisfy above two conditions if
+ * we crashed after writing a 2PC file to disk, but before the redo LSN
+ * has been updated to pg_control during doing a checkpoint.
+ *
+ * We check if the 2PC file of the transaction being recovery is already
+ * in pg_twophase/ when adding an entry from a WAL record, if a consistent
+ * point has not been reached by recovery and we find a file on disk, then
+ * do nothing because we know restoreTwoPhaseData() has done that at the
+ * beginning of recovery. If a consistent point has been reached in recovery
+ * and we find a file on disk while replaying a WAL record for the same 2PC
+ * file, then fail. If there is no file in pg_twophase/ for the record
+ * replayed, then add it to the array.
+ */
+ if (!XLogRecPtrIsInvalid(start_lsn))
+ {
+ char path[MAXPGPATH];
+ struct stat stat_buf;
+#ifdef USE_ASSERT_CHECKING
+ int i;
+#endif
+
+ TwoPhaseFilePath(path, hdr->xid);
+
+ if (stat(path, &stat_buf) == 0)
+ {
+ Assert(S_ISREG(stat_buf.st_mode));
+ ereport((reachedConsistency ? FATAL : WARNING),
+ (errmsg("could not recover two-phase state file for transaction %u", hdr->xid),
+ errdetail("Two-phase state file has been found in WAL record %X/%X,"
+ " but this transaction has already been restored from disk.",
+ LSN_FORMAT_ARGS(start_lsn))));
+ return;
+ }
+
+#ifdef USE_ASSERT_CHECKING
+ /* cross-check for duplicates in array */
+ for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+ Assert(TwoPhaseState->prepXacts[i]->xid != hdr->xid);
+#endif
+ }
+
/* Get a free gxact from the freelist */
if (TwoPhaseState->freeGXacts == NULL)
ereport(ERROR,
--
2.19.1.6.gb485710b
On Mon, Jul 17, 2023 at 05:20:00PM +0800, suyu.cmj wrote:
Thanks for the feedback! I have updated the patch, hope you can check it.
I have looked at v3, and noticed that the stat() call is actually a
bit incorrect in its error handling because it would miss any errors
that happen when checking for the existence of the file. The only
error that we should safely expect is ENOENT, for a missing entry.
All the other had better fail like the other code paths restoring 2PC
entries from the shared state. At the end, I have made the choice of
relying only on access() to check the existence of the file as this is
an internal place, simplified a bit the comment. Finally, I have made
the choice to remove the assert-only check. After sleeping on it, it
would have value in very limited cases while a bunch of recovery cases
would take a hit, including developers with large 2PC setups, so there
are a lot of downsides with limited upsides.
--
Michael
Hi, all. I would like to sync up on an issue I've recently encountered. Previously, the following logic was added to the PrepareRedoAdd() when fixing this 2PC recovery bug:
------------------------------------------------------------------
From:Michael Paquier <michael@paquier.xyz>
Send Time:2023 Jul. 18 (Tue.) 13:14
To:"蔡梦娟(玊于)"<mengjuan.cmj@alibaba-inc.com>
CC:"zhihui.fan1213"<zhihui.fan1213@gmail.com>; "pgsql-bugs"<pgsql-bugs@lists.postgresql.org>
Subject:Re: The same 2PC data maybe recovered twice
On Mon, Jul 17, 2023 at 05:20:00PM +0800, suyu.cmj wrote:
Thanks for the feedback! I have updated the patch, hope you can check it.
I have looked at v3, and noticed that the stat() call is actually a
bit incorrect in its error handling because it would miss any errors
that happen when checking for the existence of the file. The only
error that we should safely expect is ENOENT, for a missing entry.
All the other had better fail like the other code paths restoring 2PC
entries from the shared state. At the end, I have made the choice of
relying only on access() to check the existence of the file as this is
an internal place, simplified a bit the comment. Finally, I have made
the choice to remove the assert-only check. After sleeping on it, it
would have value in very limited cases while a bunch of recovery cases
would take a hit, including developers with large 2PC setups, so there
are a lot of downsides with limited upsides.
--
Michael
I'm sorry that the content of the previous email was incomplete. Let me continue with the issue I've recently encountered.
Previously, the following logic was added to the PrepareRedoAdd() function when fixing this 2PC recovery bug:
+ if (!XLogRecPtrIsInvalid(start_lsn))
+ {
+ char path[MAXPGPATH];
+
+ TwoPhaseFilePath(path, hdr->xid);
+
+ if (access(path, F_OK) == 0)
+ {
+ ereport(reachedConsistency ? ERROR : WARNING,
+ (errmsg("could not recover two-phase state file for transaction %u",
+ hdr->xid),
+ errdetail("Two-phase state file has been found in WAL record %X/%X, but thi
s transaction has already been restored from disk.",
+ LSN_FORMAT_ARGS(start_lsn))));
+ return;
+ }
+ }
Every time we add a 2PC from a WAL record, we check whether the corresponding 2PC file exists on the storage. When a consistent state is reached, it is highly likely that there will be no corresponding files on the storage, so it is probable that an empty file will be accessed. Each time a file is accessed, the OS creates a dentry for it and increases the reference count of the parent directory's dentry. The type of the dentry reference count is int32. When there are many 2PC transactions, this logic may lead to an overflow of the parent dentry's reference count. When reclaiming a dentry, such as during memory reclamation or when deleting the directory, the overflow of the reference count could ultimately cause the OS to crash.
I am considering whether the logic in this part can be modified as follows:
+ if (!XLogRecPtrIsInvalid(start_lsn) && !reachedConsistency)
+ {
+ char path[MAXPGPATH];
+
+ TwoPhaseFilePath(path, hdr->xid);
+
+ if (access(path, F_OK) == 0)
+ {
+ ereport(WARNING,
+ (errmsg("could not recover two-phase state file for transaction %u",
+ hdr->xid),
+ errdetail("Two-phase state file has been found in WAL record %X/%X, but thi
s transaction has already been restored from disk.",
+ LSN_FORMAT_ARGS(start_lsn))));
+ return;
+ }
+ }
Currently, since restoreTwoPhaseData() is the only function that restores 2PC transactions from disk before the recovery starts, after reaching a consistent state, the 2PC transactions are only added from the WAL. Under normal circumstances, there should not be any corresponding 2PC files on the storage at that point. Therefore, I prefer to perform the file access checks only when the server has not yet reached a consistent state. Once consistency has been reached, if a duplicate 2PC transaction is added, it will directly result in an error in the subsequent replay logic.
Do you think this modification is feasible? I look forward to your feedback
Best Regards
suyu.cmj
------------------------------------------------------------------
From:CAI, Mengjuan <mengjuan.cmj@alibaba-inc.com>
Send Time:2025 Jul. 9 (Wed.) 09:57
To:Michael Paquier<michael@paquier.xyz>
CC:"pgsql-bugs"<pgsql-bugs@lists.postgresql.org>
Subject:Re: The same 2PC data maybe recovered twice
Hi, all. I would like to sync up on an issue I've recently encountered. Previously, the following logic was added to the PrepareRedoAdd() when fixing this 2PC recovery bug:
------------------------------------------------------------------
From:Michael Paquier <michael@paquier.xyz>
Send Time:2023 Jul. 18 (Tue.) 13:14
To:"蔡梦娟(玊于)"<mengjuan.cmj@alibaba-inc.com>
CC:"zhihui.fan1213"<zhihui.fan1213@gmail.com>; "pgsql-bugs"<pgsql-bugs@lists.postgresql.org>
Subject:Re: The same 2PC data maybe recovered twice
On Mon, Jul 17, 2023 at 05:20:00PM +0800, suyu.cmj wrote:
Thanks for the feedback! I have updated the patch, hope you can check it.
I have looked at v3, and noticed that the stat() call is actually a
bit incorrect in its error handling because it would miss any errors
that happen when checking for the existence of the file. The only
error that we should safely expect is ENOENT, for a missing entry.
All the other had better fail like the other code paths restoring 2PC
entries from the shared state. At the end, I have made the choice of
relying only on access() to check the existence of the file as this is
an internal place, simplified a bit the comment. Finally, I have made
the choice to remove the assert-only check. After sleeping on it, it
would have value in very limited cases while a bunch of recovery cases
would take a hit, including developers with large 2PC setups, so there
are a lot of downsides with limited upsides.
--
Michael
On Wed, Jul 09, 2025 at 01:51:03PM +0800, suyu.cmj wrote:
Currently, since restoreTwoPhaseData() is the only function that
restores 2PC transactions from disk before the recovery starts,
after reaching a consistent state, the 2PC transactions are only
added from the WAL. Under normal circumstances, there should not be
any corresponding 2PC files on the storage at that point. Therefore,
I prefer to perform the file access checks only when the server has
not yet reached a consistent state. Once consistency has been
reached, if a duplicate 2PC transaction is added, it will directly
result in an error in the subsequent replay logic.
There is a bit more going on in this code that needs to be fixed.
Please see the following thread, that also relates to the state of the
2PC recovery code when we read them from disk at the beginning of
recovery, with the bottom part being the most relevant:
/messages/by-id/Z5sd5O9JO7NYNK-C@paquier.xyz
--
Michael
Hi Michael,
Thank you for your reply. I reviewed the thread you mentioned, and it seems that the issue needing to be fixed is not the same as the one I previously raised.
I am considering whether the 2PC file check logic in PrepareRedoAdd() can be modified. Currently, each time a XLOG_XACT_PREPARE WAL entry is replayed, it checks for the corresponding 2PC file using access(). Each access operation creates a dentry in the OS, and in most cases, the file being accessed does not exist. When there are many 2PC transactions, this logic may lead to an increase in OS slab memory. In worse scenarios, it could cause the reference count of the parent directory's dentry to overflow, potentially leading to an OS crash. Typically, when accessing existing files, the disk will fill up before the dentry reference count overflows.
Therefore, I would like to propose a modification. Attached is my patch for your review, and I hope you can take a look at it.
Best regards,
suyu.cmj
Attachments:
0001-Fix-2PC-file-check-logic-in-PrepareRedoAdd.patchapplication/octet-streamDownload
From c2e603f95a0b66a924e854fc2a0cb08b75eb55ea Mon Sep 17 00:00:00 2001
From: "suyu.cmj" <mengjuan.cmj@alibaba-inc.com>
Date: Mon, 14 Jul 2025 01:59:26 +0000
Subject: [PATCH] Fix 2PC file check logic in PrepareRedoAdd()
In commit cb0cca1880723b4c90c56cdcf3842489ef036800,
the 2PC state file check logic is added in PrepareRedoAdd()
to fix the bug of 2PC recovery. Each time a file is accessed,
the OS creates a dentry for it and increases the reference
count of the parent directory's dentry. The type of the dentry
reference count is int32. When there are many 2PC transactions,
this logic may lead to an overflow of the parent dentry's
reference count, which could ultimately cause the OS to crash.
Since the 2PC data on disk is only restored at the beginning
of recovery with restoreTwoPhaseData(), after reaching a
consistent state, the 2PC transactions are only added from the
WAL. So we perform the file access check only when the server
has not yet reached a consistent state. Once consistency has
been reached, if a duplicate 2PC transaction is added, it will
directly result in an error in the subsequent replay logic.
---
src/backend/access/transam/twophase.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 85cbe397cb..b69aca0977 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2520,8 +2520,16 @@ PrepareRedoAdd(FullTransactionId fxid, char *buf,
* duplicates in TwoPhaseState. If a consistent state has been reached,
* the record is added to TwoPhaseState and it should have no
* corresponding file in pg_twophase.
+ *
+ * When there are many 2PC transactions, frequently accessing empty files
+ * may lead to an increase in OS slab memory. In worse cases, this could
+ * cause the reference count of the directory dentry to overflow,
+ * potentially resulting in an OS crash. Therefore, this check is only
+ * performed when a consistent state has not yet been reached. Once
+ * consistency has been reached, if a duplicate 2PC transaction is added,
+ * it will directly result in an error in the subsequent replay logic.
*/
- if (!XLogRecPtrIsInvalid(start_lsn))
+ if (!XLogRecPtrIsInvalid(start_lsn) && !reachedConsistency)
{
char path[MAXPGPATH];
@@ -2530,7 +2538,7 @@ PrepareRedoAdd(FullTransactionId fxid, char *buf,
if (access(path, F_OK) == 0)
{
- ereport(reachedConsistency ? ERROR : WARNING,
+ ereport(WARNING,
(errmsg("could not recover two-phase state file for transaction %u",
hdr->xid),
errdetail("Two-phase state file has been found in WAL record %X/%08X, but this transaction has already been restored from disk.",
--
2.27.0
On Mon, Jul 14, 2025 at 02:46:25PM +0800, CAI, Mengjuan wrote:
Thank you for your reply. I reviewed the thread you mentioned, and
it seems that the issue needing to be fixed is not the same as the
one I previously raised.
I am considering whether the 2PC file check logic in
PrepareRedoAdd() can be modified. Currently, each time a
XLOG_XACT_PREPARE WAL entry is replayed, it checks for the
corresponding 2PC file using access(). Each access operation creates
a dentry in the OS, and in most cases, the file being accessed does
not exist. When there are many 2PC transactions, this logic may lead
to an increase in OS slab memory. In worse scenarios, it could cause
the reference count of the parent directory's dentry to overflow,
potentially leading to an OS crash. Typically, when accessing
existing files, the disk will fill up before the dentry reference
count overflows.
Therefore, I would like to propose a modification. Attached is my
patch for your review, and I hope you can take a look at it.
@@ -2520,8 +2520,16 @@ PrepareRedoAdd(FullTransactionId fxid, char *buf,
[...]
- if (!XLogRecPtrIsInvalid(start_lsn))
+ if (!XLogRecPtrIsInvalid(start_lsn) && !reachedConsistency)
Actually, what you are doing is incorrect because we could miss some
ERRORs for example if a base backup was incorrect if come files were
present in pg_twophase?
It's not really true that what you are changing here has no
interaction with the beginning of recovery, the other thread is about
the fact that reading the 2PC files from disk when !reachedConsistency
is a bad concept that we should avoid, impacting the assumption the
code path you are changing here relies on. At the end, it may be
possible that we're able to remove this check entirely..
--
Michael
Actually, what you are doing is incorrect because we could miss some
ERRORs for example if a base backup was incorrect if come files were
present in pg_twophase?
Yes, you are right, base backup with incorrect files would be a problem.
It's not really true that what you are changing here has no
interaction with the beginning of recovery, the other thread is about
the fact that reading the 2PC files from disk when !reachedConsistency
is a bad concept that we should avoid, impacting the assumption the
code path you are changing here relies on. At the end, it may be
possible that we're able to remove this check entirely..
Is there any recent discussion or plan regarding this? I noticed that in the latest code branch, 2PC data on disk is still being restored when !reachedConsistency. If there are any updates or discussions, I would greatly appreciate it if you could let me know
Best regards