Got FATAL in lock_twophase_recover() during recovery
Hi, all.
I want to report a bug about the recovery of two-phase transaction, in current implementation of crash recovery, there are two ways to recover 2pc data:
1、before redo, func restoreTwoPhaseData() will restore 2pc data those xid < ShmemVariableCache->nextXid, which is initialized from checkPoint.nextXid;
2、during redo, func xact_redo() will add 2pc from wal;
The following scenario may cause the same 2pc transaction to be added repeatedly, I have attached a patch for pg11 that reproduces the error:
1、start creating checkpoint_1, checkpoint_1.redo is set as curInsert;
2、before set checkPoint_1.nextXid, a new 2pc is prepared, suppose the xid of this 2pc is 100, and then ShmemVariableCache->nextXid will be advanced as 101;
3、checkPoint_1.nextXid is set as 101;
4、in CheckPointTwoPhase() of checkpoint_1, 2pc_100 won't be copied to disk because its prepare_end_lsn > checkpoint_1.redo;
5、checkPoint_1 is finished, after checkpoint_timeout, start creating checkpoint_2;
6、during checkpoint_2, data of 2pc_100 will be copied to disk;
7、before UpdateControlFile() of checkpoint_2, crash happened;
8、during crash recovery, redo will start from checkpoint_1, and 2pc_100 will be restored first by restoreTwoPhaseData() because xid_100 < checkPoint_1.nextXid, which is 101;
9、because prepare_start_lsn of 2pc_100 > checkpoint_1.redo, 2pc_100 will be added again by xact_redo() during wal replay, resulting in the same 2pc data being added twice;
10、In RecoverPreparedTransactions() -> lock_twophase_recover(), lock the same 2pc will cause FATAL.
After running the patch that reproduced the error, you will get the following error during crash recovery:
2023-07-10 13:04:30.670 UTC [11169] LOG: recovering prepared transaction 569 from shared memory
2023-07-10 13:04:30.670 UTC [11169] LOG: recovering prepared transaction 569 from shared memory
2023-07-10 13:04:30.670 UTC [11169] FATAL: lock ExclusiveLock on object 569/0/0 is already held
2023-07-10 13:04:30.670 UTC [11168] LOG: startup process (PID 11169) exited with exit code 1
2023-07-10 13:04:30.670 UTC [11168] LOG: aborting startup due to startup process failure
I also added a patch for pg11 to fix this problem, hope you can check it when you have time.
Thanks & Best Regard
Attachments:
0001-Reproduce-the-error-in-lock_twophase_recover_11.patchapplication/octet-streamDownload
From 90cedbdcadfe6e68ba2c0e74ba69a626fd402964 Mon Sep 17 00:00:00 2001
From: "suyu.cmj" <mengjuan.cmj@alibaba-inc.com>
Date: Mon, 10 Jul 2023 13:09:09 +0000
Subject: [PATCH] Reproduce the error in lock_twophase_recover()
---
src/backend/access/transam/xlog.c | 10 ++++
src/backend/postmaster/checkpointer.c | 51 +++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 +++
src/include/postmaster/bgwriter.h | 3 ++
..._add_duplicate_twophase_during_recovery.pl | 50 ++++++++++++++++++
5 files changed, 122 insertions(+)
create mode 100644 src/test/recovery/t/034_add_duplicate_twophase_during_recovery.pl
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1174dd5beb..9c6f287267 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9072,6 +9072,11 @@ CreateCheckPoint(int flags)
TRACE_POSTGRESQL_CHECKPOINT_START(flags);
+ /* Test inject fault */
+ while (IsCheckpointDelayed())
+ pg_usleep(1000000);
+ /* Test inject fault end */
+
/*
* Get the other info we need for the checkpoint record.
*
@@ -9169,6 +9174,11 @@ CreateCheckPoint(int flags)
}
pfree(vxids);
+ /* Test inject fault */
+ if (IsCheckpointPanicInjected())
+ elog(PANIC, "panic is injected during checkpoint");
+ /* Test inject fault end */
+
/*
* Take a snapshot of running transactions and write this to WAL. This
* allows us to reconstruct the state of running transactions during
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fb912c0381..adcecd902e 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -126,6 +126,9 @@ typedef struct
int ckpt_flags; /* checkpoint flags, as defined in xlog.h */
+ bool is_ckpt_delayed; /* add for test */
+ bool is_ckpt_panic_injected; /* add for test */
+
uint32 num_backend_writes; /* counts user backend buffer writes */
uint32 num_backend_fsync; /* counts user backend fsync calls */
@@ -1392,3 +1395,51 @@ FirstCallSinceLastCheckpoint(void)
return FirstCall;
}
+
+bool
+IsCheckpointDelayed(void)
+{
+ bool is_delayed = false;
+
+ SpinLockAcquire(&CheckpointerShmem->ckpt_lck);
+ is_delayed = CheckpointerShmem->is_ckpt_delayed;
+ SpinLockRelease(&CheckpointerShmem->ckpt_lck);
+
+ return is_delayed;
+}
+
+bool
+IsCheckpointPanicInjected(void)
+{
+ bool inject_panic = false;
+
+ SpinLockAcquire(&CheckpointerShmem->ckpt_lck);
+ inject_panic = CheckpointerShmem->is_ckpt_panic_injected;
+ SpinLockRelease(&CheckpointerShmem->ckpt_lck);
+
+ return inject_panic;
+}
+
+Datum
+pg_test_inject_checkpoint_delay(PG_FUNCTION_ARGS)
+{
+ bool set_delay = PG_GETARG_BOOL(0);
+
+ SpinLockAcquire(&CheckpointerShmem->ckpt_lck);
+ CheckpointerShmem->is_ckpt_delayed = set_delay;
+ SpinLockRelease(&CheckpointerShmem->ckpt_lck);
+
+ PG_RETURN_VOID();
+}
+
+Datum
+pg_test_inject_checkpoint_panic(PG_FUNCTION_ARGS)
+{
+ bool inject_panic = PG_GETARG_BOOL(0);
+
+ SpinLockAcquire(&CheckpointerShmem->ckpt_lck);
+ CheckpointerShmem->is_ckpt_panic_injected = inject_panic;
+ SpinLockRelease(&CheckpointerShmem->ckpt_lck);
+
+ PG_RETURN_VOID();
+}
\ No newline at end of file
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 658229ce77..10ec7b4d32 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10206,4 +10206,12 @@
proisstrict => 'f', prorettype => 'bool', proargtypes => 'oid int4 int4 any',
proargmodes => '{i,i,i,v}', prosrc => 'satisfies_hash_partition' },
+{ oid => '5058', descr => 'inject delay during checkpoint',
+ proname => 'pg_test_inject_checkpoint_delay', provolatile => 'v', prorettype => 'void',
+ proargtypes => 'bool', prosrc => 'pg_test_inject_checkpoint_delay' },
+
+{ oid => '5059', descr => 'inject panic during checkpoint',
+ proname => 'pg_test_inject_checkpoint_panic', provolatile => 'v', prorettype => 'void',
+ proargtypes => 'bool', prosrc => 'pg_test_inject_checkpoint_panic' },
+
]
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 941c6aba7d..40e1eb00f2 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -40,4 +40,7 @@ extern void CheckpointerShmemInit(void);
extern bool FirstCallSinceLastCheckpoint(void);
+extern bool IsCheckpointDelayed(void);
+extern bool IsCheckpointPanicInjected(void);
+
#endif /* _BGWRITER_H */
diff --git a/src/test/recovery/t/034_add_duplicate_twophase_during_recovery.pl b/src/test/recovery/t/034_add_duplicate_twophase_during_recovery.pl
new file mode 100644
index 0000000000..acd16cb494
--- /dev/null
+++ b/src/test/recovery/t/034_add_duplicate_twophase_during_recovery.pl
@@ -0,0 +1,50 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 1;
+
+# Setup primary node
+my $node_primary = get_new_node("primary");
+$node_primary->init;
+$node_primary->append_conf(
+ 'postgresql.conf', qq(
+ max_prepared_transactions = 10
+ log_checkpoints = true
+ checkpoint_timeout = 3000
+));
+$node_primary->start;
+
+$node_primary->psql('postgres', "create table t_test_2pc(id int, msg text)");
+$node_primary->psql('postgres', "select pg_test_inject_checkpoint_delay(true)");
+
+# start do checkpoint1
+my $host = $node_primary->host;
+my $port = $node_primary->port;
+`nohup psql -h $host -p $port -c 'checkpoint' >test.file 2>&1 &`;
+
+# prepare transaction during checkpoint1
+$node_primary->psql(
+ 'postgres', "
+ BEGIN;
+ INSERT INTO t_test_2pc VALUES (1, 'test add duplicate 2pc');
+ PREPARE TRANSACTION 'xact_test_1';");
+
+# reset delay flag to finish checkpoint1
+$node_primary->psql('postgres', "select pg_test_inject_checkpoint_delay(false)");
+sleep 3;
+
+# inject panic during checkpoint2
+$node_primary->psql('postgres', "select pg_test_inject_checkpoint_panic(true)");
+$node_primary->psql('postgres', "checkpoint");
+
+sleep 3;
+my $logdir = $node_primary->logfile;
+my @res = `grep -rn "panic is injected during checkpoint" $logdir`;
+my $found = @res;
+ok($found == 1, "inject panic success");
+
+# start will fail due to error in lock_twophase_recover()
+$node_primary->_update_pid(0);
+$node_primary->start;
\ No newline at end of file
--
2.19.1.6.gb485710b
v1-0001-Fix-a-2PC-transaction-maybe-recovered-twice_11.patchapplication/octet-streamDownload
From cae8ef31feb4c8dc8beb3c7422548955455a35c9 Mon Sep 17 00:00:00 2001
From: "suyu.cmj" <mengjuan.cmj@alibaba-inc.com>
Date: Fri, 7 Jul 2023 08:44:49 +0000
Subject: [PATCH] Fix the bug of a 2PC transaction maybe recovered twice
During recovery, a two-phase transaction should be restored
either from the disk file or from the WAL. However, a two-phase
transaction maybe restored from both way if we crashed during
doing a checkpoint. Considering that the data on disk file may
not be reliable, so in this case, we only store the transaction
data recorded in the WAL.
---
src/backend/access/transam/twophase.c | 61 +++++++++++++++++++++------
1 file changed, 49 insertions(+), 12 deletions(-)
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index d145836a79..629b79401c 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2491,6 +2491,8 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
char *bufptr;
const char *gid;
GlobalTransaction gxact;
+ bool addnewgxact = true;
+ int i;
Assert(LWLockHeldByMeInMode(TwoPhaseStateLock, LW_EXCLUSIVE));
Assert(RecoveryInProgress());
@@ -2509,15 +2511,54 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
* that it got added in the redo phase
*/
+ /*
+ * During recovery, the two-phase data can be added in two ways:
+ * 1) restored from disk file when its xid < checkPoint.nextxid,
+ * 2) restored from the WAL when its prepare_start_lsn > checkPoint.redo,
+ * A two-phase transaction may satisfy above two conditions if we
+ * crashed during doing a checkpoint. Considering that the data
+ * on disk file may not be reliable, so in this case, we only store
+ * the transaction data recorded in the WAL.
+ */
+ for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+ {
+ GlobalTransaction curxact = TwoPhaseState->prepXacts[i];
+
+ if (curxact->xid == hdr->xid)
+ {
+ if (curxact->ondisk && !XLogRecPtrIsInvalid(start_lsn))
+ {
+ gxact = curxact;
+ addnewgxact = false;
+ ereport(WARNING,
+ (errmsg("found duplicate two-phase transaction:%u, store data from the WAL",
+ hdr->xid)));
+ break;
+ }
+ else
+ ereport(ERROR,
+ (errmsg("found unexpected duplicate two-phase transaction:%u, check for data correctness.",
+ hdr->xid)));
+ }
+ }
+
/* Get a free gxact from the freelist */
- if (TwoPhaseState->freeGXacts == NULL)
- ereport(ERROR,
- (errcode(ERRCODE_OUT_OF_MEMORY),
- errmsg("maximum number of prepared transactions reached"),
- errhint("Increase max_prepared_transactions (currently %d).",
- max_prepared_xacts)));
- gxact = TwoPhaseState->freeGXacts;
- TwoPhaseState->freeGXacts = gxact->next;
+ if (addnewgxact)
+ {
+ if (TwoPhaseState->freeGXacts == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("maximum number of prepared transactions reached"),
+ errhint("Increase max_prepared_transactions (currently %d).",
+ max_prepared_xacts)));
+
+ gxact = TwoPhaseState->freeGXacts;
+ TwoPhaseState->freeGXacts = gxact->next;
+
+ /* And insert it into the active array */
+ Assert(TwoPhaseState->numPrepXacts < max_prepared_xacts);
+ TwoPhaseState->prepXacts[TwoPhaseState->numPrepXacts++] = gxact;
+ }
gxact->prepared_at = hdr->prepared_at;
gxact->prepare_start_lsn = start_lsn;
@@ -2530,10 +2571,6 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
gxact->inredo = true; /* yes, added in redo */
strcpy(gxact->gid, gid);
- /* And insert it into the active array */
- Assert(TwoPhaseState->numPrepXacts < max_prepared_xacts);
- TwoPhaseState->prepXacts[TwoPhaseState->numPrepXacts++] = gxact;
-
if (origin_id != InvalidRepOriginId)
{
/* recover apply progress */
--
2.19.1.6.gb485710b
On Tue, Jul 11, 2023 at 10:35:15AM +0800, suyu.cmj wrote:
I want to report a bug about the recovery of two-phase transaction,
in current implementation of crash recovery, there are two ways to
recover 2pc data:
1、before redo, func restoreTwoPhaseData() will restore 2pc data
those xid < ShmemVariableCache->nextXid, which is initialized from
checkPoint.nextXid;
2、during redo, func xact_redo() will add 2pc from wal;
The following scenario may cause the same 2pc transaction to be
added repeatedly, I have attached a patch for pg11 that reproduces
the error:
1、start creating checkpoint_1, checkpoint_1.redo is set as
curInsert;
2、before set checkPoint_1.nextXid, a new 2pc is prepared, suppose
the xid of this 2pc is 100, and then ShmemVariableCache->nextXid
will be advanced as 101;
3、checkPoint_1.nextXid is set as 101;
4、in CheckPointTwoPhase() of checkpoint_1, 2pc_100 won't be copied
to disk because its prepare_end_lsn > checkpoint_1.redo;
5、checkPoint_1 is finished, after checkpoint_timeout, start
creating checkpoint_2;
6、during checkpoint_2, data of 2pc_100 will be copied to disk;
7、before UpdateControlFile() of checkpoint_2, crash happened;
8、during crash recovery, redo will start from checkpoint_1, and
2pc_100 will be restored first by restoreTwoPhaseData() because
xid_100 < checkPoint_1.nextXid, which is 101;
9、because prepare_start_lsn of 2pc_100 > checkpoint_1.redo, 2pc_100
will be added again by xact_redo() during wal replay, resulting in
the same 2pc data being added twice;
It looks like you have something here. I'll try to look at it. This
is a bug, so I have removed pgsql-hackers from the CC list keeping
only pgsql-bugs as cross-list posts are not encouraged.
--
Michael