Got FATAL in lock_twophase_recover() during recovery

Started by suyu.cmjover 2 years ago2 messages

mengjuan.cmj@alibaba-inc.com

over 2 years ago

2 attachment(s)

Hi, all.
I want to report a bug about the recovery of two-phase transaction, in current implementation of crash recovery, there are two ways to recover 2pc data:
1、before redo, func restoreTwoPhaseData() will restore 2pc data those xid < ShmemVariableCache->nextXid, which is initialized from checkPoint.nextXid;
2、during redo, func xact_redo() will add 2pc from wal;
The following scenario may cause the same 2pc transaction to be added repeatedly, I have attached a patch for pg11 that reproduces the error:
1、start creating checkpoint_1, checkpoint_1.redo is set as curInsert;
2、before set checkPoint_1.nextXid, a new 2pc is prepared, suppose the xid of this 2pc is 100, and then ShmemVariableCache->nextXid will be advanced as 101;
3、checkPoint_1.nextXid is set as 101;
4、in CheckPointTwoPhase() of checkpoint_1, 2pc_100 won't be copied to disk because its prepare_end_lsn > checkpoint_1.redo;
5、checkPoint_1 is finished, after checkpoint_timeout, start creating checkpoint_2;
6、during checkpoint_2, data of 2pc_100 will be copied to disk;
7、before UpdateControlFile() of checkpoint_2, crash happened;
8、during crash recovery, redo will start from checkpoint_1, and 2pc_100 will be restored first by restoreTwoPhaseData() because xid_100 < checkPoint_1.nextXid, which is 101;
9、because prepare_start_lsn of 2pc_100 > checkpoint_1.redo, 2pc_100 will be added again by xact_redo() during wal replay, resulting in the same 2pc data being added twice;
10、In RecoverPreparedTransactions() -> lock_twophase_recover(), lock the same 2pc will cause FATAL.
After running the patch that reproduced the error, you will get the following error during crash recovery:
2023-07-10 13:04:30.670 UTC [11169] LOG: recovering prepared transaction 569 from shared memory
2023-07-10 13:04:30.670 UTC [11169] LOG: recovering prepared transaction 569 from shared memory
2023-07-10 13:04:30.670 UTC [11169] FATAL: lock ExclusiveLock on object 569/0/0 is already held
2023-07-10 13:04:30.670 UTC [11168] LOG: startup process (PID 11169) exited with exit code 1
2023-07-10 13:04:30.670 UTC [11168] LOG: aborting startup due to startup process failure
I also added a patch for pg11 to fix this problem, hope you can check it when you have time.
Thanks & Best Regard

Attachments:

0001-Reproduce-the-error-in-lock_twophase_recover_11.patchapplication/octet-streamDownload

From 90cedbdcadfe6e68ba2c0e74ba69a626fd402964 Mon Sep 17 00:00:00 2001
From: "suyu.cmj" <mengjuan.cmj@alibaba-inc.com>
Date: Mon, 10 Jul 2023 13:09:09 +0000
Subject: [PATCH] Reproduce the error in lock_twophase_recover()

---
 src/backend/access/transam/xlog.c             | 10 ++++
 src/backend/postmaster/checkpointer.c         | 51 +++++++++++++++++++
 src/include/catalog/pg_proc.dat               |  8 +++
 src/include/postmaster/bgwriter.h             |  3 ++
 ..._add_duplicate_twophase_during_recovery.pl | 50 ++++++++++++++++++
 5 files changed, 122 insertions(+)
 create mode 100644 src/test/recovery/t/034_add_duplicate_twophase_during_recovery.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1174dd5beb..9c6f287267 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9072,6 +9072,11 @@ CreateCheckPoint(int flags)
 
 	TRACE_POSTGRESQL_CHECKPOINT_START(flags);
 
+	/* Test inject fault */
+	while (IsCheckpointDelayed())
+		pg_usleep(1000000);
+	/* Test inject fault end */
+
 	/*
 	 * Get the other info we need for the checkpoint record.
 	 *
@@ -9169,6 +9174,11 @@ CreateCheckPoint(int flags)
 	}
 	pfree(vxids);
 
+	/* Test inject fault */
+	if (IsCheckpointPanicInjected())
+		elog(PANIC, "panic is injected during checkpoint");
+	/* Test inject fault end */
+
 	/*
 	 * Take a snapshot of running transactions and write this to WAL. This
 	 * allows us to reconstruct the state of running transactions during
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fb912c0381..adcecd902e 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -126,6 +126,9 @@ typedef struct
 
 	int			ckpt_flags;		/* checkpoint flags, as defined in xlog.h */
 
+	bool		is_ckpt_delayed;	/* add for test */
+	bool		is_ckpt_panic_injected; /* add for test */
+
 	uint32		num_backend_writes; /* counts user backend buffer writes */
 	uint32		num_backend_fsync;	/* counts user backend fsync calls */
 
@@ -1392,3 +1395,51 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+bool
+IsCheckpointDelayed(void)
+{
+	bool is_delayed = false;
+
+	SpinLockAcquire(&CheckpointerShmem->ckpt_lck);
+	is_delayed = CheckpointerShmem->is_ckpt_delayed;
+	SpinLockRelease(&CheckpointerShmem->ckpt_lck);
+
+	return is_delayed;
+}
+
+bool
+IsCheckpointPanicInjected(void)
+{
+	bool inject_panic = false;
+
+	SpinLockAcquire(&CheckpointerShmem->ckpt_lck);
+	inject_panic = CheckpointerShmem->is_ckpt_panic_injected;
+	SpinLockRelease(&CheckpointerShmem->ckpt_lck);
+
+	return inject_panic;
+}
+
+Datum
+pg_test_inject_checkpoint_delay(PG_FUNCTION_ARGS)
+{
+	bool set_delay = PG_GETARG_BOOL(0);
+
+	SpinLockAcquire(&CheckpointerShmem->ckpt_lck);
+	CheckpointerShmem->is_ckpt_delayed = set_delay;
+	SpinLockRelease(&CheckpointerShmem->ckpt_lck);
+
+	PG_RETURN_VOID();
+}
+
+Datum
+pg_test_inject_checkpoint_panic(PG_FUNCTION_ARGS)
+{
+	bool inject_panic = PG_GETARG_BOOL(0);
+
+	SpinLockAcquire(&CheckpointerShmem->ckpt_lck);
+	CheckpointerShmem->is_ckpt_panic_injected = inject_panic;
+	SpinLockRelease(&CheckpointerShmem->ckpt_lck);
+
+	PG_RETURN_VOID();
+}
\ No newline at end of file
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 658229ce77..10ec7b4d32 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10206,4 +10206,12 @@
   proisstrict => 'f', prorettype => 'bool', proargtypes => 'oid int4 int4 any',
   proargmodes => '{i,i,i,v}', prosrc => 'satisfies_hash_partition' },
 
+{ oid => '5058', descr => 'inject delay during checkpoint',
+  proname => 'pg_test_inject_checkpoint_delay', provolatile => 'v', prorettype => 'void',
+  proargtypes => 'bool', prosrc => 'pg_test_inject_checkpoint_delay' },
+
+{ oid => '5059', descr => 'inject panic during checkpoint',
+  proname => 'pg_test_inject_checkpoint_panic', provolatile => 'v', prorettype => 'void',
+  proargtypes => 'bool', prosrc => 'pg_test_inject_checkpoint_panic' },
+
 ]
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 941c6aba7d..40e1eb00f2 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -40,4 +40,7 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern bool IsCheckpointDelayed(void);
+extern bool IsCheckpointPanicInjected(void);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/test/recovery/t/034_add_duplicate_twophase_during_recovery.pl b/src/test/recovery/t/034_add_duplicate_twophase_during_recovery.pl
new file mode 100644
index 0000000000..acd16cb494
--- /dev/null
+++ b/src/test/recovery/t/034_add_duplicate_twophase_during_recovery.pl
@@ -0,0 +1,50 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 1;
+
+# Setup primary node
+my $node_primary = get_new_node("primary");
+$node_primary->init;
+$node_primary->append_conf(
+	'postgresql.conf', qq(
+	max_prepared_transactions = 10
+	log_checkpoints = true
+	checkpoint_timeout = 3000
+));
+$node_primary->start;
+
+$node_primary->psql('postgres', "create table t_test_2pc(id int, msg text)");
+$node_primary->psql('postgres', "select pg_test_inject_checkpoint_delay(true)");
+
+# start do checkpoint1
+my $host = $node_primary->host;
+my $port = $node_primary->port;
+`nohup psql -h $host -p $port -c 'checkpoint' >test.file 2>&1 &`;
+
+# prepare transaction during checkpoint1
+$node_primary->psql(
+	'postgres', "
+	BEGIN;
+	INSERT INTO t_test_2pc VALUES (1, 'test add duplicate 2pc');
+	PREPARE TRANSACTION 'xact_test_1';");
+
+# reset delay flag to finish checkpoint1
+$node_primary->psql('postgres', "select pg_test_inject_checkpoint_delay(false)");
+sleep 3;
+
+# inject panic during checkpoint2
+$node_primary->psql('postgres', "select pg_test_inject_checkpoint_panic(true)");
+$node_primary->psql('postgres', "checkpoint");
+
+sleep 3;
+my $logdir = $node_primary->logfile;
+my @res   = `grep -rn "panic is injected during checkpoint" $logdir`;
+my $found = @res;
+ok($found == 1, "inject panic success");
+
+# start will fail due to error in lock_twophase_recover()
+$node_primary->_update_pid(0);
+$node_primary->start;
\ No newline at end of file
-- 
2.19.1.6.gb485710b

v1-0001-Fix-a-2PC-transaction-maybe-recovered-twice_11.patchapplication/octet-streamDownload

From cae8ef31feb4c8dc8beb3c7422548955455a35c9 Mon Sep 17 00:00:00 2001
From: "suyu.cmj" <mengjuan.cmj@alibaba-inc.com>
Date: Fri, 7 Jul 2023 08:44:49 +0000
Subject: [PATCH] Fix the bug of a 2PC transaction maybe recovered twice

During recovery, a two-phase transaction should be restored
either from the disk file or from the WAL. However, a two-phase
transaction maybe restored from both way if we crashed during
doing a checkpoint. Considering that the data on disk file may
not be reliable, so in this case, we only store the transaction
data recorded in the WAL.
---
 src/backend/access/transam/twophase.c | 61 +++++++++++++++++++++------
 1 file changed, 49 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index d145836a79..629b79401c 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2491,6 +2491,8 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 	char	   *bufptr;
 	const char *gid;
 	GlobalTransaction gxact;
+	bool addnewgxact = true;
+	int i;
 
 	Assert(LWLockHeldByMeInMode(TwoPhaseStateLock, LW_EXCLUSIVE));
 	Assert(RecoveryInProgress());
@@ -2509,15 +2511,54 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 	 * that it got added in the redo phase
 	 */
 
+	/*
+	 * During recovery, the two-phase data can be added in two ways:
+	 * 1) restored from disk file when its xid < checkPoint.nextxid,
+	 * 2) restored from the WAL when its prepare_start_lsn > checkPoint.redo,
+	 * A two-phase transaction may satisfy above two conditions if we
+	 * crashed during doing a checkpoint. Considering that the data
+	 * on disk file may not be reliable, so in this case, we only store
+	 * the transaction data recorded in the WAL.
+	 */
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction curxact = TwoPhaseState->prepXacts[i];
+
+		if (curxact->xid == hdr->xid)
+		{
+			if (curxact->ondisk && !XLogRecPtrIsInvalid(start_lsn))
+			{
+				gxact = curxact;
+				addnewgxact = false;
+				ereport(WARNING,
+						(errmsg("found duplicate two-phase transaction:%u, store data from the WAL",
+								hdr->xid)));
+				break;
+			}
+			else
+				ereport(ERROR,
+						(errmsg("found unexpected duplicate two-phase transaction:%u, check for data correctness.",
+								hdr->xid)));
+		}
+	}
+
 	/* Get a free gxact from the freelist */
-	if (TwoPhaseState->freeGXacts == NULL)
-		ereport(ERROR,
-				(errcode(ERRCODE_OUT_OF_MEMORY),
-				 errmsg("maximum number of prepared transactions reached"),
-				 errhint("Increase max_prepared_transactions (currently %d).",
-						 max_prepared_xacts)));
-	gxact = TwoPhaseState->freeGXacts;
-	TwoPhaseState->freeGXacts = gxact->next;
+	if (addnewgxact)
+	{
+		if (TwoPhaseState->freeGXacts == NULL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OUT_OF_MEMORY),
+					 errmsg("maximum number of prepared transactions reached"),
+					 errhint("Increase max_prepared_transactions (currently %d).",
+							 max_prepared_xacts)));
+
+		gxact = TwoPhaseState->freeGXacts;
+		TwoPhaseState->freeGXacts = gxact->next;
+
+		/* And insert it into the active array */
+		Assert(TwoPhaseState->numPrepXacts < max_prepared_xacts);
+		TwoPhaseState->prepXacts[TwoPhaseState->numPrepXacts++] = gxact;
+	}
 
 	gxact->prepared_at = hdr->prepared_at;
 	gxact->prepare_start_lsn = start_lsn;
@@ -2530,10 +2571,6 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 	gxact->inredo = true;		/* yes, added in redo */
 	strcpy(gxact->gid, gid);
 
-	/* And insert it into the active array */
-	Assert(TwoPhaseState->numPrepXacts < max_prepared_xacts);
-	TwoPhaseState->prepXacts[TwoPhaseState->numPrepXacts++] = gxact;
-
 	if (origin_id != InvalidRepOriginId)
 	{
 		/* recover apply progress */
-- 
2.19.1.6.gb485710b

Michael Paquier

michael@paquier.xyz

over 2 years ago

In reply to: suyu.cmj (#1)

Re: Got FATAL in lock_twophase_recover() during recovery

On Tue, Jul 11, 2023 at 10:35:15AM +0800, suyu.cmj wrote:

I want to report a bug about the recovery of two-phase transaction,
in current implementation of crash recovery, there are two ways to
recover 2pc data:
1、before redo, func restoreTwoPhaseData() will restore 2pc data
those xid < ShmemVariableCache->nextXid, which is initialized from
checkPoint.nextXid;
2、during redo, func xact_redo() will add 2pc from wal;
The following scenario may cause the same 2pc transaction to be
added repeatedly, I have attached a patch for pg11 that reproduces
the error:
1、start creating checkpoint_1, checkpoint_1.redo is set as
curInsert;
2、before set checkPoint_1.nextXid, a new 2pc is prepared, suppose
the xid of this 2pc is 100, and then ShmemVariableCache->nextXid
will be advanced as 101;
3、checkPoint_1.nextXid is set as 101;
4、in CheckPointTwoPhase() of checkpoint_1, 2pc_100 won't be copied
to disk because its prepare_end_lsn > checkpoint_1.redo;
5、checkPoint_1 is finished, after checkpoint_timeout, start
creating checkpoint_2;
6、during checkpoint_2, data of 2pc_100 will be copied to disk;
7、before UpdateControlFile() of checkpoint_2, crash happened;
8、during crash recovery, redo will start from checkpoint_1, and
2pc_100 will be restored first by restoreTwoPhaseData() because
xid_100 < checkPoint_1.nextXid, which is 101;
9、because prepare_start_lsn of 2pc_100 > checkpoint_1.redo, 2pc_100
will be added again by xact_redo() during wal replay, resulting in
the same 2pc data being added twice;

It looks like you have something here. I'll try to look at it. This
is a bug, so I have removed pgsql-hackers from the CC list keeping
only pgsql-bugs as cross-list posts are not encouraged.
--
Michael