Fix primary crash continually with invalid checkpoint after promote

Started by Zhao Ruiover 3 years ago5 messages
#1Zhao Rui
875941708@qq.com
1 attachment(s)

Newly promoted primary may leave an invalid checkpoint.

In function CreateRestartPoint, control file is updated and old wals are removed. But in some situations, control file is not updated, old wals are still removed. Thus produces an invalid checkpoint with nonexistent wal. Crucial log: "invalid primary checkpoint record", "could not locate a valid checkpoint record".

The following timeline reproduces above situation:

tl1: standby begins to create restart point (time or wal triggered).

tl2: standby promotes and control file state is updated to DB_IN_PRODUCTION. Control file will not update (xlog.c:9690). But old wals are still removed (xlog.c:9719).

tl3: standby becomes primary. primary may crash before the next complete checkpoint (OOM in my situation). primary will crash continually with invalid checkpoint.

The attached patch reproduces this problem using standard postgresql perl test, you can run with 

./configure --enable-tap-tests; make -j; make -C src/test/recovery/ check PROVE_TESTS=t/027_invalid_checkpoint_after_promote.pl

The attached patch also fixes this problem by ensuring that remove old wals only after control file is updated.

Attachments:

0001-Fix-primary-crash-continually-with-invalid-checkpoint-after-promote.patchapplication/octet-stream; charset=ISO-8859-1; name=0001-Fix-primary-crash-continually-with-invalid-checkpoint-after-promote.patchDownload
From 835abbfb2c30367e76fa68d63b9eb6a817b4eae3 Mon Sep 17 00:00:00 2001
From: Zhao Rui <875941708@qq.com>
Date: Tue, 26 Apr 2022 15:08:23 +0800
Subject: [PATCH] Fix primary crash continually with invalid checkpoint after
 promote

---
 src/backend/access/transam/xlog.c             |   8 +-
 .../t/027_invalid_checkpoint_after_promote.pl | 113 ++++++++++++++++++
 2 files changed, 120 insertions(+), 1 deletion(-)
 create mode 100644 src/test/recovery/t/027_invalid_checkpoint_after_promote.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6208e123e5..7c07a19d91 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9586,6 +9586,9 @@ CreateRestartPoint(int flags)
 	XLogRecPtr	endptr;
 	XLogSegNo	_logSegNo;
 	TimestampTz xtime;
+	bool        isControlFileUpdated;
+
+	isControlFileUpdated = false;
 
 	/* Get a local copy of the last safe checkpoint record. */
 	SpinLockAcquire(&XLogCtl->info_lck);
@@ -9717,6 +9720,7 @@ CreateRestartPoint(int flags)
 		if (flags & CHECKPOINT_IS_SHUTDOWN)
 			ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
 		UpdateControlFile();
+		isControlFileUpdated = true;
 	}
 	LWLockRelease(ControlFileLock);
 
@@ -9768,7 +9772,9 @@ CreateRestartPoint(int flags)
 	if (RecoveryInProgress())
 		ThisTimeLineID = replayTLI;
 
-	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, endptr);
+	/* We can only remove old xlog files after control file is updated. */
+	if (isControlFileUpdated)
+		RemoveOldXlogFiles(_logSegNo, RedoRecPtr, endptr);
 
 	/*
 	 * Make more log segments if needed.  (Do this after recycling old log
diff --git a/src/test/recovery/t/027_invalid_checkpoint_after_promote.pl b/src/test/recovery/t/027_invalid_checkpoint_after_promote.pl
new file mode 100644
index 0000000000..9cf6db5c4c
--- /dev/null
+++ b/src/test/recovery/t/027_invalid_checkpoint_after_promote.pl
@@ -0,0 +1,113 @@
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Time::HiRes qw(usleep nanosleep);
+use Test::More tests => 5;
+
+# initialize primary node
+my $node_primary = get_new_node('master');
+$node_primary->init(allows_streaming => 1);
+$node_primary->append_conf(
+	'postgresql.conf', q[
+checkpoint_timeout = 30s
+max_wal_size = 16GB
+log_checkpoints = on
+restart_after_crash = on
+]);
+$node_primary->start;
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# setup a standby
+my $node_standby = get_new_node('standby1');
+$node_standby->init_from_backup($node_primary, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# dummy table for the upcoming tests.
+$node_primary->safe_psql('postgres', 'checkpoint');
+$node_primary->safe_psql('postgres', 'create table test (a int, b varchar(255))');
+
+# use background psql to insert batch, just to make checkpoint a little slow.
+my $psql_timeout = IPC::Run::timer(3600);
+my ($stdin, $stdout, $stderr) = ('', '', '');
+my $psql_primary = IPC::Run::start(
+	[ 'psql', '-XAtq', '-v', 'ON_ERROR_STOP=1', '-f', '-', '-d', $node_primary->connstr('postgres') ],
+	'<',
+	\$stdin,
+	'>',
+	\$stdout,
+	'2>',
+	\$stderr,
+	$psql_timeout);
+$stdin .= q[
+INSERT INTO test SELECT i, 'aaaaaaaaaaaaaaaaaaaaaaa' from generate_series(1, 100000000) as i;
+];
+$psql_primary->pump_nb();
+
+# wait until restartpoint on standy
+my $logstart = -s $node_standby->logfile;
+my $checkpoint_start = 0;
+for (my $i = 0; $i < 3000; $i++)
+{
+    my $log = TestLib::slurp_file($node_standby->logfile, $logstart);
+	if ($log =~ m/restartpoint starting: time/)
+	{
+        $checkpoint_start = 1;
+		last;
+	}
+	usleep(100_000);
+}
+is($checkpoint_start, 1, 'restartpoint has started');
+
+# promote during restartpoint
+$node_primary->stop;
+$node_standby->promote;
+
+# wait until checkpoint on new primary
+$logstart = -s $node_standby->logfile;
+$checkpoint_start = 0;
+for (my $i = 0; $i < 3000; $i++)
+{
+    my $log = TestLib::slurp_file($node_standby->logfile, $logstart);
+	if ($log =~ m/restartpoint complete/)
+	{
+        $checkpoint_start = 1;
+		last;
+	}
+	usleep(100_000);
+}
+is($checkpoint_start, 1, 'checkpoint has started');
+
+# kill SIGKILL a backend, and all backend will restart. Note that previous checkpoint has not completed.
+my ($killme_stdin, $killme_stdout, $killme_stderr) = ('', '', '');
+my $killme = IPC::Run::start(
+	[ 'psql', '-XAtq', '-v', 'ON_ERROR_STOP=1', '-f', '-', '-d', $node_standby->connstr('postgres') ],
+	'<',
+	\$killme_stdin,
+	'>',
+	\$killme_stdout,
+	'2>',
+	\$killme_stderr,
+	$psql_timeout);
+$killme_stdin .= q[
+SELECT pg_backend_pid();
+];
+$killme->pump until $killme_stdout =~ /[[:digit:]]+[\r\n]$/;
+my $pid = $killme_stdout;
+chomp($pid);
+my $ret = TestLib::system_log('pg_ctl', 'kill', 'KILL', $pid);
+is($ret, 0, 'killed process with KILL');
+
+# after recovery, the server will not start, and log PANIC: could not locate a valid checkpoint record
+for (my $i = 0; $i < 30; $i++)
+{
+    ($ret, $stdout, $stderr) = $node_standby->psql('postgres', 'select 1');
+    last if $ret == 0;
+	sleep(1);
+}
+is($ret, 0, "psql connect success");
+is($stdout, 1, "psql select 1");
+
+$psql_primary->finish;
+$killme->finish;
\ No newline at end of file
-- 
2.32.0.3.g01195cf9f

#2Nathan Bossart
nathandbossart@gmail.com
In reply to: Zhao Rui (#1)
Re: Fix primary crash continually with invalid checkpoint after promote

On Tue, Apr 26, 2022 at 03:16:13PM +0800, Zhao Rui wrote:

In function CreateRestartPoint, control file is updated and old wals are removed. But in some situations, control file is not updated, old wals are still removed. Thus produces an invalid checkpoint with nonexistent wal. Crucial log: "invalid primary checkpoint record", "could not locate a valid checkpoint record".

I think this is the same issue tracked here: [0]/messages/by-id/20220316.102444.2193181487576617583.horikyota.ntt@gmail.com.

[0]: /messages/by-id/20220316.102444.2193181487576617583.horikyota.ntt@gmail.com

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zhao Rui (#1)
Re: Fix primary crash continually with invalid checkpoint after promote

"=?ISO-8859-1?B?WmhhbyBSdWk=?=" <875941708@qq.com> writes:

Newly promoted primary may leave an invalid checkpoint.
In function CreateRestartPoint, control file is updated and old wals are removed. But in some situations, control file is not updated, old wals are still removed. Thus produces an invalid checkpoint with nonexistent wal. Crucial log: "invalid primary checkpoint record", "could not locate a valid checkpoint record".

I believe this is the same issue being discussed here:

/messages/by-id/20220316.102444.2193181487576617583.horikyota.ntt@gmail.com

but Horiguchi-san's proposed fix looks quite different from yours.

regards, tom lane

#4Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Tom Lane (#3)
Re: Fix primary crash continually with invalid checkpoint after promote

At Tue, 26 Apr 2022 15:47:13 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in

"=?ISO-8859-1?B?WmhhbyBSdWk=?=" <875941708@qq.com> writes:

Newly promoted primary may leave an invalid checkpoint.
In function CreateRestartPoint, control file is updated and old wals are removed. But in some situations, control file is not updated, old wals are still removed. Thus produces an invalid checkpoint with nonexistent wal. Crucial log: "invalid primary checkpoint record", "could not locate a valid checkpoint record".

I believe this is the same issue being discussed here:

/messages/by-id/20220316.102444.2193181487576617583.horikyota.ntt@gmail.com

but Horiguchi-san's proposed fix looks quite different from yours.

The root cause is that CreateRestartPoint omits to update last
checkpoint in control file if archiver recovery exits at an
unfortunate timing. So my proposal is going to fix the root cause.

Zhao Rui's proposal is retension of WAL files according to (the wrong
content of) control file.

Aside from the fact that it may let slots be invalidated ealier, It's
not great that an acutally performed restartpoint is forgotten, which
may cause the next crash recovery starts from an already performed
checkpoint.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#5Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#4)
Re: Fix primary crash continually with invalid checkpoint after promote

On Wed, Apr 27, 2022 at 11:24:11AM +0900, Kyotaro Horiguchi wrote:

Zhao Rui's proposal is retension of WAL files according to (the wrong
content of) control file.

Aside from the fact that it may let slots be invalidated ealier, It's
not great that an acutally performed restartpoint is forgotten, which
may cause the next crash recovery starts from an already performed
checkpoint.

Yeah, I was analyzing this problem and took a look at what's proposed
here, and I agree that what is proposed on this thread would just do
some unnecessary work if we find ourselves in a situation where we
we need to replay from a point earlier than necessary, aka the
checkpoint that should have been already finished.
--
Michael