Fix crash during recovery when redo segment is missing
Hi,
In [1]/messages/by-id/20231023232145.cmqe73stvivsmlhs@awork3.anarazel.de, Andres reported a bug where PostgreSQL crashes during recovery
if the segment containing the redo pointer does not exist. I have
attempted to address this issue and I am sharing a patch for the same.
The problem was that PostgreSQL did not PANIC when the redo LSN and
checkpoint LSN were in separate segments, and the file containing the
redo LSN was missing, leading to a crash. Andres has provided a
detailed analysis of the behavior across different settings and
versions. Please refer to [1]/messages/by-id/20231023232145.cmqe73stvivsmlhs@awork3.anarazel.de for more information. This issue arises
because PostgreSQL does not PANIC initially.
The issue was resolved by ensuring that the REDO location exists once
we successfully read the checkpoint record in InitWalRecovery(). This
prevents control from reaching PerformWalRecovery() unless the WAL
file containing the redo record exists. A new test script,
044_redo_segment_missing.pl, has been added to validate this. To
populate the WAL file with a redo record different from the WAL file
with the checkpoint record, I wait for the checkpoint start message
and then issue a pg_switch_wal(), which should occur before the
completion of the checkpoint. Then, I crash the server, and during the
restart, it should log an appropriate error indicating that it could
not find the redo location. Please let me know if there is a better
way to reproduce this behavior. I have tested and verified this with
the various scenarios Andres pointed out in [1]/messages/by-id/20231023232145.cmqe73stvivsmlhs@awork3.anarazel.de. Please note that this
patch does not address error checking in StartupXLOG(),
CreateCheckPoint(), etc., nor does it focus on cleaning up existing
code.
Attaching the patch. Please review and share your feedback. Thanks to
Andres for spotting the bug and providing the detailed report [1]/messages/by-id/20231023232145.cmqe73stvivsmlhs@awork3.anarazel.de.
[1]: /messages/by-id/20231023232145.cmqe73stvivsmlhs@awork3.anarazel.de
Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft
Attachments:
0001-Fix-crash-during-recovery-when-redo-segment-is-missi.patchapplication/octet-stream; name=0001-Fix-crash-during-recovery-when-redo-segment-is-missi.patchDownload
From 5141de1ec247a2a442c1b13a6875530fe443f59f Mon Sep 17 00:00:00 2001
From: Nitin Jadhav <nitinjadhav@microsoft.com>
Date: Fri, 21 Feb 2025 10:48:15 +0000
Subject: [PATCH] Fix crash during recovery when redo segment is missing
The issue was that PostgreSQL did not PANIC when the redo LSN and
checkpoint LSN were in separate segments, and the file containing the
redo LSN was missing. This resulted in a crash.
The fix ensures the REDO location exists after reading the checkpoint
record in InitWalRecovery(). If the REDO location is missing, it logs
a PANIC error. Additionally, a new test script,
044_redo_segment_missing.pl, has been added to validate this.
---
src/backend/access/transam/xlogrecovery.c | 14 ++-
.../recovery/t/044_redo_segment_missing.pl | 103 ++++++++++++++++++
2 files changed, 115 insertions(+), 2 deletions(-)
create mode 100644 src/test/recovery/t/044_redo_segment_missing.pl
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 473de6710d..c9c341e5dc 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -780,9 +780,21 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
CheckPointTLI);
if (record != NULL)
{
+ memcpy(&checkPoint, XLogRecGetData(xlogreader), sizeof(CheckPoint));
+ wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN);
ereport(DEBUG1,
(errmsg_internal("checkpoint record is at %X/%X",
LSN_FORMAT_ARGS(CheckPointLoc))));
+
+ /* Make sure that REDO location exists. */
+ if (RedoStartLSN < CheckPointLoc)
+ {
+ XLogPrefetcherBeginRead(xlogprefetcher, RedoStartLSN);
+ if (!ReadRecord(xlogprefetcher, LOG, false, RedoStartTLI))
+ ereport(PANIC,
+ (errmsg("could not find redo location %X/%X referenced by checkpoint record at %X/%X",
+ LSN_FORMAT_ARGS(RedoStartLSN), LSN_FORMAT_ARGS(CheckPointLoc))));
+ }
}
else
{
@@ -796,8 +808,6 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
(errmsg("could not locate a valid checkpoint record at %X/%X",
LSN_FORMAT_ARGS(CheckPointLoc))));
}
- memcpy(&checkPoint, XLogRecGetData(xlogreader), sizeof(CheckPoint));
- wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN);
}
if (ArchiveRecoveryRequested)
diff --git a/src/test/recovery/t/044_redo_segment_missing.pl b/src/test/recovery/t/044_redo_segment_missing.pl
new file mode 100644
index 0000000000..a7bd9a7112
--- /dev/null
+++ b/src/test/recovery/t/044_redo_segment_missing.pl
@@ -0,0 +1,103 @@
+# Copyright (c) 2021-2025, PostgreSQL Global Development Group
+#
+# Evaluates PostgreSQL's recovery behavior when a WAL segment
+# containing the redo record is missing. It initializes a
+# PostgreSQL instance, configures it, and executes a series of
+# operations to mimic a scenario where a WAL segment is absent.
+# Then checks that PostgreSQL logs the correct error message
+# when it fails to locate the redo record.
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Time::HiRes qw(usleep);
+use Test::More;
+
+my $psql_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+my $node = PostgreSQL::Test::Cluster->new('testnode');
+$node->init;
+$node->start;
+
+# Set checkpoint_timeout to the minimum value (30s) to quickly create
+# a checkpoint for testing. Populate some data to ensure the checkpoint
+# operation takes enough time, allowing the execution of pg_switch_wal().
+# This will create a new WAL segment, and the checkpoint record will be
+# written to a different WAL segment.
+$node->safe_psql(
+ 'postgres',
+ q[
+ ALTER SYSTEM SET checkpoint_timeout = '30s';
+ SELECT pg_reload_conf();
+ CREATE TABLE test_table(id int);
+ INSERT INTO test_table (id) SELECT generate_series(1, 100);
+ ]
+);
+
+# Wait for the checkpoint to start
+my $logstart = -s $node->logfile;
+my $checkpoint_start = 0;
+foreach my $i (0 .. 10 * $PostgreSQL::Test::Utils::timeout_default) {
+ if ($node->log_contains("checkpoint starting: time", $logstart)) {
+ $checkpoint_start = 1;
+ last;
+ }
+ usleep(100_000);
+}
+is($checkpoint_start, 1, 'checkpoint has started');
+
+# Switch the WAL to ensure that the WAL segment containing the checkpoint
+# record is different from the one containing the redo record.
+$node->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Wait for the checkpoint to complete
+my $checkpoint_complete = 0;
+foreach my $i (0 .. 10 * $PostgreSQL::Test::Utils::timeout_default) {
+ if ($node->log_contains("checkpoint complete", $logstart)) {
+ $checkpoint_complete = 1;
+ last;
+ }
+ usleep(100_000);
+}
+is($checkpoint_complete, 1, 'checkpoint has completed');
+
+my $redo_lsn = $node->safe_psql('postgres', "SELECT redo_lsn FROM pg_control_checkpoint()");
+my $redo_walfile_name = $node->safe_psql('postgres', "SELECT pg_walfile_name('$redo_lsn')");
+my $checkpoint_lsn = $node->safe_psql('postgres', "SELECT checkpoint_lsn FROM pg_control_checkpoint()");
+my $checkpoint_walfile_name = $node->safe_psql('postgres', "SELECT pg_walfile_name('$checkpoint_lsn')");
+
+if ($redo_walfile_name eq $checkpoint_walfile_name) {
+ die "redo record wal file is the same as checkpoint record wal file, aborting test";
+}
+
+# Remove the WAL segment containing the redo record
+unlink $node->data_dir . "/pg_wal/$redo_walfile_name"
+ or die "Could not remove WAL file: $!";
+
+$node->stop('immediate');
+
+# Use run_log instead of node->start because this test expects
+# that the server ends with an error during recovery.
+run_log(
+ [
+ 'pg_ctl',
+ '--pgdata' => $node->data_dir,
+ '--log' => $node->logfile,
+ 'start',
+ ]);
+
+# Wait for postgres to terminate
+foreach my $i (0 .. 10 * $PostgreSQL::Test::Utils::timeout_default)
+{
+ last if !-f $node->data_dir . '/postmaster.pid';
+ usleep(100_000);
+}
+
+# Confirm that the recovery fails with an expected error
+my $logfile = slurp_file($node->logfile());
+ok( $logfile =~
+ qr/PANIC: could not find redo location .* referenced by checkpoint record at .*/,
+ "ends with PANIC because it could not find redo location"
+);
+
+done_testing();
\ No newline at end of file
--
2.43.0
The patch wasn’t applying cleanly on master, so I’ve rebased it and
also added it to the PG19‑4 CommitFest:
https://commitfest.postgresql.org/patch/6279/
Please review and share your feedback.
Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft
Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft
On Fri, Feb 21, 2025 at 4:29 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
Show quoted text
Hi,
In [1], Andres reported a bug where PostgreSQL crashes during recovery
if the segment containing the redo pointer does not exist. I have
attempted to address this issue and I am sharing a patch for the same.The problem was that PostgreSQL did not PANIC when the redo LSN and
checkpoint LSN were in separate segments, and the file containing the
redo LSN was missing, leading to a crash. Andres has provided a
detailed analysis of the behavior across different settings and
versions. Please refer to [1] for more information. This issue arises
because PostgreSQL does not PANIC initially.The issue was resolved by ensuring that the REDO location exists once
we successfully read the checkpoint record in InitWalRecovery(). This
prevents control from reaching PerformWalRecovery() unless the WAL
file containing the redo record exists. A new test script,
044_redo_segment_missing.pl, has been added to validate this. To
populate the WAL file with a redo record different from the WAL file
with the checkpoint record, I wait for the checkpoint start message
and then issue a pg_switch_wal(), which should occur before the
completion of the checkpoint. Then, I crash the server, and during the
restart, it should log an appropriate error indicating that it could
not find the redo location. Please let me know if there is a better
way to reproduce this behavior. I have tested and verified this with
the various scenarios Andres pointed out in [1]. Please note that this
patch does not address error checking in StartupXLOG(),
CreateCheckPoint(), etc., nor does it focus on cleaning up existing
code.Attaching the patch. Please review and share your feedback. Thanks to
Andres for spotting the bug and providing the detailed report [1].[1]: /messages/by-id/20231023232145.cmqe73stvivsmlhs@awork3.anarazel.de
Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft
Apologies, I missed attaching the patch earlier. Please find the v2
version attached.
Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft
On Thu, Dec 4, 2025 at 12:01 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
Show quoted text
The patch wasn’t applying cleanly on master, so I’ve rebased it and
also added it to the PG19‑4 CommitFest:
https://commitfest.postgresql.org/patch/6279/
Please review and share your feedback.Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
MicrosoftBest Regards,
Nitin Jadhav
Azure Database for PostgreSQL
MicrosoftOn Fri, Feb 21, 2025 at 4:29 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:Hi,
In [1], Andres reported a bug where PostgreSQL crashes during recovery
if the segment containing the redo pointer does not exist. I have
attempted to address this issue and I am sharing a patch for the same.The problem was that PostgreSQL did not PANIC when the redo LSN and
checkpoint LSN were in separate segments, and the file containing the
redo LSN was missing, leading to a crash. Andres has provided a
detailed analysis of the behavior across different settings and
versions. Please refer to [1] for more information. This issue arises
because PostgreSQL does not PANIC initially.The issue was resolved by ensuring that the REDO location exists once
we successfully read the checkpoint record in InitWalRecovery(). This
prevents control from reaching PerformWalRecovery() unless the WAL
file containing the redo record exists. A new test script,
044_redo_segment_missing.pl, has been added to validate this. To
populate the WAL file with a redo record different from the WAL file
with the checkpoint record, I wait for the checkpoint start message
and then issue a pg_switch_wal(), which should occur before the
completion of the checkpoint. Then, I crash the server, and during the
restart, it should log an appropriate error indicating that it could
not find the redo location. Please let me know if there is a better
way to reproduce this behavior. I have tested and verified this with
the various scenarios Andres pointed out in [1]. Please note that this
patch does not address error checking in StartupXLOG(),
CreateCheckPoint(), etc., nor does it focus on cleaning up existing
code.Attaching the patch. Please review and share your feedback. Thanks to
Andres for spotting the bug and providing the detailed report [1].[1]: /messages/by-id/20231023232145.cmqe73stvivsmlhs@awork3.anarazel.de
Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft
Attachments:
v2-0001-Fix-crash-during-recovery-when-redo-segment-is-missi.patchapplication/octet-stream; name=v2-0001-Fix-crash-during-recovery-when-redo-segment-is-missi.patchDownload
From e3887b375923a4aff4c28f99946545ffcd11b125 Mon Sep 17 00:00:00 2001
From: Nitin Jadhav <nitinjadhavpostgres@gmail.com>
Date: Thu, 4 Dec 2025 06:09:26 +0000
Subject: [PATCH] Fix crash during recovery when redo segment is missing
The issue was that PostgreSQL did not PANIC when the redo LSN and
checkpoint LSN were in separate segments, and the file containing the
redo LSN was missing. This resulted in a crash.
The fix ensures the REDO location exists after reading the checkpoint
record in InitWalRecovery(). If the REDO location is missing, it logs
a PANIC error. Additionally, a new test script,
050_redo_segment_missing.pl, has been added to validate this.
---
src/backend/access/transam/xlogrecovery.c | 14 ++-
.../recovery/t/050_redo_segment_missing.pl | 103 ++++++++++++++++++
2 files changed, 115 insertions(+), 2 deletions(-)
create mode 100644 src/test/recovery/t/050_redo_segment_missing.pl
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 21b8f179ba0..a402bfa550e 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -787,9 +787,21 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
CheckPointTLI);
if (record != NULL)
{
+ memcpy(&checkPoint, XLogRecGetData(xlogreader), sizeof(CheckPoint));
+ wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN);
ereport(DEBUG1,
errmsg_internal("checkpoint record is at %X/%08X",
LSN_FORMAT_ARGS(CheckPointLoc)));
+
+ /* Make sure that REDO location exists. */
+ if (RedoStartLSN < CheckPointLoc)
+ {
+ XLogPrefetcherBeginRead(xlogprefetcher, RedoStartLSN);
+ if (!ReadRecord(xlogprefetcher, LOG, false, RedoStartTLI))
+ ereport(PANIC,
+ errmsg("could not find redo location %X/%08X referenced by checkpoint record at %X/%08X",
+ LSN_FORMAT_ARGS(RedoStartLSN), LSN_FORMAT_ARGS(CheckPointLoc)));
+ }
}
else
{
@@ -803,8 +815,6 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
errmsg("could not locate a valid checkpoint record at %X/%08X",
LSN_FORMAT_ARGS(CheckPointLoc)));
}
- memcpy(&checkPoint, XLogRecGetData(xlogreader), sizeof(CheckPoint));
- wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN);
}
if (ArchiveRecoveryRequested)
diff --git a/src/test/recovery/t/050_redo_segment_missing.pl b/src/test/recovery/t/050_redo_segment_missing.pl
new file mode 100644
index 00000000000..a7bd9a71125
--- /dev/null
+++ b/src/test/recovery/t/050_redo_segment_missing.pl
@@ -0,0 +1,103 @@
+# Copyright (c) 2021-2025, PostgreSQL Global Development Group
+#
+# Evaluates PostgreSQL's recovery behavior when a WAL segment
+# containing the redo record is missing. It initializes a
+# PostgreSQL instance, configures it, and executes a series of
+# operations to mimic a scenario where a WAL segment is absent.
+# Then checks that PostgreSQL logs the correct error message
+# when it fails to locate the redo record.
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Time::HiRes qw(usleep);
+use Test::More;
+
+my $psql_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+my $node = PostgreSQL::Test::Cluster->new('testnode');
+$node->init;
+$node->start;
+
+# Set checkpoint_timeout to the minimum value (30s) to quickly create
+# a checkpoint for testing. Populate some data to ensure the checkpoint
+# operation takes enough time, allowing the execution of pg_switch_wal().
+# This will create a new WAL segment, and the checkpoint record will be
+# written to a different WAL segment.
+$node->safe_psql(
+ 'postgres',
+ q[
+ ALTER SYSTEM SET checkpoint_timeout = '30s';
+ SELECT pg_reload_conf();
+ CREATE TABLE test_table(id int);
+ INSERT INTO test_table (id) SELECT generate_series(1, 100);
+ ]
+);
+
+# Wait for the checkpoint to start
+my $logstart = -s $node->logfile;
+my $checkpoint_start = 0;
+foreach my $i (0 .. 10 * $PostgreSQL::Test::Utils::timeout_default) {
+ if ($node->log_contains("checkpoint starting: time", $logstart)) {
+ $checkpoint_start = 1;
+ last;
+ }
+ usleep(100_000);
+}
+is($checkpoint_start, 1, 'checkpoint has started');
+
+# Switch the WAL to ensure that the WAL segment containing the checkpoint
+# record is different from the one containing the redo record.
+$node->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Wait for the checkpoint to complete
+my $checkpoint_complete = 0;
+foreach my $i (0 .. 10 * $PostgreSQL::Test::Utils::timeout_default) {
+ if ($node->log_contains("checkpoint complete", $logstart)) {
+ $checkpoint_complete = 1;
+ last;
+ }
+ usleep(100_000);
+}
+is($checkpoint_complete, 1, 'checkpoint has completed');
+
+my $redo_lsn = $node->safe_psql('postgres', "SELECT redo_lsn FROM pg_control_checkpoint()");
+my $redo_walfile_name = $node->safe_psql('postgres', "SELECT pg_walfile_name('$redo_lsn')");
+my $checkpoint_lsn = $node->safe_psql('postgres', "SELECT checkpoint_lsn FROM pg_control_checkpoint()");
+my $checkpoint_walfile_name = $node->safe_psql('postgres', "SELECT pg_walfile_name('$checkpoint_lsn')");
+
+if ($redo_walfile_name eq $checkpoint_walfile_name) {
+ die "redo record wal file is the same as checkpoint record wal file, aborting test";
+}
+
+# Remove the WAL segment containing the redo record
+unlink $node->data_dir . "/pg_wal/$redo_walfile_name"
+ or die "Could not remove WAL file: $!";
+
+$node->stop('immediate');
+
+# Use run_log instead of node->start because this test expects
+# that the server ends with an error during recovery.
+run_log(
+ [
+ 'pg_ctl',
+ '--pgdata' => $node->data_dir,
+ '--log' => $node->logfile,
+ 'start',
+ ]);
+
+# Wait for postgres to terminate
+foreach my $i (0 .. 10 * $PostgreSQL::Test::Utils::timeout_default)
+{
+ last if !-f $node->data_dir . '/postmaster.pid';
+ usleep(100_000);
+}
+
+# Confirm that the recovery fails with an expected error
+my $logfile = slurp_file($node->logfile());
+ok( $logfile =~
+ qr/PANIC: could not find redo location .* referenced by checkpoint record at .*/,
+ "ends with PANIC because it could not find redo location"
+);
+
+done_testing();
\ No newline at end of file
--
2.43.0
On Thu, 4 Dec 2025 at 11:37, Nitin Jadhav <nitinjadhavpostgres@gmail.com> wrote:
Apologies, I missed attaching the patch earlier. Please find the v2
version attached.Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
MicrosoftOn Thu, Dec 4, 2025 at 12:01 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:The patch wasn’t applying cleanly on master, so I’ve rebased it and
also added it to the PG19‑4 CommitFest:
https://commitfest.postgresql.org/patch/6279/
Please review and share your feedback.Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
MicrosoftBest Regards,
Nitin Jadhav
Azure Database for PostgreSQL
MicrosoftOn Fri, Feb 21, 2025 at 4:29 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:Hi,
In [1], Andres reported a bug where PostgreSQL crashes during recovery
if the segment containing the redo pointer does not exist. I have
attempted to address this issue and I am sharing a patch for the same.The problem was that PostgreSQL did not PANIC when the redo LSN and
checkpoint LSN were in separate segments, and the file containing the
redo LSN was missing, leading to a crash. Andres has provided a
detailed analysis of the behavior across different settings and
versions. Please refer to [1] for more information. This issue arises
because PostgreSQL does not PANIC initially.The issue was resolved by ensuring that the REDO location exists once
we successfully read the checkpoint record in InitWalRecovery(). This
prevents control from reaching PerformWalRecovery() unless the WAL
file containing the redo record exists. A new test script,
044_redo_segment_missing.pl, has been added to validate this. To
populate the WAL file with a redo record different from the WAL file
with the checkpoint record, I wait for the checkpoint start message
and then issue a pg_switch_wal(), which should occur before the
completion of the checkpoint. Then, I crash the server, and during the
restart, it should log an appropriate error indicating that it could
not find the redo location. Please let me know if there is a better
way to reproduce this behavior. I have tested and verified this with
the various scenarios Andres pointed out in [1]. Please note that this
patch does not address error checking in StartupXLOG(),
CreateCheckPoint(), etc., nor does it focus on cleaning up existing
code.Attaching the patch. Please review and share your feedback. Thanks to
Andres for spotting the bug and providing the detailed report [1].[1]: /messages/by-id/20231023232145.cmqe73stvivsmlhs@awork3.anarazel.de
Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft
Hi!
1) Please do not top-post on these lists
2) I did not get the exact reason this thread is different from [0]/messages/by-id/20231023232145.cmqe73stvivsmlhs@awork3.anarazel.de
3)
Your tests are using a busy loop with usleep which nowadays is
considered as bad practice. There 3 of such places, and I think the
first two of them
can be replaced with injection point wait.
+# Wait for the checkpoint to complete +my $checkpoint_complete = 0; +foreach my $i (0 .. 10 * $PostgreSQL::Test::Utils::timeout_default) { + if ($node->log_contains("checkpoint complete", $logstart)) { + $checkpoint_complete = 1; + last; + } + usleep(100_000); +}
4) There are comments from Robert, which are not covered [1]/messages/by-id/CA+Tgmob1x7HMcWAb=6ep2cBuWuwpT-p9E7EmQegWFu6E8nKHeg@mail.gmail.com.
[0]: /messages/by-id/20231023232145.cmqe73stvivsmlhs@awork3.anarazel.de
[1]: /messages/by-id/CA+Tgmob1x7HMcWAb=6ep2cBuWuwpT-p9E7EmQegWFu6E8nKHeg@mail.gmail.com
--
Best regards,
Kirill Reshke
On Thu, Dec 04, 2025 at 12:00:23PM +0500, Kirill Reshke wrote:
Your tests are using a busy loop with usleep which nowadays is
considered as bad practice. There 3 of such places, and I think the
first two of them
can be replaced with injection point wait.
(This fell off my radar, apologies about that.)
The problem with this test is not related to its use of sleeps, which
is perfectly fine to check the start or end timings of a checkpoint
depending on the contents of the logs. It has two issues.
One first issue is that the test is unnecessary long, taking more than
30s to finish because it relies on checkpoint_timeout to kick a
checkpoint. This could use a background psql object to kick a
checkpoint to accelerate the whole. So the test is sitting idle for a
long time, doing nothing.
Your intuition about injection points is right, though, but it points
to a second problem: the test is not reliable because we could finish
the checkpoint *before* the WAL segment is switched, and we expect the
WAL segment switch to happen while the checkpoint is processing. If
you want to make that deterministic, having a wait in the middle of
checkpointing would make the test actually test what it should. In
this case the test would randomly die on its "redo record wal file is
the same" message. That's OK to prove the point of the initial patch,
but it's not acceptable for a test that could be added to the tree.
An update of src/test/recovery/meson.build to add the new test is
also required.
Anyway, I think that the patch is overdoing it in issuing a PANIC in
this case, going backwards with the other thread from [0]: a FATAL
would be perfectly OK, like in the backup_label path because your
manipulations of WAL segment missing are indeed possible, as the test
posted is proving. And there have been many arguments in the past
about performing recovery without a backup_label, as well. And that
would make the test something that we could use, because no backtrace
on buildfarm hosts or disk bloat.
If we do not re-check in InitWalRecovery() that the redo record is
around, the startup process happily goes down to PerformWalRecovery()
in an everything-is-fine mode, initializing a bunch of states, to then
crash due to a pointer dereference while attempting to read the record
from the redo LSN, which does not exist. This is not right. So,
oops. I would see an argument good enough for a backpatch when it
comes to crash recovery, because we actually don't know what has
happened in this case. Or not based on the lack of complaints on the
matter over the years.
So I'd argue about updating the patch among these lines, instead:
- Switch the PANIC to a FATAL, to inform about the pilot error. This
addresses the pointer dereference with WAL replay going crazy for the
redo LSN.
- Add a test, with an injection point after checkpoint startup, based
on a manual checkpoint done in a backgroud psql, without a
checkpoint_timeout to make the test faster.
- We could be careful first based on the lack of complaints, doing
that extra check only on HEAD, for v19.
4) There are comments from Robert, which are not covered [1].
[0] /messages/by-id/20231023232145.cmqe73stvivsmlhs@awork3.anarazel.de
[1] /messages/by-id/CA+Tgmob1x7HMcWAb=6ep2cBuWuwpT-p9E7EmQegWFu6E8nKHeg@mail.gmail.com
I get the argument about the spaghetti code in general, however the
code in question here deals with the WAL replay initialization, for a
backup_label vs a non-backup_label path. Perhaps I'm more used to
this area than others, but it's not that much pasta to me.
I don't see a point in moving the memcpy() and the wasShutdown parts
as proposed in the patch, by the way. The PANIC would block things
in its else block. Let's limit outselves to the extra ReadRecord()
check for the redo record when we find a checkpoint record.
I'd actually wonder if we should not lower the existing PANIC to a
FATAL if ReadCheckpointRecord() fails, as well. The result would be
the same for operators, without the need to deal with backtrace
cleanups after a crash. And we still have the error message that
tells us what's going on.
Any thoughts or opinions from others?
--
Michael
Hi,
Thanks for reviewing and sharing your feedback.
1) Please do not top-post on these lists
Apologies for that. I will make sure to follow this going forward.
2) I did not get the exact reason this thread is different from [0]
The purpose of this thread was to share the actual patch addressing
the crash scenario discussed in [0]. That said, I agree I should have
posted the patch in [0] itself. I will make sure to use existing
threads wherever possible going forward.
Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft
Thank you for the detailed feedback.
Your tests are using a busy loop with usleep which nowadays is
considered as bad practice. There 3 of such places, and I think the
first two of them
can be replaced with injection point wait.(This fell off my radar, apologies about that.)
The problem with this test is not related to its use of sleeps, which
is perfectly fine to check the start or end timings of a checkpoint
depending on the contents of the logs. It has two issues.One first issue is that the test is unnecessary long, taking more than
30s to finish because it relies on checkpoint_timeout to kick a
checkpoint. This could use a background psql object to kick a
checkpoint to accelerate the whole. So the test is sitting idle for a
long time, doing nothing.Your intuition about injection points is right, though, but it points
to a second problem: the test is not reliable because we could finish
the checkpoint *before* the WAL segment is switched, and we expect the
WAL segment switch to happen while the checkpoint is processing. If
you want to make that deterministic, having a wait in the middle of
checkpointing would make the test actually test what it should. In
this case the test would randomly die on its "redo record wal file is
the same" message. That's OK to prove the point of the initial patch,
but it's not acceptable for a test that could be added to the tree.An update of src/test/recovery/meson.build to add the new test is
also required.
I will work on improving the test accordingly and include the changes
in the next version.
Anyway, I think that the patch is overdoing it in issuing a PANIC in
this case, going backwards with the other thread from [0]: a FATAL
would be perfectly OK, like in the backup_label path because your
manipulations of WAL segment missing are indeed possible, as the test
posted is proving. And there have been many arguments in the past
about performing recovery without a backup_label, as well. And that
would make the test something that we could use, because no backtrace
on buildfarm hosts or disk bloat.
The main reason I chose PANIC is that when the ReadCheckpointRecord()
fails, we already use PANIC for the error message ‘could not locate a
valid checkpoint record…’ in the no_backup_label case, whereas the
similar flow with backup_label uses FATAL. I am not entirely sure why
this difference exists. I looked into it but couldn’t find much. If we
decide to change this patch to use FATAL for ‘could not find redo
location…’, should we also consider changing the existing PANIC to
FATAL for consistency?
4) There are comments from Robert, which are not covered [1].
[0] /messages/by-id/20231023232145.cmqe73stvivsmlhs@awork3.anarazel.de
[1] /messages/by-id/CA+Tgmob1x7HMcWAb=6ep2cBuWuwpT-p9E7EmQegWFu6E8nKHeg@mail.gmail.comI get the argument about the spaghetti code in general, however the
code in question here deals with the WAL replay initialization, for a
backup_label vs a non-backup_label path. Perhaps I'm more used to
this area than others, but it's not that much pasta to me.
Yes. As noted in my initial email, this patch is focused solely on
fixing the crash issue. It does not address error handling in
StartupXLOG(), CreateCheckPoint(), or involve any broader code
cleanup.
I don't see a point in moving the memcpy() and the wasShutdown parts
as proposed in the patch, by the way. The PANIC would block things
in its else block. Let's limit outselves to the extra ReadRecord()
check for the redo record when we find a checkpoint record.
I agree and will take care in the next patch.
Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft
On Mon, Dec 08, 2025 at 12:48:52PM +0530, Nitin Jadhav wrote:
An update of src/test/recovery/meson.build to add the new test is
also required.I will work on improving the test accordingly and include the changes
in the next version.
Cool, thanks.
The main reason I chose PANIC is that when the ReadCheckpointRecord()
fails, we already use PANIC for the error message ‘could not locate a
valid checkpoint record…’ in the no_backup_label case, whereas the
similar flow with backup_label uses FATAL. I am not entirely sure why
this difference exists. I looked into it but couldn’t find much. If we
decide to change this patch to use FATAL for ‘could not find redo
location…’, should we also consider changing the existing PANIC to
FATAL for consistency?
Using PANIC is an inherited historical artifact that has been
introduced around 4d14fe0048cf with the introduction of WAL. There
was nothing like archiving or even base backup back then. Switching
the existing surrounding one to also use a FATAL is something that
seems worth considering to me for the checkpoint record, at least
based on the pattern that there could be a driver error even if there
is no backup_label file (aka for example the case of FS-levelsnapshots
with one partition used for the data folder, no?).
This offers bonus points in the shape of more tests like the one you
have sent upthread. It's not something that I would backpatch as it
is a behavior change, but I'm open to seeing that as an improvement in
usability for future releases: PANIC is for cases that should never
happen for internal states, due to an internal logic error, or an OS
going crazy. Here we have a should-no-happen case triggered by a
user, and a FATAL still provides the same information about what's
wrong. Let's make such changes separate patches, of course, depending
on what we find on the way.
Yes. As noted in my initial email, this patch is focused solely on
fixing the crash issue. It does not address error handling in
StartupXLOG(), CreateCheckPoint(), or involve any broader code
cleanup.
That sounds fine to me.
--
Michael
Using PANIC is an inherited historical artifact that has been
introduced around 4d14fe0048cf with the introduction of WAL. There
was nothing like archiving or even base backup back then. Switching
the existing surrounding one to also use a FATAL is something that
seems worth considering to me for the checkpoint record, at least
based on the pattern that there could be a driver error even if there
is no backup_label file (aka for example the case of FS-levelsnapshots
with one partition used for the data folder, no?).
Thanks for explaining the historical context. I agree that switching
the existing PANIC to FATAL for the checkpoint record case makes
sense. I will include this change in the next patch if there are no
objections from others.
This offers bonus points in the shape of more tests like the one you
have sent upthread. It's not something that I would backpatch as it
is a behavior change, but I'm open to seeing that as an improvement in
usability for future releases: PANIC is for cases that should never
happen for internal states, due to an internal logic error, or an OS
going crazy. Here we have a should-no-happen case triggered by a
user, and a FATAL still provides the same information about what's
wrong. Let's make such changes separate patches, of course, depending
on what we find on the way.
Thanks for the suggestion. I will keep that in mind and look to add
more such tests in future.
Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft
I have incorporated all the feedback discussed above and attached the v3 patch.
Please take a look and let me know if you have any additional feedback.
Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft
Attachments:
v3-0001-Fix-crash-during-recovery-when-redo-segment-is-missi.patchapplication/x-patch; name=v3-0001-Fix-crash-during-recovery-when-redo-segment-is-missi.patchDownload
From ce9ef428d961a2de093be4a8d52b934be77cec60 Mon Sep 17 00:00:00 2001
From: Nitin Jadhav <nitinjadhavpostgres@gmail.com>
Date: Wed, 10 Dec 2025 10:13:35 +0000
Subject: [PATCH] Fix crash during recovery when redo segment is missing
The issue was that PostgreSQL did not handle the case when the redo LSN and
checkpoint LSN were in separate segments, and the file containing the
redo LSN was missing. This resulted in a crash.
The fix ensures the REDO location exists after reading the checkpoint
record in InitWalRecovery(). If the REDO location is missing, it logs
a FATAL error. Additionally, a new test script,
050_redo_segment_missing.pl, has been added to validate this.
This change also updates an existing error path: the previous PANIC emitted
when ReadCheckpointRecord() fails in the no backup label case is now
lowered to FATAL for consistency and improved operator experience.
---
src/backend/access/transam/xlog.c | 2 +
src/backend/access/transam/xlogrecovery.c | 12 +-
src/test/recovery/meson.build | 1 +
.../recovery/t/050_redo_segment_missing.pl | 106 ++++++++++++++++++
4 files changed, 120 insertions(+), 1 deletion(-)
create mode 100644 src/test/recovery/t/050_redo_segment_missing.pl
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 22d0a2e8c3a..55d074c5389 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7153,6 +7153,8 @@ CreateCheckPoint(int flags)
if (log_checkpoints)
LogCheckpointStart(flags, false);
+ INJECTION_POINT("checkpoint-after-start", NULL);
+
/* Update the process title */
update_checkpoint_display(flags, false, false);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 21b8f179ba0..8671e46355f 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -790,6 +790,16 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
ereport(DEBUG1,
errmsg_internal("checkpoint record is at %X/%08X",
LSN_FORMAT_ARGS(CheckPointLoc)));
+
+ /* Make sure that REDO location exists. */
+ if (RedoStartLSN < CheckPointLoc)
+ {
+ XLogPrefetcherBeginRead(xlogprefetcher, RedoStartLSN);
+ if (!ReadRecord(xlogprefetcher, LOG, false, RedoStartTLI))
+ ereport(FATAL,
+ errmsg("could not find redo location %X/%08X referenced by checkpoint record at %X/%08X",
+ LSN_FORMAT_ARGS(RedoStartLSN), LSN_FORMAT_ARGS(CheckPointLoc)));
+ }
}
else
{
@@ -799,7 +809,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
* can't read the last checkpoint because this allows us to
* simplify processing around checkpoints.
*/
- ereport(PANIC,
+ ereport(FATAL,
errmsg("could not locate a valid checkpoint record at %X/%08X",
LSN_FORMAT_ARGS(CheckPointLoc)));
}
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 523a5cd5b52..e93248bd66e 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -58,6 +58,7 @@ tests += {
't/047_checkpoint_physical_slot.pl',
't/048_vacuum_horizon_floor.pl',
't/049_wait_for_lsn.pl',
+ 't/050_redo_segment_missing.pl',
],
},
}
diff --git a/src/test/recovery/t/050_redo_segment_missing.pl b/src/test/recovery/t/050_redo_segment_missing.pl
new file mode 100644
index 00000000000..92d217299d2
--- /dev/null
+++ b/src/test/recovery/t/050_redo_segment_missing.pl
@@ -0,0 +1,106 @@
+# Copyright (c) 2021-2025, PostgreSQL Global Development Group
+#
+# Evaluates PostgreSQL's recovery behavior when a WAL segment
+# containing the redo record is missing. It initializes a
+# PostgreSQL instance, configures it, and executes a series of
+# operations to mimic a scenario where a WAL segment is absent.
+# Then checks that PostgreSQL logs the correct error message
+# when it fails to locate the redo record.
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Time::HiRes qw(usleep);
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+my $psql_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+my $node = PostgreSQL::Test::Cluster->new('testnode');
+$node->init;
+$node->start;
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+$node->safe_psql('postgres', q(CREATE EXTENSION injection_points));
+
+# Start a psql session to run the checkpoint in the background and make
+# the test wait on the injection point so the checkpoint stops just after
+# it starts.
+my $checkpoint = $node->background_psql('postgres');
+$checkpoint->query(
+ q{select injection_points_attach('checkpoint-after-start','wait')}
+);
+$checkpoint->query_until(
+ qr/starting_checkpoint/,
+ q(\echo starting_checkpoint
+checkpoint;
+\q
+));
+
+# Wait until the checkpoint has reached the injection point.
+note('waiting for injection_point');
+$node->wait_for_event('checkpointer', 'checkpoint-after-start');
+note('injection_point is reached');
+
+# Switch the WAL to ensure that the WAL segment containing the checkpoint
+# record is different from the one containing the redo record.
+$node->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Continue the checkpoint and wait for its completion.
+my $log_offset = -s $node->logfile;
+$node->safe_psql('postgres',
+ q{select injection_points_wakeup('checkpoint-after-start')});
+$node->wait_for_log(qr/checkpoint complete/, $log_offset);
+
+# Retrieve the WAL file names for the redo record and checkpoint record.
+my $redo_lsn = $node->safe_psql('postgres', "SELECT redo_lsn FROM pg_control_checkpoint()");
+my $redo_walfile_name = $node->safe_psql('postgres', "SELECT pg_walfile_name('$redo_lsn')");
+my $checkpoint_lsn = $node->safe_psql('postgres', "SELECT checkpoint_lsn FROM pg_control_checkpoint()");
+my $checkpoint_walfile_name = $node->safe_psql('postgres', "SELECT pg_walfile_name('$checkpoint_lsn')");
+
+# Die if both records are in the same WAL file.
+if ($redo_walfile_name eq $checkpoint_walfile_name) {
+ die "redo record wal file is the same as checkpoint record wal file, aborting test";
+}
+
+# Remove the WAL segment containing the redo record
+unlink $node->data_dir . "/pg_wal/$redo_walfile_name"
+ or die "Could not remove WAL file: $!";
+
+$node->stop('immediate');
+
+# Use run_log instead of node->start because this test expects
+# that the server ends with an error during recovery.
+run_log(
+ [
+ 'pg_ctl',
+ '--pgdata' => $node->data_dir,
+ '--log' => $node->logfile,
+ 'start',
+ ]);
+
+# Wait for postgres to terminate
+foreach my $i (0 .. 10 * $PostgreSQL::Test::Utils::timeout_default)
+{
+ last if !-f $node->data_dir . '/postmaster.pid';
+ usleep(100_000);
+}
+
+# Confirm that the recovery fails with an expected error
+my $logfile = slurp_file($node->logfile());
+ok( $logfile =~
+ qr/FATAL: could not find redo location .* referenced by checkpoint record at .*/,
+ "ends with FATAL because it could not find redo location"
+);
+
+done_testing();
\ No newline at end of file
--
2.43.0
On Wed, Dec 10, 2025 at 04:01:28PM +0530, Nitin Jadhav wrote:
I have incorporated all the feedback discussed above and attached the v3 patch.
Please take a look and let me know if you have any additional feedback.
I have been looking at this patch, and as far as I can see this is an
old problem, with consequences that vary depending on the branches
dealt with (looked at the code and tested down to 9.2, did not go
further down).
HEAD is actually not a bad student: if the REDO record is missing, we
crash on a pointer dereference. Now, older branches have a different
story to offer, leading to a worse behavior. For example on v14, our
oldest stable branch still supported: if the REDO record is missing
the startup process thinks that no REDO is required at all because
there is no record, and skips everything until the end of recovery. I
have been lacking time today to look at the state of the branches
between v15 and v17, but I am pretty sure that we are fighting between
the pointer dereference and the v14 behavior for these as well. v18
should be the pointer issue.
FWIW, this new ReadRecord() call has given me a pause for a few hours,
where I was wondering what is the effect of fetching_ckpt = false when
recovery or standby mode is requested. At the end I think that we are
OK: if there is a restore_command and we don't have a backup_label, we
would be able to fetch the redo record from an archive, finishing
recovery even in this case where the backup_label is missing. For
example, allowing recovery for the node in the test (recovery.signal +
restore_command) while copying the missing segment to archive
completes recovery, with consistency reached at the end of the
checkpoint and all the records between the redo point and the end of
checkpoint record, then a TLI jump happens.
Another set of things that were wrong in the patch, found during
review:
- The TAP test with injection points was failing for two memory
allocations attempted as the point is added in a critical section,
after the REDO record is generated. There is one allocation for the
point loaded, due to the library path. There was a second one due to
the wait machinery in the library injection_points for the DSM
registry and the shmem state we need. Both of these problems can be
avoided with two techniques by using two points based on a wait: one
wait happens before the critical section, and is used to initialize
the shmem state. The second is loaded outside the critical section,
run inside the critical section inside from the cache.
- The patch was actually wrong: we need to rely on the redo LSN as
found in the checkpoint record obtained a bit earlier. There was at
least one regression due to that: recovery test 030 for the standby
node.
As a whole, this has to be backpatched, because we are just ignoring
recovery if the REDO record is simply missing while the checkpoint
record is found. For the back branches, the PANIC is actually what
I'm planning to go with, to match with what is happening when the
checkpoint record is missing. On HEAD, let's use a softer FATAL to
give a way to test this driver error moving forward. There is a
secondary argument for softening the PANIC when the checkpoint record
is missing to a FATAL, but let's discuss that separately. Hence that
would make two patches:
- Something simpler than the attached, without the test with a PANIC
for the redo record missing, for all the branches.
- A second patch lowering this PANIC to a FATAL, with the test
included, only for HEAD.
I have done a couple of tests in the CI and locally, and that was
looking stable. Attached is the result of what would happen on HEAD,
where the change in xlogrecovery.c would include the back-branch
versions.
Thoughts or comments are welcome.
--
Michael
Attachments:
v4-0001-Check-presence-of-REDO-record.patchtext/x-diff; charset=us-asciiDownload
From 04bda25a7d52526282f3f1a0437077c94794b5a2 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Mon, 15 Dec 2025 17:48:07 +0900
Subject: [PATCH v4] Check presence of REDO record
---
src/backend/access/transam/xlog.c | 6 +
src/backend/access/transam/xlogrecovery.c | 10 ++
src/test/recovery/meson.build | 1 +
.../recovery/t/050_redo_segment_missing.pl | 117 ++++++++++++++++++
4 files changed, 134 insertions(+)
create mode 100644 src/test/recovery/t/050_redo_segment_missing.pl
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6a5640df51af..430a38b1a216 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7001,6 +7001,10 @@ CreateCheckPoint(int flags)
*/
SyncPreCheckpoint();
+ /* Run these points outside the critical section. */
+ INJECTION_POINT("create-checkpoint-initial", NULL);
+ INJECTION_POINT_LOAD("create-checkpoint-run");
+
/*
* Use a critical section to force system panic if we have trouble.
*/
@@ -7151,6 +7155,8 @@ CreateCheckPoint(int flags)
if (log_checkpoints)
LogCheckpointStart(flags, false);
+ INJECTION_POINT_CACHED("create-checkpoint-run", NULL);
+
/* Update the process title */
update_checkpoint_display(flags, false, false);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index ae2398d6975b..38b594d21709 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -805,6 +805,16 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
}
memcpy(&checkPoint, XLogRecGetData(xlogreader), sizeof(CheckPoint));
wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN);
+
+ /* Make sure that REDO location exists. */
+ if (checkPoint.redo < CheckPointLoc)
+ {
+ XLogPrefetcherBeginRead(xlogprefetcher, checkPoint.redo);
+ if (!ReadRecord(xlogprefetcher, LOG, false, checkPoint.ThisTimeLineID))
+ ereport(FATAL,
+ errmsg("could not find redo location %X/%08X referenced by checkpoint record at %X/%08X",
+ LSN_FORMAT_ARGS(checkPoint.redo), LSN_FORMAT_ARGS(CheckPointLoc)));
+ }
}
if (ArchiveRecoveryRequested)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 523a5cd5b527..e93248bd66e2 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -58,6 +58,7 @@ tests += {
't/047_checkpoint_physical_slot.pl',
't/048_vacuum_horizon_floor.pl',
't/049_wait_for_lsn.pl',
+ 't/050_redo_segment_missing.pl',
],
},
}
diff --git a/src/test/recovery/t/050_redo_segment_missing.pl b/src/test/recovery/t/050_redo_segment_missing.pl
new file mode 100644
index 000000000000..113477e01520
--- /dev/null
+++ b/src/test/recovery/t/050_redo_segment_missing.pl
@@ -0,0 +1,117 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+#
+# Evaluates PostgreSQL's recovery behavior when a WAL segment containing the
+# redo record is missing, with a checkpoint record located in a different
+# segment.
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+my $node = PostgreSQL::Test::Cluster->new('testnode');
+$node->init;
+$node->append_conf('postgresql.conf', 'log_checkpoints = on');
+$node->start;
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+$node->safe_psql('postgres', q(CREATE EXTENSION injection_points));
+
+# Note that this uses two injection points based on waits, not one.
+# This may look strange, but this works as a workaround to enforce
+# all memory allocations to happen outside the critical section of a
+# checkpoint.
+# First, "create-checkpoint-initial" is run outside the critical
+# section, and is used as a way to initialize the shared memory required
+# for the wait machinery with its DSM registry.
+# Then, "create-checkpoint-run" is run inside the critical section of
+# a checkpoint, with its callback loaded outside its critical section to
+# allocate any memory required by the callback.
+$node->safe_psql('postgres',
+ q{select injection_points_attach('create-checkpoint-initial', 'wait')});
+$node->safe_psql('postgres',
+ q{select injection_points_attach('create-checkpoint-run', 'wait')});
+
+# Start a psql session to run the checkpoint in the background and make
+# the test wait on the injection point so the checkpoint stops just after
+# it starts.
+my $checkpoint = $node->background_psql('postgres');
+$checkpoint->query_until(
+ qr/starting_checkpoint/,
+ q(\echo starting_checkpoint
+checkpoint;
+));
+
+# Wait for initialization point to finish, the checkpointer is still
+# outside its critical section. Then release to reach the second
+# point.
+$node->wait_for_event('checkpointer', 'create-checkpoint-initial');
+$node->safe_psql('postgres',
+ q{select injection_points_wakeup('create-checkpoint-initial')});
+
+# Wait until the checkpoint has reached the second injection point,
+# we are now in the middle of a checkpoint running after the redo
+# record has been logged.
+$node->wait_for_event('checkpointer', 'create-checkpoint-run');
+
+# Switch the WAL segment, ensuring that the redo record will be
+# included in a different segment than the checkpoint record.
+$node->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Continue the checkpoint and wait for its completion.
+my $log_offset = -s $node->logfile;
+$node->safe_psql('postgres',
+ q{select injection_points_wakeup('create-checkpoint-run')});
+$node->wait_for_log(qr/checkpoint complete/, $log_offset);
+
+$checkpoint->quit;
+
+# Retrieve the WAL file names for the redo record and checkpoint record.
+my $redo_lsn = $node->safe_psql('postgres',
+ "SELECT redo_lsn FROM pg_control_checkpoint()");
+my $redo_walfile_name =
+ $node->safe_psql('postgres', "SELECT pg_walfile_name('$redo_lsn')");
+my $checkpoint_lsn = $node->safe_psql('postgres',
+ "SELECT checkpoint_lsn FROM pg_control_checkpoint()");
+my $checkpoint_walfile_name =
+ $node->safe_psql('postgres', "SELECT pg_walfile_name('$checkpoint_lsn')");
+
+# Redo record and checkpoint record should be on different segments.
+isnt($redo_walfile_name, $checkpoint_walfile_name,
+ 'redo and checkpoint records on different segments');
+
+# Remove the WAL segment containing the redo record.
+unlink $node->data_dir . "/pg_wal/$redo_walfile_name"
+ or die "Could not remove WAL file: $!";
+
+$node->stop('immediate');
+
+# Use run_log instead of node->start because this test expects
+# that the server ends with an error during recovery.
+run_log(
+ [
+ 'pg_ctl',
+ '--pgdata' => $node->data_dir,
+ '--log' => $node->logfile,
+ 'start',
+ ]);
+
+# Confirm that recovery has failed, as expected.
+my $logfile = slurp_file($node->logfile());
+ok( $logfile =~
+ qr/FATAL: could not find redo location .* referenced by checkpoint record at .*/,
+ "ends with FATAL because it could not find redo location");
+
+done_testing();
--
2.51.0
On Mon, Dec 15, 2025 at 05:48:29PM +0900, Michael Paquier wrote:
I have done a couple of tests in the CI and locally, and that was
looking stable. Attached is the result of what would happen on HEAD,
where the change in xlogrecovery.c would include the back-branch
versions.Thoughts or comments are welcome.
And done all of that, with the test added to HEAD and a backpatch down
to v14 for the main fix.
--
Michael
As a whole, this has to be backpatched, because we are just ignoring
recovery if the REDO record is simply missing while the checkpoint
record is found. For the back branches, the PANIC is actually what
I'm planning to go with, to match with what is happening when the
checkpoint record is missing. On HEAD, let's use a softer FATAL to
give a way to test this driver error moving forward. There is a
secondary argument for softening the PANIC when the checkpoint record
is missing to a FATAL, but let's discuss that separately. Hence that
would make two patches:
- Something simpler than the attached, without the test with a PANIC
for the redo record missing, for all the branches.
- A second patch lowering this PANIC to a FATAL, with the test
included, only for HEAD.And done all of that, with the test added to HEAD and a backpatch down
to v14 for the main fix.
Thanks for fixing the test code. The plan for what goes into the back
branches and what goes into HEAD makes sense. Thanks for committing
it.
There is a
secondary argument for softening the PANIC when the checkpoint record
is missing to a FATAL, but let's discuss that separately.
I was planning to start a separate thread for this point, but since it
was a small change I had included it here earlier. I understand the
considerations involved even for these minor adjustments. I will start
a separate thread for this.
Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft
Show quoted text
On Tue, Dec 16, 2025 at 11:35 AM Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Dec 15, 2025 at 05:48:29PM +0900, Michael Paquier wrote:
I have done a couple of tests in the CI and locally, and that was
looking stable. Attached is the result of what would happen on HEAD,
where the change in xlogrecovery.c would include the back-branch
versions.Thoughts or comments are welcome.
And done all of that, with the test added to HEAD and a backpatch down
to v14 for the main fix.
--
Michael
On Tue, Dec 16, 2025 at 12:40:25PM +0530, Nitin Jadhav wrote:
I was planning to start a separate thread for this point, but since it
was a small change I had included it here earlier. I understand the
considerations involved even for these minor adjustments. I will start
a separate thread for this.
An argument that would sound in favor of a switch from PANIC to FATAL
is the testing side: if one removes the segment where the checkpoint
record resides, we crash. Of course, one should not do that, but I
have been wondering for years if it would not be a good thing idea to
lift that a bit and expand the in-core tests and how we expect the
startup process to deal with things. One of my line of thoughts is
that the PANIC behavior is inherited from a time where we did not have
online backups and archive recovery, where such manipulations have
never been possible to start with because WAL segments had a full life
only linked to the backend in pg_wal. Perhaps others don't agree with
that, that's fine.
It would be easy enough to expand the test added by 15f68cebdcec to
check the no-checkpoint case, of course. I just did that this morning
while quickly testing various recovery patterns, which was easier than
rewriting a new script for the job. :)
--
Michael
An argument that would sound in favor of a switch from PANIC to FATAL
is the testing side: if one removes the segment where the checkpoint
record resides, we crash. Of course, one should not do that, but I
have been wondering for years if it would not be a good thing idea to
lift that a bit and expand the in-core tests and how we expect the
startup process to deal with things. One of my line of thoughts is
that the PANIC behavior is inherited from a time where we did not have
online backups and archive recovery, where such manipulations have
never been possible to start with because WAL segments had a full life
only linked to the backend in pg_wal. Perhaps others don't agree with
that, that's fine.
Makes sense. I agree that with modern features like backups, archiving
and external WAL handling, it’s common for WAL segments to go missing
due to operational scenarios, and these cases are often recoverable.
So switching to FATAL seems appropriate.
I would prefer to discuss this in a separate thread with a more
accurate subject line so we can get more opinions. Please let me know
if it is ok or you would rather continue the discussion here.
It would be easy enough to expand the test added by 15f68cebdcec to
check the no-checkpoint case, of course. I just did that this morning
while quickly testing various recovery patterns, which was easier than
rewriting a new script for the job. :)
Thanks for this. I also tested it by adding a TAP test. I initially
planned to share it, but decided instead to start a new thread first,
gather more opinions, and then share a patch if we agree on making the
change.
Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft
On Tue, Dec 16, 2025 at 03:31:58PM +0530, Nitin Jadhav wrote:
I would prefer to discuss this in a separate thread with a more
accurate subject line so we can get more opinions. Please let me know
if it is ok or you would rather continue the discussion here.
Yes, let's do that.
--
Michael