Fix crash during recovery when redo segment is missing
Hi,
In [1]/messages/by-id/20231023232145.cmqe73stvivsmlhs@awork3.anarazel.de, Andres reported a bug where PostgreSQL crashes during recovery
if the segment containing the redo pointer does not exist. I have
attempted to address this issue and I am sharing a patch for the same.
The problem was that PostgreSQL did not PANIC when the redo LSN and
checkpoint LSN were in separate segments, and the file containing the
redo LSN was missing, leading to a crash. Andres has provided a
detailed analysis of the behavior across different settings and
versions. Please refer to [1]/messages/by-id/20231023232145.cmqe73stvivsmlhs@awork3.anarazel.de for more information. This issue arises
because PostgreSQL does not PANIC initially.
The issue was resolved by ensuring that the REDO location exists once
we successfully read the checkpoint record in InitWalRecovery(). This
prevents control from reaching PerformWalRecovery() unless the WAL
file containing the redo record exists. A new test script,
044_redo_segment_missing.pl, has been added to validate this. To
populate the WAL file with a redo record different from the WAL file
with the checkpoint record, I wait for the checkpoint start message
and then issue a pg_switch_wal(), which should occur before the
completion of the checkpoint. Then, I crash the server, and during the
restart, it should log an appropriate error indicating that it could
not find the redo location. Please let me know if there is a better
way to reproduce this behavior. I have tested and verified this with
the various scenarios Andres pointed out in [1]/messages/by-id/20231023232145.cmqe73stvivsmlhs@awork3.anarazel.de. Please note that this
patch does not address error checking in StartupXLOG(),
CreateCheckPoint(), etc., nor does it focus on cleaning up existing
code.
Attaching the patch. Please review and share your feedback. Thanks to
Andres for spotting the bug and providing the detailed report [1]/messages/by-id/20231023232145.cmqe73stvivsmlhs@awork3.anarazel.de.
[1]: /messages/by-id/20231023232145.cmqe73stvivsmlhs@awork3.anarazel.de
Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft
Attachments:
The patch wasn’t applying cleanly on master, so I’ve rebased it and
also added it to the PG19‑4 CommitFest:
https://commitfest.postgresql.org/patch/6279/
Please review and share your feedback.
Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft
Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft
On Fri, Feb 21, 2025 at 4:29 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
Show quoted text
Hi,
In [1], Andres reported a bug where PostgreSQL crashes during recovery
if the segment containing the redo pointer does not exist. I have
attempted to address this issue and I am sharing a patch for the same.The problem was that PostgreSQL did not PANIC when the redo LSN and
checkpoint LSN were in separate segments, and the file containing the
redo LSN was missing, leading to a crash. Andres has provided a
detailed analysis of the behavior across different settings and
versions. Please refer to [1] for more information. This issue arises
because PostgreSQL does not PANIC initially.The issue was resolved by ensuring that the REDO location exists once
we successfully read the checkpoint record in InitWalRecovery(). This
prevents control from reaching PerformWalRecovery() unless the WAL
file containing the redo record exists. A new test script,
044_redo_segment_missing.pl, has been added to validate this. To
populate the WAL file with a redo record different from the WAL file
with the checkpoint record, I wait for the checkpoint start message
and then issue a pg_switch_wal(), which should occur before the
completion of the checkpoint. Then, I crash the server, and during the
restart, it should log an appropriate error indicating that it could
not find the redo location. Please let me know if there is a better
way to reproduce this behavior. I have tested and verified this with
the various scenarios Andres pointed out in [1]. Please note that this
patch does not address error checking in StartupXLOG(),
CreateCheckPoint(), etc., nor does it focus on cleaning up existing
code.Attaching the patch. Please review and share your feedback. Thanks to
Andres for spotting the bug and providing the detailed report [1].[1]: /messages/by-id/20231023232145.cmqe73stvivsmlhs@awork3.anarazel.de
Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft
Apologies, I missed attaching the patch earlier. Please find the v2
version attached.
Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft
On Thu, Dec 4, 2025 at 12:01 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
Show quoted text
The patch wasn’t applying cleanly on master, so I’ve rebased it and
also added it to the PG19‑4 CommitFest:
https://commitfest.postgresql.org/patch/6279/
Please review and share your feedback.Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
MicrosoftBest Regards,
Nitin Jadhav
Azure Database for PostgreSQL
MicrosoftOn Fri, Feb 21, 2025 at 4:29 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:Hi,
In [1], Andres reported a bug where PostgreSQL crashes during recovery
if the segment containing the redo pointer does not exist. I have
attempted to address this issue and I am sharing a patch for the same.The problem was that PostgreSQL did not PANIC when the redo LSN and
checkpoint LSN were in separate segments, and the file containing the
redo LSN was missing, leading to a crash. Andres has provided a
detailed analysis of the behavior across different settings and
versions. Please refer to [1] for more information. This issue arises
because PostgreSQL does not PANIC initially.The issue was resolved by ensuring that the REDO location exists once
we successfully read the checkpoint record in InitWalRecovery(). This
prevents control from reaching PerformWalRecovery() unless the WAL
file containing the redo record exists. A new test script,
044_redo_segment_missing.pl, has been added to validate this. To
populate the WAL file with a redo record different from the WAL file
with the checkpoint record, I wait for the checkpoint start message
and then issue a pg_switch_wal(), which should occur before the
completion of the checkpoint. Then, I crash the server, and during the
restart, it should log an appropriate error indicating that it could
not find the redo location. Please let me know if there is a better
way to reproduce this behavior. I have tested and verified this with
the various scenarios Andres pointed out in [1]. Please note that this
patch does not address error checking in StartupXLOG(),
CreateCheckPoint(), etc., nor does it focus on cleaning up existing
code.Attaching the patch. Please review and share your feedback. Thanks to
Andres for spotting the bug and providing the detailed report [1].[1]: /messages/by-id/20231023232145.cmqe73stvivsmlhs@awork3.anarazel.de
Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft
Attachments:
On Thu, 4 Dec 2025 at 11:37, Nitin Jadhav <nitinjadhavpostgres@gmail.com> wrote:
Apologies, I missed attaching the patch earlier. Please find the v2
version attached.Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
MicrosoftOn Thu, Dec 4, 2025 at 12:01 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:The patch wasn’t applying cleanly on master, so I’ve rebased it and
also added it to the PG19‑4 CommitFest:
https://commitfest.postgresql.org/patch/6279/
Please review and share your feedback.Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
MicrosoftBest Regards,
Nitin Jadhav
Azure Database for PostgreSQL
MicrosoftOn Fri, Feb 21, 2025 at 4:29 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:Hi,
In [1], Andres reported a bug where PostgreSQL crashes during recovery
if the segment containing the redo pointer does not exist. I have
attempted to address this issue and I am sharing a patch for the same.The problem was that PostgreSQL did not PANIC when the redo LSN and
checkpoint LSN were in separate segments, and the file containing the
redo LSN was missing, leading to a crash. Andres has provided a
detailed analysis of the behavior across different settings and
versions. Please refer to [1] for more information. This issue arises
because PostgreSQL does not PANIC initially.The issue was resolved by ensuring that the REDO location exists once
we successfully read the checkpoint record in InitWalRecovery(). This
prevents control from reaching PerformWalRecovery() unless the WAL
file containing the redo record exists. A new test script,
044_redo_segment_missing.pl, has been added to validate this. To
populate the WAL file with a redo record different from the WAL file
with the checkpoint record, I wait for the checkpoint start message
and then issue a pg_switch_wal(), which should occur before the
completion of the checkpoint. Then, I crash the server, and during the
restart, it should log an appropriate error indicating that it could
not find the redo location. Please let me know if there is a better
way to reproduce this behavior. I have tested and verified this with
the various scenarios Andres pointed out in [1]. Please note that this
patch does not address error checking in StartupXLOG(),
CreateCheckPoint(), etc., nor does it focus on cleaning up existing
code.Attaching the patch. Please review and share your feedback. Thanks to
Andres for spotting the bug and providing the detailed report [1].[1]: /messages/by-id/20231023232145.cmqe73stvivsmlhs@awork3.anarazel.de
Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft
Hi!
1) Please do not top-post on these lists
2) I did not get the exact reason this thread is different from [0]/messages/by-id/20231023232145.cmqe73stvivsmlhs@awork3.anarazel.de
3)
Your tests are using a busy loop with usleep which nowadays is
considered as bad practice. There 3 of such places, and I think the
first two of them
can be replaced with injection point wait.
+# Wait for the checkpoint to complete +my $checkpoint_complete = 0; +foreach my $i (0 .. 10 * $PostgreSQL::Test::Utils::timeout_default) { + if ($node->log_contains("checkpoint complete", $logstart)) { + $checkpoint_complete = 1; + last; + } + usleep(100_000); +}
4) There are comments from Robert, which are not covered [1]/messages/by-id/CA+Tgmob1x7HMcWAb=6ep2cBuWuwpT-p9E7EmQegWFu6E8nKHeg@mail.gmail.com.
[0]: /messages/by-id/20231023232145.cmqe73stvivsmlhs@awork3.anarazel.de
[1]: /messages/by-id/CA+Tgmob1x7HMcWAb=6ep2cBuWuwpT-p9E7EmQegWFu6E8nKHeg@mail.gmail.com
--
Best regards,
Kirill Reshke
On Thu, Dec 04, 2025 at 12:00:23PM +0500, Kirill Reshke wrote:
Your tests are using a busy loop with usleep which nowadays is
considered as bad practice. There 3 of such places, and I think the
first two of them
can be replaced with injection point wait.
(This fell off my radar, apologies about that.)
The problem with this test is not related to its use of sleeps, which
is perfectly fine to check the start or end timings of a checkpoint
depending on the contents of the logs. It has two issues.
One first issue is that the test is unnecessary long, taking more than
30s to finish because it relies on checkpoint_timeout to kick a
checkpoint. This could use a background psql object to kick a
checkpoint to accelerate the whole. So the test is sitting idle for a
long time, doing nothing.
Your intuition about injection points is right, though, but it points
to a second problem: the test is not reliable because we could finish
the checkpoint *before* the WAL segment is switched, and we expect the
WAL segment switch to happen while the checkpoint is processing. If
you want to make that deterministic, having a wait in the middle of
checkpointing would make the test actually test what it should. In
this case the test would randomly die on its "redo record wal file is
the same" message. That's OK to prove the point of the initial patch,
but it's not acceptable for a test that could be added to the tree.
An update of src/test/recovery/meson.build to add the new test is
also required.
Anyway, I think that the patch is overdoing it in issuing a PANIC in
this case, going backwards with the other thread from [0]: a FATAL
would be perfectly OK, like in the backup_label path because your
manipulations of WAL segment missing are indeed possible, as the test
posted is proving. And there have been many arguments in the past
about performing recovery without a backup_label, as well. And that
would make the test something that we could use, because no backtrace
on buildfarm hosts or disk bloat.
If we do not re-check in InitWalRecovery() that the redo record is
around, the startup process happily goes down to PerformWalRecovery()
in an everything-is-fine mode, initializing a bunch of states, to then
crash due to a pointer dereference while attempting to read the record
from the redo LSN, which does not exist. This is not right. So,
oops. I would see an argument good enough for a backpatch when it
comes to crash recovery, because we actually don't know what has
happened in this case. Or not based on the lack of complaints on the
matter over the years.
So I'd argue about updating the patch among these lines, instead:
- Switch the PANIC to a FATAL, to inform about the pilot error. This
addresses the pointer dereference with WAL replay going crazy for the
redo LSN.
- Add a test, with an injection point after checkpoint startup, based
on a manual checkpoint done in a backgroud psql, without a
checkpoint_timeout to make the test faster.
- We could be careful first based on the lack of complaints, doing
that extra check only on HEAD, for v19.
4) There are comments from Robert, which are not covered [1].
[0] /messages/by-id/20231023232145.cmqe73stvivsmlhs@awork3.anarazel.de
[1] /messages/by-id/CA+Tgmob1x7HMcWAb=6ep2cBuWuwpT-p9E7EmQegWFu6E8nKHeg@mail.gmail.com
I get the argument about the spaghetti code in general, however the
code in question here deals with the WAL replay initialization, for a
backup_label vs a non-backup_label path. Perhaps I'm more used to
this area than others, but it's not that much pasta to me.
I don't see a point in moving the memcpy() and the wasShutdown parts
as proposed in the patch, by the way. The PANIC would block things
in its else block. Let's limit outselves to the extra ReadRecord()
check for the redo record when we find a checkpoint record.
I'd actually wonder if we should not lower the existing PANIC to a
FATAL if ReadCheckpointRecord() fails, as well. The result would be
the same for operators, without the need to deal with backtrace
cleanups after a crash. And we still have the error message that
tells us what's going on.
Any thoughts or opinions from others?
--
Michael
Hi,
Thanks for reviewing and sharing your feedback.
1) Please do not top-post on these lists
Apologies for that. I will make sure to follow this going forward.
2) I did not get the exact reason this thread is different from [0]
The purpose of this thread was to share the actual patch addressing
the crash scenario discussed in [0]. That said, I agree I should have
posted the patch in [0] itself. I will make sure to use existing
threads wherever possible going forward.
Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft
Thank you for the detailed feedback.
Your tests are using a busy loop with usleep which nowadays is
considered as bad practice. There 3 of such places, and I think the
first two of them
can be replaced with injection point wait.(This fell off my radar, apologies about that.)
The problem with this test is not related to its use of sleeps, which
is perfectly fine to check the start or end timings of a checkpoint
depending on the contents of the logs. It has two issues.One first issue is that the test is unnecessary long, taking more than
30s to finish because it relies on checkpoint_timeout to kick a
checkpoint. This could use a background psql object to kick a
checkpoint to accelerate the whole. So the test is sitting idle for a
long time, doing nothing.Your intuition about injection points is right, though, but it points
to a second problem: the test is not reliable because we could finish
the checkpoint *before* the WAL segment is switched, and we expect the
WAL segment switch to happen while the checkpoint is processing. If
you want to make that deterministic, having a wait in the middle of
checkpointing would make the test actually test what it should. In
this case the test would randomly die on its "redo record wal file is
the same" message. That's OK to prove the point of the initial patch,
but it's not acceptable for a test that could be added to the tree.An update of src/test/recovery/meson.build to add the new test is
also required.
I will work on improving the test accordingly and include the changes
in the next version.
Anyway, I think that the patch is overdoing it in issuing a PANIC in
this case, going backwards with the other thread from [0]: a FATAL
would be perfectly OK, like in the backup_label path because your
manipulations of WAL segment missing are indeed possible, as the test
posted is proving. And there have been many arguments in the past
about performing recovery without a backup_label, as well. And that
would make the test something that we could use, because no backtrace
on buildfarm hosts or disk bloat.
The main reason I chose PANIC is that when the ReadCheckpointRecord()
fails, we already use PANIC for the error message ‘could not locate a
valid checkpoint record…’ in the no_backup_label case, whereas the
similar flow with backup_label uses FATAL. I am not entirely sure why
this difference exists. I looked into it but couldn’t find much. If we
decide to change this patch to use FATAL for ‘could not find redo
location…’, should we also consider changing the existing PANIC to
FATAL for consistency?
4) There are comments from Robert, which are not covered [1].
[0] /messages/by-id/20231023232145.cmqe73stvivsmlhs@awork3.anarazel.de
[1] /messages/by-id/CA+Tgmob1x7HMcWAb=6ep2cBuWuwpT-p9E7EmQegWFu6E8nKHeg@mail.gmail.comI get the argument about the spaghetti code in general, however the
code in question here deals with the WAL replay initialization, for a
backup_label vs a non-backup_label path. Perhaps I'm more used to
this area than others, but it's not that much pasta to me.
Yes. As noted in my initial email, this patch is focused solely on
fixing the crash issue. It does not address error handling in
StartupXLOG(), CreateCheckPoint(), or involve any broader code
cleanup.
I don't see a point in moving the memcpy() and the wasShutdown parts
as proposed in the patch, by the way. The PANIC would block things
in its else block. Let's limit outselves to the extra ReadRecord()
check for the redo record when we find a checkpoint record.
I agree and will take care in the next patch.
Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft
On Mon, Dec 08, 2025 at 12:48:52PM +0530, Nitin Jadhav wrote:
An update of src/test/recovery/meson.build to add the new test is
also required.I will work on improving the test accordingly and include the changes
in the next version.
Cool, thanks.
The main reason I chose PANIC is that when the ReadCheckpointRecord()
fails, we already use PANIC for the error message ‘could not locate a
valid checkpoint record…’ in the no_backup_label case, whereas the
similar flow with backup_label uses FATAL. I am not entirely sure why
this difference exists. I looked into it but couldn’t find much. If we
decide to change this patch to use FATAL for ‘could not find redo
location…’, should we also consider changing the existing PANIC to
FATAL for consistency?
Using PANIC is an inherited historical artifact that has been
introduced around 4d14fe0048cf with the introduction of WAL. There
was nothing like archiving or even base backup back then. Switching
the existing surrounding one to also use a FATAL is something that
seems worth considering to me for the checkpoint record, at least
based on the pattern that there could be a driver error even if there
is no backup_label file (aka for example the case of FS-levelsnapshots
with one partition used for the data folder, no?).
This offers bonus points in the shape of more tests like the one you
have sent upthread. It's not something that I would backpatch as it
is a behavior change, but I'm open to seeing that as an improvement in
usability for future releases: PANIC is for cases that should never
happen for internal states, due to an internal logic error, or an OS
going crazy. Here we have a should-no-happen case triggered by a
user, and a FATAL still provides the same information about what's
wrong. Let's make such changes separate patches, of course, depending
on what we find on the way.
Yes. As noted in my initial email, this patch is focused solely on
fixing the crash issue. It does not address error handling in
StartupXLOG(), CreateCheckPoint(), or involve any broader code
cleanup.
That sounds fine to me.
--
Michael
Using PANIC is an inherited historical artifact that has been
introduced around 4d14fe0048cf with the introduction of WAL. There
was nothing like archiving or even base backup back then. Switching
the existing surrounding one to also use a FATAL is something that
seems worth considering to me for the checkpoint record, at least
based on the pattern that there could be a driver error even if there
is no backup_label file (aka for example the case of FS-levelsnapshots
with one partition used for the data folder, no?).
Thanks for explaining the historical context. I agree that switching
the existing PANIC to FATAL for the checkpoint record case makes
sense. I will include this change in the next patch if there are no
objections from others.
This offers bonus points in the shape of more tests like the one you
have sent upthread. It's not something that I would backpatch as it
is a behavior change, but I'm open to seeing that as an improvement in
usability for future releases: PANIC is for cases that should never
happen for internal states, due to an internal logic error, or an OS
going crazy. Here we have a should-no-happen case triggered by a
user, and a FATAL still provides the same information about what's
wrong. Let's make such changes separate patches, of course, depending
on what we find on the way.
Thanks for the suggestion. I will keep that in mind and look to add
more such tests in future.
Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft
I have incorporated all the feedback discussed above and attached the v3 patch.
Please take a look and let me know if you have any additional feedback.
Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft