Change checkpoint‑record‑missing PANIC to FATAL
Hi,
While working on [1]/messages/by-id/CAMm1aWaaJi2w49c0RiaDBfhdCL6ztbr9m=daGqiOuVdizYWYaA@mail.gmail.com, we discussed whether the redo-record-missing error
should be a PANIC or a FATAL. We concluded that FATAL is more appropriate,
as it is more appropriate for the current situation and achieves the
intended behavior and also it is consistent with the backup_label path,
which already reports FATAL in the same scenario.
However, when the checkpoint record is missing, the behavior remains
inconsistent: Without a backup_label, we currently raise a PANIC. With a
backup_label, the same code path reports a FATAL.Since we have already made
the redo‑record‑missing case to FATAL in 15f68ce
<https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=15f68cebdcec>,
it seems reasonable to align the checkpoint‑record‑missing case as well.
The existing PANIC dates back to an era before online backups and archive
recovery existed, when external manipulation of WAL was not expected and
such conditions were treated as internal faults. With all such features, it
is much more realistic for WAL segments to go missing due to operational
issues, and such cases are often recoverable. So switching this to FATAL
appears appropriate.
Please share your thoughts.
I am happy to share a patch including a TAP test to cover this behavior
once we agree to proceed.
[1]: /messages/by-id/CAMm1aWaaJi2w49c0RiaDBfhdCL6ztbr9m=daGqiOuVdizYWYaA@mail.gmail.com
/messages/by-id/CAMm1aWaaJi2w49c0RiaDBfhdCL6ztbr9m=daGqiOuVdizYWYaA@mail.gmail.com
Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft
On Tue, Dec 16, 2025 at 04:25:37PM +0530, Nitin Jadhav wrote:
it seems reasonable to align the checkpoint‑record‑missing case as well.
The existing PANIC dates back to an era before online backups and archive
recovery existed, when external manipulation of WAL was not expected and
such conditions were treated as internal faults. With all such features, it
is much more realistic for WAL segments to go missing due to operational
issues, and such cases are often recoverable. So switching this to FATAL
appears appropriate.Please share your thoughts.
FWIW, I think that we should lift the PANIC pattern in this case, at
least to be able to provide more tests around the manipulation of WAL
segments when triggering recovery, with or without a backup_label as
much as with or without a recovery/standby.signal defined in the tree.
The PANIC pattern to blow up the backend when missing a checkpoint
record at the beginning of recovery is a historical artifact of
4d14fe0048cf. The backend has evolved a lot since, particularly with
WAL archives that came much later than that. Lowering that to a FATAL
does not imply a loss of information, just the lack of a backtrace
that can be triggered depending on how one has set of a cluster to
start (say a recovery.signal was forgotten and pg_wal/ has no
contents, etc.). And IMO I doubt that a trace is really useful anyway
in this specific code path.
I'd love to hear the opinion of others on the matter, so if anybody
has comments, feel free.
I'd be curious to look at the amount of tests related to recovery
startup you have in mind anyway, Nitin.
--
Michael
I'd be curious to look at the amount of tests related to recovery
startup you have in mind anyway, Nitin.
Apologies for the delay.
At a high level, the recovery startup cases we want to test fall into
two main buckets:
(1) with a backup_label file and (2) without a backup_label file.
From these two situations, we can cover the following scenarios:
1) Primary crash recovery without a backup_label – Delete the WAL
segment containing the checkpoint record and try starting the server.
2) Primary crash recovery with a backup_label – Take a base backup
(which creates the backup_label), remove the checkpoint WAL segment,
and start the server with that backup directory.
3) Standby crash recovery – Stop the standby, delete the checkpoint
WAL segment, and start it again to see how standby recovery behaves.
4) PITR / archive‑recovery – Remove the checkpoint WAL segment and
start the server with a valid restore_command so it enters archive
recovery.
Tests (2) and (4) are fairly similar, so we can merge them if they
turn out to be redundant.
These are the scenarios I have in mind so far. Please let me know if
you think anything else should be added.
Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft