Add checkpoint/redo LSNs to recovery errors.

Started by David Steelealmost 2 years ago4 messages
#1David Steele
david@pgmasters.net
1 attachment(s)

Hackers,

This patch adds checkpoint/redo LSNs to recovery error messages where
they may be useful for debugging.

When backup_label is present the LSNs are already output in a log
message, but it still seems like a good idea to repeat them.

When backup_label is not present, the checkpoint LSN is not logged
unless backup recovery is in progress or the checkpoint is found. In the
case where a backup is restored but the backup_label is missing, the
checkpoint LSN is not logged so it is useful for debugging to have it in
the error message.

Regards,
-David

Attachments:

add-lsn-to-err-v1.patchtext/plain; charset=UTF-8; name=add-lsn-to-err-v1.patchDownload
From 9678d0cd928c58e4a3d038c8aa4c191f5bdddbe7 Mon Sep 17 00:00:00 2001
From: David Steele <david@pgmasters.net>
Date: Wed, 28 Feb 2024 21:45:39 +0000
Subject: Add checkpoint/redo LSNs to recovery errors.

When backup_label is present the LSNs are already output in a log message, but
it still seems like a good idea to repeat them.

When backup_label is not present, the checkpoint LSN is not logged unless backup
recovery is in progress or the checkpoint is found. In the case where a backup
is restored but the backup_label is missing, the checkpoint LSN is not logged so
it is useful for debugging to have it in the error message.
---
 src/backend/access/transam/xlogrecovery.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index d73a49b3e8..baf1765f5e 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -647,7 +647,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 				if (!ReadRecord(xlogprefetcher, LOG, false,
 								checkPoint.ThisTimeLineID))
 					ereport(FATAL,
-							(errmsg("could not find redo location referenced by checkpoint record"),
+							(errmsg("could not find redo location %X/%X referenced by checkpoint record at %X/%X",
+									LSN_FORMAT_ARGS(checkPoint.redo), LSN_FORMAT_ARGS(CheckPointLoc)),
 							 errhint("If you are restoring from a backup, touch \"%s/recovery.signal\" or \"%s/standby.signal\" and add required recovery options.\n"
 									 "If you are not restoring from a backup, try removing the file \"%s/backup_label\".\n"
 									 "Be careful: removing \"%s/backup_label\" will result in a corrupt cluster if restoring from a backup.",
@@ -657,7 +658,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		else
 		{
 			ereport(FATAL,
-					(errmsg("could not locate required checkpoint record"),
+					(errmsg("could not locate required checkpoint record at %X/%X",
+							LSN_FORMAT_ARGS(CheckPointLoc)),
 					 errhint("If you are restoring from a backup, touch \"%s/recovery.signal\" or \"%s/standby.signal\" and add required recovery options.\n"
 							 "If you are not restoring from a backup, try removing the file \"%s/backup_label\".\n"
 							 "Be careful: removing \"%s/backup_label\" will result in a corrupt cluster if restoring from a backup.",
@@ -791,7 +793,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 			 * simplify processing around checkpoints.
 			 */
 			ereport(PANIC,
-					(errmsg("could not locate a valid checkpoint record")));
+					(errmsg("could not locate a valid checkpoint record at %X/%X",
+							LSN_FORMAT_ARGS(CheckPointLoc))));
 		}
 		memcpy(&checkPoint, XLogRecGetData(xlogreader), sizeof(CheckPoint));
 		wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN);
-- 
2.34.1

#2Michael Paquier
michael@paquier.xyz
In reply to: David Steele (#1)
Re: Add checkpoint/redo LSNs to recovery errors.

On Thu, Feb 29, 2024 at 10:53:15AM +1300, David Steele wrote:

This patch adds checkpoint/redo LSNs to recovery error messages where they
may be useful for debugging.

Thanks for following up on that!

When backup_label is not present, the checkpoint LSN is not logged unless
backup recovery is in progress or the checkpoint is found. In the case where
a backup is restored but the backup_label is missing, the checkpoint LSN is
not logged so it is useful for debugging to have it in the error message.

             ereport(PANIC,
-                    (errmsg("could not locate a valid checkpoint record")));
+                    (errmsg("could not locate a valid checkpoint record at %X/%X",
+                            LSN_FORMAT_ARGS(CheckPointLoc))));
         }

I've seen this one in the field occasionally, so that's really a
welcome addition IMO.

I've scanned a bit xlogrecovery.c, and I have spotted a couple of that
could gain more information.

ereport(PANIC,
(errmsg("invalid redo record in shutdown checkpoint")));
[...]
ereport(PANIC,
(errmsg("invalid redo in checkpoint record")));
These two could mention CheckPointLoc and checkPoint.redo.

ereport(PANIC,
(errmsg("invalid next transaction ID")));
Perhaps some XID information could be added here?

ereport(FATAL,
(errmsg("WAL ends before consistent recovery point")));
[...]
ereport(FATAL,
(errmsg("WAL ends before end of online backup"),

These two are in xlog.c, and don't mention backupStartPoint for the
first one. Perhaps there's a point in adding some information about
EndOfLog and LocalMinRecoveryPoint as well?
--
Michael

#3David Steele
david@pgmasters.net
In reply to: Michael Paquier (#2)
Re: Add checkpoint/redo LSNs to recovery errors.

On 2/29/24 16:42, Michael Paquier wrote:

On Thu, Feb 29, 2024 at 10:53:15AM +1300, David Steele wrote:

This patch adds checkpoint/redo LSNs to recovery error messages where they
may be useful for debugging.

I've scanned a bit xlogrecovery.c, and I have spotted a couple of that
could gain more information.

ereport(PANIC,
(errmsg("invalid redo record in shutdown checkpoint")));
[...]
ereport(PANIC,
(errmsg("invalid redo in checkpoint record")));
These two could mention CheckPointLoc and checkPoint.redo.

ereport(PANIC,
(errmsg("invalid next transaction ID")));
Perhaps some XID information could be added here?

ereport(FATAL,
(errmsg("WAL ends before consistent recovery point")));
[...]
ereport(FATAL,
(errmsg("WAL ends before end of online backup"),

These two are in xlog.c, and don't mention backupStartPoint for the
first one. Perhaps there's a point in adding some information about
EndOfLog and LocalMinRecoveryPoint as well?

For now I'd like to just focus on these three messages that are related
to a missing backup_label or a misconfiguration of restore_command when
backup_label is present.

No doubt there are many other recovery log messages that could be
improved, but I'd rather not bog down the patch.

Regards,
-David

#4Michael Paquier
michael@paquier.xyz
In reply to: David Steele (#3)
Re: Add checkpoint/redo LSNs to recovery errors.

On Sun, Mar 10, 2024 at 04:58:19PM +1300, David Steele wrote:

No doubt there are many other recovery log messages that could be improved,
but I'd rather not bog down the patch.

Fair argument, and these additions are useful when taken
independently. I've applied your suggestions for now.
--
Michael