Fix logging for invalid recovery timeline

Started by David Steeleabout 1 year ago16 messages
#1David Steele
david@pgbackrest.org
1 attachment(s)

Hackers,

I noticed while debugging a user issue that the checkpoint logged in

"Latest checkpoint is at %X/%X on timeline %u, but in the history of the
requested timeline, the server forked off from that timeline at %X/%X."

is being pulled from the value stored in the control file rather than
the value read from backup_label (in the case where there is a
backup_label). This only affects logging since the timeline check is
done against the checkpoint/TLI as read from backup_label.

This patch updates the checkpoint and TLI to (what I believe are) the
correct values for logging.

I think this should also be back-patched.

Regards,
-David

Attachments:

timeline-log-v1.patchtext/plain; charset=UTF-8; name=timeline-log-v1.patchDownload
From f14e30b18cde216131bd3e069ee8ecd5da3301b0 Mon Sep 17 00:00:00 2001
From: David Steele <david@pgmasters.net>
Date: Fri, 20 Dec 2024 15:13:59 +0000
Subject: Fix logging for invalid recovery timeline.

If the requested recovery timeline is not reachable the logged checkpoint and
TLI should to be the values read from backup_label when backup_label is
present.
---
 src/backend/access/transam/xlogrecovery.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c6994b78282..99afd01bf38 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -844,13 +844,13 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		 * tliSwitchPoint will throw an error if the checkpoint's timeline is
 		 * not in expectedTLEs at all.
 		 */
-		switchpoint = tliSwitchPoint(ControlFile->checkPointCopy.ThisTimeLineID, expectedTLEs, NULL);
+		switchpoint = tliSwitchPoint(CheckPointTLI, expectedTLEs, NULL);
 		ereport(FATAL,
 				(errmsg("requested timeline %u is not a child of this server's history",
 						recoveryTargetTLI),
 				 errdetail("Latest checkpoint is at %X/%X on timeline %u, but in the history of the requested timeline, the server forked off from that timeline at %X/%X.",
-						   LSN_FORMAT_ARGS(ControlFile->checkPoint),
-						   ControlFile->checkPointCopy.ThisTimeLineID,
+						   LSN_FORMAT_ARGS(CheckPointLoc),
+						   CheckPointTLI,
 						   LSN_FORMAT_ARGS(switchpoint))));
 	}
 
-- 
2.34.1

#2Andrey M. Borodin
x4mmm@yandex-team.ru
In reply to: David Steele (#1)
Re: Fix logging for invalid recovery timeline

On 20 Dec 2024, at 20:37, David Steele <david@pgbackrest.org> wrote:

"Latest checkpoint is at %X/%X on timeline %u, but in the history of the requested timeline, the server forked off from that timeline at %X/%X."

I think errdetai here is very hard to follow. I seem to understand what is going on after reading errmsg, but errdetai makes me uncertain.

If we call "tliSwitchPoint(CheckPointTLI, expectedTLEs, NULL);"
don't we risk to have again
ereport(ERROR,
(errmsg("requested timeline %u is not in this server's history",
tli)));
?

Best regards, Andrey Borodin.

#3David Steele
david@pgbackrest.org
In reply to: Andrey M. Borodin (#2)
Re: Fix logging for invalid recovery timeline

On 12/20/24 23:28, Andrey M. Borodin wrote:

On 20 Dec 2024, at 20:37, David Steele <david@pgbackrest.org> wrote:

"Latest checkpoint is at %X/%X on timeline %u, but in the history of the requested timeline, the server forked off from that timeline at %X/%X."

I think errdetai here is very hard to follow. I seem to understand what is going on after reading errmsg, but errdetai makes me uncertain.

Yeah, this one confuses users a lot. We see it mostly when a user
accidentally promotes a standby and that standby pushes a history file
and maybe some WAL on a new timeline, e.g. 2. The original primary
continues to make backups on the original timeline 1. At some point a
restore is required and Postgres by default wants to recover to the most
recent timeline, but timeline 2 forks from timeline 1 before the latest
backup was started so it is not accessible.

The solution is to set the target timeline to current but first the user
needs this figure out what is going on an this error message just
doesn't contain enough information to do that. I have some ideas on how
to make it better but that would probably be for HEAD only.

If we call "tliSwitchPoint(CheckPointTLI, expectedTLEs, NULL);"
don't we risk to have again
ereport(ERROR,
(errmsg("requested timeline %u is not in this server's history",
tli)));
?

I'm not sure what you mean. For primary backups CheckPointTLI will
always equal ControlFile->checkPointCopy.ThisTimeLineID so that
shouldn't be a problem. For standby backups CheckPointTLI will be <=
ControlFile->checkPointCopy.ThisTimeLineID since CheckPointTLI
represents the timeline at the start of the backup. If a route from that
timeline to the current timeline can't be found then I'd certainly
expect an error.

I'll add this patch to the January CF.

Regards,
-David

#4Benoit Lobréau
benoit.lobreau@dalibo.com
In reply to: David Steele (#3)
Re: Fix logging for invalid recovery timeline

Hi,

I think the changes make sense. Would it be helpful to add the origin of
the checkpoint record we are referring to ? (i.e. control file or backup
label).

For example:

* Latest checkpoint in the control file is at %X/%X on timeline %u,
* Checkpoint location in the backup_label file is at %X/%X on timeline %u,

The current message refers to "Latest checkpoint" which automatically
makes me think about the control file (specifically, the pg_controldata
output).

Now that I have read the code, I understand which one we are referring
to, but for someone who hasn't (or me in a few month) it might be useful ?

--
Benoit Lobréau
Consultant
http://dalibo.com

#5David Steele
david@pgbackrest.org
In reply to: Benoit Lobréau (#4)
Re: Fix logging for invalid recovery timeline

On 2/19/25 03:51, Benoit Lobréau wrote:

I think the changes make sense. Would it be helpful to add the origin of
the checkpoint record we are referring to ? (i.e. control file or backup
label).

For example:

* Latest checkpoint in the control file is at %X/%X on timeline %u,
* Checkpoint location in the backup_label file is at %X/%X on timeline %u,

I like this idea but I would prefer to get the patch committed as-is
first. The reason is that I'm hoping to see this batch-patched (since it
is a bug) and that is less likely if the message wording is change.

Your idea would be perfect going forward, though.

Regards,
-David

#6Michael Paquier
michael@paquier.xyz
In reply to: David Steele (#5)
Re: Fix logging for invalid recovery timeline

On Wed, Feb 19, 2025 at 05:35:18PM +0000, David Steele wrote:

I like this idea but I would prefer to get the patch committed as-is first.
The reason is that I'm hoping to see this batch-patched (since it is a bug)
and that is less likely if the message wording is change.

(Had this thread flagged as a TODO for some time, sorry for not
chiming in earlier.)
Yeah, that's clearly a bug in the information provided.

Your idea would be perfect going forward, though.

We have a few logs that already track this information, but perhaps
that's better to track this extra element in the FATAL if you have
log_min_messages at fatal where LOG would not show up? Feel free to
propose a separate patch if you think that this can be improved.

I thought first that this was an issue coming from 70e81861fadd, but
from what I can see the confusion is much older and comes from
ee994272ca50 that has begun mixing the data from the control and the
backup label for the checkpoint record in this FATAL message.

At a27048cbcb58, the check is done based on the copy of the checkpoint
record, and the log is generated based on the checkpoint data in the
control file. 4a92a1c3d1c3 has begun retrieving the replay TLI from
the backup label. v13 and v14 have this issue, but I'm not really
tempted to poke at the beast more than necessary as this code had a
lot of changes in the last couple of years, with few to no complaints
as far as I am aware.

Applied down to v15 where we have xlogrecovery.c, then. Thanks for
the report!
--
Michael

#7David Steele
david@pgbackrest.org
In reply to: Michael Paquier (#6)
Re: Fix logging for invalid recovery timeline

On 2/19/25 19:45, Michael Paquier wrote:

On Wed, Feb 19, 2025 at 05:35:18PM +0000, David Steele wrote:

I like this idea but I would prefer to get the patch committed as-is first.
The reason is that I'm hoping to see this batch-patched (since it is a bug)
and that is less likely if the message wording is change.

(Had this thread flagged as a TODO for some time, sorry for not
chiming in earlier.)

No worries. Thank you for having a look.

Your idea would be perfect going forward, though.

We have a few logs that already track this information, but perhaps
that's better to track this extra element in the FATAL if you have
log_min_messages at fatal where LOG would not show up? Feel free to
propose a separate patch if you think that this can be improved.

Benoit -- this was your idea. Did you want to submit a patch yourself?

At a27048cbcb58, the check is done based on the copy of the checkpoint
record, and the log is generated based on the checkpoint data in the
control file. 4a92a1c3d1c3 has begun retrieving the replay TLI from
the backup label. v13 and v14 have this issue, but I'm not really
tempted to poke at the beast more than necessary as this code had a
lot of changes in the last couple of years, with few to no complaints
as far as I am aware.

Fair enough -- this is subtle enough that I doubt almost anyone is going
to notice and the general content of the error message is not really
changed by the incorrect values.

Applied down to v15 where we have xlogrecovery.c, then. Thanks for
the report!

Thank you for the commit!

Regards,
-David

#8Benoit Lobréau
benoit.lobreau@dalibo.com
In reply to: David Steele (#7)
1 attachment(s)
Re: Fix logging for invalid recovery timeline

On 2/20/25 4:40 PM, David Steele wrote:

Benoit -- this was your idea. Did you want to submit a patch yourself?

Here is an attempt at that. I kept the wording I used above. Is it fine
to repeat the whole ereport block twice?

--
Benoit Lobréau
Consultant
http://dalibo.com

Attachments:

0001-Adjust-logging-for-invalid-recovery-timeline.patchtext/x-patch; charset=UTF-8; name=0001-Adjust-logging-for-invalid-recovery-timeline.patchDownload
From 44459bf799fca517b13aef03be952b0374463d5c Mon Sep 17 00:00:00 2001
From: benoit <benoit.lobreau@dalibo.com>
Date: Fri, 21 Feb 2025 14:09:56 +0100
Subject: [PATCH] Adjust logging for invalid recovery timeline

The original message doesn't mention where the checkpoint was read from.
It refers to the "Latest checkpoint", the terminology is used for some
records from the control file (specifically, the pg_controldata output).
The backup label uses "Checkpoint location".
---
 src/backend/access/transam/xlogrecovery.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index f234007d348..7ada300bf9b 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -846,13 +846,22 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		 * not in expectedTLEs at all.
 		 */
 		switchpoint = tliSwitchPoint(CheckPointTLI, expectedTLEs, NULL);
-		ereport(FATAL,
-				(errmsg("requested timeline %u is not a child of this server's history",
-						recoveryTargetTLI),
-				 errdetail("Latest checkpoint is at %X/%X on timeline %u, but in the history of the requested timeline, the server forked off from that timeline at %X/%X.",
-						   LSN_FORMAT_ARGS(CheckPointLoc),
-						   CheckPointTLI,
-						   LSN_FORMAT_ARGS(switchpoint))));
+		if (haveBackupLabel)
+			ereport(FATAL,
+					(errmsg("requested timeline %u is not a child of this server's history",
+							recoveryTargetTLI),
+					 errdetail("The checkpoint location in the backup_label file is at %X/%X on timeline %u, but in the history of the requested timeline, the server forked off from that timeline at %X/%X.",
+							   LSN_FORMAT_ARGS(CheckPointLoc),
+							   CheckPointTLI,
+							   LSN_FORMAT_ARGS(switchpoint))));
+		else
+			ereport(FATAL,
+					(errmsg("requested timeline %u is not a child of this server's history",
+							recoveryTargetTLI),
+					 errdetail("The latest checkpoint in the control file is at %X/%X on timeline %u, but in the history of the requested timeline, the server forked off from that timeline at %X/%X.",
+							   LSN_FORMAT_ARGS(CheckPointLoc),
+							   CheckPointTLI,
+							   LSN_FORMAT_ARGS(switchpoint))));
 	}
 
 	/*
-- 
2.48.1

#9David Steele
david@pgbackrest.org
In reply to: Benoit Lobréau (#8)
Re: Fix logging for invalid recovery timeline

On 2/21/25 09:09, Benoit Lobréau wrote:

On 2/20/25 4:40 PM, David Steele wrote:

Benoit -- this was your idea. Did you want to submit a patch yourself?

Here is an attempt at that. I kept the wording I used above. Is it fine
to repeat the whole ereport block twice?

I think for translation purposes this is probably how it needs to be but
I wonder if we could do something like:

errdetail("Latest checkpoint in %s is at %X/%X <...>",
haveBackupLabel ? "pg_control" ? "backup_label",

I'll defer to Michael on that.

In general this patch and the new messages look good to me, though.

Regards,
-David

#10Michael Paquier
michael@paquier.xyz
In reply to: David Steele (#9)
Re: Fix logging for invalid recovery timeline

On Sat, Feb 22, 2025 at 04:17:44PM +0000, David Steele wrote:

I think for translation purposes this is probably how it needs to be but I
wonder if we could do something like:

errdetail("Latest checkpoint in %s is at %X/%X <...>",
haveBackupLabel ? "pg_control" ? "backup_label",

I'll defer to Michael on that.

I think that you have the right idea here, avoiding the duplication
of the errdetail() which feels itchy when looking at the patch. This
should have a note for translators that this field refers to a file
name. The point is that we don't lose any context if we do a "control
file" => pg_control update in the errdetail().

This would also make the proposed patch simpler.
--
Michael

#11Benoit Lobréau
benoit.lobreau@dalibo.com
In reply to: Michael Paquier (#10)
1 attachment(s)
Re: Fix logging for invalid recovery timeline

On 2/24/25 2:05 AM, Michael Paquier wrote:

I think that you have the right idea here, avoiding the duplication
of the errdetail() which feels itchy when looking at the patch.

Done this way in the attached patch.

This should have a note for translators that this field refers to a file
name.

Is there a specific way todo this?

--
Benoit Lobréau
Consultant
http://dalibo.com

Attachments:

V2_0001-Adjust-logging-for-invalid-recovery-timeline.patchtext/x-patch; charset=UTF-8; name=V2_0001-Adjust-logging-for-invalid-recovery-timeline.patchDownload
From 35dd8db22c997e78ba92f63ffca3022b435d7304 Mon Sep 17 00:00:00 2001
From: benoit <benoit.lobreau@dalibo.com>
Date: Fri, 21 Feb 2025 14:09:56 +0100
Subject: [PATCH] Adjust logging for invalid recovery timeline

The original message doesn't mention where the checkpoint was read from.
It refers to the "Latest checkpoint", the terminology is used for some
records from the control file (specifically, the pg_controldata output).
The backup label uses "Checkpoint location".
---
 src/backend/access/transam/xlogrecovery.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 915cb330295..c5160bf5258 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -849,7 +849,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		ereport(FATAL,
 				(errmsg("requested timeline %u is not a child of this server's history",
 						recoveryTargetTLI),
-				 errdetail("Latest checkpoint is at %X/%X on timeline %u, but in the history of the requested timeline, the server forked off from that timeline at %X/%X.",
+				 errdetail("Latest checkpoint in %s is at %X/%X on timeline %u, but in the history of the requested timeline, the server forked off from that timeline at %X/%X.",
+						   haveBackupLabel ? "backup_label" : "pg_control",
 						   LSN_FORMAT_ARGS(CheckPointLoc),
 						   CheckPointTLI,
 						   LSN_FORMAT_ARGS(switchpoint))));
-- 
2.48.1

#12David Steele
david@pgbackrest.org
In reply to: Benoit Lobréau (#11)
Re: Fix logging for invalid recovery timeline

On 2/24/25 10:21, Benoit Lobréau wrote:

On 2/24/25 2:05 AM, Michael Paquier wrote:

I think that you have the right idea here, avoiding the duplication
of the errdetail() which feels itchy when looking at the patch.

Done this way in the attached patch.

This looks good to me.

This should have a note for translators that this field refers to a file
name.

Is there a specific way todo this?

If you search the source for translator: you'll find lots of examples.
Here's one that was pretty relevant:

/* translator: %s is a configuration file */

We could probably phrase that as:

/* translator: %s is a backup_label or pg_control file */

That note goes right above the hint, so something like:

+                                /* translator: %s is a backup_label or 
pg_control file */
+				 errdetail("Latest checkpoint in %s is at %X/%X on timeline %u, but 
in the history of the requested timeline, the server forked off from 
that timeline at %X/%X.",
+						   haveBackupLabel ? "backup_label" : "pg_control",

Regards,
-David

#13Michael Paquier
michael@paquier.xyz
In reply to: David Steele (#12)
Re: Fix logging for invalid recovery timeline

On Mon, Feb 24, 2025 at 05:30:35PM +0000, David Steele wrote:

+ /* translator: %s is a backup_label or
pg_control file */

See for example PostmasterMain() with the "/* translator: %s is a
configuration file */".

+             errdetail("Latest checkpoint in %s is at %X/%X on timeline %u, but in
the history of the requested timeline, the server forked off from that
timeline at %X/%X.",
+                     haveBackupLabel ? "backup_label" : "pg_control",

This part looks good to me.
--
Michael

#14Benoit Lobréau
benoit.lobreau@dalibo.com
In reply to: Michael Paquier (#13)
1 attachment(s)
Re: Fix logging for invalid recovery timeline

On 2/24/25 11:33 PM, Michael Paquier wrote:

On Mon, Feb 24, 2025 at 05:30:35PM +0000, David Steele wrote:

+ /* translator: %s is a backup_label or
pg_control file */

See for example PostmasterMain() with the "/* translator: %s is a
configuration file */".

Thank you Michael and David.
I never paid attention to thoses...

--
Benoit Lobréau
Consultant
http://dalibo.com

Attachments:

V3_0001-Adjust-logging-for-invalid-recovery-timeline.patchtext/x-patch; charset=UTF-8; name=V3_0001-Adjust-logging-for-invalid-recovery-timeline.patchDownload
From 9a12cad29afb86f2277bf76bc53757702715afd5 Mon Sep 17 00:00:00 2001
From: benoit <benoit.lobreau@dalibo.com>
Date: Fri, 21 Feb 2025 14:09:56 +0100
Subject: [PATCH] Adjust logging for invalid recovery timeline

The original message doesn't mention where the checkpoint was read from.
It refers to the "Latest checkpoint", the terminology is used for some
records from the control file (specifically, the pg_controldata output).
The backup label uses "Checkpoint location".
---
 src/backend/access/transam/xlogrecovery.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 915cb330295..65335e542ad 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -849,7 +849,9 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		ereport(FATAL,
 				(errmsg("requested timeline %u is not a child of this server's history",
 						recoveryTargetTLI),
-				 errdetail("Latest checkpoint is at %X/%X on timeline %u, but in the history of the requested timeline, the server forked off from that timeline at %X/%X.",
+		/* translator: %s is a backup_label or pg_control file */
+				 errdetail("Latest checkpoint in %s is at %X/%X on timeline %u, but in the history of the requested timeline, the server forked off from that timeline at %X/%X.",
+						   haveBackupLabel ? "backup_label" : "pg_control",
 						   LSN_FORMAT_ARGS(CheckPointLoc),
 						   CheckPointTLI,
 						   LSN_FORMAT_ARGS(switchpoint))));
-- 
2.48.1

#15David Steele
david@pgbackrest.org
In reply to: Benoit Lobréau (#14)
Re: Fix logging for invalid recovery timeline

On 2/25/25 04:07, Benoit Lobréau wrote:

On 2/24/25 11:33 PM, Michael Paquier wrote:

On Mon, Feb 24, 2025 at 05:30:35PM +0000, David Steele wrote:

+                                /* translator: %s is a backup_label or
pg_control file */

See for example PostmasterMain() with the "/* translator: %s is a
configuration file */".

Thank you Michael and David.
I never paid attention to thoses...

This looks good to me.

Regards,
-David

#16Michael Paquier
michael@paquier.xyz
In reply to: David Steele (#15)
Re: Fix logging for invalid recovery timeline

On Tue, Feb 25, 2025 at 04:25:40PM +0000, David Steele wrote:

On 2/25/25 04:07, Benoit Lobréau wrote:

Thank you Michael and David.
I never paid attention to thoses...

Don't worry about that. There is a lot of Postgres-ism in this code
base, and we all keep learning stuff :D

This looks good to me.

Applied on HEAD.
--
Michael