needless complexity in StartupXLOG

Started by Robert Haasover 4 years ago39 messages
#1Robert Haas
robertmhaas@gmail.com
1 attachment(s)

StartupXLOG() has code beginning around line 7900 of xlog.c that
decides, at the end of recovery, between four possible courses of
action. It either writes an end-of-recovery record, or requests a
checkpoint, or creates a checkpoint, or does nothing, depending on the
value of 3 flag variables, and also on whether we're still able to
read the last checkpoint record:

checkPointLoc = ControlFile->checkPoint;

/*
* Confirm the last checkpoint is available for us to recover
* from if we fail.
*/
record = ReadCheckpointRecord(xlogreader,
checkPointLoc, 1, false);
if (record != NULL)
{
promoted = true;

It seems to me that ReadCheckpointRecord() should never fail here. It
should always be true, even when we're not in recovery, that the last
checkpoint record is readable. If there's ever a situation where that
is not true, even for an instant, then a crash at that point will be
unrecoverable. Now, one way that it could happen is if we've got a bug
in the code someplace that removes WAL segments too soon. However, if
we have such a bug, we should fix it. What this code does is says "oh,
I guess we removed the old checkpoint too soon, no big deal, let's
just be more aggressive about getting the next one done," which I do
not think is the right response. Aside from a bug, the only other way
I can see it happening is if someone is manually removing WAL segments
as the server is running through recovery, perhaps as some last-ditch
play to avoid running out of disk space. I don't think the server
needs to have - or should have - code to cater to such crazy
scenarios. Therefore I think that this check, at least in its current
form, is not sensible.

My first thought was that we should do the check unconditionally,
rather than just when bgwriterLaunched && LocalPromoteIsTriggered, and
ERROR if it fails. But then I wondered what the point of that would be
exactly. If we have such a bug -- and to the best of my knowledge
there's no evidence that we do -- there's no particular reason it
should only happen at the end of recovery. It could happen any time
the system -- or the user, or malicious space aliens -- remove files
from pg_wal, and we have no real idea about the timing of malicious
space alien activity, so doing the test here rather than anywhere else
just seems like a shot in the dark. Perhaps the most logical place
would be to move it to CreateCheckPoint() just after we remove old
xlog files, but we don't have the xlogreader there, and anyway I don't
see how it's really going to help. What bug do we expect to catch by
removing files we think we don't need and then checking that we didn't
remove the files we think we do need? That seems more like grasping at
straws than a serious attempt to make things work better.

So at the moment I am leaning toward the view that we should just
remove this check entirely, as in the attached, proposed patch.

Really, I think we should consider going further. If it's safe to
write an end-of-recovery record rather than a checkpoint, why not do
so all the time? Right now we fail to do that in the above-described
"impossible" scenario where the previous checkpoint record can't be
read, or if we're exiting archive recovery for some reason other than
a promotion request, or if we're in single-user mode, or if we're in
crash recovery. Presumably, people would like to start up the server
quickly in all of those scenarios, so the only reason not to use this
technology all the time is if we think it's safe in some scenarios and
not others. I can't think of a reason why it matters why we're exiting
archive recovery, nor can I think of a reason why it matters whether
we're in single user mode. The distinction between crash recovery and
archive recovery does seem to matter, but if anything the crash
recovery scenario seems simpler, because then there's only one
timeline involved.

I realize that conservatism may have played a role in this code ending
up looking the way that it does; someone seems to have thought it
would be better not to rely on a new idea in all cases. From my point
of view, though, it's scary to have so many cases, especially cases
that don't seem like they should ever be reached. I think that
simplifying the logic here and trying to do the same things in as many
cases as we can will lead to better robustness. Imagine if instead of
all the hairy logic we have now we just replaced this whole if
(IsInRecovery) stanza with this:

if (InRecovery)
CreateEndOfRecoveryRecord();

That would be WAY easier to reason about than the rat's nest we have
here today. Now, I am not sure what it would take to get there, but I
think that is the direction we ought to be heading.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

0001-Remove-pointless-call-to-ReadCheckpointRecord.patchapplication/octet-stream; name=0001-Remove-pointless-call-to-ReadCheckpointRecord.patchDownload
From f9f109c32d621fb0d34491993fc16669c2b96314 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 26 Jul 2021 11:18:47 -0400
Subject: [PATCH] Remove pointless call to ReadCheckpointRecord().

It should always be the case that the last checkpoint record is still
readable, because otherwise, a crash would leave us in a situation
from which we can't recover. Therefore the test removed by this patch
should always succeed. For it to fail, either there has to be a serious
bug in the code someplace, or the user has to be manually modifying
pg_wal while crash recovery is running. If it's the first one, we
should fix the bug. If it's the second one, they should stop, or
anyway they're doing so at their own risk. In neither case does
a full checkpoint instead of an end-of-recovery record seem like a
clear winner. Furthermore, rarely-taken code paths are particularly
vulnerable to bugs, so let's simplify by getting rid of this one.
---
 src/backend/access/transam/xlog.c | 34 ++++++++++---------------------
 1 file changed, 11 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3479402272..ac250c05cc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7917,33 +7917,21 @@ StartupXLOG(void)
 		{
 			if (LocalPromoteIsTriggered)
 			{
-				checkPointLoc = ControlFile->checkPoint;
+				promoted = true;
 
 				/*
-				 * Confirm the last checkpoint is available for us to recover
-				 * from if we fail.
+				 * Insert a special WAL record to mark the end of recovery,
+				 * since we aren't doing a checkpoint. That means that the
+				 * checkpointer process may likely be in the middle of a
+				 * time-smoothed restartpoint and could continue to be for
+				 * minutes after this. That sounds strange, but the effect is
+				 * roughly the same and it would be stranger to try to come
+				 * out of the restartpoint and then checkpoint. We request a
+				 * checkpoint later anyway, just for safety.
 				 */
-				record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1, false);
-				if (record != NULL)
-				{
-					promoted = true;
-
-					/*
-					 * Insert a special WAL record to mark the end of
-					 * recovery, since we aren't doing a checkpoint. That
-					 * means that the checkpointer process may likely be in
-					 * the middle of a time-smoothed restartpoint and could
-					 * continue to be for minutes after this. That sounds
-					 * strange, but the effect is roughly the same and it
-					 * would be stranger to try to come out of the
-					 * restartpoint and then checkpoint. We request a
-					 * checkpoint later anyway, just for safety.
-					 */
-					CreateEndOfRecoveryRecord();
-				}
+				CreateEndOfRecoveryRecord();
 			}
-
-			if (!promoted)
+			else
 				RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
 								  CHECKPOINT_IMMEDIATE |
 								  CHECKPOINT_WAIT);
-- 
2.24.3 (Apple Git-128)

#2Stephen Frost
sfrost@snowman.net
In reply to: Robert Haas (#1)
Re: needless complexity in StartupXLOG

Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:

So at the moment I am leaning toward the view that we should just
remove this check entirely, as in the attached, proposed patch.

Haven't dug in deeply but at least following your explanation and
reading over the patch and the code a bit, I tend to agree.

Really, I think we should consider going further. If it's safe to
write an end-of-recovery record rather than a checkpoint, why not do
so all the time? Right now we fail to do that in the above-described
"impossible" scenario where the previous checkpoint record can't be
read, or if we're exiting archive recovery for some reason other than
a promotion request, or if we're in single-user mode, or if we're in
crash recovery. Presumably, people would like to start up the server
quickly in all of those scenarios, so the only reason not to use this
technology all the time is if we think it's safe in some scenarios and
not others. I can't think of a reason why it matters why we're exiting
archive recovery, nor can I think of a reason why it matters whether
we're in single user mode. The distinction between crash recovery and
archive recovery does seem to matter, but if anything the crash
recovery scenario seems simpler, because then there's only one
timeline involved.

Yeah, tend to agree with this too ... but something I find a bit curious
is the comment:

* Insert a special WAL record to mark the end of
* recovery, since we aren't doing a checkpoint.

... immediately after setting promoted = true, and then at the end of
StartupXLOG() having:

if (promoted)
RequestCheckpoint(CHECKPOINT_FORCE);

maybe I'm missing something, but seems like that comment isn't being
terribly clear. Perhaps we aren't doing a full checkpoint *there*, but
sure looks like we're going to do one moments later regardless of
anything else since we've set promoted to true..?

I realize that conservatism may have played a role in this code ending
up looking the way that it does; someone seems to have thought it
would be better not to rely on a new idea in all cases. From my point
of view, though, it's scary to have so many cases, especially cases
that don't seem like they should ever be reached. I think that
simplifying the logic here and trying to do the same things in as many
cases as we can will lead to better robustness. Imagine if instead of
all the hairy logic we have now we just replaced this whole if
(IsInRecovery) stanza with this:

if (InRecovery)
CreateEndOfRecoveryRecord();

That would be WAY easier to reason about than the rat's nest we have
here today. Now, I am not sure what it would take to get there, but I
think that is the direction we ought to be heading.

Agreed that simpler logic is better, provided it's correct logic, of
course. Finding better ways to test all of this would be really nice.

Thanks!

Stephen

#3Robert Haas
robertmhaas@gmail.com
In reply to: Stephen Frost (#2)
Re: needless complexity in StartupXLOG

On Mon, Jul 26, 2021 at 1:32 PM Stephen Frost <sfrost@snowman.net> wrote:

Yeah, tend to agree with this too ... but something I find a bit curious
is the comment:

* Insert a special WAL record to mark the end of
* recovery, since we aren't doing a checkpoint.

... immediately after setting promoted = true, and then at the end of
StartupXLOG() having:

if (promoted)
RequestCheckpoint(CHECKPOINT_FORCE);

maybe I'm missing something, but seems like that comment isn't being
terribly clear. Perhaps we aren't doing a full checkpoint *there*, but
sure looks like we're going to do one moments later regardless of
anything else since we've set promoted to true..?

Yep. So it's a question of whether we allow operations that might
write WAL in the meantime. When we write the checkpoint record right
here, there can't be any WAL from the new server lifetime until the
checkpoint completes. When we write an end-of-recovery record, there
can. And there could actually be quite a bit, because if we do the
checkpoint right in this section of code, it will be a fast
checkpoint, whereas in the code you quoted above, it's a spread
checkpoint, which takes a lot longer. So the question is whether it's
reasonable to give the checkpoint some time to complete or whether it
needs to be completed right now.

Agreed that simpler logic is better, provided it's correct logic, of
course. Finding better ways to test all of this would be really nice.

Yeah, and there again, it's a lot easier to test if we don't have as
many cases. Now no single change is going to fix that, but the number
of flag variables in xlog.c is simply bonkers. This particular stretch
of code uses 3 of them to even decide whether to attempt the test in
question, and all of those are set in complex ways depending on the
values of still more flag variables. The comments where
bgwriterLaunched is set claim that we only do this during archive
recovery, not crash recovery, but the code depends on the value of
ArchiveRecoveryRequested, not InArchiveRecovery. So I wonder if we
can't get the bgwriter to run even during crash recovery in the
scenario described by:

* It's possible that archive recovery was requested, but we don't
* know how far we need to replay the WAL before we reach consistency.
* This can happen for example if a base backup is taken from a
* running server using an atomic filesystem snapshot, without calling
* pg_start/stop_backup. Or if you just kill a running primary server
* and put it into archive recovery by creating a recovery signal
* file.

If we ran the bgwriter all the time during crash recovery, we'd know
for sure whether that causes any problems. If we only do it like this
in certain corner cases, it's much more likely that we have bugs.
Grumble, grumble.

--
Robert Haas
EDB: http://www.enterprisedb.com

#4Stephen Frost
sfrost@snowman.net
In reply to: Robert Haas (#3)
Re: needless complexity in StartupXLOG

Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:

On Mon, Jul 26, 2021 at 1:32 PM Stephen Frost <sfrost@snowman.net> wrote:

Yeah, tend to agree with this too ... but something I find a bit curious
is the comment:

* Insert a special WAL record to mark the end of
* recovery, since we aren't doing a checkpoint.

... immediately after setting promoted = true, and then at the end of
StartupXLOG() having:

if (promoted)
RequestCheckpoint(CHECKPOINT_FORCE);

maybe I'm missing something, but seems like that comment isn't being
terribly clear. Perhaps we aren't doing a full checkpoint *there*, but
sure looks like we're going to do one moments later regardless of
anything else since we've set promoted to true..?

Yep. So it's a question of whether we allow operations that might
write WAL in the meantime. When we write the checkpoint record right
here, there can't be any WAL from the new server lifetime until the
checkpoint completes. When we write an end-of-recovery record, there
can. And there could actually be quite a bit, because if we do the
checkpoint right in this section of code, it will be a fast
checkpoint, whereas in the code you quoted above, it's a spread
checkpoint, which takes a lot longer. So the question is whether it's
reasonable to give the checkpoint some time to complete or whether it
needs to be completed right now.

All I was really trying to point out above was that the comment might be
improved upon, just so someone understands that we aren't doing a
checkpoint at this particular place, but one will be done later due to
the promotion. Maybe I'm being a bit extra with that, but that was my
thought when reading the code and the use of the promoted flag variable.

Agreed that simpler logic is better, provided it's correct logic, of
course. Finding better ways to test all of this would be really nice.

Yeah, and there again, it's a lot easier to test if we don't have as
many cases. Now no single change is going to fix that, but the number
of flag variables in xlog.c is simply bonkers. This particular stretch
of code uses 3 of them to even decide whether to attempt the test in
question, and all of those are set in complex ways depending on the
values of still more flag variables. The comments where
bgwriterLaunched is set claim that we only do this during archive
recovery, not crash recovery, but the code depends on the value of
ArchiveRecoveryRequested, not InArchiveRecovery. So I wonder if we
can't get the bgwriter to run even during crash recovery in the
scenario described by:

* It's possible that archive recovery was requested, but we don't
* know how far we need to replay the WAL before we reach consistency.
* This can happen for example if a base backup is taken from a
* running server using an atomic filesystem snapshot, without calling
* pg_start/stop_backup. Or if you just kill a running primary server
* and put it into archive recovery by creating a recovery signal
* file.

If we ran the bgwriter all the time during crash recovery, we'd know
for sure whether that causes any problems. If we only do it like this
in certain corner cases, it's much more likely that we have bugs.
Grumble, grumble.

Yeah ... not to mention that it really is just incredibly dangerous to
use such an approach for PITR. For my 2c, I'd rather we figure out a
way to prevent this than to imply that we support it when we have no way
of knowing if we actually have replayed far enough to be consistent.
That isn't to say that using snapshots for database backups isn't
possible, but it should be done in-between pg_start/stop_backup calls
which properly grab the returned info from those and store the backup
label with the snapshot, etc.

Thanks,

Stephen

#5Justin Pryzby
pryzby@telsasoft.com
In reply to: Robert Haas (#3)
Re: needless complexity in StartupXLOG

On Mon, Jul 26, 2021 at 03:53:20PM -0400, Robert Haas wrote:

Yeah, and there again, it's a lot easier to test if we don't have as
many cases. Now no single change is going to fix that, but the number
of flag variables in xlog.c is simply bonkers. This particular stretch
of code uses 3 of them to even decide whether to attempt the test in
question, and all of those are set in complex ways depending on the
values of still more flag variables. The comments where
bgwriterLaunched is set claim that we only do this during archive
recovery, not crash recovery, but the code depends on the value of
ArchiveRecoveryRequested, not InArchiveRecovery. So I wonder if we
can't get the bgwriter to run even during crash recovery in the
scenario described by:

I'm not following along closely and maybe you're already aware of this one?
https://commitfest.postgresql.org/33/2706/
Background writer and checkpointer in crash recovery

@Thomas:
/messages/by-id/CA+TgmoYmw==TOJ6EzYb_vcjyS09NkzrVKSyBKUUyo1zBEaJASA@mail.gmail.com

* It's possible that archive recovery was requested, but we don't
* know how far we need to replay the WAL before we reach consistency.
* This can happen for example if a base backup is taken from a
* running server using an atomic filesystem snapshot, without calling
* pg_start/stop_backup. Or if you just kill a running primary server
* and put it into archive recovery by creating a recovery signal
* file.

If we ran the bgwriter all the time during crash recovery, we'd know
for sure whether that causes any problems. If we only do it like this
in certain corner cases, it's much more likely that we have bugs.
Grumble, grumble.

--
Justin

#6Robert Haas
robertmhaas@gmail.com
In reply to: Stephen Frost (#4)
Re: needless complexity in StartupXLOG

On Mon, Jul 26, 2021 at 4:15 PM Stephen Frost <sfrost@snowman.net> wrote:

All I was really trying to point out above was that the comment might be
improved upon, just so someone understands that we aren't doing a
checkpoint at this particular place, but one will be done later due to
the promotion. Maybe I'm being a bit extra with that, but that was my
thought when reading the code and the use of the promoted flag variable.

Yeah, I agree, it confused me too, at first.

Yeah ... not to mention that it really is just incredibly dangerous to
use such an approach for PITR. For my 2c, I'd rather we figure out a
way to prevent this than to imply that we support it when we have no way
of knowing if we actually have replayed far enough to be consistent.
That isn't to say that using snapshots for database backups isn't
possible, but it should be done in-between pg_start/stop_backup calls
which properly grab the returned info from those and store the backup
label with the snapshot, etc.

My position on that is that I would not particularly recommend the
technique described here, but I would not choose to try to block it
either. That's an argument for another thread, though.

--
Robert Haas
EDB: http://www.enterprisedb.com

#7Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Stephen Frost (#4)
Re: needless complexity in StartupXLOG

At Mon, 26 Jul 2021 16:15:23 -0400, Stephen Frost <sfrost@snowman.net> wrote in

Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:

On Mon, Jul 26, 2021 at 1:32 PM Stephen Frost <sfrost@snowman.net> wrote:

Yeah, tend to agree with this too ... but something I find a bit curious
is the comment:

* Insert a special WAL record to mark the end of
* recovery, since we aren't doing a checkpoint.

... immediately after setting promoted = true, and then at the end of
StartupXLOG() having:

if (promoted)
RequestCheckpoint(CHECKPOINT_FORCE);

maybe I'm missing something, but seems like that comment isn't being
terribly clear. Perhaps we aren't doing a full checkpoint *there*, but
sure looks like we're going to do one moments later regardless of
anything else since we've set promoted to true..?

Yep. So it's a question of whether we allow operations that might
write WAL in the meantime. When we write the checkpoint record right
here, there can't be any WAL from the new server lifetime until the
checkpoint completes. When we write an end-of-recovery record, there
can. And there could actually be quite a bit, because if we do the
checkpoint right in this section of code, it will be a fast
checkpoint, whereas in the code you quoted above, it's a spread
checkpoint, which takes a lot longer. So the question is whether it's
reasonable to give the checkpoint some time to complete or whether it
needs to be completed right now.

All I was really trying to point out above was that the comment might be
improved upon, just so someone understands that we aren't doing a
checkpoint at this particular place, but one will be done later due to
the promotion. Maybe I'm being a bit extra with that, but that was my
thought when reading the code and the use of the promoted flag variable.

(I feel we don't need to check for last-checkpoint, too.)

FWIW, by the way, I complained that the variable name "promoted" is a
bit confusing. It's old name was fast_promoted, which means that fast
promotion is being *requsted* and ongoing. On the other hand the
current name "promoted" still means "(fast=non-fallback) promotion is
ongoing" so there was a conversation as the follows.

/messages/by-id/9fdd994d-a531-a52b-7906-e1cc22701310@oss.nttdata.com

How about changing it to fallback_promotion, or some names with more
behavior-specific name like immediate_checkpoint_needed?

I like doEndOfRecoveryCkpt or something, but I have no strong opinion
about that flag naming. So I'm ok with immediate_checkpoint_needed
if others also like that, too.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#8Robert Haas
robertmhaas@gmail.com
In reply to: Kyotaro Horiguchi (#7)
Re: needless complexity in StartupXLOG

On Tue, Jul 27, 2021 at 9:18 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

FWIW, by the way, I complained that the variable name "promoted" is a
bit confusing. It's old name was fast_promoted, which means that fast
promotion is being *requsted* and ongoing. On the other hand the
current name "promoted" still means "(fast=non-fallback) promotion is
ongoing" so there was a conversation as the follows.

/messages/by-id/9fdd994d-a531-a52b-7906-e1cc22701310@oss.nttdata.com

I agree - that variable name is also not great. I am open to making
improvements in that area and in others that have been suggested on
this thread, but my immediate goal is to figure out whether anyone
objects to me committing the posted patch. If nobody comes up with a
reason why it's a bad idea in the next few days, I'll plan to move
ahead with it.

--
Robert Haas
EDB: http://www.enterprisedb.com

#9Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Robert Haas (#8)
Re: needless complexity in StartupXLOG

At Tue, 27 Jul 2021 11:03:14 -0400, Robert Haas <robertmhaas@gmail.com> wrote in

On Tue, Jul 27, 2021 at 9:18 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

FWIW, by the way, I complained that the variable name "promoted" is a
bit confusing. It's old name was fast_promoted, which means that fast
promotion is being *requsted* and ongoing. On the other hand the
current name "promoted" still means "(fast=non-fallback) promotion is
ongoing" so there was a conversation as the follows.

/messages/by-id/9fdd994d-a531-a52b-7906-e1cc22701310@oss.nttdata.com

I agree - that variable name is also not great. I am open to making
improvements in that area and in others that have been suggested on
this thread, but my immediate goal is to figure out whether anyone
objects to me committing the posted patch. If nobody comes up with a
reason why it's a bad idea in the next few days, I'll plan to move
ahead with it.

That's fine with me.

I still haven't find a way to lose the last checkpoint due to software
failure. Repeated promotion without having new checkpoints is safe as
expected. We don't remove WAL files unless a checkpoint completes, and
a checkpoint preserves segments back to the one containing its redo
point.

In short, I'm for it.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#10Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#1)
Re: needless complexity in StartupXLOG

Hi,

On 2021-07-26 12:12:53 -0400, Robert Haas wrote:

My first thought was that we should do the check unconditionally,
rather than just when bgwriterLaunched && LocalPromoteIsTriggered, and
ERROR if it fails. But then I wondered what the point of that would be
exactly. If we have such a bug -- and to the best of my knowledge
there's no evidence that we do -- there's no particular reason it
should only happen at the end of recovery. It could happen any time
the system -- or the user, or malicious space aliens -- remove files
from pg_wal, and we have no real idea about the timing of malicious
space alien activity, so doing the test here rather than anywhere else
just seems like a shot in the dark.

Yea. The history of that code being added doesn't suggest that there was
a concrete issue being addressed, from what I can tell.

So at the moment I am leaning toward the view that we should just
remove this check entirely, as in the attached, proposed patch.

+1

Really, I think we should consider going further. If it's safe to
write an end-of-recovery record rather than a checkpoint, why not do
so all the time?

+many. The current split doesn't make much sense. For one, it often is a huge
issue if crash recovery takes a long time - why should we incur the cost that
we are OK avoiding during promotions? For another, end-of-recovery is a
crucial path for correctness, reducing the number of non-trivial paths is
good.

Imagine if instead of
all the hairy logic we have now we just replaced this whole if
(IsInRecovery) stanza with this:

if (InRecovery)
CreateEndOfRecoveryRecord();

That would be WAY easier to reason about than the rat's nest we have
here today. Now, I am not sure what it would take to get there, but I
think that is the direction we ought to be heading.

What are we going to do in the single user ([1]I really wish somebody had the energy to just remove single user and bootstrap modes. The degree to which they increase complexity in the rest of the system is entirely unreasonable. There's not actually any reason bootstrapping can't happen with checkpointer et al running, it's just autovacuum that'd need to be blocked.) case in this awesome future?
I guess we could just not create a checkpoint until single user mode is shut
down / creates a checkpoint for other reasons?

Greetings,

Andres Freund

[1]: I really wish somebody had the energy to just remove single user and bootstrap modes. The degree to which they increase complexity in the rest of the system is entirely unreasonable. There's not actually any reason bootstrapping can't happen with checkpointer et al running, it's just autovacuum that'd need to be blocked.
bootstrap modes. The degree to which they increase complexity in the rest of
the system is entirely unreasonable. There's not actually any reason
bootstrapping can't happen with checkpointer et al running, it's just
autovacuum that'd need to be blocked.

#11Julien Rouhaud
rjuju123@gmail.com
In reply to: Andres Freund (#10)
Re: needless complexity in StartupXLOG

On Thu, Jul 29, 2021 at 3:28 AM Andres Freund <andres@anarazel.de> wrote:

[1] I really wish somebody had the energy to just remove single user and
bootstrap modes. The degree to which they increase complexity in the rest of
the system is entirely unreasonable. There's not actually any reason
bootstrapping can't happen with checkpointer et al running, it's just
autovacuum that'd need to be blocked.

Any objection to adding an entry for that in the wiki TODO?

#12Amul Sul
sulamul@gmail.com
In reply to: Robert Haas (#1)
Re: needless complexity in StartupXLOG

On Mon, Jul 26, 2021 at 9:43 PM Robert Haas <robertmhaas@gmail.com> wrote:

StartupXLOG() has code beginning around line 7900 of xlog.c that
decides, at the end of recovery, between four possible courses of
action. It either writes an end-of-recovery record, or requests a
checkpoint, or creates a checkpoint, or does nothing, depending on the
value of 3 flag variables, and also on whether we're still able to
read the last checkpoint record:

checkPointLoc = ControlFile->checkPoint;

/*
* Confirm the last checkpoint is available for us to recover
* from if we fail.
*/
record = ReadCheckpointRecord(xlogreader,
checkPointLoc, 1, false);
if (record != NULL)
{
promoted = true;

It seems to me that ReadCheckpointRecord() should never fail here. It
should always be true, even when we're not in recovery, that the last
checkpoint record is readable. If there's ever a situation where that
is not true, even for an instant, then a crash at that point will be
unrecoverable. Now, one way that it could happen is if we've got a bug
in the code someplace that removes WAL segments too soon. However, if
we have such a bug, we should fix it. What this code does is says "oh,
I guess we removed the old checkpoint too soon, no big deal, let's
just be more aggressive about getting the next one done," which I do
not think is the right response. Aside from a bug, the only other way
I can see it happening is if someone is manually removing WAL segments
as the server is running through recovery, perhaps as some last-ditch
play to avoid running out of disk space. I don't think the server
needs to have - or should have - code to cater to such crazy
scenarios. Therefore I think that this check, at least in its current
form, is not sensible.

My first thought was that we should do the check unconditionally,
rather than just when bgwriterLaunched && LocalPromoteIsTriggered, and
ERROR if it fails. But then I wondered what the point of that would be
exactly. If we have such a bug -- and to the best of my knowledge
there's no evidence that we do -- there's no particular reason it
should only happen at the end of recovery. It could happen any time
the system -- or the user, or malicious space aliens -- remove files
from pg_wal, and we have no real idea about the timing of malicious
space alien activity, so doing the test here rather than anywhere else
just seems like a shot in the dark. Perhaps the most logical place
would be to move it to CreateCheckPoint() just after we remove old
xlog files, but we don't have the xlogreader there, and anyway I don't
see how it's really going to help. What bug do we expect to catch by
removing files we think we don't need and then checking that we didn't
remove the files we think we do need? That seems more like grasping at
straws than a serious attempt to make things work better.

So at the moment I am leaning toward the view that we should just
remove this check entirely, as in the attached, proposed patch.

Can we have an elog() fatal error or warning to make sure that the
last checkpoint is still readable? Since the case where the user
(knowingly or unknowingly) or some buggy code has removed the WAL file
containing the last checkpoint could be possible. If it is then we
would have a hard time finding out when we get further unexpected
behavior due to this. Thoughts?

Really, I think we should consider going further. If it's safe to
write an end-of-recovery record rather than a checkpoint, why not do
so all the time? Right now we fail to do that in the above-described
"impossible" scenario where the previous checkpoint record can't be
read, or if we're exiting archive recovery for some reason other than
a promotion request, or if we're in single-user mode, or if we're in
crash recovery. Presumably, people would like to start up the server
quickly in all of those scenarios, so the only reason not to use this
technology all the time is if we think it's safe in some scenarios and
not others. I can't think of a reason why it matters why we're exiting
archive recovery, nor can I think of a reason why it matters whether
we're in single user mode. The distinction between crash recovery and
archive recovery does seem to matter, but if anything the crash
recovery scenario seems simpler, because then there's only one
timeline involved.

I realize that conservatism may have played a role in this code ending
up looking the way that it does; someone seems to have thought it
would be better not to rely on a new idea in all cases. From my point
of view, though, it's scary to have so many cases, especially cases
that don't seem like they should ever be reached. I think that
simplifying the logic here and trying to do the same things in as many
cases as we can will lead to better robustness. Imagine if instead of
all the hairy logic we have now we just replaced this whole if
(IsInRecovery) stanza with this:

if (InRecovery)
CreateEndOfRecoveryRecord();

+1, and do the checkpoint at the end unconditionally as we are doing
for the promotion.

Regards,
Amul

#13Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#10)
Re: needless complexity in StartupXLOG

On Wed, Jul 28, 2021 at 3:28 PM Andres Freund <andres@anarazel.de> wrote:

Imagine if instead of
all the hairy logic we have now we just replaced this whole if
(IsInRecovery) stanza with this:

if (InRecovery)
CreateEndOfRecoveryRecord();

That would be WAY easier to reason about than the rat's nest we have
here today. Now, I am not sure what it would take to get there, but I
think that is the direction we ought to be heading.

What are we going to do in the single user ([1]) case in this awesome future?
I guess we could just not create a checkpoint until single user mode is shut
down / creates a checkpoint for other reasons?

It probably depends on who writes this thus-far-hypothetical patch,
but my thought is that we'd unconditionally request a checkpoint after
writing the end-of-recovery record, same as we do now if (promoted).
If we happened to be in single-user mode, then that checkpoint request
would turn into performing a checkpoint on the spot in the one and
only process we've got, but StartupXLOG() wouldn't really need to care
what happens under the hood after it called RequestCheckpoint().

[1] I really wish somebody had the energy to just remove single user and
bootstrap modes. The degree to which they increase complexity in the rest of
the system is entirely unreasonable. There's not actually any reason
bootstrapping can't happen with checkpointer et al running, it's just
autovacuum that'd need to be blocked.

I don't know whether this is the way forward or not. I think a lot of
the complexity of the current system is incidental rather than
intrinsic. If I were going to work on this, I'd probably working on
trying to tidy up the code and reduce the number of places that need
to care about IsUnderPostmaster and IsPostmasterEnvironment, rather
than trying to get rid of them. I suspect that's a necessary
prerequisite step anyway, and not a trivial effort either.

--
Robert Haas
EDB: http://www.enterprisedb.com

#14Robert Haas
robertmhaas@gmail.com
In reply to: Amul Sul (#12)
Re: needless complexity in StartupXLOG

On Thu, Jul 29, 2021 at 1:47 AM Amul Sul <sulamul@gmail.com> wrote:

Can we have an elog() fatal error or warning to make sure that the
last checkpoint is still readable? Since the case where the user
(knowingly or unknowingly) or some buggy code has removed the WAL file
containing the last checkpoint could be possible. If it is then we
would have a hard time finding out when we get further unexpected
behavior due to this. Thoughts?

Sure, we could, but I don't think we should. Such crazy things can
happen any time, not just at the point where this check is happening.
It's not particularly more likely to happen here vs. any other place
where we could insert a check. Should we check everywhere, all the
time, just in case?

I realize that conservatism may have played a role in this code ending
up looking the way that it does; someone seems to have thought it
would be better not to rely on a new idea in all cases. From my point
of view, though, it's scary to have so many cases, especially cases
that don't seem like they should ever be reached. I think that
simplifying the logic here and trying to do the same things in as many
cases as we can will lead to better robustness. Imagine if instead of
all the hairy logic we have now we just replaced this whole if
(IsInRecovery) stanza with this:

if (InRecovery)
CreateEndOfRecoveryRecord();

+1, and do the checkpoint at the end unconditionally as we are doing
for the promotion.

Yeah, that was my thought, too.

--
Robert Haas
EDB: http://www.enterprisedb.com

#15Andres Freund
andres@anarazel.de
In reply to: Julien Rouhaud (#11)
Re: needless complexity in StartupXLOG

Hi,

On 2021-07-29 12:49:19 +0800, Julien Rouhaud wrote:

On Thu, Jul 29, 2021 at 3:28 AM Andres Freund <andres@anarazel.de> wrote:

[1] I really wish somebody had the energy to just remove single user and
bootstrap modes. The degree to which they increase complexity in the rest of
the system is entirely unreasonable. There's not actually any reason
bootstrapping can't happen with checkpointer et al running, it's just
autovacuum that'd need to be blocked.

Any objection to adding an entry for that in the wiki TODO?

Not sure there's enough concensus on the idea for that. I personally
think that's a good approach at reducing relevant complexity, but I
don't know if anybody agrees...

Greetings,

Andres Freund

#16Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#15)
Re: needless complexity in StartupXLOG

On Thu, Jul 29, 2021 at 3:16 PM Andres Freund <andres@anarazel.de> wrote:

Not sure there's enough concensus on the idea for that. I personally
think that's a good approach at reducing relevant complexity, but I
don't know if anybody agrees...

There does seem to be agreement on the proposed patch, so I have committed it.

Thanks to all for the discussion and the links to other relevant threads.

--
Robert Haas
EDB: http://www.enterprisedb.com

#17Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#13)
1 attachment(s)
using an end-of-recovery record in all cases

On Thu, Jul 29, 2021 at 11:49 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jul 28, 2021 at 3:28 PM Andres Freund <andres@anarazel.de> wrote:

Imagine if instead of
all the hairy logic we have now we just replaced this whole if
(IsInRecovery) stanza with this:

if (InRecovery)
CreateEndOfRecoveryRecord();

That would be WAY easier to reason about than the rat's nest we have
here today. Now, I am not sure what it would take to get there, but I
think that is the direction we ought to be heading.

What are we going to do in the single user ([1]) case in this awesome future?
I guess we could just not create a checkpoint until single user mode is shut
down / creates a checkpoint for other reasons?

It probably depends on who writes this thus-far-hypothetical patch,
but my thought is that we'd unconditionally request a checkpoint after
writing the end-of-recovery record, same as we do now if (promoted).
If we happened to be in single-user mode, then that checkpoint request
would turn into performing a checkpoint on the spot in the one and
only process we've got, but StartupXLOG() wouldn't really need to care
what happens under the hood after it called RequestCheckpoint().

I decided to try writing a patch to use an end-of-recovery record
rather than a checkpoint record in all cases. The patch itself was
pretty simple but didn't work. There are two different reasons why it
didn't work, which I'll explain in a minute. I'm not sure whether
there are any other problems; these are the only two things that cause
problems with 'make check-world', but that's hardly a guarantee of
anything. Anyway, I thought it would be useful to report these issues
first and hopefully get some feedback.

The first problem I hit was that GetRunningTransactionData() does
Assert(TransactionIdIsNormal(CurrentRunningXacts->latestCompletedXid)).
While that does seem like a superficially reasonable thing to assert,
StartupXLOG() initializes latestCompletedXid by calling
TransactionIdRetreat on the value extracted from checkPoint.nextXid,
and the first-ever checkpoint that is written at the beginning of WAL
has a nextXid of 3, so when starting up from that checkpoint nextXid
is 2, which is not a normal XID. When we try to create the next
checkpoint, CreateCheckPoint() does LogStandbySnapshot() which calls
GetRunningTransactionData() and the assertion fails. In the current
code, we avoid this more or less accidentally because
LogStandbySnapshot() is skipped when starting from a shutdown
checkpoint. If we write an end-of-recovery record and then trigger a
checkpoint afterwards, then we don't avoid the problem. Although I'm
impishly tempted to suggest that we #define SecondNormalTransactionId
4 and then use that to create the first checkpoint instead of
FirstNormalTransactionId -- naturally with no comments explaining why
we're doing it -- I think the real problem is that the assertion is
just wrong. CurrentRunningXacts->latestCompletedXid shouldn't be
InvalidTransactionId or BootstrapTransactionId, but
FrozenTransactionId is a legal, if corner-case, value, at least as the
code stands today.

Unfortunately we can't just relax the assertion, because the
XLOG_RUNNING_XACTS record will eventually be handed to
ProcArrayApplyRecoveryInfo() for processing ... and that function
contains a matching assertion which would in turn fail. It in turn
passes the value to MaintainLatestCompletedXidRecovery() which
contains yet another matching assertion, so the restriction to normal
XIDs here looks pretty deliberate. There are no comments, though, so
the reader is left to guess why. I see one problem:
MaintainLatestCompletedXidRecovery uses FullXidRelativeTo, which
expects a normal XID. Perhaps it's best to just dodge the entire issue
by skipping LogStandbySnapshot() if latestCompletedXid happens to be
2, but that feels like a hack, because AFAICS the real problem is that
StartupXLog() doesn't agree with the rest of the code on whether 2 is
a legal case, and maybe we ought to be storing a value that doesn't
need to be computed via TransactionIdRetreat().

The second problem I hit was a preexisting bug where, under
wal_level=minimal, redo of a "create tablespace" record can blow away
data of which we have no other copy. See
/messages/by-id/CA+TgmoaLO9ncuwvr2nN-J4VEP5XyAcy=zKiHxQzBbFRxxGxm0w@mail.gmail.com
for details. That bug happens to make src/test/recovery's
018_wal_optimize.pl fail one of the tests, because the test starts and
stops the server repeatedly, with the result that with the attached
patch, we just keep writing end-of-recovery records and never getting
time to finish a checkpoint before the next shutdown, so every test
replays the CREATE TABLESPACE record and everything that previous
tests did. The "wal_level = minimal, SET TABLESPACE commit
subtransaction" fails because it's the only one that (1) uses the
tablespace for a new table, (2) commits, and (3) runs before a
checkpoint is manually forced.

It's also worth noting that if we go this way,
CHECKPOINT_END_OF_RECOVERY should be ripped out entirely. We'd still
be triggering a checkpoint at the end of recovery, but because it
could be running concurrently with WAL-generating activity, it
wouldn't be an end-of-recovery checkpoint in the sense that we now use
that term. In particular, you couldn't assume that no other write
transactions are running at the point when this checkpoint is
performed. I haven't yet tried ripping that out and doing so might
reveal other problems.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

v1-0001-Always-use-an-end-of-recovery-record-rather-than-.patchapplication/octet-stream; name=v1-0001-Always-use-an-end-of-recovery-record-rather-than-.patchDownload
From 1d1e605ef0915553c6eec40ebf5e048bf2272836 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 9 Aug 2021 14:49:21 -0400
Subject: [PATCH v1] Always use an end-of-recovery record rather than a
 checkpoint.

Insert awful hack to allow LogStandbySnapshot() to not log a
standby snapshot, because it's unprepared to handle the case
where latestCompletedXid is 2, which can happen when starting
from the very first ever checkpoint.

Nuke 018_wal_optimize.pl. Otherwise, the "wal_level = minimal, SET
TABLESPACE commit subtransaction" test fails. The reason for the
failure is that CREATE TABLESPACE nukes the entire directory and
then recreates it, which is a really bad thing to do when
wal_level=minimal. Without the changes made by this patch the
problem is masked because each server restart necessarily
involves a checkpoint, so the CREATE TABLESPACE and the creation
of a table in that tablespace don't manage to happen in the same
checkpoint cycle.

This is not to be taken seriously in its current form; it's just
an illustration of (some of?) the problems that would need to be
solved to make this kind of change viable.
---
 src/backend/access/transam/xlog.c       |  63 ++--
 src/backend/postmaster/bgwriter.c       |   1 +
 src/backend/replication/slot.c          |   1 +
 src/backend/storage/ipc/standby.c       |   9 +
 src/test/recovery/t/018_wal_optimize.pl | 376 ------------------------
 5 files changed, 29 insertions(+), 421 deletions(-)
 delete mode 100644 src/test/recovery/t/018_wal_optimize.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d0ec6a834b..627d6e619d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6503,7 +6503,6 @@ StartupXLOG(void)
 	DBState		dbstate_at_startup;
 	XLogReaderState *xlogreader;
 	XLogPageReadPrivate private;
-	bool		promoted = false;
 	struct stat st;
 
 	/*
@@ -7876,43 +7875,8 @@ StartupXLOG(void)
 
 	if (InRecovery)
 	{
-		/*
-		 * Perform a checkpoint to update all our recovery activity to disk.
-		 *
-		 * Note that we write a shutdown checkpoint rather than an on-line
-		 * one. This is not particularly critical, but since we may be
-		 * assigning a new TLI, using a shutdown checkpoint allows us to have
-		 * the rule that TLI only changes in shutdown checkpoints, which
-		 * allows some extra error checking in xlog_redo.
-		 *
-		 * In promotion, only create a lightweight end-of-recovery record
-		 * instead of a full checkpoint. A checkpoint is requested later,
-		 * after we're fully out of recovery mode and already accepting
-		 * queries.
-		 */
-		if (ArchiveRecoveryRequested && IsUnderPostmaster &&
-			LocalPromoteIsTriggered)
-		{
-			promoted = true;
-
-			/*
-			 * Insert a special WAL record to mark the end of recovery, since
-			 * we aren't doing a checkpoint. That means that the checkpointer
-			 * process may likely be in the middle of a time-smoothed
-			 * restartpoint and could continue to be for minutes after this.
-			 * That sounds strange, but the effect is roughly the same and it
-			 * would be stranger to try to come out of the restartpoint and
-			 * then checkpoint. We request a checkpoint later anyway, just for
-			 * safety.
-			 */
-			CreateEndOfRecoveryRecord();
-		}
-		else
-		{
-			RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
-							  CHECKPOINT_IMMEDIATE |
-							  CHECKPOINT_WAIT);
-		}
+		/* Insert a special WAL record to mark the end of recovery. */
+		CreateEndOfRecoveryRecord();
 	}
 
 	if (ArchiveRecoveryRequested)
@@ -8093,13 +8057,22 @@ StartupXLOG(void)
 	WalSndWakeup();
 
 	/*
-	 * If this was a promotion, request an (online) checkpoint now. This isn't
-	 * required for consistency, but the last restartpoint might be far back,
-	 * and in case of a crash, recovering from it might take a longer than is
-	 * appropriate now that we're not in standby mode anymore.
+	 * Request an (online) checkpoint now. Note that, until this is complete,
+	 * a crash would start replay from the same WAL location we did, or from
+	 * the last restartpoint that completed. We don't want to let that
+	 * situation persist for longer than necessary, since users don't like
+	 * long recovery times. On the other hand, they also want to be able to
+	 * start doing useful work again as quickly as possible. Therfore, we
+	 * don't pass CHECKPOINT_IMMEDIATE to avoid bogging down the system.
+	 *
+	 * Note that the consequence of requesting a checkpoint here only after
+	 * we've allowed WAL writes is that a single checkpoint cycle can span
+	 * multiple server lifetimes. So for example if you want to something to
+	 * happen at least once per checkpoint cycle or at most once per
+	 * checkpoint cycle, you have to consider what happens if the server
+	 * is restarted someplace in the middle.
 	 */
-	if (promoted)
-		RequestCheckpoint(CHECKPOINT_FORCE);
+	RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
 /*
@@ -8914,7 +8887,7 @@ CreateCheckPoint(int flags)
 		shutdown = false;
 
 	/* sanity check */
-	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
+	if (RecoveryInProgress())
 		elog(ERROR, "can't create a checkpoint during recovery");
 
 	/*
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 5584f4bc24..3eeacad50a 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -294,6 +294,7 @@ BackgroundWriterMain(void)
 				last_snapshot_lsn <= GetLastImportantRecPtr())
 			{
 				last_snapshot_lsn = LogStandbySnapshot();
+				Assert(!XLogRecPtrIsInvalid(last_snapshot_lsn));
 				last_snapshot_ts = now;
 			}
 		}
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 33e9acab4a..4b649d69e6 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1108,6 +1108,7 @@ ReplicationSlotReserveWal(void)
 
 			/* make sure we have enough information to start */
 			flushptr = LogStandbySnapshot();
+			Assert(!XLogRecPtrIsInvalid(flushptr));
 
 			/* and make sure it's fsynced to disk */
 			XLogFlush(flushptr);
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 077251c1a6..b718c30fa5 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -1226,6 +1226,15 @@ LogStandbySnapshot(void)
 
 	Assert(XLogStandbyInfoActive());
 
+	/*
+	 * HACK: If we do this in single user mode, we'll fail an assertion,
+	 * because until the first checkpoint completes, latestCompletedXid will
+	 * be FrozenTransactionId rather than a normal XID, and
+	 * GetRunningTransactionData() doesn't think that's OK.
+	 */
+	if (!IsUnderPostmaster)
+		return InvalidXLogRecPtr;
+
 	/*
 	 * Get details of any AccessExclusiveLocks being held at the moment.
 	 */
diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
deleted file mode 100644
index 4aa1bf8f54..0000000000
--- a/src/test/recovery/t/018_wal_optimize.pl
+++ /dev/null
@@ -1,376 +0,0 @@
-
-# Copyright (c) 2021, PostgreSQL Global Development Group
-
-# Test WAL replay when some operation has skipped WAL.
-#
-# These tests exercise code that once violated the mandate described in
-# src/backend/access/transam/README section "Skipping WAL for New
-# RelFileNode".  The tests work by committing some transactions, initiating an
-# immediate shutdown, and confirming that the expected data survives recovery.
-# For many years, individual commands made the decision to skip WAL, hence the
-# frequent appearance of COPY in these tests.
-use strict;
-use warnings;
-
-use PostgresNode;
-use TestLib;
-use Test::More tests => 34;
-
-sub check_orphan_relfilenodes
-{
-	my ($node, $test_name) = @_;
-
-	my $db_oid = $node->safe_psql('postgres',
-		"SELECT oid FROM pg_database WHERE datname = 'postgres'");
-	my $prefix               = "base/$db_oid/";
-	my $filepaths_referenced = $node->safe_psql(
-		'postgres', "
-	   SELECT pg_relation_filepath(oid) FROM pg_class
-	   WHERE reltablespace = 0 AND relpersistence <> 't' AND
-	   pg_relation_filepath(oid) IS NOT NULL;");
-	is_deeply(
-		[
-			sort(map { "$prefix$_" }
-				  grep(/^[0-9]+$/, slurp_dir($node->data_dir . "/$prefix")))
-		],
-		[ sort split /\n/, $filepaths_referenced ],
-		$test_name);
-	return;
-}
-
-# We run this same test suite for both wal_level=minimal and replica.
-sub run_wal_optimize
-{
-	my $wal_level = shift;
-
-	my $node = PostgresNode->new("node_$wal_level");
-	$node->init;
-	$node->append_conf(
-		'postgresql.conf', qq(
-wal_level = $wal_level
-max_prepared_transactions = 1
-wal_log_hints = on
-wal_skip_threshold = 0
-#wal_debug = on
-));
-	$node->start;
-
-	# Setup
-	my $tablespace_dir = $node->basedir . '/tablespace_other';
-	mkdir($tablespace_dir);
-	$tablespace_dir = TestLib::perl2host($tablespace_dir);
-	$node->safe_psql('postgres',
-		"CREATE TABLESPACE other LOCATION '$tablespace_dir';");
-
-	# Test direct truncation optimization.  No tuples.
-	$node->safe_psql(
-		'postgres', "
-		BEGIN;
-		CREATE TABLE trunc (id serial PRIMARY KEY);
-		TRUNCATE trunc;
-		COMMIT;");
-	$node->stop('immediate');
-	$node->start;
-	my $result = $node->safe_psql('postgres', "SELECT count(*) FROM trunc;");
-	is($result, qq(0), "wal_level = $wal_level, TRUNCATE with empty table");
-
-	# Test truncation with inserted tuples within the same transaction.
-	# Tuples inserted after the truncation should be seen.
-	$node->safe_psql(
-		'postgres', "
-		BEGIN;
-		CREATE TABLE trunc_ins (id serial PRIMARY KEY);
-		INSERT INTO trunc_ins VALUES (DEFAULT);
-		TRUNCATE trunc_ins;
-		INSERT INTO trunc_ins VALUES (DEFAULT);
-		COMMIT;");
-	$node->stop('immediate');
-	$node->start;
-	$result = $node->safe_psql('postgres',
-		"SELECT count(*), min(id) FROM trunc_ins;");
-	is($result, qq(1|2), "wal_level = $wal_level, TRUNCATE INSERT");
-
-	# Same for prepared transaction.
-	# Tuples inserted after the truncation should be seen.
-	$node->safe_psql(
-		'postgres', "
-		BEGIN;
-		CREATE TABLE twophase (id serial PRIMARY KEY);
-		INSERT INTO twophase VALUES (DEFAULT);
-		TRUNCATE twophase;
-		INSERT INTO twophase VALUES (DEFAULT);
-		PREPARE TRANSACTION 't';
-		COMMIT PREPARED 't';");
-	$node->stop('immediate');
-	$node->start;
-	$result = $node->safe_psql('postgres',
-		"SELECT count(*), min(id) FROM trunc_ins;");
-	is($result, qq(1|2), "wal_level = $wal_level, TRUNCATE INSERT PREPARE");
-
-	# Writing WAL at end of xact, instead of syncing.
-	$node->safe_psql(
-		'postgres', "
-		SET wal_skip_threshold = '1GB';
-		BEGIN;
-		CREATE TABLE noskip (id serial PRIMARY KEY);
-		INSERT INTO noskip (SELECT FROM generate_series(1, 20000) a) ;
-		COMMIT;");
-	$node->stop('immediate');
-	$node->start;
-	$result = $node->safe_psql('postgres', "SELECT count(*) FROM noskip;");
-	is($result, qq(20000), "wal_level = $wal_level, end-of-xact WAL");
-
-	# Data file for COPY query in subsequent tests
-	my $basedir   = $node->basedir;
-	my $copy_file = "$basedir/copy_data.txt";
-	TestLib::append_to_file(
-		$copy_file, qq(20000,30000
-20001,30001
-20002,30002));
-	$copy_file = TestLib::perl2host($copy_file);
-
-	# Test truncation with inserted tuples using both INSERT and COPY.  Tuples
-	# inserted after the truncation should be seen.
-	$node->safe_psql(
-		'postgres', "
-		BEGIN;
-		CREATE TABLE ins_trunc (id serial PRIMARY KEY, id2 int);
-		INSERT INTO ins_trunc VALUES (DEFAULT, generate_series(1,10000));
-		TRUNCATE ins_trunc;
-		INSERT INTO ins_trunc (id, id2) VALUES (DEFAULT, 10000);
-		COPY ins_trunc FROM '$copy_file' DELIMITER ',';
-		INSERT INTO ins_trunc (id, id2) VALUES (DEFAULT, 10000);
-		COMMIT;");
-	$node->stop('immediate');
-	$node->start;
-	$result = $node->safe_psql('postgres', "SELECT count(*) FROM ins_trunc;");
-	is($result, qq(5), "wal_level = $wal_level, TRUNCATE COPY INSERT");
-
-	# Test truncation with inserted tuples using COPY.  Tuples copied after
-	# the truncation should be seen.
-	$node->safe_psql(
-		'postgres', "
-		BEGIN;
-		CREATE TABLE trunc_copy (id serial PRIMARY KEY, id2 int);
-		INSERT INTO trunc_copy VALUES (DEFAULT, generate_series(1,3000));
-		TRUNCATE trunc_copy;
-		COPY trunc_copy FROM '$copy_file' DELIMITER ',';
-		COMMIT;");
-	$node->stop('immediate');
-	$node->start;
-	$result =
-	  $node->safe_psql('postgres', "SELECT count(*) FROM trunc_copy;");
-	is($result, qq(3), "wal_level = $wal_level, TRUNCATE COPY");
-
-	# Like previous test, but rollback SET TABLESPACE in a subtransaction.
-	$node->safe_psql(
-		'postgres', "
-		BEGIN;
-		CREATE TABLE spc_abort (id serial PRIMARY KEY, id2 int);
-		INSERT INTO spc_abort VALUES (DEFAULT, generate_series(1,3000));
-		TRUNCATE spc_abort;
-		SAVEPOINT s;
-		  ALTER TABLE spc_abort SET TABLESPACE other; ROLLBACK TO s;
-		COPY spc_abort FROM '$copy_file' DELIMITER ',';
-		COMMIT;");
-	$node->stop('immediate');
-	$node->start;
-	$result = $node->safe_psql('postgres', "SELECT count(*) FROM spc_abort;");
-	is($result, qq(3),
-		"wal_level = $wal_level, SET TABLESPACE abort subtransaction");
-
-	# in different subtransaction patterns
-	$node->safe_psql(
-		'postgres', "
-		BEGIN;
-		CREATE TABLE spc_commit (id serial PRIMARY KEY, id2 int);
-		INSERT INTO spc_commit VALUES (DEFAULT, generate_series(1,3000));
-		TRUNCATE spc_commit;
-		SAVEPOINT s; ALTER TABLE spc_commit SET TABLESPACE other; RELEASE s;
-		COPY spc_commit FROM '$copy_file' DELIMITER ',';
-		COMMIT;");
-	$node->stop('immediate');
-	$node->start;
-	$result =
-	  $node->safe_psql('postgres', "SELECT count(*) FROM spc_commit;");
-	is($result, qq(3),
-		"wal_level = $wal_level, SET TABLESPACE commit subtransaction");
-
-	$node->safe_psql(
-		'postgres', "
-		BEGIN;
-		CREATE TABLE spc_nest (id serial PRIMARY KEY, id2 int);
-		INSERT INTO spc_nest VALUES (DEFAULT, generate_series(1,3000));
-		TRUNCATE spc_nest;
-		SAVEPOINT s;
-			ALTER TABLE spc_nest SET TABLESPACE other;
-			SAVEPOINT s2;
-				ALTER TABLE spc_nest SET TABLESPACE pg_default;
-			ROLLBACK TO s2;
-			SAVEPOINT s2;
-				ALTER TABLE spc_nest SET TABLESPACE pg_default;
-			RELEASE s2;
-		ROLLBACK TO s;
-		COPY spc_nest FROM '$copy_file' DELIMITER ',';
-		COMMIT;");
-	$node->stop('immediate');
-	$node->start;
-	$result = $node->safe_psql('postgres', "SELECT count(*) FROM spc_nest;");
-	is($result, qq(3),
-		"wal_level = $wal_level, SET TABLESPACE nested subtransaction");
-
-	$node->safe_psql(
-		'postgres', "
-		CREATE TABLE spc_hint (id int);
-		INSERT INTO spc_hint VALUES (1);
-		BEGIN;
-		ALTER TABLE spc_hint SET TABLESPACE other;
-		CHECKPOINT;
-		SELECT * FROM spc_hint;  -- set hint bit
-		INSERT INTO spc_hint VALUES (2);
-		COMMIT;");
-	$node->stop('immediate');
-	$node->start;
-	$result = $node->safe_psql('postgres', "SELECT count(*) FROM spc_hint;");
-	is($result, qq(2), "wal_level = $wal_level, SET TABLESPACE, hint bit");
-
-	$node->safe_psql(
-		'postgres', "
-		BEGIN;
-		CREATE TABLE idx_hint (c int PRIMARY KEY);
-		SAVEPOINT q; INSERT INTO idx_hint VALUES (1); ROLLBACK TO q;
-		CHECKPOINT;
-		INSERT INTO idx_hint VALUES (1);  -- set index hint bit
-		INSERT INTO idx_hint VALUES (2);
-		COMMIT;");
-	$node->stop('immediate');
-	$node->start;
-	$result = $node->psql('postgres',);
-	my ($ret, $stdout, $stderr) =
-	  $node->psql('postgres', "INSERT INTO idx_hint VALUES (2);");
-	is($ret, qq(3), "wal_level = $wal_level, unique index LP_DEAD");
-	like(
-		$stderr,
-		qr/violates unique/,
-		"wal_level = $wal_level, unique index LP_DEAD message");
-
-	# UPDATE touches two buffers for one row.
-	$node->safe_psql(
-		'postgres', "
-		BEGIN;
-		CREATE TABLE upd (id serial PRIMARY KEY, id2 int);
-		INSERT INTO upd (id, id2) VALUES (DEFAULT, generate_series(1,10000));
-		COPY upd FROM '$copy_file' DELIMITER ',';
-		UPDATE upd SET id2 = id2 + 1;
-		DELETE FROM upd;
-		COMMIT;");
-	$node->stop('immediate');
-	$node->start;
-	$result = $node->safe_psql('postgres', "SELECT count(*) FROM upd;");
-	is($result, qq(0),
-		"wal_level = $wal_level, UPDATE touches two buffers for one row");
-
-	# Test consistency of COPY with INSERT for table created in the same
-	# transaction.
-	$node->safe_psql(
-		'postgres', "
-		BEGIN;
-		CREATE TABLE ins_copy (id serial PRIMARY KEY, id2 int);
-		INSERT INTO ins_copy VALUES (DEFAULT, 1);
-		COPY ins_copy FROM '$copy_file' DELIMITER ',';
-		COMMIT;");
-	$node->stop('immediate');
-	$node->start;
-	$result = $node->safe_psql('postgres', "SELECT count(*) FROM ins_copy;");
-	is($result, qq(4), "wal_level = $wal_level, INSERT COPY");
-
-	# Test consistency of COPY that inserts more to the same table using
-	# triggers.  If the INSERTS from the trigger go to the same block data
-	# is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
-	# it tries to replay the WAL record but the "before" image doesn't match,
-	# because not all changes were WAL-logged.
-	$node->safe_psql(
-		'postgres', "
-		BEGIN;
-		CREATE TABLE ins_trig (id serial PRIMARY KEY, id2 text);
-		CREATE FUNCTION ins_trig_before_row_trig() RETURNS trigger
-		  LANGUAGE plpgsql as \$\$
-		  BEGIN
-			IF new.id2 NOT LIKE 'triggered%' THEN
-			  INSERT INTO ins_trig
-				VALUES (DEFAULT, 'triggered row before' || NEW.id2);
-			END IF;
-			RETURN NEW;
-		  END; \$\$;
-		CREATE FUNCTION ins_trig_after_row_trig() RETURNS trigger
-		  LANGUAGE plpgsql as \$\$
-		  BEGIN
-			IF new.id2 NOT LIKE 'triggered%' THEN
-			  INSERT INTO ins_trig
-				VALUES (DEFAULT, 'triggered row after' || NEW.id2);
-			END IF;
-			RETURN NEW;
-		  END; \$\$;
-		CREATE TRIGGER ins_trig_before_row_insert
-		  BEFORE INSERT ON ins_trig
-		  FOR EACH ROW EXECUTE PROCEDURE ins_trig_before_row_trig();
-		CREATE TRIGGER ins_trig_after_row_insert
-		  AFTER INSERT ON ins_trig
-		  FOR EACH ROW EXECUTE PROCEDURE ins_trig_after_row_trig();
-		COPY ins_trig FROM '$copy_file' DELIMITER ',';
-		COMMIT;");
-	$node->stop('immediate');
-	$node->start;
-	$result = $node->safe_psql('postgres', "SELECT count(*) FROM ins_trig;");
-	is($result, qq(9), "wal_level = $wal_level, COPY with INSERT triggers");
-
-	# Test consistency of INSERT, COPY and TRUNCATE in same transaction block
-	# with TRUNCATE triggers.
-	$node->safe_psql(
-		'postgres', "
-		BEGIN;
-		CREATE TABLE trunc_trig (id serial PRIMARY KEY, id2 text);
-		CREATE FUNCTION trunc_trig_before_stat_trig() RETURNS trigger
-		  LANGUAGE plpgsql as \$\$
-		  BEGIN
-			INSERT INTO trunc_trig VALUES (DEFAULT, 'triggered stat before');
-			RETURN NULL;
-		  END; \$\$;
-		CREATE FUNCTION trunc_trig_after_stat_trig() RETURNS trigger
-		  LANGUAGE plpgsql as \$\$
-		  BEGIN
-			INSERT INTO trunc_trig VALUES (DEFAULT, 'triggered stat before');
-			RETURN NULL;
-		  END; \$\$;
-		CREATE TRIGGER trunc_trig_before_stat_truncate
-		  BEFORE TRUNCATE ON trunc_trig
-		  FOR EACH STATEMENT EXECUTE PROCEDURE trunc_trig_before_stat_trig();
-		CREATE TRIGGER trunc_trig_after_stat_truncate
-		  AFTER TRUNCATE ON trunc_trig
-		  FOR EACH STATEMENT EXECUTE PROCEDURE trunc_trig_after_stat_trig();
-		INSERT INTO trunc_trig VALUES (DEFAULT, 1);
-		TRUNCATE trunc_trig;
-		COPY trunc_trig FROM '$copy_file' DELIMITER ',';
-		COMMIT;");
-	$node->stop('immediate');
-	$node->start;
-	$result =
-	  $node->safe_psql('postgres', "SELECT count(*) FROM trunc_trig;");
-	is($result, qq(4),
-		"wal_level = $wal_level, TRUNCATE COPY with TRUNCATE triggers");
-
-	# Test redo of temp table creation.
-	$node->safe_psql(
-		'postgres', "
-		CREATE TEMP TABLE temp (id serial PRIMARY KEY, id2 text);");
-	$node->stop('immediate');
-	$node->start;
-	check_orphan_relfilenodes($node,
-		"wal_level = $wal_level, no orphan relfilenode remains");
-
-	return;
-}
-
-# Run same test suite for multiple wal_level values.
-run_wal_optimize("minimal");
-run_wal_optimize("replica");
-- 
2.24.3 (Apple Git-128)

#18Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#17)
Re: using an end-of-recovery record in all cases

On Mon, Aug 9, 2021 at 3:00 PM Robert Haas <robertmhaas@gmail.com> wrote:

I decided to try writing a patch to use an end-of-recovery record
rather than a checkpoint record in all cases.

The first problem I hit was that GetRunningTransactionData() does
Assert(TransactionIdIsNormal(CurrentRunningXacts->latestCompletedXid)).

Unfortunately we can't just relax the assertion, because the
XLOG_RUNNING_XACTS record will eventually be handed to
ProcArrayApplyRecoveryInfo() for processing ... and that function
contains a matching assertion which would in turn fail. It in turn
passes the value to MaintainLatestCompletedXidRecovery() which
contains yet another matching assertion, so the restriction to normal
XIDs here looks pretty deliberate. There are no comments, though, so
the reader is left to guess why. I see one problem:
MaintainLatestCompletedXidRecovery uses FullXidRelativeTo, which
expects a normal XID. Perhaps it's best to just dodge the entire issue
by skipping LogStandbySnapshot() if latestCompletedXid happens to be
2, but that feels like a hack, because AFAICS the real problem is that
StartupXLog() doesn't agree with the rest of the code on whether 2 is
a legal case, and maybe we ought to be storing a value that doesn't
need to be computed via TransactionIdRetreat().

Anyone have any thoughts about this?

--
Robert Haas
EDB: http://www.enterprisedb.com

#19Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Robert Haas (#18)
Re: using an end-of-recovery record in all cases

At Thu, 2 Sep 2021 11:30:59 -0400, Robert Haas <robertmhaas@gmail.com> wrote in

On Mon, Aug 9, 2021 at 3:00 PM Robert Haas <robertmhaas@gmail.com> wrote:

I decided to try writing a patch to use an end-of-recovery record
rather than a checkpoint record in all cases.

The first problem I hit was that GetRunningTransactionData() does
Assert(TransactionIdIsNormal(CurrentRunningXacts->latestCompletedXid)).

Unfortunately we can't just relax the assertion, because the
XLOG_RUNNING_XACTS record will eventually be handed to
ProcArrayApplyRecoveryInfo() for processing ... and that function
contains a matching assertion which would in turn fail. It in turn
passes the value to MaintainLatestCompletedXidRecovery() which
contains yet another matching assertion, so the restriction to normal
XIDs here looks pretty deliberate. There are no comments, though, so
the reader is left to guess why. I see one problem:
MaintainLatestCompletedXidRecovery uses FullXidRelativeTo, which
expects a normal XID. Perhaps it's best to just dodge the entire issue
by skipping LogStandbySnapshot() if latestCompletedXid happens to be
2, but that feels like a hack, because AFAICS the real problem is that
StartupXLog() doesn't agree with the rest of the code on whether 2 is
a legal case, and maybe we ought to be storing a value that doesn't
need to be computed via TransactionIdRetreat().

Anyone have any thoughts about this?

I tried to reproduce this but just replacing the end-of-recovery
checkpoint request with issuing an end-of-recovery record didn't cause
make check-workd fail for me. Do you have an idea of any other
requirement to cause that?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#20Amul Sul
sulamul@gmail.com
In reply to: Kyotaro Horiguchi (#19)
Re: using an end-of-recovery record in all cases

On Fri, Sep 3, 2021 at 10:23 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Thu, 2 Sep 2021 11:30:59 -0400, Robert Haas <robertmhaas@gmail.com> wrote in

On Mon, Aug 9, 2021 at 3:00 PM Robert Haas <robertmhaas@gmail.com> wrote:

I decided to try writing a patch to use an end-of-recovery record
rather than a checkpoint record in all cases.

The first problem I hit was that GetRunningTransactionData() does
Assert(TransactionIdIsNormal(CurrentRunningXacts->latestCompletedXid)).

Unfortunately we can't just relax the assertion, because the
XLOG_RUNNING_XACTS record will eventually be handed to
ProcArrayApplyRecoveryInfo() for processing ... and that function
contains a matching assertion which would in turn fail. It in turn
passes the value to MaintainLatestCompletedXidRecovery() which
contains yet another matching assertion, so the restriction to normal
XIDs here looks pretty deliberate. There are no comments, though, so
the reader is left to guess why. I see one problem:
MaintainLatestCompletedXidRecovery uses FullXidRelativeTo, which
expects a normal XID. Perhaps it's best to just dodge the entire issue
by skipping LogStandbySnapshot() if latestCompletedXid happens to be
2, but that feels like a hack, because AFAICS the real problem is that
StartupXLog() doesn't agree with the rest of the code on whether 2 is
a legal case, and maybe we ought to be storing a value that doesn't
need to be computed via TransactionIdRetreat().

Anyone have any thoughts about this?

I tried to reproduce this but just replacing the end-of-recovery
checkpoint request with issuing an end-of-recovery record didn't cause
make check-workd fail for me. Do you have an idea of any other
requirement to cause that?

You might need the following change at the end of StartupXLOG():

- if (promoted)
- RequestCheckpoint(CHECKPOINT_FORCE);
+ RequestCheckpoint(CHECKPOINT_FORCE);

Regards,
Amul

#21Robert Haas
robertmhaas@gmail.com
In reply to: Kyotaro Horiguchi (#19)
Re: using an end-of-recovery record in all cases

On Fri, Sep 3, 2021 at 12:52 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

I tried to reproduce this but just replacing the end-of-recovery
checkpoint request with issuing an end-of-recovery record didn't cause
make check-workd fail for me. Do you have an idea of any other
requirement to cause that?

Did you use the patch I posted, or a different one?

--
Robert Haas
EDB: http://www.enterprisedb.com

#22Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#17)
Re: using an end-of-recovery record in all cases

On Tue, Aug 10, 2021 at 12:31 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jul 29, 2021 at 11:49 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jul 28, 2021 at 3:28 PM Andres Freund <andres@anarazel.de> wrote:

Imagine if instead of
all the hairy logic we have now we just replaced this whole if
(IsInRecovery) stanza with this:

if (InRecovery)
CreateEndOfRecoveryRecord();

That would be WAY easier to reason about than the rat's nest we have
here today. Now, I am not sure what it would take to get there, but I
think that is the direction we ought to be heading.

What are we going to do in the single user ([1]) case in this awesome future?
I guess we could just not create a checkpoint until single user mode is shut
down / creates a checkpoint for other reasons?

It probably depends on who writes this thus-far-hypothetical patch,
but my thought is that we'd unconditionally request a checkpoint after
writing the end-of-recovery record, same as we do now if (promoted).
If we happened to be in single-user mode, then that checkpoint request
would turn into performing a checkpoint on the spot in the one and
only process we've got, but StartupXLOG() wouldn't really need to care
what happens under the hood after it called RequestCheckpoint().

I decided to try writing a patch to use an end-of-recovery record
rather than a checkpoint record in all cases. The patch itself was
pretty simple but didn't work. There are two different reasons why it
didn't work, which I'll explain in a minute. I'm not sure whether
there are any other problems; these are the only two things that cause
problems with 'make check-world', but that's hardly a guarantee of
anything. Anyway, I thought it would be useful to report these issues
first and hopefully get some feedback.

The first problem I hit was that GetRunningTransactionData() does
Assert(TransactionIdIsNormal(CurrentRunningXacts->latestCompletedXid)).
While that does seem like a superficially reasonable thing to assert,
StartupXLOG() initializes latestCompletedXid by calling
TransactionIdRetreat on the value extracted from checkPoint.nextXid,
and the first-ever checkpoint that is written at the beginning of WAL
has a nextXid of 3, so when starting up from that checkpoint nextXid
is 2, which is not a normal XID. When we try to create the next
checkpoint, CreateCheckPoint() does LogStandbySnapshot() which calls
GetRunningTransactionData() and the assertion fails. In the current
code, we avoid this more or less accidentally because
LogStandbySnapshot() is skipped when starting from a shutdown
checkpoint. If we write an end-of-recovery record and then trigger a
checkpoint afterwards, then we don't avoid the problem. Although I'm
impishly tempted to suggest that we #define SecondNormalTransactionId
4 and then use that to create the first checkpoint instead of
FirstNormalTransactionId -- naturally with no comments explaining why
we're doing it -- I think the real problem is that the assertion is
just wrong. CurrentRunningXacts->latestCompletedXid shouldn't be
InvalidTransactionId or BootstrapTransactionId, but
FrozenTransactionId is a legal, if corner-case, value, at least as the
code stands today.

Unfortunately we can't just relax the assertion, because the
XLOG_RUNNING_XACTS record will eventually be handed to
ProcArrayApplyRecoveryInfo() for processing ... and that function
contains a matching assertion which would in turn fail. It in turn
passes the value to MaintainLatestCompletedXidRecovery() which
contains yet another matching assertion, so the restriction to normal
XIDs here looks pretty deliberate. There are no comments, though, so
the reader is left to guess why. I see one problem:
MaintainLatestCompletedXidRecovery uses FullXidRelativeTo, which
expects a normal XID. Perhaps it's best to just dodge the entire issue
by skipping LogStandbySnapshot() if latestCompletedXid happens to be
2,

By reading above explanation, it seems like it is better to skip
LogStandbySnapshot() for this proposal. Can't we consider skipping
LogStandbySnapshot() by passing some explicit flag-like
'recovery_checkpoint' or something like that?

I think there is some prior discussion in another thread related to
this work but it would be easier to follow if you can summarize the
same.

With Regards,
Amit Kapila.

#23Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Robert Haas (#18)
Re: using an end-of-recovery record in all cases

At Fri, 3 Sep 2021 15:56:35 +0530, Amul Sul <sulamul@gmail.com> wrote in

On Fri, Sep 3, 2021 at 10:23 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
You might need the following change at the end of StartupXLOG():

- if (promoted)
- RequestCheckpoint(CHECKPOINT_FORCE);
+ RequestCheckpoint(CHECKPOINT_FORCE);

At Fri, 3 Sep 2021 10:13:53 -0400, Robert Haas <robertmhaas@gmail.com> wrote in

Did you use the patch I posted, or a different one

Thanks to both of you. That was my stupid.

..But even with the patch (with removal of 018_..pl test file), I
didn't get check-world failed.. I'll retry later.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#24Amul Sul
sulamul@gmail.com
In reply to: Kyotaro Horiguchi (#23)
1 attachment(s)
Re: using an end-of-recovery record in all cases

I was trying to understand the v1 patch and found that at the end
RequestCheckpoint() is called unconditionally, I think that should
have been called if REDO had performed, here is the snip from the v1
patch:

  /*
- * If this was a promotion, request an (online) checkpoint now. This isn't
- * required for consistency, but the last restartpoint might be far back,
- * and in case of a crash, recovering from it might take a longer than is
- * appropriate now that we're not in standby mode anymore.
+ * Request an (online) checkpoint now. Note that, until this is complete,
+ * a crash would start replay from the same WAL location we did, or from
+ * the last restartpoint that completed. We don't want to let that
+ * situation persist for longer than necessary, since users don't like
+ * long recovery times. On the other hand, they also want to be able to
+ * start doing useful work again as quickly as possible. Therfore, we
+ * don't pass CHECKPOINT_IMMEDIATE to avoid bogging down the system.
+ *
+ * Note that the consequence of requesting a checkpoint here only after
+ * we've allowed WAL writes is that a single checkpoint cycle can span
+ * multiple server lifetimes. So for example if you want to something to
+ * happen at least once per checkpoint cycle or at most once per
+ * checkpoint cycle, you have to consider what happens if the server
+ * is restarted someplace in the middle.
  */
- if (promoted)
- RequestCheckpoint(CHECKPOINT_FORCE);
+ RequestCheckpoint(CHECKPOINT_FORCE);

When I try to call that conditionally like attached, I don't see any
regression failure, correct me if I am missing something here.

Regards,
Amul

Attachments:

trial.patchapplication/x-patch; name=trial.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index eddb13d13a7..6d48a1047c6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6549,7 +6549,7 @@ StartupXLOG(void)
 	DBState		dbstate_at_startup;
 	XLogReaderState *xlogreader;
 	XLogPageReadPrivate private;
-	bool		promoted = false;
+	bool		written_end_of_recovery = false;
 	struct stat st;
 
 	/*
@@ -7955,45 +7955,9 @@ StartupXLOG(void)
 
 	if (InRecovery)
 	{
-		/*
-		 * Perform a checkpoint to update all our recovery activity to disk.
-		 *
-		 * Note that we write a shutdown checkpoint rather than an on-line
-		 * one. This is not particularly critical, but since we may be
-		 * assigning a new TLI, using a shutdown checkpoint allows us to have
-		 * the rule that TLI only changes in shutdown checkpoints, which
-		 * allows some extra error checking in xlog_redo.
-		 *
-		 * In promotion, only create a lightweight end-of-recovery record
-		 * instead of a full checkpoint. A checkpoint is requested later,
-		 * after we're fully out of recovery mode and already accepting
-		 * queries.
-		 */
-		if (ArchiveRecoveryRequested && IsUnderPostmaster &&
-			LocalPromoteIsTriggered)
-		{
-			promoted = true;
-
-			/*
-			 * Insert a special WAL record to mark the end of recovery, since
-			 * we aren't doing a checkpoint. That means that the checkpointer
-			 * process may likely be in the middle of a time-smoothed
-			 * restartpoint and could continue to be for minutes after this.
-			 * That sounds strange, but the effect is roughly the same and it
-			 * would be stranger to try to come out of the restartpoint and
-			 * then checkpoint. We request a checkpoint later anyway, just for
-			 * safety.
-			 */
-			CreateEndOfRecoveryRecord();
-		}
-		else
-		{
-			RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
-							  CHECKPOINT_IMMEDIATE |
-							  CHECKPOINT_WAIT);
-		}
+		CreateEndOfRecoveryRecord();
+		written_end_of_recovery = true;
 	}
-
 	if (ArchiveRecoveryRequested)
 	{
 		/*
@@ -8177,12 +8141,9 @@ StartupXLOG(void)
 	WalSndWakeup();
 
 	/*
-	 * If this was a promotion, request an (online) checkpoint now. This isn't
-	 * required for consistency, but the last restartpoint might be far back,
-	 * and in case of a crash, recovering from it might take a longer than is
-	 * appropriate now that we're not in standby mode anymore.
+	 * TODO
 	 */
-	if (promoted)
+	if (written_end_of_recovery)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
#25Robert Haas
robertmhaas@gmail.com
In reply to: Amul Sul (#24)
Re: using an end-of-recovery record in all cases

On Tue, Oct 5, 2021 at 7:44 AM Amul Sul <sulamul@gmail.com> wrote:

I was trying to understand the v1 patch and found that at the end
RequestCheckpoint() is called unconditionally, I think that should
have been called if REDO had performed,

You're right. But I don't think we need an extra variable like this,
right? We can just test InRecovery?

--
Robert Haas
EDB: http://www.enterprisedb.com

#26Amul Sul
sulamul@gmail.com
In reply to: Robert Haas (#25)
Re: using an end-of-recovery record in all cases

On Tue, 5 Oct 2021 at 9:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Oct 5, 2021 at 7:44 AM Amul Sul <sulamul@gmail.com> wrote:

I was trying to understand the v1 patch and found that at the end
RequestCheckpoint() is called unconditionally, I think that should
have been called if REDO had performed,

You're right. But I don't think we need an extra variable like this,
right? We can just test InRecovery?

No, InRecovery flag get cleared before this point. I think, we can use
lastReplayedEndRecPtr what you have suggested in other thread.

Regards,
Amul
--
Regards,
Amul Sul
EDB: http://www.enterprisedb.com

#27Robert Haas
robertmhaas@gmail.com
In reply to: Amul Sul (#26)
Re: using an end-of-recovery record in all cases

On Tue, Oct 5, 2021 at 12:41 PM Amul Sul <sulamul@gmail.com> wrote:

No, InRecovery flag get cleared before this point. I think, we can use lastReplayedEndRecPtr what you have suggested in other thread.

Hmm, right, that makes sense. Perhaps I should start remembering what
I said in my own emails.

--
Robert Haas
EDB: http://www.enterprisedb.com

#28Amul Sul
sulamul@gmail.com
In reply to: Robert Haas (#27)
1 attachment(s)
Re: using an end-of-recovery record in all cases

On Tue, Oct 5, 2021 at 10:42 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Oct 5, 2021 at 12:41 PM Amul Sul <sulamul@gmail.com> wrote:

No, InRecovery flag get cleared before this point. I think, we can use lastReplayedEndRecPtr what you have suggested in other thread.

Hmm, right, that makes sense. Perhaps I should start remembering what
I said in my own emails.

Here I end up with the attached version where I have dropped the
changes for standby.c and 018_wal_optimize.pl files. Also, I am not
sure that we should have the changes for bgwriter.c and slot.c in this
patch, but that's not touched.

Regards,
Amul

Attachments:

v2-0001-Always-use-an-end-of-recovery-record-rather-than-.patchapplication/x-patch; name=v2-0001-Always-use-an-end-of-recovery-record-rather-than-.patchDownload
From 8afbd58573f744ef742969575ecf4de7ec5d915d Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 6 Oct 2021 09:38:02 -0400
Subject: [PATCH v2] Always use an end-of-recovery record rather than a
 checkpoint.

Insert awful hack to allow LogStandbySnapshot() to not log a
standby snapshot, because it's unprepared to handle the case
where latestCompletedXid is 2, which can happen when starting
from the very first ever checkpoint.

Nuke 018_wal_optimize.pl. Otherwise, the "wal_level = minimal, SET
TABLESPACE commit subtransaction" test fails. The reason for the
failure is that CREATE TABLESPACE nukes the entire directory and
then recreates it, which is a really bad thing to do when
wal_level=minimal. Without the changes made by this patch the
problem is masked because each server restart necessarily
involves a checkpoint, so the CREATE TABLESPACE and the creation
of a table in that tablespace don't manage to happen in the same
checkpoint cycle.

This is not to be taken seriously in its current form; it's just
an illustration of (some of?) the problems that would need to be
solved to make this kind of change viable.

Robert Haas, with modifications by Amul Sul.
---
 src/backend/access/transam/xlog.c | 72 ++++++++++++-------------------
 src/backend/postmaster/bgwriter.c |  1 +
 src/backend/replication/slot.c    |  1 +
 3 files changed, 29 insertions(+), 45 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 26dcc00ac01..112de230a0d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6549,7 +6549,6 @@ StartupXLOG(void)
 	DBState		dbstate_at_startup;
 	XLogReaderState *xlogreader;
 	XLogPageReadPrivate private;
-	bool		promoted = false;
 	struct stat st;
 
 	/*
@@ -7955,43 +7954,8 @@ StartupXLOG(void)
 
 	if (InRecovery)
 	{
-		/*
-		 * Perform a checkpoint to update all our recovery activity to disk.
-		 *
-		 * Note that we write a shutdown checkpoint rather than an on-line
-		 * one. This is not particularly critical, but since we may be
-		 * assigning a new TLI, using a shutdown checkpoint allows us to have
-		 * the rule that TLI only changes in shutdown checkpoints, which
-		 * allows some extra error checking in xlog_redo.
-		 *
-		 * In promotion, only create a lightweight end-of-recovery record
-		 * instead of a full checkpoint. A checkpoint is requested later,
-		 * after we're fully out of recovery mode and already accepting
-		 * queries.
-		 */
-		if (ArchiveRecoveryRequested && IsUnderPostmaster &&
-			LocalPromoteIsTriggered)
-		{
-			promoted = true;
-
-			/*
-			 * Insert a special WAL record to mark the end of recovery, since
-			 * we aren't doing a checkpoint. That means that the checkpointer
-			 * process may likely be in the middle of a time-smoothed
-			 * restartpoint and could continue to be for minutes after this.
-			 * That sounds strange, but the effect is roughly the same and it
-			 * would be stranger to try to come out of the restartpoint and
-			 * then checkpoint. We request a checkpoint later anyway, just for
-			 * safety.
-			 */
-			CreateEndOfRecoveryRecord();
-		}
-		else
-		{
-			RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
-							  CHECKPOINT_IMMEDIATE |
-							  CHECKPOINT_WAIT);
-		}
+		/* Insert a special WAL record to mark the end of recovery. */
+		CreateEndOfRecoveryRecord();
 	}
 
 	if (ArchiveRecoveryRequested)
@@ -8177,12 +8141,30 @@ StartupXLOG(void)
 	WalSndWakeup();
 
 	/*
-	 * If this was a promotion, request an (online) checkpoint now. This isn't
-	 * required for consistency, but the last restartpoint might be far back,
-	 * and in case of a crash, recovering from it might take a longer than is
-	 * appropriate now that we're not in standby mode anymore.
-	 */
-	if (promoted)
+	 * Request an (online) checkpoint now. Note that, until this is complete,
+	 * a crash would start replay from the same WAL location we did, or from
+	 * the last restartpoint that completed. We don't want to let that
+	 * situation persist for longer than necessary, since users don't like
+	 * long recovery times. On the other hand, they also want to be able to
+	 * start doing useful work again as quickly as possible. Therfore, we
+	 * don't pass CHECKPOINT_IMMEDIATE to avoid bogging down the system.
+	 *
+	 * Note that the consequence of requesting a checkpoint here only after
+	 * we've allowed WAL writes is that a single checkpoint cycle can span
+	 * multiple server lifetimes. So for example if you want to something to
+	 * happen at least once per checkpoint cycle or at most once per
+	 * checkpoint cycle, you have to consider what happens if the server
+	 * is restarted someplace in the middle.
+	 *
+	 * NB: We generally use the InRecovery flag to perform the redo and to
+	 * perform the subsequent operation needed after the redo. The following
+	 * checkpoint request is one of the operations that need to perform if the
+	 * redo has been performed previously. But we cannot check the InRecovery
+	 * flag now because that has been cleared by now, perhaps, we can still find
+	 * out whether redo has been performed by checking lastReplayedEndRecPtr
+	 * that get set only in case of redo.
+	 */
+	if (!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr))
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
@@ -8998,7 +8980,7 @@ CreateCheckPoint(int flags)
 		shutdown = false;
 
 	/* sanity check */
-	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
+	if (RecoveryInProgress())
 		elog(ERROR, "can't create a checkpoint during recovery");
 
 	/*
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 5584f4bc241..3eeacad50aa 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -294,6 +294,7 @@ BackgroundWriterMain(void)
 				last_snapshot_lsn <= GetLastImportantRecPtr())
 			{
 				last_snapshot_lsn = LogStandbySnapshot();
+				Assert(!XLogRecPtrIsInvalid(last_snapshot_lsn));
 				last_snapshot_ts = now;
 			}
 		}
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 1c6c0c7ce27..0d9b1bbe293 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1120,6 +1120,7 @@ ReplicationSlotReserveWal(void)
 
 			/* make sure we have enough information to start */
 			flushptr = LogStandbySnapshot();
+			Assert(!XLogRecPtrIsInvalid(flushptr));
 
 			/* and make sure it's fsynced to disk */
 			XLogFlush(flushptr);
-- 
2.18.0

#29Amul Sul
sulamul@gmail.com
In reply to: Amul Sul (#28)
1 attachment(s)
Re: using an end-of-recovery record in all cases

On Wed, Oct 6, 2021 at 7:24 PM Amul Sul <sulamul@gmail.com> wrote:

On Tue, Oct 5, 2021 at 10:42 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Oct 5, 2021 at 12:41 PM Amul Sul <sulamul@gmail.com> wrote:

No, InRecovery flag get cleared before this point. I think, we can use lastReplayedEndRecPtr what you have suggested in other thread.

Hmm, right, that makes sense. Perhaps I should start remembering what
I said in my own emails.

Here I end up with the attached version where I have dropped the
changes for standby.c and 018_wal_optimize.pl files. Also, I am not
sure that we should have the changes for bgwriter.c and slot.c in this
patch, but that's not touched.

Attached is the rebased and updated version. The patch removes the
newly introduced PerformRecoveryXLogAction() function. In addition to
that, removed the CHECKPOINT_END_OF_RECOVERY flag and its related
code. Also, dropped changes for bgwriter.c and slot.c in this patch, which
seem unrelated to this work.

Regards,
Amul

Attachments:

v3-0001-Always-use-an-end-of-recovery-record-rather-than-.patchapplication/x-patch; name=v3-0001-Always-use-an-end-of-recovery-record-rather-than-.patchDownload
From 7cf4d270ef8649426fb16f2594a7ed7dcf8cfa5d Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 15 Oct 2021 10:38:55 -0400
Subject: [PATCH v3] Always use an end-of-recovery record rather than a
 checkpoint.

Robert Haas, with modifications by Amul Sul.
---
 src/backend/access/transam/xlog.c     | 144 ++++++--------------------
 src/backend/postmaster/checkpointer.c |  11 +-
 src/backend/storage/buffer/bufmgr.c   |  14 ++-
 src/include/access/xlog.h             |  16 ++-
 4 files changed, 47 insertions(+), 138 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 62862255fca..c6543f35da4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -939,7 +939,6 @@ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
-static bool PerformRecoveryXLogAction(void);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 										XLogRecPtr RecPtr, int whichChkpt, bool report);
 static bool rescanLatestTimeLine(void);
@@ -6634,7 +6633,6 @@ StartupXLOG(void)
 	DBState		dbstate_at_startup;
 	XLogReaderState *xlogreader;
 	XLogPageReadPrivate private;
-	bool		promoted = false;
 	struct stat st;
 
 	/*
@@ -8083,7 +8081,7 @@ StartupXLOG(void)
 	LocalXLogInsertAllowed = -1;
 
 	/*
-	 * Emit checkpoint or end-of-recovery record in XLOG, if required.
+	 * Insert a special WAL record to mark the end of recovery.
 	 *
 	 * XLogCtl->lastReplayedEndRecPtr will be a valid LSN if and only if we
 	 * entered recovery. Even if we ultimately replayed no WAL records, it will
@@ -8092,7 +8090,7 @@ StartupXLOG(void)
 	 * we reach this code.
 	 */
 	if (!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr))
-		promoted = PerformRecoveryXLogAction();
+		CreateEndOfRecoveryRecord();
 
 	/* If this is archive recovery, perform post-recovery cleanup actions. */
 	if (ArchiveRecoveryRequested)
@@ -8156,12 +8154,22 @@ StartupXLOG(void)
 	WalSndWakeup();
 
 	/*
-	 * If this was a promotion, request an (online) checkpoint now. This isn't
-	 * required for consistency, but the last restartpoint might be far back,
-	 * and in case of a crash, recovering from it might take a longer than is
-	 * appropriate now that we're not in standby mode anymore.
+	 * Request an (online) checkpoint now. Note that, until this is complete,
+	 * a crash would start replay from the same WAL location we did, or from
+	 * the last restartpoint that completed. We don't want to let that
+	 * situation persist for longer than necessary, since users don't like
+	 * long recovery times. On the other hand, they also want to be able to
+	 * start doing useful work again as quickly as possible. Therfore, we
+	 * don't pass CHECKPOINT_IMMEDIATE to avoid bogging down the system.
+	 *
+	 * Note that the consequence of requesting a checkpoint here only after
+	 * we've allowed WAL writes is that a single checkpoint cycle can span
+	 * multiple server lifetimes. So for example if you want to something to
+	 * happen at least once per checkpoint cycle or at most once per
+	 * checkpoint cycle, you have to consider what happens if the server
+	 * is restarted someplace in the middle.
 	 */
-	if (promoted)
+	if (!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr))
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
@@ -8261,60 +8269,6 @@ CheckRecoveryConsistency(void)
 	}
 }
 
-/*
- * Perform whatever XLOG actions are necessary at end of REDO.
- *
- * The goal here is to make sure that we'll be able to recover properly if
- * we crash again. If we choose to write a checkpoint, we'll write a shutdown
- * checkpoint rather than an on-line one. This is not particularly critical,
- * but since we may be assigning a new TLI, using a shutdown checkpoint allows
- * us to have the rule that TLI only changes in shutdown checkpoints, which
- * allows some extra error checking in xlog_redo.
- */
-static bool
-PerformRecoveryXLogAction(void)
-{
-	bool		promoted = false;
-
-	/*
-	 * Perform a checkpoint to update all our recovery activity to disk.
-	 *
-	 * Note that we write a shutdown checkpoint rather than an on-line one. This
-	 * is not particularly critical, but since we may be assigning a new TLI,
-	 * using a shutdown checkpoint allows us to have the rule that TLI only
-	 * changes in shutdown checkpoints, which allows some extra error checking
-	 * in xlog_redo.
-	 *
-	 * In promotion, only create a lightweight end-of-recovery record instead of
-	 * a full checkpoint. A checkpoint is requested later, after we're fully out
-	 * of recovery mode and already accepting queries.
-	 */
-	if (ArchiveRecoveryRequested && IsUnderPostmaster &&
-		LocalPromoteIsTriggered)
-	{
-		promoted = true;
-
-		/*
-		 * Insert a special WAL record to mark the end of recovery, since we
-		 * aren't doing a checkpoint. That means that the checkpointer process
-		 * may likely be in the middle of a time-smoothed restartpoint and could
-		 * continue to be for minutes after this.  That sounds strange, but the
-		 * effect is roughly the same and it would be stranger to try to come
-		 * out of the restartpoint and then checkpoint. We request a checkpoint
-		 * later anyway, just for safety.
-		 */
-		CreateEndOfRecoveryRecord();
-	}
-	else
-	{
-		RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
-						  CHECKPOINT_IMMEDIATE |
-						  CHECKPOINT_WAIT);
-	}
-
-	return promoted;
-}
-
 /*
  * Is the system still in recovery?
  *
@@ -8793,9 +8747,8 @@ LogCheckpointStart(int flags, bool restartpoint)
 	if (restartpoint)
 		ereport(LOG,
 		/* translator: the placeholders show checkpoint options */
-				(errmsg("restartpoint starting:%s%s%s%s%s%s%s%s",
+				(errmsg("restartpoint starting:%s%s%s%s%s%s%s",
 						(flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "",
-						(flags & CHECKPOINT_END_OF_RECOVERY) ? " end-of-recovery" : "",
 						(flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "",
 						(flags & CHECKPOINT_FORCE) ? " force" : "",
 						(flags & CHECKPOINT_WAIT) ? " wait" : "",
@@ -8805,9 +8758,8 @@ LogCheckpointStart(int flags, bool restartpoint)
 	else
 		ereport(LOG,
 		/* translator: the placeholders show checkpoint options */
-				(errmsg("checkpoint starting:%s%s%s%s%s%s%s%s",
+				(errmsg("checkpoint starting:%s%s%s%s%s%s%s",
 						(flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "",
-						(flags & CHECKPOINT_END_OF_RECOVERY) ? " end-of-recovery" : "",
 						(flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "",
 						(flags & CHECKPOINT_FORCE) ? " force" : "",
 						(flags & CHECKPOINT_WAIT) ? " wait" : "",
@@ -8953,13 +8905,12 @@ static void
 update_checkpoint_display(int flags, bool restartpoint, bool reset)
 {
 	/*
-	 * The status is reported only for end-of-recovery and shutdown
-	 * checkpoints or shutdown restartpoints.  Updating the ps display is
-	 * useful in those situations as it may not be possible to rely on
-	 * pg_stat_activity to see the status of the checkpointer or the startup
-	 * process.
+	 * The status is reported only for shutdown checkpoints or shutdown
+	 * restartpoints.  Updating the ps display is useful in those situations as
+	 * it may not be possible to rely on pg_stat_activity to see the status of
+	 * the checkpointer or the startup process.
 	 */
-	if ((flags & (CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_IS_SHUTDOWN)) == 0)
+	if ((flags & CHECKPOINT_IS_SHUTDOWN) == 0)
 		return;
 
 	if (reset)
@@ -8968,8 +8919,7 @@ update_checkpoint_display(int flags, bool restartpoint, bool reset)
 	{
 		char		activitymsg[128];
 
-		snprintf(activitymsg, sizeof(activitymsg), "performing %s%s%s",
-				 (flags & CHECKPOINT_END_OF_RECOVERY) ? "end-of-recovery " : "",
+		snprintf(activitymsg, sizeof(activitymsg), "performing %s%s",
 				 (flags & CHECKPOINT_IS_SHUTDOWN) ? "shutdown " : "",
 				 restartpoint ? "restartpoint" : "checkpoint");
 		set_ps_display(activitymsg);
@@ -8982,12 +8932,10 @@ update_checkpoint_display(int flags, bool restartpoint, bool reset)
  *
  * flags is a bitwise OR of the following:
  *	CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
- *	CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery.
  *	CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
  *		ignoring checkpoint_completion_target parameter.
  *	CHECKPOINT_FORCE: force a checkpoint even if no XLOG activity has occurred
- *		since the last one (implied by CHECKPOINT_IS_SHUTDOWN or
- *		CHECKPOINT_END_OF_RECOVERY).
+ *		since the last one (implied by CHECKPOINT_IS_SHUTDOWN).
  *	CHECKPOINT_FLUSH_ALL: also flush buffers of unlogged tables.
  *
  * Note: flags contains other bits, of interest here only for logging purposes.
@@ -9021,17 +8969,11 @@ CreateCheckPoint(int flags)
 	VirtualTransactionId *vxids;
 	int			nvxids;
 
-	/*
-	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
-	 * issued at a different time.
-	 */
-	if (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY))
-		shutdown = true;
-	else
-		shutdown = false;
+	/* Is shutdown checkpoint? */
+	shutdown = ((flags & CHECKPOINT_IS_SHUTDOWN) != 0);
 
 	/* sanity check */
-	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
+	if (RecoveryInProgress())
 		elog(ERROR, "can't create a checkpoint during recovery");
 
 	/*
@@ -9107,8 +9049,7 @@ CreateCheckPoint(int flags)
 	 * WAL activity requiring a checkpoint, skip it.  The idea here is to
 	 * avoid inserting duplicate checkpoints when the system is idle.
 	 */
-	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
-				  CHECKPOINT_FORCE)) == 0)
+	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_FORCE)) == 0)
 	{
 		if (last_important_lsn == ControlFile->checkPoint)
 		{
@@ -9120,20 +9061,8 @@ CreateCheckPoint(int flags)
 		}
 	}
 
-	/*
-	 * An end-of-recovery checkpoint is created before anyone is allowed to
-	 * write WAL. To allow us to write the checkpoint record, temporarily
-	 * enable XLogInsertAllowed.  (This also ensures ThisTimeLineID is
-	 * initialized, which we need here and in AdvanceXLInsertBuffer.)
-	 */
-	if (flags & CHECKPOINT_END_OF_RECOVERY)
-		LocalSetXLogInsertAllowed();
-
 	checkPoint.ThisTimeLineID = ThisTimeLineID;
-	if (flags & CHECKPOINT_END_OF_RECOVERY)
-		checkPoint.PrevTimeLineID = XLogCtl->PrevTimeLineID;
-	else
-		checkPoint.PrevTimeLineID = ThisTimeLineID;
+	checkPoint.PrevTimeLineID = ThisTimeLineID;
 
 	checkPoint.fullPageWrites = Insert->fullPageWrites;
 
@@ -9300,17 +9229,10 @@ CreateCheckPoint(int flags)
 	/*
 	 * We mustn't write any new WAL after a shutdown checkpoint, or it will be
 	 * overwritten at next startup.  No-one should even try, this just allows
-	 * sanity-checking.  In the case of an end-of-recovery checkpoint, we want
-	 * to just temporarily disable writing until the system has exited
-	 * recovery.
+	 * sanity-checking.
 	 */
 	if (shutdown)
-	{
-		if (flags & CHECKPOINT_END_OF_RECOVERY)
-			LocalXLogInsertAllowed = -1;	/* return to "check" state */
-		else
-			LocalXLogInsertAllowed = 0; /* never again write WAL */
-	}
+		LocalXLogInsertAllowed = 0; /* never again write WAL */
 
 	/*
 	 * We now have ProcLastRecPtr = start of actual checkpoint record, recptr
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index be7366379d0..e7b8f5d6c33 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -404,13 +404,6 @@ CheckpointerMain(void)
 
 			ConditionVariableBroadcast(&CheckpointerShmem->start_cv);
 
-			/*
-			 * The end-of-recovery checkpoint is a real checkpoint that's
-			 * performed while we're still in recovery.
-			 */
-			if (flags & CHECKPOINT_END_OF_RECOVERY)
-				do_restartpoint = false;
-
 			/*
 			 * We will warn if (a) too soon since last checkpoint (whatever
 			 * caused it) and (b) somebody set the CHECKPOINT_CAUSE_XLOG flag
@@ -905,12 +898,10 @@ CheckpointerShmemInit(void)
  *
  * flags is a bitwise OR of the following:
  *	CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
- *	CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery.
  *	CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
  *		ignoring checkpoint_completion_target parameter.
  *	CHECKPOINT_FORCE: force a checkpoint even if no XLOG activity has occurred
- *		since the last one (implied by CHECKPOINT_IS_SHUTDOWN or
- *		CHECKPOINT_END_OF_RECOVERY).
+ *		since the last one (implied by CHECKPOINT_IS_SHUTDOWN)
  *	CHECKPOINT_WAIT: wait for completion before returning (otherwise,
  *		just signal checkpointer to do it, and return).
  *	CHECKPOINT_CAUSE_XLOG: checkpoint is requested due to xlog filling.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 08ebabfe96a..9aa3d5f9194 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1925,9 +1925,8 @@ UnpinBuffer(BufferDesc *buf, bool fixOwner)
  * This is called at checkpoint time to write out all dirty shared buffers.
  * The checkpoint request flags should be passed in.  If CHECKPOINT_IMMEDIATE
  * is set, we disable delays between writes; if CHECKPOINT_IS_SHUTDOWN,
- * CHECKPOINT_END_OF_RECOVERY or CHECKPOINT_FLUSH_ALL is set, we write even
- * unlogged buffers, which are otherwise skipped.  The remaining flags
- * currently have no effect here.
+ * CHECKPOINT_FLUSH_ALL is set, we write even unlogged buffers, which are
+ * otherwise skipped.  The remaining flags currently have no effect here.
  */
 static void
 BufferSync(int flags)
@@ -1949,12 +1948,11 @@ BufferSync(int flags)
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
 
 	/*
-	 * Unless this is a shutdown checkpoint or we have been explicitly told,
-	 * we write only permanent, dirty buffers.  But at shutdown or end of
-	 * recovery, we write all dirty buffers.
+	 * Unless this is a shutdown checkpoint or we have been explicitly told, we
+	 * write only permanent, dirty buffers.  But at shutdown, we write all dirty
+	 * buffers.
 	 */
-	if (!((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
-					CHECKPOINT_FLUSH_ALL))))
+	if (!((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_FLUSH_ALL))))
 		mask |= BM_PERMANENT;
 
 	/*
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 5e2c94a05ff..661473d4669 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -194,18 +194,16 @@ extern bool XLOG_DEBUG;
 
 /* These directly affect the behavior of CreateCheckPoint and subsidiaries */
 #define CHECKPOINT_IS_SHUTDOWN	0x0001	/* Checkpoint is for shutdown */
-#define CHECKPOINT_END_OF_RECOVERY	0x0002	/* Like shutdown checkpoint, but
-											 * issued at end of WAL recovery */
-#define CHECKPOINT_IMMEDIATE	0x0004	/* Do it without delays */
-#define CHECKPOINT_FORCE		0x0008	/* Force even if no activity */
-#define CHECKPOINT_FLUSH_ALL	0x0010	/* Flush all pages, including those
+#define CHECKPOINT_IMMEDIATE	0x0002	/* Do it without delays */
+#define CHECKPOINT_FORCE		0x0004	/* Force even if no activity */
+#define CHECKPOINT_FLUSH_ALL	0x0008	/* Flush all pages, including those
 										 * belonging to unlogged tables */
 /* These are important to RequestCheckpoint */
-#define CHECKPOINT_WAIT			0x0020	/* Wait for completion */
-#define CHECKPOINT_REQUESTED	0x0040	/* Checkpoint request has been made */
+#define CHECKPOINT_WAIT			0x0010	/* Wait for completion */
+#define CHECKPOINT_REQUESTED	0x0020	/* Checkpoint request has been made */
 /* These indicate the cause of a checkpoint request */
-#define CHECKPOINT_CAUSE_XLOG	0x0080	/* XLOG consumption */
-#define CHECKPOINT_CAUSE_TIME	0x0100	/* Elapsed time */
+#define CHECKPOINT_CAUSE_XLOG	0x0040	/* XLOG consumption */
+#define CHECKPOINT_CAUSE_TIME	0x0080	/* Elapsed time */
 
 /*
  * Flag bits for the record being inserted, set using XLogSetRecordFlags().
-- 
2.18.0

#30Julien Rouhaud
rjuju123@gmail.com
In reply to: Amul Sul (#29)
Re: using an end-of-recovery record in all cases

Hi,

On Mon, Oct 18, 2021 at 10:56:53AM +0530, Amul Sul wrote:

Attached is the rebased and updated version. The patch removes the
newly introduced PerformRecoveryXLogAction() function. In addition to
that, removed the CHECKPOINT_END_OF_RECOVERY flag and its related
code. Also, dropped changes for bgwriter.c and slot.c in this patch, which
seem unrelated to this work.

The cfbot reports that this version of the patch doesn't apply anymore:

http://cfbot.cputube.org/patch_36_3365.log
=== Applying patches on top of PostgreSQL commit ID 0c53a6658e47217ad3dd416a5543fc87c3ecfd58 ===
=== applying patch ./v3-0001-Always-use-an-end-of-recovery-record-rather-than-.patch
patching file src/backend/access/transam/xlog.c
[...]
Hunk #14 FAILED at 9061.
Hunk #15 FAILED at 9241.
2 out of 15 hunks FAILED -- saving rejects to file src/backend/access/transam/xlog.c.rej

Can you send a rebased version? In the meantime I will switch the cf entry to
Waiting on Author.

#31Robert Haas
robertmhaas@gmail.com
In reply to: Julien Rouhaud (#30)
2 attachment(s)
Re: using an end-of-recovery record in all cases

On Sat, Jan 15, 2022 at 11:52 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

The cfbot reports that this version of the patch doesn't apply anymore:

Here is a new version of the patch which, unlike v1, I think is
something we could seriously consider applying (not before v16, of
course). It now removes CHECKPOINT_END_OF_RECOVERY completely, and I
attach a second patch as well which nukes checkPoint.PrevTimeLineID as
well.

I mentioned two problems with $SUBJECT in the first email with this
subject line. One was a bug, which Noah has since fixed (thanks,
Noah). The other problem is that LogStandbySnapshot() and a bunch of
its friends expect latestCompletedXid to always be a normal XID, which
is a problem because (1) we currently set nextXid to 3 and (2) at
startup, we compute latestCompletedXid = nextXid - 1. The current code
dodges this kind of accidentally: the checkpoint that happens at
startup is a "shutdown checkpoint" and thus skips logging a standby
snapshot, since a shutdown checkpoint is a sure indicator that there
are no running transactions. With the changes, the checkpoint at
startup happens after we've started allowing write transactions, and
thus a standby snapshot needs to be logged also. In the cases where
the end-of-recovery record was already being used, the problem could
have happened already, except for the fact that those cases involve a
standby promotion, which doesn't happen during initdb. I explored a
few possible ways of solving this problem.

The first thing I considered was replacing latestCompletedXid with a
name like incompleteXidHorizon. The idea is that, where
latestCompletedXid is the highest XID that is known to have committed
or aborted, incompleteXidHorizon would be the lowest XID such that all
known commits or aborts are for lower XIDs. In effect,
incompleteXidHorizon would be latestCompletedXid + 1. Since
latestCompletedXid is always normal except when it's 2,
incompleteXidHorizon would be normal in all cases. Initially this
seemed fairly promising, but it kind of fell down when I realized that
we copy latestCompletedXid into
ComputeXidHorizonsResult.latest_completed. That seemed to me to make
the consequences of the change a bit more far-reaching than I liked.
Also, it wasn't entirely clear to me that I wouldn't be introducing
any off-by-one errors into various wraparound calculations with this
approach.

The second thing I considered was skipping LogStandbySnapshot() during
initdb. There are two ways of doing this and neither of them seem that
great to me. Something that does work is to skip LogStandbySnapshot()
when in single user mode, but there is no particular reason why WAL
generated in single user mode couldn't be replayed on a standby, so
this doesn't seem great. It's too big a hammer for what we really
want, which is just to suppress this during initdb. Another way of
approaching it is to skip it in bootstrap mode, but that actually
doesn't work: initdb then fails during the post-bootstrapping step
rather than during bootstrapping. I thought about patching around that
by forcing the code that generates checkpoint records to forcibly
advance an XID of 3 to 4, but that seemed like working around the
problem from the wrong end.

So ... I decided that the best approach here seems to be to just skip
FirstNormalTransactionId and use FirstNormalTransactionId + 1 for the
first write transaction that the cluster ever processes. That's very
simple and doesn't seem likely to break anything else. On the downside
it seems a bit grotty, but I don't see anything better, and on the
whole, I think with this approach we come out substantially ahead.
0001 removes 3 times as many lines as it inserts, and 0002 saves a few
more lines of code.

Now, I still don't really know that there isn't some theoretical
difficulty here that makes this whole approach a non-starter, but I
also can't think of what it might be. If the promotion case, which has
used the end-of-recovery record for many years, is basically safe,
despite the fact that it switches TLIs, then it seems to me that the
crash recovery case, which doesn't have that complication, ought to be
safe too. But I might well be missing something, so if you see a
problem, please speak up!

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

v4-0001-Remove-the-end-of-recovery-checkpoint-in-all-case.patchapplication/octet-stream; name=v4-0001-Remove-the-end-of-recovery-checkpoint-in-all-case.patchDownload
From ecfe3ee237ebd3f2c9b8281aef37877dbc7f43dd Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 18 Apr 2022 15:40:12 -0400
Subject: [PATCH v4 1/2] Remove the end-of-recovery checkpoint in all cases.

For many years, we've been issuing a special end-of-recovery WAL
record rather than performing an end-of-recovery checkpoint when a
standby is promoted.  This has the advantage of speeding up promotion.
However, it's advantageous to do this at the end of crash recovery as
well, and for the same reasons.  Plus, it's actually simpler this way,
because having only one way to do something rather than two means we
need less code.

We now always request a checkpoint at the end of recovery. If we
crash again before that checkpoint completes, recovery will begin
from the previous checkpoint that we performed - or replayed. This
behavior is not new as far as archive recovery is concerned, but now
it can also happen for crash recovery.

At initdb time, we now initialize the nextXid counter to 4 rather
than to 3. The code for logging standby shapshots and handling them
on th standby can't handle 3, because we set latestCompleteXid =
nextXid - 1, and 2 is not a normal XID. This is arguably a preexisting
bug, but because we didn't log a standby snapshot when starting from
an end-of-recovery checkpoint, we happened not to ever hit the
problematic case.
---
 src/backend/access/transam/xlog.c        | 155 +++++------------------
 src/backend/postmaster/checkpointer.c    |  11 +-
 src/backend/replication/logical/decode.c |   8 +-
 src/backend/storage/buffer/bufmgr.c      |  10 +-
 src/include/access/xlog.h                |   3 +-
 src/include/access/xlog_internal.h       |   2 +-
 6 files changed, 46 insertions(+), 143 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5eabd32cf6..6f67ffa3c3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -665,7 +665,6 @@ static void UpdateLastRemovedPtr(char *filename);
 static void ValidateXLOGDirectoryStructure(void);
 static void CleanupBackupHistory(void);
 static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
-static bool PerformRecoveryXLogAction(void);
 static void InitControlFile(uint64 sysidentifier);
 static void WriteControlFile(void);
 static void ReadControlFile(void);
@@ -2516,8 +2515,7 @@ XLogFlush(XLogRecPtr record)
 	 * During REDO, we are reading not writing WAL.  Therefore, instead of
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
-	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * to act this way too.
 	 */
 	if (!XLogInsertAllowed())
 	{
@@ -2654,10 +2652,7 @@ XLogFlush(XLogRecPtr record)
 	 * the bad page is encountered again during recovery then we would be
 	 * unable to restart the database at all!  (This scenario actually
 	 * happened in the field several times with 7.1 releases.)	As of 8.4, bad
-	 * LSNs encountered during recovery are UpdateMinRecoveryPoint's problem;
-	 * the only time we can reach here during recovery is while flushing the
-	 * end-of-recovery checkpoint record, and we don't expect that to have a
-	 * bad LSN.
+	 * LSNs encountered during recovery are UpdateMinRecoveryPoint's problem.
 	 *
 	 * Note that for calls from xact.c, the ERROR will be promoted to PANIC
 	 * since xact.c calls this routine inside a critical section.  However,
@@ -4495,6 +4490,7 @@ BootStrapXLOG(void)
 	uint64		sysidentifier;
 	struct timeval tv;
 	pg_crc32c	crc;
+	FullTransactionId	starting_fxid;
 
 	/* allow ordinary WAL segment creation, like StartupXLOG() would */
 	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
@@ -4523,6 +4519,17 @@ BootStrapXLOG(void)
 	page = (XLogPageHeader) TYPEALIGN(XLOG_BLCKSZ, buffer);
 	memset(page, 0, XLOG_BLCKSZ);
 
+	/*
+	 * We set nextXid to FirstNormalTransactionId + 1, because when restarting
+	 * we'll set latestCompletedXid to nextXid - 1, and we want that to be a
+	 * normal XID in all cases. If it isn't, a number of functions, including
+	 * GetRunningTransactionData() and ProcArrayApplyRecoveryInfo(), will
+	 * fail assertions.
+	 */
+	starting_fxid =
+		FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId);
+	FullTransactionIdAdvance(&starting_fxid);
+
 	/*
 	 * Set up information for the initial checkpoint record
 	 *
@@ -4534,8 +4541,7 @@ BootStrapXLOG(void)
 	checkPoint.ThisTimeLineID = BootstrapTimeLineID;
 	checkPoint.PrevTimeLineID = BootstrapTimeLineID;
 	checkPoint.fullPageWrites = fullPageWrites;
-	checkPoint.nextXid =
-		FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId);
+	checkPoint.nextXid = starting_fxid;
 	checkPoint.nextOid = FirstGenbkiObjectId;
 	checkPoint.nextMulti = FirstMultiXactId;
 	checkPoint.nextMultiOffset = 0;
@@ -4890,7 +4896,6 @@ StartupXLOG(void)
 	XLogRecPtr	abortedRecPtr;
 	XLogRecPtr	missingContrecPtr;
 	TransactionId oldestActiveXID;
-	bool		promoted = false;
 
 	/*
 	 * We should have an aux process resource owner to use, and we should not
@@ -5546,10 +5551,10 @@ StartupXLOG(void)
 	UpdateFullPageWrites();
 
 	/*
-	 * Emit checkpoint or end-of-recovery record in XLOG, if required.
+	 * Emit end-of-recovery record in XLOG, if required.
 	 */
 	if (performedWalRecovery)
-		promoted = PerformRecoveryXLogAction();
+		CreateEndOfRecoveryRecord();
 
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
@@ -5611,13 +5616,12 @@ StartupXLOG(void)
 	WalSndWakeup();
 
 	/*
-	 * If this was a promotion, request an (online) checkpoint now. This isn't
-	 * required for consistency, but the last restartpoint might be far back,
-	 * and in case of a crash, recovering from it might take a longer than is
-	 * appropriate now that we're not in standby mode anymore.
+	 * Request an (online) checkpoint now. This isn't required for consistency,
+	 * but the last restartpoint might be far back, and in case of a crash,
+	 * recovering from it might take a longer than is appropriate now that
+	 * we're not in standby mode anymore.
 	 */
-	if (promoted)
-		RequestCheckpoint(CHECKPOINT_FORCE);
+	RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
 /*
@@ -5689,60 +5693,6 @@ ReachedEndOfBackup(XLogRecPtr EndRecPtr, TimeLineID tli)
 	LWLockRelease(ControlFileLock);
 }
 
-/*
- * Perform whatever XLOG actions are necessary at end of REDO.
- *
- * The goal here is to make sure that we'll be able to recover properly if
- * we crash again. If we choose to write a checkpoint, we'll write a shutdown
- * checkpoint rather than an on-line one. This is not particularly critical,
- * but since we may be assigning a new TLI, using a shutdown checkpoint allows
- * us to have the rule that TLI only changes in shutdown checkpoints, which
- * allows some extra error checking in xlog_redo.
- */
-static bool
-PerformRecoveryXLogAction(void)
-{
-	bool		promoted = false;
-
-	/*
-	 * Perform a checkpoint to update all our recovery activity to disk.
-	 *
-	 * Note that we write a shutdown checkpoint rather than an on-line one.
-	 * This is not particularly critical, but since we may be assigning a new
-	 * TLI, using a shutdown checkpoint allows us to have the rule that TLI
-	 * only changes in shutdown checkpoints, which allows some extra error
-	 * checking in xlog_redo.
-	 *
-	 * In promotion, only create a lightweight end-of-recovery record instead
-	 * of a full checkpoint. A checkpoint is requested later, after we're
-	 * fully out of recovery mode and already accepting queries.
-	 */
-	if (ArchiveRecoveryRequested && IsUnderPostmaster &&
-		PromoteIsTriggered())
-	{
-		promoted = true;
-
-		/*
-		 * Insert a special WAL record to mark the end of recovery, since we
-		 * aren't doing a checkpoint. That means that the checkpointer process
-		 * may likely be in the middle of a time-smoothed restartpoint and
-		 * could continue to be for minutes after this.  That sounds strange,
-		 * but the effect is roughly the same and it would be stranger to try
-		 * to come out of the restartpoint and then checkpoint. We request a
-		 * checkpoint later anyway, just for safety.
-		 */
-		CreateEndOfRecoveryRecord();
-	}
-	else
-	{
-		RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
-						  CHECKPOINT_IMMEDIATE |
-						  CHECKPOINT_WAIT);
-	}
-
-	return promoted;
-}
-
 /*
  * Is the system still in recovery?
  *
@@ -6053,9 +6003,8 @@ LogCheckpointStart(int flags, bool restartpoint)
 	if (restartpoint)
 		ereport(LOG,
 		/* translator: the placeholders show checkpoint options */
-				(errmsg("restartpoint starting:%s%s%s%s%s%s%s%s",
+				(errmsg("restartpoint starting:%s%s%s%s%s%s%s",
 						(flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "",
-						(flags & CHECKPOINT_END_OF_RECOVERY) ? " end-of-recovery" : "",
 						(flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "",
 						(flags & CHECKPOINT_FORCE) ? " force" : "",
 						(flags & CHECKPOINT_WAIT) ? " wait" : "",
@@ -6065,9 +6014,8 @@ LogCheckpointStart(int flags, bool restartpoint)
 	else
 		ereport(LOG,
 		/* translator: the placeholders show checkpoint options */
-				(errmsg("checkpoint starting:%s%s%s%s%s%s%s%s",
+				(errmsg("checkpoint starting:%s%s%s%s%s%s%s",
 						(flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "",
-						(flags & CHECKPOINT_END_OF_RECOVERY) ? " end-of-recovery" : "",
 						(flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "",
 						(flags & CHECKPOINT_FORCE) ? " force" : "",
 						(flags & CHECKPOINT_WAIT) ? " wait" : "",
@@ -6219,7 +6167,7 @@ update_checkpoint_display(int flags, bool restartpoint, bool reset)
 	 * pg_stat_activity to see the status of the checkpointer or the startup
 	 * process.
 	 */
-	if ((flags & (CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_IS_SHUTDOWN)) == 0)
+	if ((flags & CHECKPOINT_IS_SHUTDOWN) == 0)
 		return;
 
 	if (reset)
@@ -6228,8 +6176,7 @@ update_checkpoint_display(int flags, bool restartpoint, bool reset)
 	{
 		char		activitymsg[128];
 
-		snprintf(activitymsg, sizeof(activitymsg), "performing %s%s%s",
-				 (flags & CHECKPOINT_END_OF_RECOVERY) ? "end-of-recovery " : "",
+		snprintf(activitymsg, sizeof(activitymsg), "performing %s%s",
 				 (flags & CHECKPOINT_IS_SHUTDOWN) ? "shutdown " : "",
 				 restartpoint ? "restartpoint" : "checkpoint");
 		set_ps_display(activitymsg);
@@ -6242,12 +6189,10 @@ update_checkpoint_display(int flags, bool restartpoint, bool reset)
  *
  * flags is a bitwise OR of the following:
  *	CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
- *	CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery.
  *	CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
  *		ignoring checkpoint_completion_target parameter.
  *	CHECKPOINT_FORCE: force a checkpoint even if no XLOG activity has occurred
- *		since the last one (implied by CHECKPOINT_IS_SHUTDOWN or
- *		CHECKPOINT_END_OF_RECOVERY).
+ *		since the last one (implied by CHECKPOINT_IS_SHUTDOWN).
  *	CHECKPOINT_FLUSH_ALL: also flush buffers of unlogged tables.
  *
  * Note: flags contains other bits, of interest here only for logging purposes.
@@ -6269,7 +6214,7 @@ update_checkpoint_display(int flags, bool restartpoint, bool reset)
 void
 CreateCheckPoint(int flags)
 {
-	bool		shutdown;
+	bool		shutdown = (flags & CHECKPOINT_IS_SHUTDOWN) != 0;
 	CheckPoint	checkPoint;
 	XLogRecPtr	recptr;
 	XLogSegNo	_logSegNo;
@@ -6280,19 +6225,9 @@ CreateCheckPoint(int flags)
 	XLogRecPtr	last_important_lsn;
 	VirtualTransactionId *vxids;
 	int			nvxids;
-	int			oldXLogAllowed = 0;
-
-	/*
-	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
-	 * issued at a different time.
-	 */
-	if (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY))
-		shutdown = true;
-	else
-		shutdown = false;
 
 	/* sanity check */
-	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
+	if (RecoveryInProgress())
 		elog(ERROR, "can't create a checkpoint during recovery");
 
 	/*
@@ -6358,8 +6293,7 @@ CreateCheckPoint(int flags)
 	 * WAL activity requiring a checkpoint, skip it.  The idea here is to
 	 * avoid inserting duplicate checkpoints when the system is idle.
 	 */
-	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
-				  CHECKPOINT_FORCE)) == 0)
+	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_FORCE)) == 0)
 	{
 		if (last_important_lsn == ControlFile->checkPoint)
 		{
@@ -6371,19 +6305,8 @@ CreateCheckPoint(int flags)
 		}
 	}
 
-	/*
-	 * An end-of-recovery checkpoint is created before anyone is allowed to
-	 * write WAL. To allow us to write the checkpoint record, temporarily
-	 * enable XLogInsertAllowed.
-	 */
-	if (flags & CHECKPOINT_END_OF_RECOVERY)
-		oldXLogAllowed = LocalSetXLogInsertAllowed();
-
 	checkPoint.ThisTimeLineID = XLogCtl->InsertTimeLineID;
-	if (flags & CHECKPOINT_END_OF_RECOVERY)
-		checkPoint.PrevTimeLineID = XLogCtl->PrevTimeLineID;
-	else
-		checkPoint.PrevTimeLineID = checkPoint.ThisTimeLineID;
+	checkPoint.PrevTimeLineID = checkPoint.ThisTimeLineID;
 
 	checkPoint.fullPageWrites = Insert->fullPageWrites;
 
@@ -6540,8 +6463,7 @@ CreateCheckPoint(int flags)
 	 * allows us to reconstruct the state of running transactions during
 	 * archive recovery, if required. Skip, if this info disabled.
 	 *
-	 * If we are shutting down, or Startup process is completing crash
-	 * recovery we don't need to write running xact data.
+	 * If we are shutting down, we don't need to write running xact data.
 	 */
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
@@ -6562,17 +6484,10 @@ CreateCheckPoint(int flags)
 	/*
 	 * We mustn't write any new WAL after a shutdown checkpoint, or it will be
 	 * overwritten at next startup.  No-one should even try, this just allows
-	 * sanity-checking.  In the case of an end-of-recovery checkpoint, we want
-	 * to just temporarily disable writing until the system has exited
-	 * recovery.
+	 * sanity-checking.
 	 */
 	if (shutdown)
-	{
-		if (flags & CHECKPOINT_END_OF_RECOVERY)
-			LocalXLogInsertAllowed = oldXLogAllowed;
-		else
-			LocalXLogInsertAllowed = 0; /* never again write WAL */
-	}
+		LocalXLogInsertAllowed = 0; /* never again write WAL */
 
 	/*
 	 * We now have ProcLastRecPtr = start of actual checkpoint record, recptr
@@ -7017,7 +6932,7 @@ CreateRestartPoint(int flags)
 	 * Update pg_control, using current time.  Check that it still shows
 	 * DB_IN_ARCHIVE_RECOVERY state and an older checkpoint, else do nothing;
 	 * this is a quick hack to make sure nothing really bad happens if somehow
-	 * we get here after the end-of-recovery checkpoint.
+	 * we get here after end of recovery.
 	 */
 	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 	if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index c937c39f50..624c728d95 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -411,13 +411,6 @@ CheckpointerMain(void)
 
 			ConditionVariableBroadcast(&CheckpointerShmem->start_cv);
 
-			/*
-			 * The end-of-recovery checkpoint is a real checkpoint that's
-			 * performed while we're still in recovery.
-			 */
-			if (flags & CHECKPOINT_END_OF_RECOVERY)
-				do_restartpoint = false;
-
 			/*
 			 * We will warn if (a) too soon since last checkpoint (whatever
 			 * caused it) and (b) somebody set the CHECKPOINT_CAUSE_XLOG flag
@@ -916,12 +909,10 @@ CheckpointerShmemInit(void)
  *
  * flags is a bitwise OR of the following:
  *	CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
- *	CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery.
  *	CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
  *		ignoring checkpoint_completion_target parameter.
  *	CHECKPOINT_FORCE: force a checkpoint even if no XLOG activity has occurred
- *		since the last one (implied by CHECKPOINT_IS_SHUTDOWN or
- *		CHECKPOINT_END_OF_RECOVERY).
+ *		since the last one (implied by CHECKPOINT_IS_SHUTDOWN).
  *	CHECKPOINT_WAIT: wait for completion before returning (otherwise,
  *		just signal checkpointer to do it, and return).
  *	CHECKPOINT_CAUSE_XLOG: checkpoint is requested due to xlog filling.
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 6303647fe0..de6849d083 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -349,10 +349,10 @@ standby_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 * Abort all transactions that we keep track of, that are
 				 * older than the record's oldestRunningXid. This is the most
 				 * convenient spot for doing so since, in contrast to shutdown
-				 * or end-of-recovery checkpoints, we have information about
-				 * all running transactions which includes prepared ones,
-				 * while shutdown checkpoints just know that no non-prepared
-				 * transactions are in progress.
+				 * recovery checkpoints, we have information about all running
+				 * transactions which includes prepared ones, while shutdown
+				 * checkpoints just know that no non-prepared transactions are
+				 * in progress.
 				 */
 				ReorderBufferAbortOld(ctx->reorder, running->oldestRunningXid);
 			}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e02ea3a977..9ac42a9fd4 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1933,10 +1933,9 @@ UnpinBuffer(BufferDesc *buf, bool fixOwner)
  *
  * This is called at checkpoint time to write out all dirty shared buffers.
  * The checkpoint request flags should be passed in.  If CHECKPOINT_IMMEDIATE
- * is set, we disable delays between writes; if CHECKPOINT_IS_SHUTDOWN,
- * CHECKPOINT_END_OF_RECOVERY or CHECKPOINT_FLUSH_ALL is set, we write even
- * unlogged buffers, which are otherwise skipped.  The remaining flags
- * currently have no effect here.
+ * is set, we disable delays between writes; if CHECKPOINT_IS_SHUTDOWN
+ * or CHECKPOINT_FLUSH_ALL is set, we write even unlogged buffers, which are
+ * otherwise skipped.  The remaining flags currently have no effect here.
  */
 static void
 BufferSync(int flags)
@@ -1962,8 +1961,7 @@ BufferSync(int flags)
 	 * we write only permanent, dirty buffers.  But at shutdown or end of
 	 * recovery, we write all dirty buffers.
 	 */
-	if (!((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
-					CHECKPOINT_FLUSH_ALL))))
+	if (!((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_FLUSH_ALL))))
 		mask |= BM_PERMANENT;
 
 	/*
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d9f2487a96..04091d8576 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -132,8 +132,7 @@ extern PGDLLIMPORT bool XLOG_DEBUG;
 
 /* These directly affect the behavior of CreateCheckPoint and subsidiaries */
 #define CHECKPOINT_IS_SHUTDOWN	0x0001	/* Checkpoint is for shutdown */
-#define CHECKPOINT_END_OF_RECOVERY	0x0002	/* Like shutdown checkpoint, but
-											 * issued at end of WAL recovery */
+/* 0x0002 was CHECKPOINT_END_OF_RECOVERY, now unused */
 #define CHECKPOINT_IMMEDIATE	0x0004	/* Do it without delays */
 #define CHECKPOINT_FORCE		0x0008	/* Force even if no activity */
 #define CHECKPOINT_FLUSH_ALL	0x0010	/* Flush all pages, including those
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index fae0bef8f5..b71f1eed90 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -258,7 +258,7 @@ typedef struct xl_overwrite_contrecord
 	TimestampTz overwrite_time;
 } xl_overwrite_contrecord;
 
-/* End of recovery mark, when we don't do an END_OF_RECOVERY checkpoint */
+/* End of recovery mark, and possible TLI change */
 typedef struct xl_end_of_recovery
 {
 	TimestampTz end_time;
-- 
2.24.3 (Apple Git-128)

v4-0002-Remove-previous-timeline-ID-from-checkpoint.patchapplication/octet-stream; name=v4-0002-Remove-previous-timeline-ID-from-checkpoint.patchDownload
From cb49c761f783d07067b7e7c1101a1b7aea0e6fe6 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 18 Apr 2022 16:01:13 -0400
Subject: [PATCH v4 2/2] Remove previous timeline ID from checkpoint.

Since recovery now always ends with an end-of-recovery record rather
than a checkpoint, a checkpoint never causes a timeline switch.
Therefore, it doesn't need to store a previous timeline ID and a
current timeline ID any more.

Note that this affects the signature of the SQL-callable function
pg_control_checkpoint().
---
 doc/src/sgml/func.sgml                    |  5 --
 src/backend/access/rmgrdesc/xlogdesc.c    |  3 +-
 src/backend/access/transam/xlog.c         |  2 -
 src/backend/access/transam/xlogrecovery.c | 32 +++++-------
 src/backend/utils/misc/pg_controldata.c   | 61 +++++++++++------------
 src/bin/pg_controldata/pg_controldata.c   |  2 -
 src/bin/pg_resetwal/pg_resetwal.c         |  4 --
 src/include/catalog/pg_control.h          |  4 +-
 src/include/catalog/pg_proc.dat           |  6 +--
 9 files changed, 45 insertions(+), 74 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 93ba39eff1..4c92e9e56d 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27626,11 +27626,6 @@ SELECT collation for ('foo' COLLATE "de_DE");
        <entry><type>integer</type></entry>
       </row>
 
-      <row>
-       <entry><structfield>prev_timeline_id</structfield></entry>
-       <entry><type>integer</type></entry>
-      </row>
-
       <row>
        <entry><structfield>full_page_writes</structfield></entry>
        <entry><type>boolean</type></entry>
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index c0dfea40c7..a3e29b043f 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -45,13 +45,12 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
 		appendStringInfo(buf, "redo %X/%X; "
-						 "tli %u; prev tli %u; fpw %s; xid %u:%u; oid %u; multi %u; offset %u; "
+						 "tli %u; fpw %s; xid %u:%u; oid %u; multi %u; offset %u; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
 						 "oldest running xid %u; %s",
 						 LSN_FORMAT_ARGS(checkpoint->redo),
 						 checkpoint->ThisTimeLineID,
-						 checkpoint->PrevTimeLineID,
 						 checkpoint->fullPageWrites ? "true" : "false",
 						 EpochFromFullTransactionId(checkpoint->nextXid),
 						 XidFromFullTransactionId(checkpoint->nextXid),
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f67ffa3c3..c1784bb386 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4539,7 +4539,6 @@ BootStrapXLOG(void)
 	 */
 	checkPoint.redo = wal_segment_size + SizeOfXLogLongPHD;
 	checkPoint.ThisTimeLineID = BootstrapTimeLineID;
-	checkPoint.PrevTimeLineID = BootstrapTimeLineID;
 	checkPoint.fullPageWrites = fullPageWrites;
 	checkPoint.nextXid = starting_fxid;
 	checkPoint.nextOid = FirstGenbkiObjectId;
@@ -6306,7 +6305,6 @@ CreateCheckPoint(int flags)
 	}
 
 	checkPoint.ThisTimeLineID = XLogCtl->InsertTimeLineID;
-	checkPoint.PrevTimeLineID = checkPoint.ThisTimeLineID;
 
 	checkPoint.fullPageWrites = Insert->fullPageWrites;
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 39ef865ed9..14a3ee96a3 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1833,36 +1833,28 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	 */
 	if (record->xl_rmid == RM_XLOG_ID)
 	{
-		TimeLineID	newReplayTLI = *replayTLI;
-		TimeLineID	prevReplayTLI = *replayTLI;
 		uint8		info = record->xl_info & ~XLR_INFO_MASK;
 
-		if (info == XLOG_CHECKPOINT_SHUTDOWN)
-		{
-			CheckPoint	checkPoint;
-
-			memcpy(&checkPoint, XLogRecGetData(xlogreader), sizeof(CheckPoint));
-			newReplayTLI = checkPoint.ThisTimeLineID;
-			prevReplayTLI = checkPoint.PrevTimeLineID;
-		}
-		else if (info == XLOG_END_OF_RECOVERY)
+		if (info == XLOG_END_OF_RECOVERY)
 		{
 			xl_end_of_recovery xlrec;
+			TimeLineID	newReplayTLI;
+			TimeLineID	prevReplayTLI;
 
 			memcpy(&xlrec, XLogRecGetData(xlogreader), sizeof(xl_end_of_recovery));
 			newReplayTLI = xlrec.ThisTimeLineID;
 			prevReplayTLI = xlrec.PrevTimeLineID;
-		}
 
-		if (newReplayTLI != *replayTLI)
-		{
-			/* Check that it's OK to switch to this TLI */
-			checkTimeLineSwitch(xlogreader->EndRecPtr,
-								newReplayTLI, prevReplayTLI, *replayTLI);
+			if (newReplayTLI != *replayTLI)
+			{
+				/* Check that it's OK to switch to this TLI */
+				checkTimeLineSwitch(xlogreader->EndRecPtr,
+									newReplayTLI, prevReplayTLI, *replayTLI);
 
-			/* Following WAL records should be run with new TLI */
-			*replayTLI = newReplayTLI;
-			switchedTLI = true;
+				/* Following WAL records should be run with new TLI */
+				*replayTLI = newReplayTLI;
+				switchedTLI = true;
+			}
 		}
 	}
 
diff --git a/src/backend/utils/misc/pg_controldata.c b/src/backend/utils/misc/pg_controldata.c
index 781f8b8758..c30f380d07 100644
--- a/src/backend/utils/misc/pg_controldata.c
+++ b/src/backend/utils/misc/pg_controldata.c
@@ -101,33 +101,31 @@ pg_control_checkpoint(PG_FUNCTION_ARGS)
 					   TEXTOID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 4, "timeline_id",
 					   INT4OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "prev_timeline_id",
-					   INT4OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "full_page_writes",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "full_page_writes",
 					   BOOLOID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "next_xid",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "next_xid",
 					   TEXTOID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "next_oid",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "next_oid",
 					   OIDOID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "next_multixact_id",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "next_multixact_id",
 					   XIDOID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 10, "next_multi_offset",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "next_multi_offset",
 					   XIDOID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 11, "oldest_xid",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 10, "oldest_xid",
 					   XIDOID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 12, "oldest_xid_dbid",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 11, "oldest_xid_dbid",
 					   OIDOID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 13, "oldest_active_xid",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 12, "oldest_active_xid",
 					   XIDOID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 14, "oldest_multi_xid",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 13, "oldest_multi_xid",
 					   XIDOID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 15, "oldest_multi_dbid",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 14, "oldest_multi_dbid",
 					   OIDOID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 16, "oldest_commit_ts_xid",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 15, "oldest_commit_ts_xid",
 					   XIDOID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 17, "newest_commit_ts_xid",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 16, "newest_commit_ts_xid",
 					   XIDOID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 18, "checkpoint_time",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 17, "checkpoint_time",
 					   TIMESTAMPTZOID, -1, 0);
 	tupdesc = BlessTupleDesc(tupdesc);
 
@@ -158,50 +156,47 @@ pg_control_checkpoint(PG_FUNCTION_ARGS)
 	values[3] = Int32GetDatum(ControlFile->checkPointCopy.ThisTimeLineID);
 	nulls[3] = false;
 
-	values[4] = Int32GetDatum(ControlFile->checkPointCopy.PrevTimeLineID);
+	values[4] = BoolGetDatum(ControlFile->checkPointCopy.fullPageWrites);
 	nulls[4] = false;
 
-	values[5] = BoolGetDatum(ControlFile->checkPointCopy.fullPageWrites);
-	nulls[5] = false;
-
-	values[6] = CStringGetTextDatum(psprintf("%u:%u",
+	values[5] = CStringGetTextDatum(psprintf("%u:%u",
 											 EpochFromFullTransactionId(ControlFile->checkPointCopy.nextXid),
 											 XidFromFullTransactionId(ControlFile->checkPointCopy.nextXid)));
+	nulls[5] = false;
+
+	values[6] = ObjectIdGetDatum(ControlFile->checkPointCopy.nextOid);
 	nulls[6] = false;
 
-	values[7] = ObjectIdGetDatum(ControlFile->checkPointCopy.nextOid);
+	values[7] = TransactionIdGetDatum(ControlFile->checkPointCopy.nextMulti);
 	nulls[7] = false;
 
-	values[8] = TransactionIdGetDatum(ControlFile->checkPointCopy.nextMulti);
+	values[8] = TransactionIdGetDatum(ControlFile->checkPointCopy.nextMultiOffset);
 	nulls[8] = false;
 
-	values[9] = TransactionIdGetDatum(ControlFile->checkPointCopy.nextMultiOffset);
+	values[9] = TransactionIdGetDatum(ControlFile->checkPointCopy.oldestXid);
 	nulls[9] = false;
 
-	values[10] = TransactionIdGetDatum(ControlFile->checkPointCopy.oldestXid);
+	values[10] = ObjectIdGetDatum(ControlFile->checkPointCopy.oldestXidDB);
 	nulls[10] = false;
 
-	values[11] = ObjectIdGetDatum(ControlFile->checkPointCopy.oldestXidDB);
+	values[11] = TransactionIdGetDatum(ControlFile->checkPointCopy.oldestActiveXid);
 	nulls[11] = false;
 
-	values[12] = TransactionIdGetDatum(ControlFile->checkPointCopy.oldestActiveXid);
+	values[12] = TransactionIdGetDatum(ControlFile->checkPointCopy.oldestMulti);
 	nulls[12] = false;
 
-	values[13] = TransactionIdGetDatum(ControlFile->checkPointCopy.oldestMulti);
+	values[13] = ObjectIdGetDatum(ControlFile->checkPointCopy.oldestMultiDB);
 	nulls[13] = false;
 
-	values[14] = ObjectIdGetDatum(ControlFile->checkPointCopy.oldestMultiDB);
+	values[14] = TransactionIdGetDatum(ControlFile->checkPointCopy.oldestCommitTsXid);
 	nulls[14] = false;
 
-	values[15] = TransactionIdGetDatum(ControlFile->checkPointCopy.oldestCommitTsXid);
+	values[15] = TransactionIdGetDatum(ControlFile->checkPointCopy.newestCommitTsXid);
 	nulls[15] = false;
 
-	values[16] = TransactionIdGetDatum(ControlFile->checkPointCopy.newestCommitTsXid);
+	values[16] = TimestampTzGetDatum(time_t_to_timestamptz(ControlFile->checkPointCopy.time));
 	nulls[16] = false;
 
-	values[17] = TimestampTzGetDatum(time_t_to_timestamptz(ControlFile->checkPointCopy.time));
-	nulls[17] = false;
-
 	htup = heap_form_tuple(tupdesc, values, nulls);
 
 	PG_RETURN_DATUM(HeapTupleGetDatum(htup));
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index c390ec51ce..56dc8d0031 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -243,8 +243,6 @@ main(int argc, char *argv[])
 		   xlogfilename);
 	printf(_("Latest checkpoint's TimeLineID:       %u\n"),
 		   ControlFile->checkPointCopy.ThisTimeLineID);
-	printf(_("Latest checkpoint's PrevTimeLineID:   %u\n"),
-		   ControlFile->checkPointCopy.PrevTimeLineID);
 	printf(_("Latest checkpoint's full_page_writes: %s\n"),
 		   ControlFile->checkPointCopy.fullPageWrites ? _("on") : _("off"));
 	printf(_("Latest checkpoint's NextXID:          %u:%u\n"),
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index d4772a2965..2e2f35bf5c 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -444,10 +444,7 @@ main(int argc, char *argv[])
 		ControlFile.checkPointCopy.nextMultiOffset = set_mxoff;
 
 	if (minXlogTli > ControlFile.checkPointCopy.ThisTimeLineID)
-	{
 		ControlFile.checkPointCopy.ThisTimeLineID = minXlogTli;
-		ControlFile.checkPointCopy.PrevTimeLineID = minXlogTli;
-	}
 
 	if (set_wal_segsize != 0)
 		ControlFile.xlog_seg_size = WalSegSz;
@@ -650,7 +647,6 @@ GuessControlValues(void)
 
 	ControlFile.checkPointCopy.redo = SizeOfXLogLongPHD;
 	ControlFile.checkPointCopy.ThisTimeLineID = 1;
-	ControlFile.checkPointCopy.PrevTimeLineID = 1;
 	ControlFile.checkPointCopy.fullPageWrites = false;
 	ControlFile.checkPointCopy.nextXid =
 		FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 06368e2366..b0da5cb1b5 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -22,7 +22,7 @@
 
 
 /* Version identifier for this pg_control format */
-#define PG_CONTROL_VERSION	1300
+#define PG_CONTROL_VERSION	1600
 
 /* Nonce key length, see below */
 #define MOCK_AUTH_NONCE_LEN		32
@@ -37,8 +37,6 @@ typedef struct CheckPoint
 	XLogRecPtr	redo;			/* next RecPtr available when we began to
 								 * create CheckPoint (i.e. REDO start point) */
 	TimeLineID	ThisTimeLineID; /* current TLI */
-	TimeLineID	PrevTimeLineID; /* previous TLI, if this record begins a new
-								 * timeline (equals ThisTimeLineID otherwise) */
 	bool		fullPageWrites; /* current full_page_writes */
 	FullTransactionId nextXid;	/* next free transaction ID */
 	Oid			nextOid;		/* next free OID */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 6d378ff785..c2bd6f0801 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11711,9 +11711,9 @@
   descr => 'pg_controldata checkpoint state information as a function',
   proname => 'pg_control_checkpoint', provolatile => 'v',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{pg_lsn,pg_lsn,text,int4,int4,bool,text,oid,xid,xid,xid,oid,xid,xid,oid,xid,xid,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{checkpoint_lsn,redo_lsn,redo_wal_file,timeline_id,prev_timeline_id,full_page_writes,next_xid,next_oid,next_multixact_id,next_multi_offset,oldest_xid,oldest_xid_dbid,oldest_active_xid,oldest_multi_xid,oldest_multi_dbid,oldest_commit_ts_xid,newest_commit_ts_xid,checkpoint_time}',
+  proallargtypes => '{pg_lsn,pg_lsn,text,int4,bool,text,oid,xid,xid,xid,oid,xid,xid,oid,xid,xid,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{checkpoint_lsn,redo_lsn,redo_wal_file,timeline_id,full_page_writes,next_xid,next_oid,next_multixact_id,next_multi_offset,oldest_xid,oldest_xid_dbid,oldest_active_xid,oldest_multi_xid,oldest_multi_dbid,oldest_commit_ts_xid,newest_commit_ts_xid,checkpoint_time}',
   prosrc => 'pg_control_checkpoint' },
 
 { oid => '3443',
-- 
2.24.3 (Apple Git-128)

#32Amul Sul
sulamul@gmail.com
In reply to: Robert Haas (#31)
Re: using an end-of-recovery record in all cases

On Tue, Apr 19, 2022 at 2:14 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Jan 15, 2022 at 11:52 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

The cfbot reports that this version of the patch doesn't apply anymore:

Here is a new version of the patch which, unlike v1, I think is
something we could seriously consider applying (not before v16, of
course). It now removes CHECKPOINT_END_OF_RECOVERY completely, and I
attach a second patch as well which nukes checkPoint.PrevTimeLineID as
well.

I mentioned two problems with $SUBJECT in the first email with this
subject line. One was a bug, which Noah has since fixed (thanks,
Noah). The other problem is that LogStandbySnapshot() and a bunch of
its friends expect latestCompletedXid to always be a normal XID, which
is a problem because (1) we currently set nextXid to 3 and (2) at
startup, we compute latestCompletedXid = nextXid - 1. The current code
dodges this kind of accidentally: the checkpoint that happens at
startup is a "shutdown checkpoint" and thus skips logging a standby
snapshot, since a shutdown checkpoint is a sure indicator that there
are no running transactions. With the changes, the checkpoint at
startup happens after we've started allowing write transactions, and
thus a standby snapshot needs to be logged also. In the cases where
the end-of-recovery record was already being used, the problem could
have happened already, except for the fact that those cases involve a
standby promotion, which doesn't happen during initdb. I explored a
few possible ways of solving this problem.

The first thing I considered was replacing latestCompletedXid with a
name like incompleteXidHorizon. The idea is that, where
latestCompletedXid is the highest XID that is known to have committed
or aborted, incompleteXidHorizon would be the lowest XID such that all
known commits or aborts are for lower XIDs. In effect,
incompleteXidHorizon would be latestCompletedXid + 1. Since
latestCompletedXid is always normal except when it's 2,
incompleteXidHorizon would be normal in all cases. Initially this
seemed fairly promising, but it kind of fell down when I realized that
we copy latestCompletedXid into
ComputeXidHorizonsResult.latest_completed. That seemed to me to make
the consequences of the change a bit more far-reaching than I liked.
Also, it wasn't entirely clear to me that I wouldn't be introducing
any off-by-one errors into various wraparound calculations with this
approach.

The second thing I considered was skipping LogStandbySnapshot() during
initdb. There are two ways of doing this and neither of them seem that
great to me. Something that does work is to skip LogStandbySnapshot()
when in single user mode, but there is no particular reason why WAL
generated in single user mode couldn't be replayed on a standby, so
this doesn't seem great. It's too big a hammer for what we really
want, which is just to suppress this during initdb. Another way of
approaching it is to skip it in bootstrap mode, but that actually
doesn't work: initdb then fails during the post-bootstrapping step
rather than during bootstrapping. I thought about patching around that
by forcing the code that generates checkpoint records to forcibly
advance an XID of 3 to 4, but that seemed like working around the
problem from the wrong end.

So ... I decided that the best approach here seems to be to just skip
FirstNormalTransactionId and use FirstNormalTransactionId + 1 for the
first write transaction that the cluster ever processes. That's very
simple and doesn't seem likely to break anything else. On the downside
it seems a bit grotty, but I don't see anything better, and on the
whole, I think with this approach we come out substantially ahead.
0001 removes 3 times as many lines as it inserts, and 0002 saves a few
more lines of code.

Now, I still don't really know that there isn't some theoretical
difficulty here that makes this whole approach a non-starter, but I
also can't think of what it might be. If the promotion case, which has
used the end-of-recovery record for many years, is basically safe,
despite the fact that it switches TLIs, then it seems to me that the
crash recovery case, which doesn't have that complication, ought to be
safe too. But I might well be missing something, so if you see a
problem, please speak up!

  /*
- * If this was a promotion, request an (online) checkpoint now. This isn't
- * required for consistency, but the last restartpoint might be far back,
- * and in case of a crash, recovering from it might take a longer than is
- * appropriate now that we're not in standby mode anymore.
+ * Request an (online) checkpoint now. This isn't required for consistency,
+ * but the last restartpoint might be far back, and in case of a crash,
+ * recovering from it might take a longer than is appropriate now that
+ * we're not in standby mode anymore.
  */
- if (promoted)
- RequestCheckpoint(CHECKPOINT_FORCE);
+ RequestCheckpoint(CHECKPOINT_FORCE);
 }

I think RequestCheckpoint() should be called conditionally. What is the need
of the checkpoint if we haven't been through the recovery, in other words,
starting up from a clean shutdown?

Regards,
Amul

#33Robert Haas
robertmhaas@gmail.com
In reply to: Amul Sul (#32)
2 attachment(s)
Re: using an end-of-recovery record in all cases

On Mon, Apr 18, 2022 at 11:56 PM Amul Sul <sulamul@gmail.com> wrote:

I think RequestCheckpoint() should be called conditionally. What is the need
of the checkpoint if we haven't been through the recovery, in other words,
starting up from a clean shutdown?

Good point. v5 attached.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

v5-0001-Remove-the-end-of-recovery-checkpoint-in-all-case.patchapplication/octet-stream; name=v5-0001-Remove-the-end-of-recovery-checkpoint-in-all-case.patchDownload
From 59b804658ff9fc97964a9296b9612d5039d5a3f3 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 18 Apr 2022 15:40:12 -0400
Subject: [PATCH v5 1/2] Remove the end-of-recovery checkpoint in all cases.

For many years, we've been issuing a special end-of-recovery WAL
record rather than performing an end-of-recovery checkpoint when a
standby is promoted.  This has the advantage of speeding up promotion.
However, it's advantageous to do this at the end of crash recovery as
well, and for the same reasons.  Plus, it's actually simpler this way,
because having only one way to do something rather than two means we
need less code.

We now always request a checkpoint at the end of recovery. If we
crash again before that checkpoint completes, recovery will begin
from the previous checkpoint that we performed - or replayed. This
behavior is not new as far as archive recovery is concerned, but now
it can also happen for crash recovery.

At initdb time, we now initialize the nextXid counter to 4 rather
than to 3. The code for logging standby shapshots and handling them
on th standby can't handle 3, because we set latestCompleteXid =
nextXid - 1, and 2 is not a normal XID. This is arguably a preexisting
bug, but because we didn't log a standby snapshot when starting from
an end-of-recovery checkpoint, we happened not to ever hit the
problematic case.
---
 src/backend/access/transam/xlog.c        | 154 ++++++-----------------
 src/backend/postmaster/checkpointer.c    |  11 +-
 src/backend/replication/logical/decode.c |   8 +-
 src/backend/storage/buffer/bufmgr.c      |  10 +-
 src/include/access/xlog.h                |   3 +-
 src/include/access/xlog_internal.h       |   2 +-
 6 files changed, 46 insertions(+), 142 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5eabd32cf6..420e93bed3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -665,7 +665,6 @@ static void UpdateLastRemovedPtr(char *filename);
 static void ValidateXLOGDirectoryStructure(void);
 static void CleanupBackupHistory(void);
 static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
-static bool PerformRecoveryXLogAction(void);
 static void InitControlFile(uint64 sysidentifier);
 static void WriteControlFile(void);
 static void ReadControlFile(void);
@@ -2516,8 +2515,7 @@ XLogFlush(XLogRecPtr record)
 	 * During REDO, we are reading not writing WAL.  Therefore, instead of
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
-	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * to act this way too.
 	 */
 	if (!XLogInsertAllowed())
 	{
@@ -2654,10 +2652,7 @@ XLogFlush(XLogRecPtr record)
 	 * the bad page is encountered again during recovery then we would be
 	 * unable to restart the database at all!  (This scenario actually
 	 * happened in the field several times with 7.1 releases.)	As of 8.4, bad
-	 * LSNs encountered during recovery are UpdateMinRecoveryPoint's problem;
-	 * the only time we can reach here during recovery is while flushing the
-	 * end-of-recovery checkpoint record, and we don't expect that to have a
-	 * bad LSN.
+	 * LSNs encountered during recovery are UpdateMinRecoveryPoint's problem.
 	 *
 	 * Note that for calls from xact.c, the ERROR will be promoted to PANIC
 	 * since xact.c calls this routine inside a critical section.  However,
@@ -4495,6 +4490,7 @@ BootStrapXLOG(void)
 	uint64		sysidentifier;
 	struct timeval tv;
 	pg_crc32c	crc;
+	FullTransactionId	starting_fxid;
 
 	/* allow ordinary WAL segment creation, like StartupXLOG() would */
 	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
@@ -4523,6 +4519,17 @@ BootStrapXLOG(void)
 	page = (XLogPageHeader) TYPEALIGN(XLOG_BLCKSZ, buffer);
 	memset(page, 0, XLOG_BLCKSZ);
 
+	/*
+	 * We set nextXid to FirstNormalTransactionId + 1, because when restarting
+	 * we'll set latestCompletedXid to nextXid - 1, and we want that to be a
+	 * normal XID in all cases. If it isn't, a number of functions, including
+	 * GetRunningTransactionData() and ProcArrayApplyRecoveryInfo(), will
+	 * fail assertions.
+	 */
+	starting_fxid =
+		FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId);
+	FullTransactionIdAdvance(&starting_fxid);
+
 	/*
 	 * Set up information for the initial checkpoint record
 	 *
@@ -4534,8 +4541,7 @@ BootStrapXLOG(void)
 	checkPoint.ThisTimeLineID = BootstrapTimeLineID;
 	checkPoint.PrevTimeLineID = BootstrapTimeLineID;
 	checkPoint.fullPageWrites = fullPageWrites;
-	checkPoint.nextXid =
-		FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId);
+	checkPoint.nextXid = starting_fxid;
 	checkPoint.nextOid = FirstGenbkiObjectId;
 	checkPoint.nextMulti = FirstMultiXactId;
 	checkPoint.nextMultiOffset = 0;
@@ -4890,7 +4896,6 @@ StartupXLOG(void)
 	XLogRecPtr	abortedRecPtr;
 	XLogRecPtr	missingContrecPtr;
 	TransactionId oldestActiveXID;
-	bool		promoted = false;
 
 	/*
 	 * We should have an aux process resource owner to use, and we should not
@@ -5546,10 +5551,10 @@ StartupXLOG(void)
 	UpdateFullPageWrites();
 
 	/*
-	 * Emit checkpoint or end-of-recovery record in XLOG, if required.
+	 * Emit end-of-recovery record in XLOG, if required.
 	 */
 	if (performedWalRecovery)
-		promoted = PerformRecoveryXLogAction();
+		CreateEndOfRecoveryRecord();
 
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
@@ -5611,12 +5616,12 @@ StartupXLOG(void)
 	WalSndWakeup();
 
 	/*
-	 * If this was a promotion, request an (online) checkpoint now. This isn't
-	 * required for consistency, but the last restartpoint might be far back,
-	 * and in case of a crash, recovering from it might take a longer than is
-	 * appropriate now that we're not in standby mode anymore.
+	 * If we performed recovery, request an (online) checkpoint now. This
+	 * isn't required for consistency, but the last restartpoint might be far
+	 * back, and in case of a crash, recovering from it might take a longer
+	 * than is appropriate now that we're not in standby mode anymore.
 	 */
-	if (promoted)
+	if (performedWalRecovery)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
@@ -5689,60 +5694,6 @@ ReachedEndOfBackup(XLogRecPtr EndRecPtr, TimeLineID tli)
 	LWLockRelease(ControlFileLock);
 }
 
-/*
- * Perform whatever XLOG actions are necessary at end of REDO.
- *
- * The goal here is to make sure that we'll be able to recover properly if
- * we crash again. If we choose to write a checkpoint, we'll write a shutdown
- * checkpoint rather than an on-line one. This is not particularly critical,
- * but since we may be assigning a new TLI, using a shutdown checkpoint allows
- * us to have the rule that TLI only changes in shutdown checkpoints, which
- * allows some extra error checking in xlog_redo.
- */
-static bool
-PerformRecoveryXLogAction(void)
-{
-	bool		promoted = false;
-
-	/*
-	 * Perform a checkpoint to update all our recovery activity to disk.
-	 *
-	 * Note that we write a shutdown checkpoint rather than an on-line one.
-	 * This is not particularly critical, but since we may be assigning a new
-	 * TLI, using a shutdown checkpoint allows us to have the rule that TLI
-	 * only changes in shutdown checkpoints, which allows some extra error
-	 * checking in xlog_redo.
-	 *
-	 * In promotion, only create a lightweight end-of-recovery record instead
-	 * of a full checkpoint. A checkpoint is requested later, after we're
-	 * fully out of recovery mode and already accepting queries.
-	 */
-	if (ArchiveRecoveryRequested && IsUnderPostmaster &&
-		PromoteIsTriggered())
-	{
-		promoted = true;
-
-		/*
-		 * Insert a special WAL record to mark the end of recovery, since we
-		 * aren't doing a checkpoint. That means that the checkpointer process
-		 * may likely be in the middle of a time-smoothed restartpoint and
-		 * could continue to be for minutes after this.  That sounds strange,
-		 * but the effect is roughly the same and it would be stranger to try
-		 * to come out of the restartpoint and then checkpoint. We request a
-		 * checkpoint later anyway, just for safety.
-		 */
-		CreateEndOfRecoveryRecord();
-	}
-	else
-	{
-		RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
-						  CHECKPOINT_IMMEDIATE |
-						  CHECKPOINT_WAIT);
-	}
-
-	return promoted;
-}
-
 /*
  * Is the system still in recovery?
  *
@@ -6053,9 +6004,8 @@ LogCheckpointStart(int flags, bool restartpoint)
 	if (restartpoint)
 		ereport(LOG,
 		/* translator: the placeholders show checkpoint options */
-				(errmsg("restartpoint starting:%s%s%s%s%s%s%s%s",
+				(errmsg("restartpoint starting:%s%s%s%s%s%s%s",
 						(flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "",
-						(flags & CHECKPOINT_END_OF_RECOVERY) ? " end-of-recovery" : "",
 						(flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "",
 						(flags & CHECKPOINT_FORCE) ? " force" : "",
 						(flags & CHECKPOINT_WAIT) ? " wait" : "",
@@ -6065,9 +6015,8 @@ LogCheckpointStart(int flags, bool restartpoint)
 	else
 		ereport(LOG,
 		/* translator: the placeholders show checkpoint options */
-				(errmsg("checkpoint starting:%s%s%s%s%s%s%s%s",
+				(errmsg("checkpoint starting:%s%s%s%s%s%s%s",
 						(flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "",
-						(flags & CHECKPOINT_END_OF_RECOVERY) ? " end-of-recovery" : "",
 						(flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "",
 						(flags & CHECKPOINT_FORCE) ? " force" : "",
 						(flags & CHECKPOINT_WAIT) ? " wait" : "",
@@ -6219,7 +6168,7 @@ update_checkpoint_display(int flags, bool restartpoint, bool reset)
 	 * pg_stat_activity to see the status of the checkpointer or the startup
 	 * process.
 	 */
-	if ((flags & (CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_IS_SHUTDOWN)) == 0)
+	if ((flags & CHECKPOINT_IS_SHUTDOWN) == 0)
 		return;
 
 	if (reset)
@@ -6228,8 +6177,7 @@ update_checkpoint_display(int flags, bool restartpoint, bool reset)
 	{
 		char		activitymsg[128];
 
-		snprintf(activitymsg, sizeof(activitymsg), "performing %s%s%s",
-				 (flags & CHECKPOINT_END_OF_RECOVERY) ? "end-of-recovery " : "",
+		snprintf(activitymsg, sizeof(activitymsg), "performing %s%s",
 				 (flags & CHECKPOINT_IS_SHUTDOWN) ? "shutdown " : "",
 				 restartpoint ? "restartpoint" : "checkpoint");
 		set_ps_display(activitymsg);
@@ -6242,12 +6190,10 @@ update_checkpoint_display(int flags, bool restartpoint, bool reset)
  *
  * flags is a bitwise OR of the following:
  *	CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
- *	CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery.
  *	CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
  *		ignoring checkpoint_completion_target parameter.
  *	CHECKPOINT_FORCE: force a checkpoint even if no XLOG activity has occurred
- *		since the last one (implied by CHECKPOINT_IS_SHUTDOWN or
- *		CHECKPOINT_END_OF_RECOVERY).
+ *		since the last one (implied by CHECKPOINT_IS_SHUTDOWN).
  *	CHECKPOINT_FLUSH_ALL: also flush buffers of unlogged tables.
  *
  * Note: flags contains other bits, of interest here only for logging purposes.
@@ -6269,7 +6215,7 @@ update_checkpoint_display(int flags, bool restartpoint, bool reset)
 void
 CreateCheckPoint(int flags)
 {
-	bool		shutdown;
+	bool		shutdown = (flags & CHECKPOINT_IS_SHUTDOWN) != 0;
 	CheckPoint	checkPoint;
 	XLogRecPtr	recptr;
 	XLogSegNo	_logSegNo;
@@ -6280,19 +6226,9 @@ CreateCheckPoint(int flags)
 	XLogRecPtr	last_important_lsn;
 	VirtualTransactionId *vxids;
 	int			nvxids;
-	int			oldXLogAllowed = 0;
-
-	/*
-	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
-	 * issued at a different time.
-	 */
-	if (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY))
-		shutdown = true;
-	else
-		shutdown = false;
 
 	/* sanity check */
-	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
+	if (RecoveryInProgress())
 		elog(ERROR, "can't create a checkpoint during recovery");
 
 	/*
@@ -6358,8 +6294,7 @@ CreateCheckPoint(int flags)
 	 * WAL activity requiring a checkpoint, skip it.  The idea here is to
 	 * avoid inserting duplicate checkpoints when the system is idle.
 	 */
-	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
-				  CHECKPOINT_FORCE)) == 0)
+	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_FORCE)) == 0)
 	{
 		if (last_important_lsn == ControlFile->checkPoint)
 		{
@@ -6371,19 +6306,8 @@ CreateCheckPoint(int flags)
 		}
 	}
 
-	/*
-	 * An end-of-recovery checkpoint is created before anyone is allowed to
-	 * write WAL. To allow us to write the checkpoint record, temporarily
-	 * enable XLogInsertAllowed.
-	 */
-	if (flags & CHECKPOINT_END_OF_RECOVERY)
-		oldXLogAllowed = LocalSetXLogInsertAllowed();
-
 	checkPoint.ThisTimeLineID = XLogCtl->InsertTimeLineID;
-	if (flags & CHECKPOINT_END_OF_RECOVERY)
-		checkPoint.PrevTimeLineID = XLogCtl->PrevTimeLineID;
-	else
-		checkPoint.PrevTimeLineID = checkPoint.ThisTimeLineID;
+	checkPoint.PrevTimeLineID = checkPoint.ThisTimeLineID;
 
 	checkPoint.fullPageWrites = Insert->fullPageWrites;
 
@@ -6540,8 +6464,7 @@ CreateCheckPoint(int flags)
 	 * allows us to reconstruct the state of running transactions during
 	 * archive recovery, if required. Skip, if this info disabled.
 	 *
-	 * If we are shutting down, or Startup process is completing crash
-	 * recovery we don't need to write running xact data.
+	 * If we are shutting down, we don't need to write running xact data.
 	 */
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
@@ -6562,17 +6485,10 @@ CreateCheckPoint(int flags)
 	/*
 	 * We mustn't write any new WAL after a shutdown checkpoint, or it will be
 	 * overwritten at next startup.  No-one should even try, this just allows
-	 * sanity-checking.  In the case of an end-of-recovery checkpoint, we want
-	 * to just temporarily disable writing until the system has exited
-	 * recovery.
+	 * sanity-checking.
 	 */
 	if (shutdown)
-	{
-		if (flags & CHECKPOINT_END_OF_RECOVERY)
-			LocalXLogInsertAllowed = oldXLogAllowed;
-		else
-			LocalXLogInsertAllowed = 0; /* never again write WAL */
-	}
+		LocalXLogInsertAllowed = 0; /* never again write WAL */
 
 	/*
 	 * We now have ProcLastRecPtr = start of actual checkpoint record, recptr
@@ -7017,7 +6933,7 @@ CreateRestartPoint(int flags)
 	 * Update pg_control, using current time.  Check that it still shows
 	 * DB_IN_ARCHIVE_RECOVERY state and an older checkpoint, else do nothing;
 	 * this is a quick hack to make sure nothing really bad happens if somehow
-	 * we get here after the end-of-recovery checkpoint.
+	 * we get here after end of recovery.
 	 */
 	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 	if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index c937c39f50..624c728d95 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -411,13 +411,6 @@ CheckpointerMain(void)
 
 			ConditionVariableBroadcast(&CheckpointerShmem->start_cv);
 
-			/*
-			 * The end-of-recovery checkpoint is a real checkpoint that's
-			 * performed while we're still in recovery.
-			 */
-			if (flags & CHECKPOINT_END_OF_RECOVERY)
-				do_restartpoint = false;
-
 			/*
 			 * We will warn if (a) too soon since last checkpoint (whatever
 			 * caused it) and (b) somebody set the CHECKPOINT_CAUSE_XLOG flag
@@ -916,12 +909,10 @@ CheckpointerShmemInit(void)
  *
  * flags is a bitwise OR of the following:
  *	CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
- *	CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery.
  *	CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
  *		ignoring checkpoint_completion_target parameter.
  *	CHECKPOINT_FORCE: force a checkpoint even if no XLOG activity has occurred
- *		since the last one (implied by CHECKPOINT_IS_SHUTDOWN or
- *		CHECKPOINT_END_OF_RECOVERY).
+ *		since the last one (implied by CHECKPOINT_IS_SHUTDOWN).
  *	CHECKPOINT_WAIT: wait for completion before returning (otherwise,
  *		just signal checkpointer to do it, and return).
  *	CHECKPOINT_CAUSE_XLOG: checkpoint is requested due to xlog filling.
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 6303647fe0..de6849d083 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -349,10 +349,10 @@ standby_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 * Abort all transactions that we keep track of, that are
 				 * older than the record's oldestRunningXid. This is the most
 				 * convenient spot for doing so since, in contrast to shutdown
-				 * or end-of-recovery checkpoints, we have information about
-				 * all running transactions which includes prepared ones,
-				 * while shutdown checkpoints just know that no non-prepared
-				 * transactions are in progress.
+				 * recovery checkpoints, we have information about all running
+				 * transactions which includes prepared ones, while shutdown
+				 * checkpoints just know that no non-prepared transactions are
+				 * in progress.
 				 */
 				ReorderBufferAbortOld(ctx->reorder, running->oldestRunningXid);
 			}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e02ea3a977..9ac42a9fd4 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1933,10 +1933,9 @@ UnpinBuffer(BufferDesc *buf, bool fixOwner)
  *
  * This is called at checkpoint time to write out all dirty shared buffers.
  * The checkpoint request flags should be passed in.  If CHECKPOINT_IMMEDIATE
- * is set, we disable delays between writes; if CHECKPOINT_IS_SHUTDOWN,
- * CHECKPOINT_END_OF_RECOVERY or CHECKPOINT_FLUSH_ALL is set, we write even
- * unlogged buffers, which are otherwise skipped.  The remaining flags
- * currently have no effect here.
+ * is set, we disable delays between writes; if CHECKPOINT_IS_SHUTDOWN
+ * or CHECKPOINT_FLUSH_ALL is set, we write even unlogged buffers, which are
+ * otherwise skipped.  The remaining flags currently have no effect here.
  */
 static void
 BufferSync(int flags)
@@ -1962,8 +1961,7 @@ BufferSync(int flags)
 	 * we write only permanent, dirty buffers.  But at shutdown or end of
 	 * recovery, we write all dirty buffers.
 	 */
-	if (!((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
-					CHECKPOINT_FLUSH_ALL))))
+	if (!((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_FLUSH_ALL))))
 		mask |= BM_PERMANENT;
 
 	/*
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d9f2487a96..04091d8576 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -132,8 +132,7 @@ extern PGDLLIMPORT bool XLOG_DEBUG;
 
 /* These directly affect the behavior of CreateCheckPoint and subsidiaries */
 #define CHECKPOINT_IS_SHUTDOWN	0x0001	/* Checkpoint is for shutdown */
-#define CHECKPOINT_END_OF_RECOVERY	0x0002	/* Like shutdown checkpoint, but
-											 * issued at end of WAL recovery */
+/* 0x0002 was CHECKPOINT_END_OF_RECOVERY, now unused */
 #define CHECKPOINT_IMMEDIATE	0x0004	/* Do it without delays */
 #define CHECKPOINT_FORCE		0x0008	/* Force even if no activity */
 #define CHECKPOINT_FLUSH_ALL	0x0010	/* Flush all pages, including those
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index fae0bef8f5..b71f1eed90 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -258,7 +258,7 @@ typedef struct xl_overwrite_contrecord
 	TimestampTz overwrite_time;
 } xl_overwrite_contrecord;
 
-/* End of recovery mark, when we don't do an END_OF_RECOVERY checkpoint */
+/* End of recovery mark, and possible TLI change */
 typedef struct xl_end_of_recovery
 {
 	TimestampTz end_time;
-- 
2.24.3 (Apple Git-128)

v5-0002-Remove-previous-timeline-ID-from-checkpoint.patchapplication/octet-stream; name=v5-0002-Remove-previous-timeline-ID-from-checkpoint.patchDownload
From 77fe03bc10454ed3a00e71d1e0eeb711a3e95b9f Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 18 Apr 2022 16:01:13 -0400
Subject: [PATCH v5 2/2] Remove previous timeline ID from checkpoint.

Since recovery now always ends with an end-of-recovery record rather
than a checkpoint, a checkpoint never causes a timeline switch.
Therefore, it doesn't need to store a previous timeline ID and a
current timeline ID any more.

Note that this affects the signature of the SQL-callable function
pg_control_checkpoint().
---
 doc/src/sgml/func.sgml                    |  5 --
 src/backend/access/rmgrdesc/xlogdesc.c    |  3 +-
 src/backend/access/transam/xlog.c         |  2 -
 src/backend/access/transam/xlogrecovery.c | 32 +++++-------
 src/backend/utils/misc/pg_controldata.c   | 61 +++++++++++------------
 src/bin/pg_controldata/pg_controldata.c   |  2 -
 src/bin/pg_resetwal/pg_resetwal.c         |  4 --
 src/include/catalog/pg_control.h          |  4 +-
 src/include/catalog/pg_proc.dat           |  6 +--
 9 files changed, 45 insertions(+), 74 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 489184a4f0..f8516b5475 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27626,11 +27626,6 @@ SELECT collation for ('foo' COLLATE "de_DE");
        <entry><type>integer</type></entry>
       </row>
 
-      <row>
-       <entry><structfield>prev_timeline_id</structfield></entry>
-       <entry><type>integer</type></entry>
-      </row>
-
       <row>
        <entry><structfield>full_page_writes</structfield></entry>
        <entry><type>boolean</type></entry>
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index c0dfea40c7..a3e29b043f 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -45,13 +45,12 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
 		appendStringInfo(buf, "redo %X/%X; "
-						 "tli %u; prev tli %u; fpw %s; xid %u:%u; oid %u; multi %u; offset %u; "
+						 "tli %u; fpw %s; xid %u:%u; oid %u; multi %u; offset %u; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
 						 "oldest running xid %u; %s",
 						 LSN_FORMAT_ARGS(checkpoint->redo),
 						 checkpoint->ThisTimeLineID,
-						 checkpoint->PrevTimeLineID,
 						 checkpoint->fullPageWrites ? "true" : "false",
 						 EpochFromFullTransactionId(checkpoint->nextXid),
 						 XidFromFullTransactionId(checkpoint->nextXid),
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 420e93bed3..912a913b64 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4539,7 +4539,6 @@ BootStrapXLOG(void)
 	 */
 	checkPoint.redo = wal_segment_size + SizeOfXLogLongPHD;
 	checkPoint.ThisTimeLineID = BootstrapTimeLineID;
-	checkPoint.PrevTimeLineID = BootstrapTimeLineID;
 	checkPoint.fullPageWrites = fullPageWrites;
 	checkPoint.nextXid = starting_fxid;
 	checkPoint.nextOid = FirstGenbkiObjectId;
@@ -6307,7 +6306,6 @@ CreateCheckPoint(int flags)
 	}
 
 	checkPoint.ThisTimeLineID = XLogCtl->InsertTimeLineID;
-	checkPoint.PrevTimeLineID = checkPoint.ThisTimeLineID;
 
 	checkPoint.fullPageWrites = Insert->fullPageWrites;
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 39ef865ed9..14a3ee96a3 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1833,36 +1833,28 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	 */
 	if (record->xl_rmid == RM_XLOG_ID)
 	{
-		TimeLineID	newReplayTLI = *replayTLI;
-		TimeLineID	prevReplayTLI = *replayTLI;
 		uint8		info = record->xl_info & ~XLR_INFO_MASK;
 
-		if (info == XLOG_CHECKPOINT_SHUTDOWN)
-		{
-			CheckPoint	checkPoint;
-
-			memcpy(&checkPoint, XLogRecGetData(xlogreader), sizeof(CheckPoint));
-			newReplayTLI = checkPoint.ThisTimeLineID;
-			prevReplayTLI = checkPoint.PrevTimeLineID;
-		}
-		else if (info == XLOG_END_OF_RECOVERY)
+		if (info == XLOG_END_OF_RECOVERY)
 		{
 			xl_end_of_recovery xlrec;
+			TimeLineID	newReplayTLI;
+			TimeLineID	prevReplayTLI;
 
 			memcpy(&xlrec, XLogRecGetData(xlogreader), sizeof(xl_end_of_recovery));
 			newReplayTLI = xlrec.ThisTimeLineID;
 			prevReplayTLI = xlrec.PrevTimeLineID;
-		}
 
-		if (newReplayTLI != *replayTLI)
-		{
-			/* Check that it's OK to switch to this TLI */
-			checkTimeLineSwitch(xlogreader->EndRecPtr,
-								newReplayTLI, prevReplayTLI, *replayTLI);
+			if (newReplayTLI != *replayTLI)
+			{
+				/* Check that it's OK to switch to this TLI */
+				checkTimeLineSwitch(xlogreader->EndRecPtr,
+									newReplayTLI, prevReplayTLI, *replayTLI);
 
-			/* Following WAL records should be run with new TLI */
-			*replayTLI = newReplayTLI;
-			switchedTLI = true;
+				/* Following WAL records should be run with new TLI */
+				*replayTLI = newReplayTLI;
+				switchedTLI = true;
+			}
 		}
 	}
 
diff --git a/src/backend/utils/misc/pg_controldata.c b/src/backend/utils/misc/pg_controldata.c
index 781f8b8758..c30f380d07 100644
--- a/src/backend/utils/misc/pg_controldata.c
+++ b/src/backend/utils/misc/pg_controldata.c
@@ -101,33 +101,31 @@ pg_control_checkpoint(PG_FUNCTION_ARGS)
 					   TEXTOID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 4, "timeline_id",
 					   INT4OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "prev_timeline_id",
-					   INT4OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "full_page_writes",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "full_page_writes",
 					   BOOLOID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "next_xid",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "next_xid",
 					   TEXTOID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "next_oid",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "next_oid",
 					   OIDOID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "next_multixact_id",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "next_multixact_id",
 					   XIDOID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 10, "next_multi_offset",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "next_multi_offset",
 					   XIDOID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 11, "oldest_xid",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 10, "oldest_xid",
 					   XIDOID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 12, "oldest_xid_dbid",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 11, "oldest_xid_dbid",
 					   OIDOID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 13, "oldest_active_xid",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 12, "oldest_active_xid",
 					   XIDOID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 14, "oldest_multi_xid",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 13, "oldest_multi_xid",
 					   XIDOID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 15, "oldest_multi_dbid",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 14, "oldest_multi_dbid",
 					   OIDOID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 16, "oldest_commit_ts_xid",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 15, "oldest_commit_ts_xid",
 					   XIDOID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 17, "newest_commit_ts_xid",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 16, "newest_commit_ts_xid",
 					   XIDOID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 18, "checkpoint_time",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 17, "checkpoint_time",
 					   TIMESTAMPTZOID, -1, 0);
 	tupdesc = BlessTupleDesc(tupdesc);
 
@@ -158,50 +156,47 @@ pg_control_checkpoint(PG_FUNCTION_ARGS)
 	values[3] = Int32GetDatum(ControlFile->checkPointCopy.ThisTimeLineID);
 	nulls[3] = false;
 
-	values[4] = Int32GetDatum(ControlFile->checkPointCopy.PrevTimeLineID);
+	values[4] = BoolGetDatum(ControlFile->checkPointCopy.fullPageWrites);
 	nulls[4] = false;
 
-	values[5] = BoolGetDatum(ControlFile->checkPointCopy.fullPageWrites);
-	nulls[5] = false;
-
-	values[6] = CStringGetTextDatum(psprintf("%u:%u",
+	values[5] = CStringGetTextDatum(psprintf("%u:%u",
 											 EpochFromFullTransactionId(ControlFile->checkPointCopy.nextXid),
 											 XidFromFullTransactionId(ControlFile->checkPointCopy.nextXid)));
+	nulls[5] = false;
+
+	values[6] = ObjectIdGetDatum(ControlFile->checkPointCopy.nextOid);
 	nulls[6] = false;
 
-	values[7] = ObjectIdGetDatum(ControlFile->checkPointCopy.nextOid);
+	values[7] = TransactionIdGetDatum(ControlFile->checkPointCopy.nextMulti);
 	nulls[7] = false;
 
-	values[8] = TransactionIdGetDatum(ControlFile->checkPointCopy.nextMulti);
+	values[8] = TransactionIdGetDatum(ControlFile->checkPointCopy.nextMultiOffset);
 	nulls[8] = false;
 
-	values[9] = TransactionIdGetDatum(ControlFile->checkPointCopy.nextMultiOffset);
+	values[9] = TransactionIdGetDatum(ControlFile->checkPointCopy.oldestXid);
 	nulls[9] = false;
 
-	values[10] = TransactionIdGetDatum(ControlFile->checkPointCopy.oldestXid);
+	values[10] = ObjectIdGetDatum(ControlFile->checkPointCopy.oldestXidDB);
 	nulls[10] = false;
 
-	values[11] = ObjectIdGetDatum(ControlFile->checkPointCopy.oldestXidDB);
+	values[11] = TransactionIdGetDatum(ControlFile->checkPointCopy.oldestActiveXid);
 	nulls[11] = false;
 
-	values[12] = TransactionIdGetDatum(ControlFile->checkPointCopy.oldestActiveXid);
+	values[12] = TransactionIdGetDatum(ControlFile->checkPointCopy.oldestMulti);
 	nulls[12] = false;
 
-	values[13] = TransactionIdGetDatum(ControlFile->checkPointCopy.oldestMulti);
+	values[13] = ObjectIdGetDatum(ControlFile->checkPointCopy.oldestMultiDB);
 	nulls[13] = false;
 
-	values[14] = ObjectIdGetDatum(ControlFile->checkPointCopy.oldestMultiDB);
+	values[14] = TransactionIdGetDatum(ControlFile->checkPointCopy.oldestCommitTsXid);
 	nulls[14] = false;
 
-	values[15] = TransactionIdGetDatum(ControlFile->checkPointCopy.oldestCommitTsXid);
+	values[15] = TransactionIdGetDatum(ControlFile->checkPointCopy.newestCommitTsXid);
 	nulls[15] = false;
 
-	values[16] = TransactionIdGetDatum(ControlFile->checkPointCopy.newestCommitTsXid);
+	values[16] = TimestampTzGetDatum(time_t_to_timestamptz(ControlFile->checkPointCopy.time));
 	nulls[16] = false;
 
-	values[17] = TimestampTzGetDatum(time_t_to_timestamptz(ControlFile->checkPointCopy.time));
-	nulls[17] = false;
-
 	htup = heap_form_tuple(tupdesc, values, nulls);
 
 	PG_RETURN_DATUM(HeapTupleGetDatum(htup));
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index c390ec51ce..56dc8d0031 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -243,8 +243,6 @@ main(int argc, char *argv[])
 		   xlogfilename);
 	printf(_("Latest checkpoint's TimeLineID:       %u\n"),
 		   ControlFile->checkPointCopy.ThisTimeLineID);
-	printf(_("Latest checkpoint's PrevTimeLineID:   %u\n"),
-		   ControlFile->checkPointCopy.PrevTimeLineID);
 	printf(_("Latest checkpoint's full_page_writes: %s\n"),
 		   ControlFile->checkPointCopy.fullPageWrites ? _("on") : _("off"));
 	printf(_("Latest checkpoint's NextXID:          %u:%u\n"),
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index d4772a2965..2e2f35bf5c 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -444,10 +444,7 @@ main(int argc, char *argv[])
 		ControlFile.checkPointCopy.nextMultiOffset = set_mxoff;
 
 	if (minXlogTli > ControlFile.checkPointCopy.ThisTimeLineID)
-	{
 		ControlFile.checkPointCopy.ThisTimeLineID = minXlogTli;
-		ControlFile.checkPointCopy.PrevTimeLineID = minXlogTli;
-	}
 
 	if (set_wal_segsize != 0)
 		ControlFile.xlog_seg_size = WalSegSz;
@@ -650,7 +647,6 @@ GuessControlValues(void)
 
 	ControlFile.checkPointCopy.redo = SizeOfXLogLongPHD;
 	ControlFile.checkPointCopy.ThisTimeLineID = 1;
-	ControlFile.checkPointCopy.PrevTimeLineID = 1;
 	ControlFile.checkPointCopy.fullPageWrites = false;
 	ControlFile.checkPointCopy.nextXid =
 		FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 06368e2366..b0da5cb1b5 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -22,7 +22,7 @@
 
 
 /* Version identifier for this pg_control format */
-#define PG_CONTROL_VERSION	1300
+#define PG_CONTROL_VERSION	1600
 
 /* Nonce key length, see below */
 #define MOCK_AUTH_NONCE_LEN		32
@@ -37,8 +37,6 @@ typedef struct CheckPoint
 	XLogRecPtr	redo;			/* next RecPtr available when we began to
 								 * create CheckPoint (i.e. REDO start point) */
 	TimeLineID	ThisTimeLineID; /* current TLI */
-	TimeLineID	PrevTimeLineID; /* previous TLI, if this record begins a new
-								 * timeline (equals ThisTimeLineID otherwise) */
 	bool		fullPageWrites; /* current full_page_writes */
 	FullTransactionId nextXid;	/* next free transaction ID */
 	Oid			nextOid;		/* next free OID */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 6d378ff785..c2bd6f0801 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11711,9 +11711,9 @@
   descr => 'pg_controldata checkpoint state information as a function',
   proname => 'pg_control_checkpoint', provolatile => 'v',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{pg_lsn,pg_lsn,text,int4,int4,bool,text,oid,xid,xid,xid,oid,xid,xid,oid,xid,xid,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{checkpoint_lsn,redo_lsn,redo_wal_file,timeline_id,prev_timeline_id,full_page_writes,next_xid,next_oid,next_multixact_id,next_multi_offset,oldest_xid,oldest_xid_dbid,oldest_active_xid,oldest_multi_xid,oldest_multi_dbid,oldest_commit_ts_xid,newest_commit_ts_xid,checkpoint_time}',
+  proallargtypes => '{pg_lsn,pg_lsn,text,int4,bool,text,oid,xid,xid,xid,oid,xid,xid,oid,xid,xid,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{checkpoint_lsn,redo_lsn,redo_wal_file,timeline_id,full_page_writes,next_xid,next_oid,next_multixact_id,next_multi_offset,oldest_xid,oldest_xid_dbid,oldest_active_xid,oldest_multi_xid,oldest_multi_dbid,oldest_commit_ts_xid,newest_commit_ts_xid,checkpoint_time}',
   prosrc => 'pg_control_checkpoint' },
 
 { oid => '3443',
-- 
2.24.3 (Apple Git-128)

#34Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#31)
Re: using an end-of-recovery record in all cases

On Mon, Apr 18, 2022 at 04:44:03PM -0400, Robert Haas wrote:

Here is a new version of the patch which, unlike v1, I think is
something we could seriously consider applying (not before v16, of
course). It now removes CHECKPOINT_END_OF_RECOVERY completely, and I
attach a second patch as well which nukes checkPoint.PrevTimeLineID as
well.

I'd like to add a big +1 for this change. IIUC this should help with some
of the problems I've noted elsewhere [0]/messages/by-id/C1EE64B0-D4DB-40F3-98C8-0CED324D34CB@amazon.com.

I mentioned two problems with $SUBJECT in the first email with this
subject line. One was a bug, which Noah has since fixed (thanks,
Noah). The other problem is that LogStandbySnapshot() and a bunch of
its friends expect latestCompletedXid to always be a normal XID, which
is a problem because (1) we currently set nextXid to 3 and (2) at
startup, we compute latestCompletedXid = nextXid - 1. The current code
dodges this kind of accidentally: the checkpoint that happens at
startup is a "shutdown checkpoint" and thus skips logging a standby
snapshot, since a shutdown checkpoint is a sure indicator that there
are no running transactions. With the changes, the checkpoint at
startup happens after we've started allowing write transactions, and
thus a standby snapshot needs to be logged also. In the cases where
the end-of-recovery record was already being used, the problem could
have happened already, except for the fact that those cases involve a
standby promotion, which doesn't happen during initdb. I explored a
few possible ways of solving this problem.

Shouldn't latestCompletedXid be set to MaxTransactionId in this case? Or
is this related to the logic in FullTransactionIdRetreat() that avoids
skipping over the "actual" special transaction IDs?

So ... I decided that the best approach here seems to be to just skip
FirstNormalTransactionId and use FirstNormalTransactionId + 1 for the
first write transaction that the cluster ever processes. That's very
simple and doesn't seem likely to break anything else. On the downside
it seems a bit grotty, but I don't see anything better, and on the
whole, I think with this approach we come out substantially ahead.
0001 removes 3 times as many lines as it inserts, and 0002 saves a few
more lines of code.

This doesn't seem all that bad to me. It's a little hacky, but it's very
easy to understand and only happens once per initdb. I don't think it's
worth any extra complexity.

Now, I still don't really know that there isn't some theoretical
difficulty here that makes this whole approach a non-starter, but I
also can't think of what it might be. If the promotion case, which has
used the end-of-recovery record for many years, is basically safe,
despite the fact that it switches TLIs, then it seems to me that the
crash recovery case, which doesn't have that complication, ought to be
safe too. But I might well be missing something, so if you see a
problem, please speak up!

Your reasoning seems sound to me.

[0]: /messages/by-id/C1EE64B0-D4DB-40F3-98C8-0CED324D34CB@amazon.com

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#35Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Nathan Bossart (#34)
Re: using an end-of-recovery record in all cases

At Tue, 19 Apr 2022 13:37:59 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in

On Mon, Apr 18, 2022 at 04:44:03PM -0400, Robert Haas wrote:

Here is a new version of the patch which, unlike v1, I think is
something we could seriously consider applying (not before v16, of
course). It now removes CHECKPOINT_END_OF_RECOVERY completely, and I
attach a second patch as well which nukes checkPoint.PrevTimeLineID as
well.

I'd like to add a big +1 for this change. IIUC this should help with some
of the problems I've noted elsewhere [0].

Agreed.

I mentioned two problems with $SUBJECT in the first email with this
subject line. One was a bug, which Noah has since fixed (thanks,
Noah). The other problem is that LogStandbySnapshot() and a bunch of
its friends expect latestCompletedXid to always be a normal XID, which
is a problem because (1) we currently set nextXid to 3 and (2) at
startup, we compute latestCompletedXid = nextXid - 1. The current code
dodges this kind of accidentally: the checkpoint that happens at
startup is a "shutdown checkpoint" and thus skips logging a standby
snapshot, since a shutdown checkpoint is a sure indicator that there
are no running transactions. With the changes, the checkpoint at
startup happens after we've started allowing write transactions, and
thus a standby snapshot needs to be logged also. In the cases where
the end-of-recovery record was already being used, the problem could
have happened already, except for the fact that those cases involve a
standby promotion, which doesn't happen during initdb. I explored a
few possible ways of solving this problem.

Shouldn't latestCompletedXid be set to MaxTransactionId in this case? Or
is this related to the logic in FullTransactionIdRetreat() that avoids
skipping over the "actual" special transaction IDs?

As the result FullTransactionIdRetreat(FirstNormalFullTransactionId)
results in FrozenTransactionId, which looks odd. It seems to me
rather should be InvalidFullTransactionId, or simply should assert-out
that case. But incrmenting the very first xid avoid all that
complexity. It is somewhat hacky but very smiple and understandable.

So ... I decided that the best approach here seems to be to just skip
FirstNormalTransactionId and use FirstNormalTransactionId + 1 for the
first write transaction that the cluster ever processes. That's very
simple and doesn't seem likely to break anything else. On the downside
it seems a bit grotty, but I don't see anything better, and on the
whole, I think with this approach we come out substantially ahead.
0001 removes 3 times as many lines as it inserts, and 0002 saves a few
more lines of code.

This doesn't seem all that bad to me. It's a little hacky, but it's very
easy to understand and only happens once per initdb. I don't think it's
worth any extra complexity.

+1.

Now, I still don't really know that there isn't some theoretical
difficulty here that makes this whole approach a non-starter, but I
also can't think of what it might be. If the promotion case, which has
used the end-of-recovery record for many years, is basically safe,
despite the fact that it switches TLIs, then it seems to me that the
crash recovery case, which doesn't have that complication, ought to be
safe too. But I might well be missing something, so if you see a
problem, please speak up!

Your reasoning seems sound to me.

[0] /messages/by-id/C1EE64B0-D4DB-40F3-98C8-0CED324D34CB@amazon.com

FWIW, I don't find a flaw in the reasoning, too.

By the way do we need to leave CheckPoint.PrevTimeLineID? It is always
the same value with ThisTimeLineID. The most dubious part is
ApplyWalRecord but XLOG_CHECKPOINT_SHUTDOWN no longer induces timeline
switch.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#36Amul Sul
sulamul@gmail.com
In reply to: Robert Haas (#31)
Re: using an end-of-recovery record in all cases

On Tue, Apr 19, 2022 at 2:14 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Jan 15, 2022 at 11:52 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

The cfbot reports that this version of the patch doesn't apply anymore:

Here is a new version of the patch which, unlike v1, I think is
something we could seriously consider applying (not before v16, of
course). It now removes CHECKPOINT_END_OF_RECOVERY completely, and I
attach a second patch as well which nukes checkPoint.PrevTimeLineID as
well.

I mentioned two problems with $SUBJECT in the first email with this
subject line. One was a bug, which Noah has since fixed (thanks,
Noah). The other problem is that LogStandbySnapshot() and a bunch of
its friends expect latestCompletedXid to always be a normal XID, which
is a problem because (1) we currently set nextXid to 3 and (2) at
startup, we compute latestCompletedXid = nextXid - 1. The current code
dodges this kind of accidentally: the checkpoint that happens at
startup is a "shutdown checkpoint" and thus skips logging a standby
snapshot, since a shutdown checkpoint is a sure indicator that there
are no running transactions. With the changes, the checkpoint at
startup happens after we've started allowing write transactions, and
thus a standby snapshot needs to be logged also. In the cases where
the end-of-recovery record was already being used, the problem could
have happened already, except for the fact that those cases involve a
standby promotion, which doesn't happen during initdb. I explored a
few possible ways of solving this problem.

The first thing I considered was replacing latestCompletedXid with a
name like incompleteXidHorizon. The idea is that, where
latestCompletedXid is the highest XID that is known to have committed
or aborted, incompleteXidHorizon would be the lowest XID such that all
known commits or aborts are for lower XIDs. In effect,
incompleteXidHorizon would be latestCompletedXid + 1. Since
latestCompletedXid is always normal except when it's 2,
incompleteXidHorizon would be normal in all cases. Initially this
seemed fairly promising, but it kind of fell down when I realized that
we copy latestCompletedXid into
ComputeXidHorizonsResult.latest_completed. That seemed to me to make
the consequences of the change a bit more far-reaching than I liked.
Also, it wasn't entirely clear to me that I wouldn't be introducing
any off-by-one errors into various wraparound calculations with this
approach.

The second thing I considered was skipping LogStandbySnapshot() during
initdb. There are two ways of doing this and neither of them seem that
great to me. Something that does work is to skip LogStandbySnapshot()
when in single user mode, but there is no particular reason why WAL
generated in single user mode couldn't be replayed on a standby, so
this doesn't seem great. It's too big a hammer for what we really
want, which is just to suppress this during initdb. Another way of
approaching it is to skip it in bootstrap mode, but that actually
doesn't work: initdb then fails during the post-bootstrapping step
rather than during bootstrapping. I thought about patching around that
by forcing the code that generates checkpoint records to forcibly
advance an XID of 3 to 4, but that seemed like working around the
problem from the wrong end.

So ... I decided that the best approach here seems to be to just skip
FirstNormalTransactionId and use FirstNormalTransactionId + 1 for the
first write transaction that the cluster ever processes. That's very
simple and doesn't seem likely to break anything else. On the downside
it seems a bit grotty, but I don't see anything better, and on the
whole, I think with this approach we come out substantially ahead.

IIUC, the failure was something like this on initdb:

running bootstrap script ... TRAP:
FailedAssertion("TransactionIdIsNormal(CurrentRunningXacts->latestCompletedXid)",
File: "procarray.c", Line: 2892, PID: 60363)

/bin/postgres(ExceptionalCondition+0xb9)[0xb3917d]
/bin/postgres(GetRunningTransactionData+0x36c)[0x96aa26]
/bin/postgres(LogStandbySnapshot+0x64)[0x974393]
/bin/postgres(CreateCheckPoint+0x67f)[0x5928bf]
/bin/postgres(RequestCheckpoint+0x26)[0x8ca649]
/bin/postgres(StartupXLOG+0xf51)[0x591126]
/bin/postgres(InitPostgres+0x188)[0xb4f2ac]
/bin/postgres(BootstrapModeMain+0x4d3)[0x5ac6de]
/bin/postgres(main+0x275)[0x7ca72e]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f71af82d445]
/bin/postgres[0x48aae9]
child process was terminated by signal 6: Aborted
initdb: removing data directory "/inst/data"

That was happening because RequestCheckpoint() was called from StartupXLOG()
unconditionally, but with the v5 patch that is not true.

If my understanding is correct then we don't need any handling
for latestCompletedXid, at least in this patch.

Regards,
Amul

#37Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#34)
Re: using an end-of-recovery record in all cases

On Tue, Apr 19, 2022 at 4:38 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

Shouldn't latestCompletedXid be set to MaxTransactionId in this case? Or
is this related to the logic in FullTransactionIdRetreat() that avoids
skipping over the "actual" special transaction IDs?

The problem here is this code:

/* also initialize latestCompletedXid, to nextXid - 1 */
LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
LWLockRelease(ProcArrayLock);

If nextXid is 3, then latestCompletedXid gets 2. But in
GetRunningTransactionData:

Assert(TransactionIdIsNormal(CurrentRunningXacts->latestCompletedXid));

Your reasoning seems sound to me.

I was talking with Thomas Munro yesterday and he thinks there is a
problem with relfilenode reuse here. In normal running, when a
relation is dropped, we leave behind a 0-length file until the next
checkpoint; this keeps that relfilenode from being used even if the
OID counter wraps around. If we didn't do that, then imagine that
while running with wal_level=minimal, we drop an existing relation,
create a new relation with the same OID, load some data into it, and
crash, all within the same checkpoint cycle, then we will be able to
replay the drop, but we will not be able to restore the relation
contents afterward because at wal_level=minimal they are not logged.
Apparently, we don't create tombstone files during recovery because we
know that there will be a checkpoint at the end.

With the existing use of the end-of-recovery record, we always know
that wal_level>minimal, because we're only using it on standbys. But
with this use that wouldn't be true any more. So I guess we need to
start creating tombstone files even during recovery, or else do
something like what Dilip coded up in
/messages/by-id/CAFiTN-u=r8UTCSzu6_pnihYAtwR1=esq5sRegTEZ2tLa92fovA@mail.gmail.com
which I think would be a better solution at least in the long term.

--
Robert Haas
EDB: http://www.enterprisedb.com

#38Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#37)
Re: using an end-of-recovery record in all cases

On Wed, Apr 20, 2022 at 09:26:07AM -0400, Robert Haas wrote:

I was talking with Thomas Munro yesterday and he thinks there is a
problem with relfilenode reuse here. In normal running, when a
relation is dropped, we leave behind a 0-length file until the next
checkpoint; this keeps that relfilenode from being used even if the
OID counter wraps around. If we didn't do that, then imagine that
while running with wal_level=minimal, we drop an existing relation,
create a new relation with the same OID, load some data into it, and
crash, all within the same checkpoint cycle, then we will be able to
replay the drop, but we will not be able to restore the relation
contents afterward because at wal_level=minimal they are not logged.
Apparently, we don't create tombstone files during recovery because we
know that there will be a checkpoint at the end.

In the example you provided, won't the tombstone file already be present
before the crash? During recovery, the tombstone file will be removed, and
the new relation wouldn't use the same relfilenode anyway. I'm probably
missing something obvious here.

I do see the problem if we drop an existing relation, crash, reuse the
filenode, and then crash again (all within the same checkpoint cycle). The
first recovery would remove the tombstone file, and the second recovery
would wipe out the new relation's files.

With the existing use of the end-of-recovery record, we always know
that wal_level>minimal, because we're only using it on standbys. But
with this use that wouldn't be true any more. So I guess we need to
start creating tombstone files even during recovery, or else do
something like what Dilip coded up in
/messages/by-id/CAFiTN-u=r8UTCSzu6_pnihYAtwR1=esq5sRegTEZ2tLa92fovA@mail.gmail.com
which I think would be a better solution at least in the long term.

IMO this would be good just to reduce the branching a bit. I suppose
removing the files immediately during recovery might be an optimization in
some cases, but I am skeptical that it really makes that much of a
difference in practice.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#39Thomas Munro
thomas.munro@gmail.com
In reply to: Nathan Bossart (#38)
Re: using an end-of-recovery record in all cases

On Thu, Apr 21, 2022 at 5:02 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

I do see the problem if we drop an existing relation, crash, reuse the
filenode, and then crash again (all within the same checkpoint cycle). The
first recovery would remove the tombstone file, and the second recovery
would wipe out the new relation's files.

Right, the double-crash case is what I was worrying about. I'm not
sure, but it might even be more likely than usual that you'll reuse
the same relfilenode after the first crash, because the OID allocator
will start from the same value.