Background writer and checkpointer in crash recovery

Started by Thomas Munroover 5 years ago24 messages
#1Thomas Munro
thomas.munro@gmail.com
1 attachment(s)

(Forking this thread from the SLRU fsync one[1]/messages/by-id/CA+hUKGLJ=84YT+NvhkEEDAuUtVHMfQ9i-N7k_o50JmQ6Rpj_OQ@mail.gmail.com to allow for a
separate CF entry; it's unrelated, except for being another case of
moving work off the recovery process's plate.)

Hello hackers,

Currently we don't run the bgwriter process during crash recovery.
I've CCed Simon and Heikki who established this in commit cdd46c76.
Based on that commit message, I think the bar to clear to change the
policy is to show that it's useful, and that it doesn't make crash
recovery less robust. See the other thread for some initial evidence
of usefulness from Jakub Wartak. I think it also just makes intuitive
sense that it's got to help bigger-than-buffer-pool crash recovery if
you can shift buffer eviction out of the recovery loop. As for
robustness, I suppose we could provide the option to turn it off just
in case that turns out to be useful in some scenarios, but I'm
wondering why we would expect something that we routinely run in
archive/streaming recovery to reduce robustness in only slightly
different circumstances.

Here's an experiment-grade patch, comments welcome, though at this
stage it's primarily thoughts about the concept that I'm hoping to
solicit.

One question raised by Jakub that I don't have a great answer to right
now is whether you'd want different bgwriter settings in this scenario
for best results. I don't know, but I suspect that the answer is to
make bgwriter more adaptive rather than more configurable, and that's
an orthogonal project.

Once we had the checkpointer running, we could also consider making
the end-of-recovery checkpoint optional, or at least the wait for it
to complete. I haven't shown that in this patch, but it's just
different checkpoint request flags and possibly an end-of-recovery
record. What problems do you foresee with that? Why should we have
"fast" promotion but not "fast" crash recovery?

[1]: /messages/by-id/CA+hUKGLJ=84YT+NvhkEEDAuUtVHMfQ9i-N7k_o50JmQ6Rpj_OQ@mail.gmail.com

Attachments:

0001-Run-checkpointer-and-bgworker-in-crash-recovery.patchtext/x-patch; charset=US-ASCII; name=0001-Run-checkpointer-and-bgworker-in-crash-recovery.patchDownload
From 21d7d459a6076f85c743b739f1c4ba16451b7046 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 26 Aug 2020 16:34:33 +1200
Subject: [PATCH 1/2] Run checkpointer and bgworker in crash recovery.

Relieve the startup process of some writeback and checkpointing work
during crash recovery, making it more like replication recovery.  This
wasn't done back in commit cdd46c76 because it wasn't thought to be
useful at the time.  The theory of this patch is that there may be
workloads where we can profitably move a bunch of buffer eviction work
out of the recovery process.

In order to have a bgwriter running, you also need a checkpointer.
While you could give startup and bgwriter their own backend private
pending ops table, it's nicer to merger their requests in one place.

XXX This is just an experimental prototype patch and may contain all
kinds of bugs!
---
 src/backend/access/transam/xlog.c     | 33 ++++++++++-----------------
 src/backend/postmaster/bgwriter.c     |  3 ---
 src/backend/postmaster/checkpointer.c |  3 ---
 src/backend/postmaster/postmaster.c   | 17 ++++++--------
 src/backend/storage/sync/sync.c       | 30 +++---------------------
 src/include/storage/sync.h            |  1 -
 6 files changed, 22 insertions(+), 65 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 09c01ed4ae..d8080ed4f5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -870,9 +870,6 @@ bool		reachedConsistency = false;
 
 static bool InRedo = false;
 
-/* Have we launched bgwriter during recovery? */
-static bool bgwriterLaunched = false;
-
 /* For WALInsertLockAcquire/Release functions */
 static int	MyLockNo = 0;
 static bool holdingAllLocks = false;
@@ -7123,25 +7120,14 @@ StartupXLOG(void)
 		/* Also ensure XLogReceiptTime has a sane value */
 		XLogReceiptTime = GetCurrentTimestamp();
 
+		PublishStartupProcessInformation();
+
 		/*
 		 * Let postmaster know we've started redo now, so that it can launch
-		 * checkpointer to perform restartpoints.  We don't bother during
-		 * crash recovery as restartpoints can only be performed during
-		 * archive recovery.  And we'd like to keep crash recovery simple, to
-		 * avoid introducing bugs that could affect you when recovering after
-		 * crash.
-		 *
-		 * After this point, we can no longer assume that we're the only
-		 * process in addition to postmaster!  Also, fsync requests are
-		 * subsequently to be handled by the checkpointer, not locally.
+		 * the archiver if necessary.
 		 */
-		if (ArchiveRecoveryRequested && IsUnderPostmaster)
-		{
-			PublishStartupProcessInformation();
-			EnableSyncRequestForwarding();
+		if (IsUnderPostmaster)
 			SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED);
-			bgwriterLaunched = true;
-		}
 
 		/*
 		 * Allow read-only connections immediately if we're consistent
@@ -7729,7 +7715,7 @@ StartupXLOG(void)
 		 * after we're fully out of recovery mode and already accepting
 		 * queries.
 		 */
-		if (bgwriterLaunched)
+		if (ArchiveRecoveryRequested && IsUnderPostmaster)
 		{
 			if (LocalPromoteIsTriggered)
 			{
@@ -7764,8 +7750,13 @@ StartupXLOG(void)
 								  CHECKPOINT_IMMEDIATE |
 								  CHECKPOINT_WAIT);
 		}
+		else if (IsUnderPostmaster)
+			RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
+							  CHECKPOINT_IMMEDIATE |
+							  CHECKPOINT_WAIT);
 		else
-			CreateCheckPoint(CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_IMMEDIATE);
+			CreateCheckPoint(CHECKPOINT_END_OF_RECOVERY |
+							 CHECKPOINT_IMMEDIATE);
 	}
 
 	if (ArchiveRecoveryRequested)
@@ -11891,7 +11882,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 		 * Request a restartpoint if we've replayed too much xlog since the
 		 * last one.
 		 */
-		if (bgwriterLaunched)
+		if (ArchiveRecoveryRequested && IsUnderPostmaster)
 		{
 			if (XLogCheckpointNeeded(readSegNo))
 			{
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 069e27e427..56de0bee18 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -12,9 +12,6 @@
  *
  * As of Postgres 9.2 the bgwriter no longer handles checkpoints.
  *
- * The bgwriter is started by the postmaster as soon as the startup subprocess
- * finishes, or as soon as recovery begins if we are doing archive recovery.
- * It remains alive until the postmaster commands it to terminate.
  * Normal termination is by SIGTERM, which instructs the bgwriter to exit(0).
  * Emergency termination is by SIGQUIT; like any backend, the bgwriter will
  * simply abort and exit on SIGQUIT.
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 624a3238b8..883a0df198 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -10,9 +10,6 @@
  * fill WAL segments; the checkpointer itself doesn't watch for the
  * condition.)
  *
- * The checkpointer is started by the postmaster as soon as the startup
- * subprocess finishes, or as soon as recovery begins if we are doing archive
- * recovery.  It remains alive until the postmaster commands it to terminate.
  * Normal termination is by SIGUSR2, which instructs the checkpointer to
  * execute a shutdown checkpoint and then exit(0).  (All backends must be
  * stopped before SIGUSR2 is issued!)  Emergency termination is by SIGQUIT;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 42223c0f61..47cf1c11f7 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1400,6 +1400,12 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STARTING);
 
+	/* Start bgwriter and checkpointer so they can help with recovery */
+	if (CheckpointerPID == 0)
+		CheckpointerPID = StartCheckpointer();
+	if (BgWriterPID == 0)
+		BgWriterPID = StartBackgroundWriter();
+
 	/*
 	 * We're ready to rock and roll...
 	 */
@@ -1761,7 +1767,7 @@ ServerLoop(void)
 		 * fails, we'll just try again later.  Likewise for the checkpointer.
 		 */
 		if (pmState == PM_RUN || pmState == PM_RECOVERY ||
-			pmState == PM_HOT_STANDBY)
+			pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
 		{
 			if (CheckpointerPID == 0)
 				CheckpointerPID = StartCheckpointer();
@@ -5208,15 +5214,6 @@ sigusr1_handler(SIGNAL_ARGS)
 		FatalError = false;
 		AbortStartTime = 0;
 
-		/*
-		 * Crank up the background tasks.  It doesn't matter if this fails,
-		 * we'll just try again later.
-		 */
-		Assert(CheckpointerPID == 0);
-		CheckpointerPID = StartCheckpointer();
-		Assert(BgWriterPID == 0);
-		BgWriterPID = StartBackgroundWriter();
-
 		/*
 		 * Start the archiver if we're responsible for (re-)archiving received
 		 * files.
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 3ded2cdd71..27c28b8562 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -107,10 +107,10 @@ InitSync(void)
 {
 	/*
 	 * Create pending-operations hashtable if we need it.  Currently, we need
-	 * it if we are standalone (not under a postmaster) or if we are a startup
-	 * or checkpointer auxiliary process.
+	 * it if we are standalone (not under a postmaster) or if we are a
+	 * checkpointer auxiliary process.
 	 */
-	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
+	if (!IsUnderPostmaster || AmCheckpointerProcess())
 	{
 		HASHCTL		hash_ctl;
 
@@ -568,27 +568,3 @@ RegisterSyncRequest(const FileTag *ftag, SyncRequestType type,
 
 	return ret;
 }
-
-/*
- * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
- * already created the pendingOps during initialization of the startup
- * process.  Calling this function drops the local pendingOps so that
- * subsequent requests will be forwarded to checkpointer.
- */
-void
-EnableSyncRequestForwarding(void)
-{
-	/* Perform any pending fsyncs we may have queued up, then drop table */
-	if (pendingOps)
-	{
-		ProcessSyncRequests();
-		hash_destroy(pendingOps);
-	}
-	pendingOps = NULL;
-
-	/*
-	 * We should not have any pending unlink requests, since mdunlink doesn't
-	 * queue unlink requests when isRedo.
-	 */
-	Assert(pendingUnlinks == NIL);
-}
diff --git a/src/include/storage/sync.h b/src/include/storage/sync.h
index e16ab8e711..9622854c71 100644
--- a/src/include/storage/sync.h
+++ b/src/include/storage/sync.h
@@ -55,7 +55,6 @@ extern void SyncPreCheckpoint(void);
 extern void SyncPostCheckpoint(void);
 extern void ProcessSyncRequests(void);
 extern void RememberSyncRequest(const FileTag *ftag, SyncRequestType type);
-extern void EnableSyncRequestForwarding(void);
 extern bool RegisterSyncRequest(const FileTag *ftag, SyncRequestType type,
 								bool retryOnError);
 
-- 
2.20.1

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#1)
Re: Background writer and checkpointer in crash recovery

Thomas Munro <thomas.munro@gmail.com> writes:

Once we had the checkpointer running, we could also consider making
the end-of-recovery checkpoint optional, or at least the wait for it
to complete. I haven't shown that in this patch, but it's just
different checkpoint request flags and possibly an end-of-recovery
record. What problems do you foresee with that? Why should we have
"fast" promotion but not "fast" crash recovery?

I think that the rationale for that had something to do with trying
to reduce the costs of repeated crashes. If you've had one crash,
you might well have another one in your near future, due to the same
problem recurring. Therefore, avoiding multiple replays of the same WAL
is attractive; and to do that we have to complete a checkpoint before
giving users a chance to crash us again.

I'm not sure what I think about your primary proposal here. On the
one hand, optimizing crash recovery ought to be pretty darn far down
our priority list; if it happens often enough for performance to be
a consideration then we have worse problems to fix. Also, at least
in theory, not running the bgwriter/checkpointer makes for fewer moving
parts and thus better odds of completing crash recovery successfully.
On the other hand, it could be argued that running without the
bgwriter/checkpointer actually makes recovery less reliable, since
that's a far less thoroughly-tested operating regime than when they're
active.

tl;dr: I think this should be much less about performance than about
whether we successfully recover or not. Maybe there's still a case for
changing in that framework, though.

regards, tom lane

#3Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#2)
1 attachment(s)
Re: Background writer and checkpointer in crash recovery

On Sun, 30 Aug 2020 at 01:39, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

Once we had the checkpointer running, we could also consider making
the end-of-recovery checkpoint optional, or at least the wait for it
to complete. I haven't shown that in this patch, but it's just
different checkpoint request flags and possibly an end-of-recovery
record. What problems do you foresee with that? Why should we have
"fast" promotion but not "fast" crash recovery?

I think that the rationale for that had something to do with trying
to reduce the costs of repeated crashes. If you've had one crash,
you might well have another one in your near future, due to the same
problem recurring. Therefore, avoiding multiple replays of the same WAL
is attractive; and to do that we have to complete a checkpoint before
giving users a chance to crash us again.

Agreed. That rationale is shown in comments and in the commit message.

"We could launch it during crash recovery as well, but it seems better to keep
that codepath as simple as possible, for the sake of robustness. And it
couldn't do any restartpoints during crash recovery anyway, so it wouldn't be
that useful."

Having said that, we did raise the checkpoint_timeout by a lot, so the
situation today might be quite different. A large checkpoint_timeout
could eventually overflow shared buffers, with the right workload.

We don't have any stats to show whether this patch is worthwhile or
not, so I suggest adding the attached instrumentation patch as well so
we can see on production systems whether checkpoint_timeout is too
high by comparison with pg_stat_bgwriter. The patch is written in the
style of log_checkpoints.

--
Simon Riggs http://www.EnterpriseDB.com/

Attachments:

crash_recovery_stats.v1.patchapplication/octet-stream; name=crash_recovery_stats.v1.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a1078a7cfc..736599d2fb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7443,6 +7443,29 @@ StartupXLOG(void)
 					(errmsg("redo done at %X/%X system usage: %s",
 							(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr,
 							pg_rusage_show(&ru0))));
+			if (!InArchiveRecovery)
+			{
+				if (!INSTR_TIME_IS_ZERO(pgBufferUsage.blk_read_time))
+					ereport(LOG,
+						(errmsg("crash recovery complete: wrote %ld buffers (%.1f%%) (%0.3f ms); "
+								"dirtied %ld buffers; read %ld buffers (%0.3f ms)",
+								pgBufferUsage.shared_blks_written,
+								(double) pgBufferUsage.shared_blks_written * 100 / NBuffers,
+								INSTR_TIME_GET_MILLISEC(pgBufferUsage.blk_write_time),
+								pgBufferUsage.shared_blks_dirtied,
+								pgBufferUsage.shared_blks_read,
+								INSTR_TIME_GET_MILLISEC(pgBufferUsage.blk_read_time)
+								)));
+				else
+					ereport(LOG,
+						(errmsg("crash recovery complete: wrote %ld buffers (%.1f%%); "
+								"dirtied %ld buffers; read %ld buffers",
+								pgBufferUsage.shared_blks_written,
+								(double) pgBufferUsage.shared_blks_written * 100 / NBuffers,
+								pgBufferUsage.shared_blks_dirtied,
+								pgBufferUsage.shared_blks_read
+								)));
+			}
 			xtime = GetLatestXTime();
 			if (xtime)
 				ereport(LOG,
#4Thomas Munro
thomas.munro@gmail.com
In reply to: Simon Riggs (#3)
Re: Background writer and checkpointer in crash recovery

On Wed, Nov 11, 2020 at 9:57 PM Simon Riggs <simon@2ndquadrant.com> wrote:

Having said that, we did raise the checkpoint_timeout by a lot, so the
situation today might be quite different. A large checkpoint_timeout
could eventually overflow shared buffers, with the right workload.

FWIW Jakuk Wartak did manage to show a 1.64x speedup while running
crash recovery of an insert-only workload (with a variant of this
patch that I shared in another thread), albeit with aggressive tuning:

/messages/by-id/VI1PR0701MB6960EEB838D53886D8A180E3F6520@VI1PR0701MB6960.eurprd07.prod.outlook.com

We don't have any stats to show whether this patch is worthwhile or
not, so I suggest adding the attached instrumentation patch as well so
we can see on production systems whether checkpoint_timeout is too
high by comparison with pg_stat_bgwriter. The patch is written in the
style of log_checkpoints.

Very useful. I've also been wondering how to get that sort of
information in hot standby.

#5Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#1)
Re: Background writer and checkpointer in crash recovery

On Sat, Aug 29, 2020 at 8:13 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Currently we don't run the bgwriter process during crash recovery.
I've CCed Simon and Heikki who established this in commit cdd46c76.
Based on that commit message, I think the bar to clear to change the
policy is to show that it's useful, and that it doesn't make crash
recovery less robust. See the other thread for some initial evidence
of usefulness from Jakub Wartak. I think it also just makes intuitive
sense that it's got to help bigger-than-buffer-pool crash recovery if
you can shift buffer eviction out of the recovery loop. As for
robustness, I suppose we could provide the option to turn it off just
in case that turns out to be useful in some scenarios, but I'm
wondering why we would expect something that we routinely run in
archive/streaming recovery to reduce robustness in only slightly
different circumstances.

Here's an experiment-grade patch, comments welcome, though at this
stage it's primarily thoughts about the concept that I'm hoping to
solicit.

I think the way it works right now is stupid and the proposed change
is going in the right direction. We have ample evidence already that
handing off fsyncs to a background process is a good idea, and there's
no reason why that shouldn't be beneficial during crash recovery just
as it is at other times. But even if it somehow failed to improve
performance during recovery, there's another good reason to do this,
which is that it would make the code simpler. Having the pendingOps
stuff in the startup process in some recovery situations and in the
checkpointer in other recovery situations makes this harder to reason
about. As Tom said, the system state where bgwriter and checkpointer
are not running is an uncommon one, and is probably more likely to
have (or grow) bugs than the state where they are running.

The rat's-nest of logic introduced by the comment "Perform a
checkpoint to update all our recovery activity to disk." inside
StartupXLOG() could really do with some simplification. Right now we
have three cases: CreateEndOfRecoveryRecord(), RequestCheckpoint(),
and CreateCheckpoint(). Maybe with this change we could get it down to
just two, since RequestCheckpoint() already knows what to do about
!IsUnderPostmaster.

--
Robert Haas
EDB: http://www.enterprisedb.com

#6Thomas Munro
thomas.munro@gmail.com
In reply to: Robert Haas (#5)
3 attachment(s)
Re: Background writer and checkpointer in crash recovery

On Wed, Feb 3, 2021 at 11:11 AM Robert Haas <robertmhaas@gmail.com> wrote:

I think the way it works right now is stupid and the proposed change
is going in the right direction. We have ample evidence already that
handing off fsyncs to a background process is a good idea, and there's
no reason why that shouldn't be beneficial during crash recovery just
as it is at other times. But even if it somehow failed to improve
performance during recovery, there's another good reason to do this,
which is that it would make the code simpler. Having the pendingOps
stuff in the startup process in some recovery situations and in the
checkpointer in other recovery situations makes this harder to reason
about. As Tom said, the system state where bgwriter and checkpointer
are not running is an uncommon one, and is probably more likely to
have (or grow) bugs than the state where they are running.

Yeah, it's a good argument.

The rat's-nest of logic introduced by the comment "Perform a
checkpoint to update all our recovery activity to disk." inside
StartupXLOG() could really do with some simplification. Right now we
have three cases: CreateEndOfRecoveryRecord(), RequestCheckpoint(),
and CreateCheckpoint(). Maybe with this change we could get it down to
just two, since RequestCheckpoint() already knows what to do about
!IsUnderPostmaster.

True. Done in this version.

Here's a rebase of this patch + Simon's patch to report on stats.

I also have a sketch patch to provide a GUC that turns off the
end-of-recovery checkpoint as mentioned earlier, attached (sharing
mainly because this is one of the stack of patches that Jakub was
testing for his baseback/PITR workloads and he might want to test some
more), but I'm not proposing that for PG14. That idea is tangled up
with the "relfile tombstone" stuff I wrote about elsewhere[1]https://commitfest.postgresql.org/33/3030/, but I
haven't finished studying the full arsenal of footguns in that area
(it's something like: if we don't wait for end-of-recovery checkpoint
before allowing connections, then we'd better start creating
tombstones in recovery unless the WAL level is high enough to avoid
data eating hazards with unlogged changes and a double crash).

[1]: https://commitfest.postgresql.org/33/3030/

Attachments:

v2-0001-Run-checkpointer-and-bgworker-in-crash-recovery.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Run-checkpointer-and-bgworker-in-crash-recovery.patchDownload
From 7a82c9e622ea45c981d05dff08c1d4279ec3c73b Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 26 Aug 2020 16:34:33 +1200
Subject: [PATCH v2 1/3] Run checkpointer and bgworker in crash recovery.

Relieve the startup process of some writeback and checkpointing work
during crash recovery, making it more like replication recovery.  This
wasn't done back in commit cdd46c76 out of caution, but now it seems
more conservative to have the crash recovery environment work the same
as the more commonly exercised streaming replication environment.

In order to have a bgwriter running, you also need a checkpointer.
While you could give startup and bgwriter their own backend private
pending ops table, it's nicer to merger their requests in one place.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Simon Riggs <simon@2ndquadrant.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Tested-by:  Jakub Wartak <Jakub.Wartak@tomtom.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ8NRsqgkZEnsnRc2MFROBV-jCnacbYvtpptK2A9YYp9Q%40mail.gmail.com
---
 src/backend/access/transam/xlog.c     | 32 +++++++++------------------
 src/backend/postmaster/bgwriter.c     |  3 ---
 src/backend/postmaster/checkpointer.c |  3 ---
 src/backend/postmaster/postmaster.c   | 17 ++++++--------
 src/backend/storage/sync/sync.c       | 30 +++----------------------
 src/include/storage/sync.h            |  1 -
 6 files changed, 21 insertions(+), 65 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f4d1ce5dea..6e8e6cf1e4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -880,9 +880,6 @@ bool		reachedConsistency = false;
 
 static bool InRedo = false;
 
-/* Have we launched bgwriter during recovery? */
-static bool bgwriterLaunched = false;
-
 /* For WALInsertLockAcquire/Release functions */
 static int	MyLockNo = 0;
 static bool holdingAllLocks = false;
@@ -7268,25 +7265,14 @@ StartupXLOG(void)
 		/* Also ensure XLogReceiptTime has a sane value */
 		XLogReceiptTime = GetCurrentTimestamp();
 
+		PublishStartupProcessInformation();
+
 		/*
 		 * Let postmaster know we've started redo now, so that it can launch
-		 * checkpointer to perform restartpoints.  We don't bother during
-		 * crash recovery as restartpoints can only be performed during
-		 * archive recovery.  And we'd like to keep crash recovery simple, to
-		 * avoid introducing bugs that could affect you when recovering after
-		 * crash.
-		 *
-		 * After this point, we can no longer assume that we're the only
-		 * process in addition to postmaster!  Also, fsync requests are
-		 * subsequently to be handled by the checkpointer, not locally.
+		 * the archiver if necessary.
 		 */
-		if (ArchiveRecoveryRequested && IsUnderPostmaster)
-		{
-			PublishStartupProcessInformation();
-			EnableSyncRequestForwarding();
+		if (IsUnderPostmaster)
 			SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED);
-			bgwriterLaunched = true;
-		}
 
 		/*
 		 * Allow read-only connections immediately if we're consistent
@@ -7879,7 +7865,7 @@ StartupXLOG(void)
 		 * after we're fully out of recovery mode and already accepting
 		 * queries.
 		 */
-		if (bgwriterLaunched)
+		if (ArchiveRecoveryRequested && IsUnderPostmaster)
 		{
 			if (LocalPromoteIsTriggered)
 			{
@@ -7915,7 +7901,11 @@ StartupXLOG(void)
 								  CHECKPOINT_WAIT);
 		}
 		else
-			CreateCheckPoint(CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_IMMEDIATE);
+		{
+			RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
+							  CHECKPOINT_IMMEDIATE |
+							  CHECKPOINT_WAIT);
+		}
 	}
 
 	if (ArchiveRecoveryRequested)
@@ -12123,7 +12113,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 		 * Request a restartpoint if we've replayed too much xlog since the
 		 * last one.
 		 */
-		if (bgwriterLaunched)
+		if (ArchiveRecoveryRequested && IsUnderPostmaster)
 		{
 			if (XLogCheckpointNeeded(readSegNo))
 			{
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 715d5195bb..5584f4bc24 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -12,9 +12,6 @@
  *
  * As of Postgres 9.2 the bgwriter no longer handles checkpoints.
  *
- * The bgwriter is started by the postmaster as soon as the startup subprocess
- * finishes, or as soon as recovery begins if we are doing archive recovery.
- * It remains alive until the postmaster commands it to terminate.
  * Normal termination is by SIGTERM, which instructs the bgwriter to exit(0).
  * Emergency termination is by SIGQUIT; like any backend, the bgwriter will
  * simply abort and exit on SIGQUIT.
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5907a7befc..7b7f01e1fc 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -10,9 +10,6 @@
  * fill WAL segments; the checkpointer itself doesn't watch for the
  * condition.)
  *
- * The checkpointer is started by the postmaster as soon as the startup
- * subprocess finishes, or as soon as recovery begins if we are doing archive
- * recovery.  It remains alive until the postmaster commands it to terminate.
  * Normal termination is by SIGUSR2, which instructs the checkpointer to
  * execute a shutdown checkpoint and then exit(0).  (All backends must be
  * stopped before SIGUSR2 is issued!)  Emergency termination is by SIGQUIT;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index edab95a19e..4d5b4b9775 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1392,6 +1392,12 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STARTING);
 
+	/* Start bgwriter and checkpointer so they can help with recovery */
+	if (CheckpointerPID == 0)
+		CheckpointerPID = StartCheckpointer();
+	if (BgWriterPID == 0)
+		BgWriterPID = StartBackgroundWriter();
+
 	/*
 	 * We're ready to rock and roll...
 	 */
@@ -1754,7 +1760,7 @@ ServerLoop(void)
 		 * fails, we'll just try again later.  Likewise for the checkpointer.
 		 */
 		if (pmState == PM_RUN || pmState == PM_RECOVERY ||
-			pmState == PM_HOT_STANDBY)
+			pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
 		{
 			if (CheckpointerPID == 0)
 				CheckpointerPID = StartCheckpointer();
@@ -5125,15 +5131,6 @@ sigusr1_handler(SIGNAL_ARGS)
 		FatalError = false;
 		AbortStartTime = 0;
 
-		/*
-		 * Crank up the background tasks.  It doesn't matter if this fails,
-		 * we'll just try again later.
-		 */
-		Assert(CheckpointerPID == 0);
-		CheckpointerPID = StartCheckpointer();
-		Assert(BgWriterPID == 0);
-		BgWriterPID = StartBackgroundWriter();
-
 		/*
 		 * Start the archiver if we're responsible for (re-)archiving received
 		 * files.
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 708215614d..ee18898ea2 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -129,10 +129,10 @@ InitSync(void)
 {
 	/*
 	 * Create pending-operations hashtable if we need it.  Currently, we need
-	 * it if we are standalone (not under a postmaster) or if we are a startup
-	 * or checkpointer auxiliary process.
+	 * it if we are standalone (not under a postmaster) or if we are a
+	 * checkpointer auxiliary process.
 	 */
-	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
+	if (!IsUnderPostmaster || AmCheckpointerProcess())
 	{
 		HASHCTL		hash_ctl;
 
@@ -589,27 +589,3 @@ RegisterSyncRequest(const FileTag *ftag, SyncRequestType type,
 
 	return ret;
 }
-
-/*
- * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
- * already created the pendingOps during initialization of the startup
- * process.  Calling this function drops the local pendingOps so that
- * subsequent requests will be forwarded to checkpointer.
- */
-void
-EnableSyncRequestForwarding(void)
-{
-	/* Perform any pending fsyncs we may have queued up, then drop table */
-	if (pendingOps)
-	{
-		ProcessSyncRequests();
-		hash_destroy(pendingOps);
-	}
-	pendingOps = NULL;
-
-	/*
-	 * We should not have any pending unlink requests, since mdunlink doesn't
-	 * queue unlink requests when isRedo.
-	 */
-	Assert(pendingUnlinks == NIL);
-}
diff --git a/src/include/storage/sync.h b/src/include/storage/sync.h
index fbdf34f762..6fd50cfa7b 100644
--- a/src/include/storage/sync.h
+++ b/src/include/storage/sync.h
@@ -60,7 +60,6 @@ extern void SyncPreCheckpoint(void);
 extern void SyncPostCheckpoint(void);
 extern void ProcessSyncRequests(void);
 extern void RememberSyncRequest(const FileTag *ftag, SyncRequestType type);
-extern void EnableSyncRequestForwarding(void);
 extern bool RegisterSyncRequest(const FileTag *ftag, SyncRequestType type,
 								bool retryOnError);
 
-- 
2.30.1

v2-0002-Log-buffer-stats-after-crash-recovery.patchtext/x-patch; charset=US-ASCII; name=v2-0002-Log-buffer-stats-after-crash-recovery.patchDownload
From 3568596c9e2a6331f60d8e25bff35546279a42e4 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 13 Mar 2021 00:57:38 +1300
Subject: [PATCH v2 2/3] Log buffer stats after crash recovery.

Author: Simon Riggs <simon@2ndquadrant.com>
Discussion: https://postgr.es/m/CANP8%2BjLqwRoKQEvAi%3DiRhSwgL1JWNxiUhx%2BRyG-D82CE_qrDew%40mail.gmail.com
---
 src/backend/access/transam/xlog.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6e8e6cf1e4..b10b1b614a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7574,6 +7574,29 @@ StartupXLOG(void)
 					(errmsg("redo done at %X/%X system usage: %s",
 							LSN_FORMAT_ARGS(ReadRecPtr),
 							pg_rusage_show(&ru0))));
+			if (!InArchiveRecovery)
+			{
+				if (!INSTR_TIME_IS_ZERO(pgBufferUsage.blk_read_time))
+					ereport(LOG,
+						(errmsg("crash recovery complete: wrote %ld buffers (%.1f%%) (%0.3f ms); "
+								"dirtied %ld buffers; read %ld buffers (%0.3f ms)",
+								pgBufferUsage.shared_blks_written,
+								(double) pgBufferUsage.shared_blks_written * 100 / NBuffers,
+								INSTR_TIME_GET_MILLISEC(pgBufferUsage.blk_write_time),
+								pgBufferUsage.shared_blks_dirtied,
+								pgBufferUsage.shared_blks_read,
+								INSTR_TIME_GET_MILLISEC(pgBufferUsage.blk_read_time)
+								)));
+				else
+					ereport(LOG,
+						(errmsg("crash recovery complete: wrote %ld buffers (%.1f%%); "
+								"dirtied %ld buffers; read %ld buffers",
+								pgBufferUsage.shared_blks_written,
+								(double) pgBufferUsage.shared_blks_written * 100 / NBuffers,
+								pgBufferUsage.shared_blks_dirtied,
+								pgBufferUsage.shared_blks_read
+								)));
+			}
 			xtime = GetLatestXTime();
 			if (xtime)
 				ereport(LOG,
-- 
2.30.1

v2-0003-Optionally-don-t-wait-for-end-of-recovery-checkpo.patchtext/x-patch; charset=US-ASCII; name=v2-0003-Optionally-don-t-wait-for-end-of-recovery-checkpo.patchDownload
From 6420ae1112818ed7e4f4d5ecd306ef10f8873556 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 28 Aug 2020 16:35:25 +1200
Subject: [PATCH v2 3/3] Optionally, don't wait for end-of-recovery checkpoint.

Add a GUC to specify whether you want to wait for the end-of-recovery
checkpoint to complete before continuing.  Default to the traditional
behavior, "on".

XXX This probably needs a better GUC name.  There may also be problems with
timelines; see the nearby fast promotion code.

XXX This is experimental code only!  There may be fundamental problems
with the concept.  Or not.
---
 src/backend/access/transam/xlog.c | 23 +++++++++++++++--------
 src/backend/utils/misc/guc.c      |  9 +++++++++
 src/include/access/xlog.h         |  1 +
 3 files changed, 25 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b10b1b614a..4c001c466b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -111,6 +111,7 @@ int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
 bool		track_wal_io_timing = false;
+bool		end_of_recovery_checkpoint_wait = true;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -7874,6 +7875,18 @@ StartupXLOG(void)
 
 	if (InRecovery)
 	{
+		int		checkpoint_flags = CHECKPOINT_IMMEDIATE;
+
+		/*
+		 * Should we wait for the end-of-recovery checkpoint to complete?
+		 * Traditionally we do this, but it may be useful to allow connections
+		 * sooner.
+		 */
+		if (end_of_recovery_checkpoint_wait)
+			checkpoint_flags |= CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_WAIT;
+		else
+			checkpoint_flags |= CHECKPOINT_FORCE;
+
 		/*
 		 * Perform a checkpoint to update all our recovery activity to disk.
 		 *
@@ -7919,16 +7932,10 @@ StartupXLOG(void)
 			}
 
 			if (!promoted)
-				RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
-								  CHECKPOINT_IMMEDIATE |
-								  CHECKPOINT_WAIT);
+				RequestCheckpoint(checkpoint_flags);
 		}
 		else
-		{
-			RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
-							  CHECKPOINT_IMMEDIATE |
-							  CHECKPOINT_WAIT);
-		}
+			RequestCheckpoint(checkpoint_flags);
 	}
 
 	if (ArchiveRecoveryRequested)
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 855076b1fd..e59194a3f8 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2058,6 +2058,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"end_of_recovery_checkpoint_wait", PGC_SIGHUP, WAL_CHECKPOINTS,
+			gettext_noop("Whether to wait for the end-of-recovery checkpoint to complete."),
+		},
+		&end_of_recovery_checkpoint_wait,
+		true,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 6d384d3ce6..77b4560f4b 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -132,6 +132,7 @@ extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
 extern bool wal_receiver_create_temp_slot;
 extern bool track_wal_io_timing;
+extern bool end_of_recovery_checkpoint_wait;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
-- 
2.30.1

#7Aleksander Alekseev
aleksander@timescale.com
In reply to: Thomas Munro (#6)
Re: Background writer and checkpointer in crash recovery

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed

v2-0001 and v2-0002 look fine, but I don't like much the idea of introducing a new GUC in v2-0003. It's for very specific needs, which most of the users, I believe, don't care about. I suggest dealing with v2-0001 and v2-0002 first and then maybe submit and discuss v2-0003 as a separate CF entry.

Tested on MacOS against master `1ec7fca8`.

The new status of this patch is: Ready for Committer

#8Robert Haas
robertmhaas@gmail.com
In reply to: Aleksander Alekseev (#7)
Re: Background writer and checkpointer in crash recovery

On Fri, Jul 30, 2021 at 4:42 AM Aleksander Alekseev
<aleksander@timescale.com> wrote:

v2-0001 and v2-0002 look fine, but I don't like much the idea of introducing a new GUC in v2-0003. It's for very specific needs, which most of the users, I believe, don't care about. I suggest dealing with v2-0001 and v2-0002 first and then maybe submit and discuss v2-0003 as a separate CF entry.

Hi!

Thanks for bumping this thread; I had forgotten all about this effort,
but having just spent a bunch of time struggling with the thicket of
cases in StartupXLOG(), I'm now feeling highly motivated to make some
more progress in simplifying things over there. I am still of the
opinion that 0001 is a good idea, and I don't have any suggestions for
how it could be improved, except perhaps that the call to
PublishStartupProcessInformation() could maybe have a one-line
comment. Thomas, are you planning to press forward with committing
this soon? If not, do you mind if I do?

Regarding Simon's 0002, I wonder why it's useful to print this
information out at the end of crash recovery but not at the end of
archive recovery. It seems to me that if the information is useful
enough to be worth printing, it's probably good to print it in both
cases. In fact, rather than adding a separate message for this
information, I think we should just change the existing "redo done at"
message to print the details Simon proposes rather than what it does
now. Currently, we get output like this:

2021-07-30 09:13:05.319 EDT [48380] LOG: redo starts at 0/23A6E18
2021-07-30 09:13:05.612 EDT [48380] LOG: redo done at 0/D0D9CE8
system usage: CPU: user: 0.13 s, system: 0.12 s, elapsed: 0.29 s

With Simon's patch, I get something like this:

2021-07-30 09:39:43.579 EDT [63702] LOG: redo starts at 0/14A2F48
2021-07-30 09:39:44.129 EDT [63702] LOG: redo done at 0/15F48230
system usage: CPU: user: 0.25 s, system: 0.25 s, elapsed: 0.55 s
2021-07-30 09:39:44.129 EDT [63702] LOG: crash recovery complete:
wrote 36517 buffers (222.9%); dirtied 52985 buffers; read 7 buffers

Now I really think that information on the number of buffers touched
and how long it took is way more useful than user and system time.
Knowing how much user and system time were spent doesn't really tell
you anything, but a count of buffers touched gives you some meaningful
idea of how much work recovery did, and whether I/O was slow. Elapsed
time you can figure out yourself from the timestamp. However, I don't
really like printing the percentage here; unlike the checkpoint case,
it can very easily be way more than a hundred percent, and I think
that will confuse people. It could be tens of thousands of percent,
really, or even more.

So my proposal is:

redo done at %X/%X: wrote %ld buffers (%0.3f ms); dirtied %ld buffers;
read %ld buffers (%0.3f ms)

Regarding 0003, I agree with Alexander's comment that a GUC doesn't
seem particularly appropriate, but I also think that the approach may
not be quite right. In the promotion case, we emit an end-of-recovery
record and then later in the code we trigger a checkpoint. In your
patch, there's no end-of-recovery checkpoint -- you just trigger a
checkpoint instead of waiting for it. I think it's probably better to
make those two cases work the same. The end-of-recovery record isn't
needed to change the TLI as it is in the promotion case, but (1) it
seems better to have fewer code paths and (2) it might be good for
debuggability.

--
Robert Haas
EDB: http://www.enterprisedb.com

#9Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#8)
Re: Background writer and checkpointer in crash recovery

Hi,

On 2021-07-30 10:16:44 -0400, Robert Haas wrote:

2021-07-30 09:39:43.579 EDT [63702] LOG: redo starts at 0/14A2F48
2021-07-30 09:39:44.129 EDT [63702] LOG: redo done at 0/15F48230
system usage: CPU: user: 0.25 s, system: 0.25 s, elapsed: 0.55 s
2021-07-30 09:39:44.129 EDT [63702] LOG: crash recovery complete:
wrote 36517 buffers (222.9%); dirtied 52985 buffers; read 7 buffers

Now I really think that information on the number of buffers touched
and how long it took is way more useful than user and system time.
Knowing how much user and system time were spent doesn't really tell
you anything, but a count of buffers touched gives you some meaningful
idea of how much work recovery did, and whether I/O was slow.

I don't agree with that? If (user+system) << wall then it is very likely
that recovery is IO bound. If system is a large percentage of wall, then
shared buffers is likely too small (or we're replacing the wrong
buffers) because you spend a lot of time copying data in/out of the
kernel page cache. If user is the majority, you're CPU bound.

Without user & system time it's much harder to figure that out - at
least for me.

In your patch, there's no end-of-recovery checkpoint -- you just
trigger a checkpoint instead of waiting for it. I think it's probably
better to make those two cases work the same. The end-of-recovery
record isn't needed to change the TLI as it is in the promotion case,
but (1) it seems better to have fewer code paths and (2) it might be
good for debuggability.

+1

In addition, the end-of-recovery record also good for e.g. hot standby,
logical decoding, etc, because it's a point where no write transactions
can be in progress...

Greetings,

Andres Freund

#10Thomas Munro
thomas.munro@gmail.com
In reply to: Robert Haas (#8)
Re: Background writer and checkpointer in crash recovery

On Sat, Jul 31, 2021 at 2:16 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jul 30, 2021 at 4:42 AM Aleksander Alekseev
<aleksander@timescale.com> wrote:

v2-0001 and v2-0002 look fine, but I don't like much the idea of introducing a new GUC in v2-0003. It's for very specific needs, which most of the users, I believe, don't care about. I suggest dealing with v2-0001 and v2-0002 first and then maybe submit and discuss v2-0003 as a separate CF entry.

Thanks.

Thanks for bumping this thread; I had forgotten all about this effort,
but having just spent a bunch of time struggling with the thicket of
cases in StartupXLOG(), I'm now feeling highly motivated to make some
more progress in simplifying things over there. I am still of the
opinion that 0001 is a good idea, and I don't have any suggestions for
how it could be improved,

That's good news, and thanks. Yes, clearly there is much more that
can be simplified here.

except perhaps that the call to
PublishStartupProcessInformation() could maybe have a one-line
comment.

Done. BTW that is temporary, as I'm planning to remove that machinery soon[1]/messages/by-id/CA+hUKGLYRyDaneEwz5Uya_OgFLMx5BgJfkQSD=q9HmwsfRRb-w@mail.gmail.com.

Thomas, are you planning to press forward with committing
this soon? If not, do you mind if I do?

I pushed 0001. Let me think about 0002, and flesh out 0003 a bit more.

[1]: /messages/by-id/CA+hUKGLYRyDaneEwz5Uya_OgFLMx5BgJfkQSD=q9HmwsfRRb-w@mail.gmail.com

#11Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#9)
Re: Background writer and checkpointer in crash recovery

On Fri, Jul 30, 2021 at 4:00 PM Andres Freund <andres@anarazel.de> wrote:

I don't agree with that? If (user+system) << wall then it is very likely
that recovery is IO bound. If system is a large percentage of wall, then
shared buffers is likely too small (or we're replacing the wrong
buffers) because you spend a lot of time copying data in/out of the
kernel page cache. If user is the majority, you're CPU bound.

Without user & system time it's much harder to figure that out - at
least for me.

Oh, that's an interesting point. At least now I'll know why I am
supposed to care about that log line the next time I see it. I guess
we could include both things, though the line might get a little long.
Or maybe there's some other subset that would make sense.

--
Robert Haas
EDB: http://www.enterprisedb.com

#12Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#10)
1 attachment(s)
Re: Background writer and checkpointer in crash recovery

On Mon, Aug 2, 2021 at 1:37 AM Thomas Munro <thomas.munro@gmail.com> wrote:

I pushed 0001.

That's great. I just realized that this leaves us with identical
RequestCheckpoint() calls in two nearby places. Is there any reason
not to further simplify as in the attached?

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

0001-Further-simplify-a-bit-of-logic-in-StartupXLOG.patchapplication/octet-stream; name=0001-Further-simplify-a-bit-of-logic-in-StartupXLOG.patchDownload
From 53b48371693fed72fb9625f1af16786f92700e7b Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 2 Aug 2021 09:12:56 -0400
Subject: [PATCH] Further simplify a bit of logic in StartupXLOG().

Commit 7ff23c6d277d1d90478a51f0dd81414d343f3850 left us with two
identical cases. Collapse them.
---
 src/backend/access/transam/xlog.c | 34 +++++++++++++------------------
 1 file changed, 14 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f84c0bb01e..8b39a2fdaa 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7890,28 +7890,22 @@ StartupXLOG(void)
 		 * after we're fully out of recovery mode and already accepting
 		 * queries.
 		 */
-		if (ArchiveRecoveryRequested && IsUnderPostmaster)
+		if (ArchiveRecoveryRequested && IsUnderPostmaster &&
+			LocalPromoteIsTriggered)
 		{
-			if (LocalPromoteIsTriggered)
-			{
-				promoted = true;
+			promoted = true;
 
-				/*
-				 * Insert a special WAL record to mark the end of recovery,
-				 * since we aren't doing a checkpoint. That means that the
-				 * checkpointer process may likely be in the middle of a
-				 * time-smoothed restartpoint and could continue to be for
-				 * minutes after this. That sounds strange, but the effect is
-				 * roughly the same and it would be stranger to try to come
-				 * out of the restartpoint and then checkpoint. We request a
-				 * checkpoint later anyway, just for safety.
-				 */
-				CreateEndOfRecoveryRecord();
-			}
-			else
-				RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
-								  CHECKPOINT_IMMEDIATE |
-								  CHECKPOINT_WAIT);
+			/*
+			 * Insert a special WAL record to mark the end of recovery, since
+			 * we aren't doing a checkpoint. That means that the checkpointer
+			 * process may likely be in the middle of a time-smoothed
+			 * restartpoint and could continue to be for minutes after this.
+			 * That sounds strange, but the effect is roughly the same and it
+			 * would be stranger to try to come out of the restartpoint and
+			 * then checkpoint. We request a checkpoint later anyway, just for
+			 * safety.
+			 */
+			CreateEndOfRecoveryRecord();
 		}
 		else
 		{
-- 
2.24.3 (Apple Git-128)

#13Amul Sul
sulamul@gmail.com
In reply to: Robert Haas (#12)
Re: Background writer and checkpointer in crash recovery

On Mon, Aug 2, 2021 at 6:47 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Aug 2, 2021 at 1:37 AM Thomas Munro <thomas.munro@gmail.com> wrote:

I pushed 0001.

That's great. I just realized that this leaves us with identical
RequestCheckpoint() calls in two nearby places. Is there any reason
not to further simplify as in the attached?

+1, also, can we just get rid off "promoted" flag? The only
inconvenience is we might need to check three flags instead of one to
perform the checkpoint at the end.

#14Jakub Wartak
Jakub.Wartak@tomtom.com
In reply to: Robert Haas (#11)
RE: Background writer and checkpointer in crash recovery

On Fri, Jul 30, 2021 at 4:00 PM Andres Freund <andres@anarazel.de> wrote:

I don't agree with that? If (user+system) << wall then it is very
likely that recovery is IO bound. If system is a large percentage of
wall, then shared buffers is likely too small (or we're replacing the
wrong
buffers) because you spend a lot of time copying data in/out of the
kernel page cache. If user is the majority, you're CPU bound.

Without user & system time it's much harder to figure that out - at
least for me.

Oh, that's an interesting point. At least now I'll know why I am supposed to care
about that log line the next time I see it. I guess we could include both things,
though the line might get a little long.
Or maybe there's some other subset that would make sense.

Hi Robert,

The email threads from [1]https://commitfest.postgresql.org/30/2799/ can serve as indication that having complete view of
resource usage (user+system+elapsed) is advantageous in different situations and
pretty platform-generic. Also as Andres and Simon earlier pointed out - the performance
insight into crash recovery/replication performance is next to nothing, judging just by
looking at currently emitted log messages. The more the there is, the better I think.

BTW, if you now there's this big push for refactoring StartupXLOG() then what
frustrating^H^H^H^H^H could be done better - at least from end-user point of view -
is that there is lack of near real time cyclic messages (every 1min?) about current status,
performance and maybe even ETA (simplistic case; assuming it is linear).

[1]: https://commitfest.postgresql.org/30/2799/

#15Robert Haas
robertmhaas@gmail.com
In reply to: Amul Sul (#13)
Re: Background writer and checkpointer in crash recovery

On Mon, Aug 2, 2021 at 9:40 AM Amul Sul <sulamul@gmail.com> wrote:

That's great. I just realized that this leaves us with identical
RequestCheckpoint() calls in two nearby places. Is there any reason
not to further simplify as in the attached?

+1, also, can we just get rid off "promoted" flag? The only
inconvenience is we might need to check three flags instead of one to
perform the checkpoint at the end.

I'm not sure that's really a win, because if we use the same
conditional in both places then it might not be clear to somebody that
they're supposed to match.

I do think we ought to consider renaming the flag, though, because
LocalPromoteIsTriggered is already tracking whether we were promoted,
and what this flag is tracking is not quite that thing. It's real
purpose is to track whether we should request a non-immediate
checkpoint at the end of recovery, and I think we ought to name it
some way that reflects that, e.g. perform_checkpoint_later.

--
Robert Haas
EDB: http://www.enterprisedb.com

#16Robert Haas
robertmhaas@gmail.com
In reply to: Jakub Wartak (#14)
Re: Background writer and checkpointer in crash recovery

On Mon, Aug 2, 2021 at 9:51 AM Jakub Wartak <Jakub.Wartak@tomtom.com> wrote:

BTW, if you now there's this big push for refactoring StartupXLOG() then what
frustrating^H^H^H^H^H could be done better - at least from end-user point of view -
is that there is lack of near real time cyclic messages (every 1min?) about current status,
performance and maybe even ETA (simplistic case; assuming it is linear).

I agree. See /messages/by-id/CA+TgmoaHQrgDFOBwgY16XCoMtXxsrVGFB2jNCvb7-ubuEe1MGg@mail.gmail.com
and subsequent discussion.

--
Robert Haas
EDB: http://www.enterprisedb.com

#17Thomas Munro
thomas.munro@gmail.com
In reply to: Robert Haas (#12)
Re: Background writer and checkpointer in crash recovery

On Tue, Aug 3, 2021 at 1:17 AM Robert Haas <robertmhaas@gmail.com> wrote:

That's great. I just realized that this leaves us with identical
RequestCheckpoint() calls in two nearby places. Is there any reason
not to further simplify as in the attached?

LGTM.

#18Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#17)
Re: Background writer and checkpointer in crash recovery

On Tue, Aug 3, 2021 at 9:52 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, Aug 3, 2021 at 1:17 AM Robert Haas <robertmhaas@gmail.com> wrote:

That's great. I just realized that this leaves us with identical
RequestCheckpoint() calls in two nearby places. Is there any reason
not to further simplify as in the attached?

LGTM.

And pushed.

#19Justin Pryzby
pryzby@telsasoft.com
In reply to: Thomas Munro (#18)
Re: Background writer and checkpointer in crash recovery

On Tue, Aug 03, 2021 at 02:19:22PM +1200, Thomas Munro wrote:

On Tue, Aug 3, 2021 at 9:52 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, Aug 3, 2021 at 1:17 AM Robert Haas <robertmhaas@gmail.com> wrote:

That's great. I just realized that this leaves us with identical
RequestCheckpoint() calls in two nearby places. Is there any reason
not to further simplify as in the attached?

LGTM.

And pushed.

Gripe: this made the "ps" display worse than before.

7ff23c6d2 Run checkpointer and bgwriter in crash recovery.

A couple years ago, I complained that during the end-of-recovery
checkpoint, the "ps" display still said "recovering NNNNN", which made
it look like it was stuck on a particular WAL file.

That led to commit df9274adf, which updated the startup process's "ps"
to say "end-of-recovery checkpoint".

But since the start process no longer does the checkpoint, it still
says:

postgres 19738 11433 5 19:33 ? 00:00:01 postgres: startup recovering 000000010000000C000000FB
postgres 19739 11433 3 19:33 ? 00:00:00 postgres: checkpointer performing end-of-recovery checkpoint

That looks inconsistent. It'd be better if the startup process's "ps"
were cleared.

--
Justin

#20Justin Pryzby
pryzby@telsasoft.com
In reply to: Justin Pryzby (#19)
Re: Background writer and checkpointer in crash recovery

On Sun, Sep 11, 2022 at 07:54:43PM -0500, Justin Pryzby wrote:

On Tue, Aug 03, 2021 at 02:19:22PM +1200, Thomas Munro wrote:

On Tue, Aug 3, 2021 at 9:52 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, Aug 3, 2021 at 1:17 AM Robert Haas <robertmhaas@gmail.com> wrote:

That's great. I just realized that this leaves us with identical
RequestCheckpoint() calls in two nearby places. Is there any reason
not to further simplify as in the attached?

LGTM.

And pushed.

Gripe: this made the "ps" display worse than before.

7ff23c6d2 Run checkpointer and bgwriter in crash recovery.

A couple years ago, I complained that during the end-of-recovery
checkpoint, the "ps" display still said "recovering NNNNN", which made
it look like it was stuck on a particular WAL file.

That led to commit df9274adf, which updated the startup process's "ps"
to say "end-of-recovery checkpoint".

But since the start process no longer does the checkpoint, it still
says:

postgres 19738 11433 5 19:33 ? 00:00:01 postgres: startup recovering 000000010000000C000000FB
postgres 19739 11433 3 19:33 ? 00:00:00 postgres: checkpointer performing end-of-recovery checkpoint

That looks inconsistent. It'd be better if the startup process's "ps"
were cleared.

Like this, maybe.

It's similar to what I suggested to consider backpatching here:
/messages/by-id/20201214032224.GA30237@telsasoft.com

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7a710e6490d..4bf8614d45f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5302,6 +5302,7 @@ StartupXLOG(void)
 	EndOfLogTLI = endOfRecoveryInfo->endOfLogTLI;
 	abortedRecPtr = endOfRecoveryInfo->abortedRecPtr;
 	missingContrecPtr = endOfRecoveryInfo->missingContrecPtr;
+	set_ps_display("");

/*
* When recovering from a backup (we are in recovery, and archive recovery

#21Michael Paquier
michael@paquier.xyz
In reply to: Justin Pryzby (#20)
Re: Background writer and checkpointer in crash recovery

On Mon, Sep 12, 2022 at 09:13:11PM -0500, Justin Pryzby wrote:

Like this, maybe.

It's similar to what I suggested to consider backpatching here:
/messages/by-id/20201214032224.GA30237@telsasoft.com

At the same time, df9274adf has been done because the end-of-recovery
checkpoint done potentially by the startup process if the checkpointer
was not up at the end of recovery would take long. Any of the actions
taken in the startup process when finishing recovery are not that
costly, on the contrary, as far as I know.

I am not opposed to clearing the ps display when we are at this state
of the game for the startup process, but rather than clearing it we
could switch provide something more useful that shows what's
happening. Say a simple "performing end-of-recovery actions", for
example..
--
Michael

#22Justin Pryzby
pryzby@telsasoft.com
In reply to: Michael Paquier (#21)
Re: Background writer and checkpointer in crash recovery

On Tue, Sep 13, 2022 at 01:32:11PM +0900, Michael Paquier wrote:

On Mon, Sep 12, 2022 at 09:13:11PM -0500, Justin Pryzby wrote:

Like this, maybe.

It's similar to what I suggested to consider backpatching here:
/messages/by-id/20201214032224.GA30237@telsasoft.com

At the same time, df9274adf has been done because the end-of-recovery
checkpoint done potentially by the startup process if the checkpointer
was not up at the end of recovery would take long. Any of the actions
taken in the startup process when finishing recovery are not that
costly, on the contrary, as far as I know.

Your interpretation may be a bit different from mine.

The central problem for me was that the startup process said "recovering
NNN" after it was done recovering NNN.

To fix that, df9274adf took the approach of overwriting the existing PS
with "end-of-recovery checkpoint", which also adds some extra value...

I am not opposed to clearing the ps display when we are at this state
of the game for the startup process, but rather than clearing it we
could switch provide something more useful that shows what's
happening. Say a simple "performing end-of-recovery actions", for
example..

...but the minimal solution (which 2 years ago I suggested could've been
backpatched) is to *clear* the PS (in master branch at the time, that
would've been before also changing it to say stuff about the
checkpoint).

If we'd done that, I probably wouldn't be griping about it now, so my
preference is to *clear* the display as soon as the WAL processing is
done; further overwriting it with additional stuff can be left as a
future enhancement (like "syncing FS" that I brought up last year, and
maybe everything else that calls ereport_startup_progress()).

#23Justin Pryzby
pryzby@telsasoft.com
In reply to: Justin Pryzby (#22)
Re: Background writer and checkpointer in crash recovery

I don't know that this warrants an Opened Item, but I think some fix
ought to be applied to v15, whether that happens this week or next
month.

#24Michael Paquier
michael@paquier.xyz
In reply to: Justin Pryzby (#23)
Re: Background writer and checkpointer in crash recovery

On Thu, Sep 15, 2022 at 10:19:01PM -0500, Justin Pryzby wrote:

I don't know that this warrants an Opened Item, but I think some fix
ought to be applied to v15, whether that happens this week or next
month.

With RC1 getting close by, I have looked at that again and applied a
patch that resets the ps display after recovery is done, bringing as
back to the same state as PG <= 14 in the startup process.
--
Michael