Hot standby, recovery infra

Started by Heikki Linnakangasover 17 years ago74 messageshackers

heikki.linnakangas@enterprisedb.com

over 17 years ago

I've been reviewing and massaging the so called recovery infra patch.

To recap, the goal is to:
- start background writer during (archive) recovery
- skip the shutdown checkpoint at the end of recovery. Instead, the
database is brought up immediately, and the bgwriter performs a normal
online checkpoint, while we're already accepting connections.
- keep track of when we reach a consistent point in the recovery, where
we could let read-only backends in. Which is obviously required for hot
standby

The 1st and 2nd points provide some useful functionality, even without
the rest of the hot standby patch.

I've refactored the patch quite heavily, making it a lot simpler, and
over 1/3 smaller than before:

The signaling between the bgwriter and startup process during recovery
was quite complicated. The startup process periodically sent checkpoint
records to the bgwriter, so that bgwriter could perform restart points.
I've replaced that by storing the last seen checkpoint in a shared
memory in xlog.c. CreateRestartPoint() picks it up from there. This
means that bgwriter can decide autonomously when to perform a restart
point, it no longer needs to be told to do so by the startup process.
Which is nice in a standby. What could happen before is that the standby
processes a checkpoint record, and decides not to make it a restartpoint
because not enough time has passed since last one. If we then get a long
idle period after that, we'd really want to make the previous checkpoint
record a restart point after all, after some time has passed. That is
what will happen now, which is a usability enhancement, although the
real motivation for this refactoring was to make the code simpler.

The bgwriter is now always responsible for all checkpoints and
restartpoints. (well, except for a stand-alone backend). Which makes it
easier to understand what's going on, IMHO.

There was one pretty fundamental bug in the minsafestartpoint handling:
it was always set when a WAL file was opened for reading. Which means it
was also moved backwards when the recovery began by reading the WAL
segment containing last restart/checkpoint, rendering it useless for the
purpose it was designed. Fortunately that was easy to fix. Another tiny
bug was that log_restartpoints was not respected, because it was stored
in a variable in startup process' memory, and wasn't seen by bgwriter.

One aspect that troubles me a bit is the changes in XLogFlush. I guess
we no longer have the problem that you can't start up the database if
we've read in a corrupted page from disk, because we now start up before
checkpointing. However, it does mean that if a corrupt page is read into
shared buffers, we'll never be able to checkpoint. But then again, I
guess that's already true without this patch.

I feel quite good about this patch now. Given the amount of code churn,
it requires testing, and I'll read it through one more time after
sleeping over it. Simon, do you see anything wrong with this?

(this patch is also in my git repository at git.postgresql.org, branch
recoveryinfra.)

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Simon Riggs

simon@2ndQuadrant.com

over 17 years ago

In reply to: Heikki Linnakangas (#1)

Re: Hot standby, recovery infra

On Wed, 2009-01-28 at 12:04 +0200, Heikki Linnakangas wrote:

I've been reviewing and massaging the so called recovery infra patch.

Thanks.

I feel quite good about this patch now. Given the amount of code
churn, it requires testing, and I'll read it through one more time
after sleeping over it.

There's nothing major I feel we should discuss.

The way restartpoints happen is a useful improvement, thanks.

Simon, do you see anything wrong with this?

Few minor points

* I think we are now renaming the recovery.conf file too early. The
comment says "We have already restored all the WAL segments we need from
the archive, and we trust that they are not going to go away even if we
crash." We have, but the files overwrite each other as they arrive, so
if the last restartpoint is not in the last restored WAL file then it
will only exist in the archive. The recovery.conf is the only place
where we store the information on where the archive is and how to access
it, so by renaming it out of the way we will be unable to crash recover
until the first checkpoint is complete. So the way this was in the
original patch is the correct way to go, AFAICS.

* my original intention was to deprecate log_restartpoints and would
still like to do so. log_checkpoints does just as well for that. Even
less code than before...

* comment on BgWriterShmemInit() refers to CHECKPOINT_IS_STARTUP, but
the actual define is CHECKPOINT_STARTUP. Would prefer the "is" version
because it matches the IS_SHUTDOWN naming.

* In CreateCheckpoint() the if test on TruncateSubtrans() has been
removed, but the comment has not been changed (to explain why).

* PG_CONTROL_VERSION bump should be just one increment, to 844. I
deliberately had it higher to help spot mismatches earlier, and to avoid
needless patch conflicts.

So it looks pretty much ready for commit very soon.

We should continue to measure performance of recovery in the light of
these changes. I still feel that fsyncing the control file on each
XLogFileRead() will give a noticeable performance penalty, mostly
because we know doing exactly the same thing in normal running caused a
performance penalty. But that is easily changed and cannot be done with
any certainty without wider feedback, so no reason to delay code commit.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Fujii Masao

masao.fujii@gmail.com

over 17 years ago

In reply to: Heikki Linnakangas (#1)

Re: Hot standby, recovery infra

Hi,

On Wed, Jan 28, 2009 at 7:04 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

I've been reviewing and massaging the so called recovery infra patch.

Great!

I feel quite good about this patch now. Given the amount of code churn, it
requires testing, and I'll read it through one more time after sleeping over
it. Simon, do you see anything wrong with this?

I also read this patch and found something odd. I apologize if I misread it..

@@ -507,7 +550,7 @@ CheckArchiveTimeout(void)
pg_time_t now;
pg_time_t last_time;
-	if (XLogArchiveTimeout <= 0)
+	if (XLogArchiveTimeout <= 0 || !IsRecoveryProcessingMode())

The above change destroys archive_timeout because checking the timeout
is always skipped after recovery is done.

@@ -355,6 +359,27 @@ BackgroundWriterMain(void)
*/
PG_SETMASK(&UnBlockSig);
+	BgWriterRecoveryMode = IsRecoveryProcessingMode();
+
+	if (BgWriterRecoveryMode)
+		elog(DEBUG1, "bgwriter starting during recovery");
+	else
+		InitXLOGAccess();

Why is InitXLOGAccess() called also here when bgwriter is started after
recovery? That is already called by AuxiliaryProcessMain().

@@ -1302,7 +1314,7 @@ ServerLoop(void)
* state that prevents it, start one.  It doesn't matter if this
* fails, we'll just try again later.
*/
-		if (BgWriterPID == 0 && pmState == PM_RUN)
+		if (BgWriterPID == 0 && (pmState == PM_RUN || pmState == PM_RECOVERY))
BgWriterPID = StartBackgroundWriter();

Likewise, we should try to start also the stats collector during recovery?

@@ -2103,7 +2148,8 @@ XLogFileInit(uint32 log, uint32 seg,
unlink(tmppath);
}

-	elog(DEBUG2, "done creating and filling new WAL file");
+	XLogFileName(tmppath, ThisTimeLineID, log, seg);
+	elog(DEBUG2, "done creating and filling new WAL file %s", tmppath);

This debug message is somewhat confusing, because the WAL file
represented as "tmppath" might have been already created by
previous XLogFileInit() via InstallXLogFileSegment().

regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Simon Riggs

simon@2ndQuadrant.com

over 17 years ago

In reply to: Fujii Masao (#3)

Re: Hot standby, recovery infra

On Wed, 2009-01-28 at 23:19 +0900, Fujii Masao wrote:

@@ -355,6 +359,27 @@ BackgroundWriterMain(void)
*/
PG_SETMASK(&UnBlockSig);
+	BgWriterRecoveryMode = IsRecoveryProcessingMode();
+
+	if (BgWriterRecoveryMode)
+		elog(DEBUG1, "bgwriter starting during recovery");
+	else
+		InitXLOGAccess();
Why is InitXLOGAccess() called also here when bgwriter is started after
recovery? That is already called by AuxiliaryProcessMain().

InitXLOGAccess() sets the timeline and also gets the latest record
pointer. If the bgwriter is started in recovery these values need to be
reset later. It's easier to call it twice.

@@ -1302,7 +1314,7 @@ ServerLoop(void)
* state that prevents it, start one.  It doesn't matter if this
* fails, we'll just try again later.
*/
-		if (BgWriterPID == 0 && pmState == PM_RUN)
+		if (BgWriterPID == 0 && (pmState == PM_RUN || pmState == PM_RECOVERY))
BgWriterPID = StartBackgroundWriter();

Likewise, we should try to start also the stats collector during recovery?

We did in the previous patch...

@@ -2103,7 +2148,8 @@ XLogFileInit(uint32 log, uint32 seg,
unlink(tmppath);
}
-	elog(DEBUG2, "done creating and filling new WAL file");
+	XLogFileName(tmppath, ThisTimeLineID, log, seg);
+	elog(DEBUG2, "done creating and filling new WAL file %s", tmppath);
This debug message is somewhat confusing, because the WAL file
represented as "tmppath" might have been already created by
previous XLogFileInit() via InstallXLogFileSegment().

I think those are just for debugging and can be removed.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Fujii Masao

masao.fujii@gmail.com

over 17 years ago

In reply to: Simon Riggs (#4)

Re: Hot standby, recovery infra

Hi,

On Wed, Jan 28, 2009 at 11:47 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Wed, 2009-01-28 at 23:19 +0900, Fujii Masao wrote:
@@ -355,6 +359,27 @@ BackgroundWriterMain(void)
*/
PG_SETMASK(&UnBlockSig);
+   BgWriterRecoveryMode = IsRecoveryProcessingMode();
+
+   if (BgWriterRecoveryMode)
+           elog(DEBUG1, "bgwriter starting during recovery");
+   else
+           InitXLOGAccess();
Why is InitXLOGAccess() called also here when bgwriter is started after
recovery? That is already called by AuxiliaryProcessMain().
InitXLOGAccess() sets the timeline and also gets the latest record
pointer. If the bgwriter is started in recovery these values need to be
reset later. It's easier to call it twice.

Right. But, InitXLOGAccess() called during main loop is enough for
that purpose.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Simon Riggs

simon@2ndQuadrant.com

over 17 years ago

In reply to: Fujii Masao (#5)

Re: Hot standby, recovery infra

On Wed, 2009-01-28 at 23:54 +0900, Fujii Masao wrote:

Why is InitXLOGAccess() called also here when bgwriter is started after
recovery? That is already called by AuxiliaryProcessMain().

InitXLOGAccess() sets the timeline and also gets the latest record
pointer. If the bgwriter is started in recovery these values need to be
reset later. It's easier to call it twice.

Right. But, InitXLOGAccess() called during main loop is enough for
that purpose.

I think the code is clearer the way it is. Otherwise you'd read
AuxiliaryProcessMain() and think the bgwriter didn't need xlog access.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Hot standby, recovery infra

Attachments: