Hot standby, recovery infra

Started by Heikki Linnakangasabout 17 years ago74 messageshackers
Jump to latest
#1Heikki Linnakangas
heikki.linnakangas@enterprisedb.com

I've been reviewing and massaging the so called recovery infra patch.

To recap, the goal is to:
- start background writer during (archive) recovery
- skip the shutdown checkpoint at the end of recovery. Instead, the
database is brought up immediately, and the bgwriter performs a normal
online checkpoint, while we're already accepting connections.
- keep track of when we reach a consistent point in the recovery, where
we could let read-only backends in. Which is obviously required for hot
standby

The 1st and 2nd points provide some useful functionality, even without
the rest of the hot standby patch.

I've refactored the patch quite heavily, making it a lot simpler, and
over 1/3 smaller than before:

The signaling between the bgwriter and startup process during recovery
was quite complicated. The startup process periodically sent checkpoint
records to the bgwriter, so that bgwriter could perform restart points.
I've replaced that by storing the last seen checkpoint in a shared
memory in xlog.c. CreateRestartPoint() picks it up from there. This
means that bgwriter can decide autonomously when to perform a restart
point, it no longer needs to be told to do so by the startup process.
Which is nice in a standby. What could happen before is that the standby
processes a checkpoint record, and decides not to make it a restartpoint
because not enough time has passed since last one. If we then get a long
idle period after that, we'd really want to make the previous checkpoint
record a restart point after all, after some time has passed. That is
what will happen now, which is a usability enhancement, although the
real motivation for this refactoring was to make the code simpler.

The bgwriter is now always responsible for all checkpoints and
restartpoints. (well, except for a stand-alone backend). Which makes it
easier to understand what's going on, IMHO.

There was one pretty fundamental bug in the minsafestartpoint handling:
it was always set when a WAL file was opened for reading. Which means it
was also moved backwards when the recovery began by reading the WAL
segment containing last restart/checkpoint, rendering it useless for the
purpose it was designed. Fortunately that was easy to fix. Another tiny
bug was that log_restartpoints was not respected, because it was stored
in a variable in startup process' memory, and wasn't seen by bgwriter.

One aspect that troubles me a bit is the changes in XLogFlush. I guess
we no longer have the problem that you can't start up the database if
we've read in a corrupted page from disk, because we now start up before
checkpointing. However, it does mean that if a corrupt page is read into
shared buffers, we'll never be able to checkpoint. But then again, I
guess that's already true without this patch.

I feel quite good about this patch now. Given the amount of code churn,
it requires testing, and I'll read it through one more time after
sleeping over it. Simon, do you see anything wrong with this?

(this patch is also in my git repository at git.postgresql.org, branch
recoveryinfra.)

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

recovery-infra-dee8f65be.patchtext/x-diff; name=recovery-infra-dee8f65be.patchDownload+589-145
#2Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#1)
Re: Hot standby, recovery infra

On Wed, 2009-01-28 at 12:04 +0200, Heikki Linnakangas wrote:

I've been reviewing and massaging the so called recovery infra patch.

Thanks.

I feel quite good about this patch now. Given the amount of code
churn, it requires testing, and I'll read it through one more time
after sleeping over it.

There's nothing major I feel we should discuss.

The way restartpoints happen is a useful improvement, thanks.

Simon, do you see anything wrong with this?

Few minor points

* I think we are now renaming the recovery.conf file too early. The
comment says "We have already restored all the WAL segments we need from
the archive, and we trust that they are not going to go away even if we
crash." We have, but the files overwrite each other as they arrive, so
if the last restartpoint is not in the last restored WAL file then it
will only exist in the archive. The recovery.conf is the only place
where we store the information on where the archive is and how to access
it, so by renaming it out of the way we will be unable to crash recover
until the first checkpoint is complete. So the way this was in the
original patch is the correct way to go, AFAICS.

* my original intention was to deprecate log_restartpoints and would
still like to do so. log_checkpoints does just as well for that. Even
less code than before...

* comment on BgWriterShmemInit() refers to CHECKPOINT_IS_STARTUP, but
the actual define is CHECKPOINT_STARTUP. Would prefer the "is" version
because it matches the IS_SHUTDOWN naming.

* In CreateCheckpoint() the if test on TruncateSubtrans() has been
removed, but the comment has not been changed (to explain why).

* PG_CONTROL_VERSION bump should be just one increment, to 844. I
deliberately had it higher to help spot mismatches earlier, and to avoid
needless patch conflicts.

So it looks pretty much ready for commit very soon.

We should continue to measure performance of recovery in the light of
these changes. I still feel that fsyncing the control file on each
XLogFileRead() will give a noticeable performance penalty, mostly
because we know doing exactly the same thing in normal running caused a
performance penalty. But that is easily changed and cannot be done with
any certainty without wider feedback, so no reason to delay code commit.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#3Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#1)
Re: Hot standby, recovery infra

Hi,

On Wed, Jan 28, 2009 at 7:04 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

I've been reviewing and massaging the so called recovery infra patch.

Great!

I feel quite good about this patch now. Given the amount of code churn, it
requires testing, and I'll read it through one more time after sleeping over
it. Simon, do you see anything wrong with this?

I also read this patch and found something odd. I apologize if I misread it..

@@ -507,7 +550,7 @@ CheckArchiveTimeout(void)
pg_time_t now;
pg_time_t last_time;

-	if (XLogArchiveTimeout <= 0)
+	if (XLogArchiveTimeout <= 0 || !IsRecoveryProcessingMode())

The above change destroys archive_timeout because checking the timeout
is always skipped after recovery is done.

@@ -355,6 +359,27 @@ BackgroundWriterMain(void)
*/
PG_SETMASK(&UnBlockSig);

+	BgWriterRecoveryMode = IsRecoveryProcessingMode();
+
+	if (BgWriterRecoveryMode)
+		elog(DEBUG1, "bgwriter starting during recovery");
+	else
+		InitXLOGAccess();

Why is InitXLOGAccess() called also here when bgwriter is started after
recovery? That is already called by AuxiliaryProcessMain().

@@ -1302,7 +1314,7 @@ ServerLoop(void)
* state that prevents it, start one.  It doesn't matter if this
* fails, we'll just try again later.
*/
-		if (BgWriterPID == 0 && pmState == PM_RUN)
+		if (BgWriterPID == 0 && (pmState == PM_RUN || pmState == PM_RECOVERY))
BgWriterPID = StartBackgroundWriter();

Likewise, we should try to start also the stats collector during recovery?

@@ -2103,7 +2148,8 @@ XLogFileInit(uint32 log, uint32 seg,
unlink(tmppath);
}

-	elog(DEBUG2, "done creating and filling new WAL file");
+	XLogFileName(tmppath, ThisTimeLineID, log, seg);
+	elog(DEBUG2, "done creating and filling new WAL file %s", tmppath);

This debug message is somewhat confusing, because the WAL file
represented as "tmppath" might have been already created by
previous XLogFileInit() via InstallXLogFileSegment().

regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#4Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#3)
Re: Hot standby, recovery infra

On Wed, 2009-01-28 at 23:19 +0900, Fujii Masao wrote:

@@ -355,6 +359,27 @@ BackgroundWriterMain(void)
*/
PG_SETMASK(&UnBlockSig);

+	BgWriterRecoveryMode = IsRecoveryProcessingMode();
+
+	if (BgWriterRecoveryMode)
+		elog(DEBUG1, "bgwriter starting during recovery");
+	else
+		InitXLOGAccess();

Why is InitXLOGAccess() called also here when bgwriter is started after
recovery? That is already called by AuxiliaryProcessMain().

InitXLOGAccess() sets the timeline and also gets the latest record
pointer. If the bgwriter is started in recovery these values need to be
reset later. It's easier to call it twice.

@@ -1302,7 +1314,7 @@ ServerLoop(void)
* state that prevents it, start one.  It doesn't matter if this
* fails, we'll just try again later.
*/
-		if (BgWriterPID == 0 && pmState == PM_RUN)
+		if (BgWriterPID == 0 && (pmState == PM_RUN || pmState == PM_RECOVERY))
BgWriterPID = StartBackgroundWriter();

Likewise, we should try to start also the stats collector during recovery?

We did in the previous patch...

@@ -2103,7 +2148,8 @@ XLogFileInit(uint32 log, uint32 seg,
unlink(tmppath);
}

-	elog(DEBUG2, "done creating and filling new WAL file");
+	XLogFileName(tmppath, ThisTimeLineID, log, seg);
+	elog(DEBUG2, "done creating and filling new WAL file %s", tmppath);

This debug message is somewhat confusing, because the WAL file
represented as "tmppath" might have been already created by
previous XLogFileInit() via InstallXLogFileSegment().

I think those are just for debugging and can be removed.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#5Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#4)
Re: Hot standby, recovery infra

Hi,

On Wed, Jan 28, 2009 at 11:47 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Wed, 2009-01-28 at 23:19 +0900, Fujii Masao wrote:

@@ -355,6 +359,27 @@ BackgroundWriterMain(void)
*/
PG_SETMASK(&UnBlockSig);

+   BgWriterRecoveryMode = IsRecoveryProcessingMode();
+
+   if (BgWriterRecoveryMode)
+           elog(DEBUG1, "bgwriter starting during recovery");
+   else
+           InitXLOGAccess();

Why is InitXLOGAccess() called also here when bgwriter is started after
recovery? That is already called by AuxiliaryProcessMain().

InitXLOGAccess() sets the timeline and also gets the latest record
pointer. If the bgwriter is started in recovery these values need to be
reset later. It's easier to call it twice.

Right. But, InitXLOGAccess() called during main loop is enough for
that purpose.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#6Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#5)
Re: Hot standby, recovery infra

On Wed, 2009-01-28 at 23:54 +0900, Fujii Masao wrote:

Why is InitXLOGAccess() called also here when bgwriter is started after
recovery? That is already called by AuxiliaryProcessMain().

InitXLOGAccess() sets the timeline and also gets the latest record
pointer. If the bgwriter is started in recovery these values need to be
reset later. It's easier to call it twice.

Right. But, InitXLOGAccess() called during main loop is enough for
that purpose.

I think the code is clearer the way it is. Otherwise you'd read
AuxiliaryProcessMain() and think the bgwriter didn't need xlog access.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#7Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#3)
Re: Hot standby, recovery infra

Fujii Masao wrote:

On Wed, Jan 28, 2009 at 7:04 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

I feel quite good about this patch now. Given the amount of code churn, it
requires testing, and I'll read it through one more time after sleeping over
it. Simon, do you see anything wrong with this?

I also read this patch and found something odd.

Many thanks for looking into this!

@@ -507,7 +550,7 @@ CheckArchiveTimeout(void)
pg_time_t now;
pg_time_t last_time;

-	if (XLogArchiveTimeout <= 0)
+	if (XLogArchiveTimeout <= 0 || !IsRecoveryProcessingMode())

The above change destroys archive_timeout because checking the timeout
is always skipped after recovery is done.

Yep, good catch. That obviously needs to be
"IsRecoveryProcessingMode()", without the exclamation mark.

@@ -355,6 +359,27 @@ BackgroundWriterMain(void)
*/
PG_SETMASK(&UnBlockSig);

+	BgWriterRecoveryMode = IsRecoveryProcessingMode();
+
+	if (BgWriterRecoveryMode)
+		elog(DEBUG1, "bgwriter starting during recovery");
+	else
+		InitXLOGAccess();

Why is InitXLOGAccess() called also here when bgwriter is started after
recovery? That is already called by AuxiliaryProcessMain().

Oh, I didn't realize that. Agreed, it's a bit disconcerting that
InitXLOGAccess() is called twice (there was a 2nd call within the loop
in the original patch as well). Looking at InitXLOGAccess, it's harmless
to call it multiple times, but it seems better to remove the
InitXLOGAccess call from AuxiliaryProcessMain().

InitXLOGAccess() needs to be called after seeing that
IsRecoveryProcessingMode() == false, because it won't get the right
timeline id before that.

@@ -1302,7 +1314,7 @@ ServerLoop(void)
* state that prevents it, start one.  It doesn't matter if this
* fails, we'll just try again later.
*/
-		if (BgWriterPID == 0 && pmState == PM_RUN)
+		if (BgWriterPID == 0 && (pmState == PM_RUN || pmState == PM_RECOVERY))
BgWriterPID = StartBackgroundWriter();

Likewise, we should try to start also the stats collector during recovery?

Hmm, there's not much point as this patch stands, but I guess we should
in the hot standby patch, where we let backends in.

@@ -2103,7 +2148,8 @@ XLogFileInit(uint32 log, uint32 seg,
unlink(tmppath);
}

-	elog(DEBUG2, "done creating and filling new WAL file");
+	XLogFileName(tmppath, ThisTimeLineID, log, seg);
+	elog(DEBUG2, "done creating and filling new WAL file %s", tmppath);

This debug message is somewhat confusing, because the WAL file
represented as "tmppath" might have been already created by
previous XLogFileInit() via InstallXLogFileSegment().

I don't quite understand what you're saying, but I think I'll just
revert that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#8Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#3)
Re: Hot standby, recovery infra

Hi,

On Wed, Jan 28, 2009 at 11:19 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

I feel quite good about this patch now. Given the amount of code churn, it
requires testing, and I'll read it through one more time after sleeping over
it. Simon, do you see anything wrong with this?

I also read this patch and found something odd. I apologize if I misread it..

If archive recovery fails after it reaches the last valid record
in the last unfilled WAL segment, subsequent recovery might cause
the following fatal error. This is because minSafeStartPoint indicates
the end of the last unfilled WAL segment which subsequent recovery
cannot reach. Is this bug? (I'm not sure how to fix this problem
because I don't understand yet why minSafeStartPoint is required.)

FATAL: WAL ends before end time of backup dump

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#9Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#3)
Re: Hot standby, recovery infra

Hi,

On Wed, Jan 28, 2009 at 11:19 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

I feel quite good about this patch now. Given the amount of code churn, it
requires testing, and I'll read it through one more time after sleeping over
it. Simon, do you see anything wrong with this?

I also read this patch and found something odd. I apologize if I misread it..

Sorry for my random reply.

Though this is a matter of taste, I think that it's weird that bgwriter
runs with ThisTimeLineID = 0 during recovery. This is because
XLogCtl->ThisTimeLineID is set at the end of recovery. ISTM this will
be a cause of bug in the near future, though this is harmless currently.

+	/*
+	 * XXX: Should we try to perform restartpoints if we're not in consistent
+	 * recovery? The bgwriter isn't doing it for us in that case.
+	 */

I think yes. This is helpful for the system which it takes a long time to get
a base backup, ie. it also takes a long time to reach a consistent recovery
point.

+CreateRestartPoint(int flags)

<snip>

+	 * We rely on this lock to ensure that the startup process doesn't exit
+	 * Recovery while we are half way through a restartpoint. XXX ?
*/
+	LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);

Is this comment correct? CheckpointLock cannot prevent the startup process
from exiting recovery because the startup process doesn't acquire that lock.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#10Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#9)
Re: Hot standby, recovery infra

On Thu, 2009-01-29 at 12:18 +0900, Fujii Masao wrote:

Hi,

On Wed, Jan 28, 2009 at 11:19 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

I feel quite good about this patch now. Given the amount of code churn, it
requires testing, and I'll read it through one more time after sleeping over
it. Simon, do you see anything wrong with this?

I also read this patch and found something odd. I apologize if I misread it..

Sorry for my random reply.

Though this is a matter of taste, I think that it's weird that bgwriter
runs with ThisTimeLineID = 0 during recovery. This is because
XLogCtl->ThisTimeLineID is set at the end of recovery. ISTM this will
be a cause of bug in the near future, though this is harmless currently.

It doesn't. That's exactly what InitXLogAccess() was for.

+	/*
+	 * XXX: Should we try to perform restartpoints if we're not in consistent
+	 * recovery? The bgwriter isn't doing it for us in that case.
+	 */

I think yes. This is helpful for the system which it takes a long time to get
a base backup, ie. it also takes a long time to reach a consistent recovery
point.

The original patch did this.

+CreateRestartPoint(int flags)

<snip>

+	 * We rely on this lock to ensure that the startup process doesn't exit
+	 * Recovery while we are half way through a restartpoint. XXX ?
*/
+	LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);

Is this comment correct? CheckpointLock cannot prevent the startup process
from exiting recovery because the startup process doesn't acquire that lock.

The original patch acquired CheckpointLock during exitRecovery to prove
that a restartpoint was not in progress. It no longer does this, so not
sure if Heikki has found another way and the comment is wrong, or that
removing the lock request is a bug.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#11Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#8)
Re: Hot standby, recovery infra

On Thu, 2009-01-29 at 10:36 +0900, Fujii Masao wrote:

Hi,

On Wed, Jan 28, 2009 at 11:19 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

I feel quite good about this patch now. Given the amount of code churn, it
requires testing, and I'll read it through one more time after sleeping over
it. Simon, do you see anything wrong with this?

I also read this patch and found something odd. I apologize if I misread it..

If archive recovery fails after it reaches the last valid record
in the last unfilled WAL segment, subsequent recovery might cause
the following fatal error. This is because minSafeStartPoint indicates
the end of the last unfilled WAL segment which subsequent recovery
cannot reach. Is this bug? (I'm not sure how to fix this problem
because I don't understand yet why minSafeStartPoint is required.)

FATAL: WAL ends before end time of backup dump

I think you're right. We need a couple of changes to avoid confusing
messages.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#12Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#10)
Re: Hot standby, recovery infra

Simon Riggs wrote:

On Thu, 2009-01-29 at 12:18 +0900, Fujii Masao wrote:

Though this is a matter of taste, I think that it's weird that bgwriter
runs with ThisTimeLineID = 0 during recovery. This is because
XLogCtl->ThisTimeLineID is set at the end of recovery. ISTM this will
be a cause of bug in the near future, though this is harmless currently.

It doesn't. That's exactly what InitXLogAccess() was for.

It does *during recovery*, before InitXLogAccess is called. Yeah, it's
harmless currently. It would be pretty hard to keep it up-to-date in
bgwriter and other processes. I think it's better to keep it at 0, which
is clearly an invalid value, than try to keep it up-to-date and risk
using an old value. TimeLineID is not used in a lot of places,
currently. It might be a good idea to add some "Assert(TimeLineID != 0)"
to places where it used.

+	/*
+	 * XXX: Should we try to perform restartpoints if we're not in consistent
+	 * recovery? The bgwriter isn't doing it for us in that case.
+	 */

I think yes. This is helpful for the system which it takes a long time to get
a base backup, ie. it also takes a long time to reach a consistent recovery
point.

The original patch did this.

Yeah, I took it out. ISTM if you restore from a base backup, you'd want
to run it until consistent recovery anyway. We can put it back in if
people feel it's helpful. There's two ways to do it: we can fire up the
bgwriter before reaching consistent recovery point, or we can do the
restartpoints in startup process before that point. I'm inclined to fire
up the bgwriter earlier. That way, bgwriter remains responsible for all
checkpoints and restartpoints, which seems clearer. I can't see any
particular reason why bgwriter startup and reaching the consistent
recovery point need to be linked together.

+CreateRestartPoint(int flags)

<snip>

+	 * We rely on this lock to ensure that the startup process doesn't exit
+	 * Recovery while we are half way through a restartpoint. XXX ?
*/
+	LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);

Is this comment correct? CheckpointLock cannot prevent the startup process
from exiting recovery because the startup process doesn't acquire that lock.

The original patch acquired CheckpointLock during exitRecovery to prove
that a restartpoint was not in progress. It no longer does this, so not
sure if Heikki has found another way and the comment is wrong, or that
removing the lock request is a bug.

The comment is obsolete. There's no reason for startup process to wait
for a restartpoint to finish. Looking back at the archives and the
history of that, I got the impression that in a very early version of
the patch, startup process still created a shutdown checkpoint after
recovery. To do that, it had to interrupt an ongoing restartpoint. That
was deemed too dangerous/weird, so it was changed to wait for it to
finish instead. But since the startup process no longer creates a
shutdown checkpoint, the waiting became obsolete, right?

If there's a restartpoint in progress when we reach the end of recovery,
startup process will RequestCheckpoint as usual. That will cause
bgwriter to hurry the on-going restartpoint, skipping the usual delays,
and start the checkpoint as soon as the restartpoint is finished.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#13Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#12)
Re: Hot standby, recovery infra

On Thu, 2009-01-29 at 09:34 +0200, Heikki Linnakangas wrote:

It does *during recovery*, before InitXLogAccess is called. Yeah, it's
harmless currently. It would be pretty hard to keep it up-to-date in
bgwriter and other processes. I think it's better to keep it at 0,
which is clearly an invalid value, than try to keep it up-to-date and
risk using an old value. TimeLineID is not used in a lot of places,
currently. It might be a good idea to add some "Assert(TimeLineID !=
0)" to places where it used.

Agreed. TimeLineID is a normal-running concept used for writing WAL.
Perhaps we should even solidify the meaning of TimeLineID == 0 as
"cannot write WAL".

I see a problem there for any process that exists both before and after
recovery ends, which includes bgwriter. In that case we must not flush
WAL before recovery ends, yet afterwards we *must* flush WAL. To do that
we *must* have a valid TimeLineID set.

I would suggest we put InitXLogAccess into IsRecoveryProcessingMode(),
so if the mode changes we immediately set everything we need to allow
XLogFlush() calls to work correctly.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#14Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#11)
Re: Hot standby, recovery infra

Simon Riggs wrote:

On Thu, 2009-01-29 at 10:36 +0900, Fujii Masao wrote:

Hi,

On Wed, Jan 28, 2009 at 11:19 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

I feel quite good about this patch now. Given the amount of code churn, it
requires testing, and I'll read it through one more time after sleeping over
it. Simon, do you see anything wrong with this?

I also read this patch and found something odd. I apologize if I misread it..

If archive recovery fails after it reaches the last valid record
in the last unfilled WAL segment, subsequent recovery might cause
the following fatal error. This is because minSafeStartPoint indicates
the end of the last unfilled WAL segment which subsequent recovery
cannot reach. Is this bug? (I'm not sure how to fix this problem
because I don't understand yet why minSafeStartPoint is required.)

FATAL: WAL ends before end time of backup dump

I think you're right. We need a couple of changes to avoid confusing
messages.

Hmm, we could update minSafeStartPoint in XLogFlush instead. That was
suggested when the idea of minSafeStartPoint was first thought of.
Updating minSafeStartPoint is analogous to flushing WAL:
minSafeStartPoint must be advanced to X before we can flush a data pgse
with LSN X. To avoid excessive controlfile updates, whenever we update
minSafeStartPoint, we can update it to the latest WAL record we've read.

Or we could simply ignore that error if we've reached minSafeStartPoint
- 1 segment, assuming that even though minSafeStartPoint is higher, we
can't have gone past the end of valid WAL records in the last segment in
previous recovery either. But that feels more fragile.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#15Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#14)
Re: Hot standby, recovery infra

On Thu, 2009-01-29 at 11:20 +0200, Heikki Linnakangas wrote:

Simon Riggs wrote:

On Thu, 2009-01-29 at 10:36 +0900, Fujii Masao wrote:

Hi,

On Wed, Jan 28, 2009 at 11:19 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

I feel quite good about this patch now. Given the amount of code churn, it
requires testing, and I'll read it through one more time after sleeping over
it. Simon, do you see anything wrong with this?

I also read this patch and found something odd. I apologize if I misread it..

If archive recovery fails after it reaches the last valid record
in the last unfilled WAL segment, subsequent recovery might cause
the following fatal error. This is because minSafeStartPoint indicates
the end of the last unfilled WAL segment which subsequent recovery
cannot reach. Is this bug? (I'm not sure how to fix this problem
because I don't understand yet why minSafeStartPoint is required.)

FATAL: WAL ends before end time of backup dump

I think you're right. We need a couple of changes to avoid confusing
messages.

Hmm, we could update minSafeStartPoint in XLogFlush instead. That was
suggested when the idea of minSafeStartPoint was first thought of.
Updating minSafeStartPoint is analogous to flushing WAL:
minSafeStartPoint must be advanced to X before we can flush a data pgse
with LSN X. To avoid excessive controlfile updates, whenever we update
minSafeStartPoint, we can update it to the latest WAL record we've read.

Or we could simply ignore that error if we've reached minSafeStartPoint
- 1 segment, assuming that even though minSafeStartPoint is higher, we
can't have gone past the end of valid WAL records in the last segment in
previous recovery either. But that feels more fragile.

My proposed fix for Fujii-san's minSafeStartPoint bug is to introduce
another control file state DB_IN_ARCHIVE_RECOVERY_BASE. This would show
that we are still recovering up to the point of the end of the base
backup. Once we reach minSafeStartPoint we then switch state to
DB_IN_ARCHIVE_RECOVERY, and set baseBackupReached boolean, which then
enables writing new minSafeStartPoints when we open new WAL files in the
future.

We then have messages only when in DB_IN_ARCHIVE_RECOVERY_BASE state

if (XLByteLT(EndOfLog, ControlFile->minRecoveryPoint) &&
ControlFile->state == DB_IN_ARCHIVE_RECOVERY_BASE)
{
if (reachedStopPoint) /* stopped because of stop request */
ereport(FATAL,
(errmsg("requested recovery stop point is before end time of
backup dump")));
else /* ran off end of WAL */
ereport(FATAL,
(errmsg("WAL ends before end time of backup dump")));
}

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#16Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#15)
Re: Hot standby, recovery infra

Simon Riggs wrote:

My proposed fix for Fujii-san's minSafeStartPoint bug is to introduce
another control file state DB_IN_ARCHIVE_RECOVERY_BASE. This would show
that we are still recovering up to the point of the end of the base
backup. Once we reach minSafeStartPoint we then switch state to
DB_IN_ARCHIVE_RECOVERY, and set baseBackupReached boolean, which then
enables writing new minSafeStartPoints when we open new WAL files in the
future.

I don't see how that helps, the bug has nothing to with base backups. It
comes from the fact that we set minSafeStartPoint beyond the actual end
of WAL, if the last WAL segment is only partially filled (= fails CRC
check at some point). If we crash after setting minSafeStartPoint like
that, and then restart recovery, we'll get the error.

The last WAL segment could be partially filled for example because the
DBA has manually copied the last unarchived WAL segments to pg_xlog, as
we recommend in the manual.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#17Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#1)
Re: Hot standby, recovery infra

It looks like if you issue a fast shutdown during recovery, postmaster
doesn't kill bgwriter.

...
LOG: restored log file "000000010000000000000028" from archive
LOG: restored log file "000000010000000000000029" from archive
LOG: consistent recovery state reached at 0/2900005C
...
LOG: restored log file "00000001000000000000002F" from archive
LOG: restored log file "000000010000000000000030" from archive
LOG: restored log file "000000010000000000000031" from archive
LOG: restored log file "000000010000000000000032" from archive
LOG: restored log file "000000010000000000000033" from archive
LOG: restartpoint starting: time
LOG: restored log file "000000010000000000000034" from archive
LOG: received fast shutdown request
LOG: restored log file "000000010000000000000035" from archive
FATAL: terminating connection due to administrator command
LOG: startup process (PID 14137) exited with exit code 1
LOG: aborting startup due to startup process failure
hlinnaka@heikkilaptop:~/pgsql$
hlinnaka@heikkilaptop:~/pgsql$ LOG: restartpoint complete: wrote 3324
buffers (5.1%); write=13.996 s, sync=2.016 s, total=16.960 s
LOG: recovery restart point at 0/3000FA14

Seems that reaper() needs to be taught that bgwriter can be active
during consistent recovery. I'll take a look at how to do that.

BTW, the message "terminating connection ..." is a bit misleading. It's
referring to the startup process, which is hardly a connection. We have
that in CVS HEAD too, so it's not something introduced by the patch, but
seems worth changing in HS, since we then let real connections in while
startup process is still running.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#18Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#16)
Re: Hot standby, recovery infra

On Thu, 2009-01-29 at 12:22 +0200, Heikki Linnakangas wrote:

Simon Riggs wrote:

My proposed fix for Fujii-san's minSafeStartPoint bug is to introduce
another control file state DB_IN_ARCHIVE_RECOVERY_BASE. This would show
that we are still recovering up to the point of the end of the base
backup. Once we reach minSafeStartPoint we then switch state to
DB_IN_ARCHIVE_RECOVERY, and set baseBackupReached boolean, which then
enables writing new minSafeStartPoints when we open new WAL files in the
future.

I don't see how that helps, the bug has nothing to with base backups.

Sorry, disagree.

It
comes from the fact that we set minSafeStartPoint beyond the actual end
of WAL, if the last WAL segment is only partially filled (= fails CRC
check at some point). If we crash after setting minSafeStartPoint like
that, and then restart recovery, we'll get the error.

Look again please. My proposal would avoid the error when it is not
relevant, yet keep it when it is (while recovering base backups).

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#19Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#18)
Re: Hot standby, recovery infra

Simon Riggs wrote:

On Thu, 2009-01-29 at 12:22 +0200, Heikki Linnakangas wrote:

It
comes from the fact that we set minSafeStartPoint beyond the actual end
of WAL, if the last WAL segment is only partially filled (= fails CRC
check at some point). If we crash after setting minSafeStartPoint like
that, and then restart recovery, we'll get the error.

Look again please. My proposal would avoid the error when it is not
relevant, yet keep it when it is (while recovering base backups).

I fail to see what base backups have to do with this. The problem arises
in this scenario:

0. A base backup is unzipped. recovery.conf is copied in place, and the
remaining unarchived WAL segments are copied from the primary server to
pg_xlog. The last WAL segment is only partially filled. Let's say that
redo point is in WAL segment 1. The last, partial, WAL segment is 3, and
WAL ends at 0/3500000
1. postmaster is started, recovery starts.
2. WAL segment 1 is restored from archive.
3. We reach consistent recovery point
4. We restore WAL segment 2 from archive. minSafeStartPoint is advanced
to 0/3000000
5. WAL segment 2 is completely replayed, we move on to WAL segment 3. It
is not in archive, but it's found in pg_xlog. minSafeStartPoint is
advanced to 0/4000000. Note that that's beyond end of WAL.
6. At replay of WAL record 0/3200000, the recovery is interrupted. For
example, by a fast shutdown request, or crash.

Now when we restart the recovery, we will never reach minSafeStartPoint,
which is now 0/4000000, and we'll fail with the error that Fujii-san
pointed out. We're already way past the min recovery point of base
backup by then.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#20Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#19)
Re: Hot standby, recovery infra

On Thu, 2009-01-29 at 15:31 +0200, Heikki Linnakangas wrote:

Now when we restart the recovery, we will never reach
minSafeStartPoint, which is now 0/4000000, and we'll fail with the
error that Fujii-san pointed out. We're already way past the min
recovery point of base backup by then.

The problem was that we reported this error

FATAL: WAL ends before end time of backup dump

and this is inappropriate because, as you say, we are way past the min
recovery point of base backup.

If you look again at my proposal you will see that the proposal avoids
the above error by keeping track of whether we are past the point of
base backup or not. If we are still in base backup we get the error and
if we are passed it we do not.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#21Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#20)
#22Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#17)
#23Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#21)
#24Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#23)
#25Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#23)
#26Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#22)
#27Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#24)
#28Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#17)
#29Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#28)
#30Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#26)
#31Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#26)
#32Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#29)
#33Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#30)
#34Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#31)
#35Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#32)
#36Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#34)
#37Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#35)
#38Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#36)
#39Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#31)
#40Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#39)
#41Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#2)
#42Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#41)
#43Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#40)
#44Tom Lane
tgl@sss.pgh.pa.us
In reply to: Fujii Masao (#43)
#45Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#44)
#46Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#42)
#47Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#46)
#48Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#46)
#49Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#48)
#50Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#47)
#51Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#49)
#52Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#50)
#53Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#52)
#54Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#52)
#55Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#54)
#56Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#55)
#57Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#56)
#58Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#57)
#59Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#58)
#60Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#59)
#61Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#60)
#62Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#61)
#63Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#62)
#64Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#63)
#65Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#64)
#66Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#65)
#67Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#66)
#68Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#67)
#69Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#68)
#70Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#26)
#71Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#70)
#72Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#71)
#73Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#71)
#74Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#73)