Streaming replication, and walsender during recovery

Started by Fujii Masaoalmost 16 years ago35 messages
#1Fujii Masao
masao.fujii@gmail.com

Hi,

When I configured a cascaded standby (i.e, made the additional
standby server connect to the standby), I got the following
errors, and a cascaded standby didn't start replication.

ERROR: timeline 0 of the primary does not match recovery target timeline 1

I didn't care about that case so far. To avoid a confusing error
message, we should forbid a startup of walsender during recovery,
and emit a suitable message? Or support such cascade-configuration?
Though I don't think that the latter is difficult to be implemented,
ISTM it's not the time to do that now.

Thought?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Fujii Masao (#1)
Re: Streaming replication, and walsender during recovery

Fujii Masao <masao.fujii@gmail.com> writes:

When I configured a cascaded standby (i.e, made the additional
standby server connect to the standby), I got the following
errors, and a cascaded standby didn't start replication.

ERROR: timeline 0 of the primary does not match recovery target timeline 1

I didn't care about that case so far. To avoid a confusing error
message, we should forbid a startup of walsender during recovery,
and emit a suitable message? Or support such cascade-configuration?
Though I don't think that the latter is difficult to be implemented,
ISTM it's not the time to do that now.

It would be kind of silly to add code to forbid it if making it work
would be about the same amount of effort. I think it'd be worth looking
closer to find out what the problem is.

regards, tom lane

#3Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#2)
Re: Streaming replication, and walsender during recovery

On Mon, 2010-01-18 at 09:31 -0500, Tom Lane wrote:

Fujii Masao <masao.fujii@gmail.com> writes:

When I configured a cascaded standby (i.e, made the additional
standby server connect to the standby), I got the following
errors, and a cascaded standby didn't start replication.

ERROR: timeline 0 of the primary does not match recovery target timeline 1

I didn't care about that case so far. To avoid a confusing error
message, we should forbid a startup of walsender during recovery,
and emit a suitable message? Or support such cascade-configuration?
Though I don't think that the latter is difficult to be implemented,
ISTM it's not the time to do that now.

It would be kind of silly to add code to forbid it if making it work
would be about the same amount of effort. I think it'd be worth looking
closer to find out what the problem is.

There is an ERROR, but no problem AFAICS. The tli isn't set until end of
recovery because it doesn't need to have been set yet. That shouldn't
prevent retrieving WAL data.

--
Simon Riggs www.2ndQuadrant.com

#4Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#3)
1 attachment(s)
Re: Streaming replication, and walsender during recovery

On Mon, Jan 18, 2010 at 11:42 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Mon, 2010-01-18 at 09:31 -0500, Tom Lane wrote:

Fujii Masao <masao.fujii@gmail.com> writes:

When I configured a cascaded standby (i.e, made the additional
standby server connect to the standby), I got the following
errors, and a cascaded standby didn't start replication.

  ERROR:  timeline 0 of the primary does not match recovery target timeline 1

I didn't care about that case so far. To avoid a confusing error
message, we should forbid a startup of walsender during recovery,
and emit a suitable message? Or support such cascade-configuration?
Though I don't think that the latter is difficult to be implemented,
ISTM it's not the time to do that now.

It would be kind of silly to add code to forbid it if making it work
would be about the same amount of effort.  I think it'd be worth looking
closer to find out what the problem is.

There is an ERROR, but no problem AFAICS. The tli isn't set until end of
recovery because it doesn't need to have been set yet. That shouldn't
prevent retrieving WAL data.

OK. Here is the patch which supports a walsender process during recovery;

* Change walsender so as to send the WAL written by the walreceiver
if it has been started during recovery.
* Kill the walsenders started during recovery at the end of recovery
because replication cannot survive the change of timeline ID.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

walsender_during_recovery_0119.patchtext/x-patch; charset=US-ASCII; name=walsender_during_recovery_0119.patchDownload
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 6384,6389 **** StartupXLOG(void)
--- 6384,6397 ----
  		xlogctl->SharedRecoveryInProgress = false;
  		SpinLockRelease(&xlogctl->info_lck);
  	}
+ 
+ 	/*
+ 	 * Kill the walsender processes which were started during recovery
+ 	 * since they cannot survive the change of timeline ID at the end of
+ 	 * an archive recovery. Here is the right place to do that because
+ 	 * new 'cascaded' walsender will not be started from here on.
+ 	 */
+ 	ShutdownCascadedWalSnds();
  }
  
  /*
***************
*** 6666,6671 **** GetWriteRecPtr(void)
--- 6674,6682 ----
  	volatile XLogCtlData *xlogctl = XLogCtl;
  	XLogRecPtr	recptr;
  
+ 	if (LocalRecoveryInProgress)
+ 		return GetWalRcvWriteRecPtr();
+ 
  	SpinLockAcquire(&xlogctl->info_lck);
  	recptr = xlogctl->LogwrtResult.Write;
  	SpinLockRelease(&xlogctl->info_lck);
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***************
*** 491,496 **** InitWalSnd(void)
--- 491,505 ----
  				(errcode(ERRCODE_TOO_MANY_CONNECTIONS),
  				 errmsg("sorry, too many standbys already")));
  
+ 	/*
+ 	 * Use the recovery target timeline ID during recovery.
+ 	 */
+ 	if (RecoveryInProgress())
+ 	{
+ 		MyWalSnd->cascaded = true;
+ 		ThisTimeLineID = GetRecoveryTargetTLI();
+ 	}
+ 
  	/* Arrange to clean up at walsender exit */
  	on_shmem_exit(WalSndKill, 0);
  }
***************
*** 506,511 **** WalSndKill(int code, Datum arg)
--- 515,521 ----
  	 * for this.
  	 */
  	MyWalSnd->pid = 0;
+ 	MyWalSnd->cascaded = false;
  
  	/* WalSnd struct isn't mine anymore */
  	MyWalSnd = NULL;
***************
*** 848,850 **** GetOldestWALSendPointer(void)
--- 858,880 ----
  	}
  	return oldest;
  }
+ 
+ /*
+  * Stop only the cascaded walsender processes.
+  */
+ void
+ ShutdownCascadedWalSnds(void)
+ {
+ 	int	i;
+ 
+ 	for (i = 0; i < MaxWalSenders; i++)
+ 	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile WalSnd	*walsnd = &WalSndCtl->walsnds[i];
+ 		pid_t	walsndpid;
+ 
+ 		walsndpid = walsnd->pid;
+ 		if (walsndpid != 0 && walsnd->cascaded)
+ 			kill(walsndpid, SIGTERM);
+ 	}
+ }
*** a/src/include/replication/walsender.h
--- b/src/include/replication/walsender.h
***************
*** 22,27 **** typedef struct WalSnd
--- 22,28 ----
  {
  	pid_t	pid;		/* this walsender's process id, or 0 */
  	XLogRecPtr sentPtr;	/* WAL has been sent up to this point */
+ 	bool	cascaded;	/* this walsender is started during recovery? */
  
  	slock_t	mutex;		/* locks shared variables shown above */
  } WalSnd;
***************
*** 45,49 **** extern void WalSndSignals(void);
--- 46,51 ----
  extern Size WalSndShmemSize(void);
  extern void WalSndShmemInit(void);
  extern XLogRecPtr GetOldestWALSendPointer(void);
+ extern void ShutdownCascadedWalSnds(void);
  
  #endif	/* _WALSENDER_H */
#5Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#4)
Re: Streaming replication, and walsender during recovery

On Tue, 2010-01-19 at 15:04 +0900, Fujii Masao wrote:

There is an ERROR, but no problem AFAICS. The tli isn't set until end of
recovery because it doesn't need to have been set yet. That shouldn't
prevent retrieving WAL data.

OK. Here is the patch which supports a walsender process during recovery;

* Change walsender so as to send the WAL written by the walreceiver
if it has been started during recovery.
* Kill the walsenders started during recovery at the end of recovery
because replication cannot survive the change of timeline ID.

Good patch.

I think we need to add a longer comment explaining the tli issues. I
agree with your handling of them.

It would be useful to have the ps display differentiate between multiple
walsenders, and in this case have it indicate cascading also.

--
Simon Riggs www.2ndQuadrant.com

#6Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#5)
Re: Streaming replication, and walsender during recovery

On Tue, Jan 19, 2010 at 4:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

It would be useful to have the ps display differentiate between multiple
walsenders, and in this case have it indicate cascading also.

Since a normal walsender and a "cascading" one will not be running
at the same time, I don't think that it's worth adding that label
into the PS display. Am I missing something?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#7Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#4)
Re: Streaming replication, and walsender during recovery

Fujii Masao wrote:

OK. Here is the patch which supports a walsender process during recovery;

* Change walsender so as to send the WAL written by the walreceiver
if it has been started during recovery.
* Kill the walsenders started during recovery at the end of recovery
because replication cannot survive the change of timeline ID.

I think there's a race condition at the end of recovery. When the
shutdown checkpoint is written, with new TLI, doesn't a cascading
walsender try to send that to the standby as soon as it's flushed to
disk? But it won't find it in the WAL segment with the old TLI that it's
reading.

Also, when segments are restored from the archive, using
restore_command, the cascading walsender won't find them because they're
not written in pg_xlog like normal WAL segments.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#8Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#7)
Re: Streaming replication, and walsender during recovery

On Thu, Jan 28, 2010 at 4:47 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

I think there's a race condition at the end of recovery. When the
shutdown checkpoint is written, with new TLI, doesn't a cascading
walsender try to send that to the standby as soon as it's flushed to
disk? But it won't find it in the WAL segment with the old TLI that it's
reading.

Right. But I don't think that such a shutdown checkpoint record is worth
being sent by a cascading walsender. I think that such a walsender has
only to exit without regard to the WAL segment with the new TLI.

Also, when segments are restored from the archive, using
restore_command, the cascading walsender won't find them because they're
not written in pg_xlog like normal WAL segments.

Yeah, I need to adjust my approach to the recent 'xlog-refactor' change.
The archived file needs to be restored without a name change, and remain
in pg_xlog until the bgwriter will have recycled it.

But that change would make the xlog.c even more complicated. Should we
postpone the 'cascading walsender' feature into v9.1, and, in v9.0, just
forbid walsender to be started during recovery?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#9Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#8)
Re: Streaming replication, and walsender during recovery

Fujii Masao wrote:

On Thu, Jan 28, 2010 at 4:47 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

I think there's a race condition at the end of recovery. When the
shutdown checkpoint is written, with new TLI, doesn't a cascading
walsender try to send that to the standby as soon as it's flushed to
disk? But it won't find it in the WAL segment with the old TLI that it's
reading.

Right. But I don't think that such a shutdown checkpoint record is worth
being sent by a cascading walsender. I think that such a walsender has
only to exit without regard to the WAL segment with the new TLI.

Also, when segments are restored from the archive, using
restore_command, the cascading walsender won't find them because they're
not written in pg_xlog like normal WAL segments.

Yeah, I need to adjust my approach to the recent 'xlog-refactor' change.
The archived file needs to be restored without a name change, and remain
in pg_xlog until the bgwriter will have recycled it.

I guess you could just say that it's working as designed, and WAL files
restored from archive can't be streamed. Presumably the cascaded slave
can find them in the archive too. But it is pretty weird, doesn't feel
right.

This reminds me of something I've been pondering anyway. Currently,
restore_command copies the restored WAL segment as pg_xlog/RECOVERYXLOG
instead of the usual 00000... filename. That avoids overwriting any
pre-existing WAL segments in pg_xlog, which may still contain useful
data. Using the same filename over and over also means that we don't
need to worry about deleting old log files during archive recovery.

The downside in standby mode is that once standby has restored segment X
from archive, and it's restarted, it must find X in the archive again or
it won't be able to start up. The archive better be a directory on the
same host.

Streaming Replication, however, took another approach. It does overwrite
any existing files in pg_xlog, we do need to worry about deleting old
files, and if the master goes down, we can always find files we've
already streamed in pg_xlog, so the standby can recover even if the
master can't be contacted anymore.

That's a bit inconsistent, and causes the problem that a cascading
walsender won't find the files restored from archive.

How about restoring/streaming files to a new directory, say
pg_xlog/restored/, with the real filenames? At least in standby_mode,
probably best to keep the current behavior in PITR. That would feel more
clean, you could easily tell apart files originating from the server
itself and those copied from the master.

But that change would make the xlog.c even more complicated. Should we
postpone the 'cascading walsender' feature into v9.1, and, in v9.0, just
forbid walsender to be started during recovery?

That's certainly the simplest solution...

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#10Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#9)
Re: Streaming replication, and walsender during recovery

On Thu, Jan 28, 2010 at 7:43 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

How about restoring/streaming files to a new directory, say
pg_xlog/restored/, with the real filenames? At least in standby_mode,
probably best to keep the current behavior in PITR. That would feel more
clean, you could easily tell apart files originating from the server
itself and those copied from the master.

When the WAL file with the same name exists in the archive, pg_xlog
and pg_xlog/restore/ which directory should we recover it from?
I'm not sure that we can always make a right decision about that.

How about just making a restore_command copy the WAL files as the
normal one (e.g., 0000...) instead of a pg_xlog/RECOVERYXLOG?
Though we need to worry about deleting them, we can easily leave
the task to the bgwriter.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Fujii Masao (#10)
Re: Streaming replication, and walsender during recovery

Fujii Masao <masao.fujii@gmail.com> writes:

How about just making a restore_command copy the WAL files as the
normal one (e.g., 0000...) instead of a pg_xlog/RECOVERYXLOG?
Though we need to worry about deleting them, we can easily leave
the task to the bgwriter.

The reason for doing it that way was to limit disk space usage during
a long restore.  I'm not convinced we can leave the task to the bgwriter
--- it shouldn't be deleting anything at that point.

regards, tom lane

#12Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#11)
Re: Streaming replication, and walsender during recovery

On Thu, 2010-01-28 at 10:48 -0500, Tom Lane wrote:

Fujii Masao <masao.fujii@gmail.com> writes:

How about just making a restore_command copy the WAL files as the
normal one (e.g., 0000...) instead of a pg_xlog/RECOVERYXLOG?
Though we need to worry about deleting them, we can easily leave
the task to the bgwriter.

The reason for doing it that way was to limit disk space usage during
a long restore.  I'm not convinced we can leave the task to the bgwriter
--- it shouldn't be deleting anything at that point.

I think "bgwriter" means RemoveOldXlogFiles(), which would normally
clear down files at checkpoint. If that was added to the end of
RecoveryRestartPoint() to do roughly the same job then it could
potentially work.

However, since not every checkpoint is a restartpoint we might easily
end up with significantly more WAL files on the standby than would
normally be there when it would be a primary. Not sure if that is an
issue in this case, but we can't just assume we can store all files
needed to restart the standby on the standby itself, in all cases. That
might be an argument to add a restartpoint_segments parameter, so we can
trigger restartpoints on WAL volume as well as time. But even that would
not put an absolute limit on the number of WAL files.

I'm keen to allow cascading in 9.0. If you pull both synch rep and
cascading you're not offering much that isn't already there.

--
Simon Riggs www.2ndQuadrant.com

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#12)
Re: Streaming replication, and walsender during recovery

Simon Riggs <simon@2ndQuadrant.com> writes:

I'm keen to allow cascading in 9.0. If you pull both synch rep and
cascading you're not offering much that isn't already there.

FWIW, I don't agree with that prioritization in the least. Cascading
is something we could leave till 9.1, or even later, and hardly anyone
would care. We have much more important problems to be spending our
effort on right now.

regards, tom lane

#14Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#13)
Re: Streaming replication, and walsender during recovery

On Thu, Jan 28, 2010 at 11:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

I'm keen to allow cascading in 9.0. If you pull both synch rep and
cascading you're not offering much that isn't already there.

FWIW, I don't agree with that prioritization in the least.  Cascading
is something we could leave till 9.1, or even later, and hardly anyone
would care.  We have much more important problems to be spending our
effort on right now.

I agree. According to
http://wiki.postgresql.org/wiki/Hot_Standby_TODO , the only must-fix
issues that remain prior to beta are (1) implementing the new VACUUM
FULL for system relations, and (2) some documentation improvements.
It's a little early to be worrying about docs, but shouldn't we be
trying to get the VACUUM FULL problems cleaned up first, and then look
at what else we have time to address?

As regards the remaining items for streaming replication at:

http://wiki.postgresql.org/wiki/Streaming_Replication#v9.0

...ISTM the most important issues are (1) fixing win32 and (2) adding
a message type header, followed by (3) fixing pg_xlogfile_name() and
(4) redefining smart shutdown in standby mode.

If we fix the must-fix issues first, we can still decide to delay the
release to fix the would-like-to-fix issues, or not. The other way
around doesn't work.

...Robert

#15Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#14)
Re: Streaming replication, and walsender during recovery

On Thu, 2010-01-28 at 12:09 -0500, Robert Haas wrote:

I agree. According to
http://wiki.postgresql.org/wiki/Hot_Standby_TODO , the only must-fix
issues that remain prior to beta are (1) implementing the new VACUUM
FULL for system relations, and (2) some documentation improvements.
It's a little early to be worrying about docs, but shouldn't we be
trying to get the VACUUM FULL problems cleaned up first, and then look
at what else we have time to address?

Please don't confuse different issues. The fact that I have work to do
still is irrelevant to what other people should do on other features.

--
Simon Riggs www.2ndQuadrant.com

#16Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#13)
Re: Streaming replication, and walsender during recovery

On Thu, 2010-01-28 at 11:41 -0500, Tom Lane wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

I'm keen to allow cascading in 9.0. If you pull both synch rep and
cascading you're not offering much that isn't already there.

FWIW, I don't agree with that prioritization in the least. Cascading
is something we could leave till 9.1, or even later, and

Not what you said just a few days ago.

hardly anyone would care.

Unfortunately, I think you're very wrong on that specific point.

We have much more important problems to be spending our
effort on right now.

I'm a little worried the feature set of streaming rep isn't any better
than what we have already. If we're going to destabilise the code, we
really should be adding some features as well.

--
Simon Riggs www.2ndQuadrant.com

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#16)
Re: Streaming replication, and walsender during recovery

Simon Riggs <simon@2ndQuadrant.com> writes:

On Thu, 2010-01-28 at 11:41 -0500, Tom Lane wrote:

FWIW, I don't agree with that prioritization in the least. Cascading
is something we could leave till 9.1, or even later, and

Not what you said just a few days ago.

Me? I don't recall having said a word about cascading before.

I'm a little worried the feature set of streaming rep isn't any better
than what we have already.

Nonsense. Getting rid of the WAL-segment-based shipping delays is a
quantum improvement --- it means we actually have something approaching
real-time replication, which was really impractical before. Whether you
can feed slaves indirectly is just a minor administration detail. Yeah,
I know in some situations it could be helpful for performance, but
it's not even in the same ballpark of must-have-ness.

(Anyway, the argument that it's important for performance is pure
speculation AFAIK, untainted by any actual measurements. Given the lack
of optimization of WAL replay, it seems entirely possible that the last
thing you want to burden a slave with is sourcing data to more slaves.)

regards, tom lane

#18Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#17)
Re: Streaming replication, and walsender during recovery

On Thu, 2010-01-28 at 13:05 -0500, Tom Lane wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

On Thu, 2010-01-28 at 11:41 -0500, Tom Lane wrote:

FWIW, I don't agree with that prioritization in the least. Cascading
is something we could leave till 9.1, or even later, and

Not what you said just a few days ago.

Me? I don't recall having said a word about cascading before.

Top of this thread.

I'm a little worried the feature set of streaming rep isn't any better
than what we have already.

Nonsense. Getting rid of the WAL-segment-based shipping delays is a
quantum improvement --- it means we actually have something approaching
real-time replication, which was really impractical before. Whether you
can feed slaves indirectly is just a minor administration detail. Yeah,
I know in some situations it could be helpful for performance, but
it's not even in the same ballpark of must-have-ness.

FWIW, streaming has been possible and actively used since 8.2.

(Anyway, the argument that it's important for performance is pure
speculation AFAIK, untainted by any actual measurements. Given the lack
of optimization of WAL replay, it seems entirely possible that the last
thing you want to burden a slave with is sourcing data to more slaves.)

Separate processes, separate CPUs, no problem. If WAL replay used more
CPUs you might be right, but it doesn't yet, so same argument opposite
conclusion.

--
Simon Riggs www.2ndQuadrant.com

#19Greg Smith
greg@2ndquadrant.com
In reply to: Tom Lane (#17)
Re: Streaming replication, and walsender during recovery

Tom Lane wrote:

(Anyway, the argument that it's important for performance is pure
speculation AFAIK, untainted by any actual measurements. Given the lack
of optimization of WAL replay, it seems entirely possible that the last
thing you want to burden a slave with is sourcing data to more slaves.)

On any typical production hardware, the work of WAL replay is going to
leave at least one (and probably more) CPUs idle, and have plenty of
network resources to spare too because it's just shuffling WAL in/out
rather than dealing with so many complicated client conversations. And
the thing you want to redistribute--the WAL file--is practically
guaranteed to be sitting in the OS cache at the point where you'd be
doing it, so no disk use either. You'll disrupt a little bit of
memory/CPU cache, sure, but that's about it as far as leeching resources
from the main replay in order to support the secondary slave. I'll
measure it fully the next time I have one setup to give some hard
numbers, I've never seen it rise to the point where it was worth
worrying about before to bother.

Anyway, I think what Simon was trying to suggest was that it's possible
right now to ship partial WAL files over as they advance, if you monitor
pg_xlogfile_name_offset and are willing to coordinate copying chunks
over. That basic idea is even built already--the Skytools walmgr deals
with partial WALs for example. Having all that built-into the server
with a nicer UI is awesome, but it's been possible to build something
with the same basic feature set since 8.2. Getting that going with a
chain of downstreams slaves is not so easy though, so there's something
that I think would be unique to the 9.0 implementation.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.com

#20Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#11)
Re: Streaming replication, and walsender during recovery

Tom Lane wrote:

Fujii Masao <masao.fujii@gmail.com> writes:

How about just making a restore_command copy the WAL files as the
normal one (e.g., 0000...) instead of a pg_xlog/RECOVERYXLOG?
Though we need to worry about deleting them, we can easily leave
the task to the bgwriter.

The reason for doing it that way was to limit disk space usage during
a long restore.  I'm not convinced we can leave the task to the bgwriter
--- it shouldn't be deleting anything at that point.

That has been changed already. In standby mode, bgwriter does delete old
WAL files when it performs a restartpoint. Otherwise the streamed WAL
files will keep accumulating and eventually fill the disk.

It works as it is, but having a sandbox dedicated for restored/streamed
files in pg_xlog/restored, instead of messing with pg_xlog directly,
would make me feel a bit easier about it. There's less potential for
damage in case of bugs if they're separate.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#21Joshua D. Drake
jd@commandprompt.com
In reply to: Tom Lane (#17)
Re: Streaming replication, and walsender during recovery

On Thu, 2010-01-28 at 13:05 -0500, Tom Lane wrote:

I'm a little worried the feature set of streaming rep isn't any better
than what we have already.

Nonsense. Getting rid of the WAL-segment-based shipping delays is a
quantum improvement --- it means we actually have something approaching
real-time replication, which was really impractical before.

SR does not give us anything like replication. Replication implies an
ability to read from the Slave. That is HS only territory.

From what I read on the wiki SR doesn't give us anything that PostgreSQL

+ PITRTools doesn't already give us. And PITR Tools works as far back as
8.1 (although I would suggest 8.2+).

One thing I am unclear on, is if with SR the entire log must be written
before it streams to the slaves. If the entire log does not need to be
written, then that is one up on PITRTools in that we have to wait for
archive_command to execute.

(Anyway, the argument that it's important for performance is pure
speculation AFAIK, untainted by any actual measurements. Given the lack
of optimization of WAL replay, it seems entirely possible that the last
thing you want to burden a slave with is sourcing data to more slaves.)

I agree. WAL replay as a whole is a bottlekneck. As it stands now (I
don't know about 8.5), replay is a large bottleneck on keeping the
warm-standby up to date.

Sincerely,

Joshua D. Drake

--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.

#22Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#16)
Re: Streaming replication, and walsender during recovery

Simon Riggs wrote:

I'm a little worried the feature set of streaming rep isn't any better
than what we have already.

Huh? Are you thinking of the "Record-based Log Shipping" described in
the manual, using a program to poll pg_xlogfile_name_offset() in a tight
loop, as a replacement for streaming replication? First of all, that
requires a big chunk of custom development, so it's a bit of a stretch
to say we have it already. Secondly, with that method, the standby still
still be replaying the WAL one file at a time, which makes a difference
with Hot Standby.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#23Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#22)
Re: Streaming replication, and walsender during recovery

On Thu, 2010-01-28 at 20:49 +0200, Heikki Linnakangas wrote:

Simon Riggs wrote:

I'm a little worried the feature set of streaming rep isn't any better
than what we have already.

Huh? Are you thinking of the "Record-based Log Shipping" described in
the manual, using a program to poll pg_xlogfile_name_offset() in a tight
loop, as a replacement for streaming replication? First of all, that
requires a big chunk of custom development, so it's a bit of a stretch
to say we have it already.

It's been part of Skytools for years now...

Secondly, with that method, the standby still
still be replaying the WAL one file at a time, which makes a difference
with Hot Standby.

I'm not attempting to diss Streaming Rep, or anyone involved. What has
been done is good internal work. I am pointing out and requesting that
we should have a little more added before we stop for this release.

--
Simon Riggs www.2ndQuadrant.com

#24Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#12)
Re: Streaming replication, and walsender during recovery

Simon Riggs wrote:

On Thu, 2010-01-28 at 10:48 -0500, Tom Lane wrote:

Fujii Masao <masao.fujii@gmail.com> writes:

How about just making a restore_command copy the WAL files as the
normal one (e.g., 0000...) instead of a pg_xlog/RECOVERYXLOG?
Though we need to worry about deleting them, we can easily leave
the task to the bgwriter.

The reason for doing it that way was to limit disk space usage during
a long restore.  I'm not convinced we can leave the task to the bgwriter
--- it shouldn't be deleting anything at that point.

I think "bgwriter" means RemoveOldXlogFiles(), which would normally
clear down files at checkpoint. If that was added to the end of
RecoveryRestartPoint() to do roughly the same job then it could
potentially work.

SR added a RemoveOldXLogFiles() call to CreateRestartPoint().

(Since 8.4, RecoveryRestartPoint() just writes the location of the
checkpoint record in shared memory, but doesn't actually perform the
restartpoint; bgwriter does that in CreateRestartPoint()).

However, since not every checkpoint is a restartpoint we might easily
end up with significantly more WAL files on the standby than would
normally be there when it would be a primary. Not sure if that is an
issue in this case, but we can't just assume we can store all files
needed to restart the standby on the standby itself, in all cases. That
might be an argument to add a restartpoint_segments parameter, so we can
trigger restartpoints on WAL volume as well as time. But even that would
not put an absolute limit on the number of WAL files.

I think it is a pretty important safety feature that we keep all the WAL
around that's needed to recover the standby. To avoid out-of-disk-space
situation, it's probably enough in practice to set checkpoint_timeout
small enough in the standby to trigger restartpoints often enough.

At the moment, we do retain streamed WAL as long as it's needed, but not
the WAL restored from archive.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#25Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#24)
Re: Streaming replication, and walsender during recovery

On Thu, 2010-01-28 at 21:00 +0200, Heikki Linnakangas wrote:

However, since not every checkpoint is a restartpoint we might easily

end up with significantly more WAL files on the standby than would
normally be there when it would be a primary. Not sure if that is an
issue in this case, but we can't just assume we can store all files
needed to restart the standby on the standby itself, in all cases.

That

might be an argument to add a restartpoint_segments parameter, so we

can

trigger restartpoints on WAL volume as well as time. But even that

would

not put an absolute limit on the number of WAL files.

I think it is a pretty important safety feature that we keep all the
WAL around that's needed to recover the standby. To avoid
out-of-disk-space situation, it's probably enough in practice to set
checkpoint_timeout small enough in the standby to trigger
restartpoints often enough.

Hmm, I'm sorry but that's bogus. Retaining so much WAL that we are
strongly in danger of blowing disk space is not what I would call a
safety feature. Since there is no way to control or restrain the number
of files for certain, that approach seems fatally flawed. Reducing
checkpoint_timeout is the opposite of what you would want to do for
performance.

--
Simon Riggs www.2ndQuadrant.com

#26Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Joshua D. Drake (#21)
Re: Streaming replication, and walsender during recovery

Joshua D. Drake wrote:

...if with SR the entire log must be written before it streams to the slaves.

No.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#27Josh Berkus
josh@agliodbs.com
In reply to: Simon Riggs (#25)
Re: Streaming replication, and walsender during recovery

Guys,

Hmm, I'm sorry but that's bogus. Retaining so much WAL that we are
strongly in danger of blowing disk space is not what I would call a
safety feature. Since there is no way to control or restrain the number
of files for certain, that approach seems fatally flawed. Reducing
checkpoint_timeout is the opposite of what you would want to do for
performance.

Which WAL are we talking about here? There's 3 copies to worry about:

1) master WAL
2) the archive copy of WAL
3) slave WAL

--Josh Berkus

#28Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#25)
Re: Streaming replication, and walsender during recovery

Simon Riggs wrote:

On Thu, 2010-01-28 at 21:00 +0200, Heikki Linnakangas wrote:

I think it is a pretty important safety feature that we keep all the
WAL around that's needed to recover the standby. To avoid
out-of-disk-space situation, it's probably enough in practice to set
checkpoint_timeout small enough in the standby to trigger
restartpoints often enough.

Hmm, I'm sorry but that's bogus. Retaining so much WAL that we are
strongly in danger of blowing disk space is not what I would call a
safety feature. Since there is no way to control or restrain the number
of files for certain, that approach seems fatally flawed.

The other alternative is to refuse to recover if the master can't be
contacted to stream the missing WAL again. Surely that's worse.

Note that we don't have any hard limits on WAL disk usage in general.
For example, if archiving stops working for some reason, you'll
accumulate WAL in the master until it runs out of disk space.

Reducing
checkpoint_timeout is the opposite of what you would want to do for
performance.

Well, make sure you have enough disk space for a higher setting then. It
doesn't seem that hard.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#29Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#28)
Re: Streaming replication, and walsender during recovery

On Fri, 2010-01-29 at 09:49 +0200, Heikki Linnakangas wrote:

Simon Riggs wrote:

On Thu, 2010-01-28 at 21:00 +0200, Heikki Linnakangas wrote:

I think it is a pretty important safety feature that we keep all the
WAL around that's needed to recover the standby. To avoid
out-of-disk-space situation, it's probably enough in practice to set
checkpoint_timeout small enough in the standby to trigger
restartpoints often enough.

Hmm, I'm sorry but that's bogus. Retaining so much WAL that we are
strongly in danger of blowing disk space is not what I would call a
safety feature. Since there is no way to control or restrain the number
of files for certain, that approach seems fatally flawed.

The other alternative is to refuse to recover if the master can't be
contacted to stream the missing WAL again. Surely that's worse.

What is the behaviour of the standby if it hits a disk full error while
receiving WAL? Hopefully it stops receiving WAL and then clears enough
disk space to allow it to receive from archive instead? Yet stays up to
allow queries to continue?

--
Simon Riggs www.2ndQuadrant.com

#30Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#25)
Re: Streaming replication, and walsender during recovery

On Fri, Jan 29, 2010 at 4:13 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Hmm, I'm sorry but that's bogus. Retaining so much WAL that we are
strongly in danger of blowing disk space is not what I would call a
safety feature. Since there is no way to control or restrain the number
of files for certain, that approach seems fatally flawed. Reducing
checkpoint_timeout is the opposite of what you would want to do for
performance.

Why do you worry about that only in the standby? The primary (i.e.,
postgres in the normal mode) has been in the same situation until now.

But usually the cycle of restartpoint is longer than that of
checkpoint. Because restartpoint occurs when the checkpoint record
has been replayed AND checkpoint_timeout has been reached.
So the WAL files might more easily accumulate in the standby.

To improve the situation, I think that we need to use
checkpoint_segment/timeout as a trigger of restartpoint, regardless
of the checkpoint record. Though I'm not sure that is possible and
should be included in v9.0.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#31Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#30)
Re: Streaming replication, and walsender during recovery

On Fri, 2010-01-29 at 17:31 +0900, Fujii Masao wrote:

On Fri, Jan 29, 2010 at 4:13 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Hmm, I'm sorry but that's bogus. Retaining so much WAL that we are
strongly in danger of blowing disk space is not what I would call a
safety feature. Since there is no way to control or restrain the number
of files for certain, that approach seems fatally flawed. Reducing
checkpoint_timeout is the opposite of what you would want to do for
performance.

Why do you worry about that only in the standby?

I don't. The "safety feature" we just added makes it much more likely
that this will happen on standby.

To improve the situation, I think that we need to use
checkpoint_segment/timeout as a trigger of restartpoint, regardless
of the checkpoint record. Though I'm not sure that is possible and
should be included in v9.0.

Yes, that is a simple change. I think it is needed now.

--
Simon Riggs www.2ndQuadrant.com

#32Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#31)
Re: Streaming replication, and walsender during recovery

On Fri, Jan 29, 2010 at 5:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

To improve the situation, I think that we need to use
checkpoint_segment/timeout as a trigger of restartpoint, regardless
of the checkpoint record. Though I'm not sure that is possible and
should be included in v9.0.

Yes, that is a simple change. I think it is needed now.

On second thought, it's difficult to force restartpoint without
a checkpoint record. A recovery always needs to start from a
checkpoint redo location. Otherwise a torn page might be caused
because a full-page image has not been replayed. So restartpoint
will not start without a checkpoint record.

But at least we might have to change the bgwriter so as to use
not only checkpoint_timeout but also checkpoint_segments as a
trigger of restartpoint. It would be useful for people who want
to control the cycle of checkpoint by using only checkpoint_segments.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#33Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#1)
1 attachment(s)
Re: Streaming replication, and walsender during recovery

On Mon, Jan 18, 2010 at 2:19 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

When I configured a cascaded standby (i.e, made the additional
standby server connect to the standby), I got the following
errors, and a cascaded standby didn't start replication.

 ERROR:  timeline 0 of the primary does not match recovery target timeline 1

I didn't care about that case so far. To avoid a confusing error
message, we should forbid a startup of walsender during recovery,
and emit a suitable message? Or support such cascade-configuration?
Though I don't think that the latter is difficult to be implemented,
ISTM it's not the time to do that now.

We got the consensus that the cascading standby feature should be
postponed to v9.1 or later. But when we wrongly make the standby
connect to another standby, the following confusing message is still
output.

FATAL: timeline 0 of the primary does not match recovery target timeline 1

How about emitting the following message instead? Here is the patch.

FATAL: recovery is in progress
HINT: cannot accept the standby server during recovery.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

forbid_cascading_standby_0218.patchtext/x-patch; charset=US-ASCII; name=forbid_cascading_standby_0218.patchDownload
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***************
*** 119,124 **** WalSenderMain(void)
--- 119,130 ----
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  				 errmsg("must be superuser to start walsender")));
  
+ 	if (RecoveryInProgress())
+ 		ereport(FATAL,
+ 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 				 errmsg("recovery is in progress"),
+ 				 errhint("cannot accept the standby server during recovery.")));
+ 
  	/* Create a per-walsender data structure in shared memory */
  	InitWalSnd();
  
#34Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#33)
Re: Streaming replication, and walsender during recovery

Fujii Masao wrote:

On Mon, Jan 18, 2010 at 2:19 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

When I configured a cascaded standby (i.e, made the additional
standby server connect to the standby), I got the following
errors, and a cascaded standby didn't start replication.

ERROR: timeline 0 of the primary does not match recovery target timeline 1

I didn't care about that case so far. To avoid a confusing error
message, we should forbid a startup of walsender during recovery,
and emit a suitable message? Or support such cascade-configuration?
Though I don't think that the latter is difficult to be implemented,
ISTM it's not the time to do that now.

We got the consensus that the cascading standby feature should be
postponed to v9.1 or later. But when we wrongly make the standby
connect to another standby, the following confusing message is still
output.

FATAL: timeline 0 of the primary does not match recovery target timeline 1

How about emitting the following message instead? Here is the patch.

FATAL: recovery is in progress
HINT: cannot accept the standby server during recovery.

Commmitted. I edited the message and error code a bit:

ereport(FATAL,
(errcode(ERRCODE_CANNOT_CONNECT_NOW),
errmsg("recovery is still in progress, can't accept WAL
streaming connections")));

ERRCODE_CANNOT_CONNECT_NOW is what we use when the system is shutting
down etc, so that that seems appropriate.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#35Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#34)
Re: Streaming replication, and walsender during recovery

On Tue, Mar 16, 2010 at 6:11 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Commmitted. I edited the message and error code a bit:

ereport(FATAL,
       (errcode(ERRCODE_CANNOT_CONNECT_NOW),
        errmsg("recovery is still in progress, can't accept WAL
streaming connections")));

ERRCODE_CANNOT_CONNECT_NOW is what we use when the system is shutting
down etc, so that that seems appropriate.

Thanks! I agree that ERRCODE_CANNOT_CONNECT_NOW is more suitable.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center