Streaming Replication: Checkpoint_segment and wal_keep_segments on standby

Started by Sander, Ingo (NSN - DE/Munich)over 15 years ago23 messages

ingo.sander@nsn.com

over 15 years ago

With the parameter checkpoint_segment and wal_keep_segments the max. number of wal segments are set. If now the max number is reached,
(1) the segments are deleted/recycled
or (2) if the time set by the checkpoint_timeout is over, a checkpoint is set and if possible a deletion/recycling is done.
This is the mechanism on the active side of a db server. On the standby side however only unused tranferred segments will be deleted if the checkpoint_timeout mechanism (2) is executed.
Is this a correct behaviour or it is an error?

I have observed (checkpoint_segment set to 3; wal_keep_segments set to 10 and checkpoint_timeout set to 30min) that in my stress test the disk usage on standby side is increased up to 2GB with xlog segments whereby on the active side only ~60MB xlog files are available (we have patched the xlog file size to 4MB). To prevent this one possibility is to decreace the checkpoint_timeout to a low value (30sec), however this had the disadvantage that a checkpoint is often executed on active side which can influence the performance. Another possibility is to have different postgresql.conf on active and on standby side, but this is not our preferred solution.

Best Regards/mfG
Ingo Sander
=========================================================
Nokia Siemens Networks GmbH &Co. KG
NWS EP CP SVSS Platform Tech Support DE
St.-Martin-Str. 76
D-81541 München
*Tel.: +49-89-515938390
*ingo.sander@nsn.com

Fujii Masao

masao.fujii@gmail.com

over 15 years ago

In reply to: Sander, Ingo (NSN - DE/Munich) (#1)

Re: Streaming Replication: Checkpoint_segment and wal_keep_segments on standby

On Thu, May 27, 2010 at 10:13 PM, Sander, Ingo (NSN - DE/Munich)
<ingo.sander@nsn.com> wrote:

With the parameter checkpoint_segment and wal_keep_segments the max. number
of wal segments are set. If now the max number is reached,

(1) the segments are deleted/recycled
or (2) if the time set by the checkpoint_timeout is over, a checkpoint is
set and if possible a deletion/recycling is done.

This is the mechanism on the active side of a db server. On the standby side
however only unused tranferred segments will be deleted if the
checkpoint_timeout mechanism (2) is executed.

Is this a correct behaviour or it is an error?

I have observed (checkpoint_segment set to 3; wal_keep_segments set to 10
and checkpoint_timeout set to 30min) that in my stress test the disk usage
on standby side is increased up to 2GB with xlog segments whereby on the
active side only ~60MB xlog files are available (we have patched the xlog
file size to 4MB). To prevent this one possibility is to decreace the
checkpoint_timeout to a low value (30sec), however this had the disadvantage
that a checkpoint is often executed on active side which can influence the
performance. Another possibility is to have different postgresql.conf on
active and on standby side, but this is not our preferred solution.

I guess this happens because the frequency of checkpoint on the standby is
too lower than that on the master. In the master, checkpoint occurs for every
consumption of three segments because of "checkpoint_segments = 3". On the
other hand, in the standby, only checkpoint_timeout has effect, so checkpoint
occurs for every 30 minutes because of "checkpoint_timeout = 30min".

The walreceiver should signal the bgwriter to start checkpoint if it has
received more than checkpoint_segments WAL files, like normal processing?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: Fujii Masao (#2)

Re: Streaming Replication: Checkpoint_segment and wal_keep_segments on standby

On Thu, May 27, 2010 at 10:09 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, May 27, 2010 at 10:13 PM, Sander, Ingo (NSN - DE/Munich)
<ingo.sander@nsn.com> wrote:

With the parameter checkpoint_segment and wal_keep_segments the max. number
of wal segments are set. If now the max number is reached,

(1) the segments are deleted/recycled
or (2) if the time set by the checkpoint_timeout is over, a checkpoint is
set and if possible a deletion/recycling is done.

This is the mechanism on the active side of a db server. On the standby side
however only unused tranferred segments will be deleted if the
checkpoint_timeout mechanism (2) is executed.

Is this a correct behaviour or it is an error?

I have observed (checkpoint_segment set to 3; wal_keep_segments set to 10
and checkpoint_timeout set to 30min) that in my stress test the disk usage
on standby side is increased up to 2GB with xlog segments whereby on the
active side only ~60MB xlog files are available (we have patched the xlog
file size to 4MB). To prevent this one possibility is to decreace the
checkpoint_timeout to a low value (30sec), however this had the disadvantage
that a checkpoint is often executed on active side which can influence the
performance. Another possibility is to have different postgresql.conf on
active and on standby side, but this is not our preferred solution.

I guess this happens because the frequency of checkpoint on the standby is
too lower than that on the master. In the master, checkpoint occurs for every
consumption of three segments because of "checkpoint_segments = 3". On the
other hand, in the standby, only checkpoint_timeout has effect, so checkpoint
occurs for every 30 minutes because of "checkpoint_timeout = 30min".

The walreceiver should signal the bgwriter to start checkpoint if it has
received more than checkpoint_segments WAL files, like normal processing?

Is this also an issue when using log shipping, or just with SR?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Fujii Masao

masao.fujii@gmail.com

over 15 years ago

In reply to: Robert Haas (#3)

Re: Streaming Replication: Checkpoint_segment and wal_keep_segments on standby

On Thu, May 27, 2010 at 11:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I guess this happens because the frequency of checkpoint on the standby is
too lower than that on the master. In the master, checkpoint occurs for every
consumption of three segments because of "checkpoint_segments = 3". On the
other hand, in the standby, only checkpoint_timeout has effect, so checkpoint
occurs for every 30 minutes because of "checkpoint_timeout = 30min".

The walreceiver should signal the bgwriter to start checkpoint if it has
received more than checkpoint_segments WAL files, like normal processing?

Is this also an issue when using log shipping, or just with SR?

When using log shipping, checkpoint_segments always doesn't trigger a
checkpoint. So recovery after the standby crashes might take unexpectedly
long since redo starting point might be old.

But in file-based log shipping, since WAL files don't accumulate in
pg_xlog directory on the standby, even if the frequency of checkpoint
is very low, pg_xlog will not be filled with many WAL files. That
accumulation occurs only when using SR.

If we should avoid low frequency of checkpoint itself rather than
accumulation of WAL files, the bgwriter instead of the walreceiver
should check if we've consumed too much WAL, I think. Thought?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Sander, Ingo (NSN - DE/Munich)

ingo.sander@nsn.com

over 15 years ago

In reply to: Fujii Masao (#2)

Re: Streaming Replication: Checkpoint_segment and wal_keep_segments on standby

Both nodes (active and standby) have the same configuration parameters.
The observed effect happens too if the checkpoint timeout is decreaased.

The problem seems to be that on standby no checkpoints are written and
only the chekpoint_timeout mechanism is active

Regards
Ingo

-----Original Message-----
From: ext Fujii Masao [mailto:masao.fujii@gmail.com]
Sent: Thursday, May 27, 2010 4:10 PM
To: Sander, Ingo (NSN - DE/Munich)
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Streaming Replication: Checkpoint_segment and
wal_keep_segments on standby

On Thu, May 27, 2010 at 10:13 PM, Sander, Ingo (NSN - DE/Munich)
<ingo.sander@nsn.com> wrote:

With the parameter checkpoint_segment and wal_keep_segments the max.

number

of wal segments are set. If now the max number is reached,

(1) the segments are deleted/recycled
or (2) if the time set by the checkpoint_timeout is over, a checkpoint

set and if possible a deletion/recycling is done.

This is the mechanism on the active side of a db server. On the

standby side

however only unused tranferred segments will be deleted if the
checkpoint_timeout mechanism (2) is executed.

Is this a correct behaviour or it is an error?

I have observed (checkpoint_segment set to 3; wal_keep_segments set to

and checkpoint_timeout set to 30min) that in my stress test the disk

usage

on standby side is increased up to 2GB with xlog segments whereby on

the

active side only ~60MB xlog files are available (we have patched the

xlog

file size to 4MB). To prevent this one possibility is to decreace the
checkpoint_timeout to a low value (30sec), however this had the

disadvantage

that a checkpoint is often executed on active side which can influence

the

performance. Another possibility is to have different postgresql.conf

active and on standby side, but this is not our preferred solution.

I guess this happens because the frequency of checkpoint on the standby
is
too lower than that on the master. In the master, checkpoint occurs for
every
consumption of three segments because of "checkpoint_segments = 3". On
the
other hand, in the standby, only checkpoint_timeout has effect, so
checkpoint
occurs for every 30 minutes because of "checkpoint_timeout = 30min".

The walreceiver should signal the bgwriter to start checkpoint if it has
received more than checkpoint_segments WAL files, like normal
processing?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Fujii Masao

masao.fujii@gmail.com

over 15 years ago

In reply to: Fujii Masao (#4)

1 attachment(s)

Re: Streaming Replication: Checkpoint_segment and wal_keep_segments on standby

On Fri, May 28, 2010 at 11:12 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, May 27, 2010 at 11:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I guess this happens because the frequency of checkpoint on the standby is
too lower than that on the master. In the master, checkpoint occurs for every
consumption of three segments because of "checkpoint_segments = 3". On the
other hand, in the standby, only checkpoint_timeout has effect, so checkpoint
occurs for every 30 minutes because of "checkpoint_timeout = 30min".

The walreceiver should signal the bgwriter to start checkpoint if it has
received more than checkpoint_segments WAL files, like normal processing?

Is this also an issue when using log shipping, or just with SR?

When using log shipping, checkpoint_segments always doesn't trigger a
checkpoint. So recovery after the standby crashes might take unexpectedly
long since redo starting point might be old.

But in file-based log shipping, since WAL files don't accumulate in
pg_xlog directory on the standby, even if the frequency of checkpoint
is very low, pg_xlog will not be filled with many WAL files. That
accumulation occurs only when using SR.

If we should avoid low frequency of checkpoint itself rather than
accumulation of WAL files, the bgwriter instead of the walreceiver
should check if we've consumed too much WAL, I think. Thought?

I attached the patch, which changes the startup process so that it signals
bgwriter to perform a restartpoint if we've already replayed too much WAL
files. This leads checkpoint_segments to trigger a restartpoint.

This patch is worth applying for 9.0? If not, I'll add it into the next CF.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

checkpoint_segments_during_recovery_v1.patchtext/x-diff; charset=US-ASCII; name=checkpoint_segments_during_recovery_v1.patchDownload

*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 508,513 **** static bool reachedMinRecoveryPoint = false;
--- 508,516 ----
  
  static bool InRedo = false;
  
+ /* We've already launched bgwriter to perform restartpoint? */
+ static bool bgwriterLaunched = false;
+ 
  /*
   * Information logged when we detect a change in one of the parameters
   * important for Hot Standby.
***************
*** 550,555 **** static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
--- 553,559 ----
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
  static bool AdvanceXLInsertBuffer(bool new_segment);
+ static bool XLogCheckpointNeeded(uint32 logid, uint32 logseg);
  static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch);
  static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
  					   bool find_free, int *max_advance,
***************
*** 1554,1567 **** AdvanceXLInsertBuffer(bool new_segment)
  /*
   * Check whether we've consumed enough xlog space that a checkpoint is needed.
   *
!  * Caller must have just finished filling the open log file (so that
!  * openLogId/openLogSeg are valid).  We measure the distance from RedoRecPtr
!  * to the open log file and see if that exceeds CheckPointSegments.
   *
   * Note: it is caller's responsibility that RedoRecPtr is up-to-date.
   */
  static bool
! XLogCheckpointNeeded(void)
  {
  	/*
  	 * A straight computation of segment number could overflow 32 bits. Rather
--- 1558,1571 ----
  /*
   * Check whether we've consumed enough xlog space that a checkpoint is needed.
   *
!  * Caller must have just finished filling or reading the log file (so that
!  * the given logid/logseg are valid).  We measure the distance from RedoRecPtr
!  * to the log file and see if that exceeds CheckPointSegments.
   *
   * Note: it is caller's responsibility that RedoRecPtr is up-to-date.
   */
  static bool
! XLogCheckpointNeeded(uint32 logid, uint32 logseg)
  {
  	/*
  	 * A straight computation of segment number could overflow 32 bits. Rather
***************
*** 1577,1584 **** XLogCheckpointNeeded(void)
  	old_segno = (RedoRecPtr.xlogid % XLogSegSize) * XLogSegsPerFile +
  		(RedoRecPtr.xrecoff / XLogSegSize);
  	old_highbits = RedoRecPtr.xlogid / XLogSegSize;
! 	new_segno = (openLogId % XLogSegSize) * XLogSegsPerFile + openLogSeg;
! 	new_highbits = openLogId / XLogSegSize;
  	if (new_highbits != old_highbits ||
  		new_segno >= old_segno + (uint32) (CheckPointSegments - 1))
  		return true;
--- 1581,1588 ----
  	old_segno = (RedoRecPtr.xlogid % XLogSegSize) * XLogSegsPerFile +
  		(RedoRecPtr.xrecoff / XLogSegSize);
  	old_highbits = RedoRecPtr.xlogid / XLogSegSize;
! 	new_segno = (logid % XLogSegSize) * XLogSegsPerFile + logseg;
! 	new_highbits = logid / XLogSegSize;
  	if (new_highbits != old_highbits ||
  		new_segno >= old_segno + (uint32) (CheckPointSegments - 1))
  		return true;
***************
*** 1782,1791 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  				 * update RedoRecPtr and recheck.
  				 */
  				if (IsUnderPostmaster &&
! 					XLogCheckpointNeeded())
  				{
  					(void) GetRedoRecPtr();
! 					if (XLogCheckpointNeeded())
  						RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
  				}
  			}
--- 1786,1795 ----
  				 * update RedoRecPtr and recheck.
  				 */
  				if (IsUnderPostmaster &&
! 					XLogCheckpointNeeded(openLogId, openLogSeg))
  				{
  					(void) GetRedoRecPtr();
! 					if (XLogCheckpointNeeded(openLogId, openLogSeg))
  						RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
  				}
  			}
***************
*** 5643,5649 **** StartupXLOG(void)
  	XLogRecord *record;
  	uint32		freespace;
  	TransactionId oldestActiveXID;
- 	bool		bgwriterLaunched = false;
  
  	/*
  	 * Read control file and check XLOG status looks valid.
--- 5647,5652 ----
***************
*** 9185,9190 **** XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
--- 9188,9207 ----
  	 */
  	if (readFile >= 0 && !XLByteInSeg(*RecPtr, readId, readSeg))
  	{
+ 		/*
+ 		 * Signal bgwriter to start a restartpoint if we've replayed too
+ 		 * much xlog since the last one.
+ 		 */
+ 		if (bgwriterLaunched)
+ 		{
+ 			if (XLogCheckpointNeeded(readId, readSeg))
+ 			{
+ 				(void) GetRedoRecPtr();
+ 				if (XLogCheckpointNeeded(readId, readSeg))
+ 					RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
+ 			}
+ 		}
+ 
  		close(readFile);
  		readFile = -1;
  		readSource = 0;

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Fujii Masao (#6)

Re: Streaming Replication: Checkpoint_segment and wal_keep_segments on standby

On 30/05/10 06:04, Fujii Masao wrote:

On Fri, May 28, 2010 at 11:12 AM, Fujii Masao<masao.fujii@gmail.com> wrote:

On Thu, May 27, 2010 at 11:13 PM, Robert Haas<robertmhaas@gmail.com> wrote:

I guess this happens because the frequency of checkpoint on the standby is
too lower than that on the master. In the master, checkpoint occurs for every
consumption of three segments because of "checkpoint_segments = 3". On the
other hand, in the standby, only checkpoint_timeout has effect, so checkpoint
occurs for every 30 minutes because of "checkpoint_timeout = 30min".

The walreceiver should signal the bgwriter to start checkpoint if it has
received more than checkpoint_segments WAL files, like normal processing?

Is this also an issue when using log shipping, or just with SR?

When using log shipping, checkpoint_segments always doesn't trigger a
checkpoint. So recovery after the standby crashes might take unexpectedly
long since redo starting point might be old.

But in file-based log shipping, since WAL files don't accumulate in
pg_xlog directory on the standby, even if the frequency of checkpoint
is very low, pg_xlog will not be filled with many WAL files. That
accumulation occurs only when using SR.

If we should avoid low frequency of checkpoint itself rather than
accumulation of WAL files, the bgwriter instead of the walreceiver
should check if we've consumed too much WAL, I think. Thought?

I attached the patch, which changes the startup process so that it signals
bgwriter to perform a restartpoint if we've already replayed too much WAL
files. This leads checkpoint_segments to trigger a restartpoint.

The central question is whether checkpoint_segments should trigger
restartpoints or not. When PITR and restartpoints were introduced, the
answer was "no", on the grounds that when you're doing recovery you're
presumably replaying the logs much faster than they were generated, and
you don't want to slow down the recovery by checkpointing too often.

Now that we have bgwriter active during recovery, and streaming
replication which retains the streamed WALs so that we now risk running
out of disk space with long checkpoint_timeout, it's time to reconsider
that.

I think we have three options:

1) Leave it as it is, checkpoint_segments doesn't do anything during
recovery/standby mode

2) Change it so that checkpoint_segments does take effect during
recover/standby

3) Change it so that checkpoint_segments takes effect during streaming
replication, but not during recovery otherwise

I'm leaning towards 3), it still seems reasonable to not slow down
recovery when recovering from archive, but the potential for out of disk
space warrants doing 3.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Fujii Masao

masao.fujii@gmail.com

over 15 years ago

In reply to: Heikki Linnakangas (#7)

Re: Streaming Replication: Checkpoint_segment and wal_keep_segments on standby

On Mon, May 31, 2010 at 6:37 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

The central question is whether checkpoint_segments should trigger
restartpoints or not. When PITR and restartpoints were introduced, the
answer was "no", on the grounds that when you're doing recovery you're
presumably replaying the logs much faster than they were generated, and you
don't want to slow down the recovery by checkpointing too often.

Right.

Now that we have bgwriter active during recovery, and streaming replication
which retains the streamed WALs so that we now risk running out of disk
space with long checkpoint_timeout, it's time to reconsider that.

I think we have three options:

1) Leave it as it is, checkpoint_segments doesn't do anything during
recovery/standby mode

2) Change it so that checkpoint_segments does take effect during
recover/standby

3) Change it so that checkpoint_segments takes effect during streaming
replication, but not during recovery otherwise

I'm leaning towards 3), it still seems reasonable to not slow down recovery
when recovering from archive, but the potential for out of disk space
warrants doing 3.

3) makes sense. But how about 4)?

4) Change it so that checkpoint_segments takes effect in standby mode,
but not during recovery otherwise

This would lessen the time required to restart the standby also in
file-based log shipping case. Of course, there is the tradeoff
between the speed of recovery and the recovery time.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Heikki Linnakangas (#7)

Re: Streaming Replication: Checkpoint_segment and wal_keep_segments on standby

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

The central question is whether checkpoint_segments should trigger
restartpoints or not. When PITR and restartpoints were introduced, the
answer was "no", on the grounds that when you're doing recovery you're
presumably replaying the logs much faster than they were generated, and
you don't want to slow down the recovery by checkpointing too often.

Now that we have bgwriter active during recovery, and streaming
replication which retains the streamed WALs so that we now risk running
out of disk space with long checkpoint_timeout, it's time to reconsider
that.

I think we have three options:

What about

(4) pay some attention to the actual elapsed time since the last
restart point?

All the others seem like kluges that are relying on hard-wired rules
that are hoped to achieve something like a time-based checkpoint.

regards, tom lane

#10

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Tom Lane (#9)

Re: Streaming Replication: Checkpoint_segment and wal_keep_segments on standby

On 31/05/10 18:14, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

The central question is whether checkpoint_segments should trigger
restartpoints or not. When PITR and restartpoints were introduced, the
answer was "no", on the grounds that when you're doing recovery you're
presumably replaying the logs much faster than they were generated, and
you don't want to slow down the recovery by checkpointing too often.

Now that we have bgwriter active during recovery, and streaming
replication which retains the streamed WALs so that we now risk running
out of disk space with long checkpoint_timeout, it's time to reconsider
that.

I think we have three options:

What about

(4) pay some attention to the actual elapsed time since the last
restart point?

All the others seem like kluges that are relying on hard-wired rules
that are hoped to achieve something like a time-based checkpoint.

Huh? We already do time-based restartpoints, there's nothing wrong with
that logic AFAIK. The problem that started this thread is that we don't
do WAL-space consumption based restartpoints, i.e. checkpoint_segments
does nothing in standby mode.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#11

Fujii Masao

masao.fujii@gmail.com

over 15 years ago

In reply to: Fujii Masao (#8)

1 attachment(s)

Re: Streaming Replication: Checkpoint_segment and wal_keep_segments on standby

On Mon, May 31, 2010 at 7:17 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

4) Change it so that checkpoint_segments takes effect in standby mode,
but not during recovery otherwise

I revised the patch to achieve 4). This will enable checkpoint_segments
to trigger a restartpoint like checkpoint_timeout already does, in
standby mode (i.e., streaming replication or file-based log shipping).

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

checkpoint_segments_during_recovery_v2.patchapplication/octet-stream; name=checkpoint_segments_during_recovery_v2.patchDownload

*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 508,513 **** static bool reachedMinRecoveryPoint = false;
--- 508,516 ----
  
  static bool InRedo = false;
  
+ /* We've already launched bgwriter to perform restartpoint? */
+ static bool bgwriterLaunched = false;
+ 
  /*
   * Information logged when we detect a change in one of the parameters
   * important for Hot Standby.
***************
*** 550,555 **** static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
--- 553,559 ----
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
  static bool AdvanceXLInsertBuffer(bool new_segment);
+ static bool XLogCheckpointNeeded(uint32 logid, uint32 logseg);
  static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch);
  static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
  					   bool find_free, int *max_advance,
***************
*** 1554,1567 **** AdvanceXLInsertBuffer(bool new_segment)
  /*
   * Check whether we've consumed enough xlog space that a checkpoint is needed.
   *
!  * Caller must have just finished filling the open log file (so that
!  * openLogId/openLogSeg are valid).  We measure the distance from RedoRecPtr
!  * to the open log file and see if that exceeds CheckPointSegments.
   *
   * Note: it is caller's responsibility that RedoRecPtr is up-to-date.
   */
  static bool
! XLogCheckpointNeeded(void)
  {
  	/*
  	 * A straight computation of segment number could overflow 32 bits. Rather
--- 1558,1571 ----
  /*
   * Check whether we've consumed enough xlog space that a checkpoint is needed.
   *
!  * Caller must have just finished filling or reading the log file (so that
!  * the given logid/logseg are valid).  We measure the distance from RedoRecPtr
!  * to the log file and see if that exceeds CheckPointSegments.
   *
   * Note: it is caller's responsibility that RedoRecPtr is up-to-date.
   */
  static bool
! XLogCheckpointNeeded(uint32 logid, uint32 logseg)
  {
  	/*
  	 * A straight computation of segment number could overflow 32 bits. Rather
***************
*** 1577,1584 **** XLogCheckpointNeeded(void)
  	old_segno = (RedoRecPtr.xlogid % XLogSegSize) * XLogSegsPerFile +
  		(RedoRecPtr.xrecoff / XLogSegSize);
  	old_highbits = RedoRecPtr.xlogid / XLogSegSize;
! 	new_segno = (openLogId % XLogSegSize) * XLogSegsPerFile + openLogSeg;
! 	new_highbits = openLogId / XLogSegSize;
  	if (new_highbits != old_highbits ||
  		new_segno >= old_segno + (uint32) (CheckPointSegments - 1))
  		return true;
--- 1581,1588 ----
  	old_segno = (RedoRecPtr.xlogid % XLogSegSize) * XLogSegsPerFile +
  		(RedoRecPtr.xrecoff / XLogSegSize);
  	old_highbits = RedoRecPtr.xlogid / XLogSegSize;
! 	new_segno = (logid % XLogSegSize) * XLogSegsPerFile + logseg;
! 	new_highbits = logid / XLogSegSize;
  	if (new_highbits != old_highbits ||
  		new_segno >= old_segno + (uint32) (CheckPointSegments - 1))
  		return true;
***************
*** 1782,1791 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  				 * update RedoRecPtr and recheck.
  				 */
  				if (IsUnderPostmaster &&
! 					XLogCheckpointNeeded())
  				{
  					(void) GetRedoRecPtr();
! 					if (XLogCheckpointNeeded())
  						RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
  				}
  			}
--- 1786,1795 ----
  				 * update RedoRecPtr and recheck.
  				 */
  				if (IsUnderPostmaster &&
! 					XLogCheckpointNeeded(openLogId, openLogSeg))
  				{
  					(void) GetRedoRecPtr();
! 					if (XLogCheckpointNeeded(openLogId, openLogSeg))
  						RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
  				}
  			}
***************
*** 5643,5649 **** StartupXLOG(void)
  	XLogRecord *record;
  	uint32		freespace;
  	TransactionId oldestActiveXID;
- 	bool		bgwriterLaunched = false;
  
  	/*
  	 * Read control file and check XLOG status looks valid.
--- 5647,5652 ----
***************
*** 9185,9190 **** XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
--- 9188,9207 ----
  	 */
  	if (readFile >= 0 && !XLByteInSeg(*RecPtr, readId, readSeg))
  	{
+ 		/*
+ 		 * Signal bgwriter to start a restartpoint if we've replayed too
+ 		 * much xlog since the last one.
+ 		 */
+ 		if (StandbyMode && bgwriterLaunched)
+ 		{
+ 			if (XLogCheckpointNeeded(readId, readSeg))
+ 			{
+ 				(void) GetRedoRecPtr();
+ 				if (XLogCheckpointNeeded(readId, readSeg))
+ 					RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
+ 			}
+ 		}
+ 
  		close(readFile);
  		readFile = -1;
  		readSource = 0;
*** a/src/backend/replication/walreceiver.c
--- b/src/backend/replication/walreceiver.c
***************
*** 505,517 **** XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
  		buf += byteswritten;
  
  		LogstreamResult.Write = recptr;
- 
- 		/*
- 		 * XXX: Should we signal bgwriter to start a restartpoint if we've
- 		 * consumed too much xlog since the last one, like in normal
- 		 * processing? But this is not worth doing unless a restartpoint can
- 		 * be created independently from a checkpoint record.
- 		 */
  	}
  }
  
--- 505,510 ----

#12

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Fujii Masao (#11)

Re: Streaming Replication: Checkpoint_segment and wal_keep_segments on standby

On 02/06/10 06:23, Fujii Masao wrote:

On Mon, May 31, 2010 at 7:17 PM, Fujii Masao<masao.fujii@gmail.com> wrote:

4) Change it so that checkpoint_segments takes effect in standby mode,
but not during recovery otherwise

I revised the patch to achieve 4). This will enable checkpoint_segments
to trigger a restartpoint like checkpoint_timeout already does, in
standby mode (i.e., streaming replication or file-based log shipping).

Hmm, XLogCtl->Insert.RedoRecPtr is not updated during recovery, so this
doesn't work.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#13

Fujii Masao

masao.fujii@gmail.com

over 15 years ago

In reply to: Heikki Linnakangas (#12)

1 attachment(s)

Re: Streaming Replication: Checkpoint_segment and wal_keep_segments on standby

On Wed, Jun 2, 2010 at 8:40 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 02/06/10 06:23, Fujii Masao wrote:

On Mon, May 31, 2010 at 7:17 PM, Fujii Masao<masao.fujii@gmail.com>
wrote:

4) Change it so that checkpoint_segments takes effect in standby mode,
but not during recovery otherwise

I revised the patch to achieve 4). This will enable checkpoint_segments
to trigger a restartpoint like checkpoint_timeout already does, in
standby mode (i.e., streaming replication or file-based log shipping).

Hmm, XLogCtl->Insert.RedoRecPtr is not updated during recovery, so this
doesn't work.

Oops! I revised the patch, which changes CreateRestartPoint() so that
it updates XLogCtl->Insert.RedoRecPtr.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

checkpoint_segments_during_recovery_v3.patchapplication/octet-stream; name=checkpoint_segments_during_recovery_v3.patchDownload

*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 508,513 **** static bool reachedMinRecoveryPoint = false;
--- 508,516 ----
  
  static bool InRedo = false;
  
+ /* We've already launched bgwriter to perform restartpoint? */
+ static bool bgwriterLaunched = false;
+ 
  /*
   * Information logged when we detect a change in one of the parameters
   * important for Hot Standby.
***************
*** 550,555 **** static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
--- 553,559 ----
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
  static bool AdvanceXLInsertBuffer(bool new_segment);
+ static bool XLogCheckpointNeeded(uint32 logid, uint32 logseg);
  static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch);
  static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
  					   bool find_free, int *max_advance,
***************
*** 1554,1567 **** AdvanceXLInsertBuffer(bool new_segment)
  /*
   * Check whether we've consumed enough xlog space that a checkpoint is needed.
   *
!  * Caller must have just finished filling the open log file (so that
!  * openLogId/openLogSeg are valid).  We measure the distance from RedoRecPtr
!  * to the open log file and see if that exceeds CheckPointSegments.
   *
   * Note: it is caller's responsibility that RedoRecPtr is up-to-date.
   */
  static bool
! XLogCheckpointNeeded(void)
  {
  	/*
  	 * A straight computation of segment number could overflow 32 bits. Rather
--- 1558,1571 ----
  /*
   * Check whether we've consumed enough xlog space that a checkpoint is needed.
   *
!  * Caller must have just finished filling or reading the log file (so that
!  * the given logid/logseg are valid).  We measure the distance from RedoRecPtr
!  * to the log file and see if that exceeds CheckPointSegments.
   *
   * Note: it is caller's responsibility that RedoRecPtr is up-to-date.
   */
  static bool
! XLogCheckpointNeeded(uint32 logid, uint32 logseg)
  {
  	/*
  	 * A straight computation of segment number could overflow 32 bits. Rather
***************
*** 1577,1584 **** XLogCheckpointNeeded(void)
  	old_segno = (RedoRecPtr.xlogid % XLogSegSize) * XLogSegsPerFile +
  		(RedoRecPtr.xrecoff / XLogSegSize);
  	old_highbits = RedoRecPtr.xlogid / XLogSegSize;
! 	new_segno = (openLogId % XLogSegSize) * XLogSegsPerFile + openLogSeg;
! 	new_highbits = openLogId / XLogSegSize;
  	if (new_highbits != old_highbits ||
  		new_segno >= old_segno + (uint32) (CheckPointSegments - 1))
  		return true;
--- 1581,1588 ----
  	old_segno = (RedoRecPtr.xlogid % XLogSegSize) * XLogSegsPerFile +
  		(RedoRecPtr.xrecoff / XLogSegSize);
  	old_highbits = RedoRecPtr.xlogid / XLogSegSize;
! 	new_segno = (logid % XLogSegSize) * XLogSegsPerFile + logseg;
! 	new_highbits = logid / XLogSegSize;
  	if (new_highbits != old_highbits ||
  		new_segno >= old_segno + (uint32) (CheckPointSegments - 1))
  		return true;
***************
*** 1782,1791 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  				 * update RedoRecPtr and recheck.
  				 */
  				if (IsUnderPostmaster &&
! 					XLogCheckpointNeeded())
  				{
  					(void) GetRedoRecPtr();
! 					if (XLogCheckpointNeeded())
  						RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
  				}
  			}
--- 1786,1795 ----
  				 * update RedoRecPtr and recheck.
  				 */
  				if (IsUnderPostmaster &&
! 					XLogCheckpointNeeded(openLogId, openLogSeg))
  				{
  					(void) GetRedoRecPtr();
! 					if (XLogCheckpointNeeded(openLogId, openLogSeg))
  						RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
  				}
  			}
***************
*** 5641,5647 **** StartupXLOG(void)
  	XLogRecord *record;
  	uint32		freespace;
  	TransactionId oldestActiveXID;
- 	bool		bgwriterLaunched = false;
  
  	/*
  	 * Read control file and check XLOG status looks valid.
--- 5645,5650 ----
***************
*** 7552,7557 **** CreateRestartPoint(int flags)
--- 7555,7570 ----
  		return false;
  	}
  
+ 	/*
+ 	 * Update the shared RedoRecPtr for the startup process to request
+ 	 * a future restartpoint according to checkpoint_segments. We don't
+ 	 * need to hold the insert lock here since there should be no other
+ 	 * processes updating it during recovery.
+ 	 */
+ 	SpinLockAcquire(&xlogctl->info_lck);
+ 	xlogctl->Insert.RedoRecPtr = lastCheckPoint.redo;
+ 	SpinLockRelease(&xlogctl->info_lck);
+ 
  	if (log_checkpoints)
  	{
  		/*
***************
*** 9183,9188 **** XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
--- 9196,9215 ----
  	 */
  	if (readFile >= 0 && !XLByteInSeg(*RecPtr, readId, readSeg))
  	{
+ 		/*
+ 		 * Signal bgwriter to start a restartpoint if we've replayed too
+ 		 * much xlog since the last one.
+ 		 */
+ 		if (StandbyMode && bgwriterLaunched)
+ 		{
+ 			if (XLogCheckpointNeeded(readId, readSeg))
+ 			{
+ 				(void) GetRedoRecPtr();
+ 				if (XLogCheckpointNeeded(readId, readSeg))
+ 					RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
+ 			}
+ 		}
+ 
  		close(readFile);
  		readFile = -1;
  		readSource = 0;
*** a/src/backend/replication/walreceiver.c
--- b/src/backend/replication/walreceiver.c
***************
*** 505,517 **** XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
  		buf += byteswritten;
  
  		LogstreamResult.Write = recptr;
- 
- 		/*
- 		 * XXX: Should we signal bgwriter to start a restartpoint if we've
- 		 * consumed too much xlog since the last one, like in normal
- 		 * processing? But this is not worth doing unless a restartpoint can
- 		 * be created independently from a checkpoint record.
- 		 */
  	}
  }
  
--- 505,510 ----

#14

Fujii Masao

masao.fujii@gmail.com

over 15 years ago

In reply to: Fujii Masao (#13)

Re: Streaming Replication: Checkpoint_segment and wal_keep_segments on standby

On Wed, Jun 2, 2010 at 10:24 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Wed, Jun 2, 2010 at 8:40 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 02/06/10 06:23, Fujii Masao wrote:

On Mon, May 31, 2010 at 7:17 PM, Fujii Masao<masao.fujii@gmail.com>
wrote:

4) Change it so that checkpoint_segments takes effect in standby mode,
but not during recovery otherwise

I revised the patch to achieve 4). This will enable checkpoint_segments
to trigger a restartpoint like checkpoint_timeout already does, in
standby mode (i.e., streaming replication or file-based log shipping).

Hmm, XLogCtl->Insert.RedoRecPtr is not updated during recovery, so this
doesn't work.

Oops! I revised the patch, which changes CreateRestartPoint() so that
it updates XLogCtl->Insert.RedoRecPtr.

This is one of open items. Please review the patch I submitted, and
please feel free to comment!

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#15

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Fujii Masao (#14)

Re: Streaming Replication: Checkpoint_segment and wal_keep_segments on standby

On 09/06/10 05:26, Fujii Masao wrote:

On Wed, Jun 2, 2010 at 10:24 PM, Fujii Masao<masao.fujii@gmail.com> wrote:

On Wed, Jun 2, 2010 at 8:40 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 02/06/10 06:23, Fujii Masao wrote:

On Mon, May 31, 2010 at 7:17 PM, Fujii Masao<masao.fujii@gmail.com>
wrote:

4) Change it so that checkpoint_segments takes effect in standby mode,
but not during recovery otherwise

I revised the patch to achieve 4). This will enable checkpoint_segments
to trigger a restartpoint like checkpoint_timeout already does, in
standby mode (i.e., streaming replication or file-based log shipping).

Hmm, XLogCtl->Insert.RedoRecPtr is not updated during recovery, so this
doesn't work.

Oops! I revised the patch, which changes CreateRestartPoint() so that
it updates XLogCtl->Insert.RedoRecPtr.

This is one of open items. Please review the patch I submitted, and
please feel free to comment!

Ok, committed with some cosmetic changes.

I thought hard if we should do this at all, since the original decision
to do time-based restartpoints was deliberate. I concluded that the
tradeoffs have changed enough since then to make this reasonable. We now
perform restartpoints is bgwriter, so the replay will continue while the
restartpoint is being performed, making it less disruptive than it used
to be, and secondly SR stores the streamed WAL files in pg_xlog, making
it important to perform restartpoints often enough to clean them up and
avoid out-of-disk space.

BTW, should there be doc changes for this? I didn't find anything
explaining how restartpoints are triggered, we should add a paragraph
somewhere.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#16

Fujii Masao

masao.fujii@gmail.com

over 15 years ago

In reply to: Heikki Linnakangas (#15)

1 attachment(s)

Re: Streaming Replication: Checkpoint_segment and wal_keep_segments on standby

On Thu, Jun 10, 2010 at 12:09 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Ok, committed with some cosmetic changes.

Thanks!

BTW, should there be doc changes for this? I didn't find anything explaining
how restartpoints are triggered, we should add a paragraph somewhere.

What about the attached patch?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

trigger_restartpoint_doc_v1.patchapplication/octet-stream; name=trigger_restartpoint_doc_v1.patchDownload

*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 1902,1907 **** SET ENABLE_SEQSCAN TO OFF;
--- 1902,1908 ----
          for standby purposes, and the number of old WAL segments available
          for standbys is determined based only on the location of the previous
          checkpoint and status of WAL archiving.
+         This parameter has no effect on a restartpoint.
          This parameter can only be set in the <filename>postgresql.conf</>
          file or on the server command line.
         </para>
*** a/doc/src/sgml/wal.sgml
--- b/doc/src/sgml/wal.sgml
***************
*** 424,429 ****
--- 424,430 ----
    <para>
     There will always be at least one WAL segment file, and will normally
     not be more than (2 + <varname>checkpoint_completion_target</varname>) * <varname>checkpoint_segments</varname> + 1
+    or <varname>checkpoint_segments</> + <xref linkend="guc-wal-keep-segments"> + 1
     files.  Each segment file is normally 16 MB (though this size can be
     altered when building the server).  You can use this to estimate space
     requirements for <acronym>WAL</acronym>.
***************
*** 436,441 ****
--- 437,458 ----
    </para>
  
    <para>
+    In archive recovery or standby mode, the server periodically performs
+    <firstterm>restartpoints</><indexterm><primary>restartpoint</></>
+    which are similar to checkpoints in normal operation: the server forces
+    all its state to disk, updates the <filename>pg_control</> file to
+    indicate that the already-processed WAL data need not be scanned again,
+    and then recycles old log segment files if they are in the
+    <filename>pg_xlog</> directory. Note that this recycling is not affected
+    by <varname>wal_keep_segments</> at all. A restartpoint is triggered,
+    if at least one checkpoint record has been replayed since the last
+    restartpoint, every <varname>checkpoint_timeout</> seconds, or every
+    <varname>checkoint_segments</> log segments only in standby mode,
+    whichever comes first. In log shipping case, the checkpoint interval
+    on the standby is normally smaller than that on the master.
+   </para>
+ 
+   <para>
     There are two commonly used internal <acronym>WAL</acronym> functions:
     <function>LogInsert</function> and <function>LogFlush</function>.
     <function>LogInsert</function> is used to place a new record into

#17

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Fujii Masao (#16)

Re: Streaming Replication: Checkpoint_segment and wal_keep_segments on standby

On 10/06/10 09:14, Fujii Masao wrote:

On Thu, Jun 10, 2010 at 12:09 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

BTW, should there be doc changes for this? I didn't find anything explaining
how restartpoints are triggered, we should add a paragraph somewhere.

+1

What about the attached patch?

(description of wal_keep_segments)
*** 1902,1907 **** SET ENABLE_SEQSCAN TO OFF;
--- 1902,1908 ----
for standby purposes, and the number of old WAL segments available
for standbys is determined based only on the location of the previous
checkpoint and status of WAL archiving.
+         This parameter has no effect on a restartpoint.
This parameter can only be set in the <filename>postgresql.conf</>
file or on the server command line.
</para>

Hmm, I wonder if wal_keep_segments should take effect during recovery
too? We don't support cascading slaves, but if you have two slaves
connected to one master (without an archive), and you perform failover
to one of them, without wal_keep_segments the 2nd slave might not find
all the files it needs in the new master. Then again, that won't work
without an archive anyway, because we error out at a TLI mismatch in
replication. Seems like this is 9.1 material..

*** a/doc/src/sgml/wal.sgml
--- b/doc/src/sgml/wal.sgml
***************
*** 424,429 ****
--- 424,430 ----
<para>
There will always be at least one WAL segment file, and will normally
not be more than (2 + <varname>checkpoint_completion_target</varname>) * <varname>checkpoint_segments</varname> + 1
+    or <varname>checkpoint_segments</> + <xref linkend="guc-wal-keep-segments"> + 1
files.  Each segment file is normally 16 MB (though this size can be
altered when building the server).  You can use this to estimate space
requirements for <acronym>WAL</acronym>.

That's not true, wal_keep_segments is the minimum number of files
retained, independently of checkpoint_segments. The corret formula is (2
+ checkpoint_completion_target * checkpoint_segments, wal_keep_segments)

<para>
+    In archive recovery or standby mode, the server periodically performs
+    <firstterm>restartpoints</><indexterm><primary>restartpoint</></>
+    which are similar to checkpoints in normal operation: the server forces
+    all its state to disk, updates the <filename>pg_control</> file to
+    indicate that the already-processed WAL data need not be scanned again,
+    and then recycles old log segment files if they are in the
+    <filename>pg_xlog</> directory. Note that this recycling is not affected
+    by <varname>wal_keep_segments</> at all. A restartpoint is triggered,
+    if at least one checkpoint record has been replayed since the last
+    restartpoint, every <varname>checkpoint_timeout</> seconds, or every
+    <varname>checkoint_segments</> log segments only in standby mode,
+    whichever comes first....

That last sentence is a bit unclear. How about:

A restartpoint is triggered if at least one checkpoint record has been
replayed and <varname>checkpoint_timeout</> seconds have passed since
last restartpoint. In standby mode, a restartpoint is also triggered if
<varname>checkoint_segments</> log segments have been replayed since
last restartpoint and at least one checkpoint record has been replayed
since.

... In log shipping case, the checkpoint interval
+    on the standby is normally smaller than that on the master.
+   </para>

What does that mean? Restartpoints can't be performed more frequently
than checkpoints in the master because restartpoints can only be
performed at checkpoint records.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#18

Fujii Masao

masao.fujii@gmail.com

over 15 years ago

In reply to: Heikki Linnakangas (#17)

Re: Streaming Replication: Checkpoint_segment and wal_keep_segments on standby

On Thu, Jun 10, 2010 at 7:19 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

--- 1902,1908 ----
         for standby purposes, and the number of old WAL segments
available
         for standbys is determined based only on the location of the
previous
         checkpoint and status of WAL archiving.
+         This parameter has no effect on a restartpoint.
         This parameter can only be set in the
<filename>postgresql.conf</>
         file or on the server command line.
        </para>
Hmm, I wonder if wal_keep_segments should take effect during recovery too?
We don't support cascading slaves, but if you have two slaves connected to
one master (without an archive), and you perform failover to one of them,
without wal_keep_segments the 2nd slave might not find all the files it
needs in the new master. Then again, that won't work without an archive
anyway, because we error out at a TLI mismatch in replication. Seems like
this is 9.1 material..

Yep, since currently SR cannot get over the gap of TLI, wal_keep_segments
is not worth taking effect during recovery.

*** a/doc/src/sgml/wal.sgml
--- b/doc/src/sgml/wal.sgml
***************
*** 424,429 ****
--- 424,430 ----
   <para>
    There will always be at least one WAL segment file, and will normally
    not be more than (2 + <varname>checkpoint_completion_target</varname>)
* <varname>checkpoint_segments</varname> + 1
+    or <varname>checkpoint_segments</> + <xref
linkend="guc-wal-keep-segments"> + 1
    files.  Each segment file is normally 16 MB (though this size can be
    altered when building the server).  You can use this to estimate space
    requirements for <acronym>WAL</acronym>.

That's not true, wal_keep_segments is the minimum number of files retained,
independently of checkpoint_segments. The corret formula is (2 +
checkpoint_completion_target * checkpoint_segments, wal_keep_segments)

You mean that the maximum number of WAL files is: ?

max {
(2 + checkpoint_completion_target) * checkpoint_segments,
wal_keep_segments
}

Just after a checkpoint removes old WAL files, there might be wal_keep_segments
WAL files. Additionally, checkpoint_segments WAL files might be generated before
the subsequent checkpoint removes old WAL files. So I think that the maximum
number is

max {
(2 + checkpoint_completion_target) * checkpoint_segments,
wal_keep_segments + checkpoint_segments
}

Am I missing something?

   <para>
+    In archive recovery or standby mode, the server periodically performs
+    <firstterm>restartpoints</><indexterm><primary>restartpoint</></>
+    which are similar to checkpoints in normal operation: the server
forces
+    all its state to disk, updates the <filename>pg_control</> file to
+    indicate that the already-processed WAL data need not be scanned
again,
+    and then recycles old log segment files if they are in the
+    <filename>pg_xlog</> directory. Note that this recycling is not
affected
+    by <varname>wal_keep_segments</> at all. A restartpoint is triggered,
+    if at least one checkpoint record has been replayed since the last
+    restartpoint, every <varname>checkpoint_timeout</> seconds, or every
+    <varname>checkoint_segments</> log segments only in standby mode,
+    whichever comes first....
That last sentence is a bit unclear. How about:

A restartpoint is triggered if at least one checkpoint record has been
replayed and <varname>checkpoint_timeout</> seconds have passed since last
restartpoint. In standby mode, a restartpoint is also triggered if
<varname>checkoint_segments</> log segments have been replayed since last
restartpoint and at least one checkpoint record has been replayed since.

Thanks! Seems good.

... In log shipping case, the checkpoint interval
+    on the standby is normally smaller than that on the master.
+   </para>
What does that mean? Restartpoints can't be performed more frequently than
checkpoints in the master because restartpoints can only be performed at
checkpoint records.

Yes, that's what I meant.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#19

Bruce Momjian

bruce@momjian.us

over 15 years ago

In reply to: Fujii Masao (#18)

Re: Streaming Replication: Checkpoint_segment and wal_keep_segments on standby

Did these changes ever get into the docs? I don't think so.

---------------------------------------------------------------------------

Fujii Masao wrote:

On Thu, Jun 10, 2010 at 7:19 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
--- 1902,1908 ----
? ? ? ? ?for standby purposes, and the number of old WAL segments
available
? ? ? ? ?for standbys is determined based only on the location of the
previous
? ? ? ? ?checkpoint and status of WAL archiving.
+ ? ? ? ? This parameter has no effect on a restartpoint.
? ? ? ? ?This parameter can only be set in the
<filename>postgresql.conf</>
? ? ? ? ?file or on the server command line.
? ? ? ? </para>
Hmm, I wonder if wal_keep_segments should take effect during recovery too?
We don't support cascading slaves, but if you have two slaves connected to
one master (without an archive), and you perform failover to one of them,
without wal_keep_segments the 2nd slave might not find all the files it
needs in the new master. Then again, that won't work without an archive
anyway, because we error out at a TLI mismatch in replication. Seems like
this is 9.1 material..
Yep, since currently SR cannot get over the gap of TLI, wal_keep_segments
is not worth taking effect during recovery.
*** a/doc/src/sgml/wal.sgml
--- b/doc/src/sgml/wal.sgml
***************
*** 424,429 ****
--- 424,430 ----
? ?<para>
? ? There will always be at least one WAL segment file, and will normally
? ? not be more than (2 + <varname>checkpoint_completion_target</varname>)
* <varname>checkpoint_segments</varname> + 1
+ ? ?or <varname>checkpoint_segments</> + <xref
linkend="guc-wal-keep-segments"> + 1
? ? files. ?Each segment file is normally 16 MB (though this size can be
? ? altered when building the server). ?You can use this to estimate space
? ? requirements for <acronym>WAL</acronym>.
That's not true, wal_keep_segments is the minimum number of files retained,
independently of checkpoint_segments. The corret formula is (2 +
checkpoint_completion_target * checkpoint_segments, wal_keep_segments)
You mean that the maximum number of WAL files is: ?

max {
(2 + checkpoint_completion_target) * checkpoint_segments,
wal_keep_segments
}

Just after a checkpoint removes old WAL files, there might be wal_keep_segments
WAL files. Additionally, checkpoint_segments WAL files might be generated before
the subsequent checkpoint removes old WAL files. So I think that the maximum
number is

max {
(2 + checkpoint_completion_target) * checkpoint_segments,
wal_keep_segments + checkpoint_segments
}

Am I missing something?
? ?<para>
+ ? ?In archive recovery or standby mode, the server periodically performs
+ ? ?<firstterm>restartpoints</><indexterm><primary>restartpoint</></>
+ ? ?which are similar to checkpoints in normal operation: the server
forces
+ ? ?all its state to disk, updates the <filename>pg_control</> file to
+ ? ?indicate that the already-processed WAL data need not be scanned
again,
+ ? ?and then recycles old log segment files if they are in the
+ ? ?<filename>pg_xlog</> directory. Note that this recycling is not
affected
+ ? ?by <varname>wal_keep_segments</> at all. A restartpoint is triggered,
+ ? ?if at least one checkpoint record has been replayed since the last
+ ? ?restartpoint, every <varname>checkpoint_timeout</> seconds, or every
+ ? ?<varname>checkoint_segments</> log segments only in standby mode,
+ ? ?whichever comes first....
That last sentence is a bit unclear. How about:

A restartpoint is triggered if at least one checkpoint record has been
replayed and <varname>checkpoint_timeout</> seconds have passed since last
restartpoint. In standby mode, a restartpoint is also triggered if
<varname>checkoint_segments</> log segments have been replayed since last
restartpoint and at least one checkpoint record has been replayed since.
Thanks! Seems good.
... In log shipping case, the checkpoint interval
+ ? ?on the standby is normally smaller than that on the master.
+ ? </para>
What does that mean? Restartpoints can't be performed more frequently than
checkpoints in the master because restartpoints can only be performed at
checkpoint records.
Yes, that's what I meant.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ None of us is going to be here forever. +

#20

Fujii Masao

masao.fujii@gmail.com

over 15 years ago

In reply to: Bruce Momjian (#19)

1 attachment(s)

Re: Streaming Replication: Checkpoint_segment and wal_keep_segments on standby

On Thu, Jul 1, 2010 at 11:39 AM, Bruce Momjian <bruce@momjian.us> wrote:

Did these changes ever get into the docs? I don't think so.

Thanks for reminding me. I attached the updated patch.

That last sentence is a bit unclear. How about:

A restartpoint is triggered if at least one checkpoint record has been
replayed and <varname>checkpoint_timeout</> seconds have passed since last
restartpoint. In standby mode, a restartpoint is also triggered if
<varname>checkoint_segments</> log segments have been replayed since last
restartpoint and at least one checkpoint record has been replayed since.

... In log shipping case, the checkpoint interval
+ ? ?on the standby is normally smaller than that on the master.
+ ? </para>
What does that mean? Restartpoints can't be performed more frequently than
checkpoints in the master because restartpoints can only be performed at
checkpoint records.

I adopted these Heikki's sentences.

*** a/doc/src/sgml/wal.sgml
--- b/doc/src/sgml/wal.sgml
***************
*** 424,429 ****
--- 424,430 ----
? ?<para>
? ? There will always be at least one WAL segment file, and will normally
? ? not be more than (2 + <varname>checkpoint_completion_target</varname>)
* <varname>checkpoint_segments</varname> + 1
+ ? ?or <varname>checkpoint_segments</> + <xref
linkend="guc-wal-keep-segments"> + 1
? ? files. ?Each segment file is normally 16 MB (though this size can be
? ? altered when building the server). ?You can use this to estimate space
? ? requirements for <acronym>WAL</acronym>.
That's not true, wal_keep_segments is the minimum number of files retained,
independently of checkpoint_segments. The corret formula is (2 +
checkpoint_completion_target * checkpoint_segments, wal_keep_segments)
You mean that the maximum number of WAL files is: ?

max {
(2 + checkpoint_completion_target) * checkpoint_segments,
wal_keep_segments
}

Just after a checkpoint removes old WAL files, there might be wal_keep_segments
WAL files. Additionally, checkpoint_segments WAL files might be generated before
the subsequent checkpoint removes old WAL files. So I think that the maximum
number is

max {
(2 + checkpoint_completion_target) * checkpoint_segments,
wal_keep_segments + checkpoint_segments
}

Am I missing something?

I've left this part as it is. Before committing the patch, we need to check
whether my thought is true.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

trigger_restartpoint_doc_v2.patchapplication/octet-stream; name=trigger_restartpoint_doc_v2.patchDownload

*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 1905,1910 **** SET ENABLE_SEQSCAN TO OFF;
--- 1905,1911 ----
          for standby purposes, and the number of old WAL segments available
          for standbys is determined based only on the location of the previous
          checkpoint and status of WAL archiving.
+         This parameter has no effect on a restartpoint.
          This parameter can only be set in the <filename>postgresql.conf</>
          file or on the server command line.
         </para>
*** a/doc/src/sgml/wal.sgml
--- b/doc/src/sgml/wal.sgml
***************
*** 424,429 ****
--- 424,430 ----
    <para>
     There will always be at least one WAL segment file, and will normally
     not be more than (2 + <varname>checkpoint_completion_target</varname>) * <varname>checkpoint_segments</varname> + 1
+    or <varname>checkpoint_segments</> + <xref linkend="guc-wal-keep-segments"> + 1
     files.  Each segment file is normally 16 MB (though this size can be
     altered when building the server).  You can use this to estimate space
     requirements for <acronym>WAL</acronym>.
***************
*** 436,441 ****
--- 437,461 ----
    </para>
  
    <para>
+    In archive recovery or standby mode, the server periodically performs
+    <firstterm>restartpoints</><indexterm><primary>restartpoint</></>
+    which are similar to checkpoints in normal operation: the server forces
+    all its state to disk, updates the <filename>pg_control</> file to
+    indicate that the already-processed WAL data need not be scanned again,
+    and then recycles old log segment files if they are in the
+    <filename>pg_xlog</> directory. Note that this recycling is not affected
+    by <varname>wal_keep_segments</> at all. A restartpoint is triggered
+    if at least one checkpoint record has been replayed and
+    <varname>checkpoint_timeout</> seconds have passed since last restartpoint.
+    In standby mode, a restartpoint is also triggered if
+    <varname>checkoint_segments</> log segments have been replayed since
+    last restartpoint and at least one checkpoint record has been replayed
+    since. In log shipping case, restartpoints can't be performed more
+    frequently than checkpoints in the master because restartpoints can only
+    be performed at checkpoint records.
+   </para>
+ 
+   <para>
     There are two commonly used internal <acronym>WAL</acronym> functions:
     <function>LogInsert</function> and <function>LogFlush</function>.
     <function>LogInsert</function> is used to place a new record into

#21

Fujii Masao

masao.fujii@gmail.com

over 15 years ago

In reply to: Fujii Masao (#20)

Re: Streaming Replication: Checkpoint_segment and wal_keep_segments on standby

On Thu, Jul 1, 2010 at 1:09 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

Thanks for reminding me. I attached the updated patch.

This patch left uncommitted for half a month. No one is interested in
the patch?

The patch adds the document about the relationship between a restartpoint
and checkpoint_segments parameter.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#22

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Fujii Masao (#21)

Re: Streaming Replication: Checkpoint_segment and wal_keep_segments on standby

On 16/07/10 11:13, Fujii Masao wrote:

On Thu, Jul 1, 2010 at 1:09 PM, Fujii Masao<masao.fujii@gmail.com> wrote:

Thanks for reminding me. I attached the updated patch.

This patch left uncommitted for half a month. No one is interested in
the patch?

Sorry for the lack of interest ;-)

The patch adds the document about the relationship between a restartpoint
and checkpoint_segments parameter.

Thanks, committed with minor editorialization

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#23

Fujii Masao

masao.fujii@gmail.com

over 15 years ago

In reply to: Heikki Linnakangas (#22)

2 attachment(s)

Fwd: Streaming Replication: Checkpoint_segment and wal_keep_segments on standby

On Sat, Jul 17, 2010 at 4:22 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

The patch adds the document about the relationship between a restartpoint
and checkpoint_segments parameter.

Thanks, committed with minor editorialization

Thanks.

There will always be at least one WAL segment file, and will normally
not be more than (2 + <varname>checkpoint_completion_target</varname>) * <varname>checkpoint_segments</varname> + 1
+ or <varname>checkpoint_segments</> + <xref linkend="guc-wal-keep-segments"> + 1
files. Each segment file is normally 16 MB (though this size can be
altered when building the server). You can use this to estimate space
requirements for <acronym>WAL</acronym>.

Sorry, I was wrong here. The correct formula is:

(2 + checkpoint_completion_target) * checkpoint_segments +
wal_keep_segments + 1

The attached patch fixes this fault. And I attached the PDF file which
illustrates the proof of the formula.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

num_of_wal_formula_v1.patchapplication/octet-stream; name=num_of_wal_formula_v1.patchDownload

*** a/doc/src/sgml/wal.sgml
--- b/doc/src/sgml/wal.sgml
***************
*** 448,455 ****
  
    <para>
     There will always be at least one WAL segment file, and will normally
!    not be more than (2 + <varname>checkpoint_completion_target</varname>) * <varname>checkpoint_segments</varname> + 1
!    or <varname>checkpoint_segments</> + <xref linkend="guc-wal-keep-segments"> + 1
     files.  Each segment file is normally 16 MB (though this size can be
     altered when building the server).  You can use this to estimate space
     requirements for <acronym>WAL</acronym>.
--- 448,454 ----
  
    <para>
     There will always be at least one WAL segment file, and will normally
!    not be more than (2 + <varname>checkpoint_completion_target</varname>) * <varname>checkpoint_segments</varname> + <xref linkend="guc-wal-keep-segments"> + 1
     files.  Each segment file is normally 16 MB (though this size can be
     altered when building the server).  You can use this to estimate space
     requirements for <acronym>WAL</acronym>.

20100721_num_of_wal_in_pg_xlog.pdfapplication/pdf; name=20100721_num_of_wal_in_pg_xlog.pdfDownload