[BUG] Checkpointer on hot standby runs without looking checkpoint_segments

Started by Kyotaro HORIGUCHIover 13 years ago37 messages
#1Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
2 attachment(s)

Hello, this is bug report and a patch for it.

The first patch in the attachments is for 9.2dev and next one is
for 9.1.3.

On the current 9.2dev, IsCheckpointOnSchedule@checkpointer.c does
not check against WAL segments written. This makes checkpointer
always run at the speed according to checkpoint_timeout
regardless of WAL advancing rate.

This leads to unexpected imbalance in the numbers of WAL segment
files between the master and the standby(s) for high advance rate
of WALs. And what is worse, the master would have much higher
chance to remove some WAL segments before the standby receives
them.

XLogPageRead()@xlog.c triggers checkpoint referring to WAL
segment advance. So I think this is a bug of bgwriter in 9.1. The
attached patches fix that on 9.2dev and 9.1.3 respctively.

In the backported version to 9.1.3, bgwriter.c is modified
instead of checkpointer.c in 9.2. And GetWalRcvWriteRecPtr() is
used as the equivalent of GetStandbyFlushRecPtr() in 9.2.

By the way, GetStandbyFlushRecPtr() acquires spin lock within. It
might be enough to read XLogCtl->recoveryLastRecPtr without lock
to make rough estimation, but I can't tell it is safe or
not. Same discussion could be for GetWalRcvWriteRecPtr() on
9.1.3.

However, it seems to work fine on a simple test.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

== My e-mail address has been changed since Apr. 1, 2012.

Attachments:

standby_checkpoint_segments_9.2dev_fix_20120416.patchtext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index c9473f7..f86e9b9 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -491,8 +491,8 @@ CheckpointerMain(void)
 			 * Initialize checkpointer-private variables used during checkpoint.
 			 */
 			ckpt_active = true;
-			if (!do_restartpoint)
-				ckpt_start_recptr = GetInsertRecPtr();
+			ckpt_start_recptr =
+				do_restartpoint ? GetStandbyFlushRecPtr() : GetInsertRecPtr();
 			ckpt_start_time = now;
 			ckpt_cached_elapsed = 0;
 
@@ -731,28 +731,29 @@ IsCheckpointOnSchedule(double progress)
 		return false;
 
 	/*
-	 * Check progress against WAL segments written and checkpoint_segments.
+	 * Check progress against WAL segments written, or replayed for
+	 * hot standby, and checkpoint_segments.
 	 *
 	 * We compare the current WAL insert location against the location
-	 * computed before calling CreateCheckPoint. The code in XLogInsert that
-	 * actually triggers a checkpoint when checkpoint_segments is exceeded
-	 * compares against RedoRecptr, so this is not completely accurate.
-	 * However, it's good enough for our purposes, we're only calculating an
-	 * estimate anyway.
+	 * computed before calling CreateCheckPoint. The code in
+	 * XLogInsert that actually triggers a checkpoint when
+	 * checkpoint_segments is exceeded compares against RedoRecptr.
+	 * Similarly, we consult WAL flush location instead on hot
+	 * standbys and XLogPageRead compares it aganst RedoRecptr, too.
+	 * Altough these are not completely accurate, it's good enough for
+	 * our purposes, we're only calculating an estimate anyway.
 	 */
-	if (!RecoveryInProgress())
+	
+	recptr = RecoveryInProgress() ? GetStandbyFlushRecPtr() : GetInsertRecPtr();
+	elapsed_xlogs =
+		(((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
+		 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
+		CheckPointSegments;
+
+	if (progress < elapsed_xlogs)
 	{
-		recptr = GetInsertRecPtr();
-		elapsed_xlogs =
-			(((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
-			 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
-			CheckPointSegments;
-
-		if (progress < elapsed_xlogs)
-		{
-			ckpt_cached_elapsed = elapsed_xlogs;
-			return false;
-		}
+		ckpt_cached_elapsed = elapsed_xlogs;
+		return false;
 	}
 
 	/*
standby_checkpoint_segments_9.1.3_fix_20120416.patchtext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 5643ec8..0ce9945 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -56,6 +56,7 @@
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
 #include "replication/syncrep.h"
+#include "replication/walreceiver.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -489,8 +490,8 @@ BackgroundWriterMain(void)
 			 * Initialize bgwriter-private variables used during checkpoint.
 			 */
 			ckpt_active = true;
-			if (!do_restartpoint)
-				ckpt_start_recptr = GetInsertRecPtr();
+			ckpt_start_recptr =	do_restartpoint ?
+				GetWalRcvWriteRecPtr(NULL) : GetInsertRecPtr();
 			ckpt_start_time = now;
 			ckpt_cached_elapsed = 0;
 
@@ -764,30 +765,32 @@ IsCheckpointOnSchedule(double progress)
 		return false;
 
 	/*
-	 * Check progress against WAL segments written and checkpoint_segments.
+	 * Check progress against WAL segments written, or replayed for
+	 * hot standby, and checkpoint_segments.
 	 *
 	 * We compare the current WAL insert location against the location
-	 * computed before calling CreateCheckPoint. The code in XLogInsert that
-	 * actually triggers a checkpoint when checkpoint_segments is exceeded
-	 * compares against RedoRecptr, so this is not completely accurate.
-	 * However, it's good enough for our purposes, we're only calculating an
-	 * estimate anyway.
+	 * computed before calling CreateCheckPoint. The code in
+	 * XLogInsert that actually triggers a checkpoint when
+	 * checkpoint_segments is exceeded compares against RedoRecptr.
+	 * Similarly, we consult WAL write location instead on hot
+	 * standbys and XLogPageRead compares it aganst RedoRecptr, too.
+	 * Altough these are not completely accurate, it's good enough for
+	 * our purposes, we're only calculating an estimate anyway.
 	 */
-	if (!RecoveryInProgress())
+	
+	recptr =
+		RecoveryInProgress() ? GetWalRcvWriteRecPtr(NULL) : GetInsertRecPtr();
+	elapsed_xlogs =
+		(((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
+		 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
+		CheckPointSegments;
+	
+	if (progress < elapsed_xlogs)
 	{
-		recptr = GetInsertRecPtr();
-		elapsed_xlogs =
-			(((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
-			 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
-			CheckPointSegments;
-
-		if (progress < elapsed_xlogs)
-		{
-			ckpt_cached_elapsed = elapsed_xlogs;
-			return false;
-		}
+		ckpt_cached_elapsed = elapsed_xlogs;
+		return false;
 	}
-
+	
 	/*
 	 * Check progress against time elapsed and checkpoint_timeout.
 	 */
#2Simon Riggs
simon@2ndQuadrant.com
In reply to: Kyotaro HORIGUCHI (#1)
Re: [BUG] Checkpointer on hot standby runs without looking checkpoint_segments

On Mon, Apr 16, 2012 at 1:05 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello, this is bug report and a patch for it.

The first patch in the attachments is for 9.2dev and next one is
for 9.1.3.

On the current 9.2dev, IsCheckpointOnSchedule@checkpointer.c does
not check against WAL segments written. This makes checkpointer
always run at the speed according to checkpoint_timeout
regardless of WAL advancing rate.

Thanks, I'll take a look.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#3Fujii Masao
masao.fujii@gmail.com
In reply to: Kyotaro HORIGUCHI (#1)
Re: [BUG] Checkpointer on hot standby runs without looking checkpoint_segments

On Mon, Apr 16, 2012 at 9:05 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

In the backported version to 9.1.3, bgwriter.c is modified
instead of checkpointer.c in 9.2. And GetWalRcvWriteRecPtr() is
used as the equivalent of GetStandbyFlushRecPtr() in 9.2.

In 9,2, GetXLogReplayRecPtr() should be used instead of GetStandbyFlushRecPtr().
A restartpoint is scheduled to finish before next restartpoint is
triggered. A restartpoint
is triggered if too much WAL files have been replayed since last
restartpoint. So
a restartpoint should be scheduled according to the replay location
not the receive
location.

-	 * computed before calling CreateCheckPoint. The code in XLogInsert that
-	 * actually triggers a checkpoint when checkpoint_segments is exceeded
-	 * compares against RedoRecptr, so this is not completely accurate.
-	 * However, it's good enough for our purposes, we're only calculating an
-	 * estimate anyway.
+	 * computed before calling CreateCheckPoint. The code in
+	 * XLogInsert that actually triggers a checkpoint when
+	 * checkpoint_segments is exceeded compares against RedoRecptr.
+	 * Similarly, we consult WAL flush location instead on hot
+	 * standbys and XLogPageRead compares it aganst RedoRecptr, too.
+	 * Altough these are not completely accurate, it's good enough for
+	 * our purposes, we're only calculating an estimate anyway.

I think that basically it's better not to change the comments (i.e., not to add
the line feed) if their contents are the same as previous ones, to highlight
what you actually changed in the patch.

Typo: RedoRecptr should be RedoRecPtr?

Regards,

--
Fujii Masao

#4Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Fujii Masao (#3)
Re: [BUG] Checkpointer on hot standby runs without looking checkpoint_segments

Hello, thank you for comment.

In the backported version to 9.1.3, bgwriter.c is modified
instead of checkpointer.c in 9.2. And GetWalRcvWriteRecPtr() is
used as the equivalent of GetStandbyFlushRecPtr() in 9.2.

In 9,2, GetXLogReplayRecPtr() should be used instead of
GetStandbyFlushRecPtr(). A restartpoint is scheduled to finish
before next restartpoint is triggered. A restartpoint is
triggered if too much WAL files have been replayed since last
restartpoint. So a restartpoint should be scheduled according
to the replay location not the receive location.

I agree with it basically. But I've get confused to look into
GetStandbyFlushRecPtr().

| if (XLByteLT(receivePtr, replayPtr))
| return XLByteLT(replayPtr, restorePtr) ? restorePtr : replayPtr;
| else
| return XLByteLT(receivePtr, restorePtr) ? restorePtr : receivePtr;

This seems imply receivePtr may be behind replayPtr. I don't
understand what condition makes it but anyway the bottom line I
think is that a restartpoint should be based on WALs surely
synced. So I choosed GetStandbyFlushRecPtr() to read the
location.

If receivePtr/restorePtr always precede or are equal to
replayPtr, I prefer GetXLogReplayRecPtr() as you suggest.

(And some comment about the order among these pointers might
should be supplied for the part)

I think that basically it's better not to change the comments
(i.e., not to add the line feed) if their contents are the same
as previous ones, to highlight what you actually changed in the
patch.

Hmm. It is a priority matter between pointing up in or
compactness of a patch and consistency in outcome of that. I
think the latter takes precedence over the former.

Altough, I could have found a description on better balance. But
more than that, I've found fill-column for this comment be too
short...

Typo: RedoRecptr should be RedoRecPtr?

I think that's right. I've unconsciously brought that spelling
from the orignal comment.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

== My e-mail address has been changed since Apr. 1, 2012.

#5Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Kyotaro HORIGUCHI (#4)
2 attachment(s)
Re: [BUG] Checkpointer on hot standby runs without looking checkpoint_segments

This is new version of the patch.
I replaced GetStandbyFlushRecPtr with GetXLogReplayRecPtr to
check progress of checkpoint following Fujii's sugestion.

The first one is for 9.2dev, and the second is 9.1.3 backported version.

===
By the way, I took a close look around there,

I agree with it basically. But I've get confused to look into
GetStandbyFlushRecPtr().

| if (XLByteLT(receivePtr, replayPtr))
| return XLByteLT(replayPtr, restorePtr) ? restorePtr : replayPtr;
| else
| return XLByteLT(receivePtr, restorePtr) ? restorePtr : receivePtr;

- receivePtr seems always updated just after syncing received xlog.
- replayPtr is updated just BEFORE xlog_redo operation, and
- restorePtr is updated AFTER xlog_redo().
- And, replayPtr seems not exceeds receivePtr.

These seems quite reasonable. These conditions make following
conditional expression.

restorePtr <= replayPtr <= receivePtr

But XLByteLT(recievePtr, replayPtr) this should not return true
under the condition above.. Something wrong in my assumption?

Anyway, I understand here that you say the location returned by
GetXLogReplayRecPtr() is always flushed.

--
Kyotaro Horiguchi
NTT Open Source Software Center

== My e-mail address has been changed since Apr. 1, 2012.

Attachments:

standby_checkpoint_segments_9.2dev_fix_20120417.patchtext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index c9473f7..c2fafbf 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -491,8 +491,8 @@ CheckpointerMain(void)
 			 * Initialize checkpointer-private variables used during checkpoint.
 			 */
 			ckpt_active = true;
-			if (!do_restartpoint)
-				ckpt_start_recptr = GetInsertRecPtr();
+			ckpt_start_recptr =
+				do_restartpoint ? GetXLogReplayRecPtr(NULL) : GetInsertRecPtr();
 			ckpt_start_time = now;
 			ckpt_cached_elapsed = 0;
 
@@ -731,28 +731,30 @@ IsCheckpointOnSchedule(double progress)
 		return false;
 
 	/*
-	 * Check progress against WAL segments written and checkpoint_segments.
+	 * Check progress against WAL segments written, or replayed for
+	 * hot standby, and checkpoint_segments.
 	 *
 	 * We compare the current WAL insert location against the location
-	 * computed before calling CreateCheckPoint. The code in XLogInsert that
-	 * actually triggers a checkpoint when checkpoint_segments is exceeded
-	 * compares against RedoRecptr, so this is not completely accurate.
-	 * However, it's good enough for our purposes, we're only calculating an
-	 * estimate anyway.
+	 * computed before calling CreateCheckPoint. The code in
+	 * XLogInsert that actually triggers a checkpoint when
+	 * checkpoint_segments is exceeded compares against RedoRecPtr.
+	 * Similarly, we consult WAL replay location instead on hot
+	 * standbys and XLogPageRead compares it aganst RedoRecPtr, too.
+	 * Altough these are not completely accurate, it's good enough for
+	 * our purposes, we're only calculating an estimate anyway.
 	 */
-	if (!RecoveryInProgress())
+	
+	recptr =
+		RecoveryInProgress() ? GetXLogReplayRecPtr(NULL) : GetInsertRecPtr();
+	elapsed_xlogs =
+		(((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
+		 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
+		CheckPointSegments;
+
+	if (progress < elapsed_xlogs)
 	{
-		recptr = GetInsertRecPtr();
-		elapsed_xlogs =
-			(((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
-			 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
-			CheckPointSegments;
-
-		if (progress < elapsed_xlogs)
-		{
-			ckpt_cached_elapsed = elapsed_xlogs;
-			return false;
-		}
+		ckpt_cached_elapsed = elapsed_xlogs;
+		return false;
 	}
 
 	/*
standby_checkpoint_segments_9.1.3_fix_20120417.patchtext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 5643ec8..a272866 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -56,6 +56,7 @@
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
 #include "replication/syncrep.h"
+#include "replication/walreceiver.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -489,8 +490,8 @@ BackgroundWriterMain(void)
 			 * Initialize bgwriter-private variables used during checkpoint.
 			 */
 			ckpt_active = true;
-			if (!do_restartpoint)
-				ckpt_start_recptr = GetInsertRecPtr();
+			ckpt_start_recptr =	do_restartpoint ?
+				GetXLogReplayRecPtr() : GetInsertRecPtr();
 			ckpt_start_time = now;
 			ckpt_cached_elapsed = 0;
 
@@ -764,30 +765,32 @@ IsCheckpointOnSchedule(double progress)
 		return false;
 
 	/*
-	 * Check progress against WAL segments written and checkpoint_segments.
+	 * Check progress against WAL segments written, or replayed for
+	 * hot standby, and checkpoint_segments.
 	 *
 	 * We compare the current WAL insert location against the location
-	 * computed before calling CreateCheckPoint. The code in XLogInsert that
-	 * actually triggers a checkpoint when checkpoint_segments is exceeded
-	 * compares against RedoRecptr, so this is not completely accurate.
-	 * However, it's good enough for our purposes, we're only calculating an
-	 * estimate anyway.
+	 * computed before calling CreateCheckPoint. The code in
+	 * XLogInsert that actually triggers a checkpoint when
+	 * checkpoint_segments is exceeded compares against RedoRecPtr.
+	 * Similarly, we consult WAL replay location instead on hot
+	 * standbys and XLogPageRead compares it aganst RedoRecPtr, too.
+	 * Altough these are not completely accurate, it's good enough for
+	 * our purposes, we're only calculating an estimate anyway.
 	 */
-	if (!RecoveryInProgress())
+	
+	recptr =
+		RecoveryInProgress() ? GetXLogReplayRecPtr() : GetInsertRecPtr();
+	elapsed_xlogs =
+		(((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
+		 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
+		CheckPointSegments;
+	
+	if (progress < elapsed_xlogs)
 	{
-		recptr = GetInsertRecPtr();
-		elapsed_xlogs =
-			(((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
-			 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
-			CheckPointSegments;
-
-		if (progress < elapsed_xlogs)
-		{
-			ckpt_cached_elapsed = elapsed_xlogs;
-			return false;
-		}
+		ckpt_cached_elapsed = elapsed_xlogs;
+		return false;
 	}
-
+	
 	/*
 	 * Check progress against time elapsed and checkpoint_timeout.
 	 */
#6Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Kyotaro HORIGUCHI (#5)
Re: [BUG] Checkpointer on hot standby runs without looking checkpoint_segments

On 17.04.2012 09:50, Kyotaro HORIGUCHI wrote:

This is new version of the patch.
I replaced GetStandbyFlushRecPtr with GetXLogReplayRecPtr to
check progress of checkpoint following Fujii's sugestion.

The reason we haven't historically obeyed checkpoint_segments during
recovery is that it slows down the recovery unnecessarily if you're
restoring from a backup and you replay, say, one week's worth of WAL
files. If for example you have checkpoint_segments=10 and
checkpoint_timeout='15 mins' in the server that generated the WAL, you
would be constantly performing a restartpoint if you trigger one every
10 segments.

You could argue that you should obey checkpoint_segments in a standby
server that's caught up with the master, but AFAICS the patch doesn't
try to detect that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#7Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Heikki Linnakangas (#6)
Re: [BUG] Checkpointer on hot standby runs without looking checkpoint_segments

Hello,

The reason we haven't historically obeyed checkpoint_segments
during recovery is that it slows down the recovery
unnecessarily if you're restoring from a backup and you replay,

The variable StandbyMode is false on archive recovery, so no
checkpoint triggerred during then.

xlog.c:10026 (in some version around 9.2)
| /*
| * Request a restartpoint if we've replayed too much
| * xlog since the last one.
| */
| if (StandbyMode && bgwriterLaunched)
| {
| if (XLogCheckpointNeeded(readId, readSeg))

You could argue that you should obey checkpoint_segments in a
standby server that's caught up with the master, but AFAICS the
patch doesn't try to detect that.

Concerning checkpoint, there seems no need for the standby to know
whether it has been caught up with its master or not. Checkpoint has
been already kicked by checkpoint_timeout regardless of the
sync_state.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

== My e-mail address has been changed since Apr. 1, 2012.

#8Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Kyotaro HORIGUCHI (#7)
Re: [BUG] Checkpointer on hot standby runs without looking checkpoint_segments

Sorry, I've wrote something wrong.

The reason we haven't historically obeyed checkpoint_segments
during recovery is that it slows down the recovery
unnecessarily if you're restoring from a backup and you replay,

The variable StandbyMode is false on archive recovery, so no
checkpoint triggerred during then.

Nevertheless, checkpoints will be triggered by checkpoint_timeout and
run at the maybe higher speed governed by checkpoint_segments. This is
undesirable behavior from such a point of view.

But I think referring checkpoint_segment on such case should be
inhibited, and I suppose it is possible using StandbyMode in
IsCheckpointOnSchedule(), I suppose.

I will correct the patch later.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

== My e-mail address has been changed since Apr. 1, 2012.

#9Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Kyotaro HORIGUCHI (#8)
1 attachment(s)
Re: [BUG] Checkpointer on hot standby runs without looking checkpoint_segments

Hello, this message is attached with the patch which did not
tested. That is for show the way.

On Tue, Apr 17, 2012 at 9:38 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

But I think referring checkpoint_segment on such case should be
inhibited, and I suppose it is possible using StandbyMode in
IsCheckpointOnSchedule(), I suppose.

I will correct the patch later.

Hmm. StandbyMode is a local variable which cannot be accessed in
checkpointer. But WalRcvInProgress() which shows if wal receiver
is running seems to be usable to ENABLE governing progress by
checkpoint_segments.

| IsCheckpointOnSchedule(double progress)
| {
....
| /*
| * Inhibit governing progress by segments in archive recovery.
| */
| recovery_in_progress = RecoveryInProgress();
| if (!recovery_in_progress || WalRcvInProgress())
| {
| recptr = recovery_in_progress ? GetXLogReplayRecPtr(NULL) :
| GetInsertRecPtr();

How about this?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

== My e-mail address has been changed since Apr. 1, 2012.

Attachments:

standby_checkpoint_segments_9.2dev_fix_20120417v2.patchapplication/octet-stream; name=standby_checkpoint_segments_9.2dev_fix_20120417v2.patchDownload
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index c9473f7..c0883b1 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -47,6 +47,7 @@
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
 #include "replication/syncrep.h"
+#include "replication/walreceiver.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
@@ -491,8 +492,8 @@ CheckpointerMain(void)
 			 * Initialize checkpointer-private variables used during checkpoint.
 			 */
 			ckpt_active = true;
-			if (!do_restartpoint)
-				ckpt_start_recptr = GetInsertRecPtr();
+			ckpt_start_recptr =
+				do_restartpoint ? GetXLogReplayRecPtr(NULL) : GetInsertRecPtr();
 			ckpt_start_time = now;
 			ckpt_cached_elapsed = 0;
 
@@ -715,6 +716,7 @@ IsCheckpointOnSchedule(double progress)
 	struct timeval now;
 	double		elapsed_xlogs,
 				elapsed_time;
+	bool        recovery_in_progress;
 
 	Assert(ckpt_active);
 
@@ -731,18 +733,27 @@ IsCheckpointOnSchedule(double progress)
 		return false;
 
 	/*
-	 * Check progress against WAL segments written and checkpoint_segments.
+	 * Check progress against WAL segments written, or replayed for
+	 * hot standby, and checkpoint_segments.
 	 *
 	 * We compare the current WAL insert location against the location
-	 * computed before calling CreateCheckPoint. The code in XLogInsert that
-	 * actually triggers a checkpoint when checkpoint_segments is exceeded
-	 * compares against RedoRecptr, so this is not completely accurate.
-	 * However, it's good enough for our purposes, we're only calculating an
-	 * estimate anyway.
+	 * computed before calling CreateCheckPoint. The code in
+	 * XLogInsert that actually triggers a checkpoint when
+	 * checkpoint_segments is exceeded compares against RedoRecPtr.
+	 * Similarly, we consult WAL replay location instead on hot
+	 * standbys and XLogPageRead compares it aganst RedoRecPtr, too.
+	 * Altough these are not completely accurate, it's good enough for
+	 * our purposes, we're only calculating an estimate anyway.
+	 */
+
+	/*
+	 * Inhibit governing progress by segments in archive recovery.
 	 */
-	if (!RecoveryInProgress())
+	recovery_in_progress = RecoveryInProgress();
+	if (!recovery_in_progress || WalRcvInProgress())
 	{
-		recptr = GetInsertRecPtr();
+		recptr = recovery_in_progress ? GetXLogReplayRecPtr(NULL) :
+			GetInsertRecPtr();
 		elapsed_xlogs =
 			(((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
 			 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
#10Fujii Masao
masao.fujii@gmail.com
In reply to: Kyotaro HORIGUCHI (#5)
Re: [BUG] Checkpointer on hot standby runs without looking checkpoint_segments

On Tue, Apr 17, 2012 at 3:50 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

These seems quite reasonable. These conditions make following
conditional expression.

    restorePtr <= replayPtr <= receivePtr

But XLByteLT(recievePtr, replayPtr) this should not return true
under the condition above.. Something wrong in my assumption?

When walreceiver is not running, i.e., the startup process reads the WAL files
from the archival area, the replay location would get bigger than the
receive one.

Regards,

--
Fujii Masao

#11Fujii Masao
masao.fujii@gmail.com
In reply to: Kyotaro HORIGUCHI (#9)
Re: [BUG] Checkpointer on hot standby runs without looking checkpoint_segments

On Tue, Apr 17, 2012 at 11:50 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hmm. StandbyMode is a local variable which cannot be accessed in
checkpointer. But WalRcvInProgress() which shows if wal receiver
is running seems to be usable to ENABLE governing progress by
checkpoint_segments.

Even when walreceiver is not running and WAL files are read from the archive,
checkpoint_segments can trigger a restartpoint. In this case, ISTM a
restartpoint
should be scheduled according to checkpoint_segments, so I don't think that
checking WalRcvInProgress() for that purpose is right thing. Instead, what about
sharing StandbyMode flag among processes via shared memory like XLogCtl?

Regards,

--
Fujii Masao

#12Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Fujii Masao (#11)
Re: [BUG] Checkpointer on hot standby runs without looking checkpoint_segments

Hmm. StandbyMode is a local variable which cannot be accessed in
checkpointer. But WalRcvInProgress() which shows if wal receiver
is running seems to be usable to ENABLE governing progress by
checkpoint_segments.

Even when walreceiver is not running and WAL files are read from the archive,
checkpoint_segments can trigger a restartpoint. In this case, ISTM a
restartpoint
should be scheduled according to checkpoint_segments, so I don't think that
checking WalRcvInProgress() for that purpose is right thing. Instead, what about
sharing StandbyMode flag among processes via shared memory like XLogCtl?

I tried that at first. But I suppose the requirement here is 'if
reading segments comes via replication stream, enable throttling
by checkpoint_segments.' and WalRcvInProgress() seems fit to
check that. Plus, adding SharedStartupStandbyMode into
XLogCtlData seems accompanied with some annoyances which would
not pay.

By the way, do you have some advise about GetStandbyFlushRecPtr()
and the order of the locations? I'm embarrassed with that...

I agree with it basically. But I've get confused to look into
GetStandbyFlushRecPtr().

| if (XLByteLT(receivePtr, replayPtr))
| return XLByteLT(replayPtr, restorePtr) ? restorePtr : replayPtr;
| else
| return XLByteLT(receivePtr, restorePtr) ? restorePtr : receivePtr;

- receivePtr seems always updated just after syncing received xlog.
- replayPtr is updated just BEFORE xlog_redo operation, and
- restorePtr is updated AFTER xlog_redo().
- And, replayPtr seems not exceeds receivePtr.

These seems quite reasonable. These conditions make following
conditional expression.

restorePtr <= replayPtr <= receivePtr

But XLByteLT(recievePtr, replayPtr) this should not return true
under the condition above.. Something wrong in my assumption?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

== My e-mail address has been changed since Apr. 1, 2012.

#13Fujii Masao
masao.fujii@gmail.com
In reply to: Kyotaro HORIGUCHI (#12)
Re: [BUG] Checkpointer on hot standby runs without looking checkpoint_segments

On Wed, Apr 18, 2012 at 10:22 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

I tried that at first. But I suppose the requirement here is 'if
reading segments comes via replication stream, enable throttling
by checkpoint_segments.' and WalRcvInProgress() seems fit to
check that.

If so, what if replication is terminated and restarted repeatedly while
a restartpoint is running? It sometimes obeys and sometimes ignores
checkpoint_segments. Which seems strange behavior.

Plus, adding SharedStartupStandbyMode into
XLogCtlData seems accompanied with some annoyances which would
not pay.

Hmm... what are you worried about? I don't think that sharing the variable
via XLogCtl is so difficult. Please see the code to share archiveCleanupCommand
from the startup process to the checkpointer. It looks very simple.

By the way, do you have some advise about GetStandbyFlushRecPtr()
and the order of the locations? I'm embarrassed with that...

Sorry. I could not parse this.... Could you explain this again?

Regards,

--
Fujii Masao

#14Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Fujii Masao (#13)
Re: [BUG] Checkpointer on hot standby runs without looking checkpoint_segments

I convinced that current patch has a problem, and will come up
with the new patch later.

====

I tried that at first. But I suppose the requirement here is 'if
reading segments comes via replication stream, enable throttling
by checkpoint_segments.' and WalRcvInProgress() seems fit to
check that.

If so, what if replication is terminated and restarted repeatedly while
a restartpoint is running? It sometimes obeys and sometimes ignores
checkpoint_segments. Which seems strange behavior.

I see your point. And agree on that is something not should be.

Plus, adding SharedStartupStandbyMode into
XLogCtlData seems accompanied with some annoyances which would
not pay.

Hmm... what are you worried about? I don't think that sharing the variable
via XLogCtl is so difficult. Please see the code to share archiveCleanupCommand
from the startup process to the checkpointer. It looks very simple.

The mechanism has nothing complex. I've been afraid of making so
many variables with similar meaning sitting side by side on
shared memory. But I convinced that additional shared variable is
preferable because it makes the logic and behavior clear and
sane.

I will come up with updated patch soon.

By the way, do you have some advise about GetStandbyFlushRecPtr()
and the order of the locations? I'm embarrassed with that...

Sorry. I could not parse this.... Could you explain this again?

My point is,

- Is the assumption correct that "restorePtr <= replayPtr <= receivePtr"?

- If correct, what the code in GetStandbyFlushRecPtr() showing
below means?

if (XLByteLT(receivePtr, replayPtr))

- Or if wrong, what situation would take place to break the
expression "restorePtr <= replayPtr <= receivePtr"?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

== My e-mail address has been changed since Apr. 1, 2012.

#15Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Kyotaro HORIGUCHI (#14)
1 attachment(s)
Re: [BUG] Checkpointer on hot standby runs without looking checkpoint_segments

Hello, this is new version of standby checkpoint_segments patch.

- xlog.c: Make StandbyMode shared.

- checkpointer.c: Use IsStandbyMode() to check if postmaster is
under standby mode.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

== My e-mail address has been changed since Apr. 1, 2012.

Attachments:

standby_checkpoint_segments_9.2dev_fix_20120419.patchtext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8d0aabf..2457840 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -177,6 +177,12 @@ static bool LocalRecoveryInProgress = true;
 static bool LocalHotStandbyActive = false;
 
 /*
+ * Local copy of SharedIsStandbyMode variable.  True actually means "not known,
+ * need to check the shared state".
+ */
+static bool LocalIsStandbyMode = true;
+
+/*
  * Local state for XLogInsertAllowed():
  *		1: unconditionally allowed to insert XLOG
  *		0: unconditionally not allowed to insert XLOG
@@ -206,7 +212,6 @@ static TimestampTz recoveryTargetTime;
 static char *recoveryTargetName;
 
 /* options taken from recovery.conf for XLOG streaming */
-static bool StandbyMode = false;
 static char *PrimaryConnInfo = NULL;
 static char *TriggerFile = NULL;
 
@@ -427,6 +432,11 @@ typedef struct XLogCtlData
 	bool		SharedHotStandbyActive;
 
 	/*
+	 * SharedInStandbyMode indicates if we are running in standby mode.
+	 */
+	bool		SharedIsStandbyMode;
+
+	/*
 	 * recoveryWakeupLatch is used to wake up the startup process to continue
 	 * WAL replay, if it is waiting for WAL to arrive or failover trigger file
 	 * to appear.
@@ -619,6 +629,7 @@ static void SetLatestXTime(TimestampTz xtime);
 static void SetCurrentChunkStartTime(TimestampTz xtime);
 static void CheckRequiredParameterValues(void);
 static void XLogReportParameters(void);
+static void ExitStandbyMode(void);
 static void LocalSetXLogInsertAllowed(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
 static void KeepLogSeg(XLogRecPtr recptr, uint32 *logId, uint32 *logSeg);
@@ -3115,7 +3126,7 @@ RestoreArchivedFile(char *path, const char *xlogfname,
 				 * incorrectly conclude we've reached the end of WAL and we're
 				 * done recovering ...
 				 */
-				if (StandbyMode && stat_buf.st_size < expectedSize)
+				if (IsStandbyMode() && stat_buf.st_size < expectedSize)
 					elevel = DEBUG1;
 				else
 					elevel = FATAL;
@@ -4072,7 +4083,7 @@ next_record_is_invalid:
 	}
 
 	/* In standby-mode, keep trying */
-	if (StandbyMode)
+	if (IsStandbyMode())
 		goto retry;
 	else
 		return NULL;
@@ -5098,6 +5109,7 @@ XLOGShmemInit(void)
 	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
 	XLogCtl->SharedRecoveryInProgress = true;
 	XLogCtl->SharedHotStandbyActive = false;
+	XLogCtl->SharedIsStandbyMode = true;
 	XLogCtl->Insert.currpage = (XLogPageHeader) (XLogCtl->pages);
 	SpinLockInit(&XLogCtl->info_lck);
 	InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
@@ -5289,6 +5301,7 @@ readRecoveryCommandFile(void)
 	FILE	   *fd;
 	TimeLineID	rtli = 0;
 	bool		rtliGiven = false;
+	bool        standby_mode = false;
 	ConfigVariable *item,
 			   *head = NULL,
 			   *tail = NULL;
@@ -5439,13 +5452,14 @@ readRecoveryCommandFile(void)
 		}
 		else if (strcmp(item->name, "standby_mode") == 0)
 		{
-			if (!parse_bool(item->value, &StandbyMode))
+			if (!parse_bool(item->value, &standby_mode))
 				ereport(ERROR,
 						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 						 errmsg("parameter \"%s\" requires a Boolean value",
 								"standby_mode")));
 			ereport(DEBUG2,
 					(errmsg_internal("standby_mode = '%s'", item->value)));
+
 		}
 		else if (strcmp(item->name, "primary_conninfo") == 0)
 		{
@@ -5470,7 +5484,7 @@ readRecoveryCommandFile(void)
 	/*
 	 * Check for compulsory parameters
 	 */
-	if (StandbyMode)
+	if (standby_mode)
 	{
 		if (PrimaryConnInfo == NULL && recoveryRestoreCommand == NULL)
 			ereport(WARNING,
@@ -5480,6 +5494,7 @@ readRecoveryCommandFile(void)
 	}
 	else
 	{
+		ExitStandbyMode();
 		if (recoveryRestoreCommand == NULL)
 			ereport(FATAL,
 					(errmsg("recovery command file \"%s\" must specify restore_command when standby mode is not enabled",
@@ -6086,7 +6101,7 @@ StartupXLOG(void)
 
 	if (InArchiveRecovery)
 	{
-		if (StandbyMode)
+		if (IsStandbyMode())
 			ereport(LOG,
 					(errmsg("entering standby mode")));
 		else if (recoveryTarget == RECOVERY_TARGET_XID)
@@ -6110,7 +6125,7 @@ StartupXLOG(void)
 	 * Take ownership of the wakeup latch if we're going to sleep during
 	 * recovery.
 	 */
-	if (StandbyMode)
+	if (IsStandbyMode())
 		OwnLatch(&XLogCtl->recoveryWakeupLatch);
 
 	if (read_backup_label(&checkPointLoc, &backupEndRequired,
@@ -6169,7 +6184,7 @@ StartupXLOG(void)
 					(errmsg("checkpoint record is at %X/%X",
 							checkPointLoc.xlogid, checkPointLoc.xrecoff)));
 		}
-		else if (StandbyMode)
+		else if (IsStandbyMode())
 		{
 			/*
 			 * The last valid checkpoint record required for a streaming
@@ -6683,7 +6698,7 @@ StartupXLOG(void)
 	 * We don't need the latch anymore. It's not strictly necessary to disown
 	 * it, but let's do it for the sake of tidiness.
 	 */
-	if (StandbyMode)
+	if (IsStandbyMode())
 		DisownLatch(&XLogCtl->recoveryWakeupLatch);
 
 	/*
@@ -6691,7 +6706,7 @@ StartupXLOG(void)
 	 * recovery to force fetching the files (which would be required at end of
 	 * recovery, e.g., timeline history file) from archive or pg_xlog.
 	 */
-	StandbyMode = false;
+	ExitStandbyMode();
 
 	/*
 	 * Re-fetch the last valid or last applied record, so we can identify the
@@ -7096,7 +7111,7 @@ RecoveryInProgress(void)
  * since normal backends won't ever be able to connect until this returns
  * true. Postmaster knows this by way of signal, not via shared memory.
  *
- * Unlike testing standbyState, this works in any process that's connected to
+ * Unlike testing InRecovery, this works in any process that's connected to
  * shared memory.
  */
 bool
@@ -7124,6 +7139,53 @@ HotStandbyActive(void)
 }
 
 /*
+ * Are we running in standby mode?
+ *
+ * Unlike testing InRecovery, this works in any process that's connected to
+ * shared memory.
+ */
+bool
+IsStandbyMode(void)
+{
+	/*
+	 * We check shared state each time only until exiting standby mode. We
+	 * can't re-enter standby mode, so there's no need to keep checking after
+	 * the shared variable has once been seen false.
+	 */
+	if (!LocalIsStandbyMode)
+		return false;
+	else
+	{
+		/* use volatile pointer to prevent code rearrangement */
+		volatile XLogCtlData *xlogctl = XLogCtl;
+		
+		/* spinlock is essential on machines with weak memory ordering! */
+		SpinLockAcquire(&xlogctl->info_lck);
+		LocalIsStandbyMode = xlogctl->SharedIsStandbyMode;
+		SpinLockRelease(&xlogctl->info_lck);
+
+		return LocalIsStandbyMode;
+	}
+
+}
+
+
+/*
+ * Inform the processes connected to shared memory that we exit standby mode.
+ */
+static void
+ExitStandbyMode()
+{
+	/* use volatile pointer to prevent code rearrangement */
+	volatile XLogCtlData *xlogctl = XLogCtl;
+
+	/* spinlock is essential on machines with weak memory ordering! */
+	SpinLockAcquire(&xlogctl->info_lck);
+	LocalIsStandbyMode = xlogctl->SharedIsStandbyMode = false;
+	SpinLockRelease(&xlogctl->info_lck);
+}
+
+/*
  * Is this process allowed to insert new WAL records?
  *
  * Ordinarily this is essentially equivalent to !RecoveryInProgress().
@@ -10026,7 +10088,7 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
 		 * Request a restartpoint if we've replayed too much
 		 * xlog since the last one.
 		 */
-		if (StandbyMode && bgwriterLaunched)
+		if (IsStandbyMode() && bgwriterLaunched)
 		{
 			if (XLogCheckpointNeeded(readId, readSeg))
 			{
@@ -10048,7 +10110,7 @@ retry:
 	if (readFile < 0 ||
 		(readSource == XLOG_FROM_STREAM && !XLByteLT(*RecPtr, receivedUpto)))
 	{
-		if (StandbyMode)
+		if (IsStandbyMode())
 		{
 			/*
 			 * In standby mode, wait for the requested record to become
@@ -10362,7 +10424,7 @@ next_record_is_invalid:
 	readSource = 0;
 
 	/* In standby-mode, keep trying */
-	if (StandbyMode)
+	if (IsStandbyMode())
 		goto retry;
 	else
 		return false;
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index c9473f7..f91bd52 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -491,8 +491,8 @@ CheckpointerMain(void)
 			 * Initialize checkpointer-private variables used during checkpoint.
 			 */
 			ckpt_active = true;
-			if (!do_restartpoint)
-				ckpt_start_recptr = GetInsertRecPtr();
+			ckpt_start_recptr =
+				do_restartpoint ? GetXLogReplayRecPtr(NULL) : GetInsertRecPtr();
 			ckpt_start_time = now;
 			ckpt_cached_elapsed = 0;
 
@@ -715,6 +715,7 @@ IsCheckpointOnSchedule(double progress)
 	struct timeval now;
 	double		elapsed_xlogs,
 				elapsed_time;
+	bool        recovery_in_progress;
 
 	Assert(ckpt_active);
 
@@ -731,18 +732,27 @@ IsCheckpointOnSchedule(double progress)
 		return false;
 
 	/*
-	 * Check progress against WAL segments written and checkpoint_segments.
+	 * Check progress against WAL segments written, or replayed for
+	 * hot standby, and checkpoint_segments.
 	 *
 	 * We compare the current WAL insert location against the location
-	 * computed before calling CreateCheckPoint. The code in XLogInsert that
-	 * actually triggers a checkpoint when checkpoint_segments is exceeded
-	 * compares against RedoRecptr, so this is not completely accurate.
-	 * However, it's good enough for our purposes, we're only calculating an
-	 * estimate anyway.
+	 * computed before calling CreateCheckPoint. The code in
+	 * XLogInsert that actually triggers a checkpoint when
+	 * checkpoint_segments is exceeded compares against RedoRecPtr.
+	 * Similarly, we consult WAL replay location instead on hot
+	 * standbys and XLogPageRead compares it aganst RedoRecPtr, too.
+	 * Altough these are not completely accurate, it's good enough for
+	 * our purposes, we're only calculating an estimate anyway.
+	 */
+
+	/*
+	 * Inhibit governing progress by segments in archive recovery.
 	 */
-	if (!RecoveryInProgress())
+	recovery_in_progress = RecoveryInProgress();
+	if (!recovery_in_progress || IsStandbyMode())
 	{
-		recptr = GetInsertRecPtr();
+		recptr = recovery_in_progress ? GetXLogReplayRecPtr(NULL) :
+			GetInsertRecPtr();
 		elapsed_xlogs =
 			(((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
 			 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f8aecef..329119b 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -285,6 +285,7 @@ extern void issue_xlog_fsync(int fd, uint32 log, uint32 seg);
 
 extern bool RecoveryInProgress(void);
 extern bool HotStandbyActive(void);
+extern bool IsStandbyMode(void);
 extern bool XLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(XLogRecPtr *restoreLastRecPtr);
#16Fujii Masao
masao.fujii@gmail.com
In reply to: Kyotaro HORIGUCHI (#15)
1 attachment(s)
Re: [BUG] Checkpointer on hot standby runs without looking checkpoint_segments

On Thu, Apr 19, 2012 at 2:20 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello, this is new version of standby checkpoint_segments patch.

Thanks for the patch!

 - xlog.c: Make StandbyMode shared.

 - checkpointer.c: Use IsStandbyMode() to check if postmaster is
  under standby mode.

IsStandbyMode() looks overkill to me. The standby mode flag is forcibly
turned off at the end of recovery, but its change doesn't need to be shared
to the checkpointer process, IOW, the shared flag doesn't need to change
since startup like XLogCtl->archiveCleanupCommand, I think. So we can
simplify the code to share the flag to the checkpointer. See the attached
patch (though not tested yet).

The comments in checkpointer.c seems to need to be revised more. For
example,

+ * XLogInsert that actually triggers a checkpoint when

Currently a checkpoint is triggered by XLogWrite (not XLogInsert), the above
needs to be corrected.

Regards,

--
Fujii Masao

Attachments:

standby_checkpoint_segments_9.2dev_fix_20120423.patchapplication/octet-stream; name=standby_checkpoint_segments_9.2dev_fix_20120423.patchDownload
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 427,432 **** typedef struct XLogCtlData
--- 427,443 ----
  	bool		SharedHotStandbyActive;
  
  	/*
+ 	 * standbyMode is read from recovery.conf but needs to be in shared
+ 	 * memory so that the checkpointer process can access it. Note that
+ 	 * this is not the shared copy of StandbyMode. StandbyMode is forcibly
+ 	 * turned off at the end of recovery, but this doesn't change accordingly
+ 	 * because the checkpointer doesn't need to see the change of the mode
+ 	 * for the current use. Which means that this doesn't change after
+ 	 * startup so we need no lock to access this.
+ 	 */
+ 	bool		standbyMode;
+ 
+ 	/*
  	 * recoveryWakeupLatch is used to wake up the startup process to continue
  	 * WAL replay, if it is waiting for WAL to arrive or failover trigger file
  	 * to appear.
***************
*** 5446,5451 **** readRecoveryCommandFile(void)
--- 5457,5463 ----
  								"standby_mode")));
  			ereport(DEBUG2,
  					(errmsg_internal("standby_mode = '%s'", item->value)));
+ 
  		}
  		else if (strcmp(item->name, "primary_conninfo") == 0)
  		{
***************
*** 6075,6085 **** StartupXLOG(void)
  						ControlFile->checkPointCopy.ThisTimeLineID)));
  
  	/*
! 	 * Save the selected recovery target timeline ID and
  	 * archive_cleanup_command in shared memory so that other processes can
  	 * see them
  	 */
  	XLogCtl->RecoveryTargetTLI = recoveryTargetTLI;
  	strncpy(XLogCtl->archiveCleanupCommand,
  			archiveCleanupCommand ? archiveCleanupCommand : "",
  			sizeof(XLogCtl->archiveCleanupCommand));
--- 6087,6098 ----
  						ControlFile->checkPointCopy.ThisTimeLineID)));
  
  	/*
! 	 * Save the selected recovery target timeline ID, standby_mode and
  	 * archive_cleanup_command in shared memory so that other processes can
  	 * see them
  	 */
  	XLogCtl->RecoveryTargetTLI = recoveryTargetTLI;
+ 	XLogCtl->standbyMode = StandbyMode;
  	strncpy(XLogCtl->archiveCleanupCommand,
  			archiveCleanupCommand ? archiveCleanupCommand : "",
  			sizeof(XLogCtl->archiveCleanupCommand));
***************
*** 7096,7102 **** RecoveryInProgress(void)
   * since normal backends won't ever be able to connect until this returns
   * true. Postmaster knows this by way of signal, not via shared memory.
   *
!  * Unlike testing standbyState, this works in any process that's connected to
   * shared memory.
   */
  bool
--- 7109,7115 ----
   * since normal backends won't ever be able to connect until this returns
   * true. Postmaster knows this by way of signal, not via shared memory.
   *
!  * Unlike testing InRecovery, this works in any process that's connected to
   * shared memory.
   */
  bool
***************
*** 7124,7129 **** HotStandbyActive(void)
--- 7137,7152 ----
  }
  
  /*
+  * Is the system in standby mode?
+  */
+ bool
+ IsStandbyMode(void)
+ {
+ 	/* XLogCtl->standbyMode doesn't change so we need no lock to copy it */
+ 	return XLogCtl->standbyMode;
+ }
+ 
+ /*
   * Is this process allowed to insert new WAL records?
   *
   * Ordinarily this is essentially equivalent to !RecoveryInProgress().
*** a/src/backend/postmaster/checkpointer.c
--- b/src/backend/postmaster/checkpointer.c
***************
*** 491,498 **** CheckpointerMain(void)
  			 * Initialize checkpointer-private variables used during checkpoint.
  			 */
  			ckpt_active = true;
! 			if (!do_restartpoint)
! 				ckpt_start_recptr = GetInsertRecPtr();
  			ckpt_start_time = now;
  			ckpt_cached_elapsed = 0;
  
--- 491,498 ----
  			 * Initialize checkpointer-private variables used during checkpoint.
  			 */
  			ckpt_active = true;
! 			ckpt_start_recptr =
! 				do_restartpoint ? GetXLogReplayRecPtr(NULL) : GetInsertRecPtr();
  			ckpt_start_time = now;
  			ckpt_cached_elapsed = 0;
  
***************
*** 715,720 **** IsCheckpointOnSchedule(double progress)
--- 715,721 ----
  	struct timeval now;
  	double		elapsed_xlogs,
  				elapsed_time;
+ 	bool        recovery_in_progress;
  
  	Assert(ckpt_active);
  
***************
*** 731,748 **** IsCheckpointOnSchedule(double progress)
  		return false;
  
  	/*
! 	 * Check progress against WAL segments written and checkpoint_segments.
  	 *
  	 * We compare the current WAL insert location against the location
! 	 * computed before calling CreateCheckPoint. The code in XLogInsert that
! 	 * actually triggers a checkpoint when checkpoint_segments is exceeded
! 	 * compares against RedoRecptr, so this is not completely accurate.
! 	 * However, it's good enough for our purposes, we're only calculating an
! 	 * estimate anyway.
  	 */
! 	if (!RecoveryInProgress())
  	{
! 		recptr = GetInsertRecPtr();
  		elapsed_xlogs =
  			(((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
  			 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
--- 732,758 ----
  		return false;
  
  	/*
! 	 * Check progress against WAL segments written, or replayed for
! 	 * hot standby, and checkpoint_segments.
  	 *
  	 * We compare the current WAL insert location against the location
! 	 * computed before calling CreateCheckPoint. The code in
! 	 * XLogInsert that actually triggers a checkpoint when
! 	 * checkpoint_segments is exceeded compares against RedoRecPtr.
! 	 * Similarly, we consult WAL replay location instead on hot
! 	 * standbys and XLogPageRead compares it aganst RedoRecPtr, too.
! 	 * Altough these are not completely accurate, it's good enough for
! 	 * our purposes, we're only calculating an estimate anyway.
! 	 */
! 
! 	/*
! 	 * Inhibit governing progress by segments in archive recovery.
  	 */
! 	recovery_in_progress = RecoveryInProgress();
! 	if (!recovery_in_progress || IsStandbyMode())
  	{
! 		recptr = recovery_in_progress ? GetXLogReplayRecPtr(NULL) :
! 			GetInsertRecPtr();
  		elapsed_xlogs =
  			(((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
  			 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 285,290 **** extern void issue_xlog_fsync(int fd, uint32 log, uint32 seg);
--- 285,291 ----
  
  extern bool RecoveryInProgress(void);
  extern bool HotStandbyActive(void);
+ extern bool IsStandbyMode(void);
  extern bool XLogInsertAllowed(void);
  extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
  extern XLogRecPtr GetXLogReplayRecPtr(XLogRecPtr *restoreLastRecPtr);
#17Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Fujii Masao (#16)
Re: [BUG] Checkpointer on hot standby runs without looking checkpoint_segments

Hello,

 - xlog.c: Make StandbyMode shared.

 - checkpointer.c: Use IsStandbyMode() to check if postmaster is
  under standby mode.

IsStandbyMode() looks overkill to me. The standby mode flag is forcibly
turned off at the end of recovery, but its change doesn't need to be shared
to the checkpointer process, IOW, the shared flag doesn't need to change
since startup like XLogCtl->archiveCleanupCommand, I think. So we can
simplify the code to share the flag to the checkpointer. See the attached
patch (though not tested yet).

Hmm. I understood that the aim of the spinlock and volatil'ize of
the pointer in reading shared memory is to secure the memory
consistency on SMPs with weak memory consistency and to make
compiler help from over-optimization for non-volatile pointer
respectively.

You removed both of them in the patch.

If we are allowed to be tolerant of the temporary lack of
coherence in shared memory there, the spinlock could be removed.
But the possibility to read garbage by using XLogCtl itself to
access standbyMode does not seem to be tolerable. What do you
think about that?

The comments in checkpointer.c seems to need to be revised more. For
example,

+ * XLogInsert that actually triggers a checkpoint when

Currently a checkpoint is triggered by XLogWrite (not XLogInsert), the above
needs to be corrected.

I will be carefull for such outdated description.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

== My e-mail address has been changed since Apr. 1, 2012.

#18Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#16)
Re: [BUG] Checkpointer on hot standby runs without looking checkpoint_segments

On 23.04.2012 02:59, Fujii Masao wrote:

On Thu, Apr 19, 2012 at 2:20 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello, this is new version of standby checkpoint_segments patch.

Thanks for the patch!

This still makes catching up in standby mode slower, as you get many
more restartpoints. The reason for ignoring checkpoint_segments during
recovery was to avoid that. Maybe it's still better than what we have
currently, I'm not sure, but at least it needs to be discussed. Would be
good to do some performance testing of recovery with various
checkpoint_segments and _timeout settings, with and without this patch.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#19Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#18)
Re: [BUG] Checkpointer on hot standby runs without looking checkpoint_segments

On Tue, Apr 24, 2012 at 3:53 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 23.04.2012 02:59, Fujii Masao wrote:

On Thu, Apr 19, 2012 at 2:20 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp>  wrote:

Hello, this is new version of standby checkpoint_segments patch.

Thanks for the patch!

This still makes catching up in standby mode slower, as you get many more
restartpoints. The reason for ignoring checkpoint_segments during recovery
was to avoid that. Maybe it's still better than what we have currently, I'm
not sure, but at least it needs to be discussed.

I see your point. Agreed.

Another aspect of this problem is that if we ignore checkpoint_segments during
recovery, a restartpoint would take long time, and which prevents WAL files from
being removed from pg_xlog for a while. Which might cause the disk to fill up
with WAL files. This trouble is unlikely to happen in 9.1 or before because the
archived WAL files are always restored with a temporary name. OTOH, in 9.2,
cascading replication patch changed the recovery logic so that the archived
WAL files are restored with the correct name, so the number of WAL files in
pg_xlog keeps increasing until a restartpoint removes them. The disk is more
likely to fill up, in 9.2.

To alleviate the above problem, at least we might have to change the recovery
logic so that the archived WAL files are restored with a temporary name,
if cascading replication is not enabled (i.e., !standby_mode || !hot_standby ||
max_wal_senders <= 0). Or we might have to remove the restored WAL file
after replaying it and before opening the next one, without waiting for
a restartpoint to remove the restored WAL files. Thought?

Regards,

--
Fujii Masao

#20Fujii Masao
masao.fujii@gmail.com
In reply to: Kyotaro HORIGUCHI (#17)
Re: [BUG] Checkpointer on hot standby runs without looking checkpoint_segments

On Tue, Apr 24, 2012 at 3:47 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello,

 - xlog.c: Make StandbyMode shared.

 - checkpointer.c: Use IsStandbyMode() to check if postmaster is
  under standby mode.

IsStandbyMode() looks overkill to me. The standby mode flag is forcibly
turned off at the end of recovery, but its change doesn't need to be shared
to the checkpointer process, IOW, the shared flag doesn't need to change
since startup like XLogCtl->archiveCleanupCommand, I think. So we can
simplify the code to share the flag to the checkpointer. See the attached
patch (though not tested yet).

Hmm. I understood that the aim of the spinlock and volatil'ize of
the pointer in reading shared memory is to secure the memory
consistency on SMPs with weak memory consistency and to make
compiler help from over-optimization for non-volatile pointer
respectively.

You removed both of them in the patch.

If we are allowed to be tolerant of the temporary lack of
coherence in shared memory there, the spinlock could be removed.
But the possibility to read garbage by using XLogCtl itself to
access standbyMode does not seem to be tolerable. What do you
think about that?

I'm not sure if we really need to worry about that for such shared variable
that doesn't change since it's been initialized at the start of recovery.
Anyway, if we really need to worry about that, we need to protect the
shared variable RecoveryTargetTLI and archiveCleanupCommand with
the spinlock because they are in the same situation as standbyMode.

Regards,

--
Fujii Masao

#21Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Fujii Masao (#20)
Re: [BUG] Checkpointer on hot standby runs without looking checkpoint_segments

Hello,

At Wed, 25 Apr 2012 02:31:24 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwE1OvB=HLcahLeL5oP66sxsfsGMgwU3MqAAwZ_Vr=xh1w@mail.gmail.com>

If we are allowed to be tolerant of the temporary lack of
coherence in shared memory there, the spinlock could be removed.
But the possibility to read garbage by using XLogCtl itself to
access standbyMode does not seem to be tolerable. What do you
think about that?

I'm not sure if we really need to worry about that for such shared variable
that doesn't change since it's been initialized at the start of recovery.
Anyway, if we really need to worry about that, we need to protect the
shared variable RecoveryTargetTLI and archiveCleanupCommand with
the spinlock because they are in the same situation as standbyMode.

From I said that the former (spinlock) could be dropped, but the
latter (read as volatile) should be needed.

From the view of maintenancibility (suspicious-proof
expression?), it may be preferable that the manner to read shared
memory be uniform whole source code if no particular reasons.

Concerning this point, I think I will do 'volatization' and do
not spinlock and put comment instead.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

== My e-mail address has been changed since Apr. 1, 2012.

#22Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Heikki Linnakangas (#18)
Re: [BUG] Checkpointer on hot standby runs without looking checkpoint_segments

Thank you for sugestion.

This still makes catching up in standby mode slower, as you get
many more restartpoints. The reason for ignoring
checkpoint_segments during recovery was to avoid that.

I may have a misunderstanding around there, or your intention.

I understand that standby creates no WAL archive, and can not
recover from WAL archive, and both master and standby keeps WAL
segment no longer than about them for about 2 * 1h, spans two
maximum checkpoint_timeout intervals and some more.

Could you please tell me whether the above is correct?

If you meant crash recovery with the word 'recovery', there's
WALs no more than for 2+ hours, far less than days, or weeks
long.

Otherwise, if you meant archive recovery, this patch does not
change the behavior of archive recovery as far as I
intended. This patch intended to change the behavior of standby
under WAL shipping.

If it is correct and the patch works correctly, your anxiety
below should disappear, I hope. And if not correct, I *MUST*
avoid such negative impacts on the functions out of the target -
governing checkpoint progress on standby server shipping WALs
from its master.

Maybe it's still better than what we have currently, I'm not
sure, but at least it needs to be discussed. Would be good to
do some performance testing of recovery with various
checkpoint_segments and _timeout settings, with and without
this patch.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

== My e-mail address has been changed since Apr. 1, 2012.

#23Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Kyotaro HORIGUCHI (#21)
Re: [BUG] Checkpointer on hot standby runs without looking checkpoint_segments

Sorry for broken message.

From I said that the former (spinlock) could be dropped, but the

latter (read as volatile) should be needed.

I said that the former (spinlock) could be dropped from the view
of functionarity, but the latter (read as volatile) should be
needed.

From the view of maintenancibility (suspicious-proof

expression?), it may be preferable that the manner to read shared
memory be uniform whole source code if no particular reasons.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

== My e-mail address has been changed since Apr. 1, 2012.

#24Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Fujii Masao (#19)
Re: [BUG] Checkpointer on hot standby runs without looking checkpoint_segments

Hello, I've returned from my overseas trip for just over one week.

# and I'll leave Japan again after this...

    restorePtr <= replayPtr <= receivePtr

But XLByteLT(recievePtr, replayPtr) this should not return true
under the condition above.. Something wrong in my assumption?

When walreceiver is not running, i.e., the startup process reads the WAL files
from the archival area, the replay location would get bigger than the
receive one.

I've overlooked that startup process of the standby reads
archives first, and then WAL. But the current patch enables
progress governing based on checkpoint_segments during archive
recovery on the standby.

At Wed, 25 Apr 2012 02:20:37 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwFSo5WFptNALxmE-ozRQq6Kk24XgTYbvJjHdYtJf9fdOg@mail.gmail.com>

To alleviate the above problem, at least we might have to
change the recovery logic so that the archived WAL files are
restored with a temporary name, if cascading replication is not
enabled (i.e., !standby_mode || !hot_standby || max_wal_senders
<= 0). Or we might have to remove the restored WAL file after
replaying it and before opening the next one, without waiting
for a restartpoint to remove the restored WAL files. Thought?

I think it is beyond a bug fix. Furthermore, this is not a
problem of speed of restartpoint progression, I suppose. If so,
this should be cared as another problem.

Putting aside the problem of vast amount of copied WALs, I
suppose the remaining problem is needless restartpoint
acceleration caused during archive restoration on the standby.
I will try to resolve this problem. Is that OK?

Thinking about the so-many-copied-WAL problem, IMHO, using
temporary name only for non-cascading is not a good solution
because it leads complication and retrogression to the code and
behavior, nevertheless it won't solve the half of the problem. I
don't yet understand about cascading replication enough, but I
suppose erasing WALs as becoming out of use (by some logic I
don't find yet) is hopeful.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

== My e-mail address has been changed since Apr. 1, 2012.

#25Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Kyotaro HORIGUCHI (#24)
1 attachment(s)
Re: [BUG] Checkpointer on hot standby runs without looking checkpoint_segments

Hello,

On Sun, May 13, 2012 at 10:38 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

I've overlooked that startup process of the standby reads
archives first, and then WAL. But the current patch enables
progress governing based on checkpoint_segments during archive
recovery on the standby.

I forcused on WalRcvInProgress again to achieve this. The problem on
using it was its intermittent nature. But using IsStandbyMode to avoid
that turned out to lead another problem.

Now, the introduced function WalRcvStarted() uses WalRcvInProgress to
inform the caller if wal receiver has been started regardless of
current status.

We can avoid accelarated checkpoints before WAL receiver starts, but
checkpoints on WAL streaming will be governed with
checkpoint_segments.

I have not certain answer for the criticism that checkpoint_semgents
should be ignored even when WAL streaming.. Allowing
checkpoint_segments be null to signal no check aganst it? Introduce
another guc variable in bool to instruct to ignore checkpoint_semgnts
on WAL streaming? Or something other?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

standby_checkpoint_segments_9.2dev_fix_20120516.patchapplication/octet-stream; name=standby_checkpoint_segments_9.2dev_fix_20120516.patchDownload
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index c9473f7..babb874 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -46,6 +46,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "replication/walreceiver.h"
 #include "replication/syncrep.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
@@ -157,6 +158,8 @@ static bool am_checkpointer = false;
 
 static bool ckpt_active = false;
 
+static bool walreceiver_started = false;
+
 /* these values are valid when ckpt_active is true: */
 static pg_time_t ckpt_start_time;
 static XLogRecPtr ckpt_start_recptr;
@@ -168,6 +171,7 @@ static pg_time_t last_xlog_switch_time;
 /* Prototypes for private functions */
 
 static void CheckArchiveTimeout(void);
+static bool WalRcvStarted(void);
 static bool IsCheckpointOnSchedule(double progress);
 static bool ImmediateCheckpointRequested(void);
 static bool CompactCheckpointerRequestQueue(void);
@@ -491,8 +495,8 @@ CheckpointerMain(void)
 			 * Initialize checkpointer-private variables used during checkpoint.
 			 */
 			ckpt_active = true;
-			if (!do_restartpoint)
-				ckpt_start_recptr = GetInsertRecPtr();
+			ckpt_start_recptr =
+				do_restartpoint ? GetXLogReplayRecPtr(NULL) : GetInsertRecPtr();
 			ckpt_start_time = now;
 			ckpt_cached_elapsed = 0;
 
@@ -638,6 +642,25 @@ ImmediateCheckpointRequested(void)
 }
 
 /*
+ * WalRcvStarted -- Return whether WAL is coming via streaming
+ *
+ * WalRcvInProgress() indicates simply whether wal receiver is running or not
+ * at the moment this function is called.  This natuer is not suitable to
+ * determine current source of WAL.  WalRcvStarted() returns false before
+ * wal receiver is found running and true after that regardless whether wal
+ * receiver is actually running or not at the moment.
+ */
+bool
+WalRcvStarted()
+{
+	if (!walreceiver_started && WalRcvInProgress())
+		walreceiver_started = true;
+	
+	return walreceiver_started;
+}
+
+
+/*
  * CheckpointWriteDelay -- control rate of checkpoint
  *
  * This function is called after each page write performed by BufferSync().
@@ -715,6 +738,7 @@ IsCheckpointOnSchedule(double progress)
 	struct timeval now;
 	double		elapsed_xlogs,
 				elapsed_time;
+	bool        recovery_in_progress;
 
 	Assert(ckpt_active);
 
@@ -731,18 +755,27 @@ IsCheckpointOnSchedule(double progress)
 		return false;
 
 	/*
-	 * Check progress against WAL segments written and checkpoint_segments.
+	 * Check progress against WAL segments written, or replayed for
+	 * hot standby, and checkpoint_segments.
 	 *
 	 * We compare the current WAL insert location against the location
-	 * computed before calling CreateCheckPoint. The code in XLogInsert that
-	 * actually triggers a checkpoint when checkpoint_segments is exceeded
-	 * compares against RedoRecptr, so this is not completely accurate.
-	 * However, it's good enough for our purposes, we're only calculating an
-	 * estimate anyway.
+	 * computed before calling CreateCheckPoint. The code in
+	 * XLogInsert that actually triggers a checkpoint when
+	 * checkpoint_segments is exceeded compares against RedoRecPtr.
+	 * Similarly, we consult WAL replay location instead on hot
+	 * standbys and XLogPageRead compares it aganst RedoRecPtr, too.
+	 * Altough these are not completely accurate, it's good enough for
+	 * our purposes, we're only calculating an estimate anyway.
+	 */
+
+	/*
+	 * Inhibit governing progress by segments in archive recovery.
 	 */
-	if (!RecoveryInProgress())
+	recovery_in_progress = RecoveryInProgress();
+	if (!recovery_in_progress || WalRcvStarted())
 	{
-		recptr = GetInsertRecPtr();
+		recptr = recovery_in_progress ? GetXLogReplayRecPtr(NULL) :
+			GetInsertRecPtr();
 		elapsed_xlogs =
 			(((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
 			 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
#26Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Kyotaro HORIGUCHI (#25)
1 attachment(s)
Checkpointer on hot standby runs without looking checkpoint_segments

Hello, I will make this patch start again for this CF.

The requirement for this patch is as follows.

- What I want to get is similarity of the behaviors between
master and (hot-)standby concerning checkpoint
progression. Specifically, checkpoints for streaming
replication running at the speed governed with
checkpoint_segments. The work of this patch is avoiding to get
unexpectedly large number of WAL segments stay on standby
side. (Plus, increasing the chance to skip recovery-end
checkpoint by my another patch.)

- This patch shouldn't affect archive recovery (excluding
streaming). Activity of the checkpoints while recoverying from
WAL archive (Precisely, while archive recovery without WAL
receiver launched.) is depressed to checkpoint_timeout level as
before.

- It might be better if the accelaration can be inhibited. But
this patch does not have the feature. Is it needed?

After the consideration of the past discussion and the another
patch I'm going to put on the table, outline of this patch
becomes as follows.

- Check if it is under streaming replication by new function
WalRcvStarted() which tells whether wal receiver has been
launched so far.

- The implement of WalRcvStarted() is changed from previous
one. Now the state is turned on in WalReceiverMain, at the
point where the state of walRcvState becomes
WALRCV_RUNNING. The behavior of previous implement reading
WalRcvInProgress() was useless for my another patch.

- Determine whether to delay checkpoint by GetLogReplayRecPtr()
instead of GetInsertRecPtr() when WalRcvStarted() says true.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

== My e-mail address has been changed since Apr. 1, 2012.

Attachments:

chkpt_segs_for_standby_20120608_1.patchtext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 6aeade9..cb2509a 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -46,6 +46,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "replication/walreceiver.h"
 #include "replication/syncrep.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
@@ -493,8 +494,8 @@ CheckpointerMain(void)
 			 * Initialize checkpointer-private variables used during checkpoint
 			 */
 			ckpt_active = true;
-			if (!do_restartpoint)
-				ckpt_start_recptr = GetInsertRecPtr();
+			ckpt_start_recptr =
+				do_restartpoint ? GetXLogReplayRecPtr(NULL) : GetInsertRecPtr();
 			ckpt_start_time = now;
 			ckpt_cached_elapsed = 0;
 
@@ -747,6 +748,7 @@ IsCheckpointOnSchedule(double progress)
 	struct timeval now;
 	double		elapsed_xlogs,
 				elapsed_time;
+	bool        recovery_in_progress;
 
 	Assert(ckpt_active);
 
@@ -763,18 +765,26 @@ IsCheckpointOnSchedule(double progress)
 		return false;
 
 	/*
-	 * Check progress against WAL segments written and checkpoint_segments.
+	 * Check progress against WAL segments written, or replayed for
+	 * hot standby, and checkpoint_segments.
 	 *
 	 * We compare the current WAL insert location against the location
 	 * computed before calling CreateCheckPoint. The code in XLogInsert that
 	 * actually triggers a checkpoint when checkpoint_segments is exceeded
-	 * compares against RedoRecptr, so this is not completely accurate.
-	 * However, it's good enough for our purposes, we're only calculating an
-	 * estimate anyway.
+	 * compares against RedoRecPtr.  Similarly, we consult WAL replay location
+	 * instead on hot standbys and XLogPageRead compares it aganst RedoRecPtr,
+	 * too.  Altough these are not completely accurate, it's good enough for
+	 * our purposes, we're only calculating an estimate anyway.
+	 */
+
+	/*
+	 * Inhibit governing progress by segments in archive recovery.
 	 */
-	if (!RecoveryInProgress())
+	recovery_in_progress = RecoveryInProgress();
+	if (!recovery_in_progress || WalRcvStarted())
 	{
-		recptr = GetInsertRecPtr();
+		recptr = recovery_in_progress ? GetXLogReplayRecPtr(NULL) :
+			GetInsertRecPtr();
 		elapsed_xlogs =
 			(((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
 			 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index d63ff29..7d57ad7 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -215,6 +215,7 @@ WalReceiverMain(void)
 	/* Advertise our PID so that the startup process can kill us */
 	walrcv->pid = MyProcPid;
 	walrcv->walRcvState = WALRCV_RUNNING;
+	walrcv->started = true;
 
 	/* Fetch information required to start streaming */
 	strlcpy(conninfo, (char *) walrcv->conninfo, MAXCONNINFO);
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index f8dd523..c3b26e9 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -31,6 +31,7 @@
 #include "utils/timestamp.h"
 
 WalRcvData *WalRcv = NULL;
+static bool localWalRcvStarted = false;
 
 /*
  * How long to wait for walreceiver to start up after requesting
@@ -167,6 +168,25 @@ ShutdownWalRcv(void)
 }
 
 /*
+ * Returns true if WAL receiver has been launced so far regardless of current
+ * state.
+ */
+bool
+WalRcvStarted(void)
+{
+	/* WalRcv->started changes one way throughout the server life */
+	if (!localWalRcvStarted)
+	{
+		volatile WalRcvData *walrcv = WalRcv;
+
+		SpinLockAcquire(&walrcv->mutex);
+		localWalRcvStarted = walrcv->started;
+		SpinLockRelease(&walrcv->mutex);
+	}
+	return localWalRcvStarted;
+}
+
+/*
  * Request postmaster to start walreceiver.
  *
  * recptr indicates the position where streaming should begin, and conninfo
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 68c8647..24901be 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -53,6 +53,7 @@ typedef struct
 	 */
 	pid_t		pid;
 	WalRcvState walRcvState;
+	bool        started;
 	pg_time_t	startTime;
 
 	/*
@@ -116,6 +117,7 @@ extern Size WalRcvShmemSize(void);
 extern void WalRcvShmemInit(void);
 extern void ShutdownWalRcv(void);
 extern bool WalRcvInProgress(void);
+extern bool WalRcvStarted(void);
 extern void RequestXLogStreaming(XLogRecPtr recptr, const char *conninfo);
 extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart);
 extern int GetReplicationApplyDelay(void);
#27Simon Riggs
simon@2ndQuadrant.com
In reply to: Kyotaro HORIGUCHI (#26)
Re: Checkpointer on hot standby runs without looking checkpoint_segments

On 8 June 2012 09:14, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

The requirement for this patch is as follows.

- What I want to get is similarity of the behaviors between
 master and (hot-)standby concerning checkpoint
 progression. Specifically, checkpoints for streaming
 replication running at the speed governed with
 checkpoint_segments. The work of this patch is avoiding to get
 unexpectedly large number of WAL segments stay on standby
 side. (Plus, increasing the chance to skip recovery-end
 checkpoint by my another patch.)

Since we want wal_keep_segments number of WAL files on master (and
because of cascading, on standby also), I don't see any purpose to
triggering more frequent checkpoints just so we can hit a magic number
that is most often set wrong.

ISTM that we should avoid triggering a checkpoint on the master if
checkpoint_segments is less than wal_keep_segments. Such checkpoints
serve no purpose because we don't actually limit and recycle the WAL
files and all it does is slow people down.

Also, I don't believe that throwing more checkpoints makes it more
likely we can skip shutdown checkpoints at failover.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#28Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#27)
Re: Checkpointer on hot standby runs without looking checkpoint_segments

On Fri, Jun 8, 2012 at 5:02 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 8 June 2012 09:14, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

The requirement for this patch is as follows.

- What I want to get is similarity of the behaviors between
 master and (hot-)standby concerning checkpoint
 progression. Specifically, checkpoints for streaming
 replication running at the speed governed with
 checkpoint_segments. The work of this patch is avoiding to get
 unexpectedly large number of WAL segments stay on standby
 side. (Plus, increasing the chance to skip recovery-end
 checkpoint by my another patch.)

Since we want wal_keep_segments number of WAL files on master (and
because of cascading, on standby also), I don't see any purpose to
triggering more frequent checkpoints just so we can hit a magic number
that is most often set wrong.

This is a good point. Right now, if you set checkpoint_segments to a
large value, we retain lots of old WAL segments even when the system
is idle (cf. XLOGfileslop). I think we could be smarter about that.
I'm not sure what the exact algorithm should be, but right now users
are forced between setting checkpoint_segments very large to achieve
optimum write performance and setting it small to conserve disk space.
What would be much better, IMHO, is if the number of retained
segments could ratchet down when the system is idle, eventually
reaching a state where we keep only one segment beyond the one
currently in use.

For example, suppose I have checkpoint_timeout=10min and
checkpoint_segments=300. If, five minutes into the ten-minute
checkpoint interval, I've only used 10 WAL segments, then I probably
am not going to need another 290 of them in the remaining five
minutes. We ought to keep, say, 20 in that case (number we expect to
need * 2, similar to bgwriter_lru_multiplier) and delete the rest.

If we did that, people could set checkpoint_segments much higher to
handle periods of peak load without continuously consuming large
amounts of space with old, useless WAL segments. It doesn't end up
working very well anyway because the old WAL segments are no longer in
cache by the time we go to overwrite them.

ISTM that we should avoid triggering a checkpoint on the master if
checkpoint_segments is less than wal_keep_segments. Such checkpoints
serve no purpose because we don't actually limit and recycle the WAL
files and all it does is slow people down.

On the other hand, I emphatically disagree with this, for the same
reasons as on the other thread. Getting data down to disk provides a
greater measure of safety than having it in memory. Making
checkpoint_segments not force a checkpoint is no better than making
checkpoint_timeout not force a checkpoint.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#29Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#28)
Re: Checkpointer on hot standby runs without looking checkpoint_segments

On 8 June 2012 14:47, Robert Haas <robertmhaas@gmail.com> wrote:

ISTM that we should avoid triggering a checkpoint on the master if
checkpoint_segments is less than wal_keep_segments. Such checkpoints
serve no purpose because we don't actually limit and recycle the WAL
files and all it does is slow people down.

On the other hand, I emphatically disagree with this, for the same
reasons as on the other thread.  Getting data down to disk provides a
greater measure of safety than having it in memory.  Making
checkpoint_segments not force a checkpoint is no better than making
checkpoint_timeout not force a checkpoint.

Not sure which bit you are disagreeing with. I have no suggested
change to checkpoint_timeout.

What I'm saying is that forcing a checkpoint to save space, when we
aren't going to save space, makes no sense.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#30Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#29)
Re: Checkpointer on hot standby runs without looking checkpoint_segments

On Fri, Jun 8, 2012 at 9:58 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 8 June 2012 14:47, Robert Haas <robertmhaas@gmail.com> wrote:

ISTM that we should avoid triggering a checkpoint on the master if
checkpoint_segments is less than wal_keep_segments. Such checkpoints
serve no purpose because we don't actually limit and recycle the WAL
files and all it does is slow people down.

On the other hand, I emphatically disagree with this, for the same
reasons as on the other thread.  Getting data down to disk provides a
greater measure of safety than having it in memory.  Making
checkpoint_segments not force a checkpoint is no better than making
checkpoint_timeout not force a checkpoint.

Not sure which bit you are disagreeing with. I have no suggested
change to checkpoint_timeout.

You already made it not a hard timeout. We have another nearby thread
discussing why I don't like that.

What I'm saying is that forcing a checkpoint to save space, when we
aren't going to save space, makes no sense.

We are also forcing a checkpoint to limit recovery time and data loss
potential, not just to save space.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#31Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#30)
Re: Checkpointer on hot standby runs without looking checkpoint_segments

On 8 June 2012 15:21, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jun 8, 2012 at 9:58 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 8 June 2012 14:47, Robert Haas <robertmhaas@gmail.com> wrote:

ISTM that we should avoid triggering a checkpoint on the master if
checkpoint_segments is less than wal_keep_segments. Such checkpoints
serve no purpose because we don't actually limit and recycle the WAL
files and all it does is slow people down.

On the other hand, I emphatically disagree with this, for the same
reasons as on the other thread.  Getting data down to disk provides a
greater measure of safety than having it in memory.  Making
checkpoint_segments not force a checkpoint is no better than making
checkpoint_timeout not force a checkpoint.

Not sure which bit you are disagreeing with. I have no suggested
change to checkpoint_timeout.

You already made it not a hard timeout.  We have another nearby thread
discussing why I don't like that.

What I'm saying is that forcing a checkpoint to save space, when we
aren't going to save space, makes no sense.

We are also forcing a checkpoint to limit recovery time and data loss
potential, not just to save space.

Nothing I've said on this thread is related to the other thread.
Please don't confuse matters.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#32Florian Pflug
fgp@phlo.org
In reply to: Robert Haas (#28)
Re: Checkpointer on hot standby runs without looking checkpoint_segments

On Jun8, 2012, at 15:47 , Robert Haas wrote:

On Fri, Jun 8, 2012 at 5:02 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 8 June 2012 09:14, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

The requirement for this patch is as follows.

- What I want to get is similarity of the behaviors between
master and (hot-)standby concerning checkpoint
progression. Specifically, checkpoints for streaming
replication running at the speed governed with
checkpoint_segments. The work of this patch is avoiding to get
unexpectedly large number of WAL segments stay on standby
side. (Plus, increasing the chance to skip recovery-end
checkpoint by my another patch.)

Since we want wal_keep_segments number of WAL files on master (and
because of cascading, on standby also), I don't see any purpose to
triggering more frequent checkpoints just so we can hit a magic number
that is most often set wrong.

This is a good point. Right now, if you set checkpoint_segments to a
large value, we retain lots of old WAL segments even when the system
is idle (cf. XLOGfileslop). I think we could be smarter about that.
I'm not sure what the exact algorithm should be, but right now users
are forced between setting checkpoint_segments very large to achieve
optimum write performance and setting it small to conserve disk space.
What would be much better, IMHO, is if the number of retained
segments could ratchet down when the system is idle, eventually
reaching a state where we keep only one segment beyond the one
currently in use.

I'm a bit sceptical about this. It seems to me that you wouldn't actually
be able to do anything useful with the conserved space, since postgres
could re-claim it at any time. At which point it'd better be available,
or your whole cluster comes to a screeching halt...

best regards,
Florian Pflug

#33Robert Haas
robertmhaas@gmail.com
In reply to: Florian Pflug (#32)
Re: Checkpointer on hot standby runs without looking checkpoint_segments

On Fri, Jun 8, 2012 at 1:01 PM, Florian Pflug <fgp@phlo.org> wrote:

On Jun8, 2012, at 15:47 , Robert Haas wrote:

On Fri, Jun 8, 2012 at 5:02 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 8 June 2012 09:14, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

The requirement for this patch is as follows.

- What I want to get is similarity of the behaviors between
 master and (hot-)standby concerning checkpoint
 progression. Specifically, checkpoints for streaming
 replication running at the speed governed with
 checkpoint_segments. The work of this patch is avoiding to get
 unexpectedly large number of WAL segments stay on standby
 side. (Plus, increasing the chance to skip recovery-end
 checkpoint by my another patch.)

Since we want wal_keep_segments number of WAL files on master (and
because of cascading, on standby also), I don't see any purpose to
triggering more frequent checkpoints just so we can hit a magic number
that is most often set wrong.

This is a good point.  Right now, if you set checkpoint_segments to a
large value, we retain lots of old WAL segments even when the system
is idle (cf. XLOGfileslop).  I think we could be smarter about that.
I'm not sure what the exact algorithm should be, but right now users
are forced between setting checkpoint_segments very large to achieve
optimum write performance and setting it small to conserve disk space.
What would be much better, IMHO, is if the number of retained
segments could ratchet down when the system is idle, eventually
reaching a state where we keep only one segment beyond the one
currently in use.

I'm a bit sceptical about this. It seems to me that you wouldn't actually
be able to do anything useful with the conserved space, since postgres
could re-claim it at any time. At which point it'd better be available,
or your whole cluster comes to a screeching halt...

Well, the issue for me is elasticity. Right now we ship with
checkpoint_segments=3. That causes terribly performance on many
real-world workloads. But say we ship with checkpoint_segments = 100,
which is a far better setting from a performance point of view. Then
pg_xlog space utilization will eventually grow to more than 3 GB, even
on a low-velocity system where they don't improve performance. I'm
not sure whether it's useful for the number of checkpoint segments to
vary dramatically on a single system, but I do think it would be very
nice if we could ship with a less conservative default without eating
up so much disk space. Maybe there's a better way of going about
that, but I agree with Simon's point that the setting is often wrong.
Frequently it's too low; sometimes it's too high; occasionally it's
got both problems simultaneously. If you have another idea on how to
improve this, I'm all ears.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#34Simon Riggs
simon@2ndQuadrant.com
In reply to: Florian Pflug (#32)
Re: Checkpointer on hot standby runs without looking checkpoint_segments

On 8 June 2012 18:01, Florian Pflug <fgp@phlo.org> wrote:

What would be much better, IMHO, is if the number of retained
segments could ratchet down when the system is idle, eventually
reaching a state where we keep only one segment beyond the one
currently in use.

I'm a bit sceptical about this. It seems to me that you wouldn't actually
be able to do anything useful with the conserved space, since postgres
could re-claim it at any time. At which point it'd better be available,
or your whole cluster comes to a screeching halt...

Agreed, I can't really see why you'd want to save space when the
database is slow at the expense of robustness and reliability when the
database speeds up.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#35Simon Riggs
simon@2ndQuadrant.com
In reply to: Kyotaro HORIGUCHI (#26)
Re: Checkpointer on hot standby runs without looking checkpoint_segments

On 8 June 2012 09:14, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello, I will make this patch start again for this CF.

The requirement for this patch is as follows.

- What I want to get is similarity of the behaviors between
 master and (hot-)standby concerning checkpoint
 progression. Specifically, checkpoints for streaming
 replication running at the speed governed with
 checkpoint_segments. The work of this patch is avoiding to get
 unexpectedly large number of WAL segments stay on standby
 side. (Plus, increasing the chance to skip recovery-end
 checkpoint by my another patch.)

I think we need to be clearer about this:

I reject this patch and am moving to rejected on the CF manager.

The "increase chance to skip recovery end checkpoint" is completely
gone as a reason to do this (see other thread).

Plus the premise that we want more restartpoints is wrong, with
reasons explained by Heikki, in detail, months ago.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#36Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Simon Riggs (#35)
Re: Checkpointer on hot standby runs without looking checkpoint_segments

Ok, I agree to drop this patch from this CF.

- What I want to get is similarity of the behaviors between
 master and (hot-)standby concerning checkpoint
 progression. Specifically, checkpoints for streaming
 replication running at the speed governed with
 checkpoint_segments. The work of this patch is avoiding to get
 unexpectedly large number of WAL segments stay on standby
 side. (Plus, increasing the chance to skip recovery-end
 checkpoint by my another patch.)

I think we need to be clearer about this:

I reject this patch and am moving to rejected on the CF manager.

The "increase chance to skip recovery end checkpoint" is completely
gone as a reason to do this (see other thread).

I don't know why you hate to decrease checkpoint interval so much
despite I proposed to do so only during WAL streaming (and
additionaly allowing it to be optional), we have another and
enough reason to want *to be able* to do so. Averaging disk usage
is not a trivial issue for some users or - I suppose (or hope?) -
not a few users. This is a subject not only of resource limit but
also of operation management.

Plus the premise that we want more restartpoints is wrong, with
reasons explained by Heikki, in detail, months ago.

Yes, It may make catching up in standby mode slower but it seems
to be a choice between the resource and performance.

But since I agree with that this is not a critical issue and with
the opinion that the chekcpoint planning somehow should be
smarter to make better balance between disk usage and
performance, I accept the judgement to drop this.

Thank you.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

== My e-mail address has been changed since Apr. 1, 2012.

#37Robert Haas
robertmhaas@gmail.com
In reply to: Kyotaro HORIGUCHI (#36)
Re: Checkpointer on hot standby runs without looking checkpoint_segments

On Mon, Jul 2, 2012 at 5:08 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

I don't know why you hate to decrease checkpoint interval so much
despite I proposed to do so only during WAL streaming (and
additionaly allowing it to be optional), we have another and
enough reason to want *to be able* to do so. Averaging disk usage
is not a trivial issue for some users or - I suppose (or hope?) -
not a few users. This is a subject not only of resource limit but
also of operation management.

I agree. I think there should be a way to do this. I see the problem
with the proposed patch, but I think that means we need a better idea,
not that the problem is unimportant.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company