[BUG] Checkpointer on hot standby runs without looking checkpoint_segments
Hello, this is bug report and a patch for it.
The first patch in the attachments is for 9.2dev and next one is
for 9.1.3.
On the current 9.2dev, IsCheckpointOnSchedule@checkpointer.c does
not check against WAL segments written. This makes checkpointer
always run at the speed according to checkpoint_timeout
regardless of WAL advancing rate.
This leads to unexpected imbalance in the numbers of WAL segment
files between the master and the standby(s) for high advance rate
of WALs. And what is worse, the master would have much higher
chance to remove some WAL segments before the standby receives
them.
XLogPageRead()@xlog.c triggers checkpoint referring to WAL
segment advance. So I think this is a bug of bgwriter in 9.1. The
attached patches fix that on 9.2dev and 9.1.3 respctively.
In the backported version to 9.1.3, bgwriter.c is modified
instead of checkpointer.c in 9.2. And GetWalRcvWriteRecPtr() is
used as the equivalent of GetStandbyFlushRecPtr() in 9.2.
By the way, GetStandbyFlushRecPtr() acquires spin lock within. It
might be enough to read XLogCtl->recoveryLastRecPtr without lock
to make rough estimation, but I can't tell it is safe or
not. Same discussion could be for GetWalRcvWriteRecPtr() on
9.1.3.
However, it seems to work fine on a simple test.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
== My e-mail address has been changed since Apr. 1, 2012.
On Mon, Apr 16, 2012 at 1:05 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Hello, this is bug report and a patch for it.
The first patch in the attachments is for 9.2dev and next one is
for 9.1.3.On the current 9.2dev, IsCheckpointOnSchedule@checkpointer.c does
not check against WAL segments written. This makes checkpointer
always run at the speed according to checkpoint_timeout
regardless of WAL advancing rate.
Thanks, I'll take a look.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Apr 16, 2012 at 9:05 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
In the backported version to 9.1.3, bgwriter.c is modified
instead of checkpointer.c in 9.2. And GetWalRcvWriteRecPtr() is
used as the equivalent of GetStandbyFlushRecPtr() in 9.2.
In 9,2, GetXLogReplayRecPtr() should be used instead of GetStandbyFlushRecPtr().
A restartpoint is scheduled to finish before next restartpoint is
triggered. A restartpoint
is triggered if too much WAL files have been replayed since last
restartpoint. So
a restartpoint should be scheduled according to the replay location
not the receive
location.
- * computed before calling CreateCheckPoint. The code in XLogInsert that
- * actually triggers a checkpoint when checkpoint_segments is exceeded
- * compares against RedoRecptr, so this is not completely accurate.
- * However, it's good enough for our purposes, we're only calculating an
- * estimate anyway.
+ * computed before calling CreateCheckPoint. The code in
+ * XLogInsert that actually triggers a checkpoint when
+ * checkpoint_segments is exceeded compares against RedoRecptr.
+ * Similarly, we consult WAL flush location instead on hot
+ * standbys and XLogPageRead compares it aganst RedoRecptr, too.
+ * Altough these are not completely accurate, it's good enough for
+ * our purposes, we're only calculating an estimate anyway.
I think that basically it's better not to change the comments (i.e., not to add
the line feed) if their contents are the same as previous ones, to highlight
what you actually changed in the patch.
Typo: RedoRecptr should be RedoRecPtr?
Regards,
--
Fujii Masao
Hello, thank you for comment.
In the backported version to 9.1.3, bgwriter.c is modified
instead of checkpointer.c in 9.2. And GetWalRcvWriteRecPtr() is
used as the equivalent of GetStandbyFlushRecPtr() in 9.2.In 9,2, GetXLogReplayRecPtr() should be used instead of
GetStandbyFlushRecPtr(). A restartpoint is scheduled to finish
before next restartpoint is triggered. A restartpoint is
triggered if too much WAL files have been replayed since last
restartpoint. So a restartpoint should be scheduled according
to the replay location not the receive location.
I agree with it basically. But I've get confused to look into
GetStandbyFlushRecPtr().
| if (XLByteLT(receivePtr, replayPtr))
| return XLByteLT(replayPtr, restorePtr) ? restorePtr : replayPtr;
| else
| return XLByteLT(receivePtr, restorePtr) ? restorePtr : receivePtr;
This seems imply receivePtr may be behind replayPtr. I don't
understand what condition makes it but anyway the bottom line I
think is that a restartpoint should be based on WALs surely
synced. So I choosed GetStandbyFlushRecPtr() to read the
location.
If receivePtr/restorePtr always precede or are equal to
replayPtr, I prefer GetXLogReplayRecPtr() as you suggest.
(And some comment about the order among these pointers might
should be supplied for the part)
I think that basically it's better not to change the comments
(i.e., not to add the line feed) if their contents are the same
as previous ones, to highlight what you actually changed in the
patch.
Hmm. It is a priority matter between pointing up in or
compactness of a patch and consistency in outcome of that. I
think the latter takes precedence over the former.
Altough, I could have found a description on better balance. But
more than that, I've found fill-column for this comment be too
short...
Typo: RedoRecptr should be RedoRecPtr?
I think that's right. I've unconsciously brought that spelling
from the orignal comment.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
== My e-mail address has been changed since Apr. 1, 2012.
This is new version of the patch.
I replaced GetStandbyFlushRecPtr with GetXLogReplayRecPtr to
check progress of checkpoint following Fujii's sugestion.
The first one is for 9.2dev, and the second is 9.1.3 backported version.
===
By the way, I took a close look around there,
I agree with it basically. But I've get confused to look into
GetStandbyFlushRecPtr().| if (XLByteLT(receivePtr, replayPtr))
| return XLByteLT(replayPtr, restorePtr) ? restorePtr : replayPtr;
| else
| return XLByteLT(receivePtr, restorePtr) ? restorePtr : receivePtr;
- receivePtr seems always updated just after syncing received xlog.
- replayPtr is updated just BEFORE xlog_redo operation, and
- restorePtr is updated AFTER xlog_redo().
- And, replayPtr seems not exceeds receivePtr.
These seems quite reasonable. These conditions make following
conditional expression.
restorePtr <= replayPtr <= receivePtr
But XLByteLT(recievePtr, replayPtr) this should not return true
under the condition above.. Something wrong in my assumption?
Anyway, I understand here that you say the location returned by
GetXLogReplayRecPtr() is always flushed.
--
Kyotaro Horiguchi
NTT Open Source Software Center
== My e-mail address has been changed since Apr. 1, 2012.
On 17.04.2012 09:50, Kyotaro HORIGUCHI wrote:
This is new version of the patch.
I replaced GetStandbyFlushRecPtr with GetXLogReplayRecPtr to
check progress of checkpoint following Fujii's sugestion.
The reason we haven't historically obeyed checkpoint_segments during
recovery is that it slows down the recovery unnecessarily if you're
restoring from a backup and you replay, say, one week's worth of WAL
files. If for example you have checkpoint_segments=10 and
checkpoint_timeout='15 mins' in the server that generated the WAL, you
would be constantly performing a restartpoint if you trigger one every
10 segments.
You could argue that you should obey checkpoint_segments in a standby
server that's caught up with the master, but AFAICS the patch doesn't
try to detect that.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Hello,
The reason we haven't historically obeyed checkpoint_segments
during recovery is that it slows down the recovery
unnecessarily if you're restoring from a backup and you replay,
The variable StandbyMode is false on archive recovery, so no
checkpoint triggerred during then.
xlog.c:10026 (in some version around 9.2)
| /*
| * Request a restartpoint if we've replayed too much
| * xlog since the last one.
| */
| if (StandbyMode && bgwriterLaunched)
| {
| if (XLogCheckpointNeeded(readId, readSeg))
You could argue that you should obey checkpoint_segments in a
standby server that's caught up with the master, but AFAICS the
patch doesn't try to detect that.
Concerning checkpoint, there seems no need for the standby to know
whether it has been caught up with its master or not. Checkpoint has
been already kicked by checkpoint_timeout regardless of the
sync_state.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
== My e-mail address has been changed since Apr. 1, 2012.
Sorry, I've wrote something wrong.
The reason we haven't historically obeyed checkpoint_segments
during recovery is that it slows down the recovery
unnecessarily if you're restoring from a backup and you replay,The variable StandbyMode is false on archive recovery, so no
checkpoint triggerred during then.
Nevertheless, checkpoints will be triggered by checkpoint_timeout and
run at the maybe higher speed governed by checkpoint_segments. This is
undesirable behavior from such a point of view.
But I think referring checkpoint_segment on such case should be
inhibited, and I suppose it is possible using StandbyMode in
IsCheckpointOnSchedule(), I suppose.
I will correct the patch later.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
== My e-mail address has been changed since Apr. 1, 2012.
Hello, this message is attached with the patch which did not
tested. That is for show the way.
On Tue, Apr 17, 2012 at 9:38 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
But I think referring checkpoint_segment on such case should be
inhibited, and I suppose it is possible using StandbyMode in
IsCheckpointOnSchedule(), I suppose.I will correct the patch later.
Hmm. StandbyMode is a local variable which cannot be accessed in
checkpointer. But WalRcvInProgress() which shows if wal receiver
is running seems to be usable to ENABLE governing progress by
checkpoint_segments.
| IsCheckpointOnSchedule(double progress)
| {
....
| /*
| * Inhibit governing progress by segments in archive recovery.
| */
| recovery_in_progress = RecoveryInProgress();
| if (!recovery_in_progress || WalRcvInProgress())
| {
| recptr = recovery_in_progress ? GetXLogReplayRecPtr(NULL) :
| GetInsertRecPtr();
How about this?
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
== My e-mail address has been changed since Apr. 1, 2012.
Attachments:
standby_checkpoint_segments_9.2dev_fix_20120417v2.patchapplication/octet-stream; name=standby_checkpoint_segments_9.2dev_fix_20120417v2.patchDownload+21-10
On Tue, Apr 17, 2012 at 3:50 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
These seems quite reasonable. These conditions make following
conditional expression.restorePtr <= replayPtr <= receivePtr
But XLByteLT(recievePtr, replayPtr) this should not return true
under the condition above.. Something wrong in my assumption?
When walreceiver is not running, i.e., the startup process reads the WAL files
from the archival area, the replay location would get bigger than the
receive one.
Regards,
--
Fujii Masao
On Tue, Apr 17, 2012 at 11:50 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Hmm. StandbyMode is a local variable which cannot be accessed in
checkpointer. But WalRcvInProgress() which shows if wal receiver
is running seems to be usable to ENABLE governing progress by
checkpoint_segments.
Even when walreceiver is not running and WAL files are read from the archive,
checkpoint_segments can trigger a restartpoint. In this case, ISTM a
restartpoint
should be scheduled according to checkpoint_segments, so I don't think that
checking WalRcvInProgress() for that purpose is right thing. Instead, what about
sharing StandbyMode flag among processes via shared memory like XLogCtl?
Regards,
--
Fujii Masao
Hmm. StandbyMode is a local variable which cannot be accessed in
checkpointer. But WalRcvInProgress() which shows if wal receiver
is running seems to be usable to ENABLE governing progress by
checkpoint_segments.Even when walreceiver is not running and WAL files are read from the archive,
checkpoint_segments can trigger a restartpoint. In this case, ISTM a
restartpoint
should be scheduled according to checkpoint_segments, so I don't think that
checking WalRcvInProgress() for that purpose is right thing. Instead, what about
sharing StandbyMode flag among processes via shared memory like XLogCtl?
I tried that at first. But I suppose the requirement here is 'if
reading segments comes via replication stream, enable throttling
by checkpoint_segments.' and WalRcvInProgress() seems fit to
check that. Plus, adding SharedStartupStandbyMode into
XLogCtlData seems accompanied with some annoyances which would
not pay.
By the way, do you have some advise about GetStandbyFlushRecPtr()
and the order of the locations? I'm embarrassed with that...
I agree with it basically. But I've get confused to look into
GetStandbyFlushRecPtr().| if (XLByteLT(receivePtr, replayPtr))
| return XLByteLT(replayPtr, restorePtr) ? restorePtr : replayPtr;
| else
| return XLByteLT(receivePtr, restorePtr) ? restorePtr : receivePtr;- receivePtr seems always updated just after syncing received xlog.
- replayPtr is updated just BEFORE xlog_redo operation, and
- restorePtr is updated AFTER xlog_redo().
- And, replayPtr seems not exceeds receivePtr.These seems quite reasonable. These conditions make following
conditional expression.restorePtr <= replayPtr <= receivePtr
But XLByteLT(recievePtr, replayPtr) this should not return true
under the condition above.. Something wrong in my assumption?
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
== My e-mail address has been changed since Apr. 1, 2012.
On Wed, Apr 18, 2012 at 10:22 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
I tried that at first. But I suppose the requirement here is 'if
reading segments comes via replication stream, enable throttling
by checkpoint_segments.' and WalRcvInProgress() seems fit to
check that.
If so, what if replication is terminated and restarted repeatedly while
a restartpoint is running? It sometimes obeys and sometimes ignores
checkpoint_segments. Which seems strange behavior.
Plus, adding SharedStartupStandbyMode into
XLogCtlData seems accompanied with some annoyances which would
not pay.
Hmm... what are you worried about? I don't think that sharing the variable
via XLogCtl is so difficult. Please see the code to share archiveCleanupCommand
from the startup process to the checkpointer. It looks very simple.
By the way, do you have some advise about GetStandbyFlushRecPtr()
and the order of the locations? I'm embarrassed with that...
Sorry. I could not parse this.... Could you explain this again?
Regards,
--
Fujii Masao
I convinced that current patch has a problem, and will come up
with the new patch later.
====
I tried that at first. But I suppose the requirement here is 'if
reading segments comes via replication stream, enable throttling
by checkpoint_segments.' and WalRcvInProgress() seems fit to
check that.If so, what if replication is terminated and restarted repeatedly while
a restartpoint is running? It sometimes obeys and sometimes ignores
checkpoint_segments. Which seems strange behavior.
I see your point. And agree on that is something not should be.
Plus, adding SharedStartupStandbyMode into
XLogCtlData seems accompanied with some annoyances which would
not pay.Hmm... what are you worried about? I don't think that sharing the variable
via XLogCtl is so difficult. Please see the code to share archiveCleanupCommand
from the startup process to the checkpointer. It looks very simple.
The mechanism has nothing complex. I've been afraid of making so
many variables with similar meaning sitting side by side on
shared memory. But I convinced that additional shared variable is
preferable because it makes the logic and behavior clear and
sane.
I will come up with updated patch soon.
By the way, do you have some advise about GetStandbyFlushRecPtr()
and the order of the locations? I'm embarrassed with that...Sorry. I could not parse this.... Could you explain this again?
My point is,
- Is the assumption correct that "restorePtr <= replayPtr <= receivePtr"?
- If correct, what the code in GetStandbyFlushRecPtr() showing
below means?
if (XLByteLT(receivePtr, replayPtr))
- Or if wrong, what situation would take place to break the
expression "restorePtr <= replayPtr <= receivePtr"?
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
== My e-mail address has been changed since Apr. 1, 2012.
Hello, this is new version of standby checkpoint_segments patch.
- xlog.c: Make StandbyMode shared.
- checkpointer.c: Use IsStandbyMode() to check if postmaster is
under standby mode.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
== My e-mail address has been changed since Apr. 1, 2012.
Attachments:
standby_checkpoint_segments_9.2dev_fix_20120419.patchtext/x-patch; charset=us-asciiDownload+97-24
On Thu, Apr 19, 2012 at 2:20 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Hello, this is new version of standby checkpoint_segments patch.
Thanks for the patch!
- xlog.c: Make StandbyMode shared.
- checkpointer.c: Use IsStandbyMode() to check if postmaster is
under standby mode.
IsStandbyMode() looks overkill to me. The standby mode flag is forcibly
turned off at the end of recovery, but its change doesn't need to be shared
to the checkpointer process, IOW, the shared flag doesn't need to change
since startup like XLogCtl->archiveCleanupCommand, I think. So we can
simplify the code to share the flag to the checkpointer. See the attached
patch (though not tested yet).
The comments in checkpointer.c seems to need to be revised more. For
example,
+ * XLogInsert that actually triggers a checkpoint when
Currently a checkpoint is triggered by XLogWrite (not XLogInsert), the above
needs to be corrected.
Regards,
--
Fujii Masao
Attachments:
standby_checkpoint_segments_9.2dev_fix_20120423.patchapplication/octet-stream; name=standby_checkpoint_segments_9.2dev_fix_20120423.patchDownload+58-33
Hello,
- xlog.c: Make StandbyMode shared.
- checkpointer.c: Use IsStandbyMode() to check if postmaster is
under standby mode.IsStandbyMode() looks overkill to me. The standby mode flag is forcibly
turned off at the end of recovery, but its change doesn't need to be shared
to the checkpointer process, IOW, the shared flag doesn't need to change
since startup like XLogCtl->archiveCleanupCommand, I think. So we can
simplify the code to share the flag to the checkpointer. See the attached
patch (though not tested yet).
Hmm. I understood that the aim of the spinlock and volatil'ize of
the pointer in reading shared memory is to secure the memory
consistency on SMPs with weak memory consistency and to make
compiler help from over-optimization for non-volatile pointer
respectively.
You removed both of them in the patch.
If we are allowed to be tolerant of the temporary lack of
coherence in shared memory there, the spinlock could be removed.
But the possibility to read garbage by using XLogCtl itself to
access standbyMode does not seem to be tolerable. What do you
think about that?
The comments in checkpointer.c seems to need to be revised more. For
example,+ * XLogInsert that actually triggers a checkpoint when
Currently a checkpoint is triggered by XLogWrite (not XLogInsert), the above
needs to be corrected.
I will be carefull for such outdated description.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
== My e-mail address has been changed since Apr. 1, 2012.
On 23.04.2012 02:59, Fujii Masao wrote:
On Thu, Apr 19, 2012 at 2:20 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Hello, this is new version of standby checkpoint_segments patch.
Thanks for the patch!
This still makes catching up in standby mode slower, as you get many
more restartpoints. The reason for ignoring checkpoint_segments during
recovery was to avoid that. Maybe it's still better than what we have
currently, I'm not sure, but at least it needs to be discussed. Would be
good to do some performance testing of recovery with various
checkpoint_segments and _timeout settings, with and without this patch.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Tue, Apr 24, 2012 at 3:53 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
On 23.04.2012 02:59, Fujii Masao wrote:
On Thu, Apr 19, 2012 at 2:20 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Hello, this is new version of standby checkpoint_segments patch.
Thanks for the patch!
This still makes catching up in standby mode slower, as you get many more
restartpoints. The reason for ignoring checkpoint_segments during recovery
was to avoid that. Maybe it's still better than what we have currently, I'm
not sure, but at least it needs to be discussed.
I see your point. Agreed.
Another aspect of this problem is that if we ignore checkpoint_segments during
recovery, a restartpoint would take long time, and which prevents WAL files from
being removed from pg_xlog for a while. Which might cause the disk to fill up
with WAL files. This trouble is unlikely to happen in 9.1 or before because the
archived WAL files are always restored with a temporary name. OTOH, in 9.2,
cascading replication patch changed the recovery logic so that the archived
WAL files are restored with the correct name, so the number of WAL files in
pg_xlog keeps increasing until a restartpoint removes them. The disk is more
likely to fill up, in 9.2.
To alleviate the above problem, at least we might have to change the recovery
logic so that the archived WAL files are restored with a temporary name,
if cascading replication is not enabled (i.e., !standby_mode || !hot_standby ||
max_wal_senders <= 0). Or we might have to remove the restored WAL file
after replaying it and before opening the next one, without waiting for
a restartpoint to remove the restored WAL files. Thought?
Regards,
--
Fujii Masao
On Tue, Apr 24, 2012 at 3:47 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Hello,
- xlog.c: Make StandbyMode shared.
- checkpointer.c: Use IsStandbyMode() to check if postmaster is
under standby mode.IsStandbyMode() looks overkill to me. The standby mode flag is forcibly
turned off at the end of recovery, but its change doesn't need to be shared
to the checkpointer process, IOW, the shared flag doesn't need to change
since startup like XLogCtl->archiveCleanupCommand, I think. So we can
simplify the code to share the flag to the checkpointer. See the attached
patch (though not tested yet).Hmm. I understood that the aim of the spinlock and volatil'ize of
the pointer in reading shared memory is to secure the memory
consistency on SMPs with weak memory consistency and to make
compiler help from over-optimization for non-volatile pointer
respectively.You removed both of them in the patch.
If we are allowed to be tolerant of the temporary lack of
coherence in shared memory there, the spinlock could be removed.
But the possibility to read garbage by using XLogCtl itself to
access standbyMode does not seem to be tolerable. What do you
think about that?
I'm not sure if we really need to worry about that for such shared variable
that doesn't change since it's been initialized at the start of recovery.
Anyway, if we really need to worry about that, we need to protect the
shared variable RecoveryTargetTLI and archiveCleanupCommand with
the spinlock because they are in the same situation as standbyMode.
Regards,
--
Fujii Masao