checkpoints are duplicated even while the system is idle
Hi,
While the system is idle, we skip duplicate checkpoints for some
reasons. But when wal_level is set to hot_standby, I found that
checkpoints are wrongly duplicated even while the system is idle.
The cause is that XLOG_RUNNING_XACTS WAL record always
follows CHECKPOINT one when wal_level is set to hot_standby.
So the subsequent checkpoint wrongly thinks that there is inserted
record (i.e., XLOG_RUNNING_XACTS record) since the start of the
last checkpoint, the system is not idle, and this checkpoint cannot
be skipped. Is this intentional behavior? Or a bug?
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On Wed, Oct 5, 2011 at 1:19 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
While the system is idle, we skip duplicate checkpoints for some
reasons. But when wal_level is set to hot_standby, I found that
checkpoints are wrongly duplicated even while the system is idle.
The cause is that XLOG_RUNNING_XACTS WAL record always
follows CHECKPOINT one when wal_level is set to hot_standby.
So the subsequent checkpoint wrongly thinks that there is inserted
record (i.e., XLOG_RUNNING_XACTS record) since the start of the
last checkpoint, the system is not idle, and this checkpoint cannot
be skipped. Is this intentional behavior? Or a bug?
If we can eliminate it in some cases where it isn't necessary, that
seems like a good thing, but I'm not sure I have a good handle on when
it is or isn't necessary.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Oct 5, 2011 at 6:19 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
While the system is idle, we skip duplicate checkpoints for some
reasons. But when wal_level is set to hot_standby, I found that
checkpoints are wrongly duplicated even while the system is idle.
The cause is that XLOG_RUNNING_XACTS WAL record always
follows CHECKPOINT one when wal_level is set to hot_standby.
So the subsequent checkpoint wrongly thinks that there is inserted
record (i.e., XLOG_RUNNING_XACTS record) since the start of the
last checkpoint, the system is not idle, and this checkpoint cannot
be skipped. Is this intentional behavior? Or a bug?
I think it is avoidable behaviour, but not a bug.
Thinking some more about this, IMHO it is possible to improve the
situation greatly by returning to look at the true purpose of
checkpoints. Checkpoints exist to minimise the time taken during crash
recovery, and as starting points for backups/archive recoveries.
The current idea is that if there has been no activity then we skip
checkpoint. But all it takes is a single WAL record and off we go with
another checkpoint. If there hasn't been much WAL activity, there is
not much point in having another checkpoint record since there is
little if any time to be saved in recovery.
So why not avoid checkpoints until we have written at least 1 WAL file
worth of data? That way checkpoint records are always in different
files, so we are safer with regard to primary and secondary checkpoint
records. That would mean in some cases that dirty data would stay in
shared buffers for days or weeks? No, because the bgwriter would clean
it - but even if it did, so what? Recovery will still be incredibly
quick, which is the whole point.
Testing whether we're in a different segment is easy and much simpler
than trying to wriggle around trying to directly fix the problem you
mention. Patch attached.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
spaced_checkpoints.v1.patchapplication/octet-stream; name=spaced_checkpoints.v1.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index befb507..fbd3fd2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7563,6 +7563,10 @@ CreateCheckPoint(int flags)
uint32 freespace;
uint32 _logId;
uint32 _logSeg;
+ uint32 insert_logId;
+ uint32 insert_logSeg;
+ uint32 ckpt_logId;
+ uint32 ckpt_logSeg;
TransactionId *inCommitXids;
int nInCommit;
@@ -7629,19 +7633,14 @@ CreateCheckPoint(int flags)
LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
/*
- * If this isn't a shutdown or forced checkpoint, and we have not inserted
- * any XLOG records since the start of the last checkpoint, skip the
+ * If this isn't a shutdown or forced checkpoint, and we have not switched
+ * to the next WAL file since the start of the last checkpoint, skip the
* checkpoint. The idea here is to avoid inserting duplicate checkpoints
* when the system is idle. That wastes log space, and more importantly it
* exposes us to possible loss of both current and previous checkpoint
* records if the machine crashes just as we're writing the update.
* (Perhaps it'd make even more sense to checkpoint only when the previous
* checkpoint record is in a different xlog page?)
- *
- * We have to make two tests to determine that nothing has happened since
- * the start of the last checkpoint: current insertion point must match
- * the end of the last checkpoint record, and its redo pointer must point
- * to itself.
*/
if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
CHECKPOINT_FORCE)) == 0)
@@ -7649,13 +7648,10 @@ CreateCheckPoint(int flags)
XLogRecPtr curInsert;
INSERT_RECPTR(curInsert, Insert, Insert->curridx);
- if (curInsert.xlogid == ControlFile->checkPoint.xlogid &&
- curInsert.xrecoff == ControlFile->checkPoint.xrecoff +
- MAXALIGN(SizeOfXLogRecord + sizeof(CheckPoint)) &&
- ControlFile->checkPoint.xlogid ==
- ControlFile->checkPointCopy.redo.xlogid &&
- ControlFile->checkPoint.xrecoff ==
- ControlFile->checkPointCopy.redo.xrecoff)
+ XLByteToSeg(curInsert, insert_logId, insert_logSeg);
+ XLByteToSeg(ControlFile->checkPoint, ckpt_logId, ckpt_logSeg);
+ if (insert_logId == ckpt_logId &&
+ insert_logSeg == ckpt_logSeg)
{
LWLockRelease(WALInsertLock);
LWLockRelease(CheckpointLock);
Simon Riggs <simon@2ndQuadrant.com> writes:
The current idea is that if there has been no activity then we skip
checkpoint. But all it takes is a single WAL record and off we go with
another checkpoint. If there hasn't been much WAL activity, there is
not much point in having another checkpoint record since there is
little if any time to be saved in recovery.
So why not avoid checkpoints until we have written at least 1 WAL file
worth of data?
+1, but I think you need to compare to the last checkpoint's REDO
pointer, not to the position of the checkpoint record itself.
Otherwise, the argument falls down if there was a lot of activity
during the last checkpoint (which is not unlikely in these days of
spread checkpoints).
Also I think the comment needs more extensive revision than you gave it.
regards, tom lane
On Thu, Oct 6, 2011 at 12:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
The current idea is that if there has been no activity then we skip
checkpoint. But all it takes is a single WAL record and off we go with
another checkpoint. If there hasn't been much WAL activity, there is
not much point in having another checkpoint record since there is
little if any time to be saved in recovery.So why not avoid checkpoints until we have written at least 1 WAL file
worth of data?+1, but I think you need to compare to the last checkpoint's REDO
pointer, not to the position of the checkpoint record itself.
Otherwise, the argument falls down if there was a lot of activity
during the last checkpoint (which is not unlikely in these days of
spread checkpoints).Also I think the comment needs more extensive revision than you gave it.
If we go with this approach, we presumably also need to update the
documentation, especially for checkpoint_timeout (which will no longer
be a hard limit on the time between checkpoints).
I'm not entirely sure I understand the rationale, though. I mean, if
very little has happened since the last checkpoint, then the
checkpoint will be very cheap. In the totally degenerate case Fujii
Masao is reporting, where absolutely nothing has happened, it should
be basically free. We'll loop through a whole bunch of things, decide
there's nothing to fsync, and call it a day.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
I'm not entirely sure I understand the rationale, though. I mean, if
very little has happened since the last checkpoint, then the
checkpoint will be very cheap. In the totally degenerate case Fujii
Masao is reporting, where absolutely nothing has happened, it should
be basically free. We'll loop through a whole bunch of things, decide
there's nothing to fsync, and call it a day.
I think the point is that a totally idle database should not continue to
emit WAL, not even at a slow rate. There are also power-consumption
objections to allowing the checkpoint process to fire up to no purpose.
The larger issue is that we should not only be concerned with optimizing
for high load. Doing nothing when there's nothing to be done is an
important part of good data-center citizenship. See also the ongoing
work to avoid unnecessary process wakeups.
regards, tom lane
On Thu, Oct 6, 2011 at 12:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
I'm not entirely sure I understand the rationale, though. I mean, if
very little has happened since the last checkpoint, then the
checkpoint will be very cheap. In the totally degenerate case Fujii
Masao is reporting, where absolutely nothing has happened, it should
be basically free. We'll loop through a whole bunch of things, decide
there's nothing to fsync, and call it a day.I think the point is that a totally idle database should not continue to
emit WAL, not even at a slow rate. There are also power-consumption
objections to allowing the checkpoint process to fire up to no purpose.
Hmm, OK. I still think it's a little funny to say that
checkpoint_timeout will force a checkpoint every N minutes except when
it doesn't, but maybe there's no real harm in that as long as we
document it properly.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Oct 6, 2011 at 5:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
The current idea is that if there has been no activity then we skip
checkpoint. But all it takes is a single WAL record and off we go with
another checkpoint. If there hasn't been much WAL activity, there is
not much point in having another checkpoint record since there is
little if any time to be saved in recovery.So why not avoid checkpoints until we have written at least 1 WAL file
worth of data?+1, but I think you need to compare to the last checkpoint's REDO
pointer, not to the position of the checkpoint record itself.
Otherwise, the argument falls down if there was a lot of activity
during the last checkpoint (which is not unlikely in these days of
spread checkpoints).
Agreed, that's better.
Also I think the comment needs more extensive revision than you gave it.
OK, will do.
New version attached. Docs later.
Do we want this backpatched? If so, suggest just 9.1 and 9.0?
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
spaced_checkpoints.v2.patchapplication/octet-stream; name=spaced_checkpoints.v2.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index befb507..8f30ac8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7563,6 +7563,10 @@ CreateCheckPoint(int flags)
uint32 freespace;
uint32 _logId;
uint32 _logSeg;
+ uint32 redo_logId;
+ uint32 redo_logSeg;
+ uint32 insert_logId;
+ uint32 insert_logSeg;
TransactionId *inCommitXids;
int nInCommit;
@@ -7629,8 +7633,8 @@ CreateCheckPoint(int flags)
LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
/*
- * If this isn't a shutdown or forced checkpoint, and we have not inserted
- * any XLOG records since the start of the last checkpoint, skip the
+ * If this isn't a shutdown or forced checkpoint, and we have not switched
+ * to the next WAL file since the start of the last checkpoint, skip the
* checkpoint. The idea here is to avoid inserting duplicate checkpoints
* when the system is idle. That wastes log space, and more importantly it
* exposes us to possible loss of both current and previous checkpoint
@@ -7638,10 +7642,11 @@ CreateCheckPoint(int flags)
* (Perhaps it'd make even more sense to checkpoint only when the previous
* checkpoint record is in a different xlog page?)
*
- * We have to make two tests to determine that nothing has happened since
- * the start of the last checkpoint: current insertion point must match
- * the end of the last checkpoint record, and its redo pointer must point
- * to itself.
+ * While holding the WALInsertLock we find the current WAL insertion point
+ * and compare that with the starting point of the last checkpoint, which
+ * is the redo pointer. We use the redo pointer because the start and end
+ * points of a checkpoint can be hundreds of files apart on large systems
+ * when checkpoint writes are spread out over time.
*/
if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
CHECKPOINT_FORCE)) == 0)
@@ -7649,13 +7654,10 @@ CreateCheckPoint(int flags)
XLogRecPtr curInsert;
INSERT_RECPTR(curInsert, Insert, Insert->curridx);
- if (curInsert.xlogid == ControlFile->checkPoint.xlogid &&
- curInsert.xrecoff == ControlFile->checkPoint.xrecoff +
- MAXALIGN(SizeOfXLogRecord + sizeof(CheckPoint)) &&
- ControlFile->checkPoint.xlogid ==
- ControlFile->checkPointCopy.redo.xlogid &&
- ControlFile->checkPoint.xrecoff ==
- ControlFile->checkPointCopy.redo.xrecoff)
+ XLByteToSeg(curInsert, insert_logId, insert_logSeg);
+ XLByteToSeg(ControlFile->checkPointCopy.redo, redo_logId, redo_logSeg);
+ if (insert_logId == redo_logId &&
+ insert_logSeg == redo_logSeg)
{
LWLockRelease(WALInsertLock);
LWLockRelease(CheckpointLock);
Robert Haas <robertmhaas@gmail.com> wrote:
Tom Lane <tgl@sss.pgh.pa.us> wrote:
I think the point is that a totally idle database should not
continue to emit WAL, not even at a slow rate. There are also
power-consumption objections to allowing the checkpoint process
to fire up to no purpose.Hmm, OK. I still think it's a little funny to say that
checkpoint_timeout will force a checkpoint every N minutes except
when it doesn't, but maybe there's no real harm in that as long as
we document it properly.
What will be the best way to determine, from looking only at a
standby, whether replication is healthy and progressing? Going back
to the 8.1 warm standby days we have run pg_controldata and compared
the "Time of latest checkpoint" to current time; I'm hoping there's
a better way now, so that we can drop that kludge. If not, I would
like a way to kick out a checkpoint from the master at some finite
and configurable interval for monitoring purposes. I'm all for
having a nice, sharp fillet knife, but if that's not available,
please don't take away this rock I chipped at until it had an
edge.
Most likely there's something there which I've missed, but it's
really nice to be able to tell the difference between a quiet master
and broken replication.
-Kevin
Simon Riggs <simon@2ndquadrant.com> writes:
Do we want this backpatched? If so, suggest just 9.1 and 9.0?
-1 for backpatching; it's more an improvement than a bug fix.
In any case, I think we still need to respond to the point Kevin made
about how to tell an idle master from broken replication. Right now,
you will get at least a few bytes of data every checkpoint_timeout
seconds. If we change this, you won't.
I'm inclined to think that the way to deal with that is not to force out
useless WAL data, but to add some sort of explicit "I'm alive" heartbeat
signal to the walsender/walreceiver protocol. The hard part of that is
to figure out how to expose it where you can see it on the slave side
--- or do we have a status view that could handle that?
regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes:
On Thu, Oct 6, 2011 at 12:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I think the point is that a totally idle database should not continue to
emit WAL, not even at a slow rate. �There are also power-consumption
objections to allowing the checkpoint process to fire up to no purpose.
Hmm, OK. I still think it's a little funny to say that
checkpoint_timeout will force a checkpoint every N minutes except when
it doesn't, but maybe there's no real harm in that as long as we
document it properly.
Well ... if we think that it's sane to only checkpoint once per WAL
segment, maybe we should just take out checkpoint_timeout.
We'd need some other mechanism to address replication use-cases, but see
my comments to Simon's followup patch just now.
regards, tom lane
On Thu, Oct 6, 2011 at 6:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Simon Riggs <simon@2ndquadrant.com> writes:
Do we want this backpatched? If so, suggest just 9.1 and 9.0?
-1 for backpatching; it's more an improvement than a bug fix.
OK, works for me.
In any case, I think we still need to respond to the point Kevin made
about how to tell an idle master from broken replication. Right now,
you will get at least a few bytes of data every checkpoint_timeout
seconds. If we change this, you won't.
I'm inclined to think that the way to deal with that is not to force out useless WAL data, but to add some sort of explicit "I'm alive" heartbeat signal to the walsender/walreceiver protocol. The hard part of that is to figure out how to expose it where you can see it on the slave side --- or do we have a status view that could handle that?
Different but related issue and yes, am on it, and yes, the way you just said.
I foresee a function that tells you the delay based on a protocol
message of 'k' for keepalive.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 06.10.2011 20:58, Tom Lane wrote:
Robert Haas<robertmhaas@gmail.com> writes:
On Thu, Oct 6, 2011 at 12:44 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
I think the point is that a totally idle database should not continue to
emit WAL, not even at a slow rate. There are also power-consumption
objections to allowing the checkpoint process to fire up to no purpose.Hmm, OK. I still think it's a little funny to say that
checkpoint_timeout will force a checkpoint every N minutes except when
it doesn't, but maybe there's no real harm in that as long as we
document it properly.Well ... if we think that it's sane to only checkpoint once per WAL
segment, maybe we should just take out checkpoint_timeout.
Huh? Surely not, in my mind checkpoint_timeout is the primary way of
controlling checkpoints, and checkpoint_segments you just set "high
enough" so that you never reach it.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Thu, Oct 6, 2011 at 1:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Simon Riggs <simon@2ndquadrant.com> writes:
Do we want this backpatched? If so, suggest just 9.1 and 9.0?
-1 for backpatching; it's more an improvement than a bug fix.
In any case, I think we still need to respond to the point Kevin made
about how to tell an idle master from broken replication. Right now,
you will get at least a few bytes of data every checkpoint_timeout
seconds. If we change this, you won't.I'm inclined to think that the way to deal with that is not to force out useless WAL data, but to add some sort of explicit "I'm alive" heartbeat signal to the walsender/walreceiver protocol. The hard part of that is to figure out how to expose it where you can see it on the slave side --- or do we have a status view that could handle that?
As of 9.1, we already have something very much like this, in the
opposite direction. See wal_receiver_status_interval and
replication_timeout. I bet we could adapt that slightly to work in
the other direction, too. But that'll only work with streaming
replication - do we care about the WAL shipping case?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Thu, Oct 6, 2011 at 1:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'm inclined to think that the way to deal with that is not to force out useless WAL data, but to add some sort of explicit "I'm alive" heartbeat signal to the walsender/walreceiver protocol. �The hard part of that is to figure out how to expose it where you can see it on the slave side --- or do we have a status view that could handle that?
As of 9.1, we already have something very much like this, in the
opposite direction. See wal_receiver_status_interval and
replication_timeout. I bet we could adapt that slightly to work in
the other direction, too. But that'll only work with streaming
replication - do we care about the WAL shipping case?
I can't get excited about the WAL-shipping case. The artifact that we'll
generate a checkpoint record every few minutes does not create enough
WAL volume for WAL-shipping to reliably generate a heartbeat signal.
It'd be way too long between filling up segments if that were the only
WAL data being generated.
regards, tom lane
On Thu, Oct 6, 2011 at 2:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Thu, Oct 6, 2011 at 1:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'm inclined to think that the way to deal with that is not to force out useless WAL data, but to add some sort of explicit "I'm alive" heartbeat signal to the walsender/walreceiver protocol. The hard part of that is to figure out how to expose it where you can see it on the slave side --- or do we have a status view that could handle that?As of 9.1, we already have something very much like this, in the
opposite direction. See wal_receiver_status_interval and
replication_timeout. I bet we could adapt that slightly to work in
the other direction, too. But that'll only work with streaming
replication - do we care about the WAL shipping case?I can't get excited about the WAL-shipping case. The artifact that we'll
generate a checkpoint record every few minutes does not create enough
WAL volume for WAL-shipping to reliably generate a heartbeat signal.
It'd be way too long between filling up segments if that were the only
WAL data being generated.
Depends how you set archive_timeout, I think.
Anyway, I'm not saying we *have* to care about that case. I'm just
asking whether we do, before we go start changing things.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Simon Riggs <simon@2ndQuadrant.com> wrote:
I foresee a function that tells you the delay based on a protocol
message of 'k' for keepalive.
If the delay you mention is basically a "ping" time or something
similar, that would answer the need I've been on about. We need to
know, based on access to the replica, that the replication system is
alive and well even with an idle master; as it can otherwise be hard
to distinguish an idle master from a failed replication system
(broken connection, misconfigured replication, etc.).
Right now there is a periodic checkpoint which flows through the WAL
and affects the pg_controldata report on the replica -- we've been
using that for monitoring. Any sort of heartbeat or ping which
provides sign-of-life on the connection, accessible on the replica,
should do for our purposes. If it only works over streaming
replication on a hot standby, that's OK -- we plan to be running
everything that way before 9.2 comes out, as long as we can
materialize "traditional" WAL files on the receiving end from the SR
stream.
-1 on changing the checkpoint behavior before 9.2, though.
-Kevin
On Thu, Oct 6, 2011 at 7:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
As of 9.1, we already have something very much like this, in the
opposite direction.
Yes Robert, I wrote it.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services