standby apply lag on inactive servers
A customer of ours complained that if you have an inactive primary,
monitoring the apply lag on a standby reports monotonically increasing
lag. The reason for this is that the apply lag is only updated on
COMMIT records, which of course don't occur in inactive servers.
But CHECKPOINT records do occur, so the WAL insert pointer continues to
move forward, which is what causes the spurious lag.
(I think newer releases are protected from this problem because they
don't emit checkpoints during periods of inactivity. I didn't verify
this.)
This patch fixes the problem by using the checkpoint timestamp to update
the lag tracker in the standby. This requires a little change in where
this update is invoked, because previously it was done only for the XACT
rmgr; this makes the patch a little bigger than it should.
--
�lvaro Herrera PostgreSQL Expert, https://www.2ndQuadrant.com/
Attachments:
0001-Use-checkpoint-timestamp-to-update-apply-lag-tracker.patchtext/x-diff; charset=us-asciiDownload
From 859cd9db48cdde68c7767359e7cdae5b72a3011f Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Fri, 10 Jan 2020 10:37:44 -0300
Subject: [PATCH] Use checkpoint timestamp to update apply lag tracker
Previously we only used transaction commit records, which makes the wal
pointer in standby lag behind when the primary is idle.
---
src/backend/access/transam/xlog.c | 37 ++++++++++++++++++++++---------
1 file changed, 26 insertions(+), 11 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7f4f784c0e..59679b31ec 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5562,8 +5562,8 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
*
* If the record contains a timestamp, returns true, and saves the timestamp
* in *recordXtime. If the record type has no timestamp, returns false.
- * Currently, only transaction commit/abort records and restore points contain
- * timestamps.
+ * Currently, only transaction commit/abort records, restore points and
+ * checkpoints contain timestamps.
*/
static bool
getRecordTimestamp(XLogReaderState *record, TimestampTz *recordXtime)
@@ -5577,6 +5577,13 @@ getRecordTimestamp(XLogReaderState *record, TimestampTz *recordXtime)
*recordXtime = ((xl_restore_point *) XLogRecGetData(record))->rp_time;
return true;
}
+ if (rmid == RM_XLOG_ID && (info == XLOG_CHECKPOINT_ONLINE ||
+ info == XLOG_CHECKPOINT_SHUTDOWN))
+ {
+ *recordXtime =
+ time_t_to_timestamptz(((CheckPoint *) XLogRecGetData(record))->time);
+ return true;
+ }
if (rmid == RM_XACT_ID && (xact_info == XLOG_XACT_COMMIT ||
xact_info == XLOG_XACT_COMMIT_PREPARED))
{
@@ -5745,9 +5752,6 @@ recoveryStopsBefore(XLogReaderState *record)
/*
* Same as recoveryStopsBefore, but called after applying the record.
- *
- * We also track the timestamp of the latest applied COMMIT/ABORT
- * record in XLogCtl->recoveryLastXTime.
*/
static bool
recoveryStopsAfter(XLogReaderState *record)
@@ -5755,7 +5759,6 @@ recoveryStopsAfter(XLogReaderState *record)
uint8 info;
uint8 xact_info;
uint8 rmid;
- TimestampTz recordXtime;
/*
* Ignore recovery target settings when not in archive recovery (meaning
@@ -5823,10 +5826,6 @@ recoveryStopsAfter(XLogReaderState *record)
{
TransactionId recordXid;
- /* Update the last applied transaction timestamp */
- if (getRecordTimestamp(record, &recordXtime))
- SetLatestXTime(recordXtime);
-
/* Extract the XID of the committed/aborted transaction */
if (xact_info == XLOG_XACT_COMMIT_PREPARED)
{
@@ -5865,7 +5864,7 @@ recoveryStopsAfter(XLogReaderState *record)
{
recoveryStopAfter = true;
recoveryStopXid = recordXid;
- recoveryStopTime = recordXtime;
+ getRecordTimestamp(record, &recoveryStopTime);
recoveryStopLSN = InvalidXLogRecPtr;
recoveryStopName[0] = '\0';
@@ -6049,6 +6048,19 @@ recoveryApplyDelay(XLogReaderState *record)
return true;
}
+/*
+ * Track the timestamp of the latest applied time-bearing WAL record in
+ * XLogCtl->recoveryLastXTime.
+ */
+static void
+updateRecoveryXTime(XLogReaderState *record)
+{
+ TimestampTz recordXtime;
+
+ if (getRecordTimestamp(record, &recordXtime))
+ SetLatestXTime(recordXtime);
+}
+
/*
* Save timestamp of latest processed commit/abort record.
*
@@ -7255,6 +7267,9 @@ StartupXLOG(void)
WalSndWakeup();
}
+ /* update last recovery time */
+ updateRecoveryXTime(xlogreader);
+
/* Exit loop if we reached inclusive recovery target */
if (recoveryStopsAfter(xlogreader))
{
--
2.20.1
On 2020-Jan-10, Alvaro Herrera wrote:
A customer of ours complained that if you have an inactive primary,
monitoring the apply lag on a standby reports monotonically increasing
lag. The reason for this is that the apply lag is only updated on
COMMIT records, which of course don't occur in inactive servers.
But CHECKPOINT records do occur, so the WAL insert pointer continues to
move forward, which is what causes the spurious lag.(I think newer releases are protected from this problem because they
don't emit checkpoints during periods of inactivity. I didn't verify
this.)This patch fixes the problem by using the checkpoint timestamp to update
the lag tracker in the standby. This requires a little change in where
this update is invoked, because previously it was done only for the XACT
rmgr; this makes the patch a little bigger than it should.
Here's a version of the patch that applies to current master. It does
fix the problem that CHECKPOINT wal records are not considered when
determining time-of-latest-record.
However, it does *not* fix the monitoring problem I mentioned (which
relied on comparing pg_last_xact_replay_timestamp() to 'now()') ...
because commit 6ef2eba3f57f (pg10) made an idle server not emit
checkpoint records anymore. That is, my parenthical remark was
completely wrong: the new versions not only are "protected", but also
this fix doesn't fix them. Luckily, the way to fix monitoring for
servers of versions 10 and later is to use the new replay_lag (etc)
columns in pg_stat_replication, commit 6912acc04f0b (also pg10).
I am inclined to apply this to all branches unless there are strong
objections, because the current code seems pretty arbitrary anyway.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
v2-0001-Use-CheckPoint-time-to-update-latest-recovery-tim.patchtext/x-diff; charset=iso-8859-1Download
From 35216a217af6c3452e9cd70b10963c238bcee9ff Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Mon, 27 Jan 2020 14:43:17 -0300
Subject: [PATCH v2] Use CheckPoint->time to update latest recovery timestamp
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Previously, only timestamps carried by transaction commit/abort and
recovery labels were used, but there's no reason to ignore checkpoint's.
Also, rearrange things so that the timestamp is updated by StartupXLOG's
loop rather than recoveryStopsAfter, which was an odd choice.
Author: Ãlvaro Herrera
Reviewed-by:
Discussion: https://postgr.es/m/20200110140828.GA6228@alvherre.pgsql
---
src/backend/access/transam/xlog.c | 37 ++++++++++++++++++++++---------
1 file changed, 26 insertions(+), 11 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 882d5e8a73..0bc3d8cec5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5560,8 +5560,8 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
*
* If the record contains a timestamp, returns true, and saves the timestamp
* in *recordXtime. If the record type has no timestamp, returns false.
- * Currently, only transaction commit/abort records and restore points contain
- * timestamps.
+ * Currently, only transaction commit/abort records, restore points and
+ * checkpoints contain timestamps.
*/
static bool
getRecordTimestamp(XLogReaderState *record, TimestampTz *recordXtime)
@@ -5575,6 +5575,13 @@ getRecordTimestamp(XLogReaderState *record, TimestampTz *recordXtime)
*recordXtime = ((xl_restore_point *) XLogRecGetData(record))->rp_time;
return true;
}
+ if (rmid == RM_XLOG_ID && (info == XLOG_CHECKPOINT_ONLINE ||
+ info == XLOG_CHECKPOINT_SHUTDOWN))
+ {
+ *recordXtime =
+ time_t_to_timestamptz(((CheckPoint *) XLogRecGetData(record))->time);
+ return true;
+ }
if (rmid == RM_XACT_ID && (xact_info == XLOG_XACT_COMMIT ||
xact_info == XLOG_XACT_COMMIT_PREPARED))
{
@@ -5743,9 +5750,6 @@ recoveryStopsBefore(XLogReaderState *record)
/*
* Same as recoveryStopsBefore, but called after applying the record.
- *
- * We also track the timestamp of the latest applied COMMIT/ABORT
- * record in XLogCtl->recoveryLastXTime.
*/
static bool
recoveryStopsAfter(XLogReaderState *record)
@@ -5753,7 +5757,6 @@ recoveryStopsAfter(XLogReaderState *record)
uint8 info;
uint8 xact_info;
uint8 rmid;
- TimestampTz recordXtime;
/*
* Ignore recovery target settings when not in archive recovery (meaning
@@ -5821,10 +5824,6 @@ recoveryStopsAfter(XLogReaderState *record)
{
TransactionId recordXid;
- /* Update the last applied transaction timestamp */
- if (getRecordTimestamp(record, &recordXtime))
- SetLatestXTime(recordXtime);
-
/* Extract the XID of the committed/aborted transaction */
if (xact_info == XLOG_XACT_COMMIT_PREPARED)
{
@@ -5863,7 +5862,7 @@ recoveryStopsAfter(XLogReaderState *record)
{
recoveryStopAfter = true;
recoveryStopXid = recordXid;
- recoveryStopTime = recordXtime;
+ getRecordTimestamp(record, &recoveryStopTime);
recoveryStopLSN = InvalidXLogRecPtr;
recoveryStopName[0] = '\0';
@@ -6047,6 +6046,19 @@ recoveryApplyDelay(XLogReaderState *record)
return true;
}
+/*
+ * Track the timestamp of the latest applied time-bearing WAL record in
+ * XLogCtl->recoveryLastXTime.
+ */
+static void
+updateRecoveryXTime(XLogReaderState *record)
+{
+ TimestampTz recordXtime;
+
+ if (getRecordTimestamp(record, &recordXtime))
+ SetLatestXTime(recordXtime);
+}
+
/*
* Save timestamp of latest processed commit/abort record.
*
@@ -7255,6 +7267,9 @@ StartupXLOG(void)
WalSndWakeup();
}
+ /* update last recovery time */
+ updateRecoveryXTime(xlogreader);
+
/* Exit loop if we reached inclusive recovery target */
if (recoveryStopsAfter(xlogreader))
{
--
2.20.1
Actually looking again, getRecordTimestamp is looking pretty strange.
It looks much more natural by using nested switch/case blocks, as with
this diff. I think the compiler does a better job this way too.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hello.
At Mon, 27 Jan 2020 18:06:27 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in
Actually looking again, getRecordTimestamp is looking pretty strange.
It looks much more natural by using nested switch/case blocks, as with
this diff. I think the compiler does a better job this way too.
Agreed. Anyway I looked the latest version.
The aim of the patch seem reasonable. XLOG_END_OF_RECOVERY and
XLOG_XACT_PREPARE also have a timestamp but it doesn't help much. (But
could be included for consistency.)
The timestamp of a checkpoint record is the start time of a checkpoint
(and doesn't have subseconds part, but it's not a problem.). That
means that the timestamp is usually far behind the time at the record
has been inserted. As the result the stored timestamp can go back by a
significant internval. I don't think that causes an actual problem but
the movement looks wierd as the result of
pg_last_xact_replay_timestamp().
Asides from the backward movement, a timestamp from other than
commit/abort records in recvoeryLastXTime affects the following code.
xlog.c:7329: (similar code exists at line 9332)
ereport(LOG,
(errmsg("redo done at %X/%X",
(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
xtime = GetLatestXTime();
if (xtime)
ereport(LOG,
(errmsg("last completed transaction was at log time %s",
timestamptz_to_str(xtime))));
This code assumes (and the name GetLatestXTime() suggests, I first
noticed that here..) that the timestamp comes from commit/abort logs,
so otherwise it shows a wrong timestamp. We shouldn't update the
variable by other than that kind of records.
If (I don't think that comes true..) we set the timestamp from other
than that kind of record, the names and the comments of the functions
should be changed.
/*
* Save timestamp of latest processed commit/abort record.
*
* We keep this in XLogCtl, not a simple static variable, so that it can be
* seen by processes other than the startup process. Note in particular
* that CreateRestartPoint is executed in the checkpointer.
*/
static void
SetLatestXTime(TimestampTz xtime)
...
/*
* Fetch timestamp of latest processed commit/abort record.
*/
TimestampTz
GetLatestXTime(void)
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 2020-Jan-28, Kyotaro Horiguchi wrote:
At Mon, 27 Jan 2020 18:06:27 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in
Actually looking again, getRecordTimestamp is looking pretty strange.
It looks much more natural by using nested switch/case blocks, as with
this diff. I think the compiler does a better job this way too.Agreed. Anyway I looked the latest version.
The aim of the patch seem reasonable. XLOG_END_OF_RECOVERY and
XLOG_XACT_PREPARE also have a timestamp but it doesn't help much. (But
could be included for consistency.)
Hello, thanks for looking.
The timestamp of a checkpoint record is the start time of a checkpoint
(and doesn't have subseconds part, but it's not a problem.). That
means that the timestamp is usually far behind the time at the record
has been inserted. As the result the stored timestamp can go back by a
significant internval. I don't think that causes an actual problem but
the movement looks wierd as the result of
pg_last_xact_replay_timestamp().
Ouch ... yeah, it should be set only if it doesn't go backwards.
xlog.c:7329: (similar code exists at line 9332)
ereport(LOG,
(errmsg("redo done at %X/%X",
(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
xtime = GetLatestXTime();
if (xtime)
ereport(LOG,
(errmsg("last completed transaction was at log time %s",
timestamptz_to_str(xtime))));This code assumes (and the name GetLatestXTime() suggests, I first
noticed that here..) that the timestamp comes from commit/abort logs,
so otherwise it shows a wrong timestamp. We shouldn't update the
variable by other than that kind of records.
Hmm, that's terrible. GetLatestXTime() being displayed user-visibly for
"last transaction completion" but having it include unrelated things
such as restore points is terrible. One idea is to should split it in
two: one exclusively for transaction commit/abort, and another for all
WAL activity. That way, the former can be used for that message, and
the latter for standby replay reports. However, that might be
overengineering, if the only thing that the former is that one LOG
message; instead changing the log message to state that the time is for
other activity, as you suggest, is simpler and has no downside that I
can see.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-Jan-27, Alvaro Herrera wrote:
Actually looking again, getRecordTimestamp is looking pretty strange.
It looks much more natural by using nested switch/case blocks, as with
this diff. I think the compiler does a better job this way too.
I hadn't noticed I forgot to attach the diff here :-(
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
rewrite-getRecordTimestamp.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0bc3d8cec5..1a165701ae 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5566,33 +5566,43 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
static bool
getRecordTimestamp(XLogReaderState *record, TimestampTz *recordXtime)
{
- uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
- uint8 xact_info = info & XLOG_XACT_OPMASK;
- uint8 rmid = XLogRecGetRmid(record);
+ RmgrId rmid = XLogRecGetRmid(record);
+ uint8 info;
- if (rmid == RM_XLOG_ID && info == XLOG_RESTORE_POINT)
+ switch (rmid)
{
- *recordXtime = ((xl_restore_point *) XLogRecGetData(record))->rp_time;
- return true;
- }
- if (rmid == RM_XLOG_ID && (info == XLOG_CHECKPOINT_ONLINE ||
- info == XLOG_CHECKPOINT_SHUTDOWN))
- {
- *recordXtime =
- time_t_to_timestamptz(((CheckPoint *) XLogRecGetData(record))->time);
- return true;
- }
- if (rmid == RM_XACT_ID && (xact_info == XLOG_XACT_COMMIT ||
- xact_info == XLOG_XACT_COMMIT_PREPARED))
- {
- *recordXtime = ((xl_xact_commit *) XLogRecGetData(record))->xact_time;
- return true;
- }
- if (rmid == RM_XACT_ID && (xact_info == XLOG_XACT_ABORT ||
- xact_info == XLOG_XACT_ABORT_PREPARED))
- {
- *recordXtime = ((xl_xact_abort *) XLogRecGetData(record))->xact_time;
- return true;
+ case RM_XLOG_ID:
+ info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+ switch (info)
+ {
+ case XLOG_RESTORE_POINT:
+ *recordXtime = ((xl_restore_point *) XLogRecGetData(record))->rp_time;
+ return true;
+ case XLOG_CHECKPOINT_ONLINE:
+ case XLOG_CHECKPOINT_SHUTDOWN:
+ *recordXtime =
+ time_t_to_timestamptz(((CheckPoint *) XLogRecGetData(record))->time);
+ return true;
+ }
+ break;
+
+ case RM_XACT_ID:
+ info = XLogRecGetInfo(record) & XLOG_XACT_OPMASK;
+
+ switch (info)
+ {
+ case XLOG_XACT_COMMIT:
+ case XLOG_XACT_COMMIT_PREPARED:
+ *recordXtime = ((xl_xact_commit *) XLogRecGetData(record))->xact_time;
+ return true;
+
+ case XLOG_XACT_ABORT:
+ case XLOG_XACT_ABORT_PREPARED:
+ *recordXtime = ((xl_xact_abort *) XLogRecGetData(record))->xact_time;
+ return true;
+ }
+ break;
}
return false;
}
At Tue, 28 Jan 2020 11:18:50 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in
On 2020-Jan-27, Alvaro Herrera wrote:
Actually looking again, getRecordTimestamp is looking pretty strange.
It looks much more natural by using nested switch/case blocks, as with
this diff. I think the compiler does a better job this way too.I hadn't noticed I forgot to attach the diff here :-(
Yeay, that patch bases the apply-lag patch:) And contains
XLOG_CHECKPOINT_*. But otherwise looks good to me.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Tue, 28 Jan 2020 11:18:14 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in
xlog.c:7329: (similar code exists at line 9332)
ereport(LOG,
(errmsg("redo done at %X/%X",
(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
xtime = GetLatestXTime();
if (xtime)
ereport(LOG,
(errmsg("last completed transaction was at log time %s",
timestamptz_to_str(xtime))));This code assumes (and the name GetLatestXTime() suggests, I first
noticed that here..) that the timestamp comes from commit/abort logs,
so otherwise it shows a wrong timestamp. We shouldn't update the
variable by other than that kind of records.Hmm, that's terrible. GetLatestXTime() being displayed user-visibly for
"last transaction completion" but having it include unrelated things
such as restore points is terrible. One idea is to should split it in
two: one exclusively for transaction commit/abort, and another for all
WAL activity. That way, the former can be used for that message, and
the latter for standby replay reports. However, that might be
overengineering, if the only thing that the former is that one LOG
message; instead changing the log message to state that the time is for
other activity, as you suggest, is simpler and has no downside that I
can see.
Perhaps we can use ControlData->checkPointCopy.time instead. It misses
checkpoint records intermittently but works in general.
But as more significant issue, nowadays PostgreSQL doesn't run a
checkpoint if it is really inactive (that is, if no "important" WAL
records have issued.).
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 2020-Jan-29, Kyotaro Horiguchi wrote:
But as more significant issue, nowadays PostgreSQL doesn't run a
checkpoint if it is really inactive (that is, if no "important" WAL
records have issued.).
Yeah, I mentioned this in message
<20200127203419.GA15216@alvherre.pgsql>. The solution for monitoring
purposes is to use the new "lag" columns in pg_stat_replication. But
that's not available in older releases (prior to 10), so this change is
still useful.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-Jan-28, Kyotaro Horiguchi wrote:
The aim of the patch seem reasonable. XLOG_END_OF_RECOVERY and
XLOG_XACT_PREPARE also have a timestamp but it doesn't help much. (But
could be included for consistency.)
Hmm I think I should definitely include those.
The timestamp of a checkpoint record is the start time of a checkpoint
(and doesn't have subseconds part, but it's not a problem.). That
means that the timestamp is usually far behind the time at the record
has been inserted. As the result the stored timestamp can go back by a
significant internval. I don't think that causes an actual problem but
the movement looks wierd as the result of
pg_last_xact_replay_timestamp().
A problem I see with this is that setting the timestamp is done with
XLogCtl->lock held; since now we need to make the update conditional,
we'd have to read the current value, compare with the checkpoint time,
then set, all with the spinlock held. That seems way too expensive.
A compromise might be to do the compare only when it's done for
checkpoint. These occur seldom enough that it shouldn't be a problem
(as opposed to commit records, which can be very frequent).
Asides from the backward movement, a timestamp from other than
commit/abort records in recvoeryLastXTime affects the following code.xlog.c:7329: (similar code exists at line 9332)
ereport(LOG,
(errmsg("redo done at %X/%X",
(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
xtime = GetLatestXTime();
if (xtime)
ereport(LOG,
(errmsg("last completed transaction was at log time %s",
timestamptz_to_str(xtime))));This code assumes (and the name GetLatestXTime() suggests, I first
noticed that here..) that the timestamp comes from commit/abort logs,
so otherwise it shows a wrong timestamp. We shouldn't update the
variable by other than that kind of records.
Thinking about this some more, I think we should do keep the message the
same backbranches (avoid breaking anything that might be reading the log
-- a remote but not inexistent possibility), and adjust the message in
master to be something like "last timestamped WAL activity at time %s",
and document that it means commit, abort, restore label, checkpoint.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
At Wed, 29 Jan 2020 19:11:31 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in
On 2020-Jan-28, Kyotaro Horiguchi wrote:
The aim of the patch seem reasonable. XLOG_END_OF_RECOVERY and
XLOG_XACT_PREPARE also have a timestamp but it doesn't help much. (But
could be included for consistency.)Hmm I think I should definitely include those.
I agree to that, given the following change of log messages.
The timestamp of a checkpoint record is the start time of a checkpoint
(and doesn't have subseconds part, but it's not a problem.). That
means that the timestamp is usually far behind the time at the record
has been inserted. As the result the stored timestamp can go back by a
significant internval. I don't think that causes an actual problem but
the movement looks wierd as the result of
pg_last_xact_replay_timestamp().A problem I see with this is that setting the timestamp is done with
XLogCtl->lock held; since now we need to make the update conditional,
we'd have to read the current value, compare with the checkpoint time,
then set, all with the spinlock held. That seems way too expensive.A compromise might be to do the compare only when it's done for
checkpoint. These occur seldom enough that it shouldn't be a problem
(as opposed to commit records, which can be very frequent).
I think we don't need to that, given the following change.
Asides from the backward movement, a timestamp from other than
commit/abort records in recvoeryLastXTime affects the following code.xlog.c:7329: (similar code exists at line 9332)
ereport(LOG,
(errmsg("redo done at %X/%X",
(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
xtime = GetLatestXTime();
if (xtime)
ereport(LOG,
(errmsg("last completed transaction was at log time %s",
timestamptz_to_str(xtime))));This code assumes (and the name GetLatestXTime() suggests, I first
noticed that here..) that the timestamp comes from commit/abort logs,
so otherwise it shows a wrong timestamp. We shouldn't update the
variable by other than that kind of records.Thinking about this some more, I think we should do keep the message the
same backbranches (avoid breaking anything that might be reading the log
-- a remote but not inexistent possibility), and adjust the message in
master to be something like "last timestamped WAL activity at time %s",
and document that it means commit, abort, restore label, checkpoint.
Agreed about backbranches. I'd like to preserve the word "transaction"
as it is more familiar to users. How about something like the follows?
"transactions are completed up to log time %s"
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 2020-Jan-30, Kyotaro Horiguchi wrote:
Agreed about backbranches. I'd like to preserve the word "transaction"
as it is more familiar to users. How about something like the follows?"transactions are completed up to log time %s"
That's a good point. I used the phrase "transaction activity", which
seems sufficiently explicit to me.
So, the attached is the one for master; in back branches I would use the
same (plus minor conflict fixes), except that I would drop the message
wording changes.
Thanks for the reviews so far,
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
v3-0001-Use-CheckPoint-time-to-update-latest-recovery-tim.patchtext/x-diff; charset=iso-8859-1Download
From 003e56aca7cbcbcae144c82ea2a62855abe4b8d2 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Thu, 30 Jan 2020 17:39:25 -0300
Subject: [PATCH v3] Use CheckPoint->time to update latest recovery timestamp
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Previously, only timestamps carried by transaction commit/abort and
restore labels were used, but there's no reason to ignore the ones in
checkpoint, end-of-recovery and transaction prepare records.
Also, rearrange things so that the timestamp is updated by StartupXLOG's
loop rather than recoveryStopsAfter, which was an odd choice.
Adjust LOG message wording to make it clear that it's not just
transaction commit/abort. It wasn't true before, but after this commit
it's even less so.
Author: Ãlvaro Herrera
Reviewed-by: Kyotaro Horiguchi
Discussion: https://postgr.es/m/20200110140828.GA6228@alvherre.pgsql
---
src/backend/access/transam/xlog.c | 132 ++++++++++++++++++++++--------
1 file changed, 96 insertions(+), 36 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3813eadfb4..ca38cb17d5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -863,6 +863,7 @@ static bool recoveryStopsAfter(XLogReaderState *record);
static void recoveryPausesHere(void);
static bool recoveryApplyDelay(XLogReaderState *record);
static void SetLatestXTime(TimestampTz xtime);
+static void SetLatestXTimeConditional(TimestampTz xtime);
static void SetCurrentChunkStartTime(TimestampTz xtime);
static void CheckRequiredParameterValues(void);
static void XLogReportParameters(void);
@@ -5560,32 +5561,66 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
*
* If the record contains a timestamp, returns true, and saves the timestamp
* in *recordXtime. If the record type has no timestamp, returns false.
- * Currently, only transaction commit/abort records and restore points contain
- * timestamps.
+ * Currently, only transaction prepare/commit/abort records, restore points,
+ * end-of-recovery and checkpoints contain timestamps.
+ *
+ * If canGoBack is not NULL, it's set to true if the timestamp can have a time
+ * far in the past. This currently only happens for checkpoint records
+ * (because the timestamp they carry is that of the start of the checkpoint.)
*/
static bool
-getRecordTimestamp(XLogReaderState *record, TimestampTz *recordXtime)
+getRecordTimestamp(XLogReaderState *record, TimestampTz *recordXtime,
+ bool *canGoBack)
{
- uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
- uint8 xact_info = info & XLOG_XACT_OPMASK;
- uint8 rmid = XLogRecGetRmid(record);
+ RmgrId rmid = XLogRecGetRmid(record);
+ uint8 info;
- if (rmid == RM_XLOG_ID && info == XLOG_RESTORE_POINT)
+ if (canGoBack)
+ *canGoBack = false;
+
+ switch (rmid)
{
- *recordXtime = ((xl_restore_point *) XLogRecGetData(record))->rp_time;
- return true;
- }
- if (rmid == RM_XACT_ID && (xact_info == XLOG_XACT_COMMIT ||
- xact_info == XLOG_XACT_COMMIT_PREPARED))
- {
- *recordXtime = ((xl_xact_commit *) XLogRecGetData(record))->xact_time;
- return true;
- }
- if (rmid == RM_XACT_ID && (xact_info == XLOG_XACT_ABORT ||
- xact_info == XLOG_XACT_ABORT_PREPARED))
- {
- *recordXtime = ((xl_xact_abort *) XLogRecGetData(record))->xact_time;
- return true;
+ case RM_XLOG_ID:
+ info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+ switch (info)
+ {
+ case XLOG_RESTORE_POINT:
+ *recordXtime = ((xl_restore_point *) XLogRecGetData(record))->rp_time;
+ return true;
+ case XLOG_END_OF_RECOVERY:
+ *recordXtime = ((xl_end_of_recovery *) XLogRecGetData(record))->end_time;
+ return true;
+ case XLOG_CHECKPOINT_ONLINE:
+ case XLOG_CHECKPOINT_SHUTDOWN:
+ if (canGoBack)
+ *canGoBack = true;
+ *recordXtime =
+ time_t_to_timestamptz(((CheckPoint *) XLogRecGetData(record))->time);
+ return true;
+ }
+ break;
+
+ case RM_XACT_ID:
+ info = XLogRecGetInfo(record) & XLOG_XACT_OPMASK;
+
+ switch (info)
+ {
+ case XLOG_XACT_PREPARE:
+ *recordXtime = ((xl_xact_prepare *) XLogRecGetData(record))->prepared_at;
+ return true;
+
+ case XLOG_XACT_COMMIT:
+ case XLOG_XACT_COMMIT_PREPARED:
+ *recordXtime = ((xl_xact_commit *) XLogRecGetData(record))->xact_time;
+ return true;
+
+ case XLOG_XACT_ABORT:
+ case XLOG_XACT_ABORT_PREPARED:
+ *recordXtime = ((xl_xact_abort *) XLogRecGetData(record))->xact_time;
+ return true;
+ }
+ break;
}
return false;
}
@@ -5701,7 +5736,7 @@ recoveryStopsBefore(XLogReaderState *record)
}
if (recoveryTarget == RECOVERY_TARGET_TIME &&
- getRecordTimestamp(record, &recordXtime))
+ getRecordTimestamp(record, &recordXtime, NULL))
{
/*
* There can be many transactions that share the same commit time, so
@@ -5743,9 +5778,6 @@ recoveryStopsBefore(XLogReaderState *record)
/*
* Same as recoveryStopsBefore, but called after applying the record.
- *
- * We also track the timestamp of the latest applied COMMIT/ABORT
- * record in XLogCtl->recoveryLastXTime.
*/
static bool
recoveryStopsAfter(XLogReaderState *record)
@@ -5753,7 +5785,6 @@ recoveryStopsAfter(XLogReaderState *record)
uint8 info;
uint8 xact_info;
uint8 rmid;
- TimestampTz recordXtime;
/*
* Ignore recovery target settings when not in archive recovery (meaning
@@ -5781,7 +5812,7 @@ recoveryStopsAfter(XLogReaderState *record)
recoveryStopAfter = true;
recoveryStopXid = InvalidTransactionId;
recoveryStopLSN = InvalidXLogRecPtr;
- (void) getRecordTimestamp(record, &recoveryStopTime);
+ (void) getRecordTimestamp(record, &recoveryStopTime, NULL);
strlcpy(recoveryStopName, recordRestorePointData->rp_name, MAXFNAMELEN);
ereport(LOG,
@@ -5821,10 +5852,6 @@ recoveryStopsAfter(XLogReaderState *record)
{
TransactionId recordXid;
- /* Update the last applied transaction timestamp */
- if (getRecordTimestamp(record, &recordXtime))
- SetLatestXTime(recordXtime);
-
/* Extract the XID of the committed/aborted transaction */
if (xact_info == XLOG_XACT_COMMIT_PREPARED)
{
@@ -5863,7 +5890,7 @@ recoveryStopsAfter(XLogReaderState *record)
{
recoveryStopAfter = true;
recoveryStopXid = recordXid;
- recoveryStopTime = recordXtime;
+ getRecordTimestamp(record, &recoveryStopTime, NULL);
recoveryStopLSN = InvalidXLogRecPtr;
recoveryStopName[0] = '\0';
@@ -6000,7 +6027,7 @@ recoveryApplyDelay(XLogReaderState *record)
xact_info != XLOG_XACT_COMMIT_PREPARED)
return false;
- if (!getRecordTimestamp(record, &xtime))
+ if (!getRecordTimestamp(record, &xtime, NULL))
return false;
delayUntil = TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay);
@@ -6047,6 +6074,26 @@ recoveryApplyDelay(XLogReaderState *record)
return true;
}
+/*
+ * Track the timestamp of the latest applied time-bearing WAL record in
+ * XLogCtl->recoveryLastXTime, being careful with records whose timestamps
+ * can be far in the past.
+ */
+static void
+updateRecoveryXTime(XLogReaderState *record)
+{
+ TimestampTz recordXtime;
+ bool canGoBack;
+
+ if (getRecordTimestamp(record, &recordXtime, &canGoBack))
+ {
+ if (canGoBack)
+ SetLatestXTimeConditional(recordXtime);
+ else
+ SetLatestXTime(recordXtime);
+ }
+}
+
/*
* Save timestamp of latest processed commit/abort record.
*
@@ -6062,8 +6109,18 @@ SetLatestXTime(TimestampTz xtime)
SpinLockRelease(&XLogCtl->info_lck);
}
+/* as above, but do nothing if timestamp is in the past */
+static void
+SetLatestXTimeConditional(TimestampTz xtime)
+{
+ SpinLockAcquire(&XLogCtl->info_lck);
+ if (xtime > XLogCtl->recoveryLastXTime)
+ XLogCtl->recoveryLastXTime = xtime;
+ SpinLockRelease(&XLogCtl->info_lck);
+}
+
/*
- * Fetch timestamp of latest processed commit/abort record.
+ * Fetch timestamp of latest processed timestamped record.
*/
TimestampTz
GetLatestXTime(void)
@@ -7255,6 +7312,9 @@ StartupXLOG(void)
WalSndWakeup();
}
+ /* update last recovery time */
+ updateRecoveryXTime(xlogreader);
+
/* Exit loop if we reached inclusive recovery target */
if (recoveryStopsAfter(xlogreader))
{
@@ -7317,7 +7377,7 @@ StartupXLOG(void)
xtime = GetLatestXTime();
if (xtime)
ereport(LOG,
- (errmsg("last completed transaction was at log time %s",
+ (errmsg("last transaction activity was at log time %s",
timestamptz_to_str(xtime))));
InRedo = false;
@@ -9317,7 +9377,7 @@ CreateRestartPoint(int flags)
ereport((log_checkpoints ? LOG : DEBUG2),
(errmsg("recovery restart point at %X/%X",
(uint32) (lastCheckPoint.redo >> 32), (uint32) lastCheckPoint.redo),
- xtime ? errdetail("Last completed transaction was at log time %s.",
+ xtime ? errdetail("Last transaction activity was at log time %s.",
timestamptz_to_str(xtime)) : 0));
LWLockRelease(CheckpointLock);
--
2.20.1
At Thu, 30 Jan 2020 17:45:36 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in
On 2020-Jan-30, Kyotaro Horiguchi wrote:
Agreed about backbranches. I'd like to preserve the word "transaction"
as it is more familiar to users. How about something like the follows?"transactions are completed up to log time %s"
That's a good point. I used the phrase "transaction activity", which
seems sufficiently explicit to me.So, the attached is the one for master; in back branches I would use the
same (plus minor conflict fixes), except that I would drop the message
wording changes.Thanks for the reviews so far,
My pleasure.
regads.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 2020/01/31 5:45, Alvaro Herrera wrote:
On 2020-Jan-30, Kyotaro Horiguchi wrote:
Agreed about backbranches. I'd like to preserve the word "transaction"
as it is more familiar to users. How about something like the follows?"transactions are completed up to log time %s"
That's a good point. I used the phrase "transaction activity", which
seems sufficiently explicit to me.So, the attached is the one for master; in back branches I would use the
same (plus minor conflict fixes), except that I would drop the message
wording changes.
You're thinking to apply this change to the back branches? Sorry
if my understanding is not right. But I don't think that back-patch
is ok because it changes the documented existing behavior
of pg_last_xact_replay_timestamp(). So it looks like the behavior
change not a bug fix.
Regards,
--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters
On 2020-Jan-31, Fujii Masao wrote:
You're thinking to apply this change to the back branches? Sorry
if my understanding is not right. But I don't think that back-patch
is ok because it changes the documented existing behavior
of pg_last_xact_replay_timestamp(). So it looks like the behavior
change not a bug fix.
Yeah, I am thinking in backpatching it. The documented behavior is
already not what the code does. Do you have a situation where this
change would break something? If so, can you please explain what it is?
I think (and I said it upthread) a 100% complete fix involves tracking
two timestamps rather than one. I was thinking that that would be too
invasive because it changes XLogCtlData shmem struct ... but that struct
is private to xlog.c, so I think it's fine to change the struct. The
problem though is that the user-visible change that I want to achieve is
pg_last_xact_replay_timestamp(), and it would be obviously wrong to use
the new XLogCtlData field rather than the existing one, as that would be
a behavior change in the same sense that you're now complaining about.
So I would achieve nothing.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020/01/31 22:40, Alvaro Herrera wrote:
On 2020-Jan-31, Fujii Masao wrote:
You're thinking to apply this change to the back branches? Sorry
if my understanding is not right. But I don't think that back-patch
is ok because it changes the documented existing behavior
of pg_last_xact_replay_timestamp(). So it looks like the behavior
change not a bug fix.Yeah, I am thinking in backpatching it. The documented behavior is
already not what the code does.
Maybe you thought this because getRecordTimestamp() extracts the
timestamp from even WAL record of a restore point? That is, you're
concerned about that pg_last_xact_replay_timestamp() returns the
timestamp of not only commit/abort record but also restore point one.
Right?
As far as I read the code, this problem doesn't occur because
SetLatestXTime() is called only for commit/abort records, in
recoveryStopsAfter(). No?
Do you have a situation where this
change would break something? If so, can you please explain what it is?
For example, use the return value of pg_last_xact_replay_timestamp()
(and also the timestamp in the log message output at the end of
recovery) as a HINT when setting recovery_target_time later.
Use it to compare with the timestamp retrieved from the master server,
in order to monitor the replication delay.
Regards,
--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters
On 2020-Jan-31, Fujii Masao wrote:
On 2020/01/31 22:40, Alvaro Herrera wrote:
On 2020-Jan-31, Fujii Masao wrote:
You're thinking to apply this change to the back branches? Sorry
if my understanding is not right. But I don't think that back-patch
is ok because it changes the documented existing behavior
of pg_last_xact_replay_timestamp(). So it looks like the behavior
change not a bug fix.Yeah, I am thinking in backpatching it. The documented behavior is
already not what the code does.Maybe you thought this because getRecordTimestamp() extracts the
timestamp from even WAL record of a restore point? That is, you're
concerned about that pg_last_xact_replay_timestamp() returns the
timestamp of not only commit/abort record but also restore point one.
Right?
right.
As far as I read the code, this problem doesn't occur because
SetLatestXTime() is called only for commit/abort records, in
recoveryStopsAfter(). No?
... uh, wow, you're right about that too. IMO this is extremely
fragile, easy to break, and under-documented. But you're right, there's
no bug there at present.
Do you have a situation where this
change would break something? If so, can you please explain what it is?For example, use the return value of pg_last_xact_replay_timestamp()
(and also the timestamp in the log message output at the end of
recovery) as a HINT when setting recovery_target_time later.
Hmm.
I'm not sure how you would use it in that way. I mean, I understand how
it *can* be used that way, but it seems too fragile to be done in
practice, in a scenario that's not just laboratory games.
Use it to compare with the timestamp retrieved from the master server,
in order to monitor the replication delay.
That's precisely the use case that I'm aiming at. The timestamp
currently is not useful because this usage breaks when the primary is
inactive (no COMMIT records occur). During such periods of inactivity,
CHECKPOINT records would keep the "last xtime" current. This has
actually happened in a production setting, it's not a thought
experiment.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020/01/31 23:47, Alvaro Herrera wrote:
On 2020-Jan-31, Fujii Masao wrote:
On 2020/01/31 22:40, Alvaro Herrera wrote:
On 2020-Jan-31, Fujii Masao wrote:
You're thinking to apply this change to the back branches? Sorry
if my understanding is not right. But I don't think that back-patch
is ok because it changes the documented existing behavior
of pg_last_xact_replay_timestamp(). So it looks like the behavior
change not a bug fix.Yeah, I am thinking in backpatching it. The documented behavior is
already not what the code does.Maybe you thought this because getRecordTimestamp() extracts the
timestamp from even WAL record of a restore point? That is, you're
concerned about that pg_last_xact_replay_timestamp() returns the
timestamp of not only commit/abort record but also restore point one.
Right?right.
As far as I read the code, this problem doesn't occur because
SetLatestXTime() is called only for commit/abort records, in
recoveryStopsAfter(). No?... uh, wow, you're right about that too. IMO this is extremely
fragile, easy to break, and under-documented.
Yeah, it's worth improving the code.
But you're right, there's
no bug there at present.Do you have a situation where this
change would break something? If so, can you please explain what it is?For example, use the return value of pg_last_xact_replay_timestamp()
(and also the timestamp in the log message output at the end of
recovery) as a HINT when setting recovery_target_time later.Hmm.
I'm not sure how you would use it in that way. I mean, I understand how
it *can* be used that way, but it seems too fragile to be done in
practice, in a scenario that's not just laboratory games.Use it to compare with the timestamp retrieved from the master server,
in order to monitor the replication delay.That's precisely the use case that I'm aiming at. The timestamp
currently is not useful because this usage breaks when the primary is
inactive (no COMMIT records occur). During such periods of inactivity,
CHECKPOINT records would keep the "last xtime" current. This has
actually happened in a production setting, it's not a thought
experiment.
I've heard that someone periodically generates dummy tiny
transactions (say, every minute), as a band-aid solution,
to avoid inactive primary. Of course, this is not a perfect solution.
The idea that I proposed previously was to introduce
pg_last_xact_insert_timestamp() [1]/messages/by-id/CAHGQGwF3ZjfuNEj5ka683KU5rQUBtSWtqFq7g1X0g34o+JXWBw@mail.gmail.com into core. This function returns
the timestamp of commit / abort records in *primary* side.
So we can retrieve that timestamp from the primary (e.g., by using dblink)
and compare its result with pg_last_xact_replay_timestamp() to
calculate the delay in the standby.
Another idea is to include the commit / abort timestamp in
primary-keepalive-message that periodially sent from the primary
to the standby. Then if we introduce the function returning
that timestamp, in the standby side, we can easily compare
the commit / abort timestamps taken from both primary and
standby, in the standby.
[1]: /messages/by-id/CAHGQGwF3ZjfuNEj5ka683KU5rQUBtSWtqFq7g1X0g34o+JXWBw@mail.gmail.com
Regards,
--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters