standby apply lag on inactive servers
A customer of ours complained that if you have an inactive primary,
monitoring the apply lag on a standby reports monotonically increasing
lag. The reason for this is that the apply lag is only updated on
COMMIT records, which of course don't occur in inactive servers.
But CHECKPOINT records do occur, so the WAL insert pointer continues to
move forward, which is what causes the spurious lag.
(I think newer releases are protected from this problem because they
don't emit checkpoints during periods of inactivity. I didn't verify
this.)
This patch fixes the problem by using the checkpoint timestamp to update
the lag tracker in the standby. This requires a little change in where
this update is invoked, because previously it was done only for the XACT
rmgr; this makes the patch a little bigger than it should.
--
�lvaro Herrera PostgreSQL Expert, https://www.2ndQuadrant.com/
Attachments:
0001-Use-checkpoint-timestamp-to-update-apply-lag-tracker.patchtext/x-diff; charset=us-asciiDownload+26-12
On 2020-Jan-10, Alvaro Herrera wrote:
A customer of ours complained that if you have an inactive primary,
monitoring the apply lag on a standby reports monotonically increasing
lag. The reason for this is that the apply lag is only updated on
COMMIT records, which of course don't occur in inactive servers.
But CHECKPOINT records do occur, so the WAL insert pointer continues to
move forward, which is what causes the spurious lag.(I think newer releases are protected from this problem because they
don't emit checkpoints during periods of inactivity. I didn't verify
this.)This patch fixes the problem by using the checkpoint timestamp to update
the lag tracker in the standby. This requires a little change in where
this update is invoked, because previously it was done only for the XACT
rmgr; this makes the patch a little bigger than it should.
Here's a version of the patch that applies to current master. It does
fix the problem that CHECKPOINT wal records are not considered when
determining time-of-latest-record.
However, it does *not* fix the monitoring problem I mentioned (which
relied on comparing pg_last_xact_replay_timestamp() to 'now()') ...
because commit 6ef2eba3f57f (pg10) made an idle server not emit
checkpoint records anymore. That is, my parenthical remark was
completely wrong: the new versions not only are "protected", but also
this fix doesn't fix them. Luckily, the way to fix monitoring for
servers of versions 10 and later is to use the new replay_lag (etc)
columns in pg_stat_replication, commit 6912acc04f0b (also pg10).
I am inclined to apply this to all branches unless there are strong
objections, because the current code seems pretty arbitrary anyway.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
v2-0001-Use-CheckPoint-time-to-update-latest-recovery-tim.patchtext/x-diff; charset=iso-8859-1Download+26-12
Actually looking again, getRecordTimestamp is looking pretty strange.
It looks much more natural by using nested switch/case blocks, as with
this diff. I think the compiler does a better job this way too.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hello.
At Mon, 27 Jan 2020 18:06:27 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in
Actually looking again, getRecordTimestamp is looking pretty strange.
It looks much more natural by using nested switch/case blocks, as with
this diff. I think the compiler does a better job this way too.
Agreed. Anyway I looked the latest version.
The aim of the patch seem reasonable. XLOG_END_OF_RECOVERY and
XLOG_XACT_PREPARE also have a timestamp but it doesn't help much. (But
could be included for consistency.)
The timestamp of a checkpoint record is the start time of a checkpoint
(and doesn't have subseconds part, but it's not a problem.). That
means that the timestamp is usually far behind the time at the record
has been inserted. As the result the stored timestamp can go back by a
significant internval. I don't think that causes an actual problem but
the movement looks wierd as the result of
pg_last_xact_replay_timestamp().
Asides from the backward movement, a timestamp from other than
commit/abort records in recvoeryLastXTime affects the following code.
xlog.c:7329: (similar code exists at line 9332)
ereport(LOG,
(errmsg("redo done at %X/%X",
(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
xtime = GetLatestXTime();
if (xtime)
ereport(LOG,
(errmsg("last completed transaction was at log time %s",
timestamptz_to_str(xtime))));
This code assumes (and the name GetLatestXTime() suggests, I first
noticed that here..) that the timestamp comes from commit/abort logs,
so otherwise it shows a wrong timestamp. We shouldn't update the
variable by other than that kind of records.
If (I don't think that comes true..) we set the timestamp from other
than that kind of record, the names and the comments of the functions
should be changed.
/*
* Save timestamp of latest processed commit/abort record.
*
* We keep this in XLogCtl, not a simple static variable, so that it can be
* seen by processes other than the startup process. Note in particular
* that CreateRestartPoint is executed in the checkpointer.
*/
static void
SetLatestXTime(TimestampTz xtime)
...
/*
* Fetch timestamp of latest processed commit/abort record.
*/
TimestampTz
GetLatestXTime(void)
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 2020-Jan-28, Kyotaro Horiguchi wrote:
At Mon, 27 Jan 2020 18:06:27 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in
Actually looking again, getRecordTimestamp is looking pretty strange.
It looks much more natural by using nested switch/case blocks, as with
this diff. I think the compiler does a better job this way too.Agreed. Anyway I looked the latest version.
The aim of the patch seem reasonable. XLOG_END_OF_RECOVERY and
XLOG_XACT_PREPARE also have a timestamp but it doesn't help much. (But
could be included for consistency.)
Hello, thanks for looking.
The timestamp of a checkpoint record is the start time of a checkpoint
(and doesn't have subseconds part, but it's not a problem.). That
means that the timestamp is usually far behind the time at the record
has been inserted. As the result the stored timestamp can go back by a
significant internval. I don't think that causes an actual problem but
the movement looks wierd as the result of
pg_last_xact_replay_timestamp().
Ouch ... yeah, it should be set only if it doesn't go backwards.
xlog.c:7329: (similar code exists at line 9332)
ereport(LOG,
(errmsg("redo done at %X/%X",
(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
xtime = GetLatestXTime();
if (xtime)
ereport(LOG,
(errmsg("last completed transaction was at log time %s",
timestamptz_to_str(xtime))));This code assumes (and the name GetLatestXTime() suggests, I first
noticed that here..) that the timestamp comes from commit/abort logs,
so otherwise it shows a wrong timestamp. We shouldn't update the
variable by other than that kind of records.
Hmm, that's terrible. GetLatestXTime() being displayed user-visibly for
"last transaction completion" but having it include unrelated things
such as restore points is terrible. One idea is to should split it in
two: one exclusively for transaction commit/abort, and another for all
WAL activity. That way, the former can be used for that message, and
the latter for standby replay reports. However, that might be
overengineering, if the only thing that the former is that one LOG
message; instead changing the log message to state that the time is for
other activity, as you suggest, is simpler and has no downside that I
can see.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-Jan-27, Alvaro Herrera wrote:
Actually looking again, getRecordTimestamp is looking pretty strange.
It looks much more natural by using nested switch/case blocks, as with
this diff. I think the compiler does a better job this way too.
I hadn't noticed I forgot to attach the diff here :-(
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
rewrite-getRecordTimestamp.patchtext/x-diff; charset=us-asciiDownload+35-25
At Tue, 28 Jan 2020 11:18:50 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in
On 2020-Jan-27, Alvaro Herrera wrote:
Actually looking again, getRecordTimestamp is looking pretty strange.
It looks much more natural by using nested switch/case blocks, as with
this diff. I think the compiler does a better job this way too.I hadn't noticed I forgot to attach the diff here :-(
Yeay, that patch bases the apply-lag patch:) And contains
XLOG_CHECKPOINT_*. But otherwise looks good to me.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Tue, 28 Jan 2020 11:18:14 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in
xlog.c:7329: (similar code exists at line 9332)
ereport(LOG,
(errmsg("redo done at %X/%X",
(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
xtime = GetLatestXTime();
if (xtime)
ereport(LOG,
(errmsg("last completed transaction was at log time %s",
timestamptz_to_str(xtime))));This code assumes (and the name GetLatestXTime() suggests, I first
noticed that here..) that the timestamp comes from commit/abort logs,
so otherwise it shows a wrong timestamp. We shouldn't update the
variable by other than that kind of records.Hmm, that's terrible. GetLatestXTime() being displayed user-visibly for
"last transaction completion" but having it include unrelated things
such as restore points is terrible. One idea is to should split it in
two: one exclusively for transaction commit/abort, and another for all
WAL activity. That way, the former can be used for that message, and
the latter for standby replay reports. However, that might be
overengineering, if the only thing that the former is that one LOG
message; instead changing the log message to state that the time is for
other activity, as you suggest, is simpler and has no downside that I
can see.
Perhaps we can use ControlData->checkPointCopy.time instead. It misses
checkpoint records intermittently but works in general.
But as more significant issue, nowadays PostgreSQL doesn't run a
checkpoint if it is really inactive (that is, if no "important" WAL
records have issued.).
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 2020-Jan-29, Kyotaro Horiguchi wrote:
But as more significant issue, nowadays PostgreSQL doesn't run a
checkpoint if it is really inactive (that is, if no "important" WAL
records have issued.).
Yeah, I mentioned this in message
<20200127203419.GA15216@alvherre.pgsql>. The solution for monitoring
purposes is to use the new "lag" columns in pg_stat_replication. But
that's not available in older releases (prior to 10), so this change is
still useful.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-Jan-28, Kyotaro Horiguchi wrote:
The aim of the patch seem reasonable. XLOG_END_OF_RECOVERY and
XLOG_XACT_PREPARE also have a timestamp but it doesn't help much. (But
could be included for consistency.)
Hmm I think I should definitely include those.
The timestamp of a checkpoint record is the start time of a checkpoint
(and doesn't have subseconds part, but it's not a problem.). That
means that the timestamp is usually far behind the time at the record
has been inserted. As the result the stored timestamp can go back by a
significant internval. I don't think that causes an actual problem but
the movement looks wierd as the result of
pg_last_xact_replay_timestamp().
A problem I see with this is that setting the timestamp is done with
XLogCtl->lock held; since now we need to make the update conditional,
we'd have to read the current value, compare with the checkpoint time,
then set, all with the spinlock held. That seems way too expensive.
A compromise might be to do the compare only when it's done for
checkpoint. These occur seldom enough that it shouldn't be a problem
(as opposed to commit records, which can be very frequent).
Asides from the backward movement, a timestamp from other than
commit/abort records in recvoeryLastXTime affects the following code.xlog.c:7329: (similar code exists at line 9332)
ereport(LOG,
(errmsg("redo done at %X/%X",
(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
xtime = GetLatestXTime();
if (xtime)
ereport(LOG,
(errmsg("last completed transaction was at log time %s",
timestamptz_to_str(xtime))));This code assumes (and the name GetLatestXTime() suggests, I first
noticed that here..) that the timestamp comes from commit/abort logs,
so otherwise it shows a wrong timestamp. We shouldn't update the
variable by other than that kind of records.
Thinking about this some more, I think we should do keep the message the
same backbranches (avoid breaking anything that might be reading the log
-- a remote but not inexistent possibility), and adjust the message in
master to be something like "last timestamped WAL activity at time %s",
and document that it means commit, abort, restore label, checkpoint.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
At Wed, 29 Jan 2020 19:11:31 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in
On 2020-Jan-28, Kyotaro Horiguchi wrote:
The aim of the patch seem reasonable. XLOG_END_OF_RECOVERY and
XLOG_XACT_PREPARE also have a timestamp but it doesn't help much. (But
could be included for consistency.)Hmm I think I should definitely include those.
I agree to that, given the following change of log messages.
The timestamp of a checkpoint record is the start time of a checkpoint
(and doesn't have subseconds part, but it's not a problem.). That
means that the timestamp is usually far behind the time at the record
has been inserted. As the result the stored timestamp can go back by a
significant internval. I don't think that causes an actual problem but
the movement looks wierd as the result of
pg_last_xact_replay_timestamp().A problem I see with this is that setting the timestamp is done with
XLogCtl->lock held; since now we need to make the update conditional,
we'd have to read the current value, compare with the checkpoint time,
then set, all with the spinlock held. That seems way too expensive.A compromise might be to do the compare only when it's done for
checkpoint. These occur seldom enough that it shouldn't be a problem
(as opposed to commit records, which can be very frequent).
I think we don't need to that, given the following change.
Asides from the backward movement, a timestamp from other than
commit/abort records in recvoeryLastXTime affects the following code.xlog.c:7329: (similar code exists at line 9332)
ereport(LOG,
(errmsg("redo done at %X/%X",
(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
xtime = GetLatestXTime();
if (xtime)
ereport(LOG,
(errmsg("last completed transaction was at log time %s",
timestamptz_to_str(xtime))));This code assumes (and the name GetLatestXTime() suggests, I first
noticed that here..) that the timestamp comes from commit/abort logs,
so otherwise it shows a wrong timestamp. We shouldn't update the
variable by other than that kind of records.Thinking about this some more, I think we should do keep the message the
same backbranches (avoid breaking anything that might be reading the log
-- a remote but not inexistent possibility), and adjust the message in
master to be something like "last timestamped WAL activity at time %s",
and document that it means commit, abort, restore label, checkpoint.
Agreed about backbranches. I'd like to preserve the word "transaction"
as it is more familiar to users. How about something like the follows?
"transactions are completed up to log time %s"
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 2020-Jan-30, Kyotaro Horiguchi wrote:
Agreed about backbranches. I'd like to preserve the word "transaction"
as it is more familiar to users. How about something like the follows?"transactions are completed up to log time %s"
That's a good point. I used the phrase "transaction activity", which
seems sufficiently explicit to me.
So, the attached is the one for master; in back branches I would use the
same (plus minor conflict fixes), except that I would drop the message
wording changes.
Thanks for the reviews so far,
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
v3-0001-Use-CheckPoint-time-to-update-latest-recovery-tim.patchtext/x-diff; charset=iso-8859-1Download+96-37
At Thu, 30 Jan 2020 17:45:36 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in
On 2020-Jan-30, Kyotaro Horiguchi wrote:
Agreed about backbranches. I'd like to preserve the word "transaction"
as it is more familiar to users. How about something like the follows?"transactions are completed up to log time %s"
That's a good point. I used the phrase "transaction activity", which
seems sufficiently explicit to me.So, the attached is the one for master; in back branches I would use the
same (plus minor conflict fixes), except that I would drop the message
wording changes.Thanks for the reviews so far,
My pleasure.
regads.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 2020/01/31 5:45, Alvaro Herrera wrote:
On 2020-Jan-30, Kyotaro Horiguchi wrote:
Agreed about backbranches. I'd like to preserve the word "transaction"
as it is more familiar to users. How about something like the follows?"transactions are completed up to log time %s"
That's a good point. I used the phrase "transaction activity", which
seems sufficiently explicit to me.So, the attached is the one for master; in back branches I would use the
same (plus minor conflict fixes), except that I would drop the message
wording changes.
You're thinking to apply this change to the back branches? Sorry
if my understanding is not right. But I don't think that back-patch
is ok because it changes the documented existing behavior
of pg_last_xact_replay_timestamp(). So it looks like the behavior
change not a bug fix.
Regards,
--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters
On 2020-Jan-31, Fujii Masao wrote:
You're thinking to apply this change to the back branches? Sorry
if my understanding is not right. But I don't think that back-patch
is ok because it changes the documented existing behavior
of pg_last_xact_replay_timestamp(). So it looks like the behavior
change not a bug fix.
Yeah, I am thinking in backpatching it. The documented behavior is
already not what the code does. Do you have a situation where this
change would break something? If so, can you please explain what it is?
I think (and I said it upthread) a 100% complete fix involves tracking
two timestamps rather than one. I was thinking that that would be too
invasive because it changes XLogCtlData shmem struct ... but that struct
is private to xlog.c, so I think it's fine to change the struct. The
problem though is that the user-visible change that I want to achieve is
pg_last_xact_replay_timestamp(), and it would be obviously wrong to use
the new XLogCtlData field rather than the existing one, as that would be
a behavior change in the same sense that you're now complaining about.
So I would achieve nothing.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020/01/31 22:40, Alvaro Herrera wrote:
On 2020-Jan-31, Fujii Masao wrote:
You're thinking to apply this change to the back branches? Sorry
if my understanding is not right. But I don't think that back-patch
is ok because it changes the documented existing behavior
of pg_last_xact_replay_timestamp(). So it looks like the behavior
change not a bug fix.Yeah, I am thinking in backpatching it. The documented behavior is
already not what the code does.
Maybe you thought this because getRecordTimestamp() extracts the
timestamp from even WAL record of a restore point? That is, you're
concerned about that pg_last_xact_replay_timestamp() returns the
timestamp of not only commit/abort record but also restore point one.
Right?
As far as I read the code, this problem doesn't occur because
SetLatestXTime() is called only for commit/abort records, in
recoveryStopsAfter(). No?
Do you have a situation where this
change would break something? If so, can you please explain what it is?
For example, use the return value of pg_last_xact_replay_timestamp()
(and also the timestamp in the log message output at the end of
recovery) as a HINT when setting recovery_target_time later.
Use it to compare with the timestamp retrieved from the master server,
in order to monitor the replication delay.
Regards,
--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters
On 2020-Jan-31, Fujii Masao wrote:
On 2020/01/31 22:40, Alvaro Herrera wrote:
On 2020-Jan-31, Fujii Masao wrote:
You're thinking to apply this change to the back branches? Sorry
if my understanding is not right. But I don't think that back-patch
is ok because it changes the documented existing behavior
of pg_last_xact_replay_timestamp(). So it looks like the behavior
change not a bug fix.Yeah, I am thinking in backpatching it. The documented behavior is
already not what the code does.Maybe you thought this because getRecordTimestamp() extracts the
timestamp from even WAL record of a restore point? That is, you're
concerned about that pg_last_xact_replay_timestamp() returns the
timestamp of not only commit/abort record but also restore point one.
Right?
right.
As far as I read the code, this problem doesn't occur because
SetLatestXTime() is called only for commit/abort records, in
recoveryStopsAfter(). No?
... uh, wow, you're right about that too. IMO this is extremely
fragile, easy to break, and under-documented. But you're right, there's
no bug there at present.
Do you have a situation where this
change would break something? If so, can you please explain what it is?For example, use the return value of pg_last_xact_replay_timestamp()
(and also the timestamp in the log message output at the end of
recovery) as a HINT when setting recovery_target_time later.
Hmm.
I'm not sure how you would use it in that way. I mean, I understand how
it *can* be used that way, but it seems too fragile to be done in
practice, in a scenario that's not just laboratory games.
Use it to compare with the timestamp retrieved from the master server,
in order to monitor the replication delay.
That's precisely the use case that I'm aiming at. The timestamp
currently is not useful because this usage breaks when the primary is
inactive (no COMMIT records occur). During such periods of inactivity,
CHECKPOINT records would keep the "last xtime" current. This has
actually happened in a production setting, it's not a thought
experiment.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020/01/31 23:47, Alvaro Herrera wrote:
On 2020-Jan-31, Fujii Masao wrote:
On 2020/01/31 22:40, Alvaro Herrera wrote:
On 2020-Jan-31, Fujii Masao wrote:
You're thinking to apply this change to the back branches? Sorry
if my understanding is not right. But I don't think that back-patch
is ok because it changes the documented existing behavior
of pg_last_xact_replay_timestamp(). So it looks like the behavior
change not a bug fix.Yeah, I am thinking in backpatching it. The documented behavior is
already not what the code does.Maybe you thought this because getRecordTimestamp() extracts the
timestamp from even WAL record of a restore point? That is, you're
concerned about that pg_last_xact_replay_timestamp() returns the
timestamp of not only commit/abort record but also restore point one.
Right?right.
As far as I read the code, this problem doesn't occur because
SetLatestXTime() is called only for commit/abort records, in
recoveryStopsAfter(). No?... uh, wow, you're right about that too. IMO this is extremely
fragile, easy to break, and under-documented.
Yeah, it's worth improving the code.
But you're right, there's
no bug there at present.Do you have a situation where this
change would break something? If so, can you please explain what it is?For example, use the return value of pg_last_xact_replay_timestamp()
(and also the timestamp in the log message output at the end of
recovery) as a HINT when setting recovery_target_time later.Hmm.
I'm not sure how you would use it in that way. I mean, I understand how
it *can* be used that way, but it seems too fragile to be done in
practice, in a scenario that's not just laboratory games.Use it to compare with the timestamp retrieved from the master server,
in order to monitor the replication delay.That's precisely the use case that I'm aiming at. The timestamp
currently is not useful because this usage breaks when the primary is
inactive (no COMMIT records occur). During such periods of inactivity,
CHECKPOINT records would keep the "last xtime" current. This has
actually happened in a production setting, it's not a thought
experiment.
I've heard that someone periodically generates dummy tiny
transactions (say, every minute), as a band-aid solution,
to avoid inactive primary. Of course, this is not a perfect solution.
The idea that I proposed previously was to introduce
pg_last_xact_insert_timestamp() [1]/messages/by-id/CAHGQGwF3ZjfuNEj5ka683KU5rQUBtSWtqFq7g1X0g34o+JXWBw@mail.gmail.com into core. This function returns
the timestamp of commit / abort records in *primary* side.
So we can retrieve that timestamp from the primary (e.g., by using dblink)
and compare its result with pg_last_xact_replay_timestamp() to
calculate the delay in the standby.
Another idea is to include the commit / abort timestamp in
primary-keepalive-message that periodially sent from the primary
to the standby. Then if we introduce the function returning
that timestamp, in the standby side, we can easily compare
the commit / abort timestamps taken from both primary and
standby, in the standby.
[1]: /messages/by-id/CAHGQGwF3ZjfuNEj5ka683KU5rQUBtSWtqFq7g1X0g34o+JXWBw@mail.gmail.com
Regards,
--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters