Measuring replay lag
Hi hackers,
Here is a new version of my patch to add a replay_lag column to the
pg_stat_replication view (originally proposed as part of a larger
patch set for 9.6[1]/messages/by-id/CAEepm=31yndQ7S5RdGofoGz1yQ-cteMrePR2JLf9gWGzxKcV7w@mail.gmail.com), like this:
postgres=# select application_name, replay_lag from pg_stat_replication;
┌──────────────────┬─────────────────┐
│ application_name │ replay_lag │
├──────────────────┼─────────────────┤
│ replica1 │ 00:00:00.595382 │
│ replica2 │ 00:00:00.598448 │
│ replica3 │ 00:00:00.541597 │
│ replica4 │ 00:00:00.551227 │
└──────────────────┴─────────────────┘
(4 rows)
It works by taking advantage of the { time, end-of-WAL } samples that
sending servers already include in message headers to standbys. That
seems to provide a pretty good proxy for when the WAL was written, if
you ignore messages where the LSN hasn't advanced. The patch
introduces a new GUC replay_lag_sample_interval, defaulting to 1s, to
control how often the standby should record these timestamped LSNs
into a small circular buffer. When its recovery process eventually
replays a timestamped LSN, the timestamp is sent back to the upstream
server in a new reply message field. The value visible in
pg_stat_replication.replay_lag can then be updated by comparing with
the current time.
Compared to the usual techniques people use to estimate replay lag,
this approach has the following advantages:
1. The lag is measured in time, not LSN difference.
2. The lag time is computed using two observations of a single
server's clock, so there is no clock skew.
3. The lag is updated even between commits (during large data loads etc).
In the previous version I was effectively showing the ping time
between the servers during idle times when the standby was fully
caught up because there was nothing happening. I decided that was not
useful information and that it's more newsworthy and interesting to
see the estimated replay lag for the most recent real replayed
activity, so I changed that.
In the last thread[1]/messages/by-id/CAEepm=31yndQ7S5RdGofoGz1yQ-cteMrePR2JLf9gWGzxKcV7w@mail.gmail.com, Robert Haas wrote:
Well, one problem with this is that you can't put a loop inside of a
spinlock-protected critical section.
Fixed.
In general, I think this is a pretty reasonable way of attacking this
problem, but I'd say it's significantly under-commented. Where should
someone go to get a general overview of this mechanism? The answer is
not "at place XXX within the patch". (I think it might merit some
more extensive documentation, too, although I'm not exactly sure what
that should look like.)
I have added lots of comments.
When you overflow the buffer, you could thin in out in a smarter way,
like by throwing away every other entry instead of the oldest one. I
guess you'd need to be careful how you coded that, though, because
replaying an entry with a timestamp invalidates some of the saved
entries without formally throwing them out.
Done, by overwriting the newest sample rather than the oldest if the
buffer is full. That seems to give pretty reasonable degradation,
effectively lowering the sampling rate, without any complicated buffer
or rate management code.
Conceivably, 0002 could be split into two patches, one of which
computes "stupid replay lag" considering only records that naturally
carry timestamps, and a second adding the circular buffer to handle
the case where much time passes without finding such a record.
I contemplated this but decided that it'd be best to use ONLY samples
from walsender headers, and never use the time stamps from commit
records for this. If we use times from commit records, then a
cascading sending server will not be able to compute the difference in
time without introducing clock skew (not to mention the difficulty of
combining timestamps from two sources if we try to do both). I
figured that it's better to have value that shows a cascading
sender->standby->cascading sender round trip time that is free of
clock skew, than a master->cascading sender->standby->cascading sender
incomplete round trip that includes clock skew.
By the same reasoning I decided against introducing a new periodic WAL
record or field from the master to hold extra time stamps in between
commits to do this, in favour of the buffered transient timestamp
approach I took in this patch. That said, I can see there are
arguments for doing it with extra periodic WAL timestamps, if people
don't think it'd be too invasive to mess with the WAL for this, and
don't care about cascading standbys giving skewed readings. One
advantage would be that persistent WAL timestamps would still be able
to provide lag estimates if a standby has been down for a while and
was catching up, and this approach can't until it's caught up due to
lack of buffered transient timestamps. Thoughts?
I plan to post a new "causal reads" patch at some point which will
depend on this, but in any case I think this is a useful feature on
its own. I'd be grateful for any feedback, flames, better ideas etc.
Thanks for reading.
[1]: /messages/by-id/CAEepm=31yndQ7S5RdGofoGz1yQ-cteMrePR2JLf9gWGzxKcV7w@mail.gmail.com
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
replay-lag-v12.patchapplication/octet-stream; name=replay-lag-v12.patchDownload+341-24
On Wed, Oct 26, 2016 at 7:34 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
Hi hackers,
Here is a new version of my patch to add a replay_lag column to the
pg_stat_replication view (originally proposed as part of a larger
patch set for 9.6[1]), like this:
Thank you for working on this!
postgres=# select application_name, replay_lag from pg_stat_replication;
┌──────────────────┬─────────────────┐
│ application_name │ replay_lag │
├──────────────────┼─────────────────┤
│ replica1 │ 00:00:00.595382 │
│ replica2 │ 00:00:00.598448 │
│ replica3 │ 00:00:00.541597 │
│ replica4 │ 00:00:00.551227 │
└──────────────────┴─────────────────┘
(4 rows)It works by taking advantage of the { time, end-of-WAL } samples that
sending servers already include in message headers to standbys. That
seems to provide a pretty good proxy for when the WAL was written, if
you ignore messages where the LSN hasn't advanced. The patch
introduces a new GUC replay_lag_sample_interval, defaulting to 1s, to
control how often the standby should record these timestamped LSNs
into a small circular buffer. When its recovery process eventually
replays a timestamped LSN, the timestamp is sent back to the upstream
server in a new reply message field. The value visible in
pg_stat_replication.replay_lag can then be updated by comparing with
the current time.
replay_lag_sample_interval is 1s by default but I got 1000s by SHOW command.
postgres(1:36789)=# show replay_lag_sample_interval ;
replay_lag_sample_interval
----------------------------
1000s
(1 row)
Also, I set replay_lag_sample_interval = 500ms, I got 0 by SHOW command.
postgres(1:99850)=# select name, setting, applied from
pg_file_settings where name = 'replay_lag_sample_interval';
name | setting | applied
----------------------------+---------+---------
replay_lag_sample_interval | 500ms | t
(1 row)
postgres(1:99850)=# show replay_lag_sample_interval ;
replay_lag_sample_interval
----------------------------
0
(1 row)
Compared to the usual techniques people use to estimate replay lag,
this approach has the following advantages:1. The lag is measured in time, not LSN difference.
2. The lag time is computed using two observations of a single
server's clock, so there is no clock skew.
3. The lag is updated even between commits (during large data loads etc).
I agree with this approach.
In the previous version I was effectively showing the ping time
between the servers during idle times when the standby was fully
caught up because there was nothing happening. I decided that was not
useful information and that it's more newsworthy and interesting to
see the estimated replay lag for the most recent real replayed
activity, so I changed that.In the last thread[1], Robert Haas wrote:
Well, one problem with this is that you can't put a loop inside of a
spinlock-protected critical section.Fixed.
In general, I think this is a pretty reasonable way of attacking this
problem, but I'd say it's significantly under-commented. Where should
someone go to get a general overview of this mechanism? The answer is
not "at place XXX within the patch". (I think it might merit some
more extensive documentation, too, although I'm not exactly sure what
that should look like.)I have added lots of comments.
When you overflow the buffer, you could thin in out in a smarter way,
like by throwing away every other entry instead of the oldest one. I
guess you'd need to be careful how you coded that, though, because
replaying an entry with a timestamp invalidates some of the saved
entries without formally throwing them out.Done, by overwriting the newest sample rather than the oldest if the
buffer is full. That seems to give pretty reasonable degradation,
effectively lowering the sampling rate, without any complicated buffer
or rate management code.Conceivably, 0002 could be split into two patches, one of which
computes "stupid replay lag" considering only records that naturally
carry timestamps, and a second adding the circular buffer to handle
the case where much time passes without finding such a record.I contemplated this but decided that it'd be best to use ONLY samples
from walsender headers, and never use the time stamps from commit
records for this. If we use times from commit records, then a
cascading sending server will not be able to compute the difference in
time without introducing clock skew (not to mention the difficulty of
combining timestamps from two sources if we try to do both). I
figured that it's better to have value that shows a cascading
sender->standby->cascading sender round trip time that is free of
clock skew, than a master->cascading sender->standby->cascading sender
incomplete round trip that includes clock skew.By the same reasoning I decided against introducing a new periodic WAL
record or field from the master to hold extra time stamps in between
commits to do this, in favour of the buffered transient timestamp
approach I took in this patch.
I think that you need to change sendFeedback() in pg_recvlogical.c and
receivexlog.c as well.
That said, I can see there are
arguments for doing it with extra periodic WAL timestamps, if people
don't think it'd be too invasive to mess with the WAL for this, and
don't care about cascading standbys giving skewed readings. One
advantage would be that persistent WAL timestamps would still be able
to provide lag estimates if a standby has been down for a while and
was catching up, and this approach can't until it's caught up due to
lack of buffered transient timestamps. Thoughts?I plan to post a new "causal reads" patch at some point which will
depend on this, but in any case I think this is a useful feature on
its own. I'd be grateful for any feedback, flames, better ideas etc.
Thanks for reading.[1] /messages/by-id/CAEepm=31yndQ7S5RdGofoGz1yQ-cteMrePR2JLf9gWGzxKcV7w@mail.gmail.com
Regards,
--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Nov 8, 2016 at 2:35 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
replay_lag_sample_interval is 1s by default but I got 1000s by SHOW command.
postgres(1:36789)=# show replay_lag_sample_interval ;
replay_lag_sample_interval
----------------------------
1000s
(1 row)
Oops, fixed.
1. The lag is measured in time, not LSN difference.
2. The lag time is computed using two observations of a single
server's clock, so there is no clock skew.
3. The lag is updated even between commits (during large data loads etc).I agree with this approach.
Thanks for the feedback.
I think that you need to change sendFeedback() in pg_recvlogical.c and
receivexlog.c as well.
Right, fixed.
Thanks very much for testing! New version attached. I will add this
to the next CF.
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
replay-lag-v13.patchapplication/octet-stream; name=replay-lag-v13.patchDownload+347-26
On 11/22/16 4:27 AM, Thomas Munro wrote:
Thanks very much for testing! New version attached. I will add this
to the next CF.
I don't see it there yet.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 26 October 2016 at 11:34, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
It works by taking advantage of the { time, end-of-WAL } samples that
sending servers already include in message headers to standbys. That
seems to provide a pretty good proxy for when the WAL was written, if
you ignore messages where the LSN hasn't advanced. The patch
introduces a new GUC replay_lag_sample_interval, defaulting to 1s, to
control how often the standby should record these timestamped LSNs
into a small circular buffer. When its recovery process eventually
replays a timestamped LSN, the timestamp is sent back to the upstream
server in a new reply message field. The value visible in
pg_stat_replication.replay_lag can then be updated by comparing with
the current time.
Why not just send back the lag as calculated by max_standby_streaming_delay?
I.e. at the end of replay of each chunk record the current delay in
shmem, then send it back periodically.
If we have two methods of calculation it would be confusing.
Admittedly the approach here is the same one I advocated a some years
back when Robert and I were discussing time delayed standbys.
Compared to the usual techniques people use to estimate replay lag,
this approach has the following advantages:1. The lag is measured in time, not LSN difference.
2. The lag time is computed using two observations of a single
server's clock, so there is no clock skew.
3. The lag is updated even between commits (during large data loads etc).
Yes, good reasons.
In the previous version I was effectively showing the ping time
between the servers during idle times when the standby was fully
caught up because there was nothing happening. I decided that was not
useful information and that it's more newsworthy and interesting to
see the estimated replay lag for the most recent real replayed
activity, so I changed that.In the last thread[1], Robert Haas wrote:
Well, one problem with this is that you can't put a loop inside of a
spinlock-protected critical section.Fixed.
In general, I think this is a pretty reasonable way of attacking this
problem, but I'd say it's significantly under-commented. Where should
someone go to get a general overview of this mechanism? The answer is
not "at place XXX within the patch". (I think it might merit some
more extensive documentation, too, although I'm not exactly sure what
that should look like.)I have added lots of comments.
When you overflow the buffer, you could thin in out in a smarter way,
like by throwing away every other entry instead of the oldest one. I
guess you'd need to be careful how you coded that, though, because
replaying an entry with a timestamp invalidates some of the saved
entries without formally throwing them out.Done, by overwriting the newest sample rather than the oldest if the
buffer is full. That seems to give pretty reasonable degradation,
effectively lowering the sampling rate, without any complicated buffer
or rate management code.Conceivably, 0002 could be split into two patches, one of which
computes "stupid replay lag" considering only records that naturally
carry timestamps, and a second adding the circular buffer to handle
the case where much time passes without finding such a record.I contemplated this but decided that it'd be best to use ONLY samples
from walsender headers, and never use the time stamps from commit
records for this. If we use times from commit records, then a
cascading sending server will not be able to compute the difference in
time without introducing clock skew (not to mention the difficulty of
combining timestamps from two sources if we try to do both). I
figured that it's better to have value that shows a cascading
sender->standby->cascading sender round trip time that is free of
clock skew, than a master->cascading sender->standby->cascading sender
incomplete round trip that includes clock skew.By the same reasoning I decided against introducing a new periodic WAL
record or field from the master to hold extra time stamps in between
commits to do this, in favour of the buffered transient timestamp
approach I took in this patch. That said, I can see there are
arguments for doing it with extra periodic WAL timestamps, if people
don't think it'd be too invasive to mess with the WAL for this, and
don't care about cascading standbys giving skewed readings. One
advantage would be that persistent WAL timestamps would still be able
to provide lag estimates if a standby has been down for a while and
was catching up, and this approach can't until it's caught up due to
lack of buffered transient timestamps. Thoughts?
-1 for adding anything to the WAL for this.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Dec 19, 2016 at 4:03 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
On 11/22/16 4:27 AM, Thomas Munro wrote:
Thanks very much for testing! New version attached. I will add this
to the next CF.I don't see it there yet.
Thanks for the reminder. Added here: https://commitfest.postgresql.org/12/920/
Here's a rebased patch.
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
replay-lag-v14.patchapplication/octet-stream; name=replay-lag-v14.patchDownload+347-26
On Mon, Dec 19, 2016 at 10:46 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 26 October 2016 at 11:34, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
It works by taking advantage of the { time, end-of-WAL } samples that
sending servers already include in message headers to standbys. That
seems to provide a pretty good proxy for when the WAL was written, if
you ignore messages where the LSN hasn't advanced. The patch
introduces a new GUC replay_lag_sample_interval, defaulting to 1s, to
control how often the standby should record these timestamped LSNs
into a small circular buffer. When its recovery process eventually
replays a timestamped LSN, the timestamp is sent back to the upstream
server in a new reply message field. The value visible in
pg_stat_replication.replay_lag can then be updated by comparing with
the current time.Why not just send back the lag as calculated by max_standby_streaming_delay?
I.e. at the end of replay of each chunk record the current delay in
shmem, then send it back periodically.If we have two methods of calculation it would be confusing.
Hmm. If I understand correctly, GetStandbyLimitTime is measuring
something a bit different: it computes how long it has been since the
recovery process received the chunk that it's currently trying to
apply, most interestingly in the case where we are waiting due to
conflicts. It doesn't include time in walsender, on the network, in
walreceiver, or writing and flushing and reading before it arrives in
the recovery process. If I'm reading it correctly, it only updates
XLogReceiptTime when it's completely caught up applying all earlier
chunks, so when it falls behind, its measure of lag has a growing-only
phase and a reset that can only be triggered by catching up to the
latest chunk. That seems OK for its intended purpose of putting a cap
on the delay introduced by conflicts. But that's not what I'm trying
to provide here.
The purpose of this proposal is to show the replay_lag as judged by
the sending server: in the case of a primary server, that is an
indication of how commits done here will take to show up to users over
there, and how long COMMIT will take with remote_apply will take to
come back. It measures the WAL's whole journey, and does so in a
smooth way that shows accurate information even if the standby never
quite catches up during long periods.
Example 1: Suppose I have two servers right next each other, and the
primary server has periods of high activity which exceed the standby's
replay rate, perhaps because of slower/busier hardware, or because of
conflicts with other queries, or because our single-core 'REDO' can't
always keep up with multi-core 'DO'. By the logic of
max_standby_streaming_delay, if it never catches up to the latest
chunk but remains a fluctuating number of chunks behind, then AIUI the
standby will compute a constantly increasing lag. By my logic, the
primary will tell you quite accurately how far behind the standby's
recovery is at regular intervals, showing replay_lag fluctuating up
and down as appropriate, even if it never quite catches up. It can do
that because it has a buffer full of regularly spaced out samples to
work through, and even if you exceed the buffer size (8192 seconds'
worth by default) it'll just increase the interval between samples.
Example 2: Suppose I have servers on opposite sides of the world with
a ping time of 300ms. By the logic used for
max_standby_streaming_delay, the lag computed by the standby would be
close to zero when there is no concurrent activity to conflict with.
I don't think that's what users other than the recovery-conflict
resolution code want to know. By my logic, replay_lag computed by the
primary would show 300ms + a tiny bit more, which is how long it takes
for committed transactions to be visible to user queries on the
standby and for us to know that that is the case. That's interesting
because it tells you how long synchronous_commit = remote_apply would
make you wait (if that server is waited for according to syncrep
config).
In summary, the max_standby_streaming_delay approach only measures
activity inside the recovery process on the standby, and only uses a
single variable for timestamp tracking, so although it's semi-related
it's not what I wanted to show.
(I suppose there might be an argument that max_standby_streaming_delay
should also track received-on-standby-time for each sampled LSN in a
circular buffer, and then use that information to implement
max_standby_streaming_delay more fairly. We only need to cancel
queries that conflict with WAL records that have truly been waiting
max_standby_streaming_delay since receive time, instead of cancelling
everything that conflicts with recovery until we're caught up to the
last chunk as we do today as soon as max_standby_streaming_delay is
exceeded while trying to apply *any* WAL record. This may not make
any sense or be worth doing, just an idea...)
Admittedly the approach here is the same one I advocated a some years
back when Robert and I were discussing time delayed standbys.
That is reassuring!
--
Thomas Munro
http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Dec 19, 2016 at 8:13 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
On Mon, Dec 19, 2016 at 4:03 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:On 11/22/16 4:27 AM, Thomas Munro wrote:
Thanks very much for testing! New version attached. I will add this
to the next CF.I don't see it there yet.
Thanks for the reminder. Added here: https://commitfest.postgresql.org/12/920/
Here's a rebased patch.
I agree that the capability to measure the remote_apply lag is very useful.
Also I want to measure the remote_write and remote_flush lags, for example,
in order to diagnose the cause of replication lag.
For that, what about maintaining the pairs of send-timestamp and LSN in
*sender side* instead of receiver side? That is, walsender adds the pairs
of send-timestamp and LSN into the buffer every sampling period.
Whenever walsender receives the write, flush and apply locations from
walreceiver, it calculates the write, flush and apply lags by comparing
the received and stored LSN and comparing the current timestamp and
stored send-timestamp.
As a bonus of this approach, we don't need to add the field into the replay
message that walreceiver can very frequently send back. Which might be
helpful in terms of networking overhead.
Regards,
--
Fujii Masao
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Dec 22, 2016 at 2:14 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
I agree that the capability to measure the remote_apply lag is very useful.
Also I want to measure the remote_write and remote_flush lags, for example,
in order to diagnose the cause of replication lag.
Good idea. I will think about how to make that work. There was a
proposal to make writing and flushing independent[1]/messages/by-id/CA+U5nMJifauXvVbx=v3UbYbHO3Jw2rdT4haL6CCooEDM5=4ASQ@mail.gmail.com. I'd like that
to go in. Then the write_lag and flush_lag could diverge
significantly, and it would be nice to be able to see that effect as
time (though you could already see it with LSN positions).
For that, what about maintaining the pairs of send-timestamp and LSN in
*sender side* instead of receiver side? That is, walsender adds the pairs
of send-timestamp and LSN into the buffer every sampling period.
Whenever walsender receives the write, flush and apply locations from
walreceiver, it calculates the write, flush and apply lags by comparing
the received and stored LSN and comparing the current timestamp and
stored send-timestamp.
I thought about that too, but I couldn't figure out how to make the
sampling work. If the primary is choosing (LSN, time) pairs to store
in a buffer, and the standby is sending replies at times of its
choosing (when wal_receiver_status_interval has been exceeded), then
you can't accurately measure anything.
You could fix that by making the standby send a reply *every time* it
applies some WAL (like it does for transactions committing with
synchronous_commit = remote_apply, though that is only for commit
records), but then we'd be generating a lot of recovery->walreceiver
communication and standby->primary network traffic, even for people
who don't otherwise need it. It seems unacceptable.
Or you could fix that by setting the XACT_COMPLETION_APPLY_FEEDBACK
bit in the xl_xinfo.xinfo for selected transactions, as a way to ask
the standby to send a reply when that commit record is applied, but
that only works for commit records. One of my goals was to be able to
report lag accurately even between commits (very large data load
transactions etc).
Or you could fix that by sending a list of 'interesting LSNs' to the
standby, as a way to ask it to send a reply when those LSNs are
applied. Then you'd need a circular buffer of (LSN, time) pairs in
the primary AND a circular buffer of LSNs in the standby to remember
which locations should generate a reply. This doesn't seem to be an
improvement.
That's why I thought that the standby should have the (LSN, time)
buffer: it decides which samples to record in its buffer, using LSN
and time provided by the sending server, and then it can send replies
at exactly the right times. The LSNs don't have to be commit records,
they're just arbitrary points in the WAL stream which we attach
timestamps to. IPC and network overhead is minimised, and accuracy is
maximised.
As a bonus of this approach, we don't need to add the field into the replay
message that walreceiver can very frequently send back. Which might be
helpful in terms of networking overhead.
For the record, these replies are only sent approximately every
replay_lag_sample_interval (with variation depending on replay speed)
and are only 42 bytes with the new field added.
[1]: /messages/by-id/CA+U5nMJifauXvVbx=v3UbYbHO3Jw2rdT4haL6CCooEDM5=4ASQ@mail.gmail.com
--
Thomas Munro
http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Dec 22, 2016 at 10:14 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
On Thu, Dec 22, 2016 at 2:14 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
I agree that the capability to measure the remote_apply lag is very useful.
Also I want to measure the remote_write and remote_flush lags, for example,
in order to diagnose the cause of replication lag.Good idea. I will think about how to make that work.
Here is an experimental version that reports the write, flush and
apply lag separately as requested. This is done with three separate
(lsn, timestamp) buffers on the standby side. The GUC is now called
replication_lag_sample_interval. Not tested much yet.
postgres=# select application_name, write_lag, flush_lag, replay_lag
from pg_stat_replication ;
application_name | write_lag | flush_lag | replay_lag
------------------+-----------------+-----------------+-----------------
replica1 | 00:00:00.032408 | 00:00:00.032409 | 00:00:00.697858
replica2 | 00:00:00.032579 | 00:00:00.03258 | 00:00:00.551125
replica3 | 00:00:00.033686 | 00:00:00.033687 | 00:00:00.670571
replica4 | 00:00:00.032861 | 00:00:00.032862 | 00:00:00.521902
(4 rows)
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
replay-lag-v15.patchapplication/octet-stream; name=replay-lag-v15.patchDownload+533-27
On Thu, Dec 29, 2016 at 1:28 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
On Thu, Dec 22, 2016 at 10:14 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:On Thu, Dec 22, 2016 at 2:14 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
I agree that the capability to measure the remote_apply lag is very useful.
Also I want to measure the remote_write and remote_flush lags, for example,
in order to diagnose the cause of replication lag.Good idea. I will think about how to make that work.
Here is an experimental version that reports the write, flush and
apply lag separately as requested. This is done with three separate
(lsn, timestamp) buffers on the standby side. The GUC is now called
replication_lag_sample_interval. Not tested much yet.
Here is a new version that is slightly refactored and fixes a problem
with stale samples after periods of idleness.
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
replay-lag-v16.patchapplication/octet-stream; name=replay-lag-v16.patchDownload+536-27
On 21 December 2016 at 21:14, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
On Thu, Dec 22, 2016 at 2:14 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
I agree that the capability to measure the remote_apply lag is very useful.
Also I want to measure the remote_write and remote_flush lags, for example,
in order to diagnose the cause of replication lag.Good idea. I will think about how to make that work. There was a
proposal to make writing and flushing independent[1]. I'd like that
to go in. Then the write_lag and flush_lag could diverge
significantly, and it would be nice to be able to see that effect as
time (though you could already see it with LSN positions).
I think it has a much better chance now that the replies from apply
are OK. Will check in this release, but not now.
For that, what about maintaining the pairs of send-timestamp and LSN in
*sender side* instead of receiver side? That is, walsender adds the pairs
of send-timestamp and LSN into the buffer every sampling period.
Whenever walsender receives the write, flush and apply locations from
walreceiver, it calculates the write, flush and apply lags by comparing
the received and stored LSN and comparing the current timestamp and
stored send-timestamp.I thought about that too, but I couldn't figure out how to make the
sampling work. If the primary is choosing (LSN, time) pairs to store
in a buffer, and the standby is sending replies at times of its
choosing (when wal_receiver_status_interval has been exceeded), then
you can't accurately measure anything.
Skipping adding the line delay to this was very specifically excluded
by Tom, so that clock disparity between servers is not included.
If the balance of opinion is in favour of including a measure of
complete roundtrip time then I'm OK with that.
You could fix that by making the standby send a reply *every time* it
applies some WAL (like it does for transactions committing with
synchronous_commit = remote_apply, though that is only for commit
records), but then we'd be generating a lot of recovery->walreceiver
communication and standby->primary network traffic, even for people
who don't otherwise need it. It seems unacceptable.
I don't see why that would be unacceptable. If we do it for
remote_apply, why not also do it for other modes? Whatever the
reasoning was for remote_apply should work for other modes. I should
add it was originally designed to be that way by me, so must have been
changed later.
This seems like a bug to me now that I look harder. The docs for
wal_receiver_status_interval say "Updates are sent each time the
write or flush positions change, or at least as often as specified by
this parameter." But it doesn't do that, as I think it should.
Or you could fix that by setting the XACT_COMPLETION_APPLY_FEEDBACK
bit in the xl_xinfo.xinfo for selected transactions, as a way to ask
the standby to send a reply when that commit record is applied, but
that only works for commit records. One of my goals was to be able to
report lag accurately even between commits (very large data load
transactions etc).
As we said, we do have keepalive records we could use for that.
Or you could fix that by sending a list of 'interesting LSNs' to the
standby, as a way to ask it to send a reply when those LSNs are
applied. Then you'd need a circular buffer of (LSN, time) pairs in
the primary AND a circular buffer of LSNs in the standby to remember
which locations should generate a reply. This doesn't seem to be an
improvement.That's why I thought that the standby should have the (LSN, time)
buffer: it decides which samples to record in its buffer, using LSN
and time provided by the sending server, and then it can send replies
at exactly the right times. The LSNs don't have to be commit records,
they're just arbitrary points in the WAL stream which we attach
timestamps to. IPC and network overhead is minimised, and accuracy is
maximised.
I'm dubious of keeping standby-side state, but I will review the patch.
As a bonus of this approach, we don't need to add the field into the replay
message that walreceiver can very frequently send back. Which might be
helpful in terms of networking overhead.For the record, these replies are only sent approximately every
replay_lag_sample_interval (with variation depending on replay speed)
and are only 42 bytes with the new field added.[1] /messages/by-id/CA+U5nMJifauXvVbx=v3UbYbHO3Jw2rdT4haL6CCooEDM5=4ASQ@mail.gmail.com
We have time to make any changes to allow this to be applied in this release.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jan 4, 2017 at 1:06 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 21 December 2016 at 21:14, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:I thought about that too, but I couldn't figure out how to make the
sampling work. If the primary is choosing (LSN, time) pairs to store
in a buffer, and the standby is sending replies at times of its
choosing (when wal_receiver_status_interval has been exceeded), then
you can't accurately measure anything.Skipping adding the line delay to this was very specifically excluded
by Tom, so that clock disparity between servers is not included.If the balance of opinion is in favour of including a measure of
complete roundtrip time then I'm OK with that.
I deliberately included the network round trip for two reasons:
1. The three lag numbers tell you how long syncrep would take to
return control at the three levels remote_write, on, remote_apply.
2. The time arithmetic is all done on the primary side using two
observations of its single system clock, avoiding any discussion of
clock differences between servers.
You can always subtract half the ping time from these numbers later if
you really want to (replay_lag - (write_lag / 2) may be a cheap proxy
for a lag time that doesn't include the return network leg, and still
doesn't introduce clock difference error). I am strongly of the
opinion that time measurements made by a single observer are better
data to start from.
You could fix that by making the standby send a reply *every time* it
applies some WAL (like it does for transactions committing with
synchronous_commit = remote_apply, though that is only for commit
records), but then we'd be generating a lot of recovery->walreceiver
communication and standby->primary network traffic, even for people
who don't otherwise need it. It seems unacceptable.I don't see why that would be unacceptable. If we do it for
remote_apply, why not also do it for other modes? Whatever the
reasoning was for remote_apply should work for other modes. I should
add it was originally designed to be that way by me, so must have been
changed later.
You can achieve that with this patch by setting
replication_lag_sample_interval to 0.
The patch streams (time-right-now, end-of-wal) to the standby in every
outgoing message, and then sees how long it takes for those timestamps
to be fed back to it. The standby feeds them back immediately as soon
as it writes, flushes and applies those WAL positions. I figured it
would be silly if every message from the primary caused the standby
to generate 3 replies from the standby just for a monitoring feature,
so I introduced the GUC replication_lag_sample_interval to rate-limit
that. I don't think there's much point in setting it lower than 1s:
how often will you look at pg_stat_replication?
That's why I thought that the standby should have the (LSN, time)
buffer: it decides which samples to record in its buffer, using LSN
and time provided by the sending server, and then it can send replies
at exactly the right times. The LSNs don't have to be commit records,
they're just arbitrary points in the WAL stream which we attach
timestamps to. IPC and network overhead is minimised, and accuracy is
maximised.I'm dubious of keeping standby-side state, but I will review the patch.
Thanks!
The only standby-side state is the three buffers of (LSN, time) that
haven't been written/flushed/applied yet. I don't see how that can be
avoided, except by inserting extra periodic timestamps into the WAL
itself, which has already been rejected.
--
Thomas Munro
http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jan 4, 2017 at 12:22 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
The patch streams (time-right-now, end-of-wal) to the standby in every
outgoing message, and then sees how long it takes for those timestamps
to be fed back to it.
Correction: we already stream (time-right-now, end-of-wal) to the
standby in every outgoing message. The patch introduces a new use of
that information by feeding them back upstream.
--
Thomas Munro
http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jan 4, 2017 at 12:22 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
(replay_lag - (write_lag / 2) may be a cheap proxy
for a lag time that doesn't include the return network leg, and still
doesn't introduce clock difference error)
(Upon reflection it's a terrible proxy for that because of the mix of
write/flush work done by WAL receiver today, but would improve
dramatically if the WAL writer were doing the flushing. A better yet
proxy might involve also tracking receive_lag which doesn't include
the write() syscall. My real point is that there are ways to work
backwards from the two-way round trip time to get other estimates, but
no good ways to undo the damage that would be done to the data if we
started using two systems' clocks.)
--
Thomas Munro
http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 3 January 2017 at 23:22, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
I don't see why that would be unacceptable. If we do it for
remote_apply, why not also do it for other modes? Whatever the
reasoning was for remote_apply should work for other modes. I should
add it was originally designed to be that way by me, so must have been
changed later.You can achieve that with this patch by setting
replication_lag_sample_interval to 0.
I wonder why you ignore my mention of the bug in the correct mechanism?
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jan 4, 2017 at 8:58 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 3 January 2017 at 23:22, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
I don't see why that would be unacceptable. If we do it for
remote_apply, why not also do it for other modes? Whatever the
reasoning was for remote_apply should work for other modes. I should
add it was originally designed to be that way by me, so must have been
changed later.You can achieve that with this patch by setting
replication_lag_sample_interval to 0.I wonder why you ignore my mention of the bug in the correct mechanism?
I didn't have an opinion on that yet, but looking now I think there is
no bug: I was wrong about the current reply frequency. This comment
above XLogWalRcvSendReply confused me:
* If 'force' is not set, the message is only sent if enough time has
* passed since last status update to reach wal_receiver_status_interval.
Actually it's sent if 'force' is set, enough time has passed, or
either of the write or flush positions has moved. So we're already
sending replies after every write and flush, as you said we should.
So perhaps I should get rid of that replication_lag_sample_interval
GUC and send back apply timestamps frequently, as you were saying. It
would add up to a third more replies.
The effective sample rate would still be lowered when the fixed sized
buffers fill up and samples have to be dropped, and that'd be more
likely without that GUC. With the GUC, it doesn't start happening
until lag reaches XLOG_TIMESTAMP_BUFFER_SIZE *
replication_lag_sample_interval = ~2 hours with defaults, whereas
without rate limiting you might only need to get
XLOG_TIMESTAMP_BUFFER_SIZE 'w' messages behind before we start
dropping samples. Maybe that's perfectly OK, I'm not sure.
--
Thomas Munro
http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jan 5, 2017 at 12:03 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
So perhaps I should get rid of that replication_lag_sample_interval
GUC and send back apply timestamps frequently, as you were saying. It
would add up to a third more replies.
Oops, of course I meant to say up to 50% more replies...
--
Thomas Munro
http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Dec 22, 2016 at 6:14 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
On Thu, Dec 22, 2016 at 2:14 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
I agree that the capability to measure the remote_apply lag is very useful.
Also I want to measure the remote_write and remote_flush lags, for example,
in order to diagnose the cause of replication lag.Good idea. I will think about how to make that work. There was a
proposal to make writing and flushing independent[1]. I'd like that
to go in. Then the write_lag and flush_lag could diverge
significantly, and it would be nice to be able to see that effect as
time (though you could already see it with LSN positions).For that, what about maintaining the pairs of send-timestamp and LSN in
*sender side* instead of receiver side? That is, walsender adds the pairs
of send-timestamp and LSN into the buffer every sampling period.
Whenever walsender receives the write, flush and apply locations from
walreceiver, it calculates the write, flush and apply lags by comparing
the received and stored LSN and comparing the current timestamp and
stored send-timestamp.I thought about that too, but I couldn't figure out how to make the
sampling work. If the primary is choosing (LSN, time) pairs to store
in a buffer, and the standby is sending replies at times of its
choosing (when wal_receiver_status_interval has been exceeded), then
you can't accurately measure anything.
Yeah, even though the primary stores (100, 2017-01-17 00:00:00) as the pair of
(LSN, timestamp), for example, the standby may not send back the reply for
LSN 100 itself. The primary may receive the reply for larger LSN like 200,
instead. So the measurement of the lag in the primary side would not be so
accurate.
But we can calculate the "sync rep" lag by comparing the stored timestamp of
LSN 100 and the timestamp when the reply for LSN 200 is received. In sync rep,
since the transaction waiting for LSN 100 to be replicated is actually released
after the reply for LSN 200 is received, the above calculated lag is basically
accurate as sync rep lag.
Therefore I'm still thinking that it's better to maintain the pairs of LSN
and timestamp in the *primary* side. Thought?
Regards,
--
Fujii Masao
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jan 17, 2017 at 7:45 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Thu, Dec 22, 2016 at 6:14 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:On Thu, Dec 22, 2016 at 2:14 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
I agree that the capability to measure the remote_apply lag is very useful.
Also I want to measure the remote_write and remote_flush lags, for example,
in order to diagnose the cause of replication lag.Good idea. I will think about how to make that work. There was a
proposal to make writing and flushing independent[1]. I'd like that
to go in. Then the write_lag and flush_lag could diverge
significantly, and it would be nice to be able to see that effect as
time (though you could already see it with LSN positions).For that, what about maintaining the pairs of send-timestamp and LSN in
*sender side* instead of receiver side? That is, walsender adds the pairs
of send-timestamp and LSN into the buffer every sampling period.
Whenever walsender receives the write, flush and apply locations from
walreceiver, it calculates the write, flush and apply lags by comparing
the received and stored LSN and comparing the current timestamp and
stored send-timestamp.I thought about that too, but I couldn't figure out how to make the
sampling work. If the primary is choosing (LSN, time) pairs to store
in a buffer, and the standby is sending replies at times of its
choosing (when wal_receiver_status_interval has been exceeded), then
you can't accurately measure anything.Yeah, even though the primary stores (100, 2017-01-17 00:00:00) as the pair of
(LSN, timestamp), for example, the standby may not send back the reply for
LSN 100 itself. The primary may receive the reply for larger LSN like 200,
instead. So the measurement of the lag in the primary side would not be so
accurate.But we can calculate the "sync rep" lag by comparing the stored timestamp of
LSN 100 and the timestamp when the reply for LSN 200 is received. In sync rep,
since the transaction waiting for LSN 100 to be replicated is actually released
after the reply for LSN 200 is received, the above calculated lag is basically
accurate as sync rep lag.Therefore I'm still thinking that it's better to maintain the pairs of LSN
and timestamp in the *primary* side. Thought?
Ok. I see that there is a new compelling reason to move the ring
buffer to the sender side: then I think lag tracking will work
automatically for the new logical replication that just landed on
master. I will try it that way. Thanks for the feedback!
--
Thomas Munro
http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers