Metric to calculate WAL size left to transfer to Standby
Hello All,
We have a PostgreSQL 10.12 cluster of servers in two different data
centers. Off lately, in the case of a large WAL generation, we are seeing
replication delay between the master and the standby server. These delays
have off lately been there for an unusually long time. I was wondering if
we have any metric that can calculate the amount (size) of WAL transfer
left between master and standby?
PS: We have ensured we have upgraded our firewalls for better speed
transfer.
Any help on how to figure out the slowness in the WAL transfer would
be much appreciated.
Thanks,
Viral Shah
On Wed, 2021-04-14 at 17:50 -0400, Viral Shah wrote:
We have a PostgreSQL 10.12 cluster of servers in two different data centers.
Off lately, in the case of a large WAL generation, we are seeing replication
delay between the master and the standby server. These delays have off lately
been there for an unusually long time. I was wondering if we have any metric
that can calculate the amount (size) of WAL transfer left between master and
standby?PS: We have ensured we have upgraded our firewalls for better speed transfer.
Any help on how to figure out the slowness in the WAL transfer would be much appreciated.
SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) AS transfer_lag,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag
FROM pg_stat_replication;
If both are delayed, it might be that the network cannot cope.
If only the second number is delayed, you have replication conflicts
with queries on the standby.
Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com
Hello Laurenz,
Thank you so much for sending the query. It was exactly what I needed. I
just made 1 modification to beautify the transfer and replay lag and I can
see the size in bytes.
SELECT application_name,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn)) AS
transfer_lag,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn))
AS replay_lag
FROM pg_stat_replication;
I am now using zabbix to constantly monitor them and notify myself if
it breaches a certain threshold.
Thanks again!
Best,
Viral Shah
Nodal Exchange LLC
On Thu, Apr 15, 2021 at 8:10 AM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:
Show quoted text
On Wed, 2021-04-14 at 17:50 -0400, Viral Shah wrote:
We have a PostgreSQL 10.12 cluster of servers in two different data
centers.
Off lately, in the case of a large WAL generation, we are seeing
replication
delay between the master and the standby server. These delays have off
lately
been there for an unusually long time. I was wondering if we have any
metric
that can calculate the amount (size) of WAL transfer left between
master and
standby?
PS: We have ensured we have upgraded our firewalls for better speed
transfer.
Any help on how to figure out the slowness in the WAL transfer would be
much appreciated.
SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) AS transfer_lag,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag
FROM pg_stat_replication;If both are delayed, it might be that the network cannot cope.
If only the second number is delayed, you have replication conflicts
with queries on the standby.Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com
Hello Laurenz,
The above metric works fine for my primary server. However, We have a
cascading setup in our production system. This particular query doesn't
work for my intermediate server which is standby to the primary server but
also a master to one more standby server. We get the following error:
prod=# SELECT round(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)
/ (1024 :: NUMERIC ^ 2), 2 ) AS replay_lag FROM pg_stat_replication
where application_name like 'rtv%';
ERROR: recovery is in progress
HINT: WAL control functions cannot be executed during recovery.
I wanted to also track the transfer/replay lag between the intermediate
server and the final standby. I narrowed down the problem with
pg_current_wal_lsn() procedure. Any idea on how to mitigate this? or any
alternative to the pg_current_wal_lsn() procedure that I can use to get
transfer lag?
Thanks,
Viral Shah
On Fri, Apr 16, 2021 at 7:52 PM Viral Shah <vshah@nodalexchange.com> wrote:
Show quoted text
Hello Laurenz,
Thank you so much for sending the query. It was exactly what I needed. I
just made 1 modification to beautify the transfer and replay lag and I can
see the size in bytes.SELECT application_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn)) AS transfer_lag,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag
FROM pg_stat_replication;I am now using zabbix to constantly monitor them and notify myself if it breaches a certain threshold.
Thanks again!
Best,
Viral Shah
Nodal Exchange LLC
On Thu, Apr 15, 2021 at 8:10 AM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:On Wed, 2021-04-14 at 17:50 -0400, Viral Shah wrote:
We have a PostgreSQL 10.12 cluster of servers in two different data
centers.
Off lately, in the case of a large WAL generation, we are seeing
replication
delay between the master and the standby server. These delays have off
lately
been there for an unusually long time. I was wondering if we have any
metric
that can calculate the amount (size) of WAL transfer left between
master and
standby?
PS: We have ensured we have upgraded our firewalls for better speed
transfer.
Any help on how to figure out the slowness in the WAL transfer would be
much appreciated.
SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) AS transfer_lag,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag
FROM pg_stat_replication;If both are delayed, it might be that the network cannot cope.
If only the second number is delayed, you have replication conflicts
with queries on the standby.Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com