Metric to calculate WAL size left to transfer to Standby

Started by Viral Shahalmost 5 years ago4 messagesgeneral
Jump to latest
#1Viral Shah
vshah@nodalexchange.com

Hello All,

We have a PostgreSQL 10.12 cluster of servers in two different data
centers. Off lately, in the case of a large WAL generation, we are seeing
replication delay between the master and the standby server. These delays
have off lately been there for an unusually long time. I was wondering if
we have any metric that can calculate the amount (size) of WAL transfer
left between master and standby?

PS: We have ensured we have upgraded our firewalls for better speed
transfer.

Any help on how to figure out the slowness in the WAL transfer would
be much appreciated.

Thanks,
Viral Shah

#2Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Viral Shah (#1)
Re: Metric to calculate WAL size left to transfer to Standby

On Wed, 2021-04-14 at 17:50 -0400, Viral Shah wrote:

We have a PostgreSQL 10.12 cluster of servers in two different data centers.
Off lately, in the case of a large WAL generation, we are seeing replication
delay between the master and the standby server. These delays have off lately
been there for an unusually long time. I was wondering if we have any metric
that can calculate the amount (size) of WAL transfer left between master and
standby?

PS: We have ensured we have upgraded our firewalls for better speed transfer.

Any help on how to figure out the slowness in the WAL transfer would be much appreciated.

SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) AS transfer_lag,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag
FROM pg_stat_replication;

If both are delayed, it might be that the network cannot cope.

If only the second number is delayed, you have replication conflicts
with queries on the standby.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

#3Viral Shah
vshah@nodalexchange.com
In reply to: Laurenz Albe (#2)
Re: Metric to calculate WAL size left to transfer to Standby

Hello Laurenz,

Thank you so much for sending the query. It was exactly what I needed. I
just made 1 modification to beautify the transfer and replay lag and I can
see the size in bytes.

SELECT application_name,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn)) AS
transfer_lag,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn))
AS replay_lag
FROM pg_stat_replication;

I am now using zabbix to constantly monitor them and notify myself if
it breaches a certain threshold.

Thanks again!

Best,

Viral Shah

Nodal Exchange LLC

On Thu, Apr 15, 2021 at 8:10 AM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:

Show quoted text

On Wed, 2021-04-14 at 17:50 -0400, Viral Shah wrote:

We have a PostgreSQL 10.12 cluster of servers in two different data

centers.

Off lately, in the case of a large WAL generation, we are seeing

replication

delay between the master and the standby server. These delays have off

lately

been there for an unusually long time. I was wondering if we have any

metric

that can calculate the amount (size) of WAL transfer left between

master and

standby?

PS: We have ensured we have upgraded our firewalls for better speed

transfer.

Any help on how to figure out the slowness in the WAL transfer would be

much appreciated.

SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) AS transfer_lag,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag
FROM pg_stat_replication;

If both are delayed, it might be that the network cannot cope.

If only the second number is delayed, you have replication conflicts
with queries on the standby.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

#4Viral Shah
vshah@nodalexchange.com
In reply to: Viral Shah (#3)
Re: Metric to calculate WAL size left to transfer to Standby

Hello Laurenz,

The above metric works fine for my primary server. However, We have a
cascading setup in our production system. This particular query doesn't
work for my intermediate server which is standby to the primary server but
also a master to one more standby server. We get the following error:

prod=# SELECT round(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)
/ (1024 :: NUMERIC ^ 2), 2 ) AS replay_lag FROM pg_stat_replication
where application_name like 'rtv%';
ERROR: recovery is in progress
HINT: WAL control functions cannot be executed during recovery.

I wanted to also track the transfer/replay lag between the intermediate
server and the final standby. I narrowed down the problem with
pg_current_wal_lsn() procedure. Any idea on how to mitigate this? or any
alternative to the pg_current_wal_lsn() procedure that I can use to get
transfer lag?

Thanks,
Viral Shah

On Fri, Apr 16, 2021 at 7:52 PM Viral Shah <vshah@nodalexchange.com> wrote:

Show quoted text

Hello Laurenz,

Thank you so much for sending the query. It was exactly what I needed. I
just made 1 modification to beautify the transfer and replay lag and I can
see the size in bytes.

SELECT application_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn)) AS transfer_lag,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag
FROM pg_stat_replication;

I am now using zabbix to constantly monitor them and notify myself if it breaches a certain threshold.

Thanks again!

Best,

Viral Shah

Nodal Exchange LLC

On Thu, Apr 15, 2021 at 8:10 AM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:

On Wed, 2021-04-14 at 17:50 -0400, Viral Shah wrote:

We have a PostgreSQL 10.12 cluster of servers in two different data

centers.

Off lately, in the case of a large WAL generation, we are seeing

replication

delay between the master and the standby server. These delays have off

lately

been there for an unusually long time. I was wondering if we have any

metric

that can calculate the amount (size) of WAL transfer left between

master and

standby?

PS: We have ensured we have upgraded our firewalls for better speed

transfer.

Any help on how to figure out the slowness in the WAL transfer would be

much appreciated.

SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) AS transfer_lag,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag
FROM pg_stat_replication;

If both are delayed, it might be that the network cannot cope.

If only the second number is delayed, you have replication conflicts
with queries on the standby.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com