Postgres Logical Replication - how to see what subscriber is doing with received data?

Started by Michael Jaskiewiczover 1 year ago6 messagesgeneral

Michael Jaskiewicz

mjaskiewicz@ghx.com

over 1 year ago

I've got two Postgres 13 databases on AWS RDS.

* One is a master, the other a slave using logical replication.
* Replication has fallen behind by about 350Gb.
* The slave was maxed out in terms of CPU for the past four days because of some jobs that were ongoing so I'm not sure what logical replication was able to replicate during that time.
* I killed those jobs and now CPU on the master and slave are both low.
* I look at the subscriber via `select * from pg_stat_subscription;` and see that latest_end_lsn is advancing albeit very slowly.
* The publisher says write/flush/replay lags are all 13 minutes behind but it's been like that for most of the day.
* I see no errors in the logs on either the publisher or subscriber outside of some simple SQL errors that users have been making.
* CloudWatch reports low CPU utilization, low I/O, and low network.

Is there anything I can do here? Previously I set wal_receiver_timeout timeout to 0 because I had replication issues, and that helped things. I wish I had some visibility here to get any kind of confidence that it's going to pull through, but other than these lsn values and database logs, I'm not sure what to check.

Sincerely,
mj

shaheedhaque@gmail.com

over 1 year ago

In reply to: Michael Jaskiewicz (#1)

Re: Postgres Logical Replication - how to see what subscriber is doing with received data?

Since nobody more knowledgeable has replied...

I'm very interested in this area and still surprised that there is no
official/convenient/standard way to approach this (see
/messages/by-id/CAHAc2jdAHvp7tFZBP37awcth=T3h5WXCN9KjZOvuTNJaAAC_hg@mail.gmail.com
).

Based partly on that thread, I ended up with a script that connects to both
ends of the replication, and basically loops while comparing the counts in
each table.

On Fri, 30 Aug 2024, 12:38 Michael Jaskiewicz, <mjaskiewicz@ghx.com> wrote:

Show quoted text

I've got two Postgres 13 databases on AWS RDS.

- One is a master, the other a slave using logical replication.
- Replication has fallen behind by about 350Gb.
- The slave was maxed out in terms of CPU for the past four days
because of some jobs that were ongoing so I'm not sure what logical
replication was able to replicate during that time.
- I killed those jobs and now CPU on the master and slave are both low.
- I look at the subscriber via `select * from pg_stat_subscription;`
and see that latest_end_lsn is advancing albeit very slowly.
- The publisher says write/flush/replay lags are all 13 minutes behind
but it's been like that for most of the day.
- I see no errors in the logs on either the publisher or subscriber
outside of some simple SQL errors that users have been making.
- CloudWatch reports low CPU utilization, low I/O, and low network.

Is there anything I can do here? Previously I set wal_receiver_timeout
timeout to 0 because I had replication issues, and that helped things. I
wish I had *some* visibility here to get any kind of confidence that it's
going to pull through, but other than these lsn values and database logs,
I'm not sure what to check.

Sincerely,

mj

mmikram@gmail.com

over 1 year ago

In reply to: Shaheed Haque (#2)

Re: Postgres Logical Replication - how to see what subscriber is doing with received data?

Hi Shaheed,

Maybe these considerations could help you or give any hint to the problem ?

Check if wal_receiver_timeout being set to 0 could potentially cause
issues, like not detecting network issues quickly enough. Consider
re-evaluating this setting if you see connection issues.

If you notice that some data is missing on subscriber then could you
increase max_slot_wal_keep_size on publisher so that WALs are not deleted
until they are applied on subscriber.

Do you have flexibility to increase max_worker_processes and
max_logical_replication_workers, work_mem and maintenance_work_mem on
subscriber (In case bottleneck exists on subscriber)

If there's significant lag, consider whether it might be more efficient to
drop the subscription and re-initialize it from scratch using a new base
backup, depending on the data volume and how long it might take for the
existing replication to catch up.

Regards,
Muhammad Ikram

On Sun, Sep 1, 2024 at 9:22 PM Shaheed Haque <shaheedhaque@gmail.com> wrote:

Since nobody more knowledgeable has replied...

I'm very interested in this area and still surprised that there is no
official/convenient/standard way to approach this (see
/messages/by-id/CAHAc2jdAHvp7tFZBP37awcth=T3h5WXCN9KjZOvuTNJaAAC_hg@mail.gmail.com
).

Based partly on that thread, I ended up with a script that connects to
both ends of the replication, and basically loops while comparing the
counts in each table.

On Fri, 30 Aug 2024, 12:38 Michael Jaskiewicz, <mjaskiewicz@ghx.com>
wrote:

I've got two Postgres 13 databases on AWS RDS.

- One is a master, the other a slave using logical replication.
- Replication has fallen behind by about 350Gb.
- The slave was maxed out in terms of CPU for the past four days
because of some jobs that were ongoing so I'm not sure what logical
replication was able to replicate during that time.
- I killed those jobs and now CPU on the master and slave are both
low.
- I look at the subscriber via `select * from pg_stat_subscription;`
and see that latest_end_lsn is advancing albeit very slowly.
- The publisher says write/flush/replay lags are all 13 minutes
behind but it's been like that for most of the day.
- I see no errors in the logs on either the publisher or subscriber
outside of some simple SQL errors that users have been making.
- CloudWatch reports low CPU utilization, low I/O, and low network.

Is there anything I can do here? Previously I set wal_receiver_timeout
timeout to 0 because I had replication issues, and that helped things. I
wish I had *some* visibility here to get any kind of confidence that
it's going to pull through, but other than these lsn values and database
logs, I'm not sure what to check.

Sincerely,

mj

--
Muhammad Ikram

shaheedhaque@gmail.com

over 1 year ago

In reply to: Muhammad Ikram (#3)

Re: Postgres Logical Replication - how to see what subscriber is doing with received data?

Hi Muhammad,

On Mon, 2 Sep 2024, 07:08 Muhammad Ikram, <mmikram@gmail.com> wrote:

Hi Shaheed,

Maybe these considerations could help you or give any hint to the problem ?

Check if wal_receiver_timeout being set to 0 could potentially cause
issues, like not detecting network issues quickly enough. Consider
re-evaluating this setting if you see connection issues.

If you notice that some data is missing on subscriber then could you
increase max_slot_wal_keep_size on publisher so that WALs are not deleted
until they are applied on subscriber.

Do you have flexibility to increase max_worker_processes and
max_logical_replication_workers, work_mem and maintenance_work_mem on
subscriber (In case bottleneck exists on subscriber)

If there's significant lag, consider whether it might be more efficient to
drop the subscription and re-initialize it from scratch using a new base
backup, depending on the data volume and how long it might take for the
existing replication to catch up.

Thanks for the kind hints, I'll certainly look into those.

My main interest however was with the "visibility" question, i.e. to get an
understanding of the gap between the two ends of a replication slot,
ideally in human terms (e.g. tables x records).

I understand the difficulties of trying to produce a meaningful metric that
spans two (or more) systems but let's be honest, trying to diagnose which
knobs to tweak (whether in application, PG, the OS or the network) is
basically black magic when all we really have is a pair of opaque LSNs.

Show quoted text

Regards,
Muhammad Ikram

On Sun, Sep 1, 2024 at 9:22 PM Shaheed Haque <shaheedhaque@gmail.com>
wrote:

Since nobody more knowledgeable has replied...

I'm very interested in this area and still surprised that there is no
official/convenient/standard way to approach this (see
/messages/by-id/CAHAc2jdAHvp7tFZBP37awcth=T3h5WXCN9KjZOvuTNJaAAC_hg@mail.gmail.com
).

Based partly on that thread, I ended up with a script that connects to
both ends of the replication, and basically loops while comparing the
counts in each table.

On Fri, 30 Aug 2024, 12:38 Michael Jaskiewicz, <mjaskiewicz@ghx.com>
wrote:

I've got two Postgres 13 databases on AWS RDS.

- One is a master, the other a slave using logical replication.
- Replication has fallen behind by about 350Gb.
- The slave was maxed out in terms of CPU for the past four days
because of some jobs that were ongoing so I'm not sure what logical
replication was able to replicate during that time.
- I killed those jobs and now CPU on the master and slave are both
low.
- I look at the subscriber via `select * from pg_stat_subscription;`
and see that latest_end_lsn is advancing albeit very slowly.
- The publisher says write/flush/replay lags are all 13 minutes
behind but it's been like that for most of the day.
- I see no errors in the logs on either the publisher or subscriber
outside of some simple SQL errors that users have been making.
- CloudWatch reports low CPU utilization, low I/O, and low network.

Is there anything I can do here? Previously I set wal_receiver_timeout
timeout to 0 because I had replication issues, and that helped things. I
wish I had *some* visibility here to get any kind of confidence that
it's going to pull through, but other than these lsn values and database
logs, I'm not sure what to check.

Sincerely,

mj

--
Muhammad Ikram

mmikram@gmail.com

over 1 year ago

In reply to: Shaheed Haque (#4)

Re: Postgres Logical Replication - how to see what subscriber is doing with received data?

Hi Shaheed,
I think you must have already analyzed the outcome of queries
on pg_replication_slots, pg_current_wal_lsn(), pg_stat_subscription etc. I
could find a query SELECT
pg_size_pretty(pg_wal_lsn_diff('<publisher_restart_lsn>',
'<subscriber_replayed_lsn>'));

As a side note if you want to see what has been applied to subscribers vs
what exists on publisher then here is something from my previous
experience. We used to have a Data Validation tool for checking tables/rows
across publisher/subscriber. We also used pg_dump for another tool that was
meant for making copies of schemas.

Regards,
Muhammad Ikram

On Mon, Sep 2, 2024 at 12:42 PM Shaheed Haque <shaheedhaque@gmail.com>
wrote:

Hi Muhammad,

On Mon, 2 Sep 2024, 07:08 Muhammad Ikram, <mmikram@gmail.com> wrote:

Hi Shaheed,

Maybe these considerations could help you or give any hint to the problem
?

Check if wal_receiver_timeout being set to 0 could potentially cause
issues, like not detecting network issues quickly enough. Consider
re-evaluating this setting if you see connection issues.

If you notice that some data is missing on subscriber then could you
increase max_slot_wal_keep_size on publisher so that WALs are not deleted
until they are applied on subscriber.

Do you have flexibility to increase max_worker_processes and
max_logical_replication_workers, work_mem and maintenance_work_mem on
subscriber (In case bottleneck exists on subscriber)

If there's significant lag, consider whether it might be more efficient
to drop the subscription and re-initialize it from scratch using a new base
backup, depending on the data volume and how long it might take for the
existing replication to catch up.

Thanks for the kind hints, I'll certainly look into those.

My main interest however was with the "visibility" question, i.e. to get
an understanding of the gap between the two ends of a replication slot,
ideally in human terms (e.g. tables x records).

I understand the difficulties of trying to produce a meaningful metric
that spans two (or more) systems but let's be honest, trying to diagnose
which knobs to tweak (whether in application, PG, the OS or the network) is
basically black magic when all we really have is a pair of opaque LSNs.

Regards,
Muhammad Ikram

On Sun, Sep 1, 2024 at 9:22 PM Shaheed Haque <shaheedhaque@gmail.com>
wrote:

Since nobody more knowledgeable has replied...

I'm very interested in this area and still surprised that there is no
official/convenient/standard way to approach this (see
/messages/by-id/CAHAc2jdAHvp7tFZBP37awcth=T3h5WXCN9KjZOvuTNJaAAC_hg@mail.gmail.com
).

Based partly on that thread, I ended up with a script that connects to
both ends of the replication, and basically loops while comparing the
counts in each table.

On Fri, 30 Aug 2024, 12:38 Michael Jaskiewicz, <mjaskiewicz@ghx.com>
wrote:

I've got two Postgres 13 databases on AWS RDS.

- One is a master, the other a slave using logical replication.
- Replication has fallen behind by about 350Gb.
- The slave was maxed out in terms of CPU for the past four days
because of some jobs that were ongoing so I'm not sure what logical
replication was able to replicate during that time.
- I killed those jobs and now CPU on the master and slave are both
low.
- I look at the subscriber via `select * from
pg_stat_subscription;` and see that latest_end_lsn is advancing albeit very
slowly.
- The publisher says write/flush/replay lags are all 13 minutes
behind but it's been like that for most of the day.
- I see no errors in the logs on either the publisher or subscriber
outside of some simple SQL errors that users have been making.
- CloudWatch reports low CPU utilization, low I/O, and low network.

Is there anything I can do here? Previously I set wal_receiver_timeout
timeout to 0 because I had replication issues, and that helped things. I
wish I had *some* visibility here to get any kind of confidence that
it's going to pull through, but other than these lsn values and database
logs, I'm not sure what to check.

Sincerely,

mj

--
Muhammad Ikram

--
Muhammad Ikram

shaheedhaque@gmail.com

over 1 year ago

In reply to: Muhammad Ikram (#5)

Re: Postgres Logical Replication - how to see what subscriber is doing with received data?

Hi Muhammad,

On Mon, 2 Sep 2024, 09:45 Muhammad Ikram, <mmikram@gmail.com> wrote:

Hi Shaheed,
I think you must have already analyzed the outcome of queries
on pg_replication_slots, pg_current_wal_lsn(), pg_stat_subscription etc. I
could find a query SELECT
pg_size_pretty(pg_wal_lsn_diff('<publisher_restart_lsn>',
'<subscriber_replayed_lsn>'));

Yes. My point is that it is hard to go from byte numbers to table entries.

Aps a side note if you want to see what has been applied to subscribers vs

what exists on publisher then here is something from my previous
experience. We used to have a Data Validation tool for checking tables/rows
across publisher/subscriber.

Ack. That's pretty much what I had to build.

We also used pg_dump for another tool that was meant for making copies of

schemas.

I'm somewhat fortunate to have a simple use case where all I am doing is a
copy of the "old" deployment to a "new" deployment such that when the two
ends are in close sync, I can freeze traffic to the old deployment, pause
for any final catchup, and then run a Django migration on the new, before
switching on the new (thereby minimising the down time for the app).

What I found by just looking at LSN numbers was that the database LSN were
close but NOT the same. Once I built the tool, I was able to see which
tables were still in play, and saw that some previously overlooked
background timers were expiring, causing the activity.

Net result: the LSNs can tell you if you are not in sync, but not the
reason why. (Again, I understand that row counts worked for me, but might
not work for others).

Thanks for your kind help and pointers!

Regards,

Show quoted text

Muhammad Ikram

On Mon, Sep 2, 2024 at 12:42 PM Shaheed Haque <shaheedhaque@gmail.com>
wrote:

Hi Muhammad,

On Mon, 2 Sep 2024, 07:08 Muhammad Ikram, <mmikram@gmail.com> wrote:

Hi Shaheed,

Maybe these considerations could help you or give any hint to the
problem ?

Check if wal_receiver_timeout being set to 0 could potentially cause
issues, like not detecting network issues quickly enough. Consider
re-evaluating this setting if you see connection issues.

If you notice that some data is missing on subscriber then could you
increase max_slot_wal_keep_size on publisher so that WALs are not deleted
until they are applied on subscriber.

Do you have flexibility to increase max_worker_processes and
max_logical_replication_workers, work_mem and maintenance_work_mem on
subscriber (In case bottleneck exists on subscriber)

If there's significant lag, consider whether it might be more efficient
to drop the subscription and re-initialize it from scratch using a new base
backup, depending on the data volume and how long it might take for the
existing replication to catch up.

Thanks for the kind hints, I'll certainly look into those.

My main interest however was with the "visibility" question, i.e. to get
an understanding of the gap between the two ends of a replication slot,
ideally in human terms (e.g. tables x records).

I understand the difficulties of trying to produce a meaningful metric
that spans two (or more) systems but let's be honest, trying to diagnose
which knobs to tweak (whether in application, PG, the OS or the network) is
basically black magic when all we really have is a pair of opaque LSNs.

Regards,
Muhammad Ikram

On Sun, Sep 1, 2024 at 9:22 PM Shaheed Haque <shaheedhaque@gmail.com>
wrote:

Since nobody more knowledgeable has replied...

I'm very interested in this area and still surprised that there is no
official/convenient/standard way to approach this (see
/messages/by-id/CAHAc2jdAHvp7tFZBP37awcth=T3h5WXCN9KjZOvuTNJaAAC_hg@mail.gmail.com
).

Based partly on that thread, I ended up with a script that connects to
both ends of the replication, and basically loops while comparing the
counts in each table.

On Fri, 30 Aug 2024, 12:38 Michael Jaskiewicz, <mjaskiewicz@ghx.com>
wrote:

I've got two Postgres 13 databases on AWS RDS.

- One is a master, the other a slave using logical replication.
- Replication has fallen behind by about 350Gb.
- The slave was maxed out in terms of CPU for the past four days
because of some jobs that were ongoing so I'm not sure what logical
replication was able to replicate during that time.
- I killed those jobs and now CPU on the master and slave are both
low.
- I look at the subscriber via `select * from
pg_stat_subscription;` and see that latest_end_lsn is advancing albeit very
slowly.
- The publisher says write/flush/replay lags are all 13 minutes
behind but it's been like that for most of the day.
- I see no errors in the logs on either the publisher or
subscriber outside of some simple SQL errors that users have been making.
- CloudWatch reports low CPU utilization, low I/O, and low
network.

Is there anything I can do here? Previously I set wal_receiver_timeout
timeout to 0 because I had replication issues, and that helped things. I
wish I had *some* visibility here to get any kind of confidence that
it's going to pull through, but other than these lsn values and database
logs, I'm not sure what to check.

Sincerely,

mj

--
Muhammad Ikram

--
Muhammad Ikram