Logical Replication Delay
Hi Team,
We have configured bidirectional replication (but traffic can only flow in
one direction) between two data centers (distance: 1000 km, maximum Network
latency: 100 ms) with an application TPS (transactions per second) of 700
at maximum.
We are fine with handling up to 500 TPS without observing any lag between
the two data centers. However, when TPS increases, we notice a lag in WAL
files of over 100 GB (initially, it was 1 TB, but after tuning, it was
reduced to 100 GB). During peak times, WAL files are generated at a rate of
4 GB per minute.
All transactions (Tx) take less than 200 ms, with a maximum of 1 second at
times (no long-running transactions).
*Here are the configured parameters and resources:*
- *OS*: Ubuntu
- *RAM*: 376 GB
- *CPU*: 64 cores
- *Swap*: 32 GB
- *PostgreSQL Version*: 16.4 (each side has 3 nodes with Patroni and
etcd configured)
- *DB Size*: 15 TB
*Parameters configured on both sides:*
Name Setting
Unit
log_replication_commands off
logical_decoding_work_mem 524288 kB
max_logical_replication_workers 16
max_parallel_apply_workers_per_subscription 2
max_replication_slots 20
max_sync_workers_per_subscription 2
max_wal_senders 20
max_worker_processes 40
wal_level logical
wal_receiver_timeout 600000 ms
wal_segment_size 1073741824 B
wal_sender_timeout 600000 ms
*Optimizations applied:*
1. Vacuum freeze is managed during off-hours; no aggressive vacuum is
triggered during business hours.
2. Converted a few tables to unlogged.
3. Removed unwanted tables from publication.
4. Partitioned all large tables.
*Pending:*
1. Turning off/tuning autovacuum parameters to avoid triggering during
business hours.
*Not possible: *We are running all tables in a single publication, and it
is not possible to separate them.
I would greatly appreciate any suggestions you may have to help avoid
logical replication delays, whether through tuning database or operating
system parameters, or any other recommendations
--
Thanks & Regards,
Ram.
Hi Ramakrishna,
4GB of WAL generated per minute is a lot. I would expect the replay on the
subscriber to lag behind because it is a single process. PostgreSQL 16 can
create parallel workers for large transactions, however if there is a flood
of small transactions touching many tables the single LR worker is going to
fall behind.
The only option is split the LR into multiple Publications and
Subscriptions as a single worker can't keep up.
What is the justification to not split the tables across multiple
Publications and Subscriptions
Additional items to check
Make sure the Subscriber is using binary mode, this avoids an encoding
step.
https://www.postgresql.org/docs/current/sql-createsubscription.html
Avoid the use of IDENTITY SET TO FULL on the publisher, if you do use
IDENTITY FULL make sure the subscriber table identity is set to a
qualifying unique index. In previous versions of PG the publisher and
subscriber identities had to match...
IDENTITY SET TO FULL increase the size of the WAL and the work the
publisher and subscriber has to do.
Hope this helps.
On Sat, Sep 21, 2024 at 3:08 PM Ramakrishna m <ram.pgdb@gmail.com> wrote:
Show quoted text
Hi Team,
We have configured bidirectional replication (but traffic can only flow in
one direction) between two data centers (distance: 1000 km, maximum Network
latency: 100 ms) with an application TPS (transactions per second) of 700
at maximum.We are fine with handling up to 500 TPS without observing any lag between
the two data centers. However, when TPS increases, we notice a lag in WAL
files of over 100 GB (initially, it was 1 TB, but after tuning, it was
reduced to 100 GB). During peak times, WAL files are generated at a rate of
4 GB per minute.All transactions (Tx) take less than 200 ms, with a maximum of 1 second at
times (no long-running transactions).*Here are the configured parameters and resources:*
- *OS*: Ubuntu
- *RAM*: 376 GB
- *CPU*: 64 cores
- *Swap*: 32 GB
- *PostgreSQL Version*: 16.4 (each side has 3 nodes with Patroni and
etcd configured)
- *DB Size*: 15 TB*Parameters configured on both sides:*
Name
Setting Unitlog_replication_commands off
logical_decoding_work_mem 524288 kB
max_logical_replication_workers 16
max_parallel_apply_workers_per_subscription 2
max_replication_slots 20
max_sync_workers_per_subscription 2
max_wal_senders 20
max_worker_processes 40
wal_level logical
wal_receiver_timeout 600000 ms
wal_segment_size 1073741824 B
wal_sender_timeout 600000 ms*Optimizations applied:*
1. Vacuum freeze is managed during off-hours; no aggressive vacuum is
triggered during business hours.
2. Converted a few tables to unlogged.
3. Removed unwanted tables from publication.
4. Partitioned all large tables.*Pending:*
1. Turning off/tuning autovacuum parameters to avoid triggering during
business hours.*Not possible: *We are running all tables in a single publication, and it
is not possible to separate them.I would greatly appreciate any suggestions you may have to help avoid
logical replication delays, whether through tuning database or operating
system parameters, or any other recommendations--
Thanks & Regards,
Ram.
Hi Justin,
Thank you for your suggestions and detailed insights.
Due to certain business constraints, we are unable to split the tables into
parallel publications. All of the tables involved are primary key tables,
which adds further complexity to separating them into multiple publications.
That said, we truly appreciate your recommendations regarding the use of
binary mode and reviewing the use of IDENTITY SET TO FULL. We will ensure
that the subscriber is operating in binary mode and will recheck the
identity setup to minimize WAL size and overhead.
Regards,
Ram.
On Sun, 22 Sept 2024 at 01:32, Justin <zzzzz.graf@gmail.com> wrote:
Show quoted text
Hi Ramakrishna,
4GB of WAL generated per minute is a lot. I would expect the replay on
the subscriber to lag behind because it is a single process. PostgreSQL 16
can create parallel workers for large transactions, however if there is a
flood of small transactions touching many tables the single LR worker is
going to fall behind.The only option is split the LR into multiple Publications and
Subscriptions as a single worker can't keep up.What is the justification to not split the tables across multiple
Publications and SubscriptionsAdditional items to check
Make sure the Subscriber is using binary mode, this avoids an encoding
step.
https://www.postgresql.org/docs/current/sql-createsubscription.htmlAvoid the use of IDENTITY SET TO FULL on the publisher, if you do use
IDENTITY FULL make sure the subscriber table identity is set to a
qualifying unique index. In previous versions of PG the publisher and
subscriber identities had to match...IDENTITY SET TO FULL increase the size of the WAL and the work the
publisher and subscriber has to do.Hope this helps.
On Sat, Sep 21, 2024 at 3:08 PM Ramakrishna m <ram.pgdb@gmail.com> wrote:
Hi Team,
We have configured bidirectional replication (but traffic can only flow
in one direction) between two data centers (distance: 1000 km, maximum
Network latency: 100 ms) with an application TPS (transactions per second)
of 700 at maximum.We are fine with handling up to 500 TPS without observing any lag between
the two data centers. However, when TPS increases, we notice a lag in WAL
files of over 100 GB (initially, it was 1 TB, but after tuning, it was
reduced to 100 GB). During peak times, WAL files are generated at a rate of
4 GB per minute.All transactions (Tx) take less than 200 ms, with a maximum of 1 second
at times (no long-running transactions).*Here are the configured parameters and resources:*
- *OS*: Ubuntu
- *RAM*: 376 GB
- *CPU*: 64 cores
- *Swap*: 32 GB
- *PostgreSQL Version*: 16.4 (each side has 3 nodes with Patroni and
etcd configured)
- *DB Size*: 15 TB*Parameters configured on both sides:*
Name
Setting Unitlog_replication_commands off
logical_decoding_work_mem 524288 kB
max_logical_replication_workers 16
max_parallel_apply_workers_per_subscription 2
max_replication_slots 20
max_sync_workers_per_subscription 2
max_wal_senders 20
max_worker_processes 40
wal_level logical
wal_receiver_timeout 600000 ms
wal_segment_size 1073741824 B
wal_sender_timeout 600000 ms*Optimizations applied:*
1. Vacuum freeze is managed during off-hours; no aggressive vacuum is
triggered during business hours.
2. Converted a few tables to unlogged.
3. Removed unwanted tables from publication.
4. Partitioned all large tables.*Pending:*
1. Turning off/tuning autovacuum parameters to avoid triggering
during business hours.*Not possible: *We are running all tables in a single publication, and
it is not possible to separate them.I would greatly appreciate any suggestions you may have to help avoid
logical replication delays, whether through tuning database or operating
system parameters, or any other recommendations--
Thanks & Regards,
Ram.
Hi Ramakrishna,
I am not following the reasoning on not separating the tables into
different publications and subscriptions. I set up logical replication all
the time in many different environments, one of the audits I perform before
deploying LR is looking at pg_stat_all_tables and WAL creation rate. If it
is above a certain threshold the tables are split up into different
publications and subscriptions.
If the concern is FK constraints being violated LR does not validate
those, it's not a concern.
If the concerns are triggers, the trigger has to be seto ALWAYS or REPLICA
meaning LR will execute it. I can see an LR worker being AHEAD of other
workers trying to reference ROWs that do not exist yet. Which can be dealt
with by making sure the Triggers that reference other tables are in the
same publication and subscription.
Thanks
Justin
On Mon, Sep 23, 2024 at 12:32 AM Ramakrishna m <ram.pgdb@gmail.com> wrote:
Show quoted text
Hi Justin,
Thank you for your suggestions and detailed insights.
Due to certain business constraints, we are unable to split the tables
into parallel publications. All of the tables involved are primary key
tables, which adds further complexity to separating them into multiple
publications.That said, we truly appreciate your recommendations regarding the use of
binary mode and reviewing the use of IDENTITY SET TO FULL. We will ensure
that the subscriber is operating in binary mode and will recheck the
identity setup to minimize WAL size and overhead.Regards,
Ram.On Sun, 22 Sept 2024 at 01:32, Justin <zzzzz.graf@gmail.com> wrote:
Hi Ramakrishna,
4GB of WAL generated per minute is a lot. I would expect the replay on
the subscriber to lag behind because it is a single process. PostgreSQL 16
can create parallel workers for large transactions, however if there is a
flood of small transactions touching many tables the single LR worker is
going to fall behind.The only option is split the LR into multiple Publications and
Subscriptions as a single worker can't keep up.What is the justification to not split the tables across multiple
Publications and SubscriptionsAdditional items to check
Make sure the Subscriber is using binary mode, this avoids an encoding
step.
https://www.postgresql.org/docs/current/sql-createsubscription.htmlAvoid the use of IDENTITY SET TO FULL on the publisher, if you do use
IDENTITY FULL make sure the subscriber table identity is set to a
qualifying unique index. In previous versions of PG the publisher and
subscriber identities had to match...IDENTITY SET TO FULL increase the size of the WAL and the work the
publisher and subscriber has to do.Hope this helps.
On Sat, Sep 21, 2024 at 3:08 PM Ramakrishna m <ram.pgdb@gmail.com> wrote:
Hi Team,
We have configured bidirectional replication (but traffic can only flow
in one direction) between two data centers (distance: 1000 km, maximum
Network latency: 100 ms) with an application TPS (transactions per second)
of 700 at maximum.We are fine with handling up to 500 TPS without observing any lag
between the two data centers. However, when TPS increases, we notice a lag
in WAL files of over 100 GB (initially, it was 1 TB, but after tuning, it
was reduced to 100 GB). During peak times, WAL files are generated at a
rate of 4 GB per minute.All transactions (Tx) take less than 200 ms, with a maximum of 1 second
at times (no long-running transactions).*Here are the configured parameters and resources:*
- *OS*: Ubuntu
- *RAM*: 376 GB
- *CPU*: 64 cores
- *Swap*: 32 GB
- *PostgreSQL Version*: 16.4 (each side has 3 nodes with Patroni and
etcd configured)
- *DB Size*: 15 TB*Parameters configured on both sides:*
Name
Setting Unitlog_replication_commands off
logical_decoding_work_mem 524288 kB
max_logical_replication_workers 16
max_parallel_apply_workers_per_subscription 2
max_replication_slots 20
max_sync_workers_per_subscription 2
max_wal_senders 20
max_worker_processes 40
wal_level logical
wal_receiver_timeout 600000 ms
wal_segment_size 1073741824 B
wal_sender_timeout 600000 ms*Optimizations applied:*
1. Vacuum freeze is managed during off-hours; no aggressive vacuum
is triggered during business hours.
2. Converted a few tables to unlogged.
3. Removed unwanted tables from publication.
4. Partitioned all large tables.*Pending:*
1. Turning off/tuning autovacuum parameters to avoid triggering
during business hours.*Not possible: *We are running all tables in a single publication, and
it is not possible to separate them.I would greatly appreciate any suggestions you may have to help avoid
logical replication delays, whether through tuning database or operating
system parameters, or any other recommendations--
Thanks & Regards,
Ram.
On Sat, Sep 21, 2024 at 3:08 PM Ramakrishna m <ram.pgdb@gmail.com> wrote:
I would greatly appreciate any suggestions you may have to help avoid
logical replication delays, whether through tuning database or operating
system parameters, or any other recommendations
In addition to the things already answered:
* What is the use case for logical replication? I assume your local
replicas are able to keep up just fine.
* Check the nature of the work for problems, e.g. ORM doing
unnecessary/redundant updates, maintaining indexes that are not really
needed
* Looks like your wal_segment_size was boosted to 1GB. What drove that
change?
* Yes, autovacuum could affect things - make sure
log_autovacuum_min_durations is set
Cheers,
Greg
Hi Justin and Greg,
Thank you for your input and recommendations. We understand your point
regarding separating the tables into different publications and
subscriptions.
However, due to certain business constraints, we are unable to implement
this approach at the moment.
*We are planning to set up logical replication from a standby to another
server. When the primary goes down, there is no issue as the standby
becomes the primary and the logical slots are already present. However,
when the standby goes down, these slots are not copied to the third node or
the primary by Patroni. Is there an option available to handle this
scenario? *
Regards,
Ram.
On Wed, 25 Sept 2024 at 20:12, Greg Sabino Mullane <htamfids@gmail.com>
wrote:
Show quoted text
On Sat, Sep 21, 2024 at 3:08 PM Ramakrishna m <ram.pgdb@gmail.com> wrote:
I would greatly appreciate any suggestions you may have to help avoid
logical replication delays, whether through tuning database or operating
system parameters, or any other recommendationsIn addition to the things already answered:
* What is the use case for logical replication? I assume your local
replicas are able to keep up just fine.* Check the nature of the work for problems, e.g. ORM doing
unnecessary/redundant updates, maintaining indexes that are not really
needed* Looks like your wal_segment_size was boosted to 1GB. What drove that
change?* Yes, autovacuum could affect things - make sure
log_autovacuum_min_durations is setCheers,
Greg
Hi Ram,
29 set 2024, 12:29 Ramakrishna m <ram.pgdb@gmail.com>:
*We are planning to set up logical replication from a standby to another
server. When the primary goes down, there is no issue as the standby
becomes the primary and the logical slots are already present. However,
when the standby goes down, these slots are not copied to the third node or
the primary by Patroni. Is there an option available to handle this
scenario? *
You could take a look at the pg_failover_slots extension (
https://www.enterprisedb.com/docs/pg_extensions/pg_failover_slots/), it is
aimed exactly at cloning the slot information to a standby.
Best,
giovanni