WAL accumulating, Logical Replication pg 13

Started by Willy-Bas Loosalmost 5 years ago11 messagesgeneral

willybas@gmail.com

almost 5 years ago

Hi , I'm upgrading a 1.5TB database from postgres 9.3 to postgres 13 on
Debian 10. This is now in an Acceptance stage (DTAP). I have encountered a
problem: the WAL is not being deleted. I now have 1.4 TB of WAL in pg_wal
and my disks are getting full. The oldest WAL file is 18 days old.
I use Logical Replication from the new cluster to another new cluster with
1 subscriber and 1 subscription.

I've had problems with diskspace on this server, with postgres crashing
because of it, then added more diskspace and postgres recovered. This
doesn't seem to be a problem now.

The *publication* has the options publish = 'insert, update, delete,
truncate', publish_via_partition_root = false
The *subscription* has the options connect = true, enabled = true,
create_slot = false, slot_name = my_pub1, synchronous_commit = 'off'

The log on the publisher says:
2021-05-25 21:25:18.973 CEST [4584] user@db LOG: starting logical decoding
for slot "my_pub1"
2021-05-25 21:25:18.973 CEST [4584] user@db DETAIL: Streaming transactions
committing after D6/A82B5FE0, reading WAL from D4/908BC268.
2021-05-25 21:25:18.973 CEST [4584] user@db LOG: logical decoding found
consistent point at D4/908BC268
2021-05-25 21:25:18.973 CEST [4584] user@db DETAIL: There are no running
transactions.
2021-05-25 21:29:49.456 CEST [4614] user@db ERROR: replication slot
"my_pub1" is active for PID 4584
2021-05-25 21:29:54.474 CEST [4615] user@db ERROR: replication slot
"my_pub1" is active for PID 4584

And on the subscriber:
2021-05-28 21:23:46.702 CEST [40039] LOG: logical replication apply worker
for subscription "my_pub1" has started
2021-05-28 21:23:46.712 CEST [40039] ERROR: could not start WAL streaming:
ERROR: replication slot "my_pub1" is active for PID 730
2021-05-28 21:23:46.714 CEST [19794] LOG: background worker "logical
replication worker" (PID 40039) exited with exit code 1

The postgres settings on the *publisher* are:
max_connections = 100 # (change requires restart)
tcp_keepalives_idle = 120 # TCP_KEEPIDLE, in seconds;
shared_buffers = 50GB # min 128kB
work_mem = 1GB # min 64kB
maintenance_work_mem = 10GB # min 1MB
logical_decoding_work_mem = 5GB # min 64kB
dynamic_shared_memory_type = posix # the default is the first option
max_worker_processes = 20 # (change requires restart)
max_parallel_maintenance_workers = 10 # taken from max_parallel_workers
max_parallel_workers_per_gather = 5 # taken from max_parallel_workers
max_parallel_workers = 15 # maximum number of max_worker_processes that
wal_level = logical # minimal, replica, or logical
max_wal_size = 1GB
min_wal_size = 80MB
#archive_mode = off
max_wal_senders = 10 # max number of walsender processes
wal_sender_timeout = 1min # in milliseconds; 0 disables
max_replication_slots = 7 # max number of replication slots

On postgres settings on the *subscriber*:
max_connections = 100 # (change requires restart)
tcp_keepalives_idle = 120 # TCP_KEEPIDLE, in seconds;
shared_buffers = 25GB # min 128kB
work_mem = 1GB # min 64kB
maintenance_work_mem = 10GB # min 1MB
logical_decoding_work_mem = 5GB # min 64kB
dynamic_shared_memory_type = posix # the default is the first option
max_worker_processes = 20 # (change requires restart)
max_parallel_maintenance_workers = 10 # taken from max_parallel_workers
max_parallel_workers_per_gather = 5 # taken from max_parallel_workers
max_parallel_workers = 15 # maximum number of max_worker_processes that
wal_level = logical # minimal, replica, or logical
max_wal_size = 3GB
min_wal_size = 80MB
#archive_mode = off
wal_receiver_timeout = 1min # time that receiver waits for
max_logical_replication_workers = 10 # taken from max_worker_processes
max_sync_workers_per_subscription = 5 # taken from
max_logical_replication_workers

I've tried increasing wal_sender_timeout and wal_receiver_timeout to 10
minutes each, but this had no positive effect.

Some advice would be helpful
--
Willy-Bas Loos

Vijaykumar Jain

vijaykumarjain.github@gmail.com

almost 5 years ago

In reply to: Willy-Bas Loos (#1)

Re: WAL accumulating, Logical Replication pg 13

I am not too sure with 9.3
i tried an upgrade from 9.6 to 11 using logical replication (pg_logical
extension)

one thing to note.
logical replication initiates a copy from a snapshot, then changes from
then on.

I had a very high insert rate on my source tables (v9.6) and the
destination (v11) could not keep up (it had tons of indexes when I copied
the schema) and it took around a day as the table had around 12 indexes.

So at the destination(v11), I dropped all but the primary index for each
table, started subscription and when it was almost caught up, rebuilt the
index on the destination concurrently.
it completed in 4-5 hours without stopping the source.
migration completed in a few mins :)

not sure if this would help, but just FYI.

On Sat, 29 May 2021 at 01:36, Willy-Bas Loos <willybas@gmail.com> wrote:

Hi , I'm upgrading a 1.5TB database from postgres 9.3 to postgres 13 on
Debian 10. This is now in an Acceptance stage (DTAP). I have encountered a
problem: the WAL is not being deleted. I now have 1.4 TB of WAL in pg_wal
and my disks are getting full. The oldest WAL file is 18 days old.
I use Logical Replication from the new cluster to another new cluster with
1 subscriber and 1 subscription.

pg_stat_subscription tells me all recent timestamps.
and this:
db=# select * from pg_replication_slots;
-[ RECORD 1 ]-------+-------------
slot_name | my_pub1
plugin | pgoutput
slot_type | logical
datoid | 16401
database | db
temporary | f
active | t
active_pid | 9480
xmin |
catalog_xmin | 269168
restart_lsn | D4/908BC268
confirmed_flush_lsn | E1/25BF5710
wal_status | extended
safe_wal_size |

I've had problems with diskspace on this server, with postgres crashing
because of it, then added more diskspace and postgres recovered. This
doesn't seem to be a problem now.

The *publication* has the options publish = 'insert, update, delete,
truncate', publish_via_partition_root = false
The *subscription* has the options connect = true, enabled = true,
create_slot = false, slot_name = my_pub1, synchronous_commit = 'off'

The log on the publisher says:
2021-05-25 21:25:18.973 CEST [4584] user@db LOG: starting logical
decoding for slot "my_pub1"
2021-05-25 21:25:18.973 CEST [4584] user@db DETAIL: Streaming
transactions committing after D6/A82B5FE0, reading WAL from D4/908BC268.
2021-05-25 21:25:18.973 CEST [4584] user@db LOG: logical decoding found
consistent point at D4/908BC268
2021-05-25 21:25:18.973 CEST [4584] user@db DETAIL: There are no running
transactions.
2021-05-25 21:29:49.456 CEST [4614] user@db ERROR: replication slot
"my_pub1" is active for PID 4584
2021-05-25 21:29:54.474 CEST [4615] user@db ERROR: replication slot
"my_pub1" is active for PID 4584

And on the subscriber:
2021-05-28 21:23:46.702 CEST [40039] LOG: logical replication apply
worker for subscription "my_pub1" has started
2021-05-28 21:23:46.712 CEST [40039] ERROR: could not start WAL
streaming: ERROR: replication slot "my_pub1" is active for PID 730
2021-05-28 21:23:46.714 CEST [19794] LOG: background worker "logical
replication worker" (PID 40039) exited with exit code 1

The postgres settings on the *publisher* are:
max_connections = 100 # (change requires restart)
tcp_keepalives_idle = 120 # TCP_KEEPIDLE, in seconds;
shared_buffers = 50GB # min 128kB
work_mem = 1GB # min 64kB
maintenance_work_mem = 10GB # min 1MB
logical_decoding_work_mem = 5GB # min 64kB
dynamic_shared_memory_type = posix # the default is the first option
max_worker_processes = 20 # (change requires restart)
max_parallel_maintenance_workers = 10 # taken from max_parallel_workers
max_parallel_workers_per_gather = 5 # taken from max_parallel_workers
max_parallel_workers = 15 # maximum number of max_worker_processes that
wal_level = logical # minimal, replica, or logical
max_wal_size = 1GB
min_wal_size = 80MB
#archive_mode = off
max_wal_senders = 10 # max number of walsender processes
wal_sender_timeout = 1min # in milliseconds; 0 disables
max_replication_slots = 7 # max number of replication slots

On postgres settings on the *subscriber*:
max_connections = 100 # (change requires restart)
tcp_keepalives_idle = 120 # TCP_KEEPIDLE, in seconds;
shared_buffers = 25GB # min 128kB
work_mem = 1GB # min 64kB
maintenance_work_mem = 10GB # min 1MB
logical_decoding_work_mem = 5GB # min 64kB
dynamic_shared_memory_type = posix # the default is the first option
max_worker_processes = 20 # (change requires restart)
max_parallel_maintenance_workers = 10 # taken from max_parallel_workers
max_parallel_workers_per_gather = 5 # taken from max_parallel_workers
max_parallel_workers = 15 # maximum number of max_worker_processes that
wal_level = logical # minimal, replica, or logical
max_wal_size = 3GB
min_wal_size = 80MB
#archive_mode = off
wal_receiver_timeout = 1min # time that receiver waits for
max_logical_replication_workers = 10 # taken from max_worker_processes
max_sync_workers_per_subscription = 5 # taken from
max_logical_replication_workers

I've tried increasing wal_sender_timeout and wal_receiver_timeout to 10
minutes each, but this had no positive effect.

Some advice would be helpful
--
Willy-Bas Loos

--
Thanks,
Vijay
Mumbai, India

Willy-Bas Loos

willybas@gmail.com

almost 5 years ago

In reply to: Vijaykumar Jain (#2)

Re: WAL accumulating, Logical Replication pg 13

Yeah, indexes could slow things down, thanks. Btw I'm not using logical
replication for the upgrade, that's not supported for 9.3.
It was more complicated but that's beside the point.

I could just delete the publication and all that belongs to it and start
over. But since I'm trying out logical replication, I would like to be more
in control than that. It's there anything that I can dig into to find out
why the WAL is accumulating?

Op vr 28 mei 2021 22:20 schreef Vijaykumar Jain <
vijaykumarjain.github@gmail.com>:

Show quoted text

I am not too sure with 9.3
i tried an upgrade from 9.6 to 11 using logical replication (pg_logical
extension)

one thing to note.
logical replication initiates a copy from a snapshot, then changes from
then on.

I had a very high insert rate on my source tables (v9.6) and the
destination (v11) could not keep up (it had tons of indexes when I copied
the schema) and it took around a day as the table had around 12 indexes.

So at the destination(v11), I dropped all but the primary index for each
table, started subscription and when it was almost caught up, rebuilt the
index on the destination concurrently.
it completed in 4-5 hours without stopping the source.
migration completed in a few mins :)

not sure if this would help, but just FYI.

On Sat, 29 May 2021 at 01:36, Willy-Bas Loos <willybas@gmail.com> wrote:

Hi , I'm upgrading a 1.5TB database from postgres 9.3 to postgres 13 on
Debian 10. This is now in an Acceptance stage (DTAP). I have encountered a
problem: the WAL is not being deleted. I now have 1.4 TB of WAL in pg_wal
and my disks are getting full. The oldest WAL file is 18 days old.
I use Logical Replication from the new cluster to another new cluster
with 1 subscriber and 1 subscription.

pg_stat_subscription tells me all recent timestamps.
and this:
db=# select * from pg_replication_slots;
-[ RECORD 1 ]-------+-------------
slot_name | my_pub1
plugin | pgoutput
slot_type | logical
datoid | 16401
database | db
temporary | f
active | t
active_pid | 9480
xmin |
catalog_xmin | 269168
restart_lsn | D4/908BC268
confirmed_flush_lsn | E1/25BF5710
wal_status | extended
safe_wal_size |

I've had problems with diskspace on this server, with postgres crashing
because of it, then added more diskspace and postgres recovered. This
doesn't seem to be a problem now.

The *publication* has the options publish = 'insert, update, delete,
truncate', publish_via_partition_root = false
The *subscription* has the options connect = true, enabled = true,
create_slot = false, slot_name = my_pub1, synchronous_commit = 'off'

The log on the publisher says:
2021-05-25 21:25:18.973 CEST [4584] user@db LOG: starting logical
decoding for slot "my_pub1"
2021-05-25 21:25:18.973 CEST [4584] user@db DETAIL: Streaming
transactions committing after D6/A82B5FE0, reading WAL from D4/908BC268.
2021-05-25 21:25:18.973 CEST [4584] user@db LOG: logical decoding found
consistent point at D4/908BC268
2021-05-25 21:25:18.973 CEST [4584] user@db DETAIL: There are no
running transactions.
2021-05-25 21:29:49.456 CEST [4614] user@db ERROR: replication slot
"my_pub1" is active for PID 4584
2021-05-25 21:29:54.474 CEST [4615] user@db ERROR: replication slot
"my_pub1" is active for PID 4584

And on the subscriber:
2021-05-28 21:23:46.702 CEST [40039] LOG: logical replication apply
worker for subscription "my_pub1" has started
2021-05-28 21:23:46.712 CEST [40039] ERROR: could not start WAL
streaming: ERROR: replication slot "my_pub1" is active for PID 730
2021-05-28 21:23:46.714 CEST [19794] LOG: background worker "logical
replication worker" (PID 40039) exited with exit code 1

The postgres settings on the *publisher* are:
max_connections = 100 # (change requires restart)
tcp_keepalives_idle = 120 # TCP_KEEPIDLE, in seconds;
shared_buffers = 50GB # min 128kB
work_mem = 1GB # min 64kB
maintenance_work_mem = 10GB # min 1MB
logical_decoding_work_mem = 5GB # min 64kB
dynamic_shared_memory_type = posix # the default is the first option
max_worker_processes = 20 # (change requires restart)
max_parallel_maintenance_workers = 10 # taken from max_parallel_workers
max_parallel_workers_per_gather = 5 # taken from max_parallel_workers
max_parallel_workers = 15 # maximum number of max_worker_processes that
wal_level = logical # minimal, replica, or logical
max_wal_size = 1GB
min_wal_size = 80MB
#archive_mode = off
max_wal_senders = 10 # max number of walsender processes
wal_sender_timeout = 1min # in milliseconds; 0 disables
max_replication_slots = 7 # max number of replication slots

On postgres settings on the *subscriber*:
max_connections = 100 # (change requires restart)
tcp_keepalives_idle = 120 # TCP_KEEPIDLE, in seconds;
shared_buffers = 25GB # min 128kB
work_mem = 1GB # min 64kB
maintenance_work_mem = 10GB # min 1MB
logical_decoding_work_mem = 5GB # min 64kB
dynamic_shared_memory_type = posix # the default is the first option
max_worker_processes = 20 # (change requires restart)
max_parallel_maintenance_workers = 10 # taken from max_parallel_workers
max_parallel_workers_per_gather = 5 # taken from max_parallel_workers
max_parallel_workers = 15 # maximum number of max_worker_processes that
wal_level = logical # minimal, replica, or logical
max_wal_size = 3GB
min_wal_size = 80MB
#archive_mode = off
wal_receiver_timeout = 1min # time that receiver waits for
max_logical_replication_workers = 10 # taken from max_worker_processes
max_sync_workers_per_subscription = 5 # taken from
max_logical_replication_workers

I've tried increasing wal_sender_timeout and wal_receiver_timeout to 10
minutes each, but this had no positive effect.

Some advice would be helpful
--
Willy-Bas Loos

--
Thanks,
Vijay
Mumbai, India

Vijaykumar Jain

vijaykumarjain.github@gmail.com

almost 5 years ago

In reply to: Willy-Bas Loos (#3)

Re: WAL accumulating, Logical Replication pg 13

WAL can be built up for reasons like
1) if you have an inactive replication slot. I mean you had a streaming
replica which was turned off, but you did not remote the slot from primary.
2) Do you have archiving enabled? Are the archiving commands running fine ?
if just the archiving is broken, then you can manually run archive cleanup
provided, replication is all caught up fine.

3) logical replication can be broken for multiple reasons, like conflicts
where the subscriber already has the data which primary wants to push. it
will not proceed until the conflicts are resolved.
4) poor connectivity or the computer/network resources not able to keep up
with the load, can result in WAL pile up.

there are many blogs around logical replication issues, but when it was new
in pg10, I read this.
Recovery use cases for Logical Replication in PostgreSQL 10 | by Konstantin
Evteev | AvitoTech | Medium
<https://medium.com/avitotech/recovery-use-cases-for-logical-replication-in-postgresql-10-a1e6bab03072>

btw,
how are you doing logical replication with 9.3 ? using a pglogical
extension ?
I can try many things, but it would be wrong to make assumptions since i
did not work with 9.3
for ex.
Bug fix: Using ExecCopySlot during multi insert by bdrouvotAWS · Pull
Request #295 · 2ndQuadrant/pglogical (github.com)
<https://github.com/2ndQuadrant/pglogical/pull/295>
there are many issues posted here that may be relevant to your setup.

On Sat, 29 May 2021 at 19:22, Willy-Bas Loos <willybas@gmail.com> wrote:

Yeah, indexes could slow things down, thanks. Btw I'm not using logical
replication for the upgrade, that's not supported for 9.3.
It was more complicated but that's beside the point.

I could just delete the publication and all that belongs to it and start
over. But since I'm trying out logical replication, I would like to be more
in control than that. It's there anything that I can dig into to find out
why the WAL is accumulating?

Op vr 28 mei 2021 22:20 schreef Vijaykumar Jain <
vijaykumarjain.github@gmail.com>:

I am not too sure with 9.3
i tried an upgrade from 9.6 to 11 using logical replication (pg_logical
extension)

one thing to note.
logical replication initiates a copy from a snapshot, then changes from
then on.

I had a very high insert rate on my source tables (v9.6) and the
destination (v11) could not keep up (it had tons of indexes when I copied
the schema) and it took around a day as the table had around 12 indexes.

So at the destination(v11), I dropped all but the primary index for each
table, started subscription and when it was almost caught up, rebuilt the
index on the destination concurrently.
it completed in 4-5 hours without stopping the source.
migration completed in a few mins :)

not sure if this would help, but just FYI.

On Sat, 29 May 2021 at 01:36, Willy-Bas Loos <willybas@gmail.com> wrote:

Hi , I'm upgrading a 1.5TB database from postgres 9.3 to postgres 13 on
Debian 10. This is now in an Acceptance stage (DTAP). I have encountered a
problem: the WAL is not being deleted. I now have 1.4 TB of WAL in pg_wal
and my disks are getting full. The oldest WAL file is 18 days old.
I use Logical Replication from the new cluster to another new cluster
with 1 subscriber and 1 subscription.

pg_stat_subscription tells me all recent timestamps.
and this:
db=# select * from pg_replication_slots;
-[ RECORD 1 ]-------+-------------
slot_name | my_pub1
plugin | pgoutput
slot_type | logical
datoid | 16401
database | db
temporary | f
active | t
active_pid | 9480
xmin |
catalog_xmin | 269168
restart_lsn | D4/908BC268
confirmed_flush_lsn | E1/25BF5710
wal_status | extended
safe_wal_size |

I've had problems with diskspace on this server, with postgres crashing
because of it, then added more diskspace and postgres recovered. This
doesn't seem to be a problem now.

The *publication* has the options publish = 'insert, update, delete,
truncate', publish_via_partition_root = false
The *subscription* has the options connect = true, enabled = true,
create_slot = false, slot_name = my_pub1, synchronous_commit = 'off'

The log on the publisher says:
2021-05-25 21:25:18.973 CEST [4584] user@db LOG: starting logical
decoding for slot "my_pub1"
2021-05-25 21:25:18.973 CEST [4584] user@db DETAIL: Streaming
transactions committing after D6/A82B5FE0, reading WAL from D4/908BC268.
2021-05-25 21:25:18.973 CEST [4584] user@db LOG: logical decoding
found consistent point at D4/908BC268
2021-05-25 21:25:18.973 CEST [4584] user@db DETAIL: There are no
running transactions.
2021-05-25 21:29:49.456 CEST [4614] user@db ERROR: replication slot
"my_pub1" is active for PID 4584
2021-05-25 21:29:54.474 CEST [4615] user@db ERROR: replication slot
"my_pub1" is active for PID 4584

And on the subscriber:
2021-05-28 21:23:46.702 CEST [40039] LOG: logical replication apply
worker for subscription "my_pub1" has started
2021-05-28 21:23:46.712 CEST [40039] ERROR: could not start WAL
streaming: ERROR: replication slot "my_pub1" is active for PID 730
2021-05-28 21:23:46.714 CEST [19794] LOG: background worker "logical
replication worker" (PID 40039) exited with exit code 1

The postgres settings on the *publisher* are:
max_connections = 100 # (change requires restart)
tcp_keepalives_idle = 120 # TCP_KEEPIDLE, in seconds;
shared_buffers = 50GB # min 128kB
work_mem = 1GB # min 64kB
maintenance_work_mem = 10GB # min 1MB
logical_decoding_work_mem = 5GB # min 64kB
dynamic_shared_memory_type = posix # the default is the first option
max_worker_processes = 20 # (change requires restart)
max_parallel_maintenance_workers = 10 # taken from max_parallel_workers
max_parallel_workers_per_gather = 5 # taken from max_parallel_workers
max_parallel_workers = 15 # maximum number of max_worker_processes that
wal_level = logical # minimal, replica, or logical
max_wal_size = 1GB
min_wal_size = 80MB
#archive_mode = off
max_wal_senders = 10 # max number of walsender processes
wal_sender_timeout = 1min # in milliseconds; 0 disables
max_replication_slots = 7 # max number of replication slots

On postgres settings on the *subscriber*:
max_connections = 100 # (change requires restart)
tcp_keepalives_idle = 120 # TCP_KEEPIDLE, in seconds;
shared_buffers = 25GB # min 128kB
work_mem = 1GB # min 64kB
maintenance_work_mem = 10GB # min 1MB
logical_decoding_work_mem = 5GB # min 64kB
dynamic_shared_memory_type = posix # the default is the first option
max_worker_processes = 20 # (change requires restart)
max_parallel_maintenance_workers = 10 # taken from max_parallel_workers
max_parallel_workers_per_gather = 5 # taken from max_parallel_workers
max_parallel_workers = 15 # maximum number of max_worker_processes that
wal_level = logical # minimal, replica, or logical
max_wal_size = 3GB
min_wal_size = 80MB
#archive_mode = off
wal_receiver_timeout = 1min # time that receiver waits for
max_logical_replication_workers = 10 # taken from max_worker_processes
max_sync_workers_per_subscription = 5 # taken from
max_logical_replication_workers

I've tried increasing wal_sender_timeout and wal_receiver_timeout to 10
minutes each, but this had no positive effect.

Some advice would be helpful
--
Willy-Bas Loos

--
Thanks,
Vijay
Mumbai, India

--
Thanks,
Vijay
Mumbai, India

Willy-Bas Loos

willybas@gmail.com

almost 5 years ago

In reply to: Vijaykumar Jain (#4)

Re: WAL accumulating, Logical Replication pg 13

Thank you for elaborating those possible causes and for the suggestions you
made.
1) if you have an inactive replication slot.
There is only 1 replication slot and it is active. So that is not the issue.

2) Do you have archiving enabled?
No, i never turned it on and so this is in the settings of both publisher
and subscriber: #archive_mode = off (and show archive_mode; tells me the
same)

3) logical replication can be broken for multiple reasons, like conflicts
where the subscriber already has the data which primary wants to push. it
will not proceed until the conflicts are resolved.
That would have been in the log, but there isn't any conflict in the log.
Only the messages that i posted with my first message.

4) poor connectivity or the computer/network resources not able to keep up
with the load, can result in WAL pile up.
This would be strange since there is a 10Gbps connection within the same
rack. But it could theoretically be malfunctioning or the performance on
the subscriber could be too low.
If any of this is the case, shouldn't that be visible in
pg_stat_subscription ?

Thanks for the article, it's interesting to see how they transitioned from
londiste, even if the article is about pglogical, not logical replication
in the postgres core.
I was using Londiste to transfer the data to the new server and minimize
downtime, so the article might come in handy.
I prepared by reading the documentation, which is very straightforward.

btw,
how are you doing logical replication with 9.3 ? using a pglogical

extension ?
No, I'm not using logical replication in postgres 9.3 . Only on postgres 13.
About the link to the bug reports: Thanks for the suggestion. But first I'd
like to get some better grip on what is going on before searching for bugs.

Still, any help will be much appreciated

On Sat, May 29, 2021 at 5:16 PM Vijaykumar Jain <
vijaykumarjain.github@gmail.com> wrote:

WAL can be built up for reasons like
1) if you have an inactive replication slot. I mean you had a streaming
replica which was turned off, but you did not remote the slot from primary.
2) Do you have archiving enabled? Are the archiving commands running fine
? if just the archiving is broken, then you can manually run
archive cleanup provided, replication is all caught up fine.

3) logical replication can be broken for multiple reasons, like conflicts
where the subscriber already has the data which primary wants to push. it
will not proceed until the conflicts are resolved.
4) poor connectivity or the computer/network resources not able to keep up
with the load, can result in WAL pile up.

there are many blogs around logical replication issues, but when it was
new in pg10, I read this.
Recovery use cases for Logical Replication in PostgreSQL 10 | by
Konstantin Evteev | AvitoTech | Medium
<https://medium.com/avitotech/recovery-use-cases-for-logical-replication-in-postgresql-10-a1e6bab03072>

btw,
how are you doing logical replication with 9.3 ? using a pglogical
extension ?
I can try many things, but it would be wrong to make assumptions since i
did not work with 9.3
for ex.
Bug fix: Using ExecCopySlot during multi insert by bdrouvotAWS · Pull
Request #295 · 2ndQuadrant/pglogical (github.com)
<https://github.com/2ndQuadrant/pglogical/pull/295>
there are many issues posted here that may be relevant to your setup.

On Sat, 29 May 2021 at 19:22, Willy-Bas Loos <willybas@gmail.com> wrote:

Yeah, indexes could slow things down, thanks. Btw I'm not using logical
replication for the upgrade, that's not supported for 9.3.
It was more complicated but that's beside the point.

I could just delete the publication and all that belongs to it and start
over. But since I'm trying out logical replication, I would like to be more
in control than that. It's there anything that I can dig into to find out
why the WAL is accumulating?

Op vr 28 mei 2021 22:20 schreef Vijaykumar Jain <
vijaykumarjain.github@gmail.com>:

I am not too sure with 9.3
i tried an upgrade from 9.6 to 11 using logical replication (pg_logical
extension)

one thing to note.
logical replication initiates a copy from a snapshot, then changes from
then on.

I had a very high insert rate on my source tables (v9.6) and the
destination (v11) could not keep up (it had tons of indexes when I copied
the schema) and it took around a day as the table had around 12 indexes.

So at the destination(v11), I dropped all but the primary index for each
table, started subscription and when it was almost caught up, rebuilt the
index on the destination concurrently.
it completed in 4-5 hours without stopping the source.
migration completed in a few mins :)

not sure if this would help, but just FYI.

On Sat, 29 May 2021 at 01:36, Willy-Bas Loos <willybas@gmail.com> wrote:

Hi , I'm upgrading a 1.5TB database from postgres 9.3 to postgres 13 on
Debian 10. This is now in an Acceptance stage (DTAP). I have encountered a
problem: the WAL is not being deleted. I now have 1.4 TB of WAL in pg_wal
and my disks are getting full. The oldest WAL file is 18 days old.
I use Logical Replication from the new cluster to another new cluster
with 1 subscriber and 1 subscription.

pg_stat_subscription tells me all recent timestamps.
and this:
db=# select * from pg_replication_slots;
-[ RECORD 1 ]-------+-------------
slot_name | my_pub1
plugin | pgoutput
slot_type | logical
datoid | 16401
database | db
temporary | f
active | t
active_pid | 9480
xmin |
catalog_xmin | 269168
restart_lsn | D4/908BC268
confirmed_flush_lsn | E1/25BF5710
wal_status | extended
safe_wal_size |

I've had problems with diskspace on this server, with postgres crashing
because of it, then added more diskspace and postgres recovered. This
doesn't seem to be a problem now.

The *publication* has the options publish = 'insert, update, delete,
truncate', publish_via_partition_root = false
The *subscription* has the options connect = true, enabled = true,
create_slot = false, slot_name = my_pub1, synchronous_commit = 'off'

The log on the publisher says:
2021-05-25 21:25:18.973 CEST [4584] user@db LOG: starting logical
decoding for slot "my_pub1"
2021-05-25 21:25:18.973 CEST [4584] user@db DETAIL: Streaming
transactions committing after D6/A82B5FE0, reading WAL from D4/908BC268.
2021-05-25 21:25:18.973 CEST [4584] user@db LOG: logical decoding
found consistent point at D4/908BC268
2021-05-25 21:25:18.973 CEST [4584] user@db DETAIL: There are no
running transactions.
2021-05-25 21:29:49.456 CEST [4614] user@db ERROR: replication slot
"my_pub1" is active for PID 4584
2021-05-25 21:29:54.474 CEST [4615] user@db ERROR: replication slot
"my_pub1" is active for PID 4584

And on the subscriber:
2021-05-28 21:23:46.702 CEST [40039] LOG: logical replication apply
worker for subscription "my_pub1" has started
2021-05-28 21:23:46.712 CEST [40039] ERROR: could not start WAL
streaming: ERROR: replication slot "my_pub1" is active for PID 730
2021-05-28 21:23:46.714 CEST [19794] LOG: background worker "logical
replication worker" (PID 40039) exited with exit code 1

The postgres settings on the *publisher* are:
max_connections = 100 # (change requires restart)
tcp_keepalives_idle = 120 # TCP_KEEPIDLE, in seconds;
shared_buffers = 50GB # min 128kB
work_mem = 1GB # min 64kB
maintenance_work_mem = 10GB # min 1MB
logical_decoding_work_mem = 5GB # min 64kB
dynamic_shared_memory_type = posix # the default is the first option
max_worker_processes = 20 # (change requires restart)
max_parallel_maintenance_workers = 10 # taken from max_parallel_workers
max_parallel_workers_per_gather = 5 # taken from max_parallel_workers
max_parallel_workers = 15 # maximum number of max_worker_processes that
wal_level = logical # minimal, replica, or logical
max_wal_size = 1GB
min_wal_size = 80MB
#archive_mode = off
max_wal_senders = 10 # max number of walsender processes
wal_sender_timeout = 1min # in milliseconds; 0 disables
max_replication_slots = 7 # max number of replication slots

On postgres settings on the *subscriber*:
max_connections = 100 # (change requires restart)
tcp_keepalives_idle = 120 # TCP_KEEPIDLE, in seconds;
shared_buffers = 25GB # min 128kB
work_mem = 1GB # min 64kB
maintenance_work_mem = 10GB # min 1MB
logical_decoding_work_mem = 5GB # min 64kB
dynamic_shared_memory_type = posix # the default is the first option
max_worker_processes = 20 # (change requires restart)
max_parallel_maintenance_workers = 10 # taken from max_parallel_workers
max_parallel_workers_per_gather = 5 # taken from max_parallel_workers
max_parallel_workers = 15 # maximum number of max_worker_processes that
wal_level = logical # minimal, replica, or logical
max_wal_size = 3GB
min_wal_size = 80MB
#archive_mode = off
wal_receiver_timeout = 1min # time that receiver waits for
max_logical_replication_workers = 10 # taken from max_worker_processes
max_sync_workers_per_subscription = 5 # taken from
max_logical_replication_workers

I've tried increasing wal_sender_timeout and wal_receiver_timeout to 10
minutes each, but this had no positive effect.

Some advice would be helpful
--
Willy-Bas Loos

--
Thanks,
Vijay
Mumbai, India

--
Thanks,
Vijay
Mumbai, India

--
Willy-Bas Loos

Vijaykumar Jain

vijaykumarjain.github@gmail.com

almost 5 years ago

In reply to: Willy-Bas Loos (#5)

Re: WAL accumulating, Logical Replication pg 13

So I got it all wrong it seems :)
You upgraded to pg13 fine? , but while on pg13 you have issues with logical
replication ?

There is a path in the postgresql source user subscription folder iirc
which covers various logical replication scenarios.
That may help you just in case.

I have tried all known ways in which replication breaks like above and once
I resolve conflicts it starts fine.
I’ll try to explore more scenarios.
Pardon my link to pglogical. I misunderstood.

On Mon, 31 May 2021 at 7:25 PM Willy-Bas Loos <willybas@gmail.com> wrote:

Thank you for elaborating those possible causes and for the suggestions
you made.
1) if you have an inactive replication slot.
There is only 1 replication slot and it is active. So that is not the
issue.

2) Do you have archiving enabled?
No, i never turned it on and so this is in the settings of both publisher
and subscriber: #archive_mode = off (and show archive_mode; tells me the
same)

3) logical replication can be broken for multiple reasons, like conflicts
where the subscriber already has the data which primary wants to push. it
will not proceed until the conflicts are resolved.
That would have been in the log, but there isn't any conflict in the log.
Only the messages that i posted with my first message.

4) poor connectivity or the computer/network resources not able to keep up
with the load, can result in WAL pile up.
This would be strange since there is a 10Gbps connection within the same
rack. But it could theoretically be malfunctioning or the performance on
the subscriber could be too low.
If any of this is the case, shouldn't that be visible in
pg_stat_subscription ?

Thanks for the article, it's interesting to see how they transitioned from
londiste, even if the article is about pglogical, not logical replication
in the postgres core.
I was using Londiste to transfer the data to the new server and minimize
downtime, so the article might come in handy.
I prepared by reading the documentation, which is very straightforward.

btw,
how are you doing logical replication with 9.3 ? using a pglogical

extension ?
No, I'm not using logical replication in postgres 9.3 . Only on postgres
13.
About the link to the bug reports: Thanks for the suggestion. But first
I'd like to get some better grip on what is going on before searching for
bugs.

Still, any help will be much appreciated

On Sat, May 29, 2021 at 5:16 PM Vijaykumar Jain <
vijaykumarjain.github@gmail.com> wrote:

WAL can be built up for reasons like
1) if you have an inactive replication slot. I mean you had a streaming
replica which was turned off, but you did not remote the slot from primary.
2) Do you have archiving enabled? Are the archiving commands running fine
? if just the archiving is broken, then you can manually run
archive cleanup provided, replication is all caught up fine.

3) logical replication can be broken for multiple reasons, like conflicts
where the subscriber already has the data which primary wants to push. it
will not proceed until the conflicts are resolved.
4) poor connectivity or the computer/network resources not able to keep
up with the load, can result in WAL pile up.

there are many blogs around logical replication issues, but when it was
new in pg10, I read this.
Recovery use cases for Logical Replication in PostgreSQL 10 | by
Konstantin Evteev | AvitoTech | Medium
<https://medium.com/avitotech/recovery-use-cases-for-logical-replication-in-postgresql-10-a1e6bab03072>

btw,
how are you doing logical replication with 9.3 ? using a pglogical
extension ?
I can try many things, but it would be wrong to make assumptions since i
did not work with 9.3
for ex.
Bug fix: Using ExecCopySlot during multi insert by bdrouvotAWS · Pull
Request #295 · 2ndQuadrant/pglogical (github.com)
<https://github.com/2ndQuadrant/pglogical/pull/295>
there are many issues posted here that may be relevant to your setup.

On Sat, 29 May 2021 at 19:22, Willy-Bas Loos <willybas@gmail.com> wrote:

Yeah, indexes could slow things down, thanks. Btw I'm not using logical
replication for the upgrade, that's not supported for 9.3.
It was more complicated but that's beside the point.

I could just delete the publication and all that belongs to it and start
over. But since I'm trying out logical replication, I would like to be more
in control than that. It's there anything that I can dig into to find out
why the WAL is accumulating?

Op vr 28 mei 2021 22:20 schreef Vijaykumar Jain <
vijaykumarjain.github@gmail.com>:

I am not too sure with 9.3
i tried an upgrade from 9.6 to 11 using logical replication (pg_logical
extension)

one thing to note.
logical replication initiates a copy from a snapshot, then changes from
then on.

I had a very high insert rate on my source tables (v9.6) and the
destination (v11) could not keep up (it had tons of indexes when I copied
the schema) and it took around a day as the table had around 12 indexes.

So at the destination(v11), I dropped all but the primary index for
each table, started subscription and when it was almost caught up, rebuilt
the index on the destination concurrently.
it completed in 4-5 hours without stopping the source.
migration completed in a few mins :)

not sure if this would help, but just FYI.

On Sat, 29 May 2021 at 01:36, Willy-Bas Loos <willybas@gmail.com>
wrote:

Hi , I'm upgrading a 1.5TB database from postgres 9.3 to postgres 13
on Debian 10. This is now in an Acceptance stage (DTAP). I have encountered
a problem: the WAL is not being deleted. I now have 1.4 TB of WAL in pg_wal
and my disks are getting full. The oldest WAL file is 18 days old.
I use Logical Replication from the new cluster to another new cluster
with 1 subscriber and 1 subscription.

pg_stat_subscription tells me all recent timestamps.
and this:
db=# select * from pg_replication_slots;
-[ RECORD 1 ]-------+-------------
slot_name | my_pub1
plugin | pgoutput
slot_type | logical
datoid | 16401
database | db
temporary | f
active | t
active_pid | 9480
xmin |
catalog_xmin | 269168
restart_lsn | D4/908BC268
confirmed_flush_lsn | E1/25BF5710
wal_status | extended
safe_wal_size |

I've had problems with diskspace on this server, with postgres
crashing because of it, then added more diskspace and postgres recovered.
This doesn't seem to be a problem now.

The *publication* has the options publish = 'insert, update, delete,
truncate', publish_via_partition_root = false
The *subscription* has the options connect = true, enabled = true,
create_slot = false, slot_name = my_pub1, synchronous_commit = 'off'

The log on the publisher says:
2021-05-25 21:25:18.973 CEST [4584] user@db LOG: starting logical
decoding for slot "my_pub1"
2021-05-25 21:25:18.973 CEST [4584] user@db DETAIL: Streaming
transactions committing after D6/A82B5FE0, reading WAL from D4/908BC268.
2021-05-25 21:25:18.973 CEST [4584] user@db LOG: logical decoding
found consistent point at D4/908BC268
2021-05-25 21:25:18.973 CEST [4584] user@db DETAIL: There are no
running transactions.
2021-05-25 21:29:49.456 CEST [4614] user@db ERROR: replication slot
"my_pub1" is active for PID 4584
2021-05-25 21:29:54.474 CEST [4615] user@db ERROR: replication slot
"my_pub1" is active for PID 4584

And on the subscriber:
2021-05-28 21:23:46.702 CEST [40039] LOG: logical replication apply
worker for subscription "my_pub1" has started
2021-05-28 21:23:46.712 CEST [40039] ERROR: could not start WAL
streaming: ERROR: replication slot "my_pub1" is active for PID 730
2021-05-28 21:23:46.714 CEST [19794] LOG: background worker "logical
replication worker" (PID 40039) exited with exit code 1

The postgres settings on the *publisher* are:
max_connections = 100 # (change requires restart)
tcp_keepalives_idle = 120 # TCP_KEEPIDLE, in seconds;
shared_buffers = 50GB # min 128kB
work_mem = 1GB # min 64kB
maintenance_work_mem = 10GB # min 1MB
logical_decoding_work_mem = 5GB # min 64kB
dynamic_shared_memory_type = posix # the default is the first option
max_worker_processes = 20 # (change requires restart)
max_parallel_maintenance_workers = 10 # taken from max_parallel_workers
max_parallel_workers_per_gather = 5 # taken from max_parallel_workers
max_parallel_workers = 15 # maximum number of max_worker_processes that
wal_level = logical # minimal, replica, or logical
max_wal_size = 1GB
min_wal_size = 80MB
#archive_mode = off
max_wal_senders = 10 # max number of walsender processes
wal_sender_timeout = 1min # in milliseconds; 0 disables
max_replication_slots = 7 # max number of replication slots

On postgres settings on the *subscriber*:
max_connections = 100 # (change requires restart)
tcp_keepalives_idle = 120 # TCP_KEEPIDLE, in seconds;
shared_buffers = 25GB # min 128kB
work_mem = 1GB # min 64kB
maintenance_work_mem = 10GB # min 1MB
logical_decoding_work_mem = 5GB # min 64kB
dynamic_shared_memory_type = posix # the default is the first option
max_worker_processes = 20 # (change requires restart)
max_parallel_maintenance_workers = 10 # taken from max_parallel_workers
max_parallel_workers_per_gather = 5 # taken from max_parallel_workers
max_parallel_workers = 15 # maximum number of max_worker_processes that
wal_level = logical # minimal, replica, or logical
max_wal_size = 3GB
min_wal_size = 80MB
#archive_mode = off
wal_receiver_timeout = 1min # time that receiver waits for
max_logical_replication_workers = 10 # taken from max_worker_processes
max_sync_workers_per_subscription = 5 # taken from
max_logical_replication_workers

I've tried increasing wal_sender_timeout and wal_receiver_timeout to
10 minutes each, but this had no positive effect.

Some advice would be helpful
--
Willy-Bas Loos

--
Thanks,
Vijay
Mumbai, India

--
Thanks,
Vijay
Mumbai, India

--
Willy-Bas Loos

--
Thanks,
Vijay
Mumbai, India

Willy-Bas Loos

willybas@gmail.com

almost 5 years ago

In reply to: Vijaykumar Jain (#6)

Re: WAL accumulating, Logical Replication pg 13

On Mon, May 31, 2021 at 4:24 PM Vijaykumar Jain <
vijaykumarjain.github@gmail.com> wrote:

So I got it all wrong it seems :)

Thank you for taking the time to help me!

You upgraded to pg13 fine? , but while on pg13 you have issues with logical

replication ?

Yes, the upgrade went fine. So here are some details:
I already had londiste running on postgres 9.3, but londiste wouldn't run
on Debian 10
So i first made the new server Debian 9 with postgres 9.6 and i started
replicating with londiste from 9.3 to 9.6
When all was ready, i stopped the replication to the 9.6 server and deleted
all londiste & pgq content with drop schema cascade.
Then I upgraded the server to Debian 10. Then i user pg_upgrade to upgrade
from postgres 9.6 to 13. (PostGIS versions were kept compatible).
Then I added logical replication and a third server as a subscriber.

I was going to write that replication is working fine (since the table
contains a lot of data and there are no conflicts in the log), but it turns
out that it isn't.
The subscriber is behind and It looks like there hasn't been any incoming
data after the initial data synchronization.
So at least now i know that the WAL is being retained with a reason. The
connection is working properly (via psql anyway)

I will also look into how to diagnose this from the system tables, e.g.
substracting LSN's to get some quantitative measure for the lag.

There is a path in the postgresql source user subscription folder iirc
which covers various logical replication scenarios.
That may help you just in case.

OK, so comments in the source code you mean?

Vijaykumar Jain

vijaykumarjain.github@gmail.com

almost 5 years ago

In reply to: Willy-Bas Loos (#7)

Re: WAL accumulating, Logical Replication pg 13

This part.
If you can read perl :),

https://github.com/postgres/postgres/tree/master/src/test/subscription/t

On Mon, May 31, 2021, 9:02 PM Willy-Bas Loos <willybas@gmail.com> wrote:

Show quoted text

On Mon, May 31, 2021 at 4:24 PM Vijaykumar Jain <
vijaykumarjain.github@gmail.com> wrote:

So I got it all wrong it seems :)

Thank you for taking the time to help me!

You upgraded to pg13 fine? , but while on pg13 you have issues with

logical replication ?

Yes, the upgrade went fine. So here are some details:
I already had londiste running on postgres 9.3, but londiste wouldn't run
on Debian 10
So i first made the new server Debian 9 with postgres 9.6 and i started
replicating with londiste from 9.3 to 9.6
When all was ready, i stopped the replication to the 9.6 server and
deleted all londiste & pgq content with drop schema cascade.
Then I upgraded the server to Debian 10. Then i user pg_upgrade to
upgrade from postgres 9.6 to 13. (PostGIS versions were kept compatible).
Then I added logical replication and a third server as a subscriber.

I was going to write that replication is working fine (since the table
contains a lot of data and there are no conflicts in the log), but it turns
out that it isn't.
The subscriber is behind and It looks like there hasn't been any incoming
data after the initial data synchronization.
So at least now i know that the WAL is being retained with a reason. The
connection is working properly (via psql anyway)

I will also look into how to diagnose this from the system tables, e.g.
substracting LSN's to get some quantitative measure for the lag.

There is a path in the postgresql source user subscription folder iirc
which covers various logical replication scenarios.
That may help you just in case.

OK, so comments in the source code you mean?

Tomas Pospisek

tpo2@sourcepole.ch

almost 5 years ago

In reply to: Willy-Bas Loos (#7)

Re: WAL accumulating, Logical Replication pg 13

Hi Willy-Bas Loos,

On 31.05.21 17:32, Willy-Bas Loos wrote:

On Mon, May 31, 2021 at 4:24 PM Vijaykumar Jain
<vijaykumarjain.github@gmail.com
<mailto:vijaykumarjain.github@gmail.com>> wrote:

So I got it all wrong it seems :)

Thank you for taking the time to help me!

You upgraded to pg13 fine? , but while on pg13 you have issues with
logical replication ?

Yes, the upgrade went fine. So here are some details:
I already had londiste running on postgres 9.3, but londiste wouldn't
run on Debian 10
So i first made the new server Debian 9 with postgres 9.6 and i started
replicating with londiste from 9.3 to 9.6
When all was ready, i stopped the replication to the 9.6 server and
deleted all londiste & pgq content with drop schema cascade.
Then I upgraded the server to Debian 10. Then i user pg_upgrade to
upgrade from postgres 9.6 to 13. (PostGIS versions were kept compatible).
Then I added logical replication and a third server as a subscriber.

I was going to write that replication is working fine (since the table
contains a lot of data and there are no conflicts in the log), but it
turns out that it isn't.
The subscriber is behind and It looks like there hasn't been any
incoming data after the initial data synchronization.
So at least now i know that the WAL is being retained with a reason. The
connection is working properly (via psql anyway)

I once maybe had a similar problem due to some ports that were needed
for replication being firewalled off or respectively the master having
the wrong IP address of the old master (now standby server) or such.

There was absolutely no word anywhere in any log about the problem I was
just seeing the new postgres master not starting up after hours and
hours of waiting after a failover. I somehow found out about the
required port being blocked (I don't remember - maybe seing the
unanswered SYNs in tcpdump? Or via ufw log entries?).

Show quoted text

I will also look into how to diagnose this from the system tables, e.g.
substracting LSN's to get some quantitative measure for the lag.

There is a path in the postgresql source user subscription folder
iirc which covers various logical replication scenarios.
That may help you just in case.

OK, so comments in the source code you mean?

#10

Willy-Bas Loos

willybas@gmail.com

almost 5 years ago

In reply to: Tomas Pospisek (#9)

Re: WAL accumulating, Logical Replication pg 13

Hi, I was going to follow up on this one, sorry for the long silence.
The replication is working fine now, and I have no idea what the problem
was. Not cool.
If I find out, I will let you know.

On Mon, May 31, 2021 at 6:06 PM Tomas Pospisek <tpo2@sourcepole.ch> wrote:

Hi Willy-Bas Loos,

On 31.05.21 17:32, Willy-Bas Loos wrote:

On Mon, May 31, 2021 at 4:24 PM Vijaykumar Jain
<vijaykumarjain.github@gmail.com
<mailto:vijaykumarjain.github@gmail.com>> wrote:

So I got it all wrong it seems :)

Thank you for taking the time to help me!

You upgraded to pg13 fine? , but while on pg13 you have issues with
logical replication ?

Yes, the upgrade went fine. So here are some details:
I already had londiste running on postgres 9.3, but londiste wouldn't
run on Debian 10
So i first made the new server Debian 9 with postgres 9.6 and i started
replicating with londiste from 9.3 to 9.6
When all was ready, i stopped the replication to the 9.6 server and
deleted all londiste & pgq content with drop schema cascade.
Then I upgraded the server to Debian 10. Then i user pg_upgrade to
upgrade from postgres 9.6 to 13. (PostGIS versions were kept compatible).
Then I added logical replication and a third server as a subscriber.

I was going to write that replication is working fine (since the table
contains a lot of data and there are no conflicts in the log), but it
turns out that it isn't.
The subscriber is behind and It looks like there hasn't been any
incoming data after the initial data synchronization.
So at least now i know that the WAL is being retained with a reason. The
connection is working properly (via psql anyway)

I once maybe had a similar problem due to some ports that were needed
for replication being firewalled off or respectively the master having
the wrong IP address of the old master (now standby server) or such.

There was absolutely no word anywhere in any log about the problem I was
just seeing the new postgres master not starting up after hours and
hours of waiting after a failover. I somehow found out about the
required port being blocked (I don't remember - maybe seing the
unanswered SYNs in tcpdump? Or via ufw log entries?).

I will also look into how to diagnose this from the system tables, e.g.
substracting LSN's to get some quantitative measure for the lag.

There is a path in the postgresql source user subscription folder
iirc which covers various logical replication scenarios.
That may help you just in case.

OK, so comments in the source code you mean?

--
Willy-Bas Loos

#11

Willy-Bas Loos

willybas@gmail.com

over 4 years ago

In reply to: Willy-Bas Loos (#10)

Re: WAL accumulating, Logical Replication pg 13

Hi, here's an update on this old thread.
I've found the problem, it was because of long transactions.
I reproduced the problem when I added a second publication and another
postgres cluster with a subscription to it (on acceptance still).

This query came in handy to show me what was happening:

select r.pid, a.state as query_state, a.wait_event, r.application_name,
r.state as replication_state,
-- now()-a.query_start as query_time,
r.write_lag,
pg_size_pretty(pg_wal_lsn_diff(r.sent_lsn,r.write_lsn)) as write_lag2,
r.flush_lag,
pg_size_pretty(pg_wal_lsn_diff(r.sent_lsn,r.flush_lsn)) as flush_lag2,
r.replay_lag,
pg_size_pretty(pg_wal_lsn_diff(r.sent_lsn,r.replay_lsn)) as replay_lag2,
r.client_addr, a.backend_start, a.state_change, a.query_start,
now()
from pg_stat_replication r
left join pg_stat_activity a on a.pid = r.pid
order by r.application_name;

What i saw was that replication wold "start", then enter a "catchup" state
and detect about 5GB of data, wich it would then process, so i would see
the "lag_..." numbers decline as i repeated that query.
When the number hit 0 or a slightly negative number, it would take a minute
(literally) and then the process was repeated. BTW I was seeing negative
numbers at some point so i swapped the values (e.g. r.write_lsn,
r.sent_lsn) and changed them back later. So I think that the 5GB was
actually -5GB, strangely. During the minute wait at a near-zero lag, the
query state would be "ReorderBufferWrite".

In the log there was an important extra line that I hadn't noticed before,
because it doesn't occur as often as all the lines about the workers
exiting in an error.
2021-07-14 14:04:58.110 CEST [22782] ERROR: terminating logical replication
worker due to timeout 2021-07-14 14:04:58.112 CEST [3720] LOG: background
worker "logical replication worker" (PID 22782) exited with exit code 1

I was already suspecting long transactions to be the problem, so I thought
that the timeout might be the problem.
At that point I changed the wal_receiver_timeout option to 0 in the
postgresql.conf file and the above behaviour ceased.
Instead I saw the lag values increase rapidly, with about 100GB per hour
(not a high traffic database).
There were long transactions on the subscriber database, but merely killing
those precesses wasn't enough because they would soon be started again.
I had to disable the job that started them for about 15 minutes, that did
the job. After that, those long transactions were not a problem anymore.

A guess at what is causing this:
The long transactions on the subscriber node are only a problem during the
initial snapshot phase. It uses a transaction and needs heavy locks on the
tables that are to be replicated.
The initial snapshot was cancelled because it didn't end successfully
within the wal_receiver_timeout, so it was restarted later.
When i disabled the timeout, all the WAL on te server was retained. There
were still periodical retries, there was no pending request for a lock. Or
it would have gotten one the moment I killed that long transaction.
But every time that the snapshot process was retried, there was already a
new long transaction preventing the lock for the snapshot.
I had to prevent long transactions for long enough so that the initial
snapshot could get a lock on the subscriber. However, I don't know the
length of the timeout that defines how long a transaction can be without
disturbing the snapshot.
Please correct me where I'm wrong.

Cheers,

Willy Loos

On Fri, Jun 11, 2021 at 9:51 AM Willy-Bas Loos <willybas@gmail.com> wrote:

Hi, I was going to follow up on this one, sorry for the long silence.
The replication is working fine now, and I have no idea what the problem
was. Not cool.
If I find out, I will let you know.

On Mon, May 31, 2021 at 6:06 PM Tomas Pospisek <tpo2@sourcepole.ch> wrote:

Hi Willy-Bas Loos,

On 31.05.21 17:32, Willy-Bas Loos wrote:

On Mon, May 31, 2021 at 4:24 PM Vijaykumar Jain
<vijaykumarjain.github@gmail.com
<mailto:vijaykumarjain.github@gmail.com>> wrote:

So I got it all wrong it seems :)

Thank you for taking the time to help me!

You upgraded to pg13 fine? , but while on pg13 you have issues with
logical replication ?

Yes, the upgrade went fine. So here are some details:
I already had londiste running on postgres 9.3, but londiste wouldn't
run on Debian 10
So i first made the new server Debian 9 with postgres 9.6 and i started
replicating with londiste from 9.3 to 9.6
When all was ready, i stopped the replication to the 9.6 server and
deleted all londiste & pgq content with drop schema cascade.
Then I upgraded the server to Debian 10. Then i user pg_upgrade to
upgrade from postgres 9.6 to 13. (PostGIS versions were kept

compatible).

Then I added logical replication and a third server as a subscriber.

I was going to write that replication is working fine (since the table
contains a lot of data and there are no conflicts in the log), but it
turns out that it isn't.
The subscriber is behind and It looks like there hasn't been any
incoming data after the initial data synchronization.
So at least now i know that the WAL is being retained with a reason.

The

connection is working properly (via psql anyway)

I once maybe had a similar problem due to some ports that were needed
for replication being firewalled off or respectively the master having
the wrong IP address of the old master (now standby server) or such.

There was absolutely no word anywhere in any log about the problem I was
just seeing the new postgres master not starting up after hours and
hours of waiting after a failover. I somehow found out about the
required port being blocked (I don't remember - maybe seing the
unanswered SYNs in tcpdump? Or via ufw log entries?).

I will also look into how to diagnose this from the system tables, e.g.
substracting LSN's to get some quantitative measure for the lag.

There is a path in the postgresql source user subscription folder
iirc which covers various logical replication scenarios.
That may help you just in case.

OK, so comments in the source code you mean?

--
Willy-Bas Loos

--
Willy-Bas Loos