Logical replication halted due to "this slot has been invalidated because it exceeded the maximum reserved size."
Hello lists,
We have two PostgreSQL 14.3 on Red Hat Linux 8.5 running
two different databases on VMS, where logical replicationn is used between databases.
Recently we have got bitten by a repeating issue in the databases
As we use logical replication between these two systems, we have
had to rebuilt logical replication, by dropping the subscriber as
that willl drop the logical replication slot on the primary and issue
does not occur for some time, but it will repeat.
This time it repeated after 3 days logical replication was rebuilt.
Last time it took 2-3 months until it repeated.
We started to get these warnings on the standby
2022-11-29 10:21:06.940 EET [1698404] ERROR: could not start WAL streaming: ERROR: cannot read from logical replication slot ”logs”
DETAIL: This slot has been invalidated because it exceeded the maximum reserved size.
2022-11-29 10:21:06.942 EET [1698368] LOG: background worker "logical replication worker" (PID 1698404) exited with exit code 1
Even though according to manual the max_slot_wal_keep_size is -1 and should not have a limit ?
If max_slot_wal_keep_size is -1 (the default), replication slots may retain an unlimited amount of WAL files
https://postgresqlco.nf/doc/en/param/max_slot_wal_keep_size/
psql (14.3)
Type "help" for help.
postgres=# show max_slot_wal_keep_size;
max_slot_wal_keep_size
------------------------
-1
(1 row)
Why is this happening? Is this a bug in PG 14.3 ?
Our fix for the time being is
DB=# alter subscription logs disable;
ALTER SUBSCRIPTION
SN4ReportingDB=# drop subscription logs
NOTICE: dropped replication slot ”logs” on publisher
DROP SUBSCRIPTION
SN4ReportingDB=# create subscription logs connection 'dbname=DBNAMAE host=192.168.1.1 port=5000 user=postgres' publication log pub with (copy_data=false);
NOTICE: created replication slot ”logs” on publisher
CREATE SUBSCRIPTION
But as this is a live system with a terabyte of data, some data will be lost unless we rebuilt the whole replication from scratch and this is not
bearable!
Any advice?
Regards,
Viljo Hakala
Hello lists,
We have two PostgreSQL 14.3 on Red Hat Linux 8.5 running
two different databases on VMS, where logical replicationn is used between databases.
Recently we have got bitten by a repeating issue in the databases
As we use logical replication between these two systems, we have
had to rebuilt logical replication, by dropping the subscriber as
that willl drop the logical replication slot on the primary and issue
does not occur for some time, but it will repeat.
This time it repeated after 3 days logical replication was rebuilt.
Last time it took 2-3 months until it repeated.
We started to get these warnings on the standby
2022-11-29 10:21:06.940 EET [1698404] ERROR: could not start WAL streaming: ERROR: cannot read from logical replication slot ”logs”
DETAIL: This slot has been invalidated because it exceeded the maximum reserved size.
2022-11-29 10:21:06.942 EET [1698368] LOG: background worker "logical replication worker" (PID 1698404) exited with exit code 1
Even though according to manual the max_slot_wal_keep_size is -1 and should not have a limit ?
If max_slot_wal_keep_size is -1 (the default), replication slots may retain an unlimited amount of WAL files
psql (14.3)
Type "help" for help.
postgres=# show max_slot_wal_keep_size;
max_slot_wal_keep_size
------------------------
-1
(1 row)
Why is this happening? Is this a bug in PG 14.3 ?
Our fix for the time being is
DB=# alter subscription logs disable;
ALTER SUBSCRIPTION
SN4ReportingDB=# drop subscription logs
NOTICE: dropped replication slot ”logs” on publisher
DROP SUBSCRIPTION
SN4ReportingDB=# create subscription logs connection 'dbname=DBNAMAE host=192.168.1.1 port=5000 user=postgres' publication log pub with (copy_data=false);
NOTICE: created replication slot ”logs” on publisher
CREATE SUBSCRIPTION
But as this is a live system with a terabyte of data, some data will be lost unless we rebuilt the whole replication from scratch and this is not
bearable!
Any advice?
Regards,
Viljo Hakala
On 2022-Nov-29, Viljo Hakala wrote:
We have two PostgreSQL 14.3 on Red Hat Linux 8.5 running two
different databases on VMS, where logical replicationn is used between
databases.
Hmm, I don't see any bug fixes in the commit that would match this.
I only looked after May 9th 2022, which is 14.3's tag date.
2022-11-29 10:21:06.940 EET [1698404] ERROR: could not start WAL streaming: ERROR: cannot read from logical replication slot ”logs”
DETAIL: This slot has been invalidated because it exceeded the maximum reserved size.
2022-11-29 10:21:06.942 EET [1698368] LOG: background worker "logical replication worker" (PID 1698404) exited with exit code 1Even though according to manual the max_slot_wal_keep_size is -1 and should not have a limit ?
You should see earlier messages about the slot being invalidated, during
the previous checkpoint -- and potentially the walsender being
signalled, if it was running. Can you spot those?
postgres=# show max_slot_wal_keep_size;
max_slot_wal_keep_size
------------------------
-1
The only explanation of this behavior that doesn't involve a bug is that
this parameter was set to a nonzero value, then set to disabled, but the
checkpointer failed to notice the change.
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"This is what I like so much about PostgreSQL. Most of the surprises
are of the "oh wow! That's cool" Not the "oh shit!" kind. :)"
Scott Marlowe, http://archives.postgresql.org/pgsql-admin/2008-10/msg00152.php
Hello
The only explanation of this behavior that doesn't involve a bug is that
this parameter was set to a nonzero value, then set to disabled, but the
checkpointer failed to notice the change.
Or max_slot_wal_keep_size is really set on the publication database, but only the subscription database config was checked.
regards, Sergei
HI,
Previously we had -1 as default for both of these instances and that caused same kind of issue.
Now we had set it to maximum value. We changed it back to -1
The logical replication was rebuilt over the weekend from scratch and issue occurred only 2-3 days after.
So we believe this is a bug. There seems to have been an issue in 13, that got fixed. But maybe this is a regression bug?
Regards,
Viljo
Lähettäjä: Sergei Kornilov <sk@zsrv.org>
Päivämäärä: tiistaina, 29. marraskuuta 2022 klo 15.16
Vastaanottaja: Alvaro Herrera <alvherre@alvh.no-ip.org>
Kopio: pgsql-bugs@lists.postgresql.org <pgsql-bugs@lists.postgresql.org>, Viljo Hakala <Viljo.Hakala@advania.com>
Aihe: Re:Logical replication halted due to "this slot has been invalidated because it exceeded the maximum reserved size."
Hello
The only explanation of this behavior that doesn't involve a bug is that
this parameter was set to a nonzero value, then set to disabled, but the
checkpointer failed to notice the change.
Or max_slot_wal_keep_size is really set on the publication database, but only the subscription database config was checked.
regards, Sergei