Logical replication halted due to "this slot has been invalidated because it exceeded the maximum reserved size."

Started by Viljo Hakalaover 3 years ago5 messagesbugs
Jump to latest
#1Viljo Hakala
Viljo.Hakala@advania.com

Hello lists,

We have two PostgreSQL 14.3 on Red Hat Linux 8.5 running

two different databases on VMS, where logical replicationn is used between databases.

Recently we have got bitten by a repeating issue in the databases

As we use logical replication between these two systems, we have

had to rebuilt logical replication, by dropping the subscriber as

that willl drop the logical replication slot on the primary and issue

does not occur for some time, but it will repeat.

This time it repeated after 3 days logical replication was rebuilt.

Last time it took 2-3 months until it repeated.

We started to get these warnings on the standby

2022-11-29 10:21:06.940 EET [1698404] ERROR: could not start WAL streaming: ERROR: cannot read from logical replication slot ”logs”

DETAIL: This slot has been invalidated because it exceeded the maximum reserved size.

2022-11-29 10:21:06.942 EET [1698368] LOG: background worker "logical replication worker" (PID 1698404) exited with exit code 1

Even though according to manual the max_slot_wal_keep_size is -1 and should not have a limit ?

If max_slot_wal_keep_size is -1 (the default), replication slots may retain an unlimited amount of WAL files

https://postgresqlco.nf/doc/en/param/max_slot_wal_keep_size/

psql (14.3)

Type "help" for help.

postgres=# show max_slot_wal_keep_size;

max_slot_wal_keep_size

------------------------

-1

(1 row)

Why is this happening? Is this a bug in PG 14.3 ?

Our fix for the time being is

DB=# alter subscription logs disable;

ALTER SUBSCRIPTION

SN4ReportingDB=# drop subscription logs

NOTICE: dropped replication slot ”logs” on publisher

DROP SUBSCRIPTION

SN4ReportingDB=# create subscription logs connection 'dbname=DBNAMAE host=192.168.1.1 port=5000 user=postgres' publication log pub with (copy_data=false);

NOTICE: created replication slot ”logs” on publisher

CREATE SUBSCRIPTION

But as this is a live system with a terabyte of data, some data will be lost unless we rebuilt the whole replication from scratch and this is not

bearable!

Any advice?

Regards,

Viljo Hakala

#2Viljo Hakala
Viljo.Hakala@advania.com
In reply to: Viljo Hakala (#1)

Hello lists,

We have two PostgreSQL 14.3 on Red Hat Linux 8.5 running

two different databases on VMS, where logical replicationn is used between databases.

Recently we have got bitten by a repeating issue in the databases

As we use logical replication between these two systems, we have

had to rebuilt logical replication, by dropping the subscriber as

that willl drop the logical replication slot on the primary and issue

does not occur for some time, but it will repeat.

This time it repeated after 3 days logical replication was rebuilt.

Last time it took 2-3 months until it repeated.

We started to get these warnings on the standby

2022-11-29 10:21:06.940 EET [1698404] ERROR: could not start WAL streaming: ERROR: cannot read from logical replication slot ”logs”

DETAIL: This slot has been invalidated because it exceeded the maximum reserved size.

2022-11-29 10:21:06.942 EET [1698368] LOG: background worker "logical replication worker" (PID 1698404) exited with exit code 1

Even though according to manual the max_slot_wal_keep_size is -1 and should not have a limit ?

If max_slot_wal_keep_size is -1 (the default), replication slots may retain an unlimited amount of WAL files

https://postgresqlco.nf/doc/en/param/max_slot_wal_keep_size/<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpostgresqlco.nf%2Fdoc%2Fen%2Fparam%2Fmax_slot_wal_keep_size%2F&data=05%7C01%7Cviljo.hakala%40advania.com%7Ca2ce5de10e35496cc85908dad1e657c7%7C70d22a8d923a445e82d432329da21746%7C0%7C0%7C638053084494286728%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Hf1ZwVDwdoiCYymqtIERE7hbw4gU%2FkSvyfk3tFu53zo%3D&reserved=0>

psql (14.3)

Type "help" for help.

postgres=# show max_slot_wal_keep_size;

max_slot_wal_keep_size

------------------------

-1

(1 row)

Why is this happening? Is this a bug in PG 14.3 ?

Our fix for the time being is

DB=# alter subscription logs disable;

ALTER SUBSCRIPTION

SN4ReportingDB=# drop subscription logs

NOTICE: dropped replication slot ”logs” on publisher

DROP SUBSCRIPTION

SN4ReportingDB=# create subscription logs connection 'dbname=DBNAMAE host=192.168.1.1 port=5000 user=postgres' publication log pub with (copy_data=false);

NOTICE: created replication slot ”logs” on publisher

CREATE SUBSCRIPTION

But as this is a live system with a terabyte of data, some data will be lost unless we rebuilt the whole replication from scratch and this is not

bearable!

Any advice?

Regards,

Viljo Hakala

#3Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Viljo Hakala (#1)
Re: Logical replication halted due to "this slot has been invalidated because it exceeded the maximum reserved size."

On 2022-Nov-29, Viljo Hakala wrote:

We have two PostgreSQL 14.3 on Red Hat Linux 8.5 running two
different databases on VMS, where logical replicationn is used between
databases.

Hmm, I don't see any bug fixes in the commit that would match this.
I only looked after May 9th 2022, which is 14.3's tag date.

2022-11-29 10:21:06.940 EET [1698404] ERROR: could not start WAL streaming: ERROR: cannot read from logical replication slot ”logs”
DETAIL: This slot has been invalidated because it exceeded the maximum reserved size.
2022-11-29 10:21:06.942 EET [1698368] LOG: background worker "logical replication worker" (PID 1698404) exited with exit code 1

Even though according to manual the max_slot_wal_keep_size is -1 and should not have a limit ?

You should see earlier messages about the slot being invalidated, during
the previous checkpoint -- and potentially the walsender being
signalled, if it was running. Can you spot those?

postgres=# show max_slot_wal_keep_size;

max_slot_wal_keep_size
------------------------
-1

The only explanation of this behavior that doesn't involve a bug is that
this parameter was set to a nonzero value, then set to disabled, but the
checkpointer failed to notice the change.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"This is what I like so much about PostgreSQL. Most of the surprises
are of the "oh wow! That's cool" Not the "oh shit!" kind. :)"
Scott Marlowe, http://archives.postgresql.org/pgsql-admin/2008-10/msg00152.php

In reply to: Alvaro Herrera (#3)
Re:Logical replication halted due to "this slot has been invalidated because it exceeded the maximum reserved size."

Hello

The only explanation of this behavior that doesn't involve a bug is that

this parameter was set to a nonzero value, then set to disabled, but the
checkpointer failed to notice the change.

Or max_slot_wal_keep_size is really set on the publication database, but only the subscription database config was checked.

regards, Sergei

#5Viljo Hakala
Viljo.Hakala@advania.com
In reply to: Sergei Kornilov (#4)
VS: Re:Logical replication halted due to "this slot has been invalidated because it exceeded the maximum reserved size."

HI,

Previously we had -1 as default for both of these instances and that caused same kind of issue.

Now we had set it to maximum value. We changed it back to -1

The logical replication was rebuilt over the weekend from scratch and issue occurred only 2-3 days after.

So we believe this is a bug. There seems to have been an issue in 13, that got fixed. But maybe this is a regression bug?

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=866237a6fa01a128325df41ad39b41ea3363c9a9

Regards,
Viljo

Lähettäjä: Sergei Kornilov <sk@zsrv.org>
Päivämäärä: tiistaina, 29. marraskuuta 2022 klo 15.16
Vastaanottaja: Alvaro Herrera <alvherre@alvh.no-ip.org>
Kopio: pgsql-bugs@lists.postgresql.org <pgsql-bugs@lists.postgresql.org>, Viljo Hakala <Viljo.Hakala@advania.com>
Aihe: Re:Logical replication halted due to "this slot has been invalidated because it exceeded the maximum reserved size."
Hello

The only explanation of this behavior that doesn't involve a bug is that

this parameter was set to a nonzero value, then set to disabled, but the
checkpointer failed to notice the change.

Or max_slot_wal_keep_size is really set on the publication database, but only the subscription database config was checked.

regards, Sergei