Catalog_xmin is not advanced when a logical slot is lost

Started by sirisha chamarthiabout 3 years ago17 messages
#1sirisha chamarthi
sirichamarthi22@gmail.com
1 attachment(s)

Hi Hackers,

forking this thread from the discussion [1]/messages/by-id/CAKrAKeW-sGqvkw-2zKuVYiVv=EOG4LEqJn01RJPsHfS2rQGYng@mail.gmail.com as suggested by Amit.

Catalog_xmin is not advanced when a logical slot is invalidated (lost)
until the invalidated slot is dropped. This patch ignores invalidated slots
while computing the oldest xmin. Attached a small patch to address this and
the output after the patch is as shown below.

postgres=# select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary |
active | active_pid | xmin | catalog_xmin | restart_lsn |
confirmed_flush_lsn | wal_status | safe_wal_size | two_phase
-----------+---------------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+-----------
s2 | test_decoding | logical | 5 | postgres | f | f
| | | 771 | 0/30466368 | 0/304663A0
| reserved | 28903824 | f
(1 row)

postgres=# create table t2(c int, c1 char(100));
CREATE TABLE
postgres=# drop table t2;
DROP TABLE
postgres=# vacuum pg_class;
VACUUM
postgres=# select n_dead_tup from pg_stat_all_tables where relname =
'pg_class';
n_dead_tup
------------
2
(1 row)

postgres=# select * from pg_stat_replication;
pid | usesysid | usename | application_name | client_addr |
client_hostname | client_port | backend_start | backend_xmin | state |
sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag |
replay_lag | sync_pri
ority | sync_state | reply_time
-----+----------+---------+------------------+-------------+-----------------+-------------+---------------+--------------+-------+----------+-----------+-----------+------------+-----------+-----------+------------+---------
------+------------+------------
(0 rows)

postgres=# insert into t1 select * from t1;
INSERT 0 2097152
postgres=# checkpoint;
CHECKPOINT
postgres=# select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary |
active | active_pid | xmin | catalog_xmin | restart_lsn |
confirmed_flush_lsn | wal_status | safe_wal_size | two_phase
-----------+---------------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+-----------
s2 | test_decoding | logical | 5 | postgres | f | f
| | | 771 | | 0/304663A0
| lost | | f
(1 row)

postgres=# vacuum pg_class;
VACUUM
postgres=# select n_dead_tup from pg_stat_all_tables where relname =
'pg_class';
n_dead_tup
------------
0
(1 row)

[1]: /messages/by-id/CAKrAKeW-sGqvkw-2zKuVYiVv=EOG4LEqJn01RJPsHfS2rQGYng@mail.gmail.com
/messages/by-id/CAKrAKeW-sGqvkw-2zKuVYiVv=EOG4LEqJn01RJPsHfS2rQGYng@mail.gmail.com

Thanks,
Sirisha

Attachments:

0001-Ignore-invalidated-slots-while-computing-the-oldest-.patchapplication/octet-stream; name=0001-Ignore-invalidated-slots-while-computing-the-oldest-.patchDownload
From dcd84981e456512dcd9c995fb4fe0df86bbe0d72 Mon Sep 17 00:00:00 2001
From: root <root@pgvm.rlsumirojk0etd4qpjbaa2afce.tx.internal.cloudapp.net>
Date: Thu, 10 Nov 2022 10:18:24 +0000
Subject: [PATCH] Ignore invalidated slots while computing the oldest xmin

Catalog_xmin will not advance when a slot is invalidated until the invalidated slot is dropped.
This patch ignores invalidated slots while computing the oldest xmin.
---
 src/backend/replication/slot.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index c036a2c37b..2617563a09 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -847,6 +847,7 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
 		TransactionId effective_xmin;
 		TransactionId effective_catalog_xmin;
+		TransactionId invalidated_at_lsn;
 
 		if (!s->in_use)
 			continue;
@@ -854,8 +855,13 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 		SpinLockAcquire(&s->mutex);
 		effective_xmin = s->effective_xmin;
 		effective_catalog_xmin = s->effective_catalog_xmin;
+		invalidated_at_lsn = s->data.invalidated_at;
 		SpinLockRelease(&s->mutex);
 
+		/* ignore invalid slots while computing the oldest xmin */
+		if (TransactionIdIsValid(invalidated_at_lsn))
+			continue;
+
 		/* check the data xmin */
 		if (TransactionIdIsValid(effective_xmin) &&
 			(!TransactionIdIsValid(agg_xmin) ||
-- 
2.25.1

#2Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: sirisha chamarthi (#1)
Re: Catalog_xmin is not advanced when a logical slot is lost

On 2022-Nov-20, sirisha chamarthi wrote:

Hi Hackers,

forking this thread from the discussion [1] as suggested by Amit.

Catalog_xmin is not advanced when a logical slot is invalidated (lost)
until the invalidated slot is dropped. This patch ignores invalidated slots
while computing the oldest xmin. Attached a small patch to address this and
the output after the patch is as shown below.

Oh wow, that's bad :-( I'll get it patched immediately.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Here's a general engineering tip: if the non-fun part is too complex for you
to figure out, that might indicate the fun part is too ambitious." (John Naylor)
/messages/by-id/CAFBsxsG4OWHBbSDM=sSeXrQGOtkPiOEOuME4yD7Ce41NtaAD9g@mail.gmail.com

#3Ashutosh Bapat
ashutosh.bapat.oss@gmail.com
In reply to: Alvaro Herrera (#2)
Re: Catalog_xmin is not advanced when a logical slot is lost

Hi Sirisha,

Thanks for identifying the bug and the solution. Some review comments inlined.

On Mon, Nov 21, 2022 at 2:49 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2022-Nov-20, sirisha chamarthi wrote:

Hi Hackers,

forking this thread from the discussion [1] as suggested by Amit.

Catalog_xmin is not advanced when a logical slot is invalidated (lost)
until the invalidated slot is dropped. This patch ignores invalidated slots
while computing the oldest xmin. Attached a small patch to address this and
the output after the patch is as shown below.

Oh wow, that's bad :-( I'll get it patched immediately.

+        /* ignore invalid slots while computing the oldest xmin */
+        if (TransactionIdIsValid(invalidated_at_lsn))
+            continue;

I think the condition should be

if (!XLogRecPtrIsInvalid(invalidated_at_lsn)) LSN and XID are
different data types.

and to be inline with pg_get_replication_slots()
361 if (XLogRecPtrIsInvalid(slot_contents.data.restart_lsn) &&
362 !XLogRecPtrIsInvalid(slot_contents.data.invalidated_at))
363 walstate = WALAVAIL_REMOVED;

we should also check restart_lsn.

I would write this as

bool invalidated_slot = false;

then under spinlock
invalidated_slot = XLogRecPtrIsInvalid(s->data.restart_lsn) &&
!XLogRecPtrIsInvalid(s->data.invalidated_at);

if (invalidated_slot)
continue.

--
Best Wishes,
Ashutosh Bapat

#4Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Ashutosh Bapat (#3)
Re: Catalog_xmin is not advanced when a logical slot is lost

On 2022-Nov-21, Ashutosh Bapat wrote:

I think the condition should be

if (!XLogRecPtrIsInvalid(invalidated_at_lsn)) LSN and XID are
different data types.

Yeah, this bit is wrong. I agree with your suggestion to just keep a
boolean flag, as we don't need more than that.

and to be inline with pg_get_replication_slots()
361 if (XLogRecPtrIsInvalid(slot_contents.data.restart_lsn) &&
362 !XLogRecPtrIsInvalid(slot_contents.data.invalidated_at))
363 walstate = WALAVAIL_REMOVED;

we should also check restart_lsn.

Hmm, I'm not sure about this one. I'm not sure why we check both in
pg_get_replication_slots. I suppose we didn't want to ignore a slot
only if it had a non-zero invalidated_at in case it was accidental (say,
we initialize a slot as valid, but forget to zero-out the invalidated_at
value); but I think that's pretty much useless. This is only changed
with the spinlock held, so it's not like you can see partially-set
state.

In fact, as I recall we could replace invalidated_at in
ReplicationSlotPersistentData with a boolean "invalidated" flag, and
leave restart_lsn alone when invalidated. IIRC the only reason we
didn't do it that way was that we feared some code might observe some
valid value in restart_lsn without noticing that it belonged to an
invalidate slot. (Which is exactly what happened now, except with a
different field.)

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/

#5Ashutosh Bapat
ashutosh.bapat.oss@gmail.com
In reply to: Alvaro Herrera (#4)
Re: Catalog_xmin is not advanced when a logical slot is lost

On Mon, Nov 21, 2022 at 5:39 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2022-Nov-21, Ashutosh Bapat wrote:

I think the condition should be

if (!XLogRecPtrIsInvalid(invalidated_at_lsn)) LSN and XID are
different data types.

Yeah, this bit is wrong. I agree with your suggestion to just keep a
boolean flag, as we don't need more than that.

and to be inline with pg_get_replication_slots()
361 if (XLogRecPtrIsInvalid(slot_contents.data.restart_lsn) &&
362 !XLogRecPtrIsInvalid(slot_contents.data.invalidated_at))
363 walstate = WALAVAIL_REMOVED;

we should also check restart_lsn.

Hmm, I'm not sure about this one. I'm not sure why we check both in
pg_get_replication_slots. I suppose we didn't want to ignore a slot
only if it had a non-zero invalidated_at in case it was accidental (say,
we initialize a slot as valid, but forget to zero-out the invalidated_at
value); but I think that's pretty much useless. This is only changed
with the spinlock held, so it's not like you can see partially-set
state.

In fact, as I recall we could replace invalidated_at in
ReplicationSlotPersistentData with a boolean "invalidated" flag, and
leave restart_lsn alone when invalidated. IIRC the only reason we
didn't do it that way was that we feared some code might observe some
valid value in restart_lsn without noticing that it belonged to an
invalidate slot. (Which is exactly what happened now, except with a
different field.)

Maybe. In that case pg_get_replication_slots() should be changed. We
should use the same criteria to decide whether a slot is invalidated
or not at all the places.
I am a fan of stricter, all-assumption-covering conditions. In case we
don't want to check restart_lsn, an Assert might be useful to validate
our assumption.

--
Best Wishes,
Ashutosh Bapat

#6Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Ashutosh Bapat (#5)
Re: Catalog_xmin is not advanced when a logical slot is lost

On 2022-Nov-21, Ashutosh Bapat wrote:

Maybe. In that case pg_get_replication_slots() should be changed. We
should use the same criteria to decide whether a slot is invalidated
or not at all the places.

Right.

I am a fan of stricter, all-assumption-covering conditions. In case we
don't want to check restart_lsn, an Assert might be useful to validate
our assumption.

Agreed. I'll throw in an assert.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/

#7sirisha chamarthi
sirichamarthi22@gmail.com
In reply to: Alvaro Herrera (#6)
1 attachment(s)
Re: Catalog_xmin is not advanced when a logical slot is lost

Thanks Alvaro, Ashutosh for your comments.

On Mon, Nov 21, 2022 at 6:20 AM Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

On 2022-Nov-21, Ashutosh Bapat wrote:

Maybe. In that case pg_get_replication_slots() should be changed. We
should use the same criteria to decide whether a slot is invalidated
or not at all the places.

Right.

Agreed.

I am a fan of stricter, all-assumption-covering conditions. In case we
don't want to check restart_lsn, an Assert might be useful to validate
our assumption.

Agreed. I'll throw in an assert.

Changed this in the patch to throw an assert.

Show quoted text

--
Álvaro Herrera PostgreSQL Developer —
https://www.EnterpriseDB.com/

Attachments:

0002-Ignore-invalidated-slots-while-computing-the-oldest-.patchapplication/octet-stream; name=0002-Ignore-invalidated-slots-while-computing-the-oldest-.patchDownload
From 405bf08190b0d712145c66f445949b18ff0fa430 Mon Sep 17 00:00:00 2001
From: root <root@pgvm.rlsumirojk0etd4qpjbaa2afce.tx.internal.cloudapp.net>
Date: Mon, 21 Nov 2022 15:13:25 +0000
Subject: [PATCH] Ignore invalidated slots while computing the oldest xmin

Catalog_xmin will not advance when a slot is invalidated until the invalidated slot is dropped.
This patch ignores invalidated slots while computing the oldest xmin.
---
 src/backend/replication/slot.c      | 6 ++++++
 src/backend/replication/slotfuncs.c | 8 ++++----
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index c036a2c37b..f015ceca51 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -847,6 +847,7 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
 		TransactionId effective_xmin;
 		TransactionId effective_catalog_xmin;
+		bool is_invalidated;
 
 		if (!s->in_use)
 			continue;
@@ -854,8 +855,13 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 		SpinLockAcquire(&s->mutex);
 		effective_xmin = s->effective_xmin;
 		effective_catalog_xmin = s->effective_catalog_xmin;
+		is_invalidated = !XLogRecPtrIsInvalid(s->data.invalidated_at);
 		SpinLockRelease(&s->mutex);
 
+		/* ignore invalid slots while computing the oldest xmin */
+		if (is_invalidated)
+			continue;
+
 		/* check the data xmin */
 		if (TransactionIdIsValid(effective_xmin) &&
 			(!TransactionIdIsValid(agg_xmin) ||
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 16a3527903..5cf0c8e94b 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -318,11 +318,10 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 		 * certain that the slot has been invalidated.  Otherwise, test
 		 * availability from restart_lsn.
 		 */
-		if (XLogRecPtrIsInvalid(slot_contents.data.restart_lsn) &&
-			!XLogRecPtrIsInvalid(slot_contents.data.invalidated_at))
-			walstate = WALAVAIL_REMOVED;
-		else
+		if (XLogRecPtrIsInvalid(slot_contents.data.invalidated_at))
 			walstate = GetWALAvailability(slot_contents.data.restart_lsn);
+		else
+			walstate = WALAVAIL_REMOVED;
 
 		switch (walstate)
 		{
@@ -369,6 +368,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 						break;
 					}
 				}
+				Assert(XLogRecPtrIsInvalid(slot_contents.data.restart_lsn));
 				values[i++] = CStringGetTextDatum("lost");
 				break;
 		}
-- 
2.25.1

#8Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: sirisha chamarthi (#7)
Re: Catalog_xmin is not advanced when a logical slot is lost

On 2022-Nov-21, sirisha chamarthi wrote:

I am a fan of stricter, all-assumption-covering conditions. In case we
don't want to check restart_lsn, an Assert might be useful to validate
our assumption.

Agreed. I'll throw in an assert.

Changed this in the patch to throw an assert.

Thank you. I had pushed mine for CirrusCI to test, and it failed the
assert I added in slot.c:
https://cirrus-ci.com/build/4786354503548928
Not yet sure why, looking into it.

You didn't add any asserts to the slot.c code.

In slotfuncs.c, I'm not sure I want to assert anything about restart_lsn
in any cases other than when invalidated_at is set. In other words, I
prefer this coding in pg_get_replication_slots:

if (!XLogRecPtrIsInvalid(slot_contents.data.invalidated_at))
{
Assert(XLogRecPtrIsInvalid(slot_contents.data.restart_lsn));
walstate = WALAVAIL_REMOVED;
}
else
walstate = GetWALAvailability(slot_contents.data.restart_lsn);

Your proposal is doing this:

switch (walstate)
{
[...]
case WALAVAIL_REMOVED:

if (!XLogRecPtrIsInvalid(slot_contents.data.restart_lsn))
{
[...]
if (pid != 0)
[...] break;
}
Assert(XLogRecPtrIsInvalid(slot_contents.data.restart_lsn));

which sounds like it could be hit if the replica is connected to the
slot.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/

#9sirisha chamarthi
sirichamarthi22@gmail.com
In reply to: Alvaro Herrera (#8)
Re: Catalog_xmin is not advanced when a logical slot is lost

On Mon, Nov 21, 2022 at 8:05 AM Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

On 2022-Nov-21, sirisha chamarthi wrote:

I am a fan of stricter, all-assumption-covering conditions. In case

we

don't want to check restart_lsn, an Assert might be useful to

validate

our assumption.

Agreed. I'll throw in an assert.

Changed this in the patch to throw an assert.

Thank you. I had pushed mine for CirrusCI to test, and it failed the
assert I added in slot.c:
https://cirrus-ci.com/build/4786354503548928
Not yet sure why, looking into it.

Can this be because restart_lsn is not set to InvalidXLogRecPtr for the
physical slots? My repro is as follows:

select pg_create_physical_replication_slot('s5');
// Load some data to invalidate slot
postgres@pgvm:~$ /usr/local/pgsql/bin/pg_receivewal -S s5 -D .
pg_receivewal: error: unexpected termination of replication stream: ERROR:
requested WAL segment 0000000100000000000000EB has already been removed
pg_receivewal: disconnected; waiting 5 seconds to try again
pg_receivewal: error: unexpected termination of replication stream: ERROR:
requested WAL segment 0000000100000000000000EB has already been removed
pg_receivewal: disconnected; waiting 5 seconds to try again
^Cpostgres@pgvm:~$ /usr/local/pgsql/bin/psql
psql (16devel)
Type "help" for help.

postgres=# select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary |
active | active_pid | xmin | catalog_xmin | restart_lsn |
confirmed_flush_lsn | wal_status | safe_wal_size | two_phase
-----------+---------------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+-----------
s3 | test_decoding | logical | 5 | postgres | f | f
| | | 769 | | 0/A992E7D0
| lost | | f
s5 | | physical | | | f | f
| | | | 0/EB000000 |
| lost | | f

Show quoted text

--
Álvaro Herrera PostgreSQL Developer —
https://www.EnterpriseDB.com/

#10Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: sirisha chamarthi (#9)
Re: Catalog_xmin is not advanced when a logical slot is lost

On 2022-Nov-21, sirisha chamarthi wrote:

On Mon, Nov 21, 2022 at 8:05 AM Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

Thank you. I had pushed mine for CirrusCI to test, and it failed the
assert I added in slot.c:
https://cirrus-ci.com/build/4786354503548928
Not yet sure why, looking into it.

Can this be because restart_lsn is not set to InvalidXLogRecPtr for the
physical slots?

Hmm, that makes no sense. Is that yet another bug? Looking.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"No es bueno caminar con un hombre muerto"

#11sirisha chamarthi
sirichamarthi22@gmail.com
In reply to: Alvaro Herrera (#10)
Re: Catalog_xmin is not advanced when a logical slot is lost

On Mon, Nov 21, 2022 at 9:12 AM Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

On 2022-Nov-21, sirisha chamarthi wrote:

On Mon, Nov 21, 2022 at 8:05 AM Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

Thank you. I had pushed mine for CirrusCI to test, and it failed the
assert I added in slot.c:
https://cirrus-ci.com/build/4786354503548928
Not yet sure why, looking into it.

Can this be because restart_lsn is not set to InvalidXLogRecPtr for the
physical slots?

Hmm, that makes no sense. Is that yet another bug? Looking.

It appears to be. wal_sender is setting restart_lsn to a valid LSN even
when the slot is invalidated.

postgres=# select pg_Create_physical_replication_slot('s1');
pg_create_physical_replication_slot
-------------------------------------
(s1,)
(1 row)

postgres=# select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active |
active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn |
wal_status | safe_wal_size | two_phase
-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+-----------
s1 | | physical | | | f | f |
| | | | |
| -8254390272 | f
(1 row)

postgres=# checkpoint;
CHECKPOINT
postgres=# select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active |
active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn |
wal_status | safe_wal_size | two_phase
-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+-----------
s1 | | physical | | | f | f |
| | | | |
| -8374095064 | f
(1 row)

postgres=# \q
postgres@pgvm:~$ /usr/local/pgsql/bin/pg_receivewal -S s1 -D .
pg_receivewal: error: unexpected termination of replication stream: ERROR:
requested WAL segment 0000000100000000000000EB has already been removed
pg_receivewal: disconnected; waiting 5 seconds to try again
^Cpostgres@pgvm:~$ /usr/local/pgsql/bin/psql
psql (16devel)
Type "help" for help.

postgres=# select * from pg_replication_slots;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.
!?> ^C
!?>

In the log:
2022-11-21 17:31:48.159 UTC [3953664] STATEMENT: START_REPLICATION SLOT
"s1" 0/EB000000 TIMELINE 1
TRAP: failed Assert("XLogRecPtrIsInvalid(slot_contents.data.restart_lsn)"),
File: "slotfuncs.c", Line: 371, PID: 3953707

Show quoted text

--
Álvaro Herrera 48°01'N 7°57'E —
https://www.EnterpriseDB.com/
"No es bueno caminar con un hombre muerto"

#12Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: sirisha chamarthi (#11)
Re: Catalog_xmin is not advanced when a logical slot is lost

On 2022-Nov-21, sirisha chamarthi wrote:

It appears to be. wal_sender is setting restart_lsn to a valid LSN even
when the slot is invalidated.

postgres@pgvm:~$ /usr/local/pgsql/bin/pg_receivewal -S s1 -D .
pg_receivewal: error: unexpected termination of replication stream: ERROR:
requested WAL segment 0000000100000000000000EB has already been removed
pg_receivewal: disconnected; waiting 5 seconds to try again
^Cpostgres@pgvm:~$ /usr/local/pgsql/bin/psql
psql (16devel)
Type "help" for help.

postgres=# select * from pg_replication_slots;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

Whoa, I cannot reproduce this :-(

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Java is clearly an example of money oriented programming" (A. Stepanov)

#13sirisha chamarthi
sirichamarthi22@gmail.com
In reply to: Alvaro Herrera (#12)
Re: Catalog_xmin is not advanced when a logical slot is lost

On Mon, Nov 21, 2022 at 10:11 AM Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

On 2022-Nov-21, sirisha chamarthi wrote:

It appears to be. wal_sender is setting restart_lsn to a valid LSN even
when the slot is invalidated.

postgres@pgvm:~$ /usr/local/pgsql/bin/pg_receivewal -S s1 -D .
pg_receivewal: error: unexpected termination of replication stream:

ERROR:

requested WAL segment 0000000100000000000000EB has already been removed
pg_receivewal: disconnected; waiting 5 seconds to try again
^Cpostgres@pgvm:~$ /usr/local/pgsql/bin/psql
psql (16devel)
Type "help" for help.

postgres=# select * from pg_replication_slots;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

Whoa, I cannot reproduce this :-(

I have a old .partial file in the data directory to reproduce this.

postgres=# select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active |
active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn |
wal_status | safe_wal_size | two_phase
-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+-----------
s2 | | physical | | | f | f |
| | | 2/DC000000 | | lost
| | f
(1 row)

postgres=# \q
postgres@pgvm:~$ ls
0000000100000002000000D8 0000000100000002000000D9
0000000100000002000000DA 0000000100000002000000DB
0000000100000002000000DC.partial

Show quoted text

--
Álvaro Herrera Breisgau, Deutschland —
https://www.EnterpriseDB.com/
"Java is clearly an example of money oriented programming" (A. Stepanov)

#14sirisha chamarthi
sirichamarthi22@gmail.com
In reply to: sirisha chamarthi (#13)
Re: Catalog_xmin is not advanced when a logical slot is lost

On Mon, Nov 21, 2022 at 10:40 AM sirisha chamarthi <
sirichamarthi22@gmail.com> wrote:

On Mon, Nov 21, 2022 at 10:11 AM Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

On 2022-Nov-21, sirisha chamarthi wrote:

It appears to be. wal_sender is setting restart_lsn to a valid LSN even
when the slot is invalidated.

postgres@pgvm:~$ /usr/local/pgsql/bin/pg_receivewal -S s1 -D .
pg_receivewal: error: unexpected termination of replication stream:

ERROR:

requested WAL segment 0000000100000000000000EB has already been removed
pg_receivewal: disconnected; waiting 5 seconds to try again
^Cpostgres@pgvm:~$ /usr/local/pgsql/bin/psql
psql (16devel)
Type "help" for help.

postgres=# select * from pg_replication_slots;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

Whoa, I cannot reproduce this :-(

I have a old .partial file in the data directory to reproduce this.

postgres=# select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active |
active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn |
wal_status | safe_wal_size | two_phase

-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+-----------
s2 | | physical | | | f | f |
| | | 2/DC000000 | | lost
| | f
(1 row)

postgres=# \q
postgres@pgvm:~$ ls
0000000100000002000000D8 0000000100000002000000D9
0000000100000002000000DA 0000000100000002000000DB
0000000100000002000000DC.partial

Just to be clear, it was hitting the assert I added in the slotfuncs.c but
not in the code you mentioned. Apologies for the confusion. Also it appears
in the above case I mentioned, the slot is not invalidated yet as the
checkpointer did not run though the state says it is lost.

Show quoted text

--
Álvaro Herrera Breisgau, Deutschland —
https://www.EnterpriseDB.com/
"Java is clearly an example of money oriented programming" (A. Stepanov)

#15Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: sirisha chamarthi (#13)
Re: Catalog_xmin is not advanced when a logical slot is lost

On 2022-Nov-21, sirisha chamarthi wrote:

I have a old .partial file in the data directory to reproduce this.

I don't think the .partial file is in itself important. But I think
this whole thing is a distraction. I managed to reproduce it
eventually, by messing with the slot and WAL at random, and my
conclusion is that we shouldn't mess with this at all for this bugfix.
Instead I'm going to do what Ashutosh mentioned at the start, which is
to verify both the restart_lsn and the invalidated_at, when deciding
whether to ignore the slot.

It seems to me that there is a bigger mess here, considering that we use
the effective_xmin in some places and the other xmin (the one that's
saved to disk) in others. I have no patience for trying to disentangle
that at this point, though.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"Having your biases confirmed independently is how scientific progress is
made, and hence made our great society what it is today" (Mary Gardiner)

#16sirisha chamarthi
sirichamarthi22@gmail.com
In reply to: Alvaro Herrera (#15)
Re: Catalog_xmin is not advanced when a logical slot is lost

On Mon, Nov 21, 2022 at 10:56 AM Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

On 2022-Nov-21, sirisha chamarthi wrote:

I have a old .partial file in the data directory to reproduce this.

I don't think the .partial file is in itself important. But I think
this whole thing is a distraction.

Yes, sorry for the confusion.

I managed to reproduce it
eventually, by messing with the slot and WAL at random, and my
conclusion is that we shouldn't mess with this at all for this bugfix.

Agreed.

Instead I'm going to do what Ashutosh mentioned at the start, which is
to verify both the restart_lsn and the invalidated_at, when deciding
whether to ignore the slot.

Sounds good to me. Thanks!

Show quoted text

It seems to me that there is a bigger mess here, considering that we use
the effective_xmin in some places and the other xmin (the one that's
saved to disk) in others. I have no patience for trying to disentangle
that at this point, though.

--
Álvaro Herrera PostgreSQL Developer —
https://www.EnterpriseDB.com/
"Having your biases confirmed independently is how scientific progress is
made, and hence made our great society what it is today" (Mary Gardiner)

#17Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: sirisha chamarthi (#16)
Re: Catalog_xmin is not advanced when a logical slot is lost

On 2022-Nov-21, sirisha chamarthi wrote:

On Mon, Nov 21, 2022 at 10:56 AM Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

Instead I'm going to do what Ashutosh mentioned at the start, which is
to verify both the restart_lsn and the invalidated_at, when deciding
whether to ignore the slot.

Sounds good to me. Thanks!

Done now. I also a new elog(DEBUG1), which I think makes the issue a
bit easier notice.

I think it would be even better if we reset the underlying data from
effective_catalog_xmin ... even with this patch, we show a non-zero
value for a slot in status "lost" (and we ignore it when computing the
overall xmin), which I think is quite confusing. But we can do that in
master only.

Thanks for reporting this issue.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/