logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

tomas.vondra@2ndquadrant.com

over 1 year ago

In reply to: Tomas Vondra (#1)

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

Hi,

I kept investigating this, but I haven't made much progress. I still
don't understand why would it be OK to move any of the LSN fields
backwards - certainly for fields like confirm_flush or restart_lsn.

I did a simple experiment - added asserts to the couple places in
logical.c updating the the LSN fields, checking the value is increased.
But then I simply ran make check-world, instead of the stress test.

And that actually fails too, 040_standby_failover_slots_sync.pl triggers
this

{
SpinLockAcquire(&MyReplicationSlot->mutex);
Assert(MyReplicationSlot->data.confirmed_flush <= lsn);
MyReplicationSlot->data.confirmed_flush = lsn;
SpinLockRelease(&MyReplicationSlot->mutex);
}

So this moves confirm_flush back, albeit only by a tiny amount (I've
seen ~56 byte difference). I don't have an example of this causing an
issue in practice, but I note that CheckPointReplicationSlots does this:

if (is_shutdown && SlotIsLogical(s))
{
SpinLockAcquire(&s->mutex);

if (s->data.invalidated == RS_INVAL_NONE &&
s->data.confirmed_flush > s->last_saved_confirmed_flush)
{
s->just_dirtied = true;
s->dirty = true;
}
SpinLockRelease(&s->mutex);
}

to determine if a slot needs to be flushed to disk during checkpoint. So
I guess it's possible we save a slot to disk at some LSN, then the
confirm_flush moves backward, and we fail to sync the slot to disk.

But I don't have a reproducer for this ...

I also noticed a strange difference between LogicalIncreaseXminForSlot
and LogicalIncreaseRestartDecodingForSlot.

The structure of LogicalIncreaseXminForSlot looks like this:

if (TransactionIdPrecedesOrEquals(xmin, slot->data.catalog_xmin))
{
}
else if (current_lsn <= slot->data.confirmed_flush)
{
... update candidate fields ...
}
else if (slot->candidate_xmin_lsn == InvalidXLogRecPtr)
{
... update candidate fields ...
}

while LogicalIncreaseRestartDecodingForSlot looks like this:

if (restart_lsn <= slot->data.restart_lsn)
{
}
else if (current_lsn <= slot->data.confirmed_flush)
{
... update candidate fields ...
}

if (slot->candidate_restart_valid == InvalidXLogRecPtr)
{
... update candidate fields ...
}

Notice that LogicalIncreaseXminForSlot has the third block guarded by
"else if", while LogicalIncreaseRestartDecodingForSlot has "if". Isn't
that a bit suspicious, considering the functions do the same thing, just
for different fields? I don't know if this is dangerous, the comments
suggest it may just waste extra effort after reconnect.

regards

--
Tomas Vondra

[1]: /messages/by-id/Yz2hivgyjS1RfMKs@depesz.com
[2]: /messages/by-id/CAD21AoBVhYnGBuW_o=wEGgTp01qiHNAx1a14b1X9kFXmuBe=sg@mail.gmail.com

ashutosh.bapat@enterprisedb.com

over 1 year ago

In reply to: Tomas Vondra (#1)

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

On Wed, Nov 6, 2024 at 8:54 PM Tomas Vondra <tomas@vondra.me> wrote:

Hi,

I've been investigating some issues reported by users, related to
logical replication unexpectedly breaking with messages like:

LOG: invalidating slot "s" because its restart_lsn X/Y exceeds
max_slot_wal_keep_size

which is pretty confusing, because the system has that GUC set to -1 (so
disabled, there should be no limit). Or a message like this:

ERROR: requested WAL segment 000...0AA has already been removed

which is a bit less confusing, but still puzzling because replication
slots are meant to prevent exactly this.

I speculated there's some sort of rare race condition, in how we advance
the slot LSN values. I didn't know where to start, so I gave up on
trying to understand the whole complex code. Instead, I wrote a simple
stress test that

1) sets up a cluster (primary + 1+ logical replicas)

2) runs pgbench on the primary, checks replicas/slots

3) randomly restarts the nodes (both primary/replicas) with either fast
and immediate mode, with random short pauses

4) runs checkpoints in tight while loop

This is based on the observations that in past reports the issues seem
to happen only with logical replication, shortly after reconnect (e.g.
after connection reset etc.). And the slots are invalidated / WAL
removed during a checkpoint, so frequent checkpoints should make it
easier to hit ...

Attached is a couple scripts running this - it's not particularly pretty
and may need some tweak to make it work on your machine (run.sh is the
script to run).

And unfortunately, this started to fail pretty quick. The short story is
that it's not difficult to get into a situation where restart_lsn for a
slot moves backwards, so something like this can happen:

1) slot restart_lsn moves forward to LSN A

2) checkpoint happens, updates min required LSN for slots, recycles
segments it considers unnecessary (up to LSN A)

3) slot restart_lsn moves backwards to LSN B (where B < A)

4) kaboom

This would perfectly explain all the symptoms I mentioned earlier. The
max_slot_wal_keep_size reference is confusing, but AFAIK it's just based
on (reasonable) belief that LSN can't move backwards, and so the only
reason for restart_lsn being before min required LSN is that this GUC
kicked in. But the assumption does not hold :-(

Now, why/how can this happen?

I kept adding a elog(LOG) messages to all places updating LSNs for a
slot, and asserts to fail if data.restart_lsn moves backwards. See the
attached 0001 patch. An example for a failure (for the walsended backend
that triggered the assert) looks like this:

344.139 LOG: starting logical decoding for slot "s1"

344.139 DETAIL: Streaming transactions committing after 1/E97EAC30,
reading WAL from 1/E96FB4D0.

344.140 LOG: logical decoding found consistent point at 1/E96FB4D0

344.140 DETAIL: Logical decoding will begin using saved snapshot.

344.140 LOG: LogicalConfirmReceivedLocation 1/E9865398

344.140 LOG: LogicalConfirmReceivedLocation updating
data.restart_lsn to 1/E979D4C8 (from 1/E96FB4D0)
candidate_restart_valid 0/0 (from 1/E9865398)
candidate_restart_lsn 0/0 (from 1/E979D4C8)

344.145 LOG: LogicalIncreaseRestartDecodingForSlot
restart_decoding_lsn 1/E96FB4D0

344.145 LOG: LogicalIncreaseRestartDecodingForSlot s1
candidate_restart_valid_lsn 1/E979D4C8 (0/0)
candidate_restart_lsn 1/E96FB4D0 (0/0)

344.147 LOG: LogicalIncreaseRestartDecodingForSlot
restart_decoding_lsn 1/E979D4C8

344.156 LOG: LogicalIncreaseXminForSlot
candidate_catalog_xmin 30699
candidate_xmin_lsn 1/E993AD68 (0/0)
...
344.235 LOG: LogicalIncreaseRestartDecodingForSlot
restart_decoding_lsn 1/E9F33AF8

344.240 LOG: LogicalConfirmReceivedLocation 1/E9DCCD78

344.240 LOG: LogicalConfirmReceivedLocation updating
data.restart_lsn to 1/E96FB4D0 (from 1/E979D4C8)
candidate_restart_valid 0/0 (from 1/E979D4C8)
candidate_restart_lsn 0/0 (from 1/E96FB4D0)

345.536 LOG: server process (PID 2518127) was terminated by
signal 6: Aborted

We start decoding at 1/E96FB4D0, and right after startup we get a
confirmation, and LogicalConfirmReceivedLocation updates restart_lsn to
1/E979D4C8.

But then LogicalIncreaseRestartDecodingForSlot comes along, and stores
the restart_decoding_lsn 1/E96FB4D0 (which is the initial restart_lsn)
into candidate_restart_lsn.

And then a little bit later we get another confirmation, we call
LogicalConfirmReceivedLocation which propagates candidate_restart_lsn
into data.restart_lsn.

This is how restart_lsn moves backwards, causing issues. I've reproduced
this on PG14 and current master, but I believe the issue exists since
the introduction of logical replication in 9.4 :-(

I'm not claiming this is the only way how this can happen, but all the
cases I've seen in my stress testing look like this. Moreover, I'm not
claiming this is the only LSN field that can move backwards like this.
It seems to me various other candidate_ fields have the same issue, but
may have consequences other than discarding "unnecessary" WAL.

I've been removed of this [1] thread from 2022. I'm 99% sure it's the
same issue - it happened shortly after a reconnect, etc. And it seems to
me Masahiko-san was about the causes in [2]. I don't think the fix
suggested in that message (changing the branch to "else if") can work,
though. At least it did not really help in my testing.

And I'm not sure it'd fix all the issues - it only affects restart_lsn,
but AFAICS the same issue (LSNs moving backwards) can happen for the
other LSN slot field (candidate_xmin_lsn).

After examining the code before reading [2]/messages/by-id/CAD21AoBVhYnGBuW_o=wEGgTp01qiHNAx1a14b1X9kFXmuBe=sg@mail.gmail.com, I came to the same
conclusion as Masahiko-san in [2]/messages/by-id/CAD21AoBVhYnGBuW_o=wEGgTp01qiHNAx1a14b1X9kFXmuBe=sg@mail.gmail.com. We install candidate_restart_lsn
based on the running transaction record whose LSN is between
restart_lsn and confirmed_flush_lsn. Since candidate_restart_valid of
such candidates would be lesser than any confirmed_flush_lsn received
from downstream. I am surprised that the fix suggested by Masahiko-san
didn't work though. The fix also fix the asymmetry, between
LogicalIncreaseXminForSlot and LogicalIncreaseRestartDecodingForSlot,
that you have pointed out in your next email. What behaviour do you
see with that fix applied?

--
Best Wishes,
Ashutosh Bapat

[1]: /messages/by-id/CAD21AoBG2OSDOFTtpPtQ7fx5Vt8p3dS5hPAv28CBSC6z2kHx-g@mail.gmail.com

sawada.mshk@gmail.com

over 1 year ago

In reply to: Tomas Vondra (#2)

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

Hi,

Thank you for investigating this issue.

On Thu, Nov 7, 2024 at 10:40 AM Tomas Vondra <tomas@vondra.me> wrote:

Hi,

I kept investigating this, but I haven't made much progress. I still
don't understand why would it be OK to move any of the LSN fields
backwards - certainly for fields like confirm_flush or restart_lsn.

I did a simple experiment - added asserts to the couple places in
logical.c updating the the LSN fields, checking the value is increased.
But then I simply ran make check-world, instead of the stress test.

And that actually fails too, 040_standby_failover_slots_sync.pl triggers
this

{
SpinLockAcquire(&MyReplicationSlot->mutex);
Assert(MyReplicationSlot->data.confirmed_flush <= lsn);
MyReplicationSlot->data.confirmed_flush = lsn;
SpinLockRelease(&MyReplicationSlot->mutex);
}

So this moves confirm_flush back, albeit only by a tiny amount (I've
seen ~56 byte difference). I don't have an example of this causing an
issue in practice, but I note that CheckPointReplicationSlots does this:

if (is_shutdown && SlotIsLogical(s))
{
SpinLockAcquire(&s->mutex);

if (s->data.invalidated == RS_INVAL_NONE &&
s->data.confirmed_flush > s->last_saved_confirmed_flush)
{
s->just_dirtied = true;
s->dirty = true;
}
SpinLockRelease(&s->mutex);
}

to determine if a slot needs to be flushed to disk during checkpoint. So
I guess it's possible we save a slot to disk at some LSN, then the
confirm_flush moves backward, and we fail to sync the slot to disk.

But I don't have a reproducer for this ...

I also noticed a strange difference between LogicalIncreaseXminForSlot
and LogicalIncreaseRestartDecodingForSlot.

The structure of LogicalIncreaseXminForSlot looks like this:

if (TransactionIdPrecedesOrEquals(xmin, slot->data.catalog_xmin))
{
}
else if (current_lsn <= slot->data.confirmed_flush)
{
... update candidate fields ...
}
else if (slot->candidate_xmin_lsn == InvalidXLogRecPtr)
{
... update candidate fields ...
}

while LogicalIncreaseRestartDecodingForSlot looks like this:

if (restart_lsn <= slot->data.restart_lsn)
{
}
else if (current_lsn <= slot->data.confirmed_flush)
{
... update candidate fields ...
}

if (slot->candidate_restart_valid == InvalidXLogRecPtr)
{
... update candidate fields ...
}

Notice that LogicalIncreaseXminForSlot has the third block guarded by
"else if", while LogicalIncreaseRestartDecodingForSlot has "if". Isn't
that a bit suspicious, considering the functions do the same thing, just
for different fields? I don't know if this is dangerous, the comments
suggest it may just waste extra effort after reconnect.

I also suspected this point. I still need to investigate if this
suspicion is related to the issue but I find this code in
LogicalIncreaseRestartDecodingForSlot() is dangerous.

We update slot's restart_lsn based on candidate_lsn and
candidate_valid upon receiving a feedback message from a subscriber,
then clear both fields. Therefore, this code in
LogicalIncreaseRestartDecodingForSlot() means that it sets an
arbitrary LSN to candidate_restart_lsn after updating slot's
restart_lsn.

I think an LSN older than slot's restart_lsn can be passed to
LogicalIncreaseRestartDecodingForSlot() as restart_lsn for example
after logical decoding restarts; My scenario I shared on another
thread was that after updating slot's restart_lsn (upon feedback from
a subscriber) based on both candidate_restart_lsn and
candidate_restart_valid that are remained in the slot, we might call
LogicalIncreaseRestartDecodingForSlot() when decoding a RUNNING_XACTS
record whose LSN is older than the slot's new restart_lsn. In this
case, we end up passing an LSN older than the new restart_lsn to
LogicalIncreaseRestartDecodingForSlot(), and that LSN is set to
candidate_restart_lsn. My hypothesis is that we wanted to prevent such
case by the first if block:

/* don't overwrite if have a newer restart lsn */
if (restart_lsn <= slot->data.restart_lsn)
{
}

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

tomas.vondra@2ndquadrant.com

over 1 year ago

In reply to: Ashutosh Bapat (#3)

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

On 11/8/24 15:57, Ashutosh Bapat wrote:

...

After examining the code before reading [2], I came to the same
conclusion as Masahiko-san in [2]. We install candidate_restart_lsn
based on the running transaction record whose LSN is between
restart_lsn and confirmed_flush_lsn. Since candidate_restart_valid of
such candidates would be lesser than any confirmed_flush_lsn received
from downstream. I am surprised that the fix suggested by Masahiko-san
didn't work though. The fix also fix the asymmetry, between
LogicalIncreaseXminForSlot and LogicalIncreaseRestartDecodingForSlot,
that you have pointed out in your next email. What behaviour do you
see with that fix applied?

[1] /messages/by-id/Yz2hivgyjS1RfMKs@depesz.com
[2] /messages/by-id/CAD21AoBVhYnGBuW_o=wEGgTp01qiHNAx1a14b1X9kFXmuBe=sg@mail.gmail.com

I read that message (and the earlier discussion multiple times) while
investigating the issue, and TBH it's not very clear to me what the
conclusion is :-(

There's some discussion about whether the candidate fields should be
reset on release or not. There are even claims that it might be
legitimate to not reset the fields and update the restart_lsn. Using
such "stale" LSN values seems rather suspicious to me, but I don't have
a proof that it's incorrect. FWIW this unclarity is what I mentioned the
policy/contract for candidate fields is not explained anywhere.

That being said, I gave that fix a try - see the attached 0001 patch. It
tweaks LogicalIncreaseRestartDecodingForSlot (it needs a bit more care
because of the spinlock), and it adds a couple asserts to make sure the
data.restart_lsn never moves back.

And indeed, with this my stress test script does not crash anymore.

But is that really correct? The lack of failure in one specific test
does not really prove that. And then again - why should it be OK for the
other candidate fields to move backwards? Isn't that suspicious? It sure
seems counter-intuitive to me, and I'm not sure the code expects that.

So in 0002 I added a couple more asserts to make sure the LSN fields
only move forward, and those *do* continue to fail, and in some cases
the amount by which the fields move back are pretty significant
(multiple megabytes).

Maybe it's fine if this "backwards move" never propagates to e.g.
"restart_lsn", not sure. And I'm not sure which other fields should not
move backwards (what about data.confirm_flush for example?).

regards

--
Tomas Vondra

tomas.vondra@2ndquadrant.com

over 1 year ago

In reply to: Masahiko Sawada (#4)

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

On 11/8/24 19:25, Masahiko Sawada wrote:

Hi,

Thank you for investigating this issue.

On Thu, Nov 7, 2024 at 10:40 AM Tomas Vondra <tomas@vondra.me> wrote:

Hi,

I kept investigating this, but I haven't made much progress. I still
don't understand why would it be OK to move any of the LSN fields
backwards - certainly for fields like confirm_flush or restart_lsn.

I did a simple experiment - added asserts to the couple places in
logical.c updating the the LSN fields, checking the value is increased.
But then I simply ran make check-world, instead of the stress test.

And that actually fails too, 040_standby_failover_slots_sync.pl triggers
this

{
SpinLockAcquire(&MyReplicationSlot->mutex);
Assert(MyReplicationSlot->data.confirmed_flush <= lsn);
MyReplicationSlot->data.confirmed_flush = lsn;
SpinLockRelease(&MyReplicationSlot->mutex);
}

So this moves confirm_flush back, albeit only by a tiny amount (I've
seen ~56 byte difference). I don't have an example of this causing an
issue in practice, but I note that CheckPointReplicationSlots does this:

if (is_shutdown && SlotIsLogical(s))
{
SpinLockAcquire(&s->mutex);

if (s->data.invalidated == RS_INVAL_NONE &&
s->data.confirmed_flush > s->last_saved_confirmed_flush)
{
s->just_dirtied = true;
s->dirty = true;
}
SpinLockRelease(&s->mutex);
}

to determine if a slot needs to be flushed to disk during checkpoint. So
I guess it's possible we save a slot to disk at some LSN, then the
confirm_flush moves backward, and we fail to sync the slot to disk.

But I don't have a reproducer for this ...

I also noticed a strange difference between LogicalIncreaseXminForSlot
and LogicalIncreaseRestartDecodingForSlot.

The structure of LogicalIncreaseXminForSlot looks like this:

if (TransactionIdPrecedesOrEquals(xmin, slot->data.catalog_xmin))
{
}
else if (current_lsn <= slot->data.confirmed_flush)
{
... update candidate fields ...
}
else if (slot->candidate_xmin_lsn == InvalidXLogRecPtr)
{
... update candidate fields ...
}

while LogicalIncreaseRestartDecodingForSlot looks like this:

if (restart_lsn <= slot->data.restart_lsn)
{
}
else if (current_lsn <= slot->data.confirmed_flush)
{
... update candidate fields ...
}

if (slot->candidate_restart_valid == InvalidXLogRecPtr)
{
... update candidate fields ...
}

Notice that LogicalIncreaseXminForSlot has the third block guarded by
"else if", while LogicalIncreaseRestartDecodingForSlot has "if". Isn't
that a bit suspicious, considering the functions do the same thing, just
for different fields? I don't know if this is dangerous, the comments
suggest it may just waste extra effort after reconnect.

I also suspected this point. I still need to investigate if this
suspicion is related to the issue but I find this code in
LogicalIncreaseRestartDecodingForSlot() is dangerous.

We update slot's restart_lsn based on candidate_lsn and
candidate_valid upon receiving a feedback message from a subscriber,
then clear both fields. Therefore, this code in
LogicalIncreaseRestartDecodingForSlot() means that it sets an
arbitrary LSN to candidate_restart_lsn after updating slot's
restart_lsn.

I think an LSN older than slot's restart_lsn can be passed to
LogicalIncreaseRestartDecodingForSlot() as restart_lsn for example
after logical decoding restarts; My scenario I shared on another
thread was that after updating slot's restart_lsn (upon feedback from
a subscriber) based on both candidate_restart_lsn and
candidate_restart_valid that are remained in the slot, we might call
LogicalIncreaseRestartDecodingForSlot() when decoding a RUNNING_XACTS
record whose LSN is older than the slot's new restart_lsn. In this
case, we end up passing an LSN older than the new restart_lsn to
LogicalIncreaseRestartDecodingForSlot(), and that LSN is set to
candidate_restart_lsn.

Right, I believe that matches my observations. I only see the issues
after (unexpected) restarts, say due to network issues, but chances are
regular reconnects have the same problem.

My hypothesis is that we wanted to prevent such
case by the first if block:

/* don't overwrite if have a newer restart lsn */
if (restart_lsn <= slot->data.restart_lsn)
{
}

Yeah, that condition / comment seems to say exactly that.

Do you plan / expect to work on fixing this? It seems you proposed the
right fix in that old thread, but it's been inactive since 2023/02 :-(

regards

--
Tomas Vondra

ashutosh.bapat@enterprisedb.com

over 1 year ago

In reply to: Tomas Vondra (#5)

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

On Sat, Nov 9, 2024 at 5:05 PM Tomas Vondra <tomas@vondra.me> wrote:

On 11/8/24 15:57, Ashutosh Bapat wrote:

...

After examining the code before reading [2], I came to the same
conclusion as Masahiko-san in [2]. We install candidate_restart_lsn
based on the running transaction record whose LSN is between
restart_lsn and confirmed_flush_lsn. Since candidate_restart_valid of
such candidates would be lesser than any confirmed_flush_lsn received
from downstream. I am surprised that the fix suggested by Masahiko-san
didn't work though. The fix also fix the asymmetry, between
LogicalIncreaseXminForSlot and LogicalIncreaseRestartDecodingForSlot,
that you have pointed out in your next email. What behaviour do you
see with that fix applied?

[1] /messages/by-id/Yz2hivgyjS1RfMKs@depesz.com
[2] /messages/by-id/CAD21AoBVhYnGBuW_o=wEGgTp01qiHNAx1a14b1X9kFXmuBe=sg@mail.gmail.com

I read that message (and the earlier discussion multiple times) while
investigating the issue, and TBH it's not very clear to me what the
conclusion is :-(

There's some discussion about whether the candidate fields should be
reset on release or not. There are even claims that it might be
legitimate to not reset the fields and update the restart_lsn. Using
such "stale" LSN values seems rather suspicious to me, but I don't have
a proof that it's incorrect. FWIW this unclarity is what I mentioned the
policy/contract for candidate fields is not explained anywhere.

That being said, I gave that fix a try - see the attached 0001 patch. It
tweaks LogicalIncreaseRestartDecodingForSlot (it needs a bit more care
because of the spinlock), and it adds a couple asserts to make sure the
data.restart_lsn never moves back.

And indeed, with this my stress test script does not crash anymore.

I think the problem is about processing older running transactions
record and setting data.restart_lsn based on the candidates those
records produce. But what is not clear to me is how come a newer
candidate_restart_lsn is available immediately upon WAL sender
restart. I.e. in the sequence of events you mentioned in your first
email
1. 344.139 LOG: starting logical decoding for slot "s1"

2. 344.139 DETAIL: Streaming transactions committing after 1/E97EAC30,
reading WAL from 1/E96FB4D0.

3. 344.140 LOG: logical decoding found consistent point at 1/E96FB4D0

4. 344.140 DETAIL: Logical decoding will begin using saved snapshot.

5. 344.140 LOG: LogicalConfirmReceivedLocation 1/E9865398

6. 344.140 LOG: LogicalConfirmReceivedLocation updating
data.restart_lsn to 1/E979D4C8 (from 1/E96FB4D0)
candidate_restart_valid 0/0 (from 1/E9865398)
candidate_restart_lsn 0/0 (from 1/E979D4C8)

how did candidate_restart_lsn = 1/E979D4C8 and candidate_restart_valid
= 1/E9865398 were set in ReplicationSlot after WAL sender? It means it
must have read and processed running transaction record at 1/E9865398.
If that's true, how come it went back to a running transactions WAL
record at 1/E979D4C8? It should be reading WAL records sequentially,
hence read 1/E979D4C8 first then 1/E9865398.

344.145 LOG: LogicalIncreaseRestartDecodingForSlot s1
candidate_restart_valid_lsn 1/E979D4C8 (0/0)
candidate_restart_lsn 1/E96FB4D0 (0/0)

--
Best Wishes,
Ashutosh Bapat

[1]: /messages/by-id/CAD21AoBVhYnGBuW_o=wEGgTp01qiHNAx1a14b1X9kFXmuBe=sg@mail.gmail.com
/messages/by-id/CAD21AoBVhYnGBuW_o=wEGgTp01qiHNAx1a14b1X9kFXmuBe=sg@mail.gmail.com

tomas.vondra@2ndquadrant.com

over 1 year ago

In reply to: Ashutosh Bapat (#7)

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

On 11/11/24 14:51, Ashutosh Bapat wrote:

...

I think the problem is about processing older running transactions
record and setting data.restart_lsn based on the candidates those
records produce. But what is not clear to me is how come a newer
candidate_restart_lsn is available immediately upon WAL sender
restart. I.e. in the sequence of events you mentioned in your first
email
1. 344.139 LOG: starting logical decoding for slot "s1"

2. 344.139 DETAIL: Streaming transactions committing after 1/E97EAC30,
reading WAL from 1/E96FB4D0.

3. 344.140 LOG: logical decoding found consistent point at 1/E96FB4D0

4. 344.140 DETAIL: Logical decoding will begin using saved snapshot.

5. 344.140 LOG: LogicalConfirmReceivedLocation 1/E9865398

6. 344.140 LOG: LogicalConfirmReceivedLocation updating
data.restart_lsn to 1/E979D4C8 (from 1/E96FB4D0)
candidate_restart_valid 0/0 (from 1/E9865398)
candidate_restart_lsn 0/0 (from 1/E979D4C8)

how did candidate_restart_lsn = 1/E979D4C8 and candidate_restart_valid
= 1/E9865398 were set in ReplicationSlot after WAL sender? It means it
must have read and processed running transaction record at 1/E9865398.
If that's true, how come it went back to a running transactions WAL
record at 1/E979D4C8? It should be reading WAL records sequentially,
hence read 1/E979D4C8 first then 1/E9865398.

344.145 LOG: LogicalIncreaseRestartDecodingForSlot s1
candidate_restart_valid_lsn 1/E979D4C8 (0/0)
candidate_restart_lsn 1/E96FB4D0 (0/0)

Those are good questions, but IIUC that's explained by this comment from
Masahiko-san's analysis [1]/messages/by-id/CAD21AoBVhYnGBuW_o=wEGgTp01qiHNAx1a14b1X9kFXmuBe=sg@mail.gmail.com:

Thinking about the root cause more, it seems to me that the root cause
is not the fact that candidate_xxx values are not cleared when being
released.

In the scenario I reproduced, after restarting the logical decoding,
the walsender sets the restart_lsn to a candidate_restart_lsn left in
the slot upon receiving the ack from the subscriber. ...

If this is correct, then what happens is:

1) replication is running, at some point we set candidate LSN to B

2) something breaks, causing reconnect with restart LSN A (< B)

3) we still have the candidate LSN B in memory, and after receiving
some confirmation we set it as restart_lsn

4) we get to decode the RUNNING_XACTS, which moves restart_lsn back

If this analysis is correct, I think it's rather suspicious we don't
reset the candidate fields on restart. Can those "old" values ever be
valid? But I haven't tried resetting them.

Also, this is why I'm not entirely sure just tweaking the conditions in
LogicalIncreaseRestartDecodingForSlot is quite correct. Maybe it fixes
this particular issue, but maybe the right fix would be to reset the
candidate fields on reconnect? And this change would be just hiding the
actual problem. I haven't tried this.

--
Tomas Vondra

sawada.mshk@gmail.com

over 1 year ago

In reply to: Tomas Vondra (#6)

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

On Sat, Nov 9, 2024 at 3:45 AM Tomas Vondra <tomas@vondra.me> wrote:

On 11/8/24 19:25, Masahiko Sawada wrote:

Hi,

Thank you for investigating this issue.

On Thu, Nov 7, 2024 at 10:40 AM Tomas Vondra <tomas@vondra.me> wrote:

Hi,

I kept investigating this, but I haven't made much progress. I still
don't understand why would it be OK to move any of the LSN fields
backwards - certainly for fields like confirm_flush or restart_lsn.

I did a simple experiment - added asserts to the couple places in
logical.c updating the the LSN fields, checking the value is increased.
But then I simply ran make check-world, instead of the stress test.

And that actually fails too, 040_standby_failover_slots_sync.pl triggers
this

{
SpinLockAcquire(&MyReplicationSlot->mutex);
Assert(MyReplicationSlot->data.confirmed_flush <= lsn);
MyReplicationSlot->data.confirmed_flush = lsn;
SpinLockRelease(&MyReplicationSlot->mutex);
}

So this moves confirm_flush back, albeit only by a tiny amount (I've
seen ~56 byte difference). I don't have an example of this causing an
issue in practice, but I note that CheckPointReplicationSlots does this:

if (is_shutdown && SlotIsLogical(s))
{
SpinLockAcquire(&s->mutex);

if (s->data.invalidated == RS_INVAL_NONE &&
s->data.confirmed_flush > s->last_saved_confirmed_flush)
{
s->just_dirtied = true;
s->dirty = true;
}
SpinLockRelease(&s->mutex);
}

to determine if a slot needs to be flushed to disk during checkpoint. So
I guess it's possible we save a slot to disk at some LSN, then the
confirm_flush moves backward, and we fail to sync the slot to disk.

But I don't have a reproducer for this ...

I also noticed a strange difference between LogicalIncreaseXminForSlot
and LogicalIncreaseRestartDecodingForSlot.

The structure of LogicalIncreaseXminForSlot looks like this:

if (TransactionIdPrecedesOrEquals(xmin, slot->data.catalog_xmin))
{
}
else if (current_lsn <= slot->data.confirmed_flush)
{
... update candidate fields ...
}
else if (slot->candidate_xmin_lsn == InvalidXLogRecPtr)
{
... update candidate fields ...
}

while LogicalIncreaseRestartDecodingForSlot looks like this:

if (restart_lsn <= slot->data.restart_lsn)
{
}
else if (current_lsn <= slot->data.confirmed_flush)
{
... update candidate fields ...
}

if (slot->candidate_restart_valid == InvalidXLogRecPtr)
{
... update candidate fields ...
}

Notice that LogicalIncreaseXminForSlot has the third block guarded by
"else if", while LogicalIncreaseRestartDecodingForSlot has "if". Isn't
that a bit suspicious, considering the functions do the same thing, just
for different fields? I don't know if this is dangerous, the comments
suggest it may just waste extra effort after reconnect.

I also suspected this point. I still need to investigate if this
suspicion is related to the issue but I find this code in
LogicalIncreaseRestartDecodingForSlot() is dangerous.

We update slot's restart_lsn based on candidate_lsn and
candidate_valid upon receiving a feedback message from a subscriber,
then clear both fields. Therefore, this code in
LogicalIncreaseRestartDecodingForSlot() means that it sets an
arbitrary LSN to candidate_restart_lsn after updating slot's
restart_lsn.

I think an LSN older than slot's restart_lsn can be passed to
LogicalIncreaseRestartDecodingForSlot() as restart_lsn for example
after logical decoding restarts; My scenario I shared on another
thread was that after updating slot's restart_lsn (upon feedback from
a subscriber) based on both candidate_restart_lsn and
candidate_restart_valid that are remained in the slot, we might call
LogicalIncreaseRestartDecodingForSlot() when decoding a RUNNING_XACTS
record whose LSN is older than the slot's new restart_lsn. In this
case, we end up passing an LSN older than the new restart_lsn to
LogicalIncreaseRestartDecodingForSlot(), and that LSN is set to
candidate_restart_lsn.

Right, I believe that matches my observations. I only see the issues
after (unexpected) restarts, say due to network issues, but chances are
regular reconnects have the same problem.

My hypothesis is that we wanted to prevent such
case by the first if block:

/* don't overwrite if have a newer restart lsn */
if (restart_lsn <= slot->data.restart_lsn)
{
}

Yeah, that condition / comment seems to say exactly that.

Do you plan / expect to work on fixing this? It seems you proposed the
right fix in that old thread, but it's been inactive since 2023/02 :-(

I'm happy to work on this fix. At that time, I was unsure if my fix
was really correct and there was no further discussion.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#10

tomas.vondra@2ndquadrant.com

over 1 year ago

In reply to: Tomas Vondra (#8)

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

On 11/11/24 15:17, Tomas Vondra wrote:

On 11/11/24 14:51, Ashutosh Bapat wrote:

...

I think the problem is about processing older running transactions
record and setting data.restart_lsn based on the candidates those
records produce. But what is not clear to me is how come a newer
candidate_restart_lsn is available immediately upon WAL sender
restart. I.e. in the sequence of events you mentioned in your first
email
1. 344.139 LOG: starting logical decoding for slot "s1"

2. 344.139 DETAIL: Streaming transactions committing after 1/E97EAC30,
reading WAL from 1/E96FB4D0.

3. 344.140 LOG: logical decoding found consistent point at 1/E96FB4D0

4. 344.140 DETAIL: Logical decoding will begin using saved snapshot.

5. 344.140 LOG: LogicalConfirmReceivedLocation 1/E9865398

6. 344.140 LOG: LogicalConfirmReceivedLocation updating
data.restart_lsn to 1/E979D4C8 (from 1/E96FB4D0)
candidate_restart_valid 0/0 (from 1/E9865398)
candidate_restart_lsn 0/0 (from 1/E979D4C8)

how did candidate_restart_lsn = 1/E979D4C8 and candidate_restart_valid
= 1/E9865398 were set in ReplicationSlot after WAL sender? It means it
must have read and processed running transaction record at 1/E9865398.
If that's true, how come it went back to a running transactions WAL
record at 1/E979D4C8? It should be reading WAL records sequentially,
hence read 1/E979D4C8 first then 1/E9865398.

344.145 LOG: LogicalIncreaseRestartDecodingForSlot s1
candidate_restart_valid_lsn 1/E979D4C8 (0/0)
candidate_restart_lsn 1/E96FB4D0 (0/0)

Those are good questions, but IIUC that's explained by this comment from
Masahiko-san's analysis [1]:

Thinking about the root cause more, it seems to me that the root cause
is not the fact that candidate_xxx values are not cleared when being
released.

In the scenario I reproduced, after restarting the logical decoding,
the walsender sets the restart_lsn to a candidate_restart_lsn left in
the slot upon receiving the ack from the subscriber. ...

If this is correct, then what happens is:

1) replication is running, at some point we set candidate LSN to B

2) something breaks, causing reconnect with restart LSN A (< B)

3) we still have the candidate LSN B in memory, and after receiving
some confirmation we set it as restart_lsn

4) we get to decode the RUNNING_XACTS, which moves restart_lsn back

If this analysis is correct, I think it's rather suspicious we don't
reset the candidate fields on restart. Can those "old" values ever be
valid? But I haven't tried resetting them.

To clarify this a bit, I mean something like in the attached 0003 patch.

The reasoning is that after ReplicationSlotAcquire() we should get the
slot in the same state as if we just read it from disk. Because why not?
Why should the result be different from what we'd get if the primary
restated right before the reconnect?

Parts 0001 and 0002 add a couple asserts to prevent backwards move for
both the restart_lsn and the various candidate LSN fields.

Both the 0003 and 0004 patches (applied separately) seems to fix crashes
in my stress test, and none of the asserts from 0001+0002 seem to fail.
I'm not sure if we need both fixes or just one of them.

But neither of those fixes prevents backwards move for confirmed_flush
LSN, as enforced by asserts in the 0005 patch. I don't know if this
assert is incorrect or now. It seems natural that once we get a
confirmation for some LSN, we can't move before that position, but I'm
not sure about that. Maybe it's too strict.

regards

--
Tomas Vondra

#11

tomas.vondra@2ndquadrant.com

over 1 year ago

In reply to: Masahiko Sawada (#9)

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

On 11/11/24 21:56, Masahiko Sawada wrote:

...

My hypothesis is that we wanted to prevent such
case by the first if block:

/* don't overwrite if have a newer restart lsn */
if (restart_lsn <= slot->data.restart_lsn)
{
}

Yeah, that condition / comment seems to say exactly that.

Do you plan / expect to work on fixing this? It seems you proposed the
right fix in that old thread, but it's been inactive since 2023/02 :-(

I'm happy to work on this fix. At that time, I was unsure if my fix
was really correct and there was no further discussion.

Thanks. I'm not sure about the correctness either, but I think it's
clear the issue is real, and it's not difficult to reproduce it.

regards

--
Tomas Vondra

#12

[1]: /messages/by-id/CAD21AoBG2OSDOFTtpPtQ7fx5Vt8p3dS5hPAv28CBSC6z2kHx-g@mail.gmail.com
[2]: /messages/by-id/20230206152209.yglmntznhcmaueyn@alvherre.pgsql

sawada.mshk@gmail.com

over 1 year ago

In reply to: Tomas Vondra (#8)

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

On Mon, Nov 11, 2024 at 6:17 AM Tomas Vondra <tomas@vondra.me> wrote:

If this analysis is correct, I think it's rather suspicious we don't
reset the candidate fields on restart. Can those "old" values ever be
valid? But I haven't tried resetting them.

I had the same question. IIRC resetting them also fixes the
problem[1]/messages/by-id/CAD21AoBG2OSDOFTtpPtQ7fx5Vt8p3dS5hPAv28CBSC6z2kHx-g@mail.gmail.com. However, I got a comment from Alvaro[2]/messages/by-id/20230206152209.yglmntznhcmaueyn@alvherre.pgsql:

Hmm, interesting -- I was studying some other bug recently involving the
xmin of a slot that had been invalidated and I remember wondering if
these "candidate" fields were being properly ignored when the slot is
marked not in use; but I didn't check. Are you sure that resetting them
when the slot is released is the appropriate thing to do? I mean,
shouldn't they be kept set while the slot is in use, and only reset if
we destroy it?

Which made me re-investigate the issue and thought that it doesn't
necessarily need to clear these candidate values in memory on
releasing a slot as long as we're carefully updating restart_lsn.
Which seems a bit efficient for example when restarting from a very
old point. Of course, even if we reset them on releasing a slot, it
would perfectly work since it's the same as restarting logical
decoding with a server restart. I think
LogicalIncreaseRestartDecodingForSlot() should be fixed as it seems
not to be working expectedly, but I could not have proof that we
should either keep or reset them on releasing a slot.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#13

tomas.vondra@2ndquadrant.com

over 1 year ago

In reply to: Masahiko Sawada (#12)

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

On 11/11/24 23:41, Masahiko Sawada wrote:

On Mon, Nov 11, 2024 at 6:17 AM Tomas Vondra <tomas@vondra.me> wrote:

If this analysis is correct, I think it's rather suspicious we don't
reset the candidate fields on restart. Can those "old" values ever be
valid? But I haven't tried resetting them.

I had the same question. IIRC resetting them also fixes the
problem[1]. However, I got a comment from Alvaro[2]:

Hmm, interesting -- I was studying some other bug recently involving the
xmin of a slot that had been invalidated and I remember wondering if
these "candidate" fields were being properly ignored when the slot is
marked not in use; but I didn't check. Are you sure that resetting them
when the slot is released is the appropriate thing to do? I mean,
shouldn't they be kept set while the slot is in use, and only reset if
we destroy it?

Which made me re-investigate the issue and thought that it doesn't
necessarily need to clear these candidate values in memory on
releasing a slot as long as we're carefully updating restart_lsn.

Not sure, but maybe it'd be useful to ask the opposite question. Why
shouldn't it be correct to reset the fields, which essentially puts the
slot into the same state as if it was just read from disk? That also
discards all these values, and we can't rely on accidentally keeping
something important info in memory (because if the instance restarts
we'd lose that).

But this reminds me that the patch I shared earlier today resets the
slot in the ReplicationSlotAcquire() function, but I guess that's not
quite correct. It probably should be in the "release" path.

Which seems a bit efficient for example when restarting from a very
old point. Of course, even if we reset them on releasing a slot, it
would perfectly work since it's the same as restarting logical
decoding with a server restart.

I find the "efficiency" argument a bit odd. It'd be fine if we had a
correct behavior to start with, but we don't have that ... Also, I'm not
quite sure why exactly would it be more efficient?

And how likely is this in practice? It seems to me that
performance-sensitive cases can't do reconnects very often anyway,
that's inherently inefficient. No?

I think
LogicalIncreaseRestartDecodingForSlot() should be fixed as it seems
not to be working expectedly, but I could not have proof that we
should either keep or reset them on releasing a slot.

Not sure. Chances are we need both fixes, if only to make
LogicalIncreaseRestartDecodingForSlot more like the other function.

regards

--
Tomas Vondra

#14

sawada.mshk@gmail.com

over 1 year ago

In reply to: Tomas Vondra (#10)

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

On Mon, Nov 11, 2024 at 2:08 PM Tomas Vondra <tomas@vondra.me> wrote:

But neither of those fixes prevents backwards move for confirmed_flush
LSN, as enforced by asserts in the 0005 patch. I don't know if this
assert is incorrect or now. It seems natural that once we get a
confirmation for some LSN, we can't move before that position, but I'm
not sure about that. Maybe it's too strict.

Hmm, I'm concerned that it might be another problem. I think there are
some cases where a subscriber sends a flush position older than slot's
confirmed_flush as a feedback message. But it seems to be dangerous if
we always accept it as a new confirmed_flush value. It could happen
that confirm_flush could be set to a LSN older than restart_lsn.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#15

ashutosh.bapat@enterprisedb.com

over 1 year ago

In reply to: Masahiko Sawada (#14)

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

On Tue, Nov 12, 2024 at 12:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Nov 11, 2024 at 2:08 PM Tomas Vondra <tomas@vondra.me> wrote:

But neither of those fixes prevents backwards move for confirmed_flush
LSN, as enforced by asserts in the 0005 patch. I don't know if this
assert is incorrect or now. It seems natural that once we get a
confirmation for some LSN, we can't move before that position, but I'm
not sure about that. Maybe it's too strict.

Hmm, I'm concerned that it might be another problem. I think there are
some cases where a subscriber sends a flush position older than slot's
confirmed_flush as a feedback message. But it seems to be dangerous if
we always accept it as a new confirmed_flush value. It could happen
that confirm_flush could be set to a LSN older than restart_lsn.

If confirmed_flush LSN moves backwards, it means the transactions
which were thought to be replicated earlier are no longer considered
to be replicated. This means that the restart_lsn of the slot needs to
be at least far back as the oldest of starting points of those
transactions. Thus restart_lsn of slot has to be pushed further back.
That WAL may not be available anymore. Similar issue with
catalog_xmin, the older catalog rows may have been removed. Other
problem is we may send some transactions twice, which might cause
trouble downstream. So I agree that confirmed_flush LSN should not
move backwards. IIRC, if the downstream sends an older confirmed_flush
in START_REPLICATION message, WAL sender does not consider it and
instead uses the one in replication slot.

--
Best Wishes,
Ashutosh Bapat

#16

ashutosh.bapat@enterprisedb.com

over 1 year ago

In reply to: Tomas Vondra (#13)

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

On Tue, Nov 12, 2024 at 4:54 AM Tomas Vondra <tomas@vondra.me> wrote:

On 11/11/24 23:41, Masahiko Sawada wrote:

On Mon, Nov 11, 2024 at 6:17 AM Tomas Vondra <tomas@vondra.me> wrote:

Which made me re-investigate the issue and thought that it doesn't
necessarily need to clear these candidate values in memory on
releasing a slot as long as we're carefully updating restart_lsn.

Not sure, but maybe it'd be useful to ask the opposite question. Why
shouldn't it be correct to reset the fields, which essentially puts the
slot into the same state as if it was just read from disk? That also
discards all these values, and we can't rely on accidentally keeping
something important info in memory (because if the instance restarts
we'd lose that).

But this reminds me that the patch I shared earlier today resets the
slot in the ReplicationSlotAcquire() function, but I guess that's not
quite correct. It probably should be in the "release" path.

Which seems a bit efficient for example when restarting from a very
old point. Of course, even if we reset them on releasing a slot, it
would perfectly work since it's the same as restarting logical
decoding with a server restart.

I find the "efficiency" argument a bit odd. It'd be fine if we had a
correct behavior to start with, but we don't have that ... Also, I'm not
quite sure why exactly would it be more efficient?

And how likely is this in practice? It seems to me that
performance-sensitive cases can't do reconnects very often anyway,
that's inherently inefficient. No?

I think
LogicalIncreaseRestartDecodingForSlot() should be fixed as it seems
not to be working expectedly, but I could not have proof that we
should either keep or reset them on releasing a slot.

Not sure. Chances are we need both fixes, if only to make
LogicalIncreaseRestartDecodingForSlot more like the other function.

Thanks a lot for pointing to Masahiko's analysis. I missed that part
when I read that thread. Sorry.

A candidate_restart_lsn and candidate_restart_valid pair just tells
that we may set slot's data.restart_lsn to candidate_restart_lsn when
the downstream confirms an LSN >= candidate_restart_valid. That pair
can never get inaccurate. It may get stale but never inaccurate. So
wiping those fields from ReplicationSlot is unnecessary.

What should ideally happen is we should ignore candidates produced by
older running transactions WAL records after WAL sender restart. This
is inline with what logical replication does with transactions
committed before slot's confirmed_flush_lsn - those are ignored. But
the criteria for ignoring running transactions records is slightly
different from that for transactions. If we ignore
candidate_restart_lsn which has candidate_restart_valid <=
confirmed_flush_lsn, we might lose some opportunity to advance
data.restart_lsn. Instead we should ignore any candidate_restart_lsn
<= data.restart_lsn especially before WAL sender finds first change to
send downstream. We can do that in SnapBuildProcessRunningXacts() by
accessing MyReplicationSlot, taking lock on it and then comparing
data.restart_lsn with txn->restart_decoding_lsn before calling
LogicalIncreaseRestartDecodingForSlot(). But then
LogicalIncreaseRestartDecodingForSlot() would be doing the same anyway
after applying your patch 0004. The only downside of 0004 is that the
logic to ignore candidates produced by a running transactions record
is not clearly visible in SnapBuildProcessRunningXacts(). For a
transaction which is ignored the logic to ignore the transaction is
visible in DecodeCommit() or DecodeAbort() - where people are likely
to look for that logic. We may add a comment to that effect in
SnapBuildProcessRunningXacts().

--
Best Wishes,
Ashutosh Bapat

#17

tomas.vondra@2ndquadrant.com

over 1 year ago

In reply to: Ashutosh Bapat (#16)

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

On 11/12/24 10:37, Ashutosh Bapat wrote:

On Tue, Nov 12, 2024 at 4:54 AM Tomas Vondra <tomas@vondra.me> wrote:

On 11/11/24 23:41, Masahiko Sawada wrote:

On Mon, Nov 11, 2024 at 6:17 AM Tomas Vondra <tomas@vondra.me> wrote:

Which made me re-investigate the issue and thought that it doesn't
necessarily need to clear these candidate values in memory on
releasing a slot as long as we're carefully updating restart_lsn.

Not sure, but maybe it'd be useful to ask the opposite question. Why
shouldn't it be correct to reset the fields, which essentially puts the
slot into the same state as if it was just read from disk? That also
discards all these values, and we can't rely on accidentally keeping
something important info in memory (because if the instance restarts
we'd lose that).

But this reminds me that the patch I shared earlier today resets the
slot in the ReplicationSlotAcquire() function, but I guess that's not
quite correct. It probably should be in the "release" path.

Which seems a bit efficient for example when restarting from a very
old point. Of course, even if we reset them on releasing a slot, it
would perfectly work since it's the same as restarting logical
decoding with a server restart.

I find the "efficiency" argument a bit odd. It'd be fine if we had a
correct behavior to start with, but we don't have that ... Also, I'm not
quite sure why exactly would it be more efficient?

And how likely is this in practice? It seems to me that
performance-sensitive cases can't do reconnects very often anyway,
that's inherently inefficient. No?

I think
LogicalIncreaseRestartDecodingForSlot() should be fixed as it seems
not to be working expectedly, but I could not have proof that we
should either keep or reset them on releasing a slot.

Not sure. Chances are we need both fixes, if only to make
LogicalIncreaseRestartDecodingForSlot more like the other function.

Thanks a lot for pointing to Masahiko's analysis. I missed that part
when I read that thread. Sorry.

No need to apologize. The discussion is complex and spread over a rather
long time period.

A candidate_restart_lsn and candidate_restart_valid pair just tells
that we may set slot's data.restart_lsn to candidate_restart_lsn when
the downstream confirms an LSN >= candidate_restart_valid. That pair
can never get inaccurate. It may get stale but never inaccurate. So
wiping those fields from ReplicationSlot is unnecessary.

Isn't this issue a proof that those fields *can* get inaccurate? Or what
do you mean by "stale but not inaccurate"?

What should ideally happen is we should ignore candidates produced by
older running transactions WAL records after WAL sender restart. This
is inline with what logical replication does with transactions
committed before slot's confirmed_flush_lsn - those are ignored. But
the criteria for ignoring running transactions records is slightly
different from that for transactions. If we ignore
candidate_restart_lsn which has candidate_restart_valid <=
confirmed_flush_lsn, we might lose some opportunity to advance
data.restart_lsn. Instead we should ignore any candidate_restart_lsn
<= data.restart_lsn especially before WAL sender finds first change to
send downstream. We can do that in SnapBuildProcessRunningXacts() by
accessing MyReplicationSlot, taking lock on it and then comparing
data.restart_lsn with txn->restart_decoding_lsn before calling
LogicalIncreaseRestartDecodingForSlot(). But then
LogicalIncreaseRestartDecodingForSlot() would be doing the same anyway
after applying your patch 0004. The only downside of 0004 is that the
logic to ignore candidates produced by a running transactions record
is not clearly visible in SnapBuildProcessRunningXacts(). For a
transaction which is ignored the logic to ignore the transaction is
visible in DecodeCommit() or DecodeAbort() - where people are likely
to look for that logic. We may add a comment to that effect in
SnapBuildProcessRunningXacts().

I have thought about just doing something like:

slot->data.restart_lsn = Max(slot->data.restart_lsn, new_lsn);

and similar for the other LSN fields. And it did resolve the issue at
hand, of course. But it seems sloppy, and I'm worried it might easily
mask actual issues in other cases.

I'm still of the opinion that (with the exception of a reconnect), these
places should not need to deal with values that go backwards. It should
work just fine without the Max(), and we should add Asserts() to check
that it's always a higher LSN.

For the reconnect, I think it's a bit as if the primary restarted right
before the reconnect. That could happen anyway, and we need to handle
that correctly - if not, we have yet another issue, IMHO. And with the
restart it's the same as writing the slot to disk and reading it back,
which also doesn't retain most of the fields. So it seems cleaner to do
the same thing and just reset the various fields.

I haven't thought about making SnapBuildProcessRunningXacts() more
complex to consider this stuff. But I doubt we'd like to be accessing
slots from that code - it has nothing to do with slots. If anything,
tweaking LogicalIncreaseRestartDecodingForSlot() seems more appropriate,
but I'm still wondering if the current coding was intentional and we're
just missing why it was written like this.

There's also the question of backpatching - the simpler the better, and
this I think just resetting the fields wins in this regard. The main
question is whether it's correct - I think it is. I'm not too worried
about efficiency very much, on the grounds that this should not matter
very often (only after unexpected restart). Correctness > efficiency.

regards

--
Tomas Vondra

#18

[1]: /messages/by-id/CAA4eK1JvyWHzMwhO9jzPquctE_ha6bz3EkB3KE6qQJx63StErQ@mail.gmail.com
[2]: /messages/by-id/CAA4eK1KcnTvwrVqmpRTEMpyarBeTxwr8KA+kaveQOiqJ0zYsXA@mail.gmail.com
[3]: /messages/by-id/CAD21AoBVhYnGBuW_o=wEGgTp01qiHNAx1a14b1X9kFXmuBe=sg@mail.gmail.com

ashutosh.bapat@enterprisedb.com

over 1 year ago

In reply to: Tomas Vondra (#17)

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

On Tue, Nov 12, 2024 at 4:55 PM Tomas Vondra <tomas@vondra.me> wrote:

A candidate_restart_lsn and candidate_restart_valid pair just tells
that we may set slot's data.restart_lsn to candidate_restart_lsn when
the downstream confirms an LSN >= candidate_restart_valid. That pair
can never get inaccurate. It may get stale but never inaccurate. So
wiping those fields from ReplicationSlot is unnecessary.

Isn't this issue a proof that those fields *can* get inaccurate? Or what
do you mean by "stale but not inaccurate"?

While processing a running transaction log record at RT1 the WAL
sender gathers that slot's data.restart_lsn can be set to
candidate_restart_lsn1 when confirmed_flush_lsn is past
candidate_valid_restart1. In that sense it's accurate irrespective of
when those candidates are generated. But when the next running
transactions record RT2 is processed a new pair of
candidate_restart_lsn2 and candidate_restart_valid2 is produced
(candidate_restart_valid2 > candidate_restart_valid1). Thus if
confirmed_flush_lsn >= candidate_restart_valid2, it's better to set
data.restart_lsn = candidate_restart_lsn2 instead of
candidate_restart_lsn1. In that sense the first pair becomes stale. If
we could maintain all such pairs in a sorted fashion, we would be able
to set data.restart_lsn to the latest candiate_restart_lsn for a
received confirmed_flush_lsn and remove all the pairs including the
latest one after setting data.restart_lsn. But we don't maintain such
a list and hence the business of resetting candiate_restart_lsn and
candidate_restart_valid to indicate that we have consumed the previous
pair and are ready for a new one. We don't keep updating
candidate_restart_lsn and candidate_restart_valid to avoid chasing a
moving target and never updating data.restart_lsn.

What should ideally happen is we should ignore candidates produced by
older running transactions WAL records after WAL sender restart. This
is inline with what logical replication does with transactions
committed before slot's confirmed_flush_lsn - those are ignored. But
the criteria for ignoring running transactions records is slightly
different from that for transactions. If we ignore
candidate_restart_lsn which has candidate_restart_valid <=
confirmed_flush_lsn, we might lose some opportunity to advance
data.restart_lsn. Instead we should ignore any candidate_restart_lsn
<= data.restart_lsn especially before WAL sender finds first change to
send downstream. We can do that in SnapBuildProcessRunningXacts() by
accessing MyReplicationSlot, taking lock on it and then comparing
data.restart_lsn with txn->restart_decoding_lsn before calling
LogicalIncreaseRestartDecodingForSlot(). But then
LogicalIncreaseRestartDecodingForSlot() would be doing the same anyway
after applying your patch 0004. The only downside of 0004 is that the
logic to ignore candidates produced by a running transactions record
is not clearly visible in SnapBuildProcessRunningXacts(). For a
transaction which is ignored the logic to ignore the transaction is
visible in DecodeCommit() or DecodeAbort() - where people are likely
to look for that logic. We may add a comment to that effect in
SnapBuildProcessRunningXacts().

I have thought about just doing something like:

slot->data.restart_lsn = Max(slot->data.restart_lsn, new_lsn);

and similar for the other LSN fields. And it did resolve the issue at
hand, of course. But it seems sloppy, and I'm worried it might easily
mask actual issues in other cases.

Agreed.

I'm still of the opinion that (with the exception of a reconnect), these
places should not need to deal with values that go backwards. It should
work just fine without the Max(), and we should add Asserts() to check
that it's always a higher LSN.

Agreed.

I haven't thought about making SnapBuildProcessRunningXacts() more
complex to consider this stuff. But I doubt we'd like to be accessing
slots from that code - it has nothing to do with slots. If anything,
tweaking LogicalIncreaseRestartDecodingForSlot() seems more appropriate,

Agree.

but I'm still wondering if the current coding was intentional and we're
just missing why it was written like this.

Interestingly, the asymmetry between the functions is added in the
same commit b89e151054a05f0f6d356ca52e3b725dd0505e53. I doubt it was
intentional; the comments at both places say the same thing. This
problem is probably as old as that commit.

For the reconnect, I think it's a bit as if the primary restarted right
before the reconnect. That could happen anyway, and we need to handle
that correctly - if not, we have yet another issue, IMHO. And with the
restart it's the same as writing the slot to disk and reading it back,
which also doesn't retain most of the fields. So it seems cleaner to do
the same thing and just reset the various fields.

There's also the question of backpatching - the simpler the better, and
this I think just resetting the fields wins in this regard. The main
question is whether it's correct - I think it is. I'm not too worried
about efficiency very much, on the grounds that this should not matter
very often (only after unexpected restart). Correctness > efficiency.

If the slot's restart_lsn is very old before disconnect we will lose
an opportunity to update the restart_lsn and thus release some
resources earlier. However, that opportunity is only for a short
duration. On a fast enough machine the data.restart_lsn will be
updated anyway after processing all running transactions wal records
anyway. So I am also of the opinion that the argument of efficiency
won't stand here. But I doubt if "not resetting" is wrong. It looks
more intentional to me than the asymmetry between
LogicalIncreaseRestartDecodingForSlot and LogicalIncreaseXminForSlot.
This is our chance to settle that asymmetry forever :D.

I don't have a strong argument against resetting candidates in slots
at SlotRelease. However, resetting effective_xmin and
effective_catalog_xmin may does not look good. The old values may have
been used to advance xmin horizon. I have not closely read the code to
know whether the on-disk xmin and effective_xmin are always in sync
when computing xmin horizon or not.

--
Best Wishes,
Ashutosh Bapat

#19

Amit Kapila

amit.kapila16@gmail.com

over 1 year ago

In reply to: Tomas Vondra (#17)

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

On Tue, Nov 12, 2024 at 4:55 PM Tomas Vondra <tomas@vondra.me> wrote:

There's also the question of backpatching - the simpler the better, and
this I think just resetting the fields wins in this regard. The main
question is whether it's correct - I think it is. I'm not too worried
about efficiency very much, on the grounds that this should not matter
very often (only after unexpected restart). Correctness > efficiency.

Sure, but what is wrong with changing
LogicalIncreaseRestartDecodingForSlot's "if
(slot->candidate_restart_valid == InvalidXLogRecPtr)" to "else if
(slot->candidate_restart_valid == InvalidXLogRecPtr)"? My previous
analysis [1]/messages/by-id/CAA4eK1JvyWHzMwhO9jzPquctE_ha6bz3EkB3KE6qQJx63StErQ@mail.gmail.com[2]/messages/by-id/CAA4eK1KcnTvwrVqmpRTEMpyarBeTxwr8KA+kaveQOiqJ0zYsXA@mail.gmail.com on similar issue also leads to that conclusion. Then
later Sawada-San's email [3]/messages/by-id/CAD21AoBVhYnGBuW_o=wEGgTp01qiHNAx1a14b1X9kFXmuBe=sg@mail.gmail.com also leads to the same solution. I know
that the same has been discussed in this thread and we are primarily
worried about whether we are missing some case that needs the current
code in LogicalIncreaseRestartDecodingForSlot(). It is always possible
that all who have analyzed are missing some point but I feel the
chances are less. I vote to fix LogicalIncreaseRestartDecodingForSlot.

Now, we have at least some theory to not clear candidate_restart_*
values which is why to prevent advancing restart_lsn earlier if we get
confirmation from the subscriber. Now, your theory that walsender
exits should be rare so this doesn't impact much is also true but OTOH
why change something that can work more efficiently provided we fix
LogicalIncreaseRestartDecodingForSlot as per our analysis?

--
With Regards,
Amit Kapila.

#20

tomas.vondra@2ndquadrant.com

over 1 year ago

In reply to: Amit Kapila (#19)

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

On 11/12/24 13:19, Amit Kapila wrote:

On Tue, Nov 12, 2024 at 4:55 PM Tomas Vondra <tomas@vondra.me> wrote:

There's also the question of backpatching - the simpler the better, and
this I think just resetting the fields wins in this regard. The main
question is whether it's correct - I think it is. I'm not too worried
about efficiency very much, on the grounds that this should not matter
very often (only after unexpected restart). Correctness > efficiency.

Sure, but what is wrong with changing
LogicalIncreaseRestartDecodingForSlot's "if
(slot->candidate_restart_valid == InvalidXLogRecPtr)" to "else if
(slot->candidate_restart_valid == InvalidXLogRecPtr)"? My previous
analysis [1][2] on similar issue also leads to that conclusion. Then
later Sawada-San's email [3] also leads to the same solution. I know
that the same has been discussed in this thread and we are primarily
worried about whether we are missing some case that needs the current
code in LogicalIncreaseRestartDecodingForSlot(). It is always possible
that all who have analyzed are missing some point but I feel the
chances are less. I vote to fix LogicalIncreaseRestartDecodingForSlot.

I'm not opposed to adjusting LogicalIncreaseRestartDecodingForSlot().
The difference from LogicalIncreaseXminForSlot() seems accidental, in
which case fixing it seems desirable.

Now, we have at least some theory to not clear candidate_restart_*
values which is why to prevent advancing restart_lsn earlier if we get
confirmation from the subscriber. Now, your theory that walsender
exits should be rare so this doesn't impact much is also true but OTOH
why change something that can work more efficiently provided we fix
LogicalIncreaseRestartDecodingForSlot as per our analysis?

Because with how it works now it's impossible to check if the LSN values
are correct. We get a LSN value that's before what we already have in
the slot, and it might be either fine after a restart, or a bug (perhaps
on the subscriber side) causing all kinds of strange issues.

I'm all in favor of making stuff more efficient, but only if it doesn't
harm reliability. And that does seem to be happening here, because this
(not resetting + inability to have strict checks on the LSN updates)
seems to be one of the reasons why we have this bug since 9.4.

Sure, maybe fixing LogicalIncreaseRestartDecodingForSlot() is enough to
fix this particular case. But I'd be happier if we could also add
asserts checking the LSN advances, to detect similar issues that we may
be unaware of yet.

regards

--
Tomas Vondra

#21

sawada.mshk@gmail.com

over 1 year ago

In reply to: Ashutosh Bapat (#18)

#22