Fix slot synchronization with two_phase decoding enabled

amit.kapila16@gmail.com

12 months ago

In reply to: Zhijie Hou (Fujitsu) (#1)

Re: Fix slot synchronization with two_phase decoding enabled

On Tue, Mar 25, 2025 at 11:05 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

Hi,

When testing the slot synchronization with logical replication slots that
enabled two_phase decoding, I found that transactions prepared before two-phase
decoding is enabled may fail to replicate to the subscriber after being
committed on a promoted standby following a failover.

To reproduce this issue, please follow these steps (also detailed in the
attached TAP test, v1-0001):

1. sub: create a subscription with (two_phase = false)
2. primary (pub): prepare a txn A.
3. sub: alter subscription set (two_phase = true) and wait for the logical slot to
be synced to standby.
4. primary (pub): stop primary, promote the standby and let the subscriber use
the promoted standby as publisher.
5. promoted standby (pub): COMMIT PREPARED A;
6. sub: the apply worker will report the following ERROR because it didn't
receive the PREPARE.
ERROR: prepared transaction with identifier "pg_gid_16387_752" does not exist

I think the root cause of this issue is that the two_phase_at field of the
slot, which indicates the LSN from which two-phase decoding is enabled (used to
prevent duplicate data transmission for prepared transactions), is not
synchronized to the standby server.

In step 3, transaction A is not immediately replicated because it occurred
before enabling two-phase decoding. Thus, the prepared transaction should only
be replicated after decoding the final COMMIT PREPARED, as referenced in
ReorderBufferFinishPrepared(). However, due to the invalid two_phase_at on the
standby, the prepared transaction fails to send at that time.

This problem arises after the support for altering the two-phase option
(1462aad).

Thanks for the report and patch. I'll look into it.

--
With Regards,
Amit Kapila.

amit.kapila16@gmail.com

12 months ago

In reply to: Amit Kapila (#2)

Re: Fix slot synchronization with two_phase decoding enabled

On Tue, Mar 25, 2025 at 12:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 25, 2025 at 11:05 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

Hi,

When testing the slot synchronization with logical replication slots that
enabled two_phase decoding, I found that transactions prepared before two-phase
decoding is enabled may fail to replicate to the subscriber after being
committed on a promoted standby following a failover.

To reproduce this issue, please follow these steps (also detailed in the
attached TAP test, v1-0001):

1. sub: create a subscription with (two_phase = false)
2. primary (pub): prepare a txn A.
3. sub: alter subscription set (two_phase = true) and wait for the logical slot to
be synced to standby.
4. primary (pub): stop primary, promote the standby and let the subscriber use
the promoted standby as publisher.
5. promoted standby (pub): COMMIT PREPARED A;
6. sub: the apply worker will report the following ERROR because it didn't
receive the PREPARE.
ERROR: prepared transaction with identifier "pg_gid_16387_752" does not exist

I think the root cause of this issue is that the two_phase_at field of the
slot, which indicates the LSN from which two-phase decoding is enabled (used to
prevent duplicate data transmission for prepared transactions), is not
synchronized to the standby server.

In step 3, transaction A is not immediately replicated because it occurred
before enabling two-phase decoding. Thus, the prepared transaction should only
be replicated after decoding the final COMMIT PREPARED, as referenced in
ReorderBufferFinishPrepared(). However, due to the invalid two_phase_at on the
standby, the prepared transaction fails to send at that time.

This problem arises after the support for altering the two-phase option
(1462aad).

I suspect that this can happen in PG17 as well, but I need to think
more about it to make a reproducible test case.

In the meantime, I have a few minor comments on the proposed patches:
1.
##################################################
# Promote the standby1 to primary. Confirm that:
# a) the slot 'lsub1_slot' and 'snap_test_slot' are retained on the new primary
# b) logical replication for regress_mysub1 is resumed successfully
after failover
# c) changes can be consumed from the synced slot 'snap_test_slot'
##################################################
-$standby1->start;
$primary->wait_for_replay_catchup($standby1);

# Capture the time before the standby is promoted
@@ -885,6 +940,15 @@ $standby1->wait_for_catchup('regress_mysub1');
is($subscriber1->safe_psql('postgres', q{SELECT count(*) FROM tab_int;}),
"20", 'data replicated from the new primary');

+# Commit the prepared transaction
+$standby1->safe_psql('postgres',
+ "COMMIT PREPARED 'test_twophase_slotsync';");
+$standby1->wait_for_catchup('regress_mysub1');
+
+# Confirm that the prepared transaction is replicated to the subscriber
+is($subscriber1->safe_psql('postgres', q{SELECT count(*) FROM tab_int;}),
+ "21", 'prepared data replicated from the new primary');

The commentary above this test should include information about
verifying the replication of previously prepared transactions after
promotion. Also, it would be better if confirm the commit prepared
before confirming the new Insert is replicated after promotion.

2.
@@ -249,6 +250,7 @@ update_local_synced_slot(RemoteSlot *remote_slot,
Oid remote_dbid,
  SpinLockAcquire(&slot->mutex);
  slot->data.restart_lsn = remote_slot->restart_lsn;
  slot->data.confirmed_flush = remote_slot->confirmed_lsn;
+ slot->data.two_phase_at = remote_slot->two_phase_at;

Why do we need to update the two_phase_at here when the patch does it
later in this function when local and remote values don't match?

--
With Regards,
Amit Kapila.

houzj.fnst@fujitsu.com

11 months ago

In reply to: Amit Kapila (#3)

RE: Fix slot synchronization with two_phase decoding enabled

On Thu, Mar 27, 2025 at 2:29 PM Amit Kapila wrote:

On Tue, Mar 25, 2025 at 12:1 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Tue, Mar 25, 2025 at 11:05 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

Hi,

When testing the slot synchronization with logical replication slots that
enabled two_phase decoding, I found that transactions prepared before

two-phase

decoding is enabled may fail to replicate to the subscriber after being
committed on a promoted standby following a failover.

To reproduce this issue, please follow these steps (also detailed in the
attached TAP test, v1-0001):

1. sub: create a subscription with (two_phase = false)
2. primary (pub): prepare a txn A.
3. sub: alter subscription set (two_phase = true) and wait for the logical

slot to

be synced to standby.
4. primary (pub): stop primary, promote the standby and let the subscriber

use

the promoted standby as publisher.
5. promoted standby (pub): COMMIT PREPARED A;
6. sub: the apply worker will report the following ERROR because it didn't
receive the PREPARE.
ERROR: prepared transaction with identifier "pg_gid_16387_752"

does not exist

I think the root cause of this issue is that the two_phase_at field of the
slot, which indicates the LSN from which two-phase decoding is enabled

(used to

prevent duplicate data transmission for prepared transactions), is not
synchronized to the standby server.

In step 3, transaction A is not immediately replicated because it occurred
before enabling two-phase decoding. Thus, the prepared transaction

should only

be replicated after decoding the final COMMIT PREPARED, as referenced

in

ReorderBufferFinishPrepared(). However, due to the invalid two_phase_at

on the

standby, the prepared transaction fails to send at that time.

This problem arises after the support for altering the two-phase option
(1462aad).

I suspect that this can happen in PG17 as well, but I need to think
more about it to make a reproducible test case.

After further analysis, I was able to reproduce the same issue [1]- pub: created a slot 'sub' with two_phase=false, then prepared a transaction - pub: after some activity, advanced the confirmed_flush_lsn of 'sub', so it is greater than prepared txn lsn. - sub: create subscription with (slot_name='sub', create_slot=false, failover = true, two_phase=true, copy_data=false); two_phase_at will be set to the same as confirmed_flush_lsn which is greater than the prepared transaction. - stop the primary and promote the standby. - commit the prepared transaction on standby, the following error will be reported on subscriber: in
PG 17.

However, since the proposed fix requires catalog changes and the issue is not a
security risk significant enough to justify changing the catalog in back
branches, we cannot back-patch the same solution. Following off-list
discussions with Amit and Kuroda-san, we are considering disallowing enabling
failover and two-phase decoding together for a replication slot, as suggested
in attachment 0002.

Another idea considered is to prevent the slot that enables two-phase decoding
from being synced to standby. IOW, this means displaying the failover field as
false in the view, if there is any possibility that transactions prepared
before the two_phase_at position exist (e.g., if restart_lsn is less than
two_phase_at). However, implementing this change would require additional
explanations to users for this new behavior, which seems tricky.

In the meantime, I have a few minor comments on the proposed patches:
1.
##################################################
# Promote the standby1 to primary. Confirm that:
# a) the slot 'lsub1_slot' and 'snap_test_slot' are retained on the new primary
# b) logical replication for regress_mysub1 is resumed successfully
after failover
# c) changes can be consumed from the synced slot 'snap_test_slot'
##################################################
-$standby1->start;
$primary->wait_for_replay_catchup($standby1);

# Capture the time before the standby is promoted
@@ -885,6 +940,15 @@ $standby1->wait_for_catchup('regress_mysub1');
is($subscriber1->safe_psql('postgres', q{SELECT count(*) FROM tab_int;}),
"20", 'data replicated from the new primary');
+# Commit the prepared transaction
+$standby1->safe_psql('postgres',
+ "COMMIT PREPARED 'test_twophase_slotsync';");
+$standby1->wait_for_catchup('regress_mysub1');
+
+# Confirm that the prepared transaction is replicated to the subscriber
+is($subscriber1->safe_psql('postgres', q{SELECT count(*) FROM tab_int;}),
+ "21", 'prepared data replicated from the new primary');
The commentary above this test should include information about
verifying the replication of previously prepared transactions after
promotion. Also, it would be better if confirm the commit prepared
before confirming the new Insert is replicated after promotion.
2.
@@ -249,6 +250,7 @@ update_local_synced_slot(RemoteSlot *remote_slot,
Oid remote_dbid,
SpinLockAcquire(&slot->mutex);
slot->data.restart_lsn = remote_slot->restart_lsn;
slot->data.confirmed_flush = remote_slot->confirmed_lsn;
+ slot->data.two_phase_at = remote_slot->two_phase_at;
Why do we need to update the two_phase_at here when the patch does it
later in this function when local and remote values don't match?

Thanks for the comments, they have been addressed in V2.

[1]: - pub: created a slot 'sub' with two_phase=false, then prepared a transaction - pub: after some activity, advanced the confirmed_flush_lsn of 'sub', so it is greater than prepared txn lsn. - sub: create subscription with (slot_name='sub', create_slot=false, failover = true, two_phase=true, copy_data=false); two_phase_at will be set to the same as confirmed_flush_lsn which is greater than the prepared transaction. - stop the primary and promote the standby. - commit the prepared transaction on standby, the following error will be reported on subscriber:
- pub: created a slot 'sub' with two_phase=false, then prepared a transaction
- pub: after some activity, advanced the confirmed_flush_lsn of 'sub', so it is
greater than prepared txn lsn.
- sub: create subscription with (slot_name='sub', create_slot=false, failover =
true, two_phase=true, copy_data=false); two_phase_at will be set to the same
as confirmed_flush_lsn which is greater than the prepared transaction.
- stop the primary and promote the standby.
- commit the prepared transaction on standby, the following error will be
reported on subscriber:

LOG: logical replication apply worker for subscription "sub2" has started
ERROR: prepared transaction with identifier "pg_gid_16398_764" does not exist.

Best Regards,
Hou zj

[1]: https://www.postgresql.org/docs/9.6/release-9-6-4.html

amit.kapila16@gmail.com

11 months ago

In reply to: Zhijie Hou (Fujitsu) (#4)

Re: Fix slot synchronization with two_phase decoding enabled

On Mon, Mar 31, 2025 at 5:04 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Thu, Mar 27, 2025 at 2:29 PM Amit Kapila wrote:

I suspect that this can happen in PG17 as well, but I need to think
more about it to make a reproducible test case.

After further analysis, I was able to reproduce the same issue [1] in
PG 17.

However, since the proposed fix requires catalog changes and the issue is not a
security risk significant enough to justify changing the catalog in back
branches, we cannot back-patch the same solution.

Agreed. In the past, as in commit b6e39ca92e, we have backported a
catalog-modifying commit, but that is for a CVE. Users need to follow
manual steps as explained in 9.6.4 release notes [1]https://www.postgresql.org/docs/9.6/release-9-6-4.html, which would be
cumbersome for them. This is not a security issue, so we shouldn't
backpatch a catalog modifying commit following past.

Following off-list
discussions with Amit and Kuroda-san, we are considering disallowing enabling
failover and two-phase decoding together for a replication slot, as suggested
in attachment 0002.

Another idea considered is to prevent the slot that enables two-phase decoding
from being synced to standby. IOW, this means displaying the failover field as
false in the view, if there is any possibility that transactions prepared
before the two_phase_at position exist (e.g., if restart_lsn is less than
two_phase_at). However, implementing this change would require additional
explanations to users for this new behavior, which seems tricky.

I find it tricky to explain to users. We need to say that sometimes
the slots won't be synced even if the failover is set to true. Users
can verify that by checking slot properties on the publisher. Also, on
the subscriber, the failover flag in the subscription may still be
true unless we do more engineering to make it false. So, I prefer to
simply disallow setting failover and two_phase together. We need to
recommend to users in release notes for 17 that they need to disable
failover for subscriptions where two_phase is enabled or re-create the
subscriptions with two_phase=false and failover=true. Users may not
like it, but I think it is better than getting a complaint that after
promotion of standby the data to subscriber is not getting replicated.

--
With Regards,
Amit Kapila.

amit.kapila16@gmail.com

11 months ago

In reply to: Zhijie Hou (Fujitsu) (#4)

Re: Fix slot synchronization with two_phase decoding enabled

On Mon, Mar 31, 2025 at 5:04 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

Thanks for the comments, they have been addressed in V2.

In the PG17 patch, we only have a check preventing failover and
two_phase in CreateSubscription. Don't we need a similar check for
AlterSubscription?

Apart from the above, I have a few suggestions for changes in docs,
error messages, and comments for both versions. See attached.

--
With Regards,
Amit Kapila.

houzj.fnst@fujitsu.com

11 months ago

In reply to: Amit Kapila (#6)

RE: Fix slot synchronization with two_phase decoding enabled

On Tue, Apr 1, 2025 at 2:09 PM Amit Kapila wrote:

On Mon, Mar 31, 2025 at 5:0 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

Thanks for the comments, they have been addressed in V2.

In the PG17 patch, we only have a check preventing failover and
two_phase in CreateSubscription. Don't we need a similar check for AlterSubscription?

Apart from the above, I have a few suggestions for changes in docs,
error messages, and comments for both versions. See attached.

Thanks for the comments.

Here is the V3 patch set which addressed all the comments.

I also added a testcase for ALTER SUB in 0002 as suggested by
Kuroda-san off-list.

Additionally, I found that a test failed in PG17 following this
patch because it involved creating a subscription with both failover and
two-phase enabled. Since that test was designed to test the upgrade of
replication slots and is not directly related to failover functionality, I
decided to disable the failover option for test case.

Best Regards,
Hou zj

amit.kapila16@gmail.com

11 months ago

In reply to: Zhijie Hou (Fujitsu) (#7)

Re: Fix slot synchronization with two_phase decoding enabled

On Tue, Apr 1, 2025 at 4:28 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

Here is the V3 patch set which addressed all the comments.

Comment 0n 0001
<literal>NULL</literal> for logical slots where
+       <structfield>two_phase</structfield> is false and physical slots.
+      </para></entry>

change above to:
<literal>NULL</literal> for logical slots where
<structfield>two_phase</structfield> is false and for physical slots.

Comment on 0002
+# Create a subscription with two_phase enabled
+$subscriber1->safe_psql('postgres',
+ "CREATE SUBSCRIPTION regress_mysub2 CONNECTION '$publisher_connstr'
PUBLICATION regress_mypub WITH (slot_name = lsub1_slot, create_slot =
false, enabled = false, two_phase = true);"
+);
+
+# Enable failover for the subscription
+($result, $stdout, $stderr) = $subscriber1->psql('postgres',
+ "ALTER SUBSCRIPTION regress_mysub2 SET (failover = true)");
+ok( $stderr =~ /ERROR:  cannot enable failover for a two_phase
enabled subscription/,
+ "Enabling failover is not allowed for a two_phase enabled subscription");

Is there a need for this test to be in .pl file? Can't we add it in .sql file?

Apart from the above, I have made minor modifications to the PG17
patch in the attached.

--
With Regards,
Amit Kapila.

houzj.fnst@fujitsu.com

11 months ago

In reply to: Amit Kapila (#8)

RE: Fix slot synchronization with two_phase decoding enabled

On Wed, Apr 2, 2025 at 12:41 PM Amit Kapila wrote:

On Tue, Apr 1, 2025 at 4:28 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com>
wrote:

Here is the V3 patch set which addressed all the comments.
Comment 0n 0001
<literal>NULL</literal> for logical slots where
+       <structfield>two_phase</structfield> is false and physical slots.
+      </para></entry>
change above to:
<literal>NULL</literal> for logical slots where
<structfield>two_phase</structfield> is false and for physical slots.

Changed.

Comment on 0002
+# Create a subscription with two_phase enabled 
+$subscriber1->safe_psql('postgres',
+ "CREATE SUBSCRIPTION regress_mysub2 CONNECTION
'$publisher_connstr'
PUBLICATION regress_mypub WITH (slot_name = lsub1_slot, create_slot = 
false, enabled = false, two_phase = true);"
+);
+
+# Enable failover for the subscription ($result, $stdout, $stderr) = 
+$subscriber1->psql('postgres',  "ALTER SUBSCRIPTION regress_mysub2 
+SET (failover = true)"); ok( $stderr =~
+/ERROR:  cannot enable failover for a two_phase
enabled subscription/,
+ "Enabling failover is not allowed for a two_phase enabled 
+ subscription");

Is there a need for this test to be in .pl file? Can't we add it in .sql file?

Right, I moved the test into subscription.sql

Apart from the above, I have made minor modifications to the PG17
patch in the attached.

Here is V4 patch set which addressed above comments and passed
pgindent test.

Best Regards,
Hou zj

#10

sawada.mshk@gmail.com

11 months ago

In reply to: Zhijie Hou (Fujitsu) (#4)

Re: Fix slot synchronization with two_phase decoding enabled

On Mon, Mar 31, 2025 at 4:34 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Thu, Mar 27, 2025 at 2:29 PM Amit Kapila wrote:

On Tue, Mar 25, 2025 at 12:1 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Tue, Mar 25, 2025 at 11:05 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

Hi,

When testing the slot synchronization with logical replication slots that
enabled two_phase decoding, I found that transactions prepared before

two-phase

decoding is enabled may fail to replicate to the subscriber after being
committed on a promoted standby following a failover.

To reproduce this issue, please follow these steps (also detailed in the
attached TAP test, v1-0001):

1. sub: create a subscription with (two_phase = false)
2. primary (pub): prepare a txn A.
3. sub: alter subscription set (two_phase = true) and wait for the logical

slot to

be synced to standby.
4. primary (pub): stop primary, promote the standby and let the subscriber

use

the promoted standby as publisher.
5. promoted standby (pub): COMMIT PREPARED A;
6. sub: the apply worker will report the following ERROR because it didn't
receive the PREPARE.
ERROR: prepared transaction with identifier "pg_gid_16387_752"

does not exist

I think the root cause of this issue is that the two_phase_at field of the
slot, which indicates the LSN from which two-phase decoding is enabled

(used to

prevent duplicate data transmission for prepared transactions), is not
synchronized to the standby server.

In step 3, transaction A is not immediately replicated because it occurred
before enabling two-phase decoding. Thus, the prepared transaction

should only

be replicated after decoding the final COMMIT PREPARED, as referenced

in

ReorderBufferFinishPrepared(). However, due to the invalid two_phase_at

on the

standby, the prepared transaction fails to send at that time.

This problem arises after the support for altering the two-phase option
(1462aad).

I suspect that this can happen in PG17 as well, but I need to think
more about it to make a reproducible test case.

After further analysis, I was able to reproduce the same issue [1] in
PG 17.

[1]
- pub: created a slot 'sub' with two_phase=false, then prepared a transaction
- pub: after some activity, advanced the confirmed_flush_lsn of 'sub', so it is
greater than prepared txn lsn.
- sub: create subscription with (slot_name='sub', create_slot=false, failover =
true, two_phase=true, copy_data=false); two_phase_at will be set to the same
as confirmed_flush_lsn which is greater than the prepared transaction.
- stop the primary and promote the standby.
- commit the prepared transaction on standby, the following error will be
reported on subscriber:

It seems to require elaborate steps to reproduce this issue in v17. I
wonder if we could somehow narrow down the cases that we want to
prohibit. The patch for v17 disallows CREATE SUBSCRIPTION to enable
both two_phase and failover, but I guess that it's still safe if it
also creates the replication slot (e.g., create_slot is true). If my
understanding is right, we can allow users to specify both fields if
CRETE SUBSCRIPTION creates the slot, and we don't need to disallow
that in ReplicationSlotCreate().

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#11

houzj.fnst@fujitsu.com

11 months ago

In reply to: Masahiko Sawada (#10)

RE: Fix slot synchronization with two_phase decoding enabled

On Wed, Apr 2, 2025 at 3:45 PM Masahiko Sawada wrote:

Hi,

On Mon, Mar 31, 2025 at 4:3 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Thu, Mar 27, 2025 at 2:29 PM Amit Kapila wrote:

On Tue, Mar 25, 2025 at 12:1 PM Amit Kapila
<amit.kapila16@gmail.com>
wrote:

On Tue, Mar 25, 2025 at 11:05 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

Hi,

When testing the slot synchronization with logical replication
slots that enabled two_phase decoding, I found that transactions
prepared before

two-phase

decoding is enabled may fail to replicate to the subscriber
after being committed on a promoted standby following a failover.

To reproduce this issue, please follow these steps (also
detailed in the attached TAP test, v1-0001):

1. sub: create a subscription with (two_phase = false) 2.
primary (pub): prepare a txn A.
3. sub: alter subscription set (two_phase = true) and wait for
the logical

slot to

be synced to standby.
4. primary (pub): stop primary, promote the standby and let the
subscriber

use

the promoted standby as publisher.
5. promoted standby (pub): COMMIT PREPARED A; 6. sub: the apply
worker will report the following ERROR because it didn't
receive the PREPARE.
ERROR: prepared transaction with identifier "pg_gid_16387_752"

does not exist

I think the root cause of this issue is that the two_phase_at
field of the slot, which indicates the LSN from which two-phase
decoding is enabled

(used to

prevent duplicate data transmission for prepared transactions),
is not synchronized to the standby server.

In step 3, transaction A is not immediately replicated because
it occurred before enabling two-phase decoding. Thus, the
prepared transaction

should only

be replicated after decoding the final COMMIT PREPARED, as
referenced

in

ReorderBufferFinishPrepared(). However, due to the invalid
two_phase_at

on the

standby, the prepared transaction fails to send at that time.

This problem arises after the support for altering the two-phase
option (1462aad).

I suspect that this can happen in PG17 as well, but I need to think
more about it to make a reproducible test case.

After further analysis, I was able to reproduce the same issue [1] in
PG 17.

[1]
- pub: created a slot 'sub' with two_phase=false, then prepared a
transaction
- pub: after some activity, advanced the confirmed_flush_lsn of 'sub', so it is
greater than prepared txn lsn.
- sub: create subscription with (slot_name='sub', create_slot=false,
failover = true, two_phase=true, copy_data=false); two_phase_at will
be set to the same as confirmed_flush_lsn which is greater than the

prepared transaction.

- stop the primary and promote the standby.
- commit the prepared transaction on standby, the following error will be
reported on subscriber:

It seems to require elaborate steps to reproduce this issue in v17. I wonder if we
could somehow narrow down the cases that we want to prohibit. The patch for
v17 disallows CREATE SUBSCRIPTION to enable both two_phase and failover,
but I guess that it's still safe if it also creates the replication slot (e.g.,
create_slot is true). If my understanding is right, we can allow users to specify
both fields if CRETE SUBSCRIPTION creates the slot, and we don't need to
disallow that in ReplicationSlotCreate().

Thanks for reviewing the steps.

The current reproducer aims for simplicity; however, I think it's possible to reproduce
the issue even with create_slot = true, although it requires the help of a
debugger and additional steps. But as long as there are transactions prepared
before the two_phase_at position, and they are skipped due to checks in
ReorderBufferFinishPrepared() (refer to comments[1]/* * It is possible that this transaction is not decoded at prepare time * either because by that time we didn't have a consistent snapshot, or * two_phase was not enabled, or it was decoded earlier but we have * restarted. We only need to send the prepare if it was not decoded * earlier. We don't need to decode the xact for aborts if it is not done * already. */ for why we skip sending
prepared transaction), the issue can be reproduced.

For instance, when creating a subscription with (copy_data=true, failover=true,
two_phase=true), the slot's two_phase setting starts as false and shifts to
true after table sync (refer to [2]/* * Even if two_phase is set, don't create the slot with * two-phase enabled. Will enable it once all the tables are * synced and ready. This avoids race-conditions like prepared * transactions being skipped due to changes not being applied * due to checks in should_apply_changes_for_rel() when * tablesync for the corresponding tables are in progress. See * comments atop worker.c. * * Note that if tables were specified but copy_data is false * then it is safe to enable two_phase up-front because those * tables are already initially in READY state. When the * subscription has no tables, we leave the twophase state as * PENDING, to allow ALTER SUBSCRIPTION ... REFRESH * PUBLICATION to work. */ Best Regards, Hou zj for related comments). During this period,
if a user prepares a transaction where the prepare LSN is less than the
two_phase_at, the same problem could happen.

Similarly, when setting up a subscription with (copy_data=false, failover=true,
two_phase=true), although two_phase is initially set to true and we wait for
running transactions to finish when building consistent snapshot, a race
condition may still exist: If the snapshot state reaches FULL_SNAPSHOT, it
won't check running transactions further (see SnapBuildFindSnapshot() for
specifics), if a user prepares a transaction at this point, it's possible for
the prepare LSN to be less than the LSN marking the snapshot's consistent
state, causing the same issue.

[1]: /* * It is possible that this transaction is not decoded at prepare time * either because by that time we didn't have a consistent snapshot, or * two_phase was not enabled, or it was decoded earlier but we have * restarted. We only need to send the prepare if it was not decoded * earlier. We don't need to decode the xact for aborts if it is not done * already. */

/*
* It is possible that this transaction is not decoded at prepare time
* either because by that time we didn't have a consistent snapshot, or
* two_phase was not enabled, or it was decoded earlier but we have
* restarted. We only need to send the prepare if it was not decoded
* earlier. We don't need to decode the xact for aborts if it is not done
* already.
*/

[2]: /* * Even if two_phase is set, don't create the slot with * two-phase enabled. Will enable it once all the tables are * synced and ready. This avoids race-conditions like prepared * transactions being skipped due to changes not being applied * due to checks in should_apply_changes_for_rel() when * tablesync for the corresponding tables are in progress. See * comments atop worker.c. * * Note that if tables were specified but copy_data is false * then it is safe to enable two_phase up-front because those * tables are already initially in READY state. When the * subscription has no tables, we leave the twophase state as * PENDING, to allow ALTER SUBSCRIPTION ... REFRESH * PUBLICATION to work. */ Best Regards, Hou zj

/*
* Even if two_phase is set, don't create the slot with
* two-phase enabled. Will enable it once all the tables are
* synced and ready. This avoids race-conditions like prepared
* transactions being skipped due to changes not being applied
* due to checks in should_apply_changes_for_rel() when
* tablesync for the corresponding tables are in progress. See
* comments atop worker.c.
*
* Note that if tables were specified but copy_data is false
* then it is safe to enable two_phase up-front because those
* tables are already initially in READY state. When the
* subscription has no tables, we leave the twophase state as
* PENDING, to allow ALTER SUBSCRIPTION ... REFRESH
* PUBLICATION to work.
*/
Best Regards,
Hou zj

#12

sawada.mshk@gmail.com

11 months ago

In reply to: Zhijie Hou (Fujitsu) (#11)

Re: Fix slot synchronization with two_phase decoding enabled

On Wed, Apr 2, 2025 at 6:33 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Wed, Apr 2, 2025 at 3:45 PM Masahiko Sawada wrote:

Hi,

On Mon, Mar 31, 2025 at 4:3 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Thu, Mar 27, 2025 at 2:29 PM Amit Kapila wrote:

On Tue, Mar 25, 2025 at 12:1 PM Amit Kapila
<amit.kapila16@gmail.com>
wrote:

On Tue, Mar 25, 2025 at 11:05 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

Hi,

When testing the slot synchronization with logical replication
slots that enabled two_phase decoding, I found that transactions
prepared before

two-phase

decoding is enabled may fail to replicate to the subscriber
after being committed on a promoted standby following a failover.

To reproduce this issue, please follow these steps (also
detailed in the attached TAP test, v1-0001):

1. sub: create a subscription with (two_phase = false) 2.
primary (pub): prepare a txn A.
3. sub: alter subscription set (two_phase = true) and wait for
the logical

slot to

be synced to standby.
4. primary (pub): stop primary, promote the standby and let the
subscriber

use

the promoted standby as publisher.
5. promoted standby (pub): COMMIT PREPARED A; 6. sub: the apply
worker will report the following ERROR because it didn't
receive the PREPARE.
ERROR: prepared transaction with identifier "pg_gid_16387_752"

does not exist

I think the root cause of this issue is that the two_phase_at
field of the slot, which indicates the LSN from which two-phase
decoding is enabled

(used to

prevent duplicate data transmission for prepared transactions),
is not synchronized to the standby server.

In step 3, transaction A is not immediately replicated because
it occurred before enabling two-phase decoding. Thus, the
prepared transaction

should only

be replicated after decoding the final COMMIT PREPARED, as
referenced

in

ReorderBufferFinishPrepared(). However, due to the invalid
two_phase_at

on the

standby, the prepared transaction fails to send at that time.

This problem arises after the support for altering the two-phase
option (1462aad).

I suspect that this can happen in PG17 as well, but I need to think
more about it to make a reproducible test case.

After further analysis, I was able to reproduce the same issue [1] in
PG 17.

[1]
- pub: created a slot 'sub' with two_phase=false, then prepared a
transaction
- pub: after some activity, advanced the confirmed_flush_lsn of 'sub', so it is
greater than prepared txn lsn.
- sub: create subscription with (slot_name='sub', create_slot=false,
failover = true, two_phase=true, copy_data=false); two_phase_at will
be set to the same as confirmed_flush_lsn which is greater than the

prepared transaction.

- stop the primary and promote the standby.
- commit the prepared transaction on standby, the following error will be
reported on subscriber:

It seems to require elaborate steps to reproduce this issue in v17. I wonder if we
could somehow narrow down the cases that we want to prohibit. The patch for
v17 disallows CREATE SUBSCRIPTION to enable both two_phase and failover,
but I guess that it's still safe if it also creates the replication slot (e.g.,
create_slot is true). If my understanding is right, we can allow users to specify
both fields if CRETE SUBSCRIPTION creates the slot, and we don't need to
disallow that in ReplicationSlotCreate().

Thanks for reviewing the steps.

The current reproducer aims for simplicity; however, I think it's possible to reproduce
the issue even with create_slot = true, although it requires the help of a
debugger and additional steps. But as long as there are transactions prepared
before the two_phase_at position, and they are skipped due to checks in
ReorderBufferFinishPrepared() (refer to comments[1] for why we skip sending
prepared transaction), the issue can be reproduced.

For instance, when creating a subscription with (copy_data=true, failover=true,
two_phase=true), the slot's two_phase setting starts as false and shifts to
true after table sync (refer to [2] for related comments). During this period,
if a user prepares a transaction where the prepare LSN is less than the
two_phase_at, the same problem could happen.

Similarly, when setting up a subscription with (copy_data=false, failover=true,
two_phase=true), although two_phase is initially set to true and we wait for
running transactions to finish when building consistent snapshot, a race
condition may still exist: If the snapshot state reaches FULL_SNAPSHOT, it
won't check running transactions further (see SnapBuildFindSnapshot() for
specifics), if a user prepares a transaction at this point, it's possible for
the prepare LSN to be less than the LSN marking the snapshot's consistent
state, causing the same issue.

Thank you for the explanation! I agree that the issue happens in these cases.

As another idea, I wonder if we could somehow defer to make the synced
slot as 'sync-ready' until we can ensure that the slot doesn't have
any transactions that are prepared before the point of enabling
two_phase. For example, when the slotsync worker fetches the remote
slot, it remembers the confirmed_flush_lsn (say LSN-1) if the local
slot's two_phase becomes true or the local slot is newly created with
enabling two_phase, and then it makes the slot 'sync-ready' once it
confirmed that the slot's restart_lsn passed LSN-1. Does it work?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#13

[1]: /messages/by-id/OS0PR01MB57161D9BB5409F229564957994AD2@OS0PR01MB5716.jpnprd01.prod.outlook.com

houzj.fnst@fujitsu.com

11 months ago

In reply to: Masahiko Sawada (#12)

RE: Fix slot synchronization with two_phase decoding enabled

On Thu, Apr 3, 2025 at 3:30 AM Masahiko Sawada wrote:

On Wed, Apr 2, 2025 at 6:33 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Wed, Apr 2, 2025 at 3:45 PM Masahiko Sawada wrote:

Hi,

On Mon, Mar 31, 2025 at 4:3 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

After further analysis, I was able to reproduce the same issue [1] in
PG 17.

[1]
- pub: created a slot 'sub' with two_phase=false, then prepared a
transaction
- pub: after some activity, advanced the confirmed_flush_lsn of 'sub', so

it is

greater than prepared txn lsn.
- sub: create subscription with (slot_name='sub', create_slot=false,
failover = true, two_phase=true, copy_data=false); two_phase_at will
be set to the same as confirmed_flush_lsn which is greater than the

prepared transaction.

- stop the primary and promote the standby.
- commit the prepared transaction on standby, the following error will be
reported on subscriber:

It seems to require elaborate steps to reproduce this issue in v17. I wonder

if we

could somehow narrow down the cases that we want to prohibit. The patch

for

v17 disallows CREATE SUBSCRIPTION to enable both two_phase and

failover,

but I guess that it's still safe if it also creates the replication slot (e.g.,
create_slot is true). If my understanding is right, we can allow users to

specify

both fields if CRETE SUBSCRIPTION creates the slot, and we don't need to
disallow that in ReplicationSlotCreate().

Thanks for reviewing the steps.

The current reproducer aims for simplicity; however, I think it's possible to

reproduce

the issue even with create_slot = true, although it requires the help of a
debugger and additional steps. But as long as there are transactions

prepared

before the two_phase_at position, and they are skipped due to checks in
ReorderBufferFinishPrepared() (refer to comments[1] for why we skip

sending

prepared transaction), the issue can be reproduced.

For instance, when creating a subscription with (copy_data=true,

failover=true,

two_phase=true), the slot's two_phase setting starts as false and shifts to
true after table sync (refer to [2] for related comments). During this period,
if a user prepares a transaction where the prepare LSN is less than the
two_phase_at, the same problem could happen.

Similarly, when setting up a subscription with (copy_data=false,

failover=true,

two_phase=true), although two_phase is initially set to true and we wait for
running transactions to finish when building consistent snapshot, a race
condition may still exist: If the snapshot state reaches FULL_SNAPSHOT, it
won't check running transactions further (see SnapBuildFindSnapshot() for
specifics), if a user prepares a transaction at this point, it's possible for
the prepare LSN to be less than the LSN marking the snapshot's consistent
state, causing the same issue.

Thank you for the explanation! I agree that the issue happens in these cases.

As another idea, I wonder if we could somehow defer to make the synced
slot as 'sync-ready' until we can ensure that the slot doesn't have
any transactions that are prepared before the point of enabling
two_phase. For example, when the slotsync worker fetches the remote
slot, it remembers the confirmed_flush_lsn (say LSN-1) if the local
slot's two_phase becomes true or the local slot is newly created with
enabling two_phase, and then it makes the slot 'sync-ready' once it
confirmed that the slot's restart_lsn passed LSN-1. Does it work?

Thanks for the idea!

We considered a similar approach in [1]/messages/by-id/OS0PR01MB57161D9BB5409F229564957994AD2@OS0PR01MB5716.jpnprd01.prod.outlook.com to confirm there is no prepared
transactions before two_phase_at, but the issue is that when the two_phase flag
is switched from 'false' to 'true' (as in the case with (copy_data=true,
failover=true, two_phase=true)). In this case, the slot may have already been
marked as sync-ready before the two_phase flag is enabled, as slotsync is
unaware of potential future changes to the two_phase flag. Then the slot have
to revert to sync-not-ready after the two_phase flag change, which could be
difficult for users to understand.

What do you think ?

Best Regards,
Hou zj

#14

amit.kapila16@gmail.com

11 months ago

In reply to: Zhijie Hou (Fujitsu) (#13)

Re: Fix slot synchronization with two_phase decoding enabled

On Thu, Apr 3, 2025 at 7:50 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Thu, Apr 3, 2025 at 3:30 AM Masahiko Sawada wrote:

On Wed, Apr 2, 2025 at 6:33 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

Thank you for the explanation! I agree that the issue happens in these cases.

As another idea, I wonder if we could somehow defer to make the synced
slot as 'sync-ready' until we can ensure that the slot doesn't have
any transactions that are prepared before the point of enabling
two_phase. For example, when the slotsync worker fetches the remote
slot, it remembers the confirmed_flush_lsn (say LSN-1) if the local
slot's two_phase becomes true or the local slot is newly created with
enabling two_phase, and then it makes the slot 'sync-ready' once it
confirmed that the slot's restart_lsn passed LSN-1. Does it work?

Thanks for the idea!

We considered a similar approach in [1] to confirm there is no prepared
transactions before two_phase_at, but the issue is that when the two_phase flag
is switched from 'false' to 'true' (as in the case with (copy_data=true,
failover=true, two_phase=true)). In this case, the slot may have already been
marked as sync-ready before the two_phase flag is enabled, as slotsync is
unaware of potential future changes to the two_phase flag.

This can happen because when copy_data is true, tablesync can take a
long time to complete the sync and in the meantime, slot without a
two_phase flag would have been synced to standby. Such a slot would be
marked as sync-ready even if we follow the calculation proposed by
Sawada-san. Note that we enable two_phase once all the tables are in
ready state (See run_apply_worker() and comments atop worker.c
(TWO_PHASE TRANSACTIONS)).

--
With Regards,
Amit Kapila.

#15

sawada.mshk@gmail.com

11 months ago

In reply to: Amit Kapila (#14)

Re: Fix slot synchronization with two_phase decoding enabled

On Wed, Apr 2, 2025 at 7:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Apr 3, 2025 at 7:50 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Thu, Apr 3, 2025 at 3:30 AM Masahiko Sawada wrote:

On Wed, Apr 2, 2025 at 6:33 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

Thank you for the explanation! I agree that the issue happens in these cases.

As another idea, I wonder if we could somehow defer to make the synced
slot as 'sync-ready' until we can ensure that the slot doesn't have
any transactions that are prepared before the point of enabling
two_phase. For example, when the slotsync worker fetches the remote
slot, it remembers the confirmed_flush_lsn (say LSN-1) if the local
slot's two_phase becomes true or the local slot is newly created with
enabling two_phase, and then it makes the slot 'sync-ready' once it
confirmed that the slot's restart_lsn passed LSN-1. Does it work?

Thanks for the idea!

We considered a similar approach in [1] to confirm there is no prepared
transactions before two_phase_at, but the issue is that when the two_phase flag
is switched from 'false' to 'true' (as in the case with (copy_data=true,
failover=true, two_phase=true)). In this case, the slot may have already been
marked as sync-ready before the two_phase flag is enabled, as slotsync is
unaware of potential future changes to the two_phase flag.

This can happen because when copy_data is true, tablesync can take a
long time to complete the sync and in the meantime, slot without a
two_phase flag would have been synced to standby. Such a slot would be
marked as sync-ready even if we follow the calculation proposed by
Sawada-san. Note that we enable two_phase once all the tables are in
ready state (See run_apply_worker() and comments atop worker.c
(TWO_PHASE TRANSACTIONS)).

Right. It doesn't make sense to make the slot not-sync-ready and then
back to sync-ready.

While I agree with the approach for HEAD and it seems difficult to
find a solution, I'm concerned that disallowing to use both failover
and two_phase in a minor release would affect users much. Users who
are already using that combination might end up needing to re-think
their system architecture. So I'm trying to narrow down use cases
where we're going to prohibit or to find workarounds.

If we agree with the fix for HEAD, we can push the fix for HEAD first,
which would be better to be done sooner as it needs to bump the
catversion. We can discuss the ideas and workarounds for v17 later.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#16

amit.kapila16@gmail.com

11 months ago

In reply to: Masahiko Sawada (#15)

Re: Fix slot synchronization with two_phase decoding enabled

On Thu, Apr 3, 2025 at 11:08 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Apr 2, 2025 at 7:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Apr 3, 2025 at 7:50 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Thu, Apr 3, 2025 at 3:30 AM Masahiko Sawada wrote:

On Wed, Apr 2, 2025 at 6:33 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

Thank you for the explanation! I agree that the issue happens in these cases.

As another idea, I wonder if we could somehow defer to make the synced
slot as 'sync-ready' until we can ensure that the slot doesn't have
any transactions that are prepared before the point of enabling
two_phase. For example, when the slotsync worker fetches the remote
slot, it remembers the confirmed_flush_lsn (say LSN-1) if the local
slot's two_phase becomes true or the local slot is newly created with
enabling two_phase, and then it makes the slot 'sync-ready' once it
confirmed that the slot's restart_lsn passed LSN-1. Does it work?

Thanks for the idea!

We considered a similar approach in [1] to confirm there is no prepared
transactions before two_phase_at, but the issue is that when the two_phase flag
is switched from 'false' to 'true' (as in the case with (copy_data=true,
failover=true, two_phase=true)). In this case, the slot may have already been
marked as sync-ready before the two_phase flag is enabled, as slotsync is
unaware of potential future changes to the two_phase flag.

This can happen because when copy_data is true, tablesync can take a
long time to complete the sync and in the meantime, slot without a
two_phase flag would have been synced to standby. Such a slot would be
marked as sync-ready even if we follow the calculation proposed by
Sawada-san. Note that we enable two_phase once all the tables are in
ready state (See run_apply_worker() and comments atop worker.c
(TWO_PHASE TRANSACTIONS)).

Right. It doesn't make sense to make the slot not-sync-ready and then
back to sync-ready.

While I agree with the approach for HEAD and it seems difficult to
find a solution, I'm concerned that disallowing to use both failover
and two_phase in a minor release would affect users much. Users who
are already using that combination might end up needing to re-think
their system architecture. So I'm trying to narrow down use cases
where we're going to prohibit or to find workarounds.

If we agree with the fix for HEAD, we can push the fix for HEAD first,
which would be better to be done sooner as it needs to bump the
catversion. We can discuss the ideas and workarounds for v17 later.

Thanks, I'll push the patch for HEAD and then keep thinking if we have
a better way to deal with the problem in 17. BTW, the problem for 17
can happen in a much narrower set of cases as explained in the emails
above.

--
With Regards,
Amit Kapila.

#17

houzj.fnst@fujitsu.com

11 months ago

In reply to: Masahiko Sawada (#15)

RE: Fix slot synchronization with two_phase decoding enabled

On Thu, Apr 3, 2025 at 1:38 PM Masahiko Sawada wrote:

On Wed, Apr 2, 2025 at 7:58 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Thu, Apr 3, 2025 at 7:50 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Thu, Apr 3, 2025 at 3:30 AM Masahiko Sawada wrote:

On Wed, Apr 2, 2025 at 6:33 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

Thank you for the explanation! I agree that the issue happens in these

cases.

Just to share the steps to reproduce the issue when creating the subscription
With (create_slot=true, failover = true, two_phase=true, copy_data=false) for reference.

- pub: begin a txn A. This is to stop the slot from building a consistent
snapshot immediately.
- sub: create subscription with (slot_name='sub', create_slot=true, failover =
true, two_phase=true, copy_data=false);
- pub: Attach to the walsender that will create the slot and add a checkpoint
in SnapBuildFindSnapshot() -> (builder->state = SNAPBUILD_FULL_SNAPSHOT;).
For now, the state should be BUILDING_SNAPSHOT.
- pub: begin another txn B and commit txn A. The snapshot should reach
FULL_SNAPSHOT now.
- pub: prepared a txn C and then commit txn B.
- pub: release the checkpoint in SnapBuildFindSnapshot(), then it should reach
the SNAPBUILD_CONSISTENT. Now we have a prepared txn whose prepare_lsn is
less than the two_phase_at.
- stop the primary and promote the standby.
- sub: alter the subscription to use the new primary as the publisher.
- commit the prepared transaction on new primary, the following error will be
reported on subscriber:

LOG: logical replication apply worker for subscription "sub" has started
ERROR: prepared transaction with identifier "pg_gid_16398_764" does not exist.

As another idea, I wonder if we could somehow defer to make the
synced slot as 'sync-ready' until we can ensure that the slot
doesn't have any transactions that are prepared before the point
of enabling two_phase. For example, when the slotsync worker
fetches the remote slot, it remembers the confirmed_flush_lsn (say
LSN-1) if the local slot's two_phase becomes true or the local
slot is newly created with enabling two_phase, and then it makes
the slot 'sync-ready' once it confirmed that the slot's restart_lsn passed

LSN-1. Does it work?

Thanks for the idea!

We considered a similar approach in [1] to confirm there is no
prepared transactions before two_phase_at, but the issue is that
when the two_phase flag is switched from 'false' to 'true' (as in
the case with (copy_data=true, failover=true, two_phase=true)). In
this case, the slot may have already been marked as sync-ready
before the two_phase flag is enabled, as slotsync is unaware of potential

future changes to the two_phase flag.

This can happen because when copy_data is true, tablesync can take a
long time to complete the sync and in the meantime, slot without a
two_phase flag would have been synced to standby. Such a slot would be
marked as sync-ready even if we follow the calculation proposed by
Sawada-san. Note that we enable two_phase once all the tables are in
ready state (See run_apply_worker() and comments atop worker.c
(TWO_PHASE TRANSACTIONS)).

Right. It doesn't make sense to make the slot not-sync-ready and then back to
sync-ready.

While I agree with the approach for HEAD and it seems difficult to find a
solution, I'm concerned that disallowing to use both failover and two_phase in
a minor release would affect users much. Users who are already using that
combination might end up needing to re-think their system architecture. So I'm
trying to narrow down use cases where we're going to prohibit or to find
workarounds.

If we agree with the fix for HEAD, we can push the fix for HEAD first, which
would be better to be done sooner as it needs to bump the catversion. We can
discuss the ideas and workarounds for v17 later.

Agreed. I will think more on it. One workaround could be skipping the prepared
transaction when such an issue arises, followed by manually replicating the
skipped changes.

Best Regards,
Hou zj

#18

houzj.fnst@fujitsu.com

11 months ago

In reply to: Zhijie Hou (Fujitsu) (#17)

RE: Fix slot synchronization with two_phase decoding enabled

On Thu, Apr 3, 2025 at 3:16 PM Zhijie Hou (Fujitsu) wrote:

On Thu, Apr 3, 2025 at 1:38 PM Masahiko Sawada wrote:

On Wed, Apr 2, 2025 at 7:58 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Thu, Apr 3, 2025 at 7:50 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Thu, Apr 3, 2025 at 3:30 AM Masahiko Sawada wrote:

On Wed, Apr 2, 2025 at 6:33 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

Thank you for the explanation! I agree that the issue happens in
these

cases.

As another idea, I wonder if we could somehow defer to make the
synced slot as 'sync-ready' until we can ensure that the slot
doesn't have any transactions that are prepared before the point
of enabling two_phase. For example, when the slotsync worker
fetches the remote slot, it remembers the confirmed_flush_lsn
(say
LSN-1) if the local slot's two_phase becomes true or the local
slot is newly created with enabling two_phase, and then it makes
the slot 'sync-ready' once it confirmed that the slot's
restart_lsn passed

LSN-1. Does it work?

Thanks for the idea!

We considered a similar approach in [1] to confirm there is no
prepared transactions before two_phase_at, but the issue is that
when the two_phase flag is switched from 'false' to 'true' (as in
the case with (copy_data=true, failover=true, two_phase=true)). In
this case, the slot may have already been marked as sync-ready
before the two_phase flag is enabled, as slotsync is unaware of
potential

future changes to the two_phase flag.

This can happen because when copy_data is true, tablesync can take a
long time to complete the sync and in the meantime, slot without a
two_phase flag would have been synced to standby. Such a slot would
be marked as sync-ready even if we follow the calculation proposed
by Sawada-san. Note that we enable two_phase once all the tables are
in ready state (See run_apply_worker() and comments atop worker.c
(TWO_PHASE TRANSACTIONS)).

Right. It doesn't make sense to make the slot not-sync-ready and then
back to sync-ready.

While I agree with the approach for HEAD and it seems difficult to
find a solution, I'm concerned that disallowing to use both failover
and two_phase in a minor release would affect users much. Users who
are already using that combination might end up needing to re-think
their system architecture. So I'm trying to narrow down use cases
where we're going to prohibit or to find workarounds.

We analyzed more for the backbranch fix, and here is a summary of different
approaches that could be used for PG17.

----------
Approach 1
----------

This is the original approach implemented in V4 patch.

In this approach, we entirely disallow enabling both failover and the two-phase
feature together for a replication slot or subscription.

pros:

This restriction is simple to implement and easy for users to comprehend.

cons:

Users would be unable to use the two-phase feature in conjunction with
failover.

Following the upgrade to a new release with this fix, existing subscriptions
that have both failover and two-phase enabled would require manual re-creation,
which is time-consuming.

----------
Approach 2
----------

Instead of disallowing the use of two-phase and failover together, a more
flexible strategy could be only restrict failover for slots with two-phase
enabled when there's a possibility of existing prepared transactions before the
two_phase_at that are not yet replicated. During slot creation with two-phase
and failover, we could check for any decoded prepared transactions when
determining the decoding start point (DecodingContextFindStartpoint). For
subsequent attempts to alter failover to true, we ensure that two_phase_at is
less than restart_lsn, indicating that all prepared transactions have been
committed and replicated, thus the bug would not happen.

pros:

This method minimizes restrictions for users. Especially during slot creation
with (two_phase=on, failover=on), as it’s uncommon for transactions to prepare
during consistent snapshot creation, the restriction becomes almost
unnoticeable.

Users are not always forced to re-create subscriptions post-upgrade.

cons:

The logic involved for (restart_lsn > two_phase_at) might be less intuitive for
users.

After upgrading, it's recommended that users verify whether two_phase_at for a
source slot is less than restart_lsn of the synced slot. If it isn't, users
should be careful when using the synced slot on a standby. This might be
complex to understand.

----------
Approach 3
----------

This approach is similar to Approach 2 but simplifies some aspects by avoiding
checks for prepared transactions during slot creation. It disallows enabling
both failover and twophase during slot creation, but permits altering failover
to true once ensured that no prepared transaction exists (restart_lsn >
two_phase_at).

pros:

Like Approach 2, this reduces user restrictions compared to completely
disallowing failover and two-phase usage together.

Users are not always forced to re-create subscriptions post-upgrade.

cons:

User still could not enable failover and twophase when creating a slot.

Similar to approach 1, the check (restart_lsn > two_phase_at) might be less
intuitive for users.

Best Regards,
Hou zj

#19

nisha.moond412@gmail.com

11 months ago

In reply to: Zhijie Hou (Fujitsu) (#18)

Re: Fix slot synchronization with two_phase decoding enabled

HI,

Please find the patches attached for all three approaches.

On Wed, Apr 9, 2025 at 10:45 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Thu, Apr 3, 2025 at 3:16 PM Zhijie Hou (Fujitsu) wrote:

On Thu, Apr 3, 2025 at 1:38 PM Masahiko Sawada wrote:

On Wed, Apr 2, 2025 at 7:58 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Thu, Apr 3, 2025 at 7:50 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Thu, Apr 3, 2025 at 3:30 AM Masahiko Sawada wrote:

On Wed, Apr 2, 2025 at 6:33 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

Thank you for the explanation! I agree that the issue happens in
these

cases.

As another idea, I wonder if we could somehow defer to make the
synced slot as 'sync-ready' until we can ensure that the slot
doesn't have any transactions that are prepared before the point
of enabling two_phase. For example, when the slotsync worker
fetches the remote slot, it remembers the confirmed_flush_lsn
(say
LSN-1) if the local slot's two_phase becomes true or the local
slot is newly created with enabling two_phase, and then it makes
the slot 'sync-ready' once it confirmed that the slot's
restart_lsn passed

LSN-1. Does it work?

Thanks for the idea!

We considered a similar approach in [1] to confirm there is no
prepared transactions before two_phase_at, but the issue is that
when the two_phase flag is switched from 'false' to 'true' (as in
the case with (copy_data=true, failover=true, two_phase=true)). In
this case, the slot may have already been marked as sync-ready
before the two_phase flag is enabled, as slotsync is unaware of
potential

future changes to the two_phase flag.

This can happen because when copy_data is true, tablesync can take a
long time to complete the sync and in the meantime, slot without a
two_phase flag would have been synced to standby. Such a slot would
be marked as sync-ready even if we follow the calculation proposed
by Sawada-san. Note that we enable two_phase once all the tables are
in ready state (See run_apply_worker() and comments atop worker.c
(TWO_PHASE TRANSACTIONS)).

Right. It doesn't make sense to make the slot not-sync-ready and then
back to sync-ready.

While I agree with the approach for HEAD and it seems difficult to
find a solution, I'm concerned that disallowing to use both failover
and two_phase in a minor release would affect users much. Users who
are already using that combination might end up needing to re-think
their system architecture. So I'm trying to narrow down use cases
where we're going to prohibit or to find workarounds.

We analyzed more for the backbranch fix, and here is a summary of different
approaches that could be used for PG17.

----------
Approach 1
----------

This is the original approach implemented in V4 patch.

In this approach, we entirely disallow enabling both failover and the two-phase
feature together for a replication slot or subscription.

pros:

This restriction is simple to implement and easy for users to comprehend.

cons:

Users would be unable to use the two-phase feature in conjunction with
failover.

Following the upgrade to a new release with this fix, existing subscriptions
that have both failover and two-phase enabled would require manual re-creation,
which is time-consuming.

Patch "v5_aprch_1-0001" implements the above Approach 1.

----------
Approach 2
----------

Instead of disallowing the use of two-phase and failover together, a more
flexible strategy could be only restrict failover for slots with two-phase
enabled when there's a possibility of existing prepared transactions before the
two_phase_at that are not yet replicated. During slot creation with two-phase
and failover, we could check for any decoded prepared transactions when
determining the decoding start point (DecodingContextFindStartpoint). For
subsequent attempts to alter failover to true, we ensure that two_phase_at is
less than restart_lsn, indicating that all prepared transactions have been
committed and replicated, thus the bug would not happen.

pros:

This method minimizes restrictions for users. Especially during slot creation
with (two_phase=on, failover=on), as it’s uncommon for transactions to prepare
during consistent snapshot creation, the restriction becomes almost
unnoticeable.

Users are not always forced to re-create subscriptions post-upgrade.

cons:

The logic involved for (restart_lsn > two_phase_at) might be less intuitive for
users.

After upgrading, it's recommended that users verify whether two_phase_at for a
source slot is less than restart_lsn of the synced slot. If it isn't, users
should be careful when using the synced slot on a standby. This might be
complex to understand.

Patch "v5_aprch_2-0001" implements the above Approach 2.

----------
Approach 3
----------

This approach is similar to Approach 2 but simplifies some aspects by avoiding
checks for prepared transactions during slot creation. It disallows enabling
both failover and twophase during slot creation, but permits altering failover
to true once ensured that no prepared transaction exists (restart_lsn >
two_phase_at).

pros:

Like Approach 2, this reduces user restrictions compared to completely
disallowing failover and two-phase usage together.

Users are not always forced to re-create subscriptions post-upgrade.

cons:

User still could not enable failover and twophase when creating a slot.

Similar to approach 1, the check (restart_lsn > two_phase_at) might be less
intuitive for users.

After upgrading, it's recommended that users verify whether two_phase_at for a
source slot is less than restart_lsn of the synced slot. If it isn't, users
should be careful when using the synced slot on a standby. This might be
complex to understand.

Patch "v5_aprch_3-0001" implements the above Approach 3.

Thanks Hou-san for implementing approach-2 and providing the patch.

--
Thanks,
Nisha

#20

houzj.fnst@fujitsu.com

11 months ago

In reply to: Amit Kapila (#16)

RE: Fix slot synchronization with two_phase decoding enabled

Hi,

The recently added tests for slotsync on two-phase enabled slots failed[1] due
to a broken assertion triggered while decoding the COMMIT PREPARED record on
the promoted standby.

-----
Analysis
-----

if ((txn->final_lsn < two_phase_at) && is_commit)
{
/*
* txn must have been marked as a prepared transaction and skipped but
* not sent a prepare. Also, the prepare info must have been updated
* in txn even if we skip prepare.
*/
Assert((txn->txn_flags & RBTXN_PREPARE_STATUS_MASK) ==
(RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE));

I think the issue stem from the PREPARED transaction, which was skipped on the
primary, not being skipped on the promoted standby. The root cause appears to
be the latest 'confirmed_flush' (0/6005118) of the source slot not being synced
to the standby. This results in the confirmed_flush (0/6004F90) of the synced
slot being less than the synced two_phase_at field (0/6005118). Consequently,
the PREPARE record's LSN exceeds the synced confirmed_flush during
decoding, causing it to be incorrectly decoded and sent to the subscriber. Thus, when
decoding COMMIT PREPARED, the PREPARE record is found before two_phase_at but
wasn't skipped.

-- The LOGs that proves the confirmed_flush is not synced:

decoding on original slot (primary):
LOG: 0/6004F90 has been already streamed, forwarding to 0/6005118
...
DETAIL: Streaming transactions committing after 0/6005118, reading WAL from 0/60049C8.

decoding on synced slot (promted standby) - failed run:
DETAIL: Streaming transactions committing after 0/6004F90, reading WAL from 0/60049C8.

decoding on synced slot (promted standby) - success run:
LOG: 0/6004F90 has been already streamed, forwarding to 0/6005118
...
DETAIL: Streaming transactions committing after 0/6005118, reading WAL from 0/60049C8.
--

The logs above clearly show that during the test failure, the starting position
(6004F90) was less than that of a successful run. Additionally, it was lower
than the starting position of the remote slot on the primary, indicating a
failure to synchronize confirmed_lsn.

The reason confirmed_flush isn't synced should stem from the following check,
which skip the syncing of the confirmed_flush value. I also reproduced the
assertion failure by creating a scenario that leads to sync omission due to
this check:

/*
* Don't overwrite if we already have a newer catalog_xmin and
* restart_lsn.
*/
if (remote_slot->restart_lsn < slot->data.restart_lsn ||
TransactionIdPrecedes(remote_slot->catalog_xmin,
slot->data.catalog_xmin))
{

This check triggered because the synced catalog_xmin surpassed that of the
source slot (It can be seen from the log that the restart_lsn was synced to the
same value) ). This situation arises due to several factors:

On the publisher(primary), the slot may retain an older catalog_xmin and
restart_lsn because it is being consumed by a walsender. In this scenario, if
the apply worker does not immediately confirm that changes have been flushed,
the walsender advances the catalog_xmin/restart_lsn slowly. It sets an old
value for candidate_catalog_xmin/candidate_restart_lsn and would not update
them until the apply worker confirms via LogicalConfirmReceivedLocation.

However, the slotsync worker has the opportunity to begin WAL reading from a
more recent point, potentially advancing catalog_xmin/restart_lsn to newer
values.

And it's also possible to advance only catalog_xmin while keep restart_lsn
unchanged, by starting a transaction before the running_xact record. In this
case, it would pass builder->last_serialized_snapshot as restart_lsn to
LogicalIncreaseRestartDecodingForSlot(), but last_serialized_snapshot would
point to the position of the last running_xact record instead of the current
one, or the value would be NULL if no snapshot has been serialized before, so
it would not advance restart_lsn. The advancement of catalog_xmin is not
blocked by running transaction, as it would either use the XID of the running
transaction or the oldestRunningXid as candidate.

-----
Reproduction
-----

Accompanying this analysis is a script to reproduce the issue. In these
scripts, two running_xacts were logged post-slot creation on the publisher.
Below is a detailed progression of restart_lsn/catalog_xmin advancement
observed on both the publisher and standby:

The process is : slot creation -> log running_xact -> prepare a txn -> log
running_xact

The process of advancing the catalog_xmin/restart_lsn on publisher:

slot creation
...
1. 0/301B880 - running_xacts (no running txn, oldestrunningxid = 755)

got new catalog xmin 755 at 0/301B880
LOG: got new restart lsn 0/301B880 at 0/301B880
2. 0/301B8B8 - PREPAPRE a TRANSACTION.
3. 0/301BA20 - running_xacts (no running txn, oldestrunningxid = 756)

failed to increase restart lsn: proposed 0/301B880, after 0/301BA20,
current candidate 0/301B880, current after 0/301B880, flushed up to
0/301B810

final slot data:

catalog_xmin | 755
restart_lsn | 0/301B880
confirmed_flush_lsn | 0/301BAC8

The process of advancing the catalog_xmin/restart_lsn on standby (during slot
sync):

slot creation
...
1. 0/301B880 - running_xacts (no running txn, oldestrunningxid = 755)

building consistent snapshot, so will skip advancing the catalog_xmin/restart_lsn
...
2. 0/301BA20 - running_xacts (no running txn, oldestrunningxid = 756)

got new catalog xmin 756 at 0/301BA20

cannot get new restart_lsn as running txn exists and didn't serialize
snapshot yet.

first synced slot data:

catalog_xmin | 756
restart_lsn | 0/301B880
confirmed_flush_lsn | 0/301BAC8

The second slot sync cycle would skip updating confirmed_lsn because
catalog_xmin is newer.

-----
Fix
-----

I think we should keep the confirmed_flush even if the previous synced
restart_lsn/catalog_xmin is newer. Attachments include a patch for the same.

Best Regards,
Hou zj

#21

houzj.fnst@fujitsu.com

11 months ago

In reply to: Zhijie Hou (Fujitsu) (#20)

#22

amit.kapila16@gmail.com

11 months ago

In reply to: Zhijie Hou (Fujitsu) (#20)

#23

sawada.mshk@gmail.com

11 months ago

In reply to: Zhijie Hou (Fujitsu) (#18)

#24

houzj.fnst@fujitsu.com

11 months ago

In reply to: Masahiko Sawada (#23)

#25

amit.kapila16@gmail.com

11 months ago

In reply to: Amit Kapila (#22)

#26

houzj.fnst@fujitsu.com

11 months ago

In reply to: Amit Kapila (#25)

#27

shveta.malik@gmail.com

11 months ago

In reply to: Nisha Moond (#19)

#28

amit.kapila16@gmail.com

11 months ago

In reply to: Zhijie Hou (Fujitsu) (#24)

#29

nisha.moond412@gmail.com

11 months ago

In reply to: shveta malik (#27)

#30

sawada.mshk@gmail.com

11 months ago

In reply to: Amit Kapila (#28)

#31

amit.kapila16@gmail.com

11 months ago

In reply to: Masahiko Sawada (#30)

#32

nisha.moond412@gmail.com

11 months ago

In reply to: Amit Kapila (#31)

#33

shveta.malik@gmail.com

11 months ago

In reply to: Nisha Moond (#32)

#34

sawada.mshk@gmail.com

11 months ago

In reply to: Nisha Moond (#32)

#35

sawada.mshk@gmail.com

11 months ago

In reply to: Masahiko Sawada (#34)

#36

amit.kapila16@gmail.com

11 months ago

In reply to: Masahiko Sawada (#35)

#37

nisha.moond412@gmail.com

11 months ago

In reply to: shveta malik (#33)

#38

sawada.mshk@gmail.com

11 months ago

In reply to: Amit Kapila (#36)

#39

amit.kapila16@gmail.com

11 months ago

In reply to: Masahiko Sawada (#38)

#40

shveta.malik@gmail.com

11 months ago

In reply to: Nisha Moond (#37)

#41

houzj.fnst@fujitsu.com

11 months ago

In reply to: Amit Kapila (#22)

#42

shveta.malik@gmail.com

11 months ago

In reply to: Zhijie Hou (Fujitsu) (#41)

#43

nisha.moond412@gmail.com

11 months ago

In reply to: shveta malik (#40)

#44

houzj.fnst@fujitsu.com

11 months ago

In reply to: shveta malik (#42)

#45

sawada.mshk@gmail.com

11 months ago

In reply to: Amit Kapila (#39)

#46

sawada.mshk@gmail.com

11 months ago

In reply to: Zhijie Hou (Fujitsu) (#44)

#47

amit.kapila16@gmail.com

11 months ago

In reply to: Masahiko Sawada (#45)

#48

houzj.fnst@fujitsu.com

11 months ago

In reply to: Zhijie Hou (Fujitsu) (#44)

#49

shveta.malik@gmail.com

11 months ago

In reply to: Nisha Moond (#43)

#50

amit.kapila16@gmail.com

11 months ago

In reply to: Masahiko Sawada (#46)

#51

nisha.moond412@gmail.com

11 months ago

In reply to: shveta malik (#49)

#52

sawada.mshk@gmail.com

10 months ago

In reply to: Nisha Moond (#51)

#53

houzj.fnst@fujitsu.com

10 months ago

In reply to: Amit Kapila (#50)

#54

sawada.mshk@gmail.com

10 months ago

In reply to: Zhijie Hou (Fujitsu) (#53)

#55

nisha.moond412@gmail.com

10 months ago

In reply to: Masahiko Sawada (#52)

#56

shveta.malik@gmail.com

10 months ago

In reply to: Nisha Moond (#55)

#57

Tom Lane

tgl@sss.pgh.pa.us

10 months ago

In reply to: shveta malik (#56)

#58

sawada.mshk@gmail.com

10 months ago

In reply to: Tom Lane (#57)

#59

nisha.moond412@gmail.com

10 months ago

In reply to: shveta malik (#56)

#60

amit.kapila16@gmail.com

10 months ago

In reply to: Masahiko Sawada (#58)

#61

sawada.mshk@gmail.com

10 months ago

In reply to: Amit Kapila (#60)

#62

amit.kapila16@gmail.com

10 months ago

In reply to: Masahiko Sawada (#61)

#63

sawada.mshk@gmail.com

10 months ago

In reply to: Nisha Moond (#59)

#64

houzj.fnst@fujitsu.com

10 months ago

In reply to: Amit Kapila (#60)

#65

shveta.malik@gmail.com

10 months ago

In reply to: Nisha Moond (#59)

#66

nisha.moond412@gmail.com

10 months ago

In reply to: shveta malik (#65)

#67

shveta.malik@gmail.com

10 months ago

In reply to: Nisha Moond (#66)

#68

houzj.fnst@fujitsu.com

10 months ago

In reply to: Zhijie Hou (Fujitsu) (#64)

#69

houzj.fnst@fujitsu.com

10 months ago

In reply to: Zhijie Hou (Fujitsu) (#68)

#70

nisha.moond412@gmail.com

10 months ago

In reply to: shveta malik (#67)

#71

amit.kapila16@gmail.com

10 months ago

In reply to: Zhijie Hou (Fujitsu) (#64)

#72

shveta.malik@gmail.com

10 months ago

In reply to: Amit Kapila (#71)

#73

nisha.moond412@gmail.com

10 months ago

In reply to: Zhijie Hou (Fujitsu) (#64)

#74

sawada.mshk@gmail.com

10 months ago

In reply to: Nisha Moond (#70)

#75

sawada.mshk@gmail.com

10 months ago

In reply to: Masahiko Sawada (#74)

#76

nisha.moond412@gmail.com

10 months ago

In reply to: Masahiko Sawada (#75)

#77

sawada.mshk@gmail.com

10 months ago

In reply to: Nisha Moond (#76)

#78

nisha.moond412@gmail.com

10 months ago

In reply to: Masahiko Sawada (#77)

#79

sawada.mshk@gmail.com

10 months ago

In reply to: Nisha Moond (#78)

#80

nisha.moond412@gmail.com

9 months ago

In reply to: Masahiko Sawada (#79)

#81

amit.kapila16@gmail.com

9 months ago

In reply to: Nisha Moond (#80)

#82

nisha.moond412@gmail.com

9 months ago

In reply to: Amit Kapila (#81)

#83

sawada.mshk@gmail.com

9 months ago

In reply to: Amit Kapila (#81)

#84

sawada.mshk@gmail.com

9 months ago

In reply to: Masahiko Sawada (#83)

#85

dilipbalaut@gmail.com

9 months ago

In reply to: Nisha Moond (#82)

#86

dilipbalaut@gmail.com

9 months ago

In reply to: Dilip Kumar (#85)

#87

amit.kapila16@gmail.com

9 months ago

In reply to: Dilip Kumar (#85)

#88

dilipbalaut@gmail.com

9 months ago

In reply to: Amit Kapila (#87)

#89

nisha.moond412@gmail.com

9 months ago

In reply to: Masahiko Sawada (#83)

#90

nisha.moond412@gmail.com

9 months ago

In reply to: Dilip Kumar (#86)

#91

dilipbalaut@gmail.com

9 months ago

In reply to: Amit Kapila (#87)

#92

shveta.malik@gmail.com

9 months ago

In reply to: Nisha Moond (#90)

#93

Ajin Cherian

itsajin@gmail.com

9 months ago

In reply to: Nisha Moond (#90)

#94

sawada.mshk@gmail.com

9 months ago

In reply to: Nisha Moond (#90)

#95

Peter Smith

smithpb2250@gmail.com

9 months ago

In reply to: Ajin Cherian (#93)

#96

shveta.malik@gmail.com

9 months ago

In reply to: Masahiko Sawada (#94)

#97

nisha.moond412@gmail.com

9 months ago

In reply to: shveta malik (#92)

#98

nisha.moond412@gmail.com

9 months ago

In reply to: Peter Smith (#95)

#99

sawada.mshk@gmail.com

9 months ago

In reply to: shveta malik (#96)

#100