pgsql: Preserve conflict-relevant data during logical replication.

Started by Amit Kapila6 months ago8 messages
#1Amit Kapila
akapila@postgresql.org

Preserve conflict-relevant data during logical replication.

Logical replication requires reliable conflict detection to maintain data
consistency across nodes. To achieve this, we must prevent premature
removal of tuples deleted by other origins and their associated commit_ts
data by VACUUM, which could otherwise lead to incorrect conflict reporting
and resolution.

This patch introduces a mechanism to retain deleted tuples on the
subscriber during the application of concurrent transactions from remote
nodes. Retaining these tuples allows us to correctly ignore concurrent
updates to the same tuple. Without this, an UPDATE might be misinterpreted
as an INSERT during resolutions due to the absence of the original tuple.

Additionally, we ensure that origin metadata is not prematurely removed by
vacuum freeze, which is essential for detecting update_origin_differs and
delete_origin_differs conflicts.

To support this, a new replication slot named pg_conflict_detection is
created and maintained by the launcher on the subscriber. Each apply
worker tracks its own non-removable transaction ID, which the launcher
aggregates to determine the appropriate xmin for the slot, thereby
retaining necessary tuples.

Conflict information retention (deleted tuples and commit_ts) can be
enabled per subscription via the retain_conflict_info option. This is
disabled by default to avoid unnecessary overhead for configurations that
do not require conflict resolution or logging.

During upgrades, if any subscription on the old cluster has
retain_conflict_info enabled, a conflict detection slot will be created to
protect relevant tuples from deletion when the new cluster starts.

This is a foundational work to correctly detect update_deleted conflict
which will be done in a follow-up patch.

Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Nisha Moond <nisha.moond412@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: /messages/by-id/OS0PR01MB5716BE80DAEB0EE2A6A5D1F5949D2@OS0PR01MB5716.jpnprd01.prod.outlook.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/228c3708685542d34e6f02c74240656327a5c622

Modified Files
--------------
doc/src/sgml/catalogs.sgml | 11 +
doc/src/sgml/config.sgml | 2 +
doc/src/sgml/func.sgml | 16 +-
doc/src/sgml/logical-replication.sgml | 32 ++
doc/src/sgml/protocol.sgml | 88 +++
doc/src/sgml/ref/alter_subscription.sgml | 18 +-
doc/src/sgml/ref/create_subscription.sgml | 87 ++-
src/backend/access/transam/twophase.c | 32 +-
src/backend/access/transam/xact.c | 18 +-
src/backend/access/transam/xlog.c | 2 +-
src/backend/access/transam/xlogrecovery.c | 2 +-
src/backend/catalog/pg_subscription.c | 1 +
src/backend/catalog/system_views.sql | 3 +-
src/backend/commands/subscriptioncmds.c | 400 +++++++++++--
.../replication/logical/applyparallelworker.c | 3 +-
src/backend/replication/logical/launcher.c | 228 +++++++-
src/backend/replication/logical/reorderbuffer.c | 2 +-
src/backend/replication/logical/tablesync.c | 3 +-
src/backend/replication/logical/worker.c | 623 ++++++++++++++++++++-
src/backend/replication/slot.c | 48 +-
src/backend/replication/walsender.c | 60 ++
src/backend/storage/ipc/procarray.c | 20 +-
src/backend/utils/adt/pg_upgrade_support.c | 19 +
src/bin/pg_dump/pg_dump.c | 18 +-
src/bin/pg_dump/pg_dump.h | 1 +
src/bin/pg_upgrade/check.c | 96 +++-
src/bin/pg_upgrade/info.c | 25 +-
src/bin/pg_upgrade/pg_upgrade.c | 60 +-
src/bin/pg_upgrade/pg_upgrade.h | 4 +-
src/bin/pg_upgrade/t/004_subscription.pl | 85 ++-
src/bin/psql/describe.c | 6 +-
src/bin/psql/tab-complete.in.c | 10 +-
src/include/catalog/catversion.h | 2 +-
src/include/catalog/pg_proc.dat | 4 +
src/include/catalog/pg_subscription.h | 5 +
src/include/commands/subscriptioncmds.h | 5 +
src/include/replication/logicallauncher.h | 3 +
src/include/replication/slot.h | 11 +-
src/include/replication/worker_internal.h | 13 +-
src/include/storage/proc.h | 8 +
src/include/storage/procarray.h | 3 +-
src/test/regress/expected/subscription.out | 168 +++---
src/test/regress/sql/subscription.sql | 11 +
src/test/subscription/t/035_conflicts.pl | 195 ++++++-
src/tools/pgindent/typedefs.list | 2 +
45 files changed, 2233 insertions(+), 220 deletions(-)

#2Michael Paquier
michael@paquier.xyz
In reply to: Amit Kapila (#1)
Re: pgsql: Preserve conflict-relevant data during logical replication.

On Wed, Jul 23, 2025 at 03:35:06AM +0000, Amit Kapila wrote:

Preserve conflict-relevant data during logical replication.

Logical replication requires reliable conflict detection to maintain data
consistency across nodes. To achieve this, we must prevent premature
removal of tuples deleted by other origins and their associated commit_ts
data by VACUUM, which could otherwise lead to incorrect conflict reporting
and resolution.

Some of the tests added by this commit are causing blurps in the CI:
https://cfbot.cputube.org/highlights/all.html

Example of one job (triggered it once myself with a separate patch):
https://cirrus-ci.com/task/5397273342902272
[21:15:03.131](0.028s) not ok 7 - altering retain_dead_tuples is
allowed for disabled subscription
[21:18:48.295](225.164s) # poll_query_until timed out executing this
query:
[21:18:48.296](0.000s) not ok 8 - the xmin value of slot
'pg_conflict_detection' is valid on Node A
[21:18:48.306](0.010s) not ok 9 - warn of the possibility of receiving
changes from origins other than the publisher
[21:18:48.412](0.037s) not ok 11 - the deleted column is non-removable
[21:22:29.286](220.874s) # poll_query_until timed out executing this
query:
[21:22:29.287](0.000s) not ok 12 - the xmin value of slot
'pg_conflict_detection' is updated on Node A
[21:22:29.297](0.010s) not ok 13 - the deleted column is removed

The failures happen on FreeBSD as far as I know, that enforces some
rules not used elsewhere:
-c debug_copy_parse_plan_trees=on
-c debug_write_read_parse_plan_trees=on
-c debug_raw_expression_coverage_test=on
-c debug_parallel_query=regress
[...]
CPPFLAGS: -DRELCACHE_FORCE_RELEASE -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS

I suspect that reusing these options would help in reproducing the
problem. These are not commonly used in buildfarm animals, reducing
the friction to make the instability show up.
--
Michael

#3Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Michael Paquier (#2)
1 attachment(s)
RE: pgsql: Preserve conflict-relevant data during logical replication.

On Thursday, July 24, 2025 9:25 AM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Jul 23, 2025 at 03:35:06AM +0000, Amit Kapila wrote:

Preserve conflict-relevant data during logical replication.

Logical replication requires reliable conflict detection to maintain
data consistency across nodes. To achieve this, we must prevent
premature removal of tuples deleted by other origins and their
associated commit_ts data by VACUUM, which could otherwise lead to
incorrect conflict reporting and resolution.

Some of the tests added by this commit are causing blurps in the CI:
https://cfbot.cputube.org/highlights/all.html

Example of one job (triggered it once myself with a separate patch):
https://cirrus-ci.com/task/5397273342902272
[21:15:03.131](0.028s) not ok 7 - altering retain_dead_tuples is allowed for
disabled subscription
[21:18:48.295](225.164s) # poll_query_until timed out executing this
query:
[21:18:48.296](0.000s) not ok 8 - the xmin value of slot 'pg_conflict_detection'
is valid on Node A
[21:18:48.306](0.010s) not ok 9 - warn of the possibility of receiving changes
from origins other than the publisher
[21:18:48.412](0.037s) not ok 11 - the deleted column is non-removable
[21:22:29.286](220.874s) # poll_query_until timed out executing this
query:
[21:22:29.287](0.000s) not ok 12 - the xmin value of slot 'pg_conflict_detection'
is updated on Node A
[21:22:29.297](0.010s) not ok 13 - the deleted column is removed

The failures happen on FreeBSD as far as I know, that enforces some rules not
used elsewhere:
-c debug_copy_parse_plan_trees=on
-c debug_write_read_parse_plan_trees=on
-c debug_raw_expression_coverage_test=on
-c debug_parallel_query=regress
[...]
CPPFLAGS: -DRELCACHE_FORCE_RELEASE
-DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS

I suspect that reusing these options would help in reproducing the problem.
These are not commonly used in buildfarm animals, reducing the friction to
make the instability show up.

Thanks for reporting the issue!

I confirmed that the test to enable the retain_dead_tuples option for a
disabled subscription failed due to the apply worker for that subscription still
running, which caused the all subsequent tests to fail. To resolve this issue,
we need to ensure the apply worker has stopped when disabling the subscription.

2025-07-23 21:15:03.128 UTC client backend[39133] 035_conflicts.pl LOG: statement: ALTER SUBSCRIPTION tap_sub_a_b SET (retain_dead_tuples = true);
2025-07-23 21:15:03.128 UTC client backend[39133] 035_conflicts.pl ERROR: cannot alter retain_dead_tuples when logical replication worker is still running

Attached is a patch to address this problem. Apart from the reported failure,
there's another place where we did not wait for the worker to stop after
disabling the subscription. Although this hasn't resulted in a test failure so
far, I added wait logic for it in the patch as well for safety.

Best Regards,
Hou zj

Attachments:

0001-Fix-Cfbot-failure.patchapplication/octet-stream; name=0001-Fix-Cfbot-failure.patchDownload
From 15f907b4a980bdad80fae86146b143721854f500 Mon Sep 17 00:00:00 2001
From: Zhijie Hou <houzj.fnst@fujitsu.com>
Date: Thu, 24 Jul 2025 10:44:12 +0800
Subject: [PATCH] Fix Cfbot failure

The test failed to enable the retain_dead_tuples option for a disabled
subscription failed due to the apply worker for that subscription still running.
To resolve this issue, ensure the apply worker has stopped after disabling the
subscription.
---
 src/test/subscription/t/035_conflicts.pl | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/src/test/subscription/t/035_conflicts.pl b/src/test/subscription/t/035_conflicts.pl
index 7458d7fba7e..976d53a870e 100644
--- a/src/test/subscription/t/035_conflicts.pl
+++ b/src/test/subscription/t/035_conflicts.pl
@@ -228,6 +228,11 @@ ok( $stderr =~
 # Disable the subscription
 $node_A->psql('postgres', "ALTER SUBSCRIPTION $subname_AB DISABLE;");
 
+# Wait for the apply worker to stop
+$node_A->poll_query_until('postgres',
+	"SELECT count(*) = 0 FROM pg_stat_activity WHERE backend_type = 'logical replication apply worker'"
+);
+
 # Enable retain_dead_tuples for disabled subscription
 ($cmdret, $stdout, $stderr) = $node_A->psql('postgres',
 	"ALTER SUBSCRIPTION $subname_AB SET (retain_dead_tuples = true);");
@@ -278,6 +283,11 @@ is($result, qq(1|1
 # Disable the logical replication from node B to node A
 $node_A->safe_psql('postgres', "ALTER SUBSCRIPTION $subname_AB DISABLE");
 
+# Wait for the apply worker to stop
+$node_A->poll_query_until('postgres',
+	"SELECT count(*) = 0 FROM pg_stat_activity WHERE backend_type = 'logical replication apply worker'"
+);
+
 $node_B->safe_psql('postgres', "UPDATE tab SET b = 3 WHERE a = 1;");
 $node_A->safe_psql('postgres', "DELETE FROM tab WHERE a = 1;");
 
-- 
2.50.1.windows.1

#4Amit Kapila
amit.kapila16@gmail.com
In reply to: Zhijie Hou (Fujitsu) (#3)
Re: pgsql: Preserve conflict-relevant data during logical replication.

On Thu, Jul 24, 2025 at 8:27 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Thursday, July 24, 2025 9:25 AM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Jul 23, 2025 at 03:35:06AM +0000, Amit Kapila wrote:

Preserve conflict-relevant data during logical replication.

Logical replication requires reliable conflict detection to maintain
data consistency across nodes. To achieve this, we must prevent
premature removal of tuples deleted by other origins and their
associated commit_ts data by VACUUM, which could otherwise lead to
incorrect conflict reporting and resolution.

Some of the tests added by this commit are causing blurps in the CI:
https://cfbot.cputube.org/highlights/all.html

...

I confirmed that the test to enable the retain_dead_tuples option for a
disabled subscription failed due to the apply worker for that subscription still
running, which caused the all subsequent tests to fail. To resolve this issue,
we need to ensure the apply worker has stopped when disabling the subscription.

2025-07-23 21:15:03.128 UTC client backend[39133] 035_conflicts.pl LOG: statement: ALTER SUBSCRIPTION tap_sub_a_b SET (retain_dead_tuples = true);
2025-07-23 21:15:03.128 UTC client backend[39133] 035_conflicts.pl ERROR: cannot alter retain_dead_tuples when logical replication worker is still running

Attached is a patch to address this problem. Apart from the reported failure,
there's another place where we did not wait for the worker to stop after
disabling the subscription. Although this hasn't resulted in a test failure so
far, I added wait logic for it in the patch as well for safety.

The fix looks good to me. I'll push your patch in sometime.

--
With Regards,
Amit Kapila.

#5Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#4)
Re: pgsql: Preserve conflict-relevant data during logical replication.

On Wed, Jul 23, 2025 at 11:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

The fix looks good to me. I'll push your patch in sometime.

The tests in this patch are insufficient to prove that this logic
works properly. I tried with this patch:

diff --git a/src/backend/access/transam/twophase.c
b/src/backend/access/transam/twophase.c
index 7918176fc58..7c1d0ef07b5 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2325,6 +2325,16 @@ RecordTransactionCommitPrepared(TransactionId xid,
     TimestampTz committs;
     bool        replorigin;
+    /*
+     * Note it is important to set committs value after marking ourselves as
+     * in the commit critical section (DELAY_CHKPT_IN_COMMIT). This is because
+     * we want to ensure all transactions that have acquired commit timestamp
+     * are finished before we allow the logical replication client to advance
+     * its xid which is used to hold back dead rows for conflict detection.
+     * See comments atop worker.c.
+     */
+    committs = GetCurrentTimestamp();
+
     /*
      * Are we using the replication origins feature?  Or, in other words, are
      * we replaying remote actions?
@@ -2344,16 +2354,6 @@ RecordTransactionCommitPrepared(TransactionId xid,
      */
     pg_write_barrier();

- /*
- * Note it is important to set committs value after marking ourselves as
- * in the commit critical section (DELAY_CHKPT_IN_COMMIT). This is because
- * we want to ensure all transactions that have acquired commit timestamp
- * are finished before we allow the logical replication client to advance
- * its xid which is used to hold back dead rows for conflict detection.
- * See comments atop worker.c.
- */
- committs = GetCurrentTimestamp();
-
/*
* Emit the XLOG commit record. Note that we mark 2PC commits as
* potentially having AccessExclusiveLocks since we don't know whether or

If it is in fact important to acquire the commit timestamp after
setting delayChkptFlags, you'd hope this would lead to a test failure,
but it doesn't for me. I understand it probably requires an injection
point to be certain of hitting the race condition, but I think that
would be worth doing. Otherwise, if something gets broken here by
accident, it might be a long time before anyone notices.

--
Robert Haas
EDB: http://www.enterprisedb.com

#6Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Robert Haas (#5)
RE: pgsql: Preserve conflict-relevant data during logical replication.

On Tuesday, September 2, 2025 11:03 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jul 23, 2025 at 11:44 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:

The fix looks good to me. I'll push your patch in sometime.

The tests in this patch are insufficient to prove that this logic works properly. I
tried with this patch:

diff --git a/src/backend/access/transam/twophase.c
b/src/backend/access/transam/twophase.c
index 7918176fc58..7c1d0ef07b5 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2325,6 +2325,16 @@ RecordTransactionCommitPrepared(TransactionId
xid,
TimestampTz committs;
bool        replorigin;
+    /*
+     * Note it is important to set committs value after marking ourselves as
+     * in the commit critical section (DELAY_CHKPT_IN_COMMIT). This is
because
+     * we want to ensure all transactions that have acquired commit
timestamp
+     * are finished before we allow the logical replication client to advance
+     * its xid which is used to hold back dead rows for conflict detection.
+     * See comments atop worker.c.
+     */
+    committs = GetCurrentTimestamp();
+
/*
* Are we using the replication origins feature?  Or, in other words, are
* we replaying remote actions?
@@ -2344,16 +2354,6 @@ RecordTransactionCommitPrepared(TransactionId
xid,
*/
pg_write_barrier();

- /*
- * Note it is important to set committs value after marking ourselves as
- * in the commit critical section (DELAY_CHKPT_IN_COMMIT). This is
because
- * we want to ensure all transactions that have acquired commit
timestamp
- * are finished before we allow the logical replication client to advance
- * its xid which is used to hold back dead rows for conflict detection.
- * See comments atop worker.c.
- */
- committs = GetCurrentTimestamp();
-
/*
* Emit the XLOG commit record. Note that we mark 2PC commits as
* potentially having AccessExclusiveLocks since we don't know whether
or

If it is in fact important to acquire the commit timestamp after setting
delayChkptFlags, you'd hope this would lead to a test failure, but it doesn't for
me. I understand it probably requires an injection point to be certain of hitting
the race condition, but I think that would be worth doing. Otherwise, if
something gets broken here by accident, it might be a long time before anyone
notices.

Thanks for pointing it out!

I agree that adding a test is valuable to mitigate the risk of future code
changes. We added a similar safeguard for the RecordTransactionCommit() function
by adding Assert(xactStopTimestamp == 0) after marking the DELAY_CHKPT_xxx flag,
and did not do any precautionary check for RecordTransactionCommitPrepared as
the code to acquire the timestamp and setting flag was close by and had explicit
comments.

I'll prepare a test case that uses the injection point and share it in the
original thread.

Best Regards,
Hou zj

#7Amit Kapila
amit.kapila16@gmail.com
In reply to: Zhijie Hou (Fujitsu) (#6)
Re: pgsql: Preserve conflict-relevant data during logical replication.

On Wed, Sep 3, 2025 at 5:10 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Tuesday, September 2, 2025 11:03 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jul 23, 2025 at 11:44 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:

If it is in fact important to acquire the commit timestamp after setting
delayChkptFlags, you'd hope this would lead to a test failure, but it doesn't for
me. I understand it probably requires an injection point to be certain of hitting
the race condition, but I think that would be worth doing. Otherwise, if
something gets broken here by accident, it might be a long time before anyone
notices.

Thanks for pointing it out!

I agree that adding a test is valuable to mitigate the risk of future code
changes. We added a similar safeguard for the RecordTransactionCommit() function
by adding Assert(xactStopTimestamp == 0) after marking the DELAY_CHKPT_xxx flag,
and did not do any precautionary check for RecordTransactionCommitPrepared as
the code to acquire the timestamp and setting flag was close by and had explicit
comments.

I'll prepare a test case that uses the injection point and share it in the
original thread.

The test for this case is added in commit
6456c6e2c4ad1cf9752e09cce37bfcfe2190c5e0.

--
With Regards,
Amit Kapila.

#8Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#7)
Re: pgsql: Preserve conflict-relevant data during logical replication.

On Wed, Sep 10, 2025 at 5:11 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

The test for this case is added in commit
6456c6e2c4ad1cf9752e09cce37bfcfe2190c5e0.

Thanks!

--
Robert Haas
EDB: http://www.enterprisedb.com