Orphaned records in pg_replication_origin_status after subscription drop
Hi all,
(Added Amit K. in CC.)
Some colleagues have reported that it is possible to finish with
orphaned records in pg_replication_origin_status, as an effect of
table synchronization workers that miss some cleanup actions around
replorigin_drop_by_name() when a subscription that spawned these
workers (which themselves create origins through replorigin_create()
called in LogicalRepSyncTableStart()) is dropped.
Please find attached a small script, courtesy of Tenglong Gu and
Daisuke Higuchi, that have been digging into all that.
This problem is true since ce0fdbfe9722 that has added the creation of
a replication origin for table synchronization workers, as of 14 up to
HEAD. One issue with these origin states is that they are sticky: a
restart of the node that held the subscription keeps them around due
to them getting flushed in replorigin_checkpoint. If my understanding
is right, this also means that origin slots are still tracked as in
use, implying that replorigin_session_setup() would not be able to
reuse them as far as I understand.
So, in summary (please correct me if necessary), DropSubscription()
attempts a round of cleanup for all the replication origins with a
loop over ReplicationOriginNameForLogicalRep(), and while in this case
we attempt to drop the origins created by the workers that are tracked
as orphaned, the syscache lookup of REPLORIGIDENT fails within the
DROP SUBSCRIPTION because replorigin_by_name() cannot find an entry:
the replication origin is already gone, but its state has persisted in
memory. Then, why would the replication origin be already gone with
its state in memory not cleaned up yet? Well, the problematic test
case shows that the subscription is dropped while the spawned
tablesync workers do the initial table copy, and the replication
origin is created in the same transaction as the COPY. DROP
SUBSCRIPTION tells the tablesync workers to stop, they stop with the
COPY failing and the origin is not seen as something that exists for
the session that drops the subscription. The replication state in
memory goes out of sync.
So it looks like to me that we are missing a code path where the
replication origin is dropped but we just ignore to reset the
replication state in memory, leading to bloat of the origin slots
available. worker.c does things differently: a small transaction is
used for the origin creation, hence we would never really see the
replication state memory going out-of-sync with the replorigin
catalog.
One solution would be for tablesync.c to do the same as worker.c: we
could use a short transaction that sets the memory state and creates
the replication origin, then move to a second transaction for the
COPY. A second solution, slightly more complicated, is to create some
specific logic to reset the progress state of the origin that's been
created in the transaction that runs the initial COPY, which I guess
should be in the shape of a transaction abort callback. All these
symptoms are pointing out that it is not a good idea, IMO, to expect a
replication origin to exist when dropping a subscription when an
origin has been created in the context of a transaction that can take
a very long time to run, aka the initial tablesync's COPY, so using a
short transaction feels like a good thing to do here.
I have to admit that this code has become much more complicated over
the last couple of years, hence I may not have the full context of
how these cleanups actions should be achieved. They cleary should
happen though, but I am not completely sure which solution would be
better over the other. There may be an entirely different solution,
as well.
Note that it would not be complicated to create a regression test
based on an injection point: waiting during the COPY of a tablesync
worker would be enough while issuing a DROP SUBSCRIPTION.
So, thoughts or comments?
--
Michael
Attachments:
On Fri, Dec 19, 2025 at 10:42 AM Michael Paquier <michael@paquier.xyz> wrote:
Some colleagues have reported that it is possible to finish with
orphaned records in pg_replication_origin_status, as an effect of
table synchronization workers that miss some cleanup actions around
replorigin_drop_by_name() when a subscription that spawned these
workers (which themselves create origins through replorigin_create()
called in LogicalRepSyncTableStart()) is dropped.Please find attached a small script, courtesy of Tenglong Gu and
Daisuke Higuchi, that have been digging into all that.This problem is true since ce0fdbfe9722 that has added the creation of
a replication origin for table synchronization workers, as of 14 up to
HEAD. One issue with these origin states is that they are sticky: a
restart of the node that held the subscription keeps them around due
to them getting flushed in replorigin_checkpoint. If my understanding
is right, this also means that origin slots are still tracked as in
use, implying that replorigin_session_setup() would not be able to
reuse them as far as I understand.So, in summary (please correct me if necessary), DropSubscription()
attempts a round of cleanup for all the replication origins with a
loop over ReplicationOriginNameForLogicalRep(), and while in this case
we attempt to drop the origins created by the workers that are tracked
as orphaned, the syscache lookup of REPLORIGIDENT fails within the
DROP SUBSCRIPTION because replorigin_by_name() cannot find an entry:
the replication origin is already gone, but its state has persisted in
memory. Then, why would the replication origin be already gone with
its state in memory not cleaned up yet? Well, the problematic test
case shows that the subscription is dropped while the spawned
tablesync workers do the initial table copy, and the replication
origin is created in the same transaction as the COPY. DROP
SUBSCRIPTION tells the tablesync workers to stop, they stop with the
COPY failing and the origin is not seen as something that exists for
the session that drops the subscription. The replication state in
memory goes out of sync.So it looks like to me that we are missing a code path where the
replication origin is dropped but we just ignore to reset the
replication state in memory, leading to bloat of the origin slots
available. worker.c does things differently: a small transaction is
used for the origin creation, hence we would never really see the
replication state memory going out-of-sync with the replorigin
catalog.One solution would be for tablesync.c to do the same as worker.c: we
could use a short transaction that sets the memory state and creates
the replication origin, then move to a second transaction for the
COPY. A second solution, slightly more complicated, is to create some
specific logic to reset the progress state of the origin that's been
created in the transaction that runs the initial COPY, which I guess
should be in the shape of a transaction abort callback. All these
symptoms are pointing out that it is not a good idea, IMO, to expect a
replication origin to exist when dropping a subscription when an
origin has been created in the context of a transaction that can take
a very long time to run, aka the initial tablesync's COPY, so using a
short transaction feels like a good thing to do here.
Yeah, either of these ways are okay. The only minor point in creating
replication origin in a separate transaction is that the other part of
operations on origin still need be done at the current place, so the
origin handling will look a bit separated. So, I have a slight
preference towards PG_ENSURE_ERROR_CLEANUP solution where the callback
should clear origin state via replorigin_state_clear.
--
With Regards,
Amit Kapila.
On Fri, Dec 19, 2025 at 2:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Dec 19, 2025 at 10:42 AM Michael Paquier <michael@paquier.xyz> wrote:
Some colleagues have reported that it is possible to finish with
orphaned records in pg_replication_origin_status, as an effect of
table synchronization workers that miss some cleanup actions around
replorigin_drop_by_name() when a subscription that spawned these
workers (which themselves create origins through replorigin_create()
called in LogicalRepSyncTableStart()) is dropped.Please find attached a small script, courtesy of Tenglong Gu and
Daisuke Higuchi, that have been digging into all that.This problem is true since ce0fdbfe9722 that has added the creation of
a replication origin for table synchronization workers, as of 14 up to
HEAD. One issue with these origin states is that they are sticky: a
restart of the node that held the subscription keeps them around due
to them getting flushed in replorigin_checkpoint. If my understanding
is right, this also means that origin slots are still tracked as in
use, implying that replorigin_session_setup() would not be able to
reuse them as far as I understand.So, in summary (please correct me if necessary), DropSubscription()
attempts a round of cleanup for all the replication origins with a
loop over ReplicationOriginNameForLogicalRep(), and while in this case
we attempt to drop the origins created by the workers that are tracked
as orphaned, the syscache lookup of REPLORIGIDENT fails within the
DROP SUBSCRIPTION because replorigin_by_name() cannot find an entry:
the replication origin is already gone, but its state has persisted in
memory. Then, why would the replication origin be already gone with
its state in memory not cleaned up yet? Well, the problematic test
case shows that the subscription is dropped while the spawned
tablesync workers do the initial table copy, and the replication
origin is created in the same transaction as the COPY. DROP
SUBSCRIPTION tells the tablesync workers to stop, they stop with the
COPY failing and the origin is not seen as something that exists for
the session that drops the subscription. The replication state in
memory goes out of sync.So it looks like to me that we are missing a code path where the
replication origin is dropped but we just ignore to reset the
replication state in memory, leading to bloat of the origin slots
available. worker.c does things differently: a small transaction is
used for the origin creation, hence we would never really see the
replication state memory going out-of-sync with the replorigin
catalog.
IIUC the similar issue happens also when the tablesync worker simply
errors out for some reason (e.g., conflicts), but in this case the
tablesync worker would retry from the scratch. While the replication
origin record is deleted during rollback, the origin slot is still
in-use but not acquired by anyone. After the tablesync worker
restarts, it would get a new origin ID and might re-use the orphaned
slot if the origin ID matches.
One solution would be for tablesync.c to do the same as worker.c: we
could use a short transaction that sets the memory state and creates
the replication origin, then move to a second transaction for the
COPY.
Also, the tablesync worker should accept the case where the
replication origin already exists for the error cases instead of
raising an error:
ereport(ERROR,
(errcode(ERRCODE_DUPLICATE_OBJECT),
errmsg("replication origin \"%s\" already exists",
originname)));
A second solution, slightly more complicated, is to create some
specific logic to reset the progress state of the origin that's been
created in the transaction that runs the initial COPY, which I guess
should be in the shape of a transaction abort callback.
I find that it's too much change for backpatching.
So, I have a slight
preference towards PG_ENSURE_ERROR_CLEANUP solution where the callback
should clear origin state via replorigin_state_clear.
It would work too. Or I think we can do a similar thing in
replorigin_reset() for tablesync workers who are in
SUBREL_STATE_DATASYNC state. Both ways require exposing
replorigin_state_clear(), though.
I am slightly leaning towards the idea of using a short transaction
because the tablesync worker would do things closer to the apply
workers and it would also fix the odd behavior that
pg_replication_origin_status shows NULL in the external_id column for
the origins while the COPY is being executed. But we need to verify if
it's really okay to reuse the existing origin instead of raising an
error in case where the tablesync worker retries to the data copy.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Fri, Dec 19, 2025 at 01:56:38PM -0800, Masahiko Sawada wrote:
It would work too. Or I think we can do a similar thing in
replorigin_reset() for tablesync workers who are in
SUBREL_STATE_DATASYNC state. Both ways require exposing
replorigin_state_clear(), though.
Exposing replorigin_state_clear() feels a bit weird here for something
that's only reserved for origin.c, and it would be OK, still..
I am slightly leaning towards the idea of using a short transaction
because the tablesync worker would do things closer to the apply
workers and it would also fix the odd behavior that
pg_replication_origin_status shows NULL in the external_id column for
the origins while the COPY is being executed. But we need to verify if
it's really okay to reuse the existing origin instead of raising an
error in case where the tablesync worker retries to the data copy.
I also lean towards more consistency between the worker and tablesync
code, not less. Hence I'd prefer a short transaction at the beginning
of the tablesync startup so as we can just rely on the DROP
SUBSCRIPTION code path to ensure that the origin is removed from the
catalogs and cleaned up in memory.
The case of pg_replication_origin_status showing NULL while COPY is
executing is an interesting point, didn't think about it yesterday,
but yes that seems like a good thing to have anyway.
--
Michael
On Sat, Dec 20, 2025 at 3:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I am slightly leaning towards the idea of using a short transaction
because the tablesync worker would do things closer to the apply
workers and it would also fix the odd behavior that
pg_replication_origin_status shows NULL in the external_id column for
the origins while the COPY is being executed. But we need to verify if
it's really okay to reuse the existing origin instead of raising an
error in case where the tablesync worker retries to the data copy.
As of today, I can't think of a case where next time (restart after
error) we won't generate the same origin_name but I think this will
add the dependency that each time the origin name should be generated
the same. Also, moving just repl_origin_create part (leaving other
things like origin_advance at its location) would need some
explanation in comments which is that it has some dependency on
DropSubscription and cleanup. Anyway, if we want to go with creating
origin in a separate transaction than COPY, then we should change few
comments like: "It is possible that the origin is not yet created for
tablesync worker, this can happen for the states before
SUBREL_STATE_FINISHEDCOPY." in the code.
--
With Regards,
Amit Kapila.
On Sat, Dec 20, 2025 at 02:55:15PM +0530, Amit Kapila wrote:
As of today, I can't think of a case where next time (restart after
error) we won't generate the same origin_name but I think this will
add the dependency that each time the origin name should be generated
the same.
ReplicationOriginNameForLogicalRep() would generate the origin name as
pg_suboid_relid, so we would always reuse the same origin name on
restart as long as the subscription is not recreated, wouldn't we?
Also, moving just repl_origin_create part (leaving other
things like origin_advance at its location) would need some
explanation in comments which is that it has some dependency on
DropSubscription and cleanup. Anyway, if we want to go with creating
origin in a separate transaction than COPY, then we should change few
comments like: "It is possible that the origin is not yet created for
tablesync worker, this can happen for the states before
SUBREL_STATE_FINISHEDCOPY." in the code.
Hmm. So... Do you think that it should be OK to just create a new
transaction before we open the transaction that locks
MyLogicalRepWorker->relid (one that opens a BEGIN READ ONLY on the
remote side)? And I guess that we would just move the
replorigin_by_name(), replorigin_create() and ERROR into this new
transaction? Setting up replorigin_session_setup() and
replorigin_session_origin could then be delayed until we have done the
replorigin_advance() in BEGIN READ ONLY transaction? By that I mean
to leave the replorigin_advance() position untouched. I have studied
this code quite a bit. I "think" that something among these lines
would work, but I am not 100% sure if things are OK based the relstate
we store in each of these phases. If you have an argument that
invalidates these lines, please feel free!
--
Michael
On Monday, December 22, 2025 7:01 AM Michael Paquier <michael@paquier.xyz> wrote:
On Sat, Dec 20, 2025 at 02:55:15PM +0530, Amit Kapila wrote:
As of today, I can't think of a case where next time (restart after
error) we won't generate the same origin_name but I think this will
add the dependency that each time the origin name should be generated
the same.ReplicationOriginNameForLogicalRep() would generate the origin name as
pg_suboid_relid, so we would always reuse the same origin name on restart
as long as the subscription is not recreated, wouldn't we?Also, moving just repl_origin_create part (leaving other things like
origin_advance at its location) would need some explanation in
comments which is that it has some dependency on DropSubscription and
cleanup. Anyway, if we want to go with creating origin in a separate
transaction than COPY, then we should change few comments like: "It is
possible that the origin is not yet created for tablesync worker, this
can happen for the states before SUBREL_STATE_FINISHEDCOPY." in the
code.Hmm. So... Do you think that it should be OK to just create a new transaction
before we open the transaction that locks
MyLogicalRepWorker->relid (one that opens a BEGIN READ ONLY on the
remote side)? And I guess that we would just move the replorigin_by_name(),
replorigin_create() and ERROR into this new transaction? Setting up
replorigin_session_setup() and replorigin_session_origin could then be
delayed until we have done the
replorigin_advance() in BEGIN READ ONLY transaction? By that I mean to
leave the replorigin_advance() position untouched. I have studied this code
quite a bit. I "think" that something among these lines would work, but I am
not 100% sure if things are OK based the relstate we store in each of these
phases. If you have an argument that invalidates these lines, please feel free!
I think the solution you outlined works. One nitpick is that, instead of
starting a new transaction, we could create the origin within the same
transaction that updates the DATASYNC states, thereby ensuring the origin
information is available as soon as the catalog is updated. I think the bug
won't happen as long as the origin is created in a transaction separate
from the one setting up the shared memory state.
Additionally, if we relocate the origin creation code, we need to remove the
ERROR report. This is because the origin may already exist if a table sync
restarts due to an ERROR during the initial COPY. This should be safe since the
origin is created with the reserved name "pg_xx," preventing users from creating
another origin with the same name.
I'm attaching a small patch for reference. To verify the fix, I've added a
simple test in 004_sync.pl, which was introduced alongwith ce0fdbfe9722. This
test can reproduce the issue (without the fix) by intentionally causing the
initial COPY to fail.
Best Regards,
Hou zj
Attachments:
v1-0001-Fix-orphaned-origin-in-shared-memory-after-subscr.patchapplication/octet-stream; name=v1-0001-Fix-orphaned-origin-in-shared-memory-after-subscr.patchDownload
From 0ba8ae921ed4807c0eb5e8c7909e160547c5e678 Mon Sep 17 00:00:00 2001
From: Zhijie Hou <houzj.fnst@fujitsu.com>
Date: Fri, 19 Dec 2025 15:25:50 +0800
Subject: [PATCH v1] Fix orphaned origin in shared memory after subscription
drop
Currently, we store replication origin information in both the catalog and
shared memory (once the origin is set up). During a transaction, if a
replication origin is created and then set up, but the transaction subsequently
aborts, only the catalog changes are rolled back, leaving an orphaned origin in
shared memory. This prevents others from reusing the position.
This issue is more likely to happen during table synchronization, where the
replication origin is created within the same transaction that performs the
initial copy. If the COPY operation fails, the origin in shared memory remains
uncleared.
To address this, the commit moves origin creation to the end of a separate
transaction preceding the one executing the initial COPY. This change minimizes
the chances of a transaction abort occurring immediately after origin creation.
---
src/backend/commands/subscriptioncmds.c | 10 ++---
src/backend/replication/logical/tablesync.c | 50 ++++++++++-----------
src/test/subscription/t/004_sync.pl | 7 +++
3 files changed, 35 insertions(+), 32 deletions(-)
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index abbcaff0838..921cd9674f1 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -1099,10 +1099,10 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data,
*
* It is possible that the origin is not yet created for
* tablesync worker, this can happen for the states before
- * SUBREL_STATE_FINISHEDCOPY. The tablesync worker or
- * apply worker can also concurrently try to drop the
- * origin and by this time the origin might be already
- * removed. For these reasons, passing missing_ok = true.
+ * SUBREL_STATE_DATASYNC. The tablesync worker or apply
+ * worker can also concurrently try to drop the origin and
+ * by this time the origin might be already removed. For
+ * these reasons, passing missing_ok = true.
*/
ReplicationOriginNameForLogicalRep(sub->oid, relid, originname,
sizeof(originname));
@@ -2174,7 +2174,7 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
*
* It is possible that the origin is not yet created for tablesync
* worker so passing missing_ok = true. This can happen for the states
- * before SUBREL_STATE_FINISHEDCOPY.
+ * before SUBREL_STATE_DATASYNC.
*/
ReplicationOriginNameForLogicalRep(subid, relid, originname,
sizeof(originname));
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 6bb0cbeedad..79d2758d02c 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -1340,6 +1340,17 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
MyLogicalRepWorker->relstate,
MyLogicalRepWorker->relstate_lsn,
false);
+
+ /*
+ * Create the replication origin in a separate transaction from the one
+ * that sets up the origin in shared memory. This prevents the risk that
+ * changes to the origin in shared memory cannot be rolled back if the
+ * transaction aborts.
+ */
+ originid = replorigin_by_name(originname, true);
+ if (!OidIsValid(originid))
+ originid = replorigin_create(originname);
+
CommitTransactionCommand();
pgstat_report_stat(true);
@@ -1378,38 +1389,23 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
MySubscription->failover,
CRS_USE_SNAPSHOT, origin_startpos);
+ /*
+ * Advance the origin to the LSN got from walrcv_create_slot. This is WAL
+ * logged for the purpose of recovery. Locks are to prevent the
+ * replication origin from vanishing while advancing.
+ */
+ LockRelationOid(ReplicationOriginRelationId, RowExclusiveLock);
+ replorigin_advance(originid, *origin_startpos, InvalidXLogRecPtr,
+ true /* go backward */ , true /* WAL log */ );
+ UnlockRelationOid(ReplicationOriginRelationId, RowExclusiveLock);
+
/*
* Setup replication origin tracking. The purpose of doing this before the
* copy is to avoid doing the copy again due to any error in setting up
* origin tracking.
*/
- originid = replorigin_by_name(originname, true);
- if (!OidIsValid(originid))
- {
- /*
- * Origin tracking does not exist, so create it now.
- *
- * Then advance to the LSN got from walrcv_create_slot. This is WAL
- * logged for the purpose of recovery. Locks are to prevent the
- * replication origin from vanishing while advancing.
- */
- originid = replorigin_create(originname);
-
- LockRelationOid(ReplicationOriginRelationId, RowExclusiveLock);
- replorigin_advance(originid, *origin_startpos, InvalidXLogRecPtr,
- true /* go backward */ , true /* WAL log */ );
- UnlockRelationOid(ReplicationOriginRelationId, RowExclusiveLock);
-
- replorigin_session_setup(originid, 0);
- replorigin_session_origin = originid;
- }
- else
- {
- ereport(ERROR,
- (errcode(ERRCODE_DUPLICATE_OBJECT),
- errmsg("replication origin \"%s\" already exists",
- originname)));
- }
+ replorigin_session_setup(originid, 0);
+ replorigin_session_origin = originid;
/*
* If the user did not opt to run as the owner of the subscription
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index d5eac05a3b3..5efd7e116f0 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -172,6 +172,13 @@ ok( $node_publisher->poll_query_until(
'postgres', 'SELECT count(*) = 0 FROM pg_replication_slots'),
'DROP SUBSCRIPTION during error can clean up the slots on the publisher');
+# After dropping the subscription, all replication origins, whether created by
+# an apply worker or table sync worker, should have been cleaned up.
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(*) FROM pg_replication_origin_status");
+is($result, qq(0),
+ 'all replication origins have been cleaned up');
+
$node_subscriber->stop('fast');
$node_publisher->stop('fast');
--
2.31.1
On Mon, Dec 22, 2025 at 4:31 AM Michael Paquier <michael@paquier.xyz> wrote:
On Sat, Dec 20, 2025 at 02:55:15PM +0530, Amit Kapila wrote:
As of today, I can't think of a case where next time (restart after
error) we won't generate the same origin_name but I think this will
add the dependency that each time the origin name should be generated
the same.ReplicationOriginNameForLogicalRep() would generate the origin name as
pg_suboid_relid, so we would always reuse the same origin name on
restart as long as the subscription is not recreated, wouldn't we?
Yes. I had thought about if there is any way the relid or subid can
change in between the restart of tablesync worker but I can't think of
any. So, it sounds safe.
--
With Regards,
Amit Kapila.
On Mon, Dec 22, 2025 at 10:16 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
On Monday, December 22, 2025 7:01 AM Michael Paquier <michael@paquier.xyz> wrote:
On Sat, Dec 20, 2025 at 02:55:15PM +0530, Amit Kapila wrote:
As of today, I can't think of a case where next time (restart after
error) we won't generate the same origin_name but I think this will
add the dependency that each time the origin name should be generated
the same.ReplicationOriginNameForLogicalRep() would generate the origin name as
pg_suboid_relid, so we would always reuse the same origin name on restart
as long as the subscription is not recreated, wouldn't we?Also, moving just repl_origin_create part (leaving other things like
origin_advance at its location) would need some explanation in
comments which is that it has some dependency on DropSubscription and
cleanup. Anyway, if we want to go with creating origin in a separate
transaction than COPY, then we should change few comments like: "It is
possible that the origin is not yet created for tablesync worker, this
can happen for the states before SUBREL_STATE_FINISHEDCOPY." in the
code.Hmm. So... Do you think that it should be OK to just create a new transaction
before we open the transaction that locks
MyLogicalRepWorker->relid (one that opens a BEGIN READ ONLY on the
remote side)? And I guess that we would just move the replorigin_by_name(),
replorigin_create() and ERROR into this new transaction? Setting up
replorigin_session_setup() and replorigin_session_origin could then be
delayed until we have done the
replorigin_advance() in BEGIN READ ONLY transaction? By that I mean to
leave the replorigin_advance() position untouched. I have studied this code
quite a bit. I "think" that something among these lines would work, but I am
not 100% sure if things are OK based the relstate we store in each of these
phases. If you have an argument that invalidates these lines, please feel free!I think the solution you outlined works. One nitpick is that, instead of
starting a new transaction, we could create the origin within the same
transaction that updates the DATASYNC states, thereby ensuring the origin
information is available as soon as the catalog is updated. I think the bug
won't happen as long as the origin is created in a transaction separate
from the one setting up the shared memory state.
Agreed. But please update the comment (/* Update the state and make it
visible to others. */) just before that transaction to reflect that
origin will also be created in this transaction.
Additionally, if we relocate the origin creation code, we need to remove the
ERROR report. This is because the origin may already exist if a table sync
restarts due to an ERROR during the initial COPY. This should be safe since the
origin is created with the reserved name "pg_xx," preventing users from creating
another origin with the same name.
Right.
+ /*
+ * Advance the origin to the LSN got from walrcv_create_slot. This is WAL
+ * logged for the purpose of recovery. Locks are to prevent the
+ * replication origin from vanishing while advancing.
+ */
+ LockRelationOid(ReplicationOriginRelationId, RowExclusiveLock);
+ replorigin_advance(originid, *origin_startpos, InvalidXLogRecPtr,
+ true /* go backward */ , true /* WAL log */ );
+ UnlockRelationOid(ReplicationOriginRelationId, RowExclusiveLock);
+
/*
* Setup replication origin tracking. The purpose of doing this before the
* copy is to avoid doing the copy again due to any error in setting up
* origin tracking.
*/
Shouldn't the second comment starting from "Setup replication origin
.." be merged with the previous one because it is also intended for
the previous code change?
--
With Regards,
Amit Kapila.
On Monday, December 22, 2025 2:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Dec 22, 2025 at 10:16 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:On Monday, December 22, 2025 7:01 AM Michael Paquier
<michael@paquier.xyz> wrote:
On Sat, Dec 20, 2025 at 02:55:15PM +0530, Amit Kapila wrote:
As of today, I can't think of a case where next time (restart
after
error) we won't generate the same origin_name but I think this
will add the dependency that each time the origin name should be
generated the same.ReplicationOriginNameForLogicalRep() would generate the origin name
as pg_suboid_relid, so we would always reuse the same origin name on
restart as long as the subscription is not recreated, wouldn't we?Also, moving just repl_origin_create part (leaving other things
like origin_advance at its location) would need some explanation
in comments which is that it has some dependency on
DropSubscription and cleanup. Anyway, if we want to go with
creating origin in a separate transaction than COPY, then we
should change few comments like: "It is possible that the origin
is not yet created for tablesync worker, this can happen for the
states before SUBREL_STATE_FINISHEDCOPY." in the code.Hmm. So... Do you think that it should be OK to just create a new
transaction before we open the transaction that locks
MyLogicalRepWorker->relid (one that opens a BEGIN READ ONLY on the
remote side)? And I guess that we would just move the
replorigin_by_name(),
replorigin_create() and ERROR into this new transaction? Setting up
replorigin_session_setup() and replorigin_session_origin could then
be delayed until we have done the
replorigin_advance() in BEGIN READ ONLY transaction? By that I mean
to leave the replorigin_advance() position untouched. I have
studied this code quite a bit. I "think" that something among these
lines would work, but I am not 100% sure if things are OK based the
relstate we store in each of these phases. If you have an argument thatinvalidates these lines, please feel free!
I think the solution you outlined works. One nitpick is that, instead
of starting a new transaction, we could create the origin within the
same transaction that updates the DATASYNC states, thereby ensuring
the origin information is available as soon as the catalog is updated.
I think the bug won't happen as long as the origin is created in a
transaction separate from the one setting up the shared memory state.Agreed. But please update the comment (/* Update the state and make it
visible to others. */) just before that transaction to reflect that origin will also
be created in this transaction.
Updated.
Additionally, if we relocate the origin creation code, we need to
remove the ERROR report. This is because the origin may already exist
if a table sync restarts due to an ERROR during the initial COPY. This
should be safe since the origin is created with the reserved name
"pg_xx," preventing users from creating another origin with the same name.Right.
+ /* + * Advance the origin to the LSN got from walrcv_create_slot. This is + WAL + * logged for the purpose of recovery. Locks are to prevent the + * replication origin from vanishing while advancing. + */ + LockRelationOid(ReplicationOriginRelationId, RowExclusiveLock); + replorigin_advance(originid, *origin_startpos, InvalidXLogRecPtr, + true /* go backward */ , true /* WAL log */ ); + UnlockRelationOid(ReplicationOriginRelationId, RowExclusiveLock); + /* * Setup replication origin tracking. The purpose of doing this before the * copy is to avoid doing the copy again due to any error in setting up * origin tracking. */Shouldn't the second comment starting from "Setup replication origin .." be
merged with the previous one because it is also intended for the previous
code change?
Agreed and merged.
Here is the V2 patch which addressed above comments.
Best Regards,
Hou zj
Attachments:
v2-0001-Fix-orphaned-origin-in-shared-memory-after-subscr.patchapplication/octet-stream; name=v2-0001-Fix-orphaned-origin-in-shared-memory-after-subscr.patchDownload
From 630ad065e680d950950cc656fdfab768ae5af0e8 Mon Sep 17 00:00:00 2001
From: Zhijie Hou <houzj.fnst@fujitsu.com>
Date: Fri, 19 Dec 2025 15:25:50 +0800
Subject: [PATCH v2] Fix orphaned origin in shared memory after subscription
drop
Currently, we store replication origin information in both the catalog and
shared memory (once the origin is set up). During a transaction, if a
replication origin is created and then set up, but the transaction subsequently
aborts, only the catalog changes are rolled back, leaving an orphaned origin in
shared memory. This is confusing because users can still see orphaned records
in pg_replication_origin_status even after dropping the subscription, and this
prevents other origins from reusing the position.
This issue is more likely to happen during table synchronization, where the
replication origin is created within the same transaction that performs the
initial copy. If the COPY operation fails, the origin in shared memory remains
uncleared.
To address this, the commit moves origin creation to the end of a separate
transaction preceding the one executing the initial COPY. This prevents the
risk that changes to the origin in shared memory cannot be rolled back if the
transaction aborts.
---
src/backend/commands/subscriptioncmds.c | 10 ++--
src/backend/replication/logical/tablesync.c | 58 ++++++++++-----------
src/test/subscription/t/004_sync.pl | 7 +++
3 files changed, 40 insertions(+), 35 deletions(-)
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index abbcaff0838..921cd9674f1 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -1099,10 +1099,10 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data,
*
* It is possible that the origin is not yet created for
* tablesync worker, this can happen for the states before
- * SUBREL_STATE_FINISHEDCOPY. The tablesync worker or
- * apply worker can also concurrently try to drop the
- * origin and by this time the origin might be already
- * removed. For these reasons, passing missing_ok = true.
+ * SUBREL_STATE_DATASYNC. The tablesync worker or apply
+ * worker can also concurrently try to drop the origin and
+ * by this time the origin might be already removed. For
+ * these reasons, passing missing_ok = true.
*/
ReplicationOriginNameForLogicalRep(sub->oid, relid, originname,
sizeof(originname));
@@ -2174,7 +2174,7 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
*
* It is possible that the origin is not yet created for tablesync
* worker so passing missing_ok = true. This can happen for the states
- * before SUBREL_STATE_FINISHEDCOPY.
+ * before SUBREL_STATE_DATASYNC.
*/
ReplicationOriginNameForLogicalRep(subid, relid, originname,
sizeof(originname));
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 6bb0cbeedad..2522e372036 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -1333,13 +1333,27 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
MyLogicalRepWorker->relstate_lsn = InvalidXLogRecPtr;
SpinLockRelease(&MyLogicalRepWorker->relmutex);
- /* Update the state and make it visible to others. */
+ /*
+ * Update the state, create the replication origin, and make them visible
+ * to others.
+ */
StartTransactionCommand();
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
MyLogicalRepWorker->relid,
MyLogicalRepWorker->relstate,
MyLogicalRepWorker->relstate_lsn,
false);
+
+ /*
+ * Create the replication origin in a separate transaction from the one
+ * that sets up the origin in shared memory. This prevents the risk that
+ * changes to the origin in shared memory cannot be rolled back if the
+ * transaction aborts.
+ */
+ originid = replorigin_by_name(originname, true);
+ if (!OidIsValid(originid))
+ originid = replorigin_create(originname);
+
CommitTransactionCommand();
pgstat_report_stat(true);
@@ -1379,37 +1393,21 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
CRS_USE_SNAPSHOT, origin_startpos);
/*
- * Setup replication origin tracking. The purpose of doing this before the
- * copy is to avoid doing the copy again due to any error in setting up
- * origin tracking.
+ * Advance the origin to the LSN got from walrcv_create_slot and then set
+ * up the origin. The advancement is WAL logged for the purpose of
+ * recovery. Locks are to prevent the replication origin from vanishing
+ * while advancing.
+ *
+ * The purpose of doing these before the copy is to avoid doing the copy
+ * again due to any error in advancing or setting up origin tracking.
*/
- originid = replorigin_by_name(originname, true);
- if (!OidIsValid(originid))
- {
- /*
- * Origin tracking does not exist, so create it now.
- *
- * Then advance to the LSN got from walrcv_create_slot. This is WAL
- * logged for the purpose of recovery. Locks are to prevent the
- * replication origin from vanishing while advancing.
- */
- originid = replorigin_create(originname);
-
- LockRelationOid(ReplicationOriginRelationId, RowExclusiveLock);
- replorigin_advance(originid, *origin_startpos, InvalidXLogRecPtr,
- true /* go backward */ , true /* WAL log */ );
- UnlockRelationOid(ReplicationOriginRelationId, RowExclusiveLock);
+ LockRelationOid(ReplicationOriginRelationId, RowExclusiveLock);
+ replorigin_advance(originid, *origin_startpos, InvalidXLogRecPtr,
+ true /* go backward */ , true /* WAL log */ );
+ UnlockRelationOid(ReplicationOriginRelationId, RowExclusiveLock);
- replorigin_session_setup(originid, 0);
- replorigin_session_origin = originid;
- }
- else
- {
- ereport(ERROR,
- (errcode(ERRCODE_DUPLICATE_OBJECT),
- errmsg("replication origin \"%s\" already exists",
- originname)));
- }
+ replorigin_session_setup(originid, 0);
+ replorigin_session_origin = originid;
/*
* If the user did not opt to run as the owner of the subscription
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index d5eac05a3b3..5efd7e116f0 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -172,6 +172,13 @@ ok( $node_publisher->poll_query_until(
'postgres', 'SELECT count(*) = 0 FROM pg_replication_slots'),
'DROP SUBSCRIPTION during error can clean up the slots on the publisher');
+# After dropping the subscription, all replication origins, whether created by
+# an apply worker or table sync worker, should have been cleaned up.
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(*) FROM pg_replication_origin_status");
+is($result, qq(0),
+ 'all replication origins have been cleaned up');
+
$node_subscriber->stop('fast');
$node_publisher->stop('fast');
--
2.31.1
On Mon, Dec 22, 2025 at 2:15 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
On Monday, December 22, 2025 2:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Dec 22, 2025 at 10:16 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:On Monday, December 22, 2025 7:01 AM Michael Paquier
<michael@paquier.xyz> wrote:
On Sat, Dec 20, 2025 at 02:55:15PM +0530, Amit Kapila wrote:
As of today, I can't think of a case where next time (restart
after
error) we won't generate the same origin_name but I think this
will add the dependency that each time the origin name should be
generated the same.ReplicationOriginNameForLogicalRep() would generate the origin name
as pg_suboid_relid, so we would always reuse the same origin name on
restart as long as the subscription is not recreated, wouldn't we?Also, moving just repl_origin_create part (leaving other things
like origin_advance at its location) would need some explanation
in comments which is that it has some dependency on
DropSubscription and cleanup. Anyway, if we want to go with
creating origin in a separate transaction than COPY, then we
should change few comments like: "It is possible that the origin
is not yet created for tablesync worker, this can happen for the
states before SUBREL_STATE_FINISHEDCOPY." in the code.Hmm. So... Do you think that it should be OK to just create a new
transaction before we open the transaction that locks
MyLogicalRepWorker->relid (one that opens a BEGIN READ ONLY on the
remote side)? And I guess that we would just move the
replorigin_by_name(),
replorigin_create() and ERROR into this new transaction? Setting up
replorigin_session_setup() and replorigin_session_origin could then
be delayed until we have done the
replorigin_advance() in BEGIN READ ONLY transaction? By that I mean
to leave the replorigin_advance() position untouched. I have
studied this code quite a bit. I "think" that something among these
lines would work, but I am not 100% sure if things are OK based the
relstate we store in each of these phases. If you have an argument thatinvalidates these lines, please feel free!
I think the solution you outlined works. One nitpick is that, instead
of starting a new transaction, we could create the origin within the
same transaction that updates the DATASYNC states, thereby ensuring
the origin information is available as soon as the catalog is updated.
I think the bug won't happen as long as the origin is created in a
transaction separate from the one setting up the shared memory state.Agreed. But please update the comment (/* Update the state and make it
visible to others. */) just before that transaction to reflect that origin will also
be created in this transaction.Updated.
Additionally, if we relocate the origin creation code, we need to
remove the ERROR report. This is because the origin may already exist
if a table sync restarts due to an ERROR during the initial COPY. This
should be safe since the origin is created with the reserved name
"pg_xx," preventing users from creating another origin with the same name.Right.
+ /* + * Advance the origin to the LSN got from walrcv_create_slot. This is + WAL + * logged for the purpose of recovery. Locks are to prevent the + * replication origin from vanishing while advancing. + */ + LockRelationOid(ReplicationOriginRelationId, RowExclusiveLock); + replorigin_advance(originid, *origin_startpos, InvalidXLogRecPtr, + true /* go backward */ , true /* WAL log */ ); + UnlockRelationOid(ReplicationOriginRelationId, RowExclusiveLock); + /* * Setup replication origin tracking. The purpose of doing this before the * copy is to avoid doing the copy again due to any error in setting up * origin tracking. */Shouldn't the second comment starting from "Setup replication origin .." be
merged with the previous one because it is also intended for the previous
code change?Agreed and merged.
Here is the V2 patch which addressed above comments.
Thank you for making the patch! The patch looks good to me.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Mon, Dec 22, 2025 at 01:22:38PM -0800, Masahiko Sawada wrote:
Thank you for making the patch! The patch looks good to me.
Creating the origin at the end of the same transaction that sets the
state to SUBREL_STATE_DATASYNC seems sensible here. I'll spend a
couple of extra hours playing with all that across all the branches,
see if I can wrap it. This includes some more error injection to
cross-check the state of all these transactions with the states in
the catalogs while we drop the subscription.
The test addition is interesting, nice. I didn't notice that it would
be possible to use this trick in 004_sync..
Also, I have double-checked the names of the folks who have reported
the bug, giving the following list for the commit logs (in case I
don't finish this stuff, feel free to use that):
Daisuke Higuchi <higudai@amazon.com>
Tenglong Gu <brucegu@amazon.com>
--
Michael
On Tue, Dec 23, 2025 at 08:21:39AM +0900, Michael Paquier wrote:
Creating the origin at the end of the same transaction that sets the
state to SUBREL_STATE_DATASYNC seems sensible here. I'll spend a
couple of extra hours playing with all that across all the branches,
see if I can wrap it. This includes some more error injection to
cross-check the state of all these transactions with the states in
the catalogs while we drop the subscription.
While looking at the state of the code across the six branches where
we need to fix this, there were two points that have been slightly
sticky on my mind:
1) check_old_cluster_subscription_state() in pg_upgrade's check.c,
where we have a set of comments dealing with the reasons why only the
initial and ready states are allowed for the transfers of the relation
data. The patch only makes the origin visible in the catalogs for one
extra state, DATASYNC now, meaning that we have nothing to care about.
I was wondering about the comment of DATASYNC being slightly incorrect
now because it only mentions a replication slot. Do you think that we
should adjust that as well to mention the case of origins, knowing
that their names are also based on subscription OIDs whose value
change across upgrades? That would not apply for relation IDs as
these are fixed, but this feels a bit inexact now for the branches
where this code applies.
2) 88f488319bac and f6c5edb8abca, that have slightly changed the
replication origin logic in v16~. Still, after looking at the code
for a couple of hours, as well as testing the DROP SUBSCRIPTION
stability while enforcing ERRORs in the tablesync worker to play with
the shmem state vs the catalog state, I could not find a hole in the
v15 and v14 code caused by the fact that we have the catalog state for
the origin available one state earlier. At the end I think that we
are OK here in light of these two commits.
The comment could be improved, but I don't see that as a reason good
enough to fix the issue first, so I have applied that down to v14 with
the couple of conflicts handled on the way.
--
Michael
On Tue, Dec 23, 2025 at 12:51 PM Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Dec 23, 2025 at 08:21:39AM +0900, Michael Paquier wrote:
Creating the origin at the end of the same transaction that sets the
state to SUBREL_STATE_DATASYNC seems sensible here. I'll spend a
couple of extra hours playing with all that across all the branches,
see if I can wrap it. This includes some more error injection to
cross-check the state of all these transactions with the states in
the catalogs while we drop the subscription.While looking at the state of the code across the six branches where
we need to fix this, there were two points that have been slightly
sticky on my mind:
1) check_old_cluster_subscription_state() in pg_upgrade's check.c,
where we have a set of comments dealing with the reasons why only the
initial and ready states are allowed for the transfers of the relation
data. The patch only makes the origin visible in the catalogs for one
extra state, DATASYNC now, meaning that we have nothing to care about.
I was wondering about the comment of DATASYNC being slightly incorrect
now because it only mentions a replication slot. Do you think that we
should adjust that as well to mention the case of origins, knowing
that their names are also based on subscription OIDs whose value
change across upgrades? That would not apply for relation IDs as
these are fixed, but this feels a bit inexact now for the branches
where this code applies.
You are right. How about attached to make it match with the current code?
--
With Regards,
Amit Kapila.
Attachments:
fix_comment_1.patchapplication/octet-stream; name=fix_comment_1.patchDownload
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 9cdeb15bd51..c77ebeff883 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -2383,9 +2383,9 @@ check_old_cluster_subscription_state(void)
* states listed below are not supported:
*
* a) SUBREL_STATE_DATASYNC: A relation upgraded while in this state would
- * retain a replication slot, which could not be dropped by the sync
- * worker spawned after the upgrade because the subscription ID used for
- * the slot name won't match anymore.
+ * retain a replication slot and origin. The sync worker spawned after the
+ * upgrade cannot drop them because the subscription ID used for the slot
+ * and origin name no longer matches.
*
* b) SUBREL_STATE_SYNCDONE: A relation upgraded while in this state would
* retain the replication origin when there is a failure in tablesync