Deadlock between logrep apply worker and tablesync worker

Started by Tom Lanealmost 3 years ago23 messages
#1Tom Lane
tgl@sss.pgh.pa.us
1 attachment(s)

On my machine, the src/test/subscription/t/002_types.pl test
usually takes right about 1.5 seconds:

$ time make check PROVE_FLAGS=--timer PROVE_TESTS=t/002_types.pl
...
[14:22:12] t/002_types.pl .. ok 1550 ms ( 0.00 usr 0.00 sys + 0.70 cusr 0.25 csys = 0.95 CPU)
[14:22:13]

I noticed however that sometimes (at least one try in ten, for me)
it takes about 2.5 seconds:

[14:22:16] t/002_types.pl .. ok 2591 ms ( 0.00 usr 0.00 sys + 0.69 cusr 0.28 csys = 0.97 CPU)
[14:22:18]

and I've even seen 3.5 seconds. I dug into this and eventually
identified the cause: it's a deadlock between a subscription's apply
worker and a tablesync worker that it's spawned. Sometimes the apply
worker calls wait_for_relation_state_change (to wait for the tablesync
worker to finish) while it's holding a lock on pg_replication_origin.
If that's the case, then when the tablesync worker reaches
process_syncing_tables_for_sync it is able to perform
UpdateSubscriptionRelState and reach the transaction commit below
that; but when it tries to do replorigin_drop_by_name a little further
down, it blocks on acquiring ExclusiveLock on pg_replication_origin.
So we have an undetected deadlock. We escape that because
wait_for_relation_state_change has a 1-second timeout, after which
it rechecks GetSubscriptionRelState and is able to see the committed
relation state change; so it continues, and eventually releases its
transaction and the lock, permitting the tablesync worker to finish.

I've not tracked down the exact circumstances in which the apply
worker ends up holding a problematic lock, but it seems likely
that it corresponds to cases where its main loop has itself called
replorigin_drop_by_name, a bit further up, for some other concurrent
tablesync operation. (In all the cases I've traced through, the apply
worker is herding multiple tablesync workers when this happens.)

I experimented with having the apply worker release its locks
before waiting for the tablesync worker, as attached. This passes
check-world and it seems to eliminate the test runtime instability,
but I wonder whether it's semantically correct. This whole business
of taking table-wide ExclusiveLock on pg_replication_origin looks
like a horrid kluge that we should try to get rid of, not least
because I don't see any clear documentation of what hazard it's
trying to prevent.

Another thing that has a bad smell about it is the fact that
process_syncing_tables_for_sync uses two transactions in the first
place. There's a comment there claiming that it's for crash safety,
but I can't help suspecting it's really because this case becomes a
hard deadlock without that mid-function commit.

It's not great in any case that the apply worker can move on in
the belief that the tablesync worker is done when in fact the latter
still has catalog state updates to make. And I wonder what we're
doing with having both of them calling replorigin_drop_by_name
... shouldn't that responsibility belong to just one of them?

So I think this whole area deserves a hard look, and I'm not at
all sure that what's attached is a good solution.

regards, tom lane

Attachments:

hack-to-prevent-logrep-worker-deadlocks.patchtext/x-diff; charset=us-ascii; name=hack-to-prevent-logrep-worker-deadlocks.patchDownload
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 4647837b82..bb45c2107f 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -539,15 +539,27 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 					LWLockRelease(LogicalRepWorkerLock);
 
 					/*
-					 * Enter busy loop and wait for synchronization worker to
-					 * reach expected state (or die trying).
+					 * If we have a transaction, we must commit it to release
+					 * any locks we have (it's quite likely we hold lock on
+					 * pg_replication_origin, which the sync worker will need
+					 * to update).  Then start a new transaction so we can
+					 * examine catalog state.
 					 */
-					if (!started_tx)
+					if (started_tx)
+					{
+						CommitTransactionCommand();
+						StartTransactionCommand();
+					}
+					else
 					{
 						StartTransactionCommand();
 						started_tx = true;
 					}
 
+					/*
+					 * Enter busy loop and wait for synchronization worker to
+					 * reach expected state (or die trying).
+					 */
 					wait_for_relation_state_change(rstate->relid,
 												   SUBREL_STATE_SYNCDONE);
 				}
#2Amit Kapila
amit.kapila16@gmail.com
In reply to: Tom Lane (#1)
Re: Deadlock between logrep apply worker and tablesync worker

On Mon, Jan 23, 2023 at 1:29 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

On my machine, the src/test/subscription/t/002_types.pl test
usually takes right about 1.5 seconds:

$ time make check PROVE_FLAGS=--timer PROVE_TESTS=t/002_types.pl
...
[14:22:12] t/002_types.pl .. ok 1550 ms ( 0.00 usr 0.00 sys + 0.70 cusr 0.25 csys = 0.95 CPU)
[14:22:13]

I noticed however that sometimes (at least one try in ten, for me)
it takes about 2.5 seconds:

[14:22:16] t/002_types.pl .. ok 2591 ms ( 0.00 usr 0.00 sys + 0.69 cusr 0.28 csys = 0.97 CPU)
[14:22:18]

and I've even seen 3.5 seconds. I dug into this and eventually
identified the cause: it's a deadlock between a subscription's apply
worker and a tablesync worker that it's spawned. Sometimes the apply
worker calls wait_for_relation_state_change (to wait for the tablesync
worker to finish) while it's holding a lock on pg_replication_origin.
If that's the case, then when the tablesync worker reaches
process_syncing_tables_for_sync it is able to perform
UpdateSubscriptionRelState and reach the transaction commit below
that; but when it tries to do replorigin_drop_by_name a little further
down, it blocks on acquiring ExclusiveLock on pg_replication_origin.
So we have an undetected deadlock. We escape that because
wait_for_relation_state_change has a 1-second timeout, after which
it rechecks GetSubscriptionRelState and is able to see the committed
relation state change; so it continues, and eventually releases its
transaction and the lock, permitting the tablesync worker to finish.

I've not tracked down the exact circumstances in which the apply
worker ends up holding a problematic lock, but it seems likely
that it corresponds to cases where its main loop has itself called
replorigin_drop_by_name, a bit further up, for some other concurrent
tablesync operation. (In all the cases I've traced through, the apply
worker is herding multiple tablesync workers when this happens.)

I experimented with having the apply worker release its locks
before waiting for the tablesync worker, as attached.

I don't see any problem with your proposed change but I was wondering
if it would be better to commit the transaction and release locks
immediately after performing the replication origin drop? By doing
that, we will minimize the amount of time the transaction holds the
lock.

This passes
check-world and it seems to eliminate the test runtime instability,
but I wonder whether it's semantically correct. This whole business
of taking table-wide ExclusiveLock on pg_replication_origin looks
like a horrid kluge that we should try to get rid of, not least
because I don't see any clear documentation of what hazard it's
trying to prevent.

IIRC, this is done to prevent concurrent drops of origin drop say by
exposed API pg_replication_origin_drop(). See the discussion in [1]/messages/by-id/CAHut+PuW8DWV5fskkMWWMqzt-x7RPcNQOtJQBp6SdwyRghCk7A@mail.gmail.com
related to it. If we want we can optimize it so that we can acquire
the lock on the specific origin as mentioned in comments
replorigin_drop_by_name() but it was not clear that this operation
would be frequent enough.

[1]: /messages/by-id/CAHut+PuW8DWV5fskkMWWMqzt-x7RPcNQOtJQBp6SdwyRghCk7A@mail.gmail.com

--
With Regards,
Amit Kapila.

#3Amit Kapila
amit.kapila16@gmail.com
In reply to: Tom Lane (#1)
Re: Deadlock between logrep apply worker and tablesync worker

On Mon, Jan 23, 2023 at 1:29 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Another thing that has a bad smell about it is the fact that
process_syncing_tables_for_sync uses two transactions in the first
place. There's a comment there claiming that it's for crash safety,
but I can't help suspecting it's really because this case becomes a
hard deadlock without that mid-function commit.

It's not great in any case that the apply worker can move on in
the belief that the tablesync worker is done when in fact the latter
still has catalog state updates to make. And I wonder what we're
doing with having both of them calling replorigin_drop_by_name
... shouldn't that responsibility belong to just one of them?

Originally, it was being dropped at one place only (via tablesync
worker) but we found a race condition as mentioned in the comments in
process_syncing_tables_for_sync() before the start of the second
transaction which leads to this change. See the report and discussion
about that race condition in the email [1]/messages/by-id/CAD21AoAw0Oofi4kiDpJBOwpYyBBBkJj=sLUOn4Gd2GjUAKG-fw@mail.gmail.com.

[1]: /messages/by-id/CAD21AoAw0Oofi4kiDpJBOwpYyBBBkJj=sLUOn4Gd2GjUAKG-fw@mail.gmail.com

--
With Regards,
Amit Kapila.

#4vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#2)
1 attachment(s)
Re: Deadlock between logrep apply worker and tablesync worker

On Mon, 23 Jan 2023 at 10:52, Amit Kapila <amit.kapila16@gmail.com> wrote:

IIRC, this is done to prevent concurrent drops of origin drop say by
exposed API pg_replication_origin_drop(). See the discussion in [1]
related to it. If we want we can optimize it so that we can acquire
the lock on the specific origin as mentioned in comments
replorigin_drop_by_name() but it was not clear that this operation
would be frequent enough.

Here is an attached patch to lock the replication origin record using
LockSharedObject instead of locking pg_replication_origin relation in
ExclusiveLock mode. Now tablesync worker will wait only if the
tablesync worker is trying to drop the same replication origin which
has already been dropped by the apply worker, the other tablesync
workers will be able to successfully drop the replication origin
without any wait.

Regards,
Vignesh

Attachments:

0001-Lock-the-replication-origin-record-instead-of-lockin.patchtext/x-patch; charset=US-ASCII; name=0001-Lock-the-replication-origin-record-instead-of-lockin.patchDownload
From 83aa8b6a0c7b4de42c1bf7aa6cfd863830a86c76 Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Fri, 27 Jan 2023 15:17:09 +0530
Subject: [PATCH] Lock the replication origin record instead of locking the
 pg_replication_origin relation.

Lock the replication origin record instead of locking the
pg_replication_origin relation.
---
 src/backend/replication/logical/origin.c | 37 ++++++++++++++++--------
 1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index b754c43840..049a2547d3 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -417,6 +417,21 @@ restart:
 	CommandCounterIncrement();
 }
 
+static bool
+replorigin_exists(RepOriginId roident)
+{
+	HeapTuple	tuple;
+
+	tuple = SearchSysCache1(REPLORIGIDENT, ObjectIdGetDatum((Oid) roident));
+	if (HeapTupleIsValid(tuple))
+	{
+		ReleaseSysCache(tuple);
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * Drop replication origin (by name).
  *
@@ -430,23 +445,21 @@ replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait)
 
 	Assert(IsTransactionState());
 
-	/*
-	 * To interlock against concurrent drops, we hold ExclusiveLock on
-	 * pg_replication_origin till xact commit.
-	 *
-	 * XXX We can optimize this by acquiring the lock on a specific origin by
-	 * using LockSharedObject if required. However, for that, we first to
-	 * acquire a lock on ReplicationOriginRelationId, get the origin_id, lock
-	 * the specific origin and then re-check if the origin still exists.
-	 */
-	rel = table_open(ReplicationOriginRelationId, ExclusiveLock);
+	rel = table_open(ReplicationOriginRelationId, RowExclusiveLock);
 
 	roident = replorigin_by_name(name, missing_ok);
 
-	if (OidIsValid(roident))
+	/*
+	 * Lock the origin to prevent concurrent drops. We keep the lock until the
+	 * end of transaction.
+	 */
+	LockSharedObject(ReplicationOriginRelationId, roident, 0,
+					 AccessExclusiveLock);
+
+	/* Drop the replication origin if it has not been dropped already */
+	if (replorigin_exists(roident))
 		replorigin_drop_guts(rel, roident, nowait);
 
-	/* We keep the lock on pg_replication_origin until commit */
 	table_close(rel, NoLock);
 }
 
-- 
2.34.1

#5Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#4)
Re: Deadlock between logrep apply worker and tablesync worker

On Fri, Jan 27, 2023 at 3:45 PM vignesh C <vignesh21@gmail.com> wrote:

On Mon, 23 Jan 2023 at 10:52, Amit Kapila <amit.kapila16@gmail.com> wrote:

IIRC, this is done to prevent concurrent drops of origin drop say by
exposed API pg_replication_origin_drop(). See the discussion in [1]
related to it. If we want we can optimize it so that we can acquire
the lock on the specific origin as mentioned in comments
replorigin_drop_by_name() but it was not clear that this operation
would be frequent enough.

Here is an attached patch to lock the replication origin record using
LockSharedObject instead of locking pg_replication_origin relation in
ExclusiveLock mode. Now tablesync worker will wait only if the
tablesync worker is trying to drop the same replication origin which
has already been dropped by the apply worker, the other tablesync
workers will be able to successfully drop the replication origin
without any wait.

There is a code in the function replorigin_drop_guts() that uses the
functionality introduced by replorigin_exists(). Can we reuse this
function for the same?

Also, it would be good if you can share the numbers for different runs
of "src/test/subscription/t/002_types.pl" before and after the patch.

--
With Regards,
Amit Kapila.

#6houzj.fnst@fujitsu.com
houzj.fnst@fujitsu.com
In reply to: Amit Kapila (#5)
RE: Deadlock between logrep apply worker and tablesync worker

On Friday, January 27, 2023 8:16 PM Amit Kapila <amit.kapila16@gmail.com>

On Fri, Jan 27, 2023 at 3:45 PM vignesh C <vignesh21@gmail.com> wrote:

On Mon, 23 Jan 2023 at 10:52, Amit Kapila <amit.kapila16@gmail.com>

wrote:

IIRC, this is done to prevent concurrent drops of origin drop say by
exposed API pg_replication_origin_drop(). See the discussion in [1]
related to it. If we want we can optimize it so that we can acquire
the lock on the specific origin as mentioned in comments
replorigin_drop_by_name() but it was not clear that this operation
would be frequent enough.

Here is an attached patch to lock the replication origin record using
LockSharedObject instead of locking pg_replication_origin relation in
ExclusiveLock mode. Now tablesync worker will wait only if the
tablesync worker is trying to drop the same replication origin which
has already been dropped by the apply worker, the other tablesync
workers will be able to successfully drop the replication origin
without any wait.

There is a code in the function replorigin_drop_guts() that uses the
functionality introduced by replorigin_exists(). Can we reuse this function for
the same?

Maybe we can use SearchSysCacheExists1 to check the existence instead of
adding a new function.

One comment about the patch.

@@ -430,23 +445,21 @@ replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait)
...
+	/* Drop the replication origin if it has not been dropped already */
+	if (replorigin_exists(roident))
 		replorigin_drop_guts(rel, roident, nowait);

If developer pass missing_ok as false, should we report an ERROR here
instead of silently return ?

Best Regards,
Hou zj

#7Amit Kapila
amit.kapila16@gmail.com
In reply to: houzj.fnst@fujitsu.com (#6)
Re: Deadlock between logrep apply worker and tablesync worker

On Sat, Jan 28, 2023 at 9:36 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

On Friday, January 27, 2023 8:16 PM Amit Kapila <amit.kapila16@gmail.com>

On Fri, Jan 27, 2023 at 3:45 PM vignesh C <vignesh21@gmail.com> wrote:

On Mon, 23 Jan 2023 at 10:52, Amit Kapila <amit.kapila16@gmail.com>

wrote:

IIRC, this is done to prevent concurrent drops of origin drop say by
exposed API pg_replication_origin_drop(). See the discussion in [1]
related to it. If we want we can optimize it so that we can acquire
the lock on the specific origin as mentioned in comments
replorigin_drop_by_name() but it was not clear that this operation
would be frequent enough.

Here is an attached patch to lock the replication origin record using
LockSharedObject instead of locking pg_replication_origin relation in
ExclusiveLock mode. Now tablesync worker will wait only if the
tablesync worker is trying to drop the same replication origin which
has already been dropped by the apply worker, the other tablesync
workers will be able to successfully drop the replication origin
without any wait.

There is a code in the function replorigin_drop_guts() that uses the
functionality introduced by replorigin_exists(). Can we reuse this function for
the same?

Maybe we can use SearchSysCacheExists1 to check the existence instead of
adding a new function.

Yeah, I think that would be better.

One comment about the patch.

@@ -430,23 +445,21 @@ replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait)
...
+       /* Drop the replication origin if it has not been dropped already */
+       if (replorigin_exists(roident))
replorigin_drop_guts(rel, roident, nowait);

If developer pass missing_ok as false, should we report an ERROR here
instead of silently return ?

One thing that looks a bit odd is that we will anyway have a similar
check in replorigin_drop_guts() which is a static function and called
from only one place, so, will it be required to check at both places?

--
With Regards,
Amit Kapila.

#8vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#7)
Re: Deadlock between logrep apply worker and tablesync worker

On Sat, 28 Jan 2023 at 11:26, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, Jan 28, 2023 at 9:36 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

On Friday, January 27, 2023 8:16 PM Amit Kapila <amit.kapila16@gmail.com>

On Fri, Jan 27, 2023 at 3:45 PM vignesh C <vignesh21@gmail.com> wrote:

On Mon, 23 Jan 2023 at 10:52, Amit Kapila <amit.kapila16@gmail.com>

wrote:

IIRC, this is done to prevent concurrent drops of origin drop say by
exposed API pg_replication_origin_drop(). See the discussion in [1]
related to it. If we want we can optimize it so that we can acquire
the lock on the specific origin as mentioned in comments
replorigin_drop_by_name() but it was not clear that this operation
would be frequent enough.

Here is an attached patch to lock the replication origin record using
LockSharedObject instead of locking pg_replication_origin relation in
ExclusiveLock mode. Now tablesync worker will wait only if the
tablesync worker is trying to drop the same replication origin which
has already been dropped by the apply worker, the other tablesync
workers will be able to successfully drop the replication origin
without any wait.

There is a code in the function replorigin_drop_guts() that uses the
functionality introduced by replorigin_exists(). Can we reuse this function for
the same?

Maybe we can use SearchSysCacheExists1 to check the existence instead of
adding a new function.

Yeah, I think that would be better.

One comment about the patch.

@@ -430,23 +445,21 @@ replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait)
...
+       /* Drop the replication origin if it has not been dropped already */
+       if (replorigin_exists(roident))
replorigin_drop_guts(rel, roident, nowait);

If developer pass missing_ok as false, should we report an ERROR here
instead of silently return ?

One thing that looks a bit odd is that we will anyway have a similar
check in replorigin_drop_guts() which is a static function and called
from only one place, so, will it be required to check at both places?

There is a possibility that the initial check to verify if replication
origin exists in replorigin_drop_by_name was successful but later one
of either table sync worker or apply worker process might have dropped
the replication origin, so it is better to check again before calling
replorigin_drop_guts, ideally the tuple should be valid in
replorigin_drop_guts, but can we keep the check as it is to maintain
the consistency before calling CatalogTupleDelete.

Regards,
Vignesh

#9vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#5)
Re: Deadlock between logrep apply worker and tablesync worker

On Fri, 27 Jan 2023 at 17:46, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jan 27, 2023 at 3:45 PM vignesh C <vignesh21@gmail.com> wrote:

On Mon, 23 Jan 2023 at 10:52, Amit Kapila <amit.kapila16@gmail.com> wrote:

IIRC, this is done to prevent concurrent drops of origin drop say by
exposed API pg_replication_origin_drop(). See the discussion in [1]
related to it. If we want we can optimize it so that we can acquire
the lock on the specific origin as mentioned in comments
replorigin_drop_by_name() but it was not clear that this operation
would be frequent enough.

Here is an attached patch to lock the replication origin record using
LockSharedObject instead of locking pg_replication_origin relation in
ExclusiveLock mode. Now tablesync worker will wait only if the
tablesync worker is trying to drop the same replication origin which
has already been dropped by the apply worker, the other tablesync
workers will be able to successfully drop the replication origin
without any wait.

There is a code in the function replorigin_drop_guts() that uses the
functionality introduced by replorigin_exists(). Can we reuse this
function for the same?

Also, it would be good if you can share the numbers for different runs
of "src/test/subscription/t/002_types.pl" before and after the patch.

By using only Tom Lane's Fix, I noticed that the execution time is
varying between 3.6 seconds to 4.4 seconds.
By using only the "Changing to LockSharedObject" fix, I noticed that
the execution time is varying between 3.8 seconds to 4.6 seconds.
By using the combined Fix(Tom Lane's fix + changing to
LockSharedObject) , I noticed that the execution time is varying
between 3.5 seconds to 3.8 seconds.
I felt both the changes will be required as it will also handle the
scenario when both the apply worker and the table sync worker try to
drop the same replication origin.

The execution results for the same:
With only Tom Lane's fix:
[12:25:32] t/002_types.pl .. ok 3604 ms ( 0.00 usr 0.00 sys +
2.26 cusr 0.37 csys = 2.63 CPU)
[12:25:48] t/002_types.pl .. ok 3788 ms ( 0.00 usr 0.00 sys +
2.24 cusr 0.39 csys = 2.63 CPU)
[12:26:01] t/002_types.pl .. ok 3783 ms ( 0.00 usr 0.00 sys +
2.42 cusr 0.37 csys = 2.79 CPU)
[12:26:14] t/002_types.pl .. ok 3845 ms ( 0.00 usr 0.00 sys +
2.38 cusr 0.44 csys = 2.82 CPU)
[12:26:29] t/002_types.pl .. ok 3923 ms ( 0.00 usr 0.00 sys +
2.54 cusr 0.39 csys = 2.93 CPU)
[12:26:42] t/002_types.pl .. ok 4416 ms ( 0.00 usr 0.00 sys +
2.73 cusr 0.48 csys = 3.21 CPU)
[12:26:55] t/002_types.pl .. ok 4310 ms ( 0.00 usr 0.00 sys +
2.62 cusr 0.39 csys = 3.01 CPU)
[12:27:09] t/002_types.pl .. ok 4168 ms ( 0.00 usr 0.00 sys +
2.67 cusr 0.46 csys = 3.13 CPU)
[12:27:21] t/002_types.pl .. ok 4167 ms ( 0.00 usr 0.00 sys +
2.46 cusr 0.53 csys = 2.99 CPU)
[12:27:34] t/002_types.pl .. ok 4144 ms ( 0.00 usr 0.00 sys +
2.59 cusr 0.41 csys = 3.00 CPU)
[12:27:46] t/002_types.pl .. ok 3982 ms ( 0.00 usr 0.00 sys +
2.52 cusr 0.41 csys = 2.93 CPU)
[12:28:03] t/002_types.pl .. ok 4190 ms ( 0.01 usr 0.00 sys +
2.67 cusr 0.46 csys = 3.14 CPU)

With only "Changing to LockSharedObject" fix:
[12:33:02] t/002_types.pl .. ok 3815 ms ( 0.00 usr 0.00 sys +
2.30 cusr 0.38 csys = 2.68 CPU)
[12:33:16] t/002_types.pl .. ok 4295 ms ( 0.00 usr 0.00 sys +
2.66 cusr 0.42 csys = 3.08 CPU)
[12:33:31] t/002_types.pl .. ok 4270 ms ( 0.00 usr 0.00 sys +
2.72 cusr 0.44 csys = 3.16 CPU)
[12:33:44] t/002_types.pl .. ok 4460 ms ( 0.00 usr 0.00 sys +
2.78 cusr 0.45 csys = 3.23 CPU)
[12:33:58] t/002_types.pl .. ok 4340 ms ( 0.01 usr 0.00 sys +
2.67 cusr 0.45 csys = 3.13 CPU)
[12:34:11] t/002_types.pl .. ok 4142 ms ( 0.00 usr 0.00 sys +
2.58 cusr 0.42 csys = 3.00 CPU)
[12:34:24] t/002_types.pl .. ok 4459 ms ( 0.00 usr 0.00 sys +
2.76 cusr 0.49 csys = 3.25 CPU)
[12:34:38] t/002_types.pl .. ok 4427 ms ( 0.00 usr 0.00 sys +
2.68 cusr 0.48 csys = 3.16 CPU)
[12:35:10] t/002_types.pl .. ok 4642 ms ( 0.00 usr 0.00 sys +
2.84 cusr 0.55 csys = 3.39 CPU)
[12:35:22] t/002_types.pl .. ok 4047 ms ( 0.01 usr 0.00 sys +
2.49 cusr 0.46 csys = 2.96 CPU)
[12:35:32] t/002_types.pl .. ok 4505 ms ( 0.01 usr 0.00 sys +
2.90 cusr 0.45 csys = 3.36 CPU)
[12:36:03] t/002_types.pl .. ok 4088 ms ( 0.00 usr 0.00 sys +
2.51 cusr 0.42 csys = 2.93 CPU)

002_types with combination of Tom Lane's and "Changing to LockSharedObject" fix:
[10:22:04] t/002_types.pl .. ok 3730 ms ( 0.00 usr 0.00 sys +
2.30 cusr 0.41 csys = 2.71 CPU)
[10:23:40] t/002_types.pl .. ok 3666 ms ( 0.00 usr 0.00 sys +
2.16 cusr 0.42 csys = 2.58 CPU)
[10:23:31] t/002_types.pl .. ok 3665 ms ( 0.00 usr 0.00 sys +
2.31 cusr 0.40 csys = 2.71 CPU)
[10:23:23] t/002_types.pl .. ok 3500 ms ( 0.00 usr 0.00 sys +
2.20 cusr 0.36 csys = 2.56 CPU)
[10:23:14] t/002_types.pl .. ok 3704 ms ( 0.00 usr 0.00 sys +
2.36 cusr 0.35 csys = 2.71 CPU)
[10:23:05] t/002_types.pl .. ok 3594 ms ( 0.00 usr 0.00 sys +
2.32 cusr 0.31 csys = 2.63 CPU)
[10:24:10] t/002_types.pl .. ok 3702 ms ( 0.00 usr 0.00 sys +
2.27 cusr 0.42 csys = 2.69 CPU)
[10:24:22] t/002_types.pl .. ok 3741 ms ( 0.00 usr 0.00 sys +
2.39 cusr 0.36 csys = 2.75 CPU)
[10:24:38] t/002_types.pl .. ok 3676 ms ( 0.00 usr 0.00 sys +
2.28 cusr 0.43 csys = 2.71 CPU)
[10:24:50] t/002_types.pl .. ok 3843 ms ( 0.00 usr 0.00 sys +
2.36 cusr 0.43 csys = 2.79 CPU)
[10:25:03] t/002_types.pl .. ok 3710 ms ( 0.00 usr 0.00 sys +
2.30 cusr 0.36 csys = 2.66 CPU)
[10:25:12] t/002_types.pl .. ok 3695 ms ( 0.00 usr 0.00 sys +
2.34 cusr 0.35 csys = 2.69 CPU)

002_types on HEAD:
[10:31:05] t/002_types.pl .. ok 5687 ms ( 0.00 usr 0.00 sys +
2.35 cusr 0.45 csys = 2.80 CPU)
[10:31:31] t/002_types.pl .. ok 6815 ms ( 0.00 usr 0.00 sys +
2.61 cusr 0.43 csys = 3.04 CPU)
[10:31:47] t/002_types.pl .. ok 5561 ms ( 0.00 usr 0.00 sys +
2.24 cusr 0.47 csys = 2.71 CPU)
[10:32:05] t/002_types.pl .. ok 4542 ms ( 0.00 usr 0.00 sys +
2.27 cusr 0.39 csys = 2.66 CPU)
[10:32:20] t/002_types.pl .. ok 3663 ms ( 0.00 usr 0.00 sys +
2.30 cusr 0.38 csys = 2.68 CPU)
[10:32:33] t/002_types.pl .. ok 3627 ms ( 0.00 usr 0.00 sys +
2.27 cusr 0.32 csys = 2.59 CPU)
[10:32:45] t/002_types.pl .. ok 3808 ms ( 0.00 usr 0.00 sys +
2.41 cusr 0.39 csys = 2.80 CPU)
[10:32:59] t/002_types.pl .. ok 4536 ms ( 0.00 usr 0.00 sys +
2.24 cusr 0.38 csys = 2.62 CPU)
[10:33:13] t/002_types.pl .. ok 3638 ms ( 0.00 usr 0.00 sys +
2.25 cusr 0.41 csys = 2.66 CPU)
[10:33:35] t/002_types.pl .. ok 4796 ms ( 0.00 usr 0.00 sys +
2.38 cusr 0.38 csys = 2.76 CPU)
[10:33:51] t/002_types.pl .. ok 4695 ms ( 0.00 usr 0.00 sys +
2.40 cusr 0.37 csys = 2.77 CPU)
[10:34:06] t/002_types.pl .. ok 5738 ms ( 0.00 usr 0.00 sys +
2.44 cusr 0.43 csys = 2.87 CPU)

Regards,
Vignesh

#10Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#8)
Re: Deadlock between logrep apply worker and tablesync worker

On Mon, Jan 30, 2023 at 9:20 AM vignesh C <vignesh21@gmail.com> wrote:

On Sat, 28 Jan 2023 at 11:26, Amit Kapila <amit.kapila16@gmail.com> wrote:

One thing that looks a bit odd is that we will anyway have a similar
check in replorigin_drop_guts() which is a static function and called
from only one place, so, will it be required to check at both places?

There is a possibility that the initial check to verify if replication
origin exists in replorigin_drop_by_name was successful but later one
of either table sync worker or apply worker process might have dropped
the replication origin,

Won't locking on the particular origin prevent concurrent drops? IIUC,
the drop happens after the patch acquires the lock on the origin.

--
With Regards,
Amit Kapila.

#11houzj.fnst@fujitsu.com
houzj.fnst@fujitsu.com
In reply to: Amit Kapila (#10)
RE: Deadlock between logrep apply worker and tablesync worker

On Monday, January 30, 2023 2:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jan 30, 2023 at 9:20 AM vignesh C <vignesh21@gmail.com> wrote:

On Sat, 28 Jan 2023 at 11:26, Amit Kapila <amit.kapila16@gmail.com> wrote:

One thing that looks a bit odd is that we will anyway have a similar
check in replorigin_drop_guts() which is a static function and
called from only one place, so, will it be required to check at both places?

There is a possibility that the initial check to verify if replication
origin exists in replorigin_drop_by_name was successful but later one
of either table sync worker or apply worker process might have dropped
the replication origin,

Won't locking on the particular origin prevent concurrent drops? IIUC, the
drop happens after the patch acquires the lock on the origin.

Yes, I think the existence check in replorigin_drop_guts is unnecessary as we
already lock the origin before that. I think the check in replorigin_drop_guts
is a custom check after calling SearchSysCache1 to get the tuple, but the error
should not happen as no concurrent drop can be performed.

To make it simpler, one idea is to move the code that getting the tuple from
system cache to the replorigin_drop_by_name(). After locking the origin, we
can try to get the tuple and do the existence check, and we can reuse
this tuple to perform origin delete. In this approach we only need to check
origin existence once after locking. BTW, if we do this, then we'd better rename the
replorigin_drop_guts() to something like replorigin_state_clear() as the function
only clear the in-memory information after that.

The code could be like:

-------
replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait)
...
/*
* Lock the origin to prevent concurrent drops. We keep the lock until the
* end of transaction.
*/
LockSharedObject(ReplicationOriginRelationId, roident, 0,
AccessExclusiveLock);

tuple = SearchSysCache1(REPLORIGIDENT, ObjectIdGetDatum(roident));
if (!HeapTupleIsValid(tuple))
{
if (!missing_ok)
elog(ERROR, "cache lookup failed for replication origin with ID %d",
roident);

return;
}

replorigin_state_clear(rel, roident, nowait);

/*
* Now, we can delete the catalog entry.
*/
CatalogTupleDelete(rel, &tuple->t_self);
ReleaseSysCache(tuple);

CommandCounterIncrement();
...
-------

Best Regards,
Hou zj

#12vignesh C
vignesh21@gmail.com
In reply to: houzj.fnst@fujitsu.com (#11)
Re: Deadlock between logrep apply worker and tablesync worker

On Mon, 30 Jan 2023 at 13:00, houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

On Monday, January 30, 2023 2:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jan 30, 2023 at 9:20 AM vignesh C <vignesh21@gmail.com> wrote:

On Sat, 28 Jan 2023 at 11:26, Amit Kapila <amit.kapila16@gmail.com> wrote:

One thing that looks a bit odd is that we will anyway have a similar
check in replorigin_drop_guts() which is a static function and
called from only one place, so, will it be required to check at both places?

There is a possibility that the initial check to verify if replication
origin exists in replorigin_drop_by_name was successful but later one
of either table sync worker or apply worker process might have dropped
the replication origin,

Won't locking on the particular origin prevent concurrent drops? IIUC, the
drop happens after the patch acquires the lock on the origin.

Yes, I think the existence check in replorigin_drop_guts is unnecessary as we
already lock the origin before that. I think the check in replorigin_drop_guts
is a custom check after calling SearchSysCache1 to get the tuple, but the error
should not happen as no concurrent drop can be performed.

This scenario is possible while creating subscription, apply worker
will try to drop the replication origin if the state is
SUBREL_STATE_SYNCDONE. Table sync worker will set the state to
SUBREL_STATE_SYNCDONE and update the relation state before calling
replorigin_drop_by_name. Since the transaction is committed by table
sync worker, the state is visible to apply worker, now apply worker
will parallelly try to drop the replication origin in this case.
There is a race condition in this case, one of the process table sync
worker or apply worker will acquire the lock and drop the replication
origin, the other process will get the lock after the process drops
the origin and commits the transaction. Now the other process will try
to drop the replication origin once it acquires the lock and get the
error(from replorigin_drop_guts): cache lookup failed for replication
origin with ID.
Concurrent drop is possible in this case.

Regards,
Vignesh

#13vignesh C
vignesh21@gmail.com
In reply to: houzj.fnst@fujitsu.com (#11)
Re: Deadlock between logrep apply worker and tablesync worker

On Mon, 30 Jan 2023 at 13:00, houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

On Monday, January 30, 2023 2:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jan 30, 2023 at 9:20 AM vignesh C <vignesh21@gmail.com> wrote:

On Sat, 28 Jan 2023 at 11:26, Amit Kapila <amit.kapila16@gmail.com> wrote:

One thing that looks a bit odd is that we will anyway have a similar
check in replorigin_drop_guts() which is a static function and
called from only one place, so, will it be required to check at both places?

There is a possibility that the initial check to verify if replication
origin exists in replorigin_drop_by_name was successful but later one
of either table sync worker or apply worker process might have dropped
the replication origin,

Won't locking on the particular origin prevent concurrent drops? IIUC, the
drop happens after the patch acquires the lock on the origin.

Yes, I think the existence check in replorigin_drop_guts is unnecessary as we
already lock the origin before that. I think the check in replorigin_drop_guts
is a custom check after calling SearchSysCache1 to get the tuple, but the error
should not happen as no concurrent drop can be performed.

To make it simpler, one idea is to move the code that getting the tuple from
system cache to the replorigin_drop_by_name(). After locking the origin, we
can try to get the tuple and do the existence check, and we can reuse
this tuple to perform origin delete. In this approach we only need to check
origin existence once after locking. BTW, if we do this, then we'd better rename the
replorigin_drop_guts() to something like replorigin_state_clear() as the function
only clear the in-memory information after that.

The code could be like:

-------
replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait)
...
/*
* Lock the origin to prevent concurrent drops. We keep the lock until the
* end of transaction.
*/
LockSharedObject(ReplicationOriginRelationId, roident, 0,
AccessExclusiveLock);

tuple = SearchSysCache1(REPLORIGIDENT, ObjectIdGetDatum(roident));
if (!HeapTupleIsValid(tuple))
{
if (!missing_ok)
elog(ERROR, "cache lookup failed for replication origin with ID %d",
roident);

return;
}

replorigin_state_clear(rel, roident, nowait);

/*
* Now, we can delete the catalog entry.
*/
CatalogTupleDelete(rel, &tuple->t_self);
ReleaseSysCache(tuple);

CommandCounterIncrement();
...

+1 for this change as it removes the redundant check which is not
required. I will post an updated version for this.

Regards,
Vignesh

#14vignesh C
vignesh21@gmail.com
In reply to: houzj.fnst@fujitsu.com (#11)
1 attachment(s)
Re: Deadlock between logrep apply worker and tablesync worker

On Mon, 30 Jan 2023 at 13:00, houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

On Monday, January 30, 2023 2:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jan 30, 2023 at 9:20 AM vignesh C <vignesh21@gmail.com> wrote:

On Sat, 28 Jan 2023 at 11:26, Amit Kapila <amit.kapila16@gmail.com> wrote:

One thing that looks a bit odd is that we will anyway have a similar
check in replorigin_drop_guts() which is a static function and
called from only one place, so, will it be required to check at both places?

There is a possibility that the initial check to verify if replication
origin exists in replorigin_drop_by_name was successful but later one
of either table sync worker or apply worker process might have dropped
the replication origin,

Won't locking on the particular origin prevent concurrent drops? IIUC, the
drop happens after the patch acquires the lock on the origin.

Yes, I think the existence check in replorigin_drop_guts is unnecessary as we
already lock the origin before that. I think the check in replorigin_drop_guts
is a custom check after calling SearchSysCache1 to get the tuple, but the error
should not happen as no concurrent drop can be performed.

To make it simpler, one idea is to move the code that getting the tuple from
system cache to the replorigin_drop_by_name(). After locking the origin, we
can try to get the tuple and do the existence check, and we can reuse
this tuple to perform origin delete. In this approach we only need to check
origin existence once after locking. BTW, if we do this, then we'd better rename the
replorigin_drop_guts() to something like replorigin_state_clear() as the function
only clear the in-memory information after that.

The code could be like:

-------
replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait)
...
/*
* Lock the origin to prevent concurrent drops. We keep the lock until the
* end of transaction.
*/
LockSharedObject(ReplicationOriginRelationId, roident, 0,
AccessExclusiveLock);

tuple = SearchSysCache1(REPLORIGIDENT, ObjectIdGetDatum(roident));
if (!HeapTupleIsValid(tuple))
{
if (!missing_ok)
elog(ERROR, "cache lookup failed for replication origin with ID %d",
roident);

return;
}

replorigin_state_clear(rel, roident, nowait);

/*
* Now, we can delete the catalog entry.
*/
CatalogTupleDelete(rel, &tuple->t_self);
ReleaseSysCache(tuple);

CommandCounterIncrement();

The attached updated patch has the changes to handle the same.

Regards,
Vignesh

Attachments:

v2-0001-Lock-the-replication-origin-record-instead-of-loc.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Lock-the-replication-origin-record-instead-of-loc.patchDownload
From dc2e9acc6a55772896117cf3b88e4189f994a82d Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Fri, 27 Jan 2023 15:17:09 +0530
Subject: [PATCH v2] Lock the replication origin record instead of locking the
 pg_replication_origin relation.

Lock the replication origin record instead of locking the
pg_replication_origin relation.
---
 src/backend/replication/logical/origin.c | 56 ++++++++++++------------
 1 file changed, 27 insertions(+), 29 deletions(-)

diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index b754c43840..3e360cf41e 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -338,16 +338,14 @@ replorigin_create(const char *roname)
  * Helper function to drop a replication origin.
  */
 static void
-replorigin_drop_guts(Relation rel, RepOriginId roident, bool nowait)
+replorigin_state_clear(RepOriginId roident, bool nowait)
 {
-	HeapTuple	tuple;
 	int			i;
 
 	/*
-	 * First, clean up the slot state info, if there is any matching slot.
+	 * Clean up the slot state info, if there is any matching slot.
 	 */
 restart:
-	tuple = NULL;
 	LWLockAcquire(ReplicationOriginLock, LW_EXCLUSIVE);
 
 	for (i = 0; i < max_replication_slots; i++)
@@ -402,19 +400,6 @@ restart:
 	}
 	LWLockRelease(ReplicationOriginLock);
 	ConditionVariableCancelSleep();
-
-	/*
-	 * Now, we can delete the catalog entry.
-	 */
-	tuple = SearchSysCache1(REPLORIGIDENT, ObjectIdGetDatum(roident));
-	if (!HeapTupleIsValid(tuple))
-		elog(ERROR, "cache lookup failed for replication origin with ID %d",
-			 roident);
-
-	CatalogTupleDelete(rel, &tuple->t_self);
-	ReleaseSysCache(tuple);
-
-	CommandCounterIncrement();
 }
 
 /*
@@ -427,24 +412,37 @@ replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait)
 {
 	RepOriginId roident;
 	Relation	rel;
+	HeapTuple	tuple;
 
 	Assert(IsTransactionState());
 
-	/*
-	 * To interlock against concurrent drops, we hold ExclusiveLock on
-	 * pg_replication_origin till xact commit.
-	 *
-	 * XXX We can optimize this by acquiring the lock on a specific origin by
-	 * using LockSharedObject if required. However, for that, we first to
-	 * acquire a lock on ReplicationOriginRelationId, get the origin_id, lock
-	 * the specific origin and then re-check if the origin still exists.
-	 */
-	rel = table_open(ReplicationOriginRelationId, ExclusiveLock);
+	rel = table_open(ReplicationOriginRelationId, RowExclusiveLock);
 
 	roident = replorigin_by_name(name, missing_ok);
 
-	if (OidIsValid(roident))
-		replorigin_drop_guts(rel, roident, nowait);
+	/* Lock the origin to prevent concurrent drops */
+	LockSharedObject(ReplicationOriginRelationId, roident, 0,
+					 AccessExclusiveLock);
+
+	tuple = SearchSysCache1(REPLORIGIDENT, ObjectIdGetDatum(roident));
+	if (!HeapTupleIsValid(tuple))
+	{
+		if (!missing_ok)
+			elog(ERROR, "cache lookup failed for replication origin with ID %d",
+				 roident);
+
+		return;
+	}
+
+	replorigin_state_clear(roident, nowait);
+
+	/*
+	 * Now, we can delete the catalog entry.
+	 */
+	CatalogTupleDelete(rel, &tuple->t_self);
+	ReleaseSysCache(tuple);
+
+	CommandCounterIncrement();
 
 	/* We keep the lock on pg_replication_origin until commit */
 	table_close(rel, NoLock);
-- 
2.34.1

#15vignesh C
vignesh21@gmail.com
In reply to: vignesh C (#14)
1 attachment(s)
Re: Deadlock between logrep apply worker and tablesync worker

On Mon, 30 Jan 2023 at 17:30, vignesh C <vignesh21@gmail.com> wrote:

On Mon, 30 Jan 2023 at 13:00, houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

On Monday, January 30, 2023 2:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jan 30, 2023 at 9:20 AM vignesh C <vignesh21@gmail.com> wrote:

On Sat, 28 Jan 2023 at 11:26, Amit Kapila <amit.kapila16@gmail.com> wrote:

One thing that looks a bit odd is that we will anyway have a similar
check in replorigin_drop_guts() which is a static function and
called from only one place, so, will it be required to check at both places?

There is a possibility that the initial check to verify if replication
origin exists in replorigin_drop_by_name was successful but later one
of either table sync worker or apply worker process might have dropped
the replication origin,

Won't locking on the particular origin prevent concurrent drops? IIUC, the
drop happens after the patch acquires the lock on the origin.

Yes, I think the existence check in replorigin_drop_guts is unnecessary as we
already lock the origin before that. I think the check in replorigin_drop_guts
is a custom check after calling SearchSysCache1 to get the tuple, but the error
should not happen as no concurrent drop can be performed.

To make it simpler, one idea is to move the code that getting the tuple from
system cache to the replorigin_drop_by_name(). After locking the origin, we
can try to get the tuple and do the existence check, and we can reuse
this tuple to perform origin delete. In this approach we only need to check
origin existence once after locking. BTW, if we do this, then we'd better rename the
replorigin_drop_guts() to something like replorigin_state_clear() as the function
only clear the in-memory information after that.

The code could be like:

-------
replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait)
...
/*
* Lock the origin to prevent concurrent drops. We keep the lock until the
* end of transaction.
*/
LockSharedObject(ReplicationOriginRelationId, roident, 0,
AccessExclusiveLock);

tuple = SearchSysCache1(REPLORIGIDENT, ObjectIdGetDatum(roident));
if (!HeapTupleIsValid(tuple))
{
if (!missing_ok)
elog(ERROR, "cache lookup failed for replication origin with ID %d",
roident);

return;
}

replorigin_state_clear(rel, roident, nowait);

/*
* Now, we can delete the catalog entry.
*/
CatalogTupleDelete(rel, &tuple->t_self);
ReleaseSysCache(tuple);

CommandCounterIncrement();

The attached updated patch has the changes to handle the same.

I had not merged one of the local changes that was present, please
find the updated patch including that change. Sorry for missing that
change.

Regards,
Vignesh

Attachments:

v3-0001-Lock-the-replication-origin-record-instead-of-loc.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Lock-the-replication-origin-record-instead-of-loc.patchDownload
From 5ed769a1ed3b25e0a19ad4b235df8a4140870635 Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Fri, 27 Jan 2023 15:17:09 +0530
Subject: [PATCH v3] Lock the replication origin record instead of locking the
 pg_replication_origin relation.

Lock the replication origin record instead of locking the
pg_replication_origin relation.
---
 src/backend/replication/logical/origin.c | 57 ++++++++++++------------
 1 file changed, 28 insertions(+), 29 deletions(-)

diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index b754c43840..5837818ecf 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -338,16 +338,14 @@ replorigin_create(const char *roname)
  * Helper function to drop a replication origin.
  */
 static void
-replorigin_drop_guts(Relation rel, RepOriginId roident, bool nowait)
+replorigin_state_clear(RepOriginId roident, bool nowait)
 {
-	HeapTuple	tuple;
 	int			i;
 
 	/*
-	 * First, clean up the slot state info, if there is any matching slot.
+	 * Clean up the slot state info, if there is any matching slot.
 	 */
 restart:
-	tuple = NULL;
 	LWLockAcquire(ReplicationOriginLock, LW_EXCLUSIVE);
 
 	for (i = 0; i < max_replication_slots; i++)
@@ -402,19 +400,6 @@ restart:
 	}
 	LWLockRelease(ReplicationOriginLock);
 	ConditionVariableCancelSleep();
-
-	/*
-	 * Now, we can delete the catalog entry.
-	 */
-	tuple = SearchSysCache1(REPLORIGIDENT, ObjectIdGetDatum(roident));
-	if (!HeapTupleIsValid(tuple))
-		elog(ERROR, "cache lookup failed for replication origin with ID %d",
-			 roident);
-
-	CatalogTupleDelete(rel, &tuple->t_self);
-	ReleaseSysCache(tuple);
-
-	CommandCounterIncrement();
 }
 
 /*
@@ -427,24 +412,38 @@ replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait)
 {
 	RepOriginId roident;
 	Relation	rel;
+	HeapTuple	tuple;
 
 	Assert(IsTransactionState());
 
-	/*
-	 * To interlock against concurrent drops, we hold ExclusiveLock on
-	 * pg_replication_origin till xact commit.
-	 *
-	 * XXX We can optimize this by acquiring the lock on a specific origin by
-	 * using LockSharedObject if required. However, for that, we first to
-	 * acquire a lock on ReplicationOriginRelationId, get the origin_id, lock
-	 * the specific origin and then re-check if the origin still exists.
-	 */
-	rel = table_open(ReplicationOriginRelationId, ExclusiveLock);
+	rel = table_open(ReplicationOriginRelationId, RowExclusiveLock);
 
 	roident = replorigin_by_name(name, missing_ok);
 
-	if (OidIsValid(roident))
-		replorigin_drop_guts(rel, roident, nowait);
+	/* Lock the origin to prevent concurrent drops */
+	LockSharedObject(ReplicationOriginRelationId, roident, 0,
+					 AccessExclusiveLock);
+
+	tuple = SearchSysCache1(REPLORIGIDENT, ObjectIdGetDatum(roident));
+	if (!HeapTupleIsValid(tuple))
+	{
+		if (!missing_ok)
+			elog(ERROR, "cache lookup failed for replication origin with ID %d",
+				 roident);
+
+		table_close(rel, NoLock);
+		return;
+	}
+
+	replorigin_state_clear(roident, nowait);
+
+	/*
+	 * Now, we can delete the catalog entry.
+	 */
+	CatalogTupleDelete(rel, &tuple->t_self);
+	ReleaseSysCache(tuple);
+
+	CommandCounterIncrement();
 
 	/* We keep the lock on pg_replication_origin until commit */
 	table_close(rel, NoLock);
-- 
2.34.1

#16houzj.fnst@fujitsu.com
houzj.fnst@fujitsu.com
In reply to: vignesh C (#15)
RE: Deadlock between logrep apply worker and tablesync worker

On Tuesday, January 31, 2023 1:07 AM vignesh C <vignesh21@gmail.com> wrote:

On Mon, 30 Jan 2023 at 17:30, vignesh C <vignesh21@gmail.com> wrote:

On Mon, 30 Jan 2023 at 13:00, houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

On Monday, January 30, 2023 2:32 PM Amit Kapila

<amit.kapila16@gmail.com> wrote:

On Mon, Jan 30, 2023 at 9:20 AM vignesh C <vignesh21@gmail.com>

wrote:

On Sat, 28 Jan 2023 at 11:26, Amit Kapila <amit.kapila16@gmail.com>

wrote:

One thing that looks a bit odd is that we will anyway have a
similar check in replorigin_drop_guts() which is a static
function and called from only one place, so, will it be required to

check at both places?

There is a possibility that the initial check to verify if
replication origin exists in replorigin_drop_by_name was
successful but later one of either table sync worker or apply
worker process might have dropped the replication origin,

Won't locking on the particular origin prevent concurrent drops?
IIUC, the drop happens after the patch acquires the lock on the origin.

Yes, I think the existence check in replorigin_drop_guts is
unnecessary as we already lock the origin before that. I think the
check in replorigin_drop_guts is a custom check after calling
SearchSysCache1 to get the tuple, but the error should not happen as no

concurrent drop can be performed.

To make it simpler, one idea is to move the code that getting the
tuple from system cache to the replorigin_drop_by_name(). After
locking the origin, we can try to get the tuple and do the existence
check, and we can reuse this tuple to perform origin delete. In this
approach we only need to check origin existence once after locking.
BTW, if we do this, then we'd better rename the
replorigin_drop_guts() to something like replorigin_state_clear() as
the function only clear the in-memory information after that.

The attached updated patch has the changes to handle the same.

I had not merged one of the local changes that was present, please find the
updated patch including that change. Sorry for missing that change.

I also tried to test the time of "src/test/subscription/t/002_types.pl"
before and after the patch(change the lock level) and Tom's patch(split
transaction) like what Vignesh has shared on -hackers.

I run about 100 times for each case. Tom's and the lock level patch
behave similarly on my machines[1]CentOS 8.2, 128G RAM, 40 processors Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz.

HEAD: 3426 ~ 6425 ms
HEAD + Tom: 3404 ~ 3462 ms
HEAD + Vignesh: 3419 ~ 3474 ms
HEAD + Tom + Vignesh: 3408 ~ 3454 ms

Even apart from the testing time reduction, reducing the lock level and lock
the specific object can also help improve the lock contention which user(that
use the exposed function) , table sync worker and apply worker can also benefit
from it. So, I think pushing the patch to change the lock level makes sense.

And the patch looks good to me.

While on it, after pushing the patch, I think there is another case might also
worth to be improved, that is the table sync and apply worker try to drop the
same origin which might cause some delay. This is another case(different from
the deadlock), so I feel we can try to improve this in another patch.

[1]: CentOS 8.2, 128G RAM, 40 processors Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz

Best Regards,
Hou zj

#17Amit Kapila
amit.kapila16@gmail.com
In reply to: houzj.fnst@fujitsu.com (#16)
1 attachment(s)
Re: Deadlock between logrep apply worker and tablesync worker

On Thu, Feb 2, 2023 at 12:05 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

On Tuesday, January 31, 2023 1:07 AM vignesh C <vignesh21@gmail.com> wrote:

On Mon, 30 Jan 2023 at 17:30, vignesh C <vignesh21@gmail.com> wrote:

I also tried to test the time of "src/test/subscription/t/002_types.pl"
before and after the patch(change the lock level) and Tom's patch(split
transaction) like what Vignesh has shared on -hackers.

I run about 100 times for each case. Tom's and the lock level patch
behave similarly on my machines[1].

HEAD: 3426 ~ 6425 ms
HEAD + Tom: 3404 ~ 3462 ms
HEAD + Vignesh: 3419 ~ 3474 ms
HEAD + Tom + Vignesh: 3408 ~ 3454 ms

Even apart from the testing time reduction, reducing the lock level and lock
the specific object can also help improve the lock contention which user(that
use the exposed function) , table sync worker and apply worker can also benefit
from it. So, I think pushing the patch to change the lock level makes sense.

And the patch looks good to me.

Thanks for the tests. I also see a reduction in test time variability
with Vignesh's patch. I think we can release the locks in case the
origin is concurrently dropped as in the attached patch. I am planning
to commit this patch tomorrow unless there are more comments or
objections.

While on it, after pushing the patch, I think there is another case might also
worth to be improved, that is the table sync and apply worker try to drop the
same origin which might cause some delay. This is another case(different from
the deadlock), so I feel we can try to improve this in another patch.

Right, I think that case could be addressed by Tom's patch to some
extent but I am thinking we should also try to analyze if we can
completely avoid the need to remove origins from both processes. One
idea could be to introduce another relstate something like
PRE_SYNCDONE and set it in a separate transaction before we set the
state as SYNCDONE and remove the slot and origin in tablesync worker.
Now, if the tablesync worker errors out due to some reason during the
second transaction, it can remove the slot and origin after restart by
checking the state. However, it would add another relstate which may
not be the best way to address this problem. Anyway, that can be
accomplished as a separate patch.

--
With Regards,
Amit Kapila.

Attachments:

v4-0001-Optimize-the-origin-drop-functionality.patchapplication/octet-stream; name=v4-0001-Optimize-the-origin-drop-functionality.patchDownload
From e77d5a38103fb5a9a64d60df6126f07393b52f0c Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Thu, 2 Feb 2023 14:24:23 +0530
Subject: [PATCH v4] Optimize the origin drop functionality.

To interlock against concurrent drops, we use to hold ExclusiveLock on
pg_replication_origin till xact commit. This blocks even concurrent drops
of different origins by tablesync workers. So, instead, lock the specific
origin to interlock against concurrent drops.

This reduces the test time variability in src/test/subscription where
multiple tables are being synced.

Author: Vignesh C
Reviewed-by: Hou Zhijie, Amit Kapila
Discussion: https://postgr.es/m/1412708.1674417574@sss.pgh.pa.us
---
 src/backend/replication/logical/origin.c | 62 +++++++++++++-----------
 1 file changed, 33 insertions(+), 29 deletions(-)

diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index b754c43840..2c04c8707d 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -338,16 +338,14 @@ replorigin_create(const char *roname)
  * Helper function to drop a replication origin.
  */
 static void
-replorigin_drop_guts(Relation rel, RepOriginId roident, bool nowait)
+replorigin_state_clear(RepOriginId roident, bool nowait)
 {
-	HeapTuple	tuple;
 	int			i;
 
 	/*
-	 * First, clean up the slot state info, if there is any matching slot.
+	 * Clean up the slot state info, if there is any matching slot.
 	 */
 restart:
-	tuple = NULL;
 	LWLockAcquire(ReplicationOriginLock, LW_EXCLUSIVE);
 
 	for (i = 0; i < max_replication_slots; i++)
@@ -402,19 +400,6 @@ restart:
 	}
 	LWLockRelease(ReplicationOriginLock);
 	ConditionVariableCancelSleep();
-
-	/*
-	 * Now, we can delete the catalog entry.
-	 */
-	tuple = SearchSysCache1(REPLORIGIDENT, ObjectIdGetDatum(roident));
-	if (!HeapTupleIsValid(tuple))
-		elog(ERROR, "cache lookup failed for replication origin with ID %d",
-			 roident);
-
-	CatalogTupleDelete(rel, &tuple->t_self);
-	ReleaseSysCache(tuple);
-
-	CommandCounterIncrement();
 }
 
 /*
@@ -427,24 +412,43 @@ replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait)
 {
 	RepOriginId roident;
 	Relation	rel;
+	HeapTuple	tuple;
 
 	Assert(IsTransactionState());
 
-	/*
-	 * To interlock against concurrent drops, we hold ExclusiveLock on
-	 * pg_replication_origin till xact commit.
-	 *
-	 * XXX We can optimize this by acquiring the lock on a specific origin by
-	 * using LockSharedObject if required. However, for that, we first to
-	 * acquire a lock on ReplicationOriginRelationId, get the origin_id, lock
-	 * the specific origin and then re-check if the origin still exists.
-	 */
-	rel = table_open(ReplicationOriginRelationId, ExclusiveLock);
+	rel = table_open(ReplicationOriginRelationId, RowExclusiveLock);
 
 	roident = replorigin_by_name(name, missing_ok);
 
-	if (OidIsValid(roident))
-		replorigin_drop_guts(rel, roident, nowait);
+	/* Lock the origin to prevent concurrent drops. */
+	LockSharedObject(ReplicationOriginRelationId, roident, 0,
+					 AccessExclusiveLock);
+
+	tuple = SearchSysCache1(REPLORIGIDENT, ObjectIdGetDatum(roident));
+	if (!HeapTupleIsValid(tuple))
+	{
+		if (!missing_ok)
+			elog(ERROR, "cache lookup failed for replication origin with ID %d",
+				 roident);
+
+		/*
+		 * We don't need to retain the locks if the origin is already dropped.
+		 */
+		UnlockSharedObject(ReplicationOriginRelationId, roident, 0,
+						   AccessExclusiveLock);
+		table_close(rel, RowExclusiveLock);
+		return;
+	}
+
+	replorigin_state_clear(roident, nowait);
+
+	/*
+	 * Now, we can delete the catalog entry.
+	 */
+	CatalogTupleDelete(rel, &tuple->t_self);
+	ReleaseSysCache(tuple);
+
+	CommandCounterIncrement();
 
 	/* We keep the lock on pg_replication_origin until commit */
 	table_close(rel, NoLock);
-- 
2.28.0.windows.1

#18Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#17)
Re: Deadlock between logrep apply worker and tablesync worker

On Thu, Feb 2, 2023 at 4:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Thanks for the tests. I also see a reduction in test time variability
with Vignesh's patch. I think we can release the locks in case the
origin is concurrently dropped as in the attached patch. I am planning
to commit this patch tomorrow unless there are more comments or
objections.

Pushed.

--
With Regards,
Amit Kapila.

#19houzj.fnst@fujitsu.com
houzj.fnst@fujitsu.com
In reply to: Amit Kapila (#17)
1 attachment(s)
RE: Deadlock between logrep apply worker and tablesync worker

On Thursday, February 2, 2023 7:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Feb 2, 2023 at 12:05 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

On Tuesday, January 31, 2023 1:07 AM vignesh C <vignesh21@gmail.com>

wrote:

On Mon, 30 Jan 2023 at 17:30, vignesh C <vignesh21@gmail.com> wrote:

I also tried to test the time of "src/test/subscription/t/002_types.pl"
before and after the patch(change the lock level) and Tom's
patch(split
transaction) like what Vignesh has shared on -hackers.

I run about 100 times for each case. Tom's and the lock level patch
behave similarly on my machines[1].

HEAD: 3426 ~ 6425 ms
HEAD + Tom: 3404 ~ 3462 ms
HEAD + Vignesh: 3419 ~ 3474 ms
HEAD + Tom + Vignesh: 3408 ~ 3454 ms

Even apart from the testing time reduction, reducing the lock level
and lock the specific object can also help improve the lock contention
which user(that use the exposed function) , table sync worker and
apply worker can also benefit from it. So, I think pushing the patch to change

the lock level makes sense.

And the patch looks good to me.

Thanks for the tests. I also see a reduction in test time variability with Vignesh's
patch. I think we can release the locks in case the origin is concurrently
dropped as in the attached patch. I am planning to commit this patch
tomorrow unless there are more comments or objections.

While on it, after pushing the patch, I think there is another case
might also worth to be improved, that is the table sync and apply
worker try to drop the same origin which might cause some delay. This
is another case(different from the deadlock), so I feel we can try to improve

this in another patch.

Right, I think that case could be addressed by Tom's patch to some extent but
I am thinking we should also try to analyze if we can completely avoid the need
to remove origins from both processes. One idea could be to introduce
another relstate something like PRE_SYNCDONE and set it in a separate
transaction before we set the state as SYNCDONE and remove the slot and
origin in tablesync worker.
Now, if the tablesync worker errors out due to some reason during the second
transaction, it can remove the slot and origin after restart by checking the state.
However, it would add another relstate which may not be the best way to
address this problem. Anyway, that can be accomplished as a separate patch.

Here is an attempt to achieve the same.
Basically, the patch removes the code that drop the origin in apply worker. And
add a new state PRE_SYNCDONE after synchronization finished in front of apply
(sublsn set), but before dropping the origin and other final cleanups. The
tablesync will restart and redo the cleanup if it failed after reaching the new
state. Besides, since the changes can already be applied on the table in
PRE_SYNCDONE state, so I also modified the check in
should_apply_changes_for_rel(). And some other conditions for the origin drop
in subscription commands are were adjusted in this patch.

Best Regards,
Hou zj

Attachments:

0001-Avoid-dropping-origins-from-both-apply-and-tablesync.patchapplication/octet-stream; name=0001-Avoid-dropping-origins-from-both-apply-and-tablesync.patchDownload
From 624f31ab769e137999a739f400b36af939319c4f Mon Sep 17 00:00:00 2001
From: Hou Zhijie <houzj.fnst@cn.fujitsu.com>
Date: Fri, 3 Feb 2023 10:55:55 +0800
Subject: [PATCH] Avoid dropping origins from both apply and tablesync worker

Currently, both the apply worker and tablesync worker try to drop the origin.
This can cause the synchronization to take longer due to lock contention.

Previously, we allowed the apply worker to drop the origin to avoid the case
that the tablesync worker fails to the origin(due to crash). In this case we
don't restart the tablesync worker, and the apply worker can clean the origin.

To improve this, we introduce a new relstate SUBREL_STATE_PRE_SYNCDONE which
will be set after synchronization finished in front of apply (sublsn set), but
before dropping the origin and other final cleanups. The apply worker will
restart tablesync worker if the relstate is SUBREL_STATE_PRE_SYNCDONE. This
way, even if the tablesync worker error out in the transaction that tries to
drop the origin, the apply worker will restart the tablesync worker to redo the
cleanup(for origin and other stuff) and then directly exit.

---
 doc/src/sgml/catalogs.sgml                  |   7 +-
 src/backend/commands/subscriptioncmds.c     |  25 ++--
 src/backend/replication/logical/tablesync.c | 193 +++++++++++++++-------------
 src/backend/replication/logical/worker.c    |   3 +-
 src/include/catalog/pg_subscription_rel.h   |   6 +-
 5 files changed, 125 insertions(+), 109 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index c1e4048..3d9491a 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -8071,7 +8071,8 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
        <literal>i</literal> = initialize,
        <literal>d</literal> = data is being copied,
        <literal>f</literal> = finished table copy,
-       <literal>s</literal> = synchronized,
+       <literal>p</literal> = synchronized but not yet cleaned up,
+       <literal>s</literal> = synchronization done,
        <literal>r</literal> = ready (normal replication)
       </para></entry>
      </row>
@@ -8082,8 +8083,8 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
       </para>
       <para>
        Remote LSN of the state change used for synchronization coordination
-       when in <literal>s</literal> or <literal>r</literal> states,
-       otherwise null
+       when in <literal>p</literal>, <literal>s</literal> or
+       <literal>r</literal> states, otherwise null
       </para></entry>
      </row>
     </tbody>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 464db6d..a53f9d1 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -929,10 +929,10 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data,
 				logicalrep_worker_stop(sub->oid, relid);
 
 				/*
-				 * For READY state, we would have already dropped the
-				 * tablesync origin.
+				 * For READY state and SYNCDONE state, we would have already
+				 * dropped the tablesync origin.
 				 */
-				if (state != SUBREL_STATE_READY)
+				if (state != SUBREL_STATE_READY && state != SUBREL_STATE_SYNCDONE)
 				{
 					char		originname[NAMEDATALEN];
 
@@ -940,11 +940,8 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data,
 					 * Drop the tablesync's origin tracking if exists.
 					 *
 					 * It is possible that the origin is not yet created for
-					 * tablesync worker, this can happen for the states before
-					 * SUBREL_STATE_FINISHEDCOPY. The tablesync worker or
-					 * apply worker can also concurrently try to drop the
-					 * origin and by this time the origin might be already
-					 * removed. For these reasons, passing missing_ok = true.
+					 * tablesync worker so passing missing_ok = true. This can
+					 * happen for the states before SUBREL_STATE_FINISHEDCOPY.
 					 */
 					ReplicationOriginNameForLogicalRep(sub->oid, relid, originname,
 													   sizeof(originname));
@@ -1536,13 +1533,19 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
 		/*
 		 * Drop the tablesync's origin tracking if exists.
 		 *
+		 * For SYNCDONE/READY states, the tablesync origin tracking is known
+		 * to have already been dropped by the tablesync worker.
+		 *
 		 * It is possible that the origin is not yet created for tablesync
 		 * worker so passing missing_ok = true. This can happen for the states
 		 * before SUBREL_STATE_FINISHEDCOPY.
 		 */
-		ReplicationOriginNameForLogicalRep(subid, relid, originname,
-										   sizeof(originname));
-		replorigin_drop_by_name(originname, true, false);
+		if (rstate->state != SUBREL_STATE_SYNCDONE)
+		{
+			ReplicationOriginNameForLogicalRep(subid, relid, originname,
+											   sizeof(originname));
+			replorigin_drop_by_name(originname, true, false);
+		}
 	}
 
 	/* Clean up dependencies */
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 07eea50..9762ff5 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -274,6 +274,82 @@ invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Update the state of the table to SUBREL_STATE_SYNCDONE and cleanup the
+ * tablesync slot and drop the tablesync's origin tracking.
+ */
+static void
+finish_synchronization(bool restart_after_crash)
+{
+	char		syncslotname[NAMEDATALEN] = {0};
+	char		originname[NAMEDATALEN] = {0};
+
+	StartTransactionCommand();
+
+	SpinLockAcquire(&MyLogicalRepWorker->relmutex);
+	Assert(MyLogicalRepWorker->relstate == SUBREL_STATE_PRE_SYNCDONE);
+	MyLogicalRepWorker->relstate = SUBREL_STATE_SYNCDONE;
+	SpinLockRelease(&MyLogicalRepWorker->relmutex);
+
+	UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
+							   MyLogicalRepWorker->relid,
+							   MyLogicalRepWorker->relstate,
+							   MyLogicalRepWorker->relstate_lsn);
+
+	ReplicationOriginNameForLogicalRep(MyLogicalRepWorker->subid,
+									   MyLogicalRepWorker->relid,
+									   originname,
+									   sizeof(originname));
+
+	/*
+	 * Resetting the origin session removes the ownership of the slot. This is
+	 * needed to allow the origin to be dropped. But note that if the tablesync
+	 * worker restarts after crash. we don't need to reset the origin because
+	 * it has not been setup yet.
+	 */
+	if (!restart_after_crash)
+	{
+		replorigin_session_reset();
+		replorigin_session_origin = InvalidRepOriginId;
+		replorigin_session_origin_lsn = InvalidXLogRecPtr;
+		replorigin_session_origin_timestamp = 0;
+	}
+
+	/*
+	 * We expect that origin must be present. The concurrent operations
+	 * that remove origin like a refresh for the subscription take an
+	 * access exclusive lock on pg_subscription which prevent the previous
+	 * operation to update the rel state to SUBREL_STATE_SYNCDONE to
+	 * succeed.
+	 */
+	replorigin_drop_by_name(originname, false, false);
+
+	/*
+	 * Cleanup the tablesync slot.
+	 *
+	 * This has to be done after updating the state because otherwise if
+	 * there is an error while doing the database operations we won't be
+	 * able to rollback dropped slot.
+	 */
+	ReplicationSlotNameForTablesync(MyLogicalRepWorker->subid,
+									MyLogicalRepWorker->relid,
+									syncslotname,
+									sizeof(syncslotname));
+
+	/*
+	 * Normally, It is important to give an error if we are unable to drop the
+	 * slot, otherwise, it won't be dropped till the corresponding subscription
+	 * is dropped. So passing missing_ok = false. But if the tablesync worker
+	 * restarts after crash, the slot may have been dropped, so we allow
+	 * missing_ok = true for the drop.
+	 */
+	ReplicationSlotDropAtPubNode(LogRepWorkerWalRcvConn, syncslotname,
+								 restart_after_crash);
+
+	finish_sync_worker();
+}
+
+
+/*
  * Handle table synchronization cooperation from the synchronization
  * worker.
  *
@@ -284,18 +360,15 @@ invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue)
 static void
 process_syncing_tables_for_sync(XLogRecPtr current_lsn)
 {
+	TimeLineID	tli;
+
 	SpinLockAcquire(&MyLogicalRepWorker->relmutex);
 
 	if (MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP &&
 		current_lsn >= MyLogicalRepWorker->relstate_lsn)
 	{
-		TimeLineID	tli;
-		char		syncslotname[NAMEDATALEN] = {0};
-		char		originname[NAMEDATALEN] = {0};
-
-		MyLogicalRepWorker->relstate = SUBREL_STATE_SYNCDONE;
+		MyLogicalRepWorker->relstate = SUBREL_STATE_PRE_SYNCDONE;
 		MyLogicalRepWorker->relstate_lsn = current_lsn;
-
 		SpinLockRelease(&MyLogicalRepWorker->relmutex);
 
 		/*
@@ -304,80 +377,26 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
 		if (!IsTransactionState())
 			StartTransactionCommand();
 
+		/*
+		 * Set the state to PRE_SYNCDONE so that if the an error occurs before
+		 * setting the state to SYNCDONE the restarted tablesync worker can
+		 * exit via the fast path without starting streaming again.
+		 */
 		UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
 								   MyLogicalRepWorker->relid,
 								   MyLogicalRepWorker->relstate,
 								   MyLogicalRepWorker->relstate_lsn);
 
-		/*
-		 * End streaming so that LogRepWorkerWalRcvConn can be used to drop
-		 * the slot.
-		 */
-		walrcv_endstreaming(LogRepWorkerWalRcvConn, &tli);
-
-		/*
-		 * Cleanup the tablesync slot.
-		 *
-		 * This has to be done after updating the state because otherwise if
-		 * there is an error while doing the database operations we won't be
-		 * able to rollback dropped slot.
-		 */
-		ReplicationSlotNameForTablesync(MyLogicalRepWorker->subid,
-										MyLogicalRepWorker->relid,
-										syncslotname,
-										sizeof(syncslotname));
-
-		/*
-		 * It is important to give an error if we are unable to drop the slot,
-		 * otherwise, it won't be dropped till the corresponding subscription
-		 * is dropped. So passing missing_ok = false.
-		 */
-		ReplicationSlotDropAtPubNode(LogRepWorkerWalRcvConn, syncslotname, false);
-
 		CommitTransactionCommand();
 		pgstat_report_stat(false);
 
 		/*
-		 * Start a new transaction to clean up the tablesync origin tracking.
-		 * This transaction will be ended within the finish_sync_worker().
-		 * Now, even, if we fail to remove this here, the apply worker will
-		 * ensure to clean it up afterward.
-		 *
-		 * We need to do this after the table state is set to SYNCDONE.
-		 * Otherwise, if an error occurs while performing the database
-		 * operation, the worker will be restarted and the in-memory state of
-		 * replication progress (remote_lsn) won't be rolled-back which would
-		 * have been cleared before restart. So, the restarted worker will use
-		 * invalid replication progress state resulting in replay of
-		 * transactions that have already been applied.
-		 */
-		StartTransactionCommand();
-
-		ReplicationOriginNameForLogicalRep(MyLogicalRepWorker->subid,
-										   MyLogicalRepWorker->relid,
-										   originname,
-										   sizeof(originname));
-
-		/*
-		 * Resetting the origin session removes the ownership of the slot.
-		 * This is needed to allow the origin to be dropped.
-		 */
-		replorigin_session_reset();
-		replorigin_session_origin = InvalidRepOriginId;
-		replorigin_session_origin_lsn = InvalidXLogRecPtr;
-		replorigin_session_origin_timestamp = 0;
-
-		/*
-		 * Drop the tablesync's origin tracking if exists.
-		 *
-		 * There is a chance that the user is concurrently performing refresh
-		 * for the subscription where we remove the table state and its origin
-		 * or the apply worker would have removed this origin. So passing
-		 * missing_ok = true.
+		 * End streaming so that LogRepWorkerWalRcvConn can be used to drop
+		 * the slot.
 		 */
-		replorigin_drop_by_name(originname, true, false);
+		walrcv_endstreaming(LogRepWorkerWalRcvConn, &tli);
 
-		finish_sync_worker();
+		finish_synchronization(false);
 	}
 	else
 		SpinLockRelease(&MyLogicalRepWorker->relmutex);
@@ -463,8 +482,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 			 */
 			if (current_lsn >= rstate->lsn)
 			{
-				char		originname[NAMEDATALEN];
-
 				rstate->state = SUBREL_STATE_READY;
 				rstate->lsn = current_lsn;
 				if (!started_tx)
@@ -473,26 +490,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 					started_tx = true;
 				}
 
-				/*
-				 * Remove the tablesync origin tracking if exists.
-				 *
-				 * There is a chance that the user is concurrently performing
-				 * refresh for the subscription where we remove the table
-				 * state and its origin or the tablesync worker would have
-				 * already removed this origin. We can't rely on tablesync
-				 * worker to remove the origin tracking as if there is any
-				 * error while dropping we won't restart it to drop the
-				 * origin. So passing missing_ok = true.
-				 */
-				ReplicationOriginNameForLogicalRep(MyLogicalRepWorker->subid,
-												   rstate->relid,
-												   originname,
-												   sizeof(originname));
-				replorigin_drop_by_name(originname, true, false);
-
-				/*
-				 * Update the state to READY only after the origin cleanup.
-				 */
+				/* Update the state to READY. */
 				UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
 										   rstate->relid, rstate->state,
 										   rstate->lsn);
@@ -1283,7 +1281,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 
 	Assert(MyLogicalRepWorker->relstate == SUBREL_STATE_INIT ||
 		   MyLogicalRepWorker->relstate == SUBREL_STATE_DATASYNC ||
-		   MyLogicalRepWorker->relstate == SUBREL_STATE_FINISHEDCOPY);
+		   MyLogicalRepWorker->relstate == SUBREL_STATE_FINISHEDCOPY ||
+		   MyLogicalRepWorker->relstate == SUBREL_STATE_PRE_SYNCDONE);
 
 	/* Assign the origin tracking record name. */
 	ReplicationOriginNameForLogicalRep(MySubscription->oid,
@@ -1327,6 +1326,16 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 
 		goto copy_table_done;
 	}
+	else if (MyLogicalRepWorker->relstate == SUBREL_STATE_PRE_SYNCDONE)
+	{
+		/*
+		 * The table synchronization has finished in front of apply (sublsn
+		 * set), but the tablesync worker then crashed before setting the state
+		 * to SYNCDONE. So, we only need to perform the final cleanup, set the
+		 * state to SYNCDONE, then exit.
+		 */
+		finish_synchronization(true);
+	}
 
 	SpinLockAcquire(&MyLogicalRepWorker->relmutex);
 	MyLogicalRepWorker->relstate = SUBREL_STATE_DATASYNC;
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index cfb2ab6..9629372 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -510,7 +510,8 @@ should_apply_changes_for_rel(LogicalRepRelMapEntry *rel)
 	}
 	else
 		return (rel->state == SUBREL_STATE_READY ||
-				(rel->state == SUBREL_STATE_SYNCDONE &&
+				((rel->state == SUBREL_STATE_PRE_SYNCDONE ||
+				 rel->state == SUBREL_STATE_SYNCDONE) &&
 				 rel->statelsn <= remote_final_lsn));
 }
 
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index 60a2bcc..99a4e09 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -62,8 +62,10 @@ DECLARE_UNIQUE_INDEX_PKEY(pg_subscription_rel_srrelid_srsubid_index, 6117, Subsc
 									 * NULL) */
 #define SUBREL_STATE_FINISHEDCOPY 'f'	/* tablesync copy phase is completed
 										 * (sublsn NULL) */
-#define SUBREL_STATE_SYNCDONE	's' /* synchronization finished in front of
-									 * apply (sublsn set) */
+#define SUBREL_STATE_PRE_SYNCDONE	'p' /* synchronization finished in front of
+										 * apply (sublsn set), but the final
+										 * cleanup has not yet been performed */
+#define SUBREL_STATE_SYNCDONE	's' /* synchronization complete */
 #define SUBREL_STATE_READY		'r' /* ready (sublsn set) */
 
 /* These are never stored in the catalog, we only use them for IPC. */
-- 
2.7.2.windows.1

#20Peter Smith
smithpb2250@gmail.com
In reply to: houzj.fnst@fujitsu.com (#19)
Re: Deadlock between logrep apply worker and tablesync worker

On Fri, Feb 3, 2023 at 6:58 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

...

Right, I think that case could be addressed by Tom's patch to some extent but
I am thinking we should also try to analyze if we can completely avoid the need
to remove origins from both processes. One idea could be to introduce
another relstate something like PRE_SYNCDONE and set it in a separate
transaction before we set the state as SYNCDONE and remove the slot and
origin in tablesync worker.
Now, if the tablesync worker errors out due to some reason during the second
transaction, it can remove the slot and origin after restart by checking the state.
However, it would add another relstate which may not be the best way to
address this problem. Anyway, that can be accomplished as a separate patch.

Here is an attempt to achieve the same.
Basically, the patch removes the code that drop the origin in apply worker. And
add a new state PRE_SYNCDONE after synchronization finished in front of apply
(sublsn set), but before dropping the origin and other final cleanups. The
tablesync will restart and redo the cleanup if it failed after reaching the new
state. Besides, since the changes can already be applied on the table in
PRE_SYNCDONE state, so I also modified the check in
should_apply_changes_for_rel(). And some other conditions for the origin drop
in subscription commands are were adjusted in this patch.

Here are some review comments for the 0001 patch

======
General Comment

0.
The idea of using the extra relstate for clean-up seems OK, but the
implementation of the new state in this patch appears misordered and
misnamed to me.

The state name should indicate what it is doing (PRE_SYNCDONE is
meaningless). The patch describes in several places that this state
means "synchronized, but not yet cleaned up" therefore IMO it means
the SYNCDONE state should be *before* this new state. And since this
new state is for "cleanup" then let's call it something like that.

To summarize, I don’t think the meaning of SYNCDONE should be touched.
SYNCDONE means the synchronization is done, same as before. And your
new "cleanup" state belongs directly *after* that. IMO it should be
like this:

1. STATE_INIT
2. STATE_DATASYNC
3. STATE_FINISHEDCOPY
4. STATE_SYNCDONE
5. STATE_CLEANUP <-- new relstate
6. STATE_READY

Of course, this is going to impact almost every aspect of the patch,
but I think everything will be basically the same as you have it now
-- only all the state names and comments need to be adjusted according
to the above.

======
Commit Message

1.
Previously, we allowed the apply worker to drop the origin to avoid the case
that the tablesync worker fails to the origin(due to crash). In this case we
don't restart the tablesync worker, and the apply worker can clean the origin.

~

There seem to be some words missing in this paragraph.

SUGGESTION
Previously, we allowed the apply worker to drop the origin as a way to
recover from the scenario where the tablesync worker failed to drop it
(due to crash).

~~~

2.
To improve this, we introduce a new relstate SUBREL_STATE_PRE_SYNCDONE which
will be set after synchronization finished in front of apply (sublsn set), but
before dropping the origin and other final cleanups. The apply worker will
restart tablesync worker if the relstate is SUBREL_STATE_PRE_SYNCDONE. This
way, even if the tablesync worker error out in the transaction that tries to
drop the origin, the apply worker will restart the tablesync worker to redo the
cleanup(for origin and other stuff) and then directly exit.

~

2a.
This is going to be impacted by my "General Comment". Notice how you
describe again "will be set after synchronization finished". Evidence
again this means
the new CLEANUP state should directly follow the SYNCDONE state.

2b.
"error out” --> "encounters an error"

2c.
"cleanup(for origin" --> space before the "("

======
doc/src/sgml/catalogs.sgml

3.
@@ -8071,7 +8071,8 @@ SCRAM-SHA-256$<replaceable>&lt;iteration
count&gt;</replaceable>:<replaceable>&l
        <literal>i</literal> = initialize,
        <literal>d</literal> = data is being copied,
        <literal>f</literal> = finished table copy,
-       <literal>s</literal> = synchronized,
+       <literal>p</literal> = synchronized but not yet cleaned up,
+       <literal>s</literal> = synchronization done,
        <literal>r</literal> = ready (normal replication)
       </para></entry>
      </row>
@@ -8082,8 +8083,8 @@ SCRAM-SHA-256$<replaceable>&lt;iteration
count&gt;</replaceable>:<replaceable>&l
       </para>
       <para>
        Remote LSN of the state change used for synchronization coordination
-       when in <literal>s</literal> or <literal>r</literal> states,
-       otherwise null
+       when in <literal>p</literal>, <literal>s</literal> or
+       <literal>r</literal> states, otherwise null
       </para></entry>
      </row>
     </tbody>

This state order and choice of the letter are impacted by my "General Comment".

IMO it should be more like this:

State code: i = initialize, d = data is being copied, f = finished
table copy, s = synchronization done, c = clean-up done, r = ready
(normal replication)

======
src/backend/commands/subscriptioncmds.c

4. AlterSubscription_refresh

Some adjustments are needed according to my "General Comment".

~~~

5. DropSubscription

Some adjustments are needed according to my "General Comment".

======
src/backend/replication/logical/tablesync.c

6.

+ * Update the state of the table to SUBREL_STATE_SYNCDONE and cleanup the
+ * tablesync slot and drop the tablesync's origin tracking.
+ */
+static void
+finish_synchronization(bool restart_after_crash)

6a.
Suggest calling this function something like 'cleanup_after_synchronization'

~

6b.
Some adjustments to states and comments are needed according to my
"General Comment".

~~~

7. process_syncing_tables_for_sync

- MyLogicalRepWorker->relstate = SUBREL_STATE_SYNCDONE;
+ MyLogicalRepWorker->relstate = SUBREL_STATE_PRE_SYNCDONE;
  MyLogicalRepWorker->relstate_lsn = current_lsn;

This should just be setting SUBREL_STATE_SYNCDONE how it previously did.

Other states/comments in this function to change according to my
"General Comments".

~

8.
if (rstate->state == SUBREL_STATE_SYNCDONE)
{
/*
* Apply has caught up to the position where the table sync has
* finished. Mark the table as ready so that the apply will just
* continue to replicate it normally.
*/

That should now be checking for SUBREL_STATE_CLEANUPDONE according to
me "General Comment"

~~~

9. process_syncing_tables_for_apply

Some adjustments to states and comments are needed according to my
"General Comment".

~~~

10. LogicalRepSyncTableStart

Some adjustments to states and comments are needed according to my
"General Comment".

======
src/backend/replication/logical/worker.c

11. should_apply_changes_for_rel

Some adjustments to states according to my "General Comment".

======
src/include/catalog/pg_subscription_rel.h

12.
@@ -62,8 +62,10 @@
DECLARE_UNIQUE_INDEX_PKEY(pg_subscription_rel_srrelid_srsubid_index,
6117, Subsc
  * NULL) */
 #define SUBREL_STATE_FINISHEDCOPY 'f' /* tablesync copy phase is completed
  * (sublsn NULL) */
-#define SUBREL_STATE_SYNCDONE 's' /* synchronization finished in front of
- * apply (sublsn set) */
+#define SUBREL_STATE_PRE_SYNCDONE 'p' /* synchronization finished in front of
+ * apply (sublsn set), but the final
+ * cleanup has not yet been performed */
+#define SUBREL_STATE_SYNCDONE 's' /* synchronization complete */
 #define SUBREL_STATE_READY 'r' /* ready (sublsn set) */

Some adjustments to states and comments are needed according to my
"General Comment".

------
Kind Regards,
Peter Smith.
Fujitsu Australia

#21houzj.fnst@fujitsu.com
houzj.fnst@fujitsu.com
In reply to: Peter Smith (#20)
RE: Deadlock between logrep apply worker and tablesync worker

On Tuesday, February 7, 2023 12:12 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Fri, Feb 3, 2023 at 6:58 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com>
wrote:

...

Right, I think that case could be addressed by Tom's patch to some
extent but I am thinking we should also try to analyze if we can
completely avoid the need to remove origins from both processes. One
idea could be to introduce another relstate something like
PRE_SYNCDONE and set it in a separate transaction before we set the
state as SYNCDONE and remove the slot and origin in tablesync worker.
Now, if the tablesync worker errors out due to some reason during
the second transaction, it can remove the slot and origin after restart by

checking the state.

However, it would add another relstate which may not be the best way
to address this problem. Anyway, that can be accomplished as a separate

patch.

Here is an attempt to achieve the same.
Basically, the patch removes the code that drop the origin in apply
worker. And add a new state PRE_SYNCDONE after synchronization
finished in front of apply (sublsn set), but before dropping the
origin and other final cleanups. The tablesync will restart and redo
the cleanup if it failed after reaching the new state. Besides, since
the changes can already be applied on the table in PRE_SYNCDONE state,
so I also modified the check in should_apply_changes_for_rel(). And
some other conditions for the origin drop in subscription commands are

were adjusted in this patch.

Here are some review comments for the 0001 patch

======
General Comment

0.
The idea of using the extra relstate for clean-up seems OK, but the
implementation of the new state in this patch appears misordered and
misnamed to me.

The state name should indicate what it is doing (PRE_SYNCDONE is
meaningless). The patch describes in several places that this state means
"synchronized, but not yet cleaned up" therefore IMO it means the SYNCDONE
state should be *before* this new state. And since this new state is for
"cleanup" then let's call it something like that.

To summarize, I don’t think the meaning of SYNCDONE should be touched.
SYNCDONE means the synchronization is done, same as before. And your new
"cleanup" state belongs directly *after* that. IMO it should be like this:

1. STATE_INIT
2. STATE_DATASYNC
3. STATE_FINISHEDCOPY
4. STATE_SYNCDONE
5. STATE_CLEANUP <-- new relstate
6. STATE_READY

Of course, this is going to impact almost every aspect of the patch, but I think
everything will be basically the same as you have it now
-- only all the state names and comments need to be adjusted according to the
above.

Although I agree the CLEANUP is easier to understand, but I am a bit concerned
that the changes would be a bit invasive.

If we add a CLEANUP state at the end as suggested, it will change the meaning
of the existing SYNCDONE state, before the change it means both data sync and
cleanup have been done, but after the change it only mean the data sync is
over. This also means all the current C codes that considered the SYNCDONE as
the final state of table sync will need to be changed. Moreover, it's common
for user to query the relation state from pg_subscription_rel to identify if
the table sync of a table is finished(e.g. check relstate IN ('r', 's')), but
if we add a new state(CLEANUP) as the final state, then all these SQLs would
need to be changed as they need to check like relstate IN ('r', 'x'(new cleanup
state)).

Best Regards,
Hou zj

#22Peter Smith
smithpb2250@gmail.com
In reply to: houzj.fnst@fujitsu.com (#21)
Re: Deadlock between logrep apply worker and tablesync worker

On Tue, Feb 7, 2023 at 6:46 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

On Tuesday, February 7, 2023 12:12 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Fri, Feb 3, 2023 at 6:58 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com>
wrote:

...

Right, I think that case could be addressed by Tom's patch to some
extent but I am thinking we should also try to analyze if we can
completely avoid the need to remove origins from both processes. One
idea could be to introduce another relstate something like
PRE_SYNCDONE and set it in a separate transaction before we set the
state as SYNCDONE and remove the slot and origin in tablesync worker.
Now, if the tablesync worker errors out due to some reason during
the second transaction, it can remove the slot and origin after restart by

checking the state.

However, it would add another relstate which may not be the best way
to address this problem. Anyway, that can be accomplished as a separate

patch.

Here is an attempt to achieve the same.
Basically, the patch removes the code that drop the origin in apply
worker. And add a new state PRE_SYNCDONE after synchronization
finished in front of apply (sublsn set), but before dropping the
origin and other final cleanups. The tablesync will restart and redo
the cleanup if it failed after reaching the new state. Besides, since
the changes can already be applied on the table in PRE_SYNCDONE state,
so I also modified the check in should_apply_changes_for_rel(). And
some other conditions for the origin drop in subscription commands are

were adjusted in this patch.

Here are some review comments for the 0001 patch

======
General Comment

0.
The idea of using the extra relstate for clean-up seems OK, but the
implementation of the new state in this patch appears misordered and
misnamed to me.

The state name should indicate what it is doing (PRE_SYNCDONE is
meaningless). The patch describes in several places that this state means
"synchronized, but not yet cleaned up" therefore IMO it means the SYNCDONE
state should be *before* this new state. And since this new state is for
"cleanup" then let's call it something like that.

To summarize, I don’t think the meaning of SYNCDONE should be touched.
SYNCDONE means the synchronization is done, same as before. And your new
"cleanup" state belongs directly *after* that. IMO it should be like this:

1. STATE_INIT
2. STATE_DATASYNC
3. STATE_FINISHEDCOPY
4. STATE_SYNCDONE
5. STATE_CLEANUP <-- new relstate
6. STATE_READY

Of course, this is going to impact almost every aspect of the patch, but I think
everything will be basically the same as you have it now
-- only all the state names and comments need to be adjusted according to the
above.

Although I agree the CLEANUP is easier to understand, but I am a bit concerned
that the changes would be a bit invasive.

If we add a CLEANUP state at the end as suggested, it will change the meaning
of the existing SYNCDONE state, before the change it means both data sync and
cleanup have been done, but after the change it only mean the data sync is
over. This also means all the current C codes that considered the SYNCDONE as
the final state of table sync will need to be changed. Moreover, it's common
for user to query the relation state from pg_subscription_rel to identify if
the table sync of a table is finished(e.g. check relstate IN ('r', 's')), but
if we add a new state(CLEANUP) as the final state, then all these SQLs would
need to be changed as they need to check like relstate IN ('r', 'x'(new cleanup
state)).

IIUC, you are saying that we still want to keep the SYNCDONE state as
the last state before READY mainly because otherwise there is too much
impact on user/test SQL that is currently checking those ('s','r')
states.

OTOH, in the current 001 patch you had the SUBREL_STATE_PRE_SYNCDONE
meaning "synchronized but not yet cleaned up" (that's verbatim from
your PGDOCS). And there is C code where you are checking
SUBREL_STATE_PRE_SYNCDONE and essentially giving the state before the
SYNCDONE an equal status to the SYNCDONE (e.g.
should_apply_changes_for_rel seemed to be doing this).

It seems to be trying to have an each-way bet...

~~~

But I think there may be an easy way out of this problem:

Current HEAD
1. STATE_INIT 'i'
2. STATE_DATASYNC 'd'
3. STATE_FINISHEDCOPY 'f'
4. STATE_SYNCDONE 's'
5. STATE_READY 'r'

The patch 0001
1. STATE_INIT 'i'
2. STATE_DATASYNC 'd'
3. STATE_FINISHEDCOPY 'f'
4. STATE_PRESYNCDONE 'p' <-- new relstate
4. STATE_SYNCDONE 's'
5. STATE_READY 'r'

My previous suggestion (which you acknowledge is easier to understand,
but might cause hassles for existing SQL)
1. STATE_INIT 'i'
2. STATE_DATASYNC 'd'
3. STATE_FINISHEDCOPY 'f'
4. STATE_SYNCDONE 's'
5. STATE_CLEANUP 'x' <-- new relstate
6. STATE_READY 'r'

SUGGESTED (hack to solve everything?)
1. STATE_INIT 'i'
2. STATE_DATASYNC 'd'
3. STATE_FINISHEDCOPY 'f'
4. STATE_SYNCDONE_PRE_CLEANUP 'x' <-- change the char code for this
existing relstate (was SYNCDONE 's')
5. STATE_SYNCDONE_WITH_CLEANUP 's' <-- new relstate using 's'
6. STATE_READY 'r'

By commandeering the 's' flag for the new CLEANUP state it means no
existing user code or test code needs to change - IIUC everything will
work the same as before.

~

Hmmm -- In hindsight, perhaps I have gone around in a big circle here
and the solution I am describing here is almost exactly the same as
your patch 0001 only with better names for the relstates.

Thoughts?

------
Kind Regards,
Peter Smith.
Fujitsu Australia

#23Peter Smith
smithpb2250@gmail.com
In reply to: houzj.fnst@fujitsu.com (#19)
Re: Deadlock between logrep apply worker and tablesync worker

On Fri, Feb 3, 2023 at 6:58 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

On Thursday, February 2, 2023 7:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Feb 2, 2023 at 12:05 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

On Tuesday, January 31, 2023 1:07 AM vignesh C <vignesh21@gmail.com>

wrote:

On Mon, 30 Jan 2023 at 17:30, vignesh C <vignesh21@gmail.com> wrote:

I also tried to test the time of "src/test/subscription/t/002_types.pl"
before and after the patch(change the lock level) and Tom's
patch(split
transaction) like what Vignesh has shared on -hackers.

I run about 100 times for each case. Tom's and the lock level patch
behave similarly on my machines[1].

HEAD: 3426 ~ 6425 ms
HEAD + Tom: 3404 ~ 3462 ms
HEAD + Vignesh: 3419 ~ 3474 ms
HEAD + Tom + Vignesh: 3408 ~ 3454 ms

Even apart from the testing time reduction, reducing the lock level
and lock the specific object can also help improve the lock contention
which user(that use the exposed function) , table sync worker and
apply worker can also benefit from it. So, I think pushing the patch to change

the lock level makes sense.

And the patch looks good to me.

Thanks for the tests. I also see a reduction in test time variability with Vignesh's
patch. I think we can release the locks in case the origin is concurrently
dropped as in the attached patch. I am planning to commit this patch
tomorrow unless there are more comments or objections.

While on it, after pushing the patch, I think there is another case
might also worth to be improved, that is the table sync and apply
worker try to drop the same origin which might cause some delay. This
is another case(different from the deadlock), so I feel we can try to improve

this in another patch.

Right, I think that case could be addressed by Tom's patch to some extent but
I am thinking we should also try to analyze if we can completely avoid the need
to remove origins from both processes. One idea could be to introduce
another relstate something like PRE_SYNCDONE and set it in a separate
transaction before we set the state as SYNCDONE and remove the slot and
origin in tablesync worker.
Now, if the tablesync worker errors out due to some reason during the second
transaction, it can remove the slot and origin after restart by checking the state.
However, it would add another relstate which may not be the best way to
address this problem. Anyway, that can be accomplished as a separate patch.

Here is an attempt to achieve the same.
Basically, the patch removes the code that drop the origin in apply worker. And
add a new state PRE_SYNCDONE after synchronization finished in front of apply
(sublsn set), but before dropping the origin and other final cleanups. The
tablesync will restart and redo the cleanup if it failed after reaching the new
state. Besides, since the changes can already be applied on the table in
PRE_SYNCDONE state, so I also modified the check in
should_apply_changes_for_rel(). And some other conditions for the origin drop
in subscription commands are were adjusted in this patch.

BTW, the tablesync.c has a large file header comment which describes
all about the relstates including some examples. So this patch needs
to include modifications to that comment.

------
Kind Regards,
Peter Smith.
Fujitsu Australia.