Cannot shutdown subscriber after DROP SUBSCRIPTION

Started by Kyotaro HORIGUCHIalmost 9 years ago17 messages
#1Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
1 attachment(s)

Hello, while looking another bug, I found that standby cannot
shutdown after DROP SUBSCRIPTION.

standby=# CREATE SUBSCRPTION sub1 ...
standby=# ....
standby=# DROP SUBSCRIPTION sub1;

Ctrl-C to the standby fails to work. ApplyLauncherMain is waiting
LogicalRepLauncherLock forever.

The culprit is DropSbuscription. It acquires
LogicalRepLauncherLock but never releases.

The attached patch fixes it. Most part of the fucntion is now
enclosed by PG_TRY-CATCH since some functions can throw
exceptions.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

fix_block_by_drop_subscription_v1.patchtext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 5de9999..f07143e 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -511,51 +511,67 @@ DropSubscription(DropSubscriptionStmt *stmt)
 	/* Protect against launcher restarting the worker. */
 	LWLockAcquire(LogicalRepLauncherLock, LW_EXCLUSIVE);
 
-	/* Kill the apply worker so that the slot becomes accessible. */
-	logicalrep_worker_stop(subid);
-
-	/* Remove the origin tracking if exists. */
-	snprintf(originname, sizeof(originname), "pg_%u", subid);
-	originid = replorigin_by_name(originname, true);
-	if (originid != InvalidRepOriginId)
-		replorigin_drop(originid);
-
-	/* If the user asked to not drop the slot, we are done mow.*/
-	if (!stmt->drop_slot)
-	{
-		heap_close(rel, NoLock);
-		return;
-	}
-
 	/*
-	 * Otherwise drop the replication slot at the publisher node using
-	 * the replication connection.
+	 * replorigin_drop can throw an exception. we must release
+	 * LogicalRepLauncherLock for the case.
 	 */
-	load_file("libpqwalreceiver", false);
+	PG_TRY();
+	{
+		/* Kill the apply worker so that the slot becomes accessible. */
+		logicalrep_worker_stop(subid);
 
-	initStringInfo(&cmd);
-	appendStringInfo(&cmd, "DROP_REPLICATION_SLOT \"%s\"", slotname);
+		/* Remove the origin tracking if exists. */
+		snprintf(originname, sizeof(originname), "pg_%u", subid);
+		originid = replorigin_by_name(originname, true);
+		if (originid != InvalidRepOriginId)
+			replorigin_drop(originid);
 
-	wrconn = walrcv_connect(conninfo, true, subname, &err);
-	if (wrconn == NULL)
-		ereport(ERROR,
-				(errmsg("could not connect to publisher when attempting to "
-						"drop the replication slot \"%s\"", slotname),
-				 errdetail("The error was: %s", err)));
+		/* If the user asked to not drop the slot, we are done now.*/
+		if (stmt->drop_slot)
+		{
+			/*
+			 * Otherwise drop the replication slot at the publisher node using
+			 * the replication connection.
+			 */
+			load_file("libpqwalreceiver", false);
+
+			initStringInfo(&cmd);
+			appendStringInfo(&cmd, "DROP_REPLICATION_SLOT \"%s\"", slotname);
+
+			wrconn = walrcv_connect(conninfo, true, subname, &err);
+			if (wrconn == NULL || !walrcv_command(wrconn, cmd.data, &err))
+			{
+				if (wrconn == NULL)
+					ereport(ERROR,
+							(errmsg("could not connect to publisher when "
+									"attempting to drop the "
+									"replication slot \"%s\"", slotname),
+							 errdetail("The error was: %s", err)));
 
-	if (!walrcv_command(wrconn, cmd.data, &err))
-		ereport(ERROR,
-				(errmsg("could not drop the replication slot \"%s\" on publisher",
-						slotname),
-				 errdetail("The error was: %s", err)));
-	else
-		ereport(NOTICE,
-				(errmsg("dropped replication slot \"%s\" on publisher",
-						slotname)));
+				ereport(ERROR,
+						(errmsg("could not drop the replication slot \"%s\" on publisher",
+								slotname),
+						 errdetail("The error was: %s", err)));
+			}
 
-	walrcv_disconnect(wrconn);
+			ereport(NOTICE,
+					(errmsg("dropped replication slot \"%s\" on publisher",
+							slotname)));
 
-	pfree(cmd.data);
+			pfree(cmd.data);
+		}
+	}
+	PG_CATCH();
+	{
+		LWLockRelease(LogicalRepLauncherLock);
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+
+	LWLockRelease(LogicalRepLauncherLock);
+
+	if (wrconn)
+		walrcv_disconnect(wrconn);
 
 	heap_close(rel, NoLock);
 }
#2Fujii Masao
masao.fujii@gmail.com
In reply to: Kyotaro HORIGUCHI (#1)
Re: Cannot shutdown subscriber after DROP SUBSCRIPTION

On Wed, Feb 1, 2017 at 5:36 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello, while looking another bug, I found that standby cannot
shutdown after DROP SUBSCRIPTION.

standby=# CREATE SUBSCRPTION sub1 ...
standby=# ....
standby=# DROP SUBSCRIPTION sub1;

Ctrl-C to the standby fails to work. ApplyLauncherMain is waiting
LogicalRepLauncherLock forever.

The culprit is DropSbuscription. It acquires
LogicalRepLauncherLock but never releases.

The attached patch fixes it. Most part of the fucntion is now
enclosed by PG_TRY-CATCH since some functions can throw
exceptions.

The lwlock would be released when an exception occurs, so I don't think
that TRY-CATCH is necessary here. Or it's necessary for another reason?

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Michael Paquier
michael.paquier@gmail.com
In reply to: Fujii Masao (#2)
Re: Cannot shutdown subscriber after DROP SUBSCRIPTION

On Thu, Feb 2, 2017 at 2:14 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

The lwlock would be released when an exception occurs, so I don't think
that TRY-CATCH is necessary here. Or it's necessary for another reason?

+    PG_CATCH();
+    {
+        LWLockRelease(LogicalRepLauncherLock);
+        PG_RE_THROW();
+    }
+    PG_END_TRY();
Just to do that, a TRY/CATCH block looks like an overkill to me. Why
not just call LWLockRelease in the ERROR and return code paths?
-- 
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Michael Paquier (#3)
1 attachment(s)
Re: Cannot shutdown subscriber after DROP SUBSCRIPTION

At Thu, 2 Feb 2017 08:46:11 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqR6VQ7aiKck1Ao3_mPVvn4v4ZKnJFq2oawFqpaePHd18A@mail.gmail.com>

On Thu, Feb 2, 2017 at 2:14 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

The lwlock would be released when an exception occurs, so I don't think
that TRY-CATCH is necessary here. Or it's necessary for another reason?

+    PG_CATCH();
+    {
+        LWLockRelease(LogicalRepLauncherLock);
+        PG_RE_THROW();
+    }
+    PG_END_TRY();
Just to do that, a TRY/CATCH block looks like an overkill to me. Why
not just call LWLockRelease in the ERROR and return code paths?

I though the same first. The modification at the "if (wrconn =="
is the remains of that. It is reverted inthe attached patch.

Then, the reason for the TRY-CATCH cluase is that I found that
some functions called from there can throw exceptions.

logicalrep_worker_stop and replorigin_drop have ereport in its path.
load_library apparently can throw exception.
(walrcv_(libpq_) functions don't seeem to.)

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Fix-DROP-SUBSCRIPTION-s-lock-leak.patchtext/x-patch; charset=us-asciiDownload
From d0ca653bb2aa776742a2e7a697b02794b1ad66d9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 2 Feb 2017 11:33:40 +0900
Subject: [PATCH] Fix DROP SUBSCRIPTION's lock leak.

DROP SUBSCRIPTION acquires a lock on LogicalRepLauncherLock but never
releases. This fixes it.
---
 src/backend/commands/subscriptioncmds.c | 86 ++++++++++++++++++---------------
 1 file changed, 48 insertions(+), 38 deletions(-)

diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 5de9999..223eea4 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -511,52 +511,62 @@ DropSubscription(DropSubscriptionStmt *stmt)
 	/* Protect against launcher restarting the worker. */
 	LWLockAcquire(LogicalRepLauncherLock, LW_EXCLUSIVE);
 
-	/* Kill the apply worker so that the slot becomes accessible. */
-	logicalrep_worker_stop(subid);
-
-	/* Remove the origin tracking if exists. */
-	snprintf(originname, sizeof(originname), "pg_%u", subid);
-	originid = replorigin_by_name(originname, true);
-	if (originid != InvalidRepOriginId)
-		replorigin_drop(originid);
-
-	/* If the user asked to not drop the slot, we are done mow.*/
-	if (!stmt->drop_slot)
-	{
-		heap_close(rel, NoLock);
-		return;
-	}
-
 	/*
-	 * Otherwise drop the replication slot at the publisher node using
-	 * the replication connection.
+	 * Some functions called here can throw exceptions. we must release
+	 * LogicalRepLauncherLock for the case.
 	 */
-	load_file("libpqwalreceiver", false);
+	PG_TRY();
+	{
+		/* Kill the apply worker so that the slot becomes accessible. */
+		logicalrep_worker_stop(subid);
 
-	initStringInfo(&cmd);
-	appendStringInfo(&cmd, "DROP_REPLICATION_SLOT \"%s\"", slotname);
+		/* Remove the origin tracking if exists. */
+		snprintf(originname, sizeof(originname), "pg_%u", subid);
+		originid = replorigin_by_name(originname, true);
+		if (originid != InvalidRepOriginId)
+			replorigin_drop(originid);
 
-	wrconn = walrcv_connect(conninfo, true, subname, &err);
-	if (wrconn == NULL)
-		ereport(ERROR,
-				(errmsg("could not connect to publisher when attempting to "
-						"drop the replication slot \"%s\"", slotname),
-				 errdetail("The error was: %s", err)));
+		/* Do the follwoing only if the user asked to actually drop the slot */
+		if (stmt->drop_slot)
+		{
+			/*
+			 * Drop the replication slot at the publisher node using the
+			 * replication connection.
+			 */
+			load_file("libpqwalreceiver", false);
 
-	if (!walrcv_command(wrconn, cmd.data, &err))
-		ereport(ERROR,
-				(errmsg("could not drop the replication slot \"%s\" on publisher",
-						slotname),
-				 errdetail("The error was: %s", err)));
-	else
-		ereport(NOTICE,
-				(errmsg("dropped replication slot \"%s\" on publisher",
-						slotname)));
+			initStringInfo(&cmd);
+			appendStringInfo(&cmd, "DROP_REPLICATION_SLOT \"%s\"", slotname);
+
+			wrconn = walrcv_connect(conninfo, true, subname, &err);
+			if (wrconn == NULL)
+				ereport(ERROR,
+						(errmsg("could not connect to publisher when attempting to "
+								"drop the replication slot \"%s\"", slotname),
+						 errdetail("The error was: %s", err)));
 
-	walrcv_disconnect(wrconn);
+			else if (!walrcv_command(wrconn, cmd.data, &err))
+				ereport(ERROR,
+						(errmsg("could not drop the replication slot \"%s\" on publisher",
+								slotname),
+						 errdetail("The error was: %s", err)));
+
+			ereport(NOTICE,
+					(errmsg("dropped replication slot \"%s\" on publisher",
+							slotname)));
 
-	pfree(cmd.data);
+			walrcv_disconnect(wrconn);
+			pfree(cmd.data);
+		}
+	}
+	PG_CATCH();
+	{
+		LWLockRelease(LogicalRepLauncherLock);
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
 
+	LWLockRelease(LogicalRepLauncherLock);
 	heap_close(rel, NoLock);
 }
 
-- 
2.9.2

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kyotaro HORIGUCHI (#4)
Re: Cannot shutdown subscriber after DROP SUBSCRIPTION

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

Then, the reason for the TRY-CATCH cluase is that I found that
some functions called from there can throw exceptions.

Yes, but all LWLocks should be released by normal error recovery.
It should not be necessary for this code to clean that up by hand.
If it were necessary, there would be TRY-CATCH around every single
LWLockAcquire in the backend, and we'd have an unreadable and
unmaintainable system. Please don't add a TRY-CATCH unless it's
*necessary* -- and you haven't explained why this one is.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Michael Paquier
michael.paquier@gmail.com
In reply to: Tom Lane (#5)
1 attachment(s)
Re: Cannot shutdown subscriber after DROP SUBSCRIPTION

On Thu, Feb 2, 2017 at 2:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

Then, the reason for the TRY-CATCH cluase is that I found that
some functions called from there can throw exceptions.

Yes, but all LWLocks should be released by normal error recovery.
It should not be necessary for this code to clean that up by hand.
If it were necessary, there would be TRY-CATCH around every single
LWLockAcquire in the backend, and we'd have an unreadable and
unmaintainable system. Please don't add a TRY-CATCH unless it's
*necessary* -- and you haven't explained why this one is.

Putting hands into the code and at the problem, I can see that
dropping a subscription on a node makes it unresponsive in case of a
stop. And that's just because calls to LWLockRelease are missing as in
the patch attached. A try/catch problem should not be necessary.
--
Michael

Attachments:

drop-subs-locks.patchapplication/octet-stream; name=drop-subs-locks.patchDownload
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 5de999928f..b38beba659 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -524,6 +524,7 @@ DropSubscription(DropSubscriptionStmt *stmt)
 	if (!stmt->drop_slot)
 	{
 		heap_close(rel, NoLock);
+		LWLockRelease(LogicalRepLauncherLock);
 		return;
 	}
 
@@ -558,6 +559,7 @@ DropSubscription(DropSubscriptionStmt *stmt)
 	pfree(cmd.data);
 
 	heap_close(rel, NoLock);
+	LWLockRelease(LogicalRepLauncherLock);
 }
 
 /*
#7Fujii Masao
masao.fujii@gmail.com
In reply to: Michael Paquier (#6)
Re: Cannot shutdown subscriber after DROP SUBSCRIPTION

On Thu, Feb 2, 2017 at 2:36 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Feb 2, 2017 at 2:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

Then, the reason for the TRY-CATCH cluase is that I found that
some functions called from there can throw exceptions.

Yes, but all LWLocks should be released by normal error recovery.
It should not be necessary for this code to clean that up by hand.
If it were necessary, there would be TRY-CATCH around every single
LWLockAcquire in the backend, and we'd have an unreadable and
unmaintainable system. Please don't add a TRY-CATCH unless it's
*necessary* -- and you haven't explained why this one is.

Yes.

Putting hands into the code and at the problem, I can see that
dropping a subscription on a node makes it unresponsive in case of a
stop. And that's just because calls to LWLockRelease are missing as in
the patch attached. A try/catch problem should not be necessary.

Thanks for the patch!

With the patch, LogicalRepLauncherLock is released at the end of
DropSubscription(). But ISTM that the lock should be released just after
logicalrep_worker_stop() and there is no need to protect the removal of
replication slot with the lock.

/*
* If we found worker but it does not have proc set it is starting up,
* wait for it to finish and then kill it.
*/
while (worker && !worker->proc)
{

ISTM that the above loop in logicalrep_worker_stop() is not necessary
because LogicalRepLauncherLock ensures that the above condition is
always false. Thought? Am I missing something?

If the above condition is true, which means that there is the worker slot
having the "subid" of the worker to kill, but its "proc" has not been set yet.
Without LogicalRepLauncherLock, this situation can happen after "subid"
is set by the launcher and before "proc" is set by the worker. But
LogicalRepLauncherLock protects those operations, so logicalrep_worker_stop()
called while holding the lock should always think the above condition is false.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Fujii Masao (#7)
1 attachment(s)
Re: Cannot shutdown subscriber after DROP SUBSCRIPTION

At Fri, 3 Feb 2017 01:02:47 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwHqQVHmQ7wM=eLNnp1_oy-GVSSAcaJXWjE4nc2twSqXRg@mail.gmail.com>

On Thu, Feb 2, 2017 at 2:36 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Feb 2, 2017 at 2:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

Then, the reason for the TRY-CATCH cluase is that I found that
some functions called from there can throw exceptions.

Yes, but all LWLocks should be released by normal error recovery.
It should not be necessary for this code to clean that up by hand.
If it were necessary, there would be TRY-CATCH around every single
LWLockAcquire in the backend, and we'd have an unreadable and
unmaintainable system. Please don't add a TRY-CATCH unless it's
*necessary* -- and you haven't explained why this one is.

Yes.

Thank you for the suggestion. I minunderstood that.

Putting hands into the code and at the problem, I can see that
dropping a subscription on a node makes it unresponsive in case of a
stop. And that's just because calls to LWLockRelease are missing as in
the patch attached. A try/catch problem should not be necessary.

Thanks for the patch!

With the patch, LogicalRepLauncherLock is released at the end of
DropSubscription(). But ISTM that the lock should be released just after
logicalrep_worker_stop() and there is no need to protect the removal of
replication slot with the lock.

That's true. logicalrep_worker_stop returns after confirmig that
worker->proc is cleard, so no false relaunch cannot be caused.
After all, logicalrep_worker_stop is surrounded by
LWLockAcquire/Relase pair. So it can be moved into the funciton
and make the lock secrion to be more narrower.

/*
* If we found worker but it does not have proc set it is starting up,
* wait for it to finish and then kill it.
*/
while (worker && !worker->proc)
{

ISTM that the above loop in logicalrep_worker_stop() is not necessary
because LogicalRepLauncherLock ensures that the above condition is
always false. Thought? Am I missing something?

The lock exists only to keep the launcher from starting a
worker. Creating a subscription and starting a worker for the
slot run independently.

If the above condition is true, which means that there is the worker slot
having the "subid" of the worker to kill, but its "proc" has not been set yet.

Yes. The situation happens after launcher sets subid and before
ApplyWorkerMain attaches the slot. The lock doesn't protect the
section. If someone can drop a subscription just after its
creation, it happens.

Without LogicalRepLauncherLock, this situation can happen after "subid"
is set by the launcher and before "proc" is set by the worker. But
LogicalRepLauncherLock protects those operations, so logicalrep_worker_stop()
called while holding the lock should always think the above condition is false.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Refactor-the-lock-section-for-subscription-worker-te.patchtext/x-patch; charset=us-asciiDownload
From 490add168b7cf63ec584e75f6a7f79efc08ae200 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 3 Feb 2017 10:31:01 +0900
Subject: [PATCH] Refactor the lock section for subscription worker termination

---
 src/backend/commands/subscriptioncmds.c    | 3 ---
 src/backend/replication/logical/launcher.c | 5 +++++
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 3b70807..67c587c 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -508,9 +508,6 @@ DropSubscription(DropSubscriptionStmt *stmt)
 	/* Clean up dependencies */
 	deleteSharedDependencyRecordsFor(SubscriptionRelationId, subid, 0);
 
-	/* Protect against launcher restarting the worker. */
-	LWLockAcquire(LogicalRepLauncherLock, LW_EXCLUSIVE);
-
 	/* Kill the apply worker so that the slot becomes accessible. */
 	logicalrep_worker_stop(subid);
 
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index d222cff..233be06 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -368,6 +368,9 @@ logicalrep_worker_stop(Oid subid)
 			break;
 	}
 
+	/* Block the lauchner not to restart this worker */
+	LWLockAcquire(LogicalRepLauncherLock);
+
 	/* Now terminate the worker ... */
 	kill(worker->proc->pid, SIGTERM);
 	LWLockRelease(LogicalRepWorkerLock);
@@ -398,6 +401,8 @@ logicalrep_worker_stop(Oid subid)
 
 		ResetLatch(&MyProc->procLatch);
 	}
+
+	LWLockRelease(LogicalRepLauncherLock);
 }
 
 /*
-- 
2.9.2

#9Fujii Masao
masao.fujii@gmail.com
In reply to: Kyotaro HORIGUCHI (#8)
Re: Cannot shutdown subscriber after DROP SUBSCRIPTION

On Fri, Feb 3, 2017 at 10:56 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

At Fri, 3 Feb 2017 01:02:47 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwHqQVHmQ7wM=eLNnp1_oy-GVSSAcaJXWjE4nc2twSqXRg@mail.gmail.com>

On Thu, Feb 2, 2017 at 2:36 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Feb 2, 2017 at 2:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

Then, the reason for the TRY-CATCH cluase is that I found that
some functions called from there can throw exceptions.

Yes, but all LWLocks should be released by normal error recovery.
It should not be necessary for this code to clean that up by hand.
If it were necessary, there would be TRY-CATCH around every single
LWLockAcquire in the backend, and we'd have an unreadable and
unmaintainable system. Please don't add a TRY-CATCH unless it's
*necessary* -- and you haven't explained why this one is.

Yes.

Thank you for the suggestion. I minunderstood that.

Putting hands into the code and at the problem, I can see that
dropping a subscription on a node makes it unresponsive in case of a
stop. And that's just because calls to LWLockRelease are missing as in
the patch attached. A try/catch problem should not be necessary.

Thanks for the patch!

With the patch, LogicalRepLauncherLock is released at the end of
DropSubscription(). But ISTM that the lock should be released just after
logicalrep_worker_stop() and there is no need to protect the removal of
replication slot with the lock.

That's true. logicalrep_worker_stop returns after confirmig that
worker->proc is cleard, so no false relaunch cannot be caused.
After all, logicalrep_worker_stop is surrounded by
LWLockAcquire/Relase pair. So it can be moved into the funciton
and make the lock secrion to be more narrower.

/*
* If we found worker but it does not have proc set it is starting up,
* wait for it to finish and then kill it.
*/
while (worker && !worker->proc)
{

ISTM that the above loop in logicalrep_worker_stop() is not necessary
because LogicalRepLauncherLock ensures that the above condition is
always false. Thought? Am I missing something?

The lock exists only to keep the launcher from starting a
worker. Creating a subscription and starting a worker for the
slot run independently.

If the above condition is true, which means that there is the worker slot
having the "subid" of the worker to kill, but its "proc" has not been set yet.

Yes. The situation happens after launcher sets subid and before
ApplyWorkerMain attaches the slot. The lock doesn't protect the
section.

No. logicalrep_worker_launch() calls WaitForReplicationWorkerAttach()
and waits for the worker to attach to the slot. Then LogicalRepLauncherLock
is released. So both "subid" and "proc" should be set while the lock is being
held.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#9)
Re: Cannot shutdown subscriber after DROP SUBSCRIPTION

On Sat, Feb 4, 2017 at 12:49 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Fri, Feb 3, 2017 at 10:56 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

At Fri, 3 Feb 2017 01:02:47 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwHqQVHmQ7wM=eLNnp1_oy-GVSSAcaJXWjE4nc2twSqXRg@mail.gmail.com>

On Thu, Feb 2, 2017 at 2:36 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Feb 2, 2017 at 2:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

Then, the reason for the TRY-CATCH cluase is that I found that
some functions called from there can throw exceptions.

Yes, but all LWLocks should be released by normal error recovery.
It should not be necessary for this code to clean that up by hand.
If it were necessary, there would be TRY-CATCH around every single
LWLockAcquire in the backend, and we'd have an unreadable and
unmaintainable system. Please don't add a TRY-CATCH unless it's
*necessary* -- and you haven't explained why this one is.

Yes.

Thank you for the suggestion. I minunderstood that.

Putting hands into the code and at the problem, I can see that
dropping a subscription on a node makes it unresponsive in case of a
stop. And that's just because calls to LWLockRelease are missing as in
the patch attached. A try/catch problem should not be necessary.

Thanks for the patch!

With the patch, LogicalRepLauncherLock is released at the end of
DropSubscription(). But ISTM that the lock should be released just after
logicalrep_worker_stop() and there is no need to protect the removal of
replication slot with the lock.

That's true. logicalrep_worker_stop returns after confirmig that
worker->proc is cleard, so no false relaunch cannot be caused.
After all, logicalrep_worker_stop is surrounded by
LWLockAcquire/Relase pair. So it can be moved into the funciton
and make the lock secrion to be more narrower.

If we do this, Assert(LWLockHeldByMe(LogicalRepLauncherLock)) should be
removed and the comment for logicalrep_worker_stop() should be updated.

Your approach may cause the deadlock. The launcher takes LogicalRepWorkerLock
while holding LogicalRepLauncherLock. OTOH, with your approach,
logicalrep_worker_stop() takes LogicalRepLauncherLock while holding
LogicalRepWorkerLock.

Therefore I pushed the simple patch which adds LWLockRelease() just after
logicalrep_worker_stop().

Another problem that I found while reading the code is that the launcher can
start up the worker with the subscription that DROP SUBSCRIPTION just removed.
That is, DROP SUBSCRIPTION removes the target entry from pg_subscription,
but the launcher can see it and start new worker until the transaction for
DROP has been committed.

To fix this issue, I think that DROP SUBSCRIPTION should take
AccessExclusiveLock on pg_subscription, instead of RowExclusiveLock,
so that the launcher cannot see the entry to be being removed.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Fujii Masao (#10)
Re: Cannot shutdown subscriber after DROP SUBSCRIPTION

On 03/02/17 19:38, Fujii Masao wrote:

On Sat, Feb 4, 2017 at 12:49 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Fri, Feb 3, 2017 at 10:56 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

At Fri, 3 Feb 2017 01:02:47 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwHqQVHmQ7wM=eLNnp1_oy-GVSSAcaJXWjE4nc2twSqXRg@mail.gmail.com>

On Thu, Feb 2, 2017 at 2:36 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Feb 2, 2017 at 2:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

Then, the reason for the TRY-CATCH cluase is that I found that
some functions called from there can throw exceptions.

Yes, but all LWLocks should be released by normal error recovery.
It should not be necessary for this code to clean that up by hand.
If it were necessary, there would be TRY-CATCH around every single
LWLockAcquire in the backend, and we'd have an unreadable and
unmaintainable system. Please don't add a TRY-CATCH unless it's
*necessary* -- and you haven't explained why this one is.

Yes.

Thank you for the suggestion. I minunderstood that.

Putting hands into the code and at the problem, I can see that
dropping a subscription on a node makes it unresponsive in case of a
stop. And that's just because calls to LWLockRelease are missing as in
the patch attached. A try/catch problem should not be necessary.

Thanks for the patch!

With the patch, LogicalRepLauncherLock is released at the end of
DropSubscription(). But ISTM that the lock should be released just after
logicalrep_worker_stop() and there is no need to protect the removal of
replication slot with the lock.

That's true. logicalrep_worker_stop returns after confirmig that
worker->proc is cleard, so no false relaunch cannot be caused.
After all, logicalrep_worker_stop is surrounded by
LWLockAcquire/Relase pair. So it can be moved into the funciton
and make the lock secrion to be more narrower.

If we do this, Assert(LWLockHeldByMe(LogicalRepLauncherLock)) should be
removed and the comment for logicalrep_worker_stop() should be updated.

Your approach may cause the deadlock. The launcher takes LogicalRepWorkerLock
while holding LogicalRepLauncherLock. OTOH, with your approach,
logicalrep_worker_stop() takes LogicalRepLauncherLock while holding
LogicalRepWorkerLock.

Therefore I pushed the simple patch which adds LWLockRelease() just after
logicalrep_worker_stop().

Another problem that I found while reading the code is that the launcher can
start up the worker with the subscription that DROP SUBSCRIPTION just removed.
That is, DROP SUBSCRIPTION removes the target entry from pg_subscription,
but the launcher can see it and start new worker until the transaction for
DROP has been committed.

That was the reason why DropSubscription didn't release the lock in the
first place. It was supposed to be released at the end of the
transaction though.

To fix this issue, I think that DROP SUBSCRIPTION should take
AccessExclusiveLock on pg_subscription, instead of RowExclusiveLock,
so that the launcher cannot see the entry to be being removed.

The whole point of having LogicalRepLauncherLock is to avoid having to
do this, so if we do this we could probably get rid of it.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Fujii Masao
masao.fujii@gmail.com
In reply to: Petr Jelinek (#11)
1 attachment(s)
Re: Cannot shutdown subscriber after DROP SUBSCRIPTION

On Sun, Feb 5, 2017 at 5:11 AM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

On 03/02/17 19:38, Fujii Masao wrote:

On Sat, Feb 4, 2017 at 12:49 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Fri, Feb 3, 2017 at 10:56 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

At Fri, 3 Feb 2017 01:02:47 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwHqQVHmQ7wM=eLNnp1_oy-GVSSAcaJXWjE4nc2twSqXRg@mail.gmail.com>

On Thu, Feb 2, 2017 at 2:36 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Feb 2, 2017 at 2:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

Then, the reason for the TRY-CATCH cluase is that I found that
some functions called from there can throw exceptions.

Yes, but all LWLocks should be released by normal error recovery.
It should not be necessary for this code to clean that up by hand.
If it were necessary, there would be TRY-CATCH around every single
LWLockAcquire in the backend, and we'd have an unreadable and
unmaintainable system. Please don't add a TRY-CATCH unless it's
*necessary* -- and you haven't explained why this one is.

Yes.

Thank you for the suggestion. I minunderstood that.

Putting hands into the code and at the problem, I can see that
dropping a subscription on a node makes it unresponsive in case of a
stop. And that's just because calls to LWLockRelease are missing as in
the patch attached. A try/catch problem should not be necessary.

Thanks for the patch!

With the patch, LogicalRepLauncherLock is released at the end of
DropSubscription(). But ISTM that the lock should be released just after
logicalrep_worker_stop() and there is no need to protect the removal of
replication slot with the lock.

That's true. logicalrep_worker_stop returns after confirmig that
worker->proc is cleard, so no false relaunch cannot be caused.
After all, logicalrep_worker_stop is surrounded by
LWLockAcquire/Relase pair. So it can be moved into the funciton
and make the lock secrion to be more narrower.

If we do this, Assert(LWLockHeldByMe(LogicalRepLauncherLock)) should be
removed and the comment for logicalrep_worker_stop() should be updated.

Your approach may cause the deadlock. The launcher takes LogicalRepWorkerLock
while holding LogicalRepLauncherLock. OTOH, with your approach,
logicalrep_worker_stop() takes LogicalRepLauncherLock while holding
LogicalRepWorkerLock.

Therefore I pushed the simple patch which adds LWLockRelease() just after
logicalrep_worker_stop().

Another problem that I found while reading the code is that the launcher can
start up the worker with the subscription that DROP SUBSCRIPTION just removed.
That is, DROP SUBSCRIPTION removes the target entry from pg_subscription,
but the launcher can see it and start new worker until the transaction for
DROP has been committed.

That was the reason why DropSubscription didn't release the lock in the
first place. It was supposed to be released at the end of the
transaction though.

OK, I understood why you used the lock in that way. But using LWLock
for that purpose is not valid.

To fix this issue, I think that DROP SUBSCRIPTION should take
AccessExclusiveLock on pg_subscription, instead of RowExclusiveLock,
so that the launcher cannot see the entry to be being removed.

The whole point of having LogicalRepLauncherLock is to avoid having to
do this, so if we do this we could probably get rid of it.

Yes, let's remove LogicalRepLauncherLock and lock pg_subscription
with AccessExclusive mode at the beginning of DROP SUBSCRIPTION.
Attached patch does this.

Regards,

--
Fujii Masao

Attachments:

bugfix.patchapplication/octet-stream; name=bugfix.patchDownload
*** a/src/backend/commands/subscriptioncmds.c
--- b/src/backend/commands/subscriptioncmds.c
***************
*** 441,447 **** DropSubscription(DropSubscriptionStmt *stmt)
  	WalReceiverConn	   *wrconn = NULL;
  	StringInfoData		cmd;
  
! 	rel = heap_open(SubscriptionRelationId, RowExclusiveLock);
  
  	tup = SearchSysCache2(SUBSCRIPTIONNAME, MyDatabaseId,
  						  CStringGetDatum(stmt->subname));
--- 441,452 ----
  	WalReceiverConn	   *wrconn = NULL;
  	StringInfoData		cmd;
  
! 	/*
! 	 * Lock pg_subscription with AccessExclusiveLock to ensure
! 	 * that the launcher doesn't restart new worker during dropping
! 	 * the subscription
! 	 */
! 	rel = heap_open(SubscriptionRelationId, AccessExclusiveLock);
  
  	tup = SearchSysCache2(SUBSCRIPTIONNAME, MyDatabaseId,
  						  CStringGetDatum(stmt->subname));
***************
*** 508,521 **** DropSubscription(DropSubscriptionStmt *stmt)
  	/* Clean up dependencies */
  	deleteSharedDependencyRecordsFor(SubscriptionRelationId, subid, 0);
  
- 	/* Protect against launcher restarting the worker. */
- 	LWLockAcquire(LogicalRepLauncherLock, LW_EXCLUSIVE);
- 
  	/* Kill the apply worker so that the slot becomes accessible. */
  	logicalrep_worker_stop(subid);
  
- 	LWLockRelease(LogicalRepLauncherLock);
- 
  	/* Remove the origin tracking if exists. */
  	snprintf(originname, sizeof(originname), "pg_%u", subid);
  	originid = replorigin_by_name(originname, true);
--- 513,521 ----
*** a/src/backend/replication/logical/launcher.c
--- b/src/backend/replication/logical/launcher.c
***************
*** 305,321 **** logicalrep_worker_launch(Oid dbid, Oid subid, const char *subname, Oid userid)
  /*
   * Stop the logical replication worker and wait until it detaches from the
   * slot.
-  *
-  * The caller must hold LogicalRepLauncherLock to ensure that new workers are
-  * not being started during this function call.
   */
  void
  logicalrep_worker_stop(Oid subid)
  {
  	LogicalRepWorker *worker;
  
- 	Assert(LWLockHeldByMe(LogicalRepLauncherLock));
- 
  	LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
  
  	worker = logicalrep_worker_find(subid);
--- 305,316 ----
***************
*** 602,610 **** ApplyLauncherMain(Datum main_arg)
  										   ALLOCSET_DEFAULT_MAXSIZE);
  			oldctx = MemoryContextSwitchTo(subctx);
  
- 			/* Block any concurrent DROP SUBSCRIPTION. */
- 			LWLockAcquire(LogicalRepLauncherLock, LW_EXCLUSIVE);
- 
  			/* search for subscriptions to start or stop. */
  			sublist = get_subscription_list();
  
--- 597,602 ----
***************
*** 628,635 **** ApplyLauncherMain(Datum main_arg)
  				}
  			}
  
- 			LWLockRelease(LogicalRepLauncherLock);
- 
  			/* Switch back to original memory context. */
  			MemoryContextSwitchTo(oldctx);
  			/* Clean the temporary memory. */
--- 620,625 ----
*** a/src/backend/storage/lmgr/lwlocknames.txt
--- b/src/backend/storage/lmgr/lwlocknames.txt
***************
*** 48,52 **** ReplicationOriginLock				40
  MultiXactTruncationLock				41
  OldSnapshotTimeMapLock				42
  BackendRandomLock					43
! LogicalRepLauncherLock				44
! LogicalRepWorkerLock				45
--- 48,51 ----
  MultiXactTruncationLock				41
  OldSnapshotTimeMapLock				42
  BackendRandomLock					43
! LogicalRepWorkerLock				44
#13Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Fujii Masao (#12)
Re: Cannot shutdown subscriber after DROP SUBSCRIPTION

On 06/02/17 17:33, Fujii Masao wrote:

On Sun, Feb 5, 2017 at 5:11 AM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

On 03/02/17 19:38, Fujii Masao wrote:

On Sat, Feb 4, 2017 at 12:49 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Fri, Feb 3, 2017 at 10:56 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

At Fri, 3 Feb 2017 01:02:47 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwHqQVHmQ7wM=eLNnp1_oy-GVSSAcaJXWjE4nc2twSqXRg@mail.gmail.com>

On Thu, Feb 2, 2017 at 2:36 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Feb 2, 2017 at 2:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

Then, the reason for the TRY-CATCH cluase is that I found that
some functions called from there can throw exceptions.

Yes, but all LWLocks should be released by normal error recovery.
It should not be necessary for this code to clean that up by hand.
If it were necessary, there would be TRY-CATCH around every single
LWLockAcquire in the backend, and we'd have an unreadable and
unmaintainable system. Please don't add a TRY-CATCH unless it's
*necessary* -- and you haven't explained why this one is.

Yes.

Thank you for the suggestion. I minunderstood that.

Putting hands into the code and at the problem, I can see that
dropping a subscription on a node makes it unresponsive in case of a
stop. And that's just because calls to LWLockRelease are missing as in
the patch attached. A try/catch problem should not be necessary.

Thanks for the patch!

With the patch, LogicalRepLauncherLock is released at the end of
DropSubscription(). But ISTM that the lock should be released just after
logicalrep_worker_stop() and there is no need to protect the removal of
replication slot with the lock.

That's true. logicalrep_worker_stop returns after confirmig that
worker->proc is cleard, so no false relaunch cannot be caused.
After all, logicalrep_worker_stop is surrounded by
LWLockAcquire/Relase pair. So it can be moved into the funciton
and make the lock secrion to be more narrower.

If we do this, Assert(LWLockHeldByMe(LogicalRepLauncherLock)) should be
removed and the comment for logicalrep_worker_stop() should be updated.

Your approach may cause the deadlock. The launcher takes LogicalRepWorkerLock
while holding LogicalRepLauncherLock. OTOH, with your approach,
logicalrep_worker_stop() takes LogicalRepLauncherLock while holding
LogicalRepWorkerLock.

Therefore I pushed the simple patch which adds LWLockRelease() just after
logicalrep_worker_stop().

Another problem that I found while reading the code is that the launcher can
start up the worker with the subscription that DROP SUBSCRIPTION just removed.
That is, DROP SUBSCRIPTION removes the target entry from pg_subscription,
but the launcher can see it and start new worker until the transaction for
DROP has been committed.

That was the reason why DropSubscription didn't release the lock in the
first place. It was supposed to be released at the end of the
transaction though.

OK, I understood why you used the lock in that way. But using LWLock
for that purpose is not valid.

Yeah, I just tried to avoid what we are doing now really hard :)

To fix this issue, I think that DROP SUBSCRIPTION should take
AccessExclusiveLock on pg_subscription, instead of RowExclusiveLock,
so that the launcher cannot see the entry to be being removed.

The whole point of having LogicalRepLauncherLock is to avoid having to
do this, so if we do this we could probably get rid of it.

Yes, let's remove LogicalRepLauncherLock and lock pg_subscription
with AccessExclusive mode at the beginning of DROP SUBSCRIPTION.
Attached patch does this.

Okay, looks reasonable to me.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Fujii Masao
masao.fujii@gmail.com
In reply to: Petr Jelinek (#13)
Re: Cannot shutdown subscriber after DROP SUBSCRIPTION

On Tue, Feb 7, 2017 at 2:13 AM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

On 06/02/17 17:33, Fujii Masao wrote:

On Sun, Feb 5, 2017 at 5:11 AM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

On 03/02/17 19:38, Fujii Masao wrote:

On Sat, Feb 4, 2017 at 12:49 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Fri, Feb 3, 2017 at 10:56 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

At Fri, 3 Feb 2017 01:02:47 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwHqQVHmQ7wM=eLNnp1_oy-GVSSAcaJXWjE4nc2twSqXRg@mail.gmail.com>

On Thu, Feb 2, 2017 at 2:36 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Feb 2, 2017 at 2:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

Then, the reason for the TRY-CATCH cluase is that I found that
some functions called from there can throw exceptions.

Yes, but all LWLocks should be released by normal error recovery.
It should not be necessary for this code to clean that up by hand.
If it were necessary, there would be TRY-CATCH around every single
LWLockAcquire in the backend, and we'd have an unreadable and
unmaintainable system. Please don't add a TRY-CATCH unless it's
*necessary* -- and you haven't explained why this one is.

Yes.

Thank you for the suggestion. I minunderstood that.

Putting hands into the code and at the problem, I can see that
dropping a subscription on a node makes it unresponsive in case of a
stop. And that's just because calls to LWLockRelease are missing as in
the patch attached. A try/catch problem should not be necessary.

Thanks for the patch!

With the patch, LogicalRepLauncherLock is released at the end of
DropSubscription(). But ISTM that the lock should be released just after
logicalrep_worker_stop() and there is no need to protect the removal of
replication slot with the lock.

That's true. logicalrep_worker_stop returns after confirmig that
worker->proc is cleard, so no false relaunch cannot be caused.
After all, logicalrep_worker_stop is surrounded by
LWLockAcquire/Relase pair. So it can be moved into the funciton
and make the lock secrion to be more narrower.

If we do this, Assert(LWLockHeldByMe(LogicalRepLauncherLock)) should be
removed and the comment for logicalrep_worker_stop() should be updated.

Your approach may cause the deadlock. The launcher takes LogicalRepWorkerLock
while holding LogicalRepLauncherLock. OTOH, with your approach,
logicalrep_worker_stop() takes LogicalRepLauncherLock while holding
LogicalRepWorkerLock.

Therefore I pushed the simple patch which adds LWLockRelease() just after
logicalrep_worker_stop().

Another problem that I found while reading the code is that the launcher can
start up the worker with the subscription that DROP SUBSCRIPTION just removed.
That is, DROP SUBSCRIPTION removes the target entry from pg_subscription,
but the launcher can see it and start new worker until the transaction for
DROP has been committed.

That was the reason why DropSubscription didn't release the lock in the
first place. It was supposed to be released at the end of the
transaction though.

OK, I understood why you used the lock in that way. But using LWLock
for that purpose is not valid.

Yeah, I just tried to avoid what we are doing now really hard :)

To fix this issue, I think that DROP SUBSCRIPTION should take
AccessExclusiveLock on pg_subscription, instead of RowExclusiveLock,
so that the launcher cannot see the entry to be being removed.

The whole point of having LogicalRepLauncherLock is to avoid having to
do this, so if we do this we could probably get rid of it.

Yes, let's remove LogicalRepLauncherLock and lock pg_subscription
with AccessExclusive mode at the beginning of DROP SUBSCRIPTION.
Attached patch does this.

Okay, looks reasonable to me.

Thanks for the review!
But ISMT that I should suspend committing the patch until we fix the issue
that Sawada reported in other thread. That bugfix may change the related
code and design very much.
/messages/by-id/CAD21AoD+VO93zZ4ZQtZQb-jZ_wMko3OgGdx1MXO4T+8q_zHDDA@mail.gmail.com

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Robert Haas
robertmhaas@gmail.com
In reply to: Petr Jelinek (#11)
Re: Cannot shutdown subscriber after DROP SUBSCRIPTION

On Sat, Feb 4, 2017 at 3:11 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

That was the reason why DropSubscription didn't release the lock in the
first place. It was supposed to be released at the end of the
transaction though.

Holding an LWLock until end-of-transaction is a phenomenally bad idea,
both because you lose interruptibility and because of the deadlock
risk.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Fujii Masao (#14)
Re: Cannot shutdown subscriber after DROP SUBSCRIPTION

On Wed, Feb 8, 2017 at 12:04 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Tue, Feb 7, 2017 at 2:13 AM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

On 06/02/17 17:33, Fujii Masao wrote:

On Sun, Feb 5, 2017 at 5:11 AM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

On 03/02/17 19:38, Fujii Masao wrote:

On Sat, Feb 4, 2017 at 12:49 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Fri, Feb 3, 2017 at 10:56 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

At Fri, 3 Feb 2017 01:02:47 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwHqQVHmQ7wM=eLNnp1_oy-GVSSAcaJXWjE4nc2twSqXRg@mail.gmail.com>

On Thu, Feb 2, 2017 at 2:36 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Feb 2, 2017 at 2:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

Then, the reason for the TRY-CATCH cluase is that I found that
some functions called from there can throw exceptions.

Yes, but all LWLocks should be released by normal error recovery.
It should not be necessary for this code to clean that up by hand.
If it were necessary, there would be TRY-CATCH around every single
LWLockAcquire in the backend, and we'd have an unreadable and
unmaintainable system. Please don't add a TRY-CATCH unless it's
*necessary* -- and you haven't explained why this one is.

Yes.

Thank you for the suggestion. I minunderstood that.

Putting hands into the code and at the problem, I can see that
dropping a subscription on a node makes it unresponsive in case of a
stop. And that's just because calls to LWLockRelease are missing as in
the patch attached. A try/catch problem should not be necessary.

Thanks for the patch!

With the patch, LogicalRepLauncherLock is released at the end of
DropSubscription(). But ISTM that the lock should be released just after
logicalrep_worker_stop() and there is no need to protect the removal of
replication slot with the lock.

That's true. logicalrep_worker_stop returns after confirmig that
worker->proc is cleard, so no false relaunch cannot be caused.
After all, logicalrep_worker_stop is surrounded by
LWLockAcquire/Relase pair. So it can be moved into the funciton
and make the lock secrion to be more narrower.

If we do this, Assert(LWLockHeldByMe(LogicalRepLauncherLock)) should be
removed and the comment for logicalrep_worker_stop() should be updated.

Your approach may cause the deadlock. The launcher takes LogicalRepWorkerLock
while holding LogicalRepLauncherLock. OTOH, with your approach,
logicalrep_worker_stop() takes LogicalRepLauncherLock while holding
LogicalRepWorkerLock.

Therefore I pushed the simple patch which adds LWLockRelease() just after
logicalrep_worker_stop().

Another problem that I found while reading the code is that the launcher can
start up the worker with the subscription that DROP SUBSCRIPTION just removed.
That is, DROP SUBSCRIPTION removes the target entry from pg_subscription,
but the launcher can see it and start new worker until the transaction for
DROP has been committed.

That was the reason why DropSubscription didn't release the lock in the
first place. It was supposed to be released at the end of the
transaction though.

OK, I understood why you used the lock in that way. But using LWLock
for that purpose is not valid.

Yeah, I just tried to avoid what we are doing now really hard :)

To fix this issue, I think that DROP SUBSCRIPTION should take
AccessExclusiveLock on pg_subscription, instead of RowExclusiveLock,
so that the launcher cannot see the entry to be being removed.

The whole point of having LogicalRepLauncherLock is to avoid having to
do this, so if we do this we could probably get rid of it.

Yes, let's remove LogicalRepLauncherLock and lock pg_subscription
with AccessExclusive mode at the beginning of DROP SUBSCRIPTION.
Attached patch does this.

Okay, looks reasonable to me.

Thanks for the review!
But ISMT that I should suspend committing the patch until we fix the issue
that Sawada reported in other thread. That bugfix may change the related
code and design very much.
/messages/by-id/CAD21AoD+VO93zZ4ZQtZQb-jZ_wMko3OgGdx1MXO4T+8q_zHDDA@mail.gmail.com

That patch has been committed. And this issue still happens. Should we
add this to the open item list so it doesn't get missed?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiko Sawada (#16)
Re: Cannot shutdown subscriber after DROP SUBSCRIPTION

On Wed, Mar 8, 2017 at 9:05 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Feb 8, 2017 at 12:04 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Tue, Feb 7, 2017 at 2:13 AM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

On 06/02/17 17:33, Fujii Masao wrote:

On Sun, Feb 5, 2017 at 5:11 AM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

On 03/02/17 19:38, Fujii Masao wrote:

On Sat, Feb 4, 2017 at 12:49 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Fri, Feb 3, 2017 at 10:56 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

At Fri, 3 Feb 2017 01:02:47 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwHqQVHmQ7wM=eLNnp1_oy-GVSSAcaJXWjE4nc2twSqXRg@mail.gmail.com>

On Thu, Feb 2, 2017 at 2:36 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Feb 2, 2017 at 2:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

Then, the reason for the TRY-CATCH cluase is that I found that
some functions called from there can throw exceptions.

Yes, but all LWLocks should be released by normal error recovery.
It should not be necessary for this code to clean that up by hand.
If it were necessary, there would be TRY-CATCH around every single
LWLockAcquire in the backend, and we'd have an unreadable and
unmaintainable system. Please don't add a TRY-CATCH unless it's
*necessary* -- and you haven't explained why this one is.

Yes.

Thank you for the suggestion. I minunderstood that.

Putting hands into the code and at the problem, I can see that
dropping a subscription on a node makes it unresponsive in case of a
stop. And that's just because calls to LWLockRelease are missing as in
the patch attached. A try/catch problem should not be necessary.

Thanks for the patch!

With the patch, LogicalRepLauncherLock is released at the end of
DropSubscription(). But ISTM that the lock should be released just after
logicalrep_worker_stop() and there is no need to protect the removal of
replication slot with the lock.

That's true. logicalrep_worker_stop returns after confirmig that
worker->proc is cleard, so no false relaunch cannot be caused.
After all, logicalrep_worker_stop is surrounded by
LWLockAcquire/Relase pair. So it can be moved into the funciton
and make the lock secrion to be more narrower.

If we do this, Assert(LWLockHeldByMe(LogicalRepLauncherLock)) should be
removed and the comment for logicalrep_worker_stop() should be updated.

Your approach may cause the deadlock. The launcher takes LogicalRepWorkerLock
while holding LogicalRepLauncherLock. OTOH, with your approach,
logicalrep_worker_stop() takes LogicalRepLauncherLock while holding
LogicalRepWorkerLock.

Therefore I pushed the simple patch which adds LWLockRelease() just after
logicalrep_worker_stop().

Another problem that I found while reading the code is that the launcher can
start up the worker with the subscription that DROP SUBSCRIPTION just removed.
That is, DROP SUBSCRIPTION removes the target entry from pg_subscription,
but the launcher can see it and start new worker until the transaction for
DROP has been committed.

That was the reason why DropSubscription didn't release the lock in the
first place. It was supposed to be released at the end of the
transaction though.

OK, I understood why you used the lock in that way. But using LWLock
for that purpose is not valid.

Yeah, I just tried to avoid what we are doing now really hard :)

To fix this issue, I think that DROP SUBSCRIPTION should take
AccessExclusiveLock on pg_subscription, instead of RowExclusiveLock,
so that the launcher cannot see the entry to be being removed.

The whole point of having LogicalRepLauncherLock is to avoid having to
do this, so if we do this we could probably get rid of it.

Yes, let's remove LogicalRepLauncherLock and lock pg_subscription
with AccessExclusive mode at the beginning of DROP SUBSCRIPTION.
Attached patch does this.

Okay, looks reasonable to me.

Thanks for the review!
But ISMT that I should suspend committing the patch until we fix the issue
that Sawada reported in other thread. That bugfix may change the related
code and design very much.
/messages/by-id/CAD21AoD+VO93zZ4ZQtZQb-jZ_wMko3OgGdx1MXO4T+8q_zHDDA@mail.gmail.com

That patch has been committed. And this issue still happens. Should we
add this to the open item list so it doesn't get missed?

Thanks for ping. Pushed the patch.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers