Improving the latch handling between logical replication launcher and worker processes.

Started by vignesh Cover 1 year ago20 messages

vignesh21@gmail.com

over 1 year ago

Hi,

Currently the launcher's latch is used for the following: a) worker
process attach b) worker process exit and c) subscription creation.
Since this same latch is used for multiple cases, the launcher process
is not able to handle concurrent scenarios like: a) Launcher started a
new apply worker and waiting for apply worker to attach and b) create
subscription sub2 sending launcher wake up signal. In this scenario,
both of them will set latch of the launcher process, the launcher
process is not able to identify that both operations have occurred 1)
worker is attached 2) subscription is created and apply worker should
be started. As a result the apply worker does not get started for the
new subscription created immediately and gets started after the
timeout of 180 seconds.

I have started a new thread for this based on suggestions at [1]/messages/by-id/CAA4eK1KR29XfBi5rObgV06xcBLn7y+LCuxcSMdKUkKZK740L2w@mail.gmail.com.

We could improvise this by one of the following:
a) Introduce a new latch to handle worker attach and exit.
b) Add a new GUC launcher_retry_time which gives more flexibility to
users as suggested by Amit at [1]/messages/by-id/CAA4eK1KR29XfBi5rObgV06xcBLn7y+LCuxcSMdKUkKZK740L2w@mail.gmail.com. Before 5a3a953, the
wal_retrieve_retry_interval plays a similar role as the suggested new
GUC launcher_retry_time, e.g. even if a worker is launched, the
launcher only wait wal_retrieve_retry_interval time before next round.
c) Don't reset the latch at worker attach and allow launcher main to
identify and handle it. For this there is a patch v6-0002 available at
[2]: /messages/by-id/CALDaNm10R7L0Dxq+-J=Pp3AfM_yaokpbhECvJ69QiGH8-jQquw@mail.gmail.com

I'm not sure which approach is better in this case. I was thinking if
we should add a new latch to handle worker attach and exit.
Thoughts?

[1]: /messages/by-id/CAA4eK1KR29XfBi5rObgV06xcBLn7y+LCuxcSMdKUkKZK740L2w@mail.gmail.com
[2]: /messages/by-id/CALDaNm10R7L0Dxq+-J=Pp3AfM_yaokpbhECvJ69QiGH8-jQquw@mail.gmail.com

Regards,
Vignesh

Zhijie Hou (Fujitsu)

houzj.fnst@fujitsu.com

over 1 year ago

In reply to: vignesh C (#1)

RE: Improving the latch handling between logical replication launcher and worker processes.

On Thursday, April 25, 2024 4:59 PM vignesh C <vignesh21@gmail.com> wrote:

Hi,

Currently the launcher's latch is used for the following: a) worker process attach
b) worker process exit and c) subscription creation.
Since this same latch is used for multiple cases, the launcher process is not able
to handle concurrent scenarios like: a) Launcher started a new apply worker and
waiting for apply worker to attach and b) create subscription sub2 sending
launcher wake up signal. In this scenario, both of them will set latch of the
launcher process, the launcher process is not able to identify that both
operations have occurred 1) worker is attached 2) subscription is created and
apply worker should be started. As a result the apply worker does not get
started for the new subscription created immediately and gets started after the
timeout of 180 seconds.

I have started a new thread for this based on suggestions at [1].

a) Introduce a new latch to handle worker attach and exit.

I found the startup process also uses two latches(see recoveryWakeupLatch) for
different purposes, so maybe this is OK. But note that both logical launcher
and apply worker will call logicalrep_worker_launch(), if we only add one new
latch, it means both workers will wait on the same new Latch, and the launcher
may consume the SetLatch that should have been consumed by the apply
worker(although it's rare).

b) Add a new GUC launcher_retry_time which gives more flexibility to users as
suggested by Amit at [1]. Before 5a3a953, the wal_retrieve_retry_interval plays
a similar role as the suggested new GUC launcher_retry_time, e.g. even if a
worker is launched, the launcher only wait wal_retrieve_retry_interval time
before next round.

IIUC, the issue does not happen frequently, and may not be noticeable where
apply workers wouldn't be restarted often. So, I am slightly not sure if it's
worth adding a new GUC.

c) Don't reset the latch at worker attach and allow launcher main to identify and
handle it. For this there is a patch v6-0002 available at [2].

This seems simple. I found we are doing something similar in
RegisterSyncRequest() and WalSummarizerMain().

Best Regards,
Hou zj

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

over 1 year ago

In reply to: vignesh C (#1)

RE: Improving the latch handling between logical replication launcher and worker processes.

Dear Vignesh,

Thanks for raising idea!

a) Introduce a new latch to handle worker attach and exit.

Just to confirm - there are three wait events for launchers, so I feel we may be
able to create latches per wait event. Is there a reason to introduce
"a" latch?

b) Add a new GUC launcher_retry_time which gives more flexibility to
users as suggested by Amit at [1]. Before 5a3a953, the
wal_retrieve_retry_interval plays a similar role as the suggested new
GUC launcher_retry_time, e.g. even if a worker is launched, the
launcher only wait wal_retrieve_retry_interval time before next round.

Hmm. My concern is how users estimate the value. Maybe the default will be
3min, but should users change it? If so, how long? I think even if it becomes
tunable, they cannot control well.

c) Don't reset the latch at worker attach and allow launcher main to
identify and handle it. For this there is a patch v6-0002 available at
[2].

Does it mean that you want to remove ResetLatch() from
WaitForReplicationWorkerAttach(), right? If so, what about the scenario?

1) The launcher waiting the worker is attached in WaitForReplicationWorkerAttach(),
and 2) subscription is created before attaching. In this case, the launcher will
become un-sleepable because the latch is set but won't be reset. It may waste the
CPU time.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Peter Smith

smithpb2250@gmail.com

over 1 year ago

In reply to: vignesh C (#1)

Re: Improving the latch handling between logical replication launcher and worker processes.

On Thu, Apr 25, 2024 at 6:59 PM vignesh C <vignesh21@gmail.com> wrote:

...

a) Introduce a new latch to handle worker attach and exit.

IIUC there is normally only the “shared” latch or a “process local”
latch. i.e. AFAICT is is quite uncommon to invent a new latches like
option a is proposing. I did not see any examples of making new
latches (e.g. MyLatchA, MyLatchB, MyLatchC) in the PostgreSQL source
other than that ‘recoveryWakeupLatch’ mentioned above by Hou-san. So
this option could be OK, but OTOH since there is hardly any precedent
maybe that should be taken as an indication to avoid doing it this way
(???)

b) Add a new GUC launcher_retry_time which gives more flexibility to
users as suggested by Amit at [1].

I'm not sure that introducing a new GUC is a good option because this
seems a rare problem to have -- so it will be hard to tune since it
will be difficult to know you even have this problem and then
difficult to know that it is fixed. Anyway. this made me consider more
what the WaitLatch timeout value should be. Here are some thoughts:

Idea 1)

I was wondering where did that DEFAULT_NAPTIME_PER_CYCLE value of 180
seconds come from or was that just a made-up number? AFAICT it just
came into existence in the first pub/sub commit [1]github - https://github.com/postgres/postgres/commit/665d1fad99e7b11678b0d5fa24d2898424243cd6#diff-127f8eb009151ec548d14c877a57a89d67da59e35ea09189411aed529c6341bf but there is no
explanation for why 180s was chosen. Anyway, I assume a low value
(5s?) would be bad because it incurs unacceptable CPU usage, right?
But if 180s is too long and 5s is too short then what does a “good”
number even look like? E.g.,. if 60s is deemed OK, then is there any
harm in just defining DEFAULT_NAPTIME_PER_CYCLE to be 60s and leaving
it at that?

Idea 2)

Another idea could be to use a “dynamic timeout” in the WaitLatch of
ApplyLauncherMain. Let me try to explain my thought bubble:
- If the preceding foreach(lc, sublist) loop launched any workers then
the WaitLatch timeout can be a much shorter 10s
- If the preceding foreach(lc, sublist) loop did NOT launch any
workers (this would be the most common case) then the WaitLatch
timeout can be the usual 180s.

IIUC this strategy will still give any recently launched workers
enough time to attach shmem but it also means that any concurrent
CREATE SUBSCRIPTION will be addressed within 10s instead of 180s.
Maybe this is sufficient to make an already rare problem become
insignificant.

c) Don't reset the latch at worker attach and allow launcher main to
identify and handle it. For this there is a patch v6-0002 available at
[2].

This option c seems the easiest. Can you explain what are the
drawbacks of using this approach?

======
[1]: github - https://github.com/postgres/postgres/commit/665d1fad99e7b11678b0d5fa24d2898424243cd6#diff-127f8eb009151ec548d14c877a57a89d67da59e35ea09189411aed529c6341bf

Kind Regards,
Peter Smith.
Fujitsu Australia

vignesh C

vignesh21@gmail.com

over 1 year ago

In reply to: Hayato Kuroda (Fujitsu) (#3)

1 attachment(s)

Re: Improving the latch handling between logical replication launcher and worker processes.

On Fri, 10 May 2024 at 07:39, Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Vignesh,

Thanks for raising idea!

a) Introduce a new latch to handle worker attach and exit.

Just to confirm - there are three wait events for launchers, so I feel we may be
able to create latches per wait event. Is there a reason to introduce
"a" latch?

One latch is enough, we can use the new latch for both worker starting
and worker exiting. The other existing latch can be used for other
purposes. Something like the attached patch.

Regards,
Vignesh

Attachments:

v1-0001-Improving-the-latch-handling-between-logical-repl.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Improving-the-latch-handling-between-logical-repl.patchDownload

From 072cfefd717a0c7c496a5ca66e9239ccb808ece6 Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Wed, 29 May 2024 14:55:49 +0530
Subject: [PATCH v1] Improving the latch handling between logical replication
 launcher and the worker processes.

Currently the launcher's latch is used for the following: a) worker
process attach b) worker process exit and c) subscription creation.
Since this same latch is used for multiple cases, the launcher process
is not able to handle concurrent scenarios like: a) Launcher started a
new apply worker and waiting for apply worker to attach and b) create
subscription sub2 sending launcher wake up signal. Fixed it by having
a different latch for worker start and exit.
---
 .../replication/logical/applyparallelworker.c |  1 +
 src/backend/replication/logical/launcher.c    | 19 ++++++++++++++-----
 src/backend/replication/logical/worker.c      |  1 +
 src/include/replication/worker_internal.h     |  6 ++++++
 4 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index e7f7d4c5e4..62c3791ceb 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -912,6 +912,7 @@ ParallelApplyWorkerMain(Datum main_arg)
 	 * the uninitialized memory queue.
 	 */
 	logicalrep_worker_attach(worker_slot);
+	SetLatch(&MyLogicalRepWorker->launcherWakeupLatch);
 
 	/*
 	 * Register the shutdown callback after we are attached to the worker
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 27c3a91fb7..74bfdb9b6e 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -187,6 +187,7 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
 	BgwHandleStatus status;
 	int			rc;
 
+	OwnLatch(&worker->launcherWakeupLatch);
 	for (;;)
 	{
 		pid_t		pid;
@@ -199,6 +200,7 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
 		if (!worker->in_use || worker->proc)
 		{
 			LWLockRelease(LogicalRepWorkerLock);
+			DisownLatch(&worker->launcherWakeupLatch);
 			return worker->in_use;
 		}
 
@@ -214,6 +216,7 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
 			if (generation == worker->generation)
 				logicalrep_worker_cleanup(worker);
 			LWLockRelease(LogicalRepWorkerLock);
+			DisownLatch(&worker->launcherWakeupLatch);
 			return false;
 		}
 
@@ -221,13 +224,13 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
 		 * We need timeout because we generally don't get notified via latch
 		 * about the worker attach.  But we don't expect to have to wait long.
 		 */
-		rc = WaitLatch(MyLatch,
+		rc = WaitLatch(&worker->launcherWakeupLatch,
 					   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
 					   10L, WAIT_EVENT_BGWORKER_STARTUP);
 
 		if (rc & WL_LATCH_SET)
 		{
-			ResetLatch(MyLatch);
+			ResetLatch(&worker->launcherWakeupLatch);
 			CHECK_FOR_INTERRUPTS();
 		}
 	}
@@ -573,6 +576,8 @@ logicalrep_worker_stop_internal(LogicalRepWorker *worker, int signo)
 			break;
 	}
 
+	OwnLatch(&worker->launcherWakeupLatch);
+
 	/* Now terminate the worker ... */
 	kill(worker->proc->pid, signo);
 
@@ -583,18 +588,21 @@ logicalrep_worker_stop_internal(LogicalRepWorker *worker, int signo)
 
 		/* is it gone? */
 		if (!worker->proc || worker->generation != generation)
+		{
+			DisownLatch(&worker->launcherWakeupLatch);
 			break;
+		}
 
 		LWLockRelease(LogicalRepWorkerLock);
 
 		/* Wait a bit --- we don't expect to have to wait long. */
-		rc = WaitLatch(MyLatch,
+		rc = WaitLatch(&worker->launcherWakeupLatch,
 					   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
 					   10L, WAIT_EVENT_BGWORKER_SHUTDOWN);
 
 		if (rc & WL_LATCH_SET)
 		{
-			ResetLatch(MyLatch);
+			ResetLatch(&worker->launcherWakeupLatch);
 			CHECK_FOR_INTERRUPTS();
 		}
 
@@ -837,7 +845,7 @@ logicalrep_worker_onexit(int code, Datum arg)
 	if (!InitializingApplyWorker)
 		LockReleaseAll(DEFAULT_LOCKMETHOD, true);
 
-	ApplyLauncherWakeup();
+	SetLatch(&MyLogicalRepWorker->launcherWakeupLatch);
 }
 
 /*
@@ -976,6 +984,7 @@ ApplyLauncherShmemInit(void)
 
 			memset(worker, 0, sizeof(LogicalRepWorker));
 			SpinLockInit(&worker->relmutex);
+			InitSharedLatch(&worker->launcherWakeupLatch);
 		}
 	}
 }
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b5a80fe3e8..33b1336136 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -4652,6 +4652,7 @@ SetupApplyOrSyncWorker(int worker_slot)
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
 	pqsignal(SIGTERM, die);
 	BackgroundWorkerUnblockSignals();
+	SetLatch(&MyLogicalRepWorker->launcherWakeupLatch);
 
 	/*
 	 * We don't currently need any ResourceOwner in a walreceiver process, but
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 515aefd519..7cc70a1b9e 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -77,6 +77,12 @@ typedef struct LogicalRepWorker
 	 */
 	FileSet    *stream_fileset;
 
+	/*
+	 * launcherWakeupLatch is used to wake up the launcher process after the
+	 * worker process is attached.
+	 */
+	Latch		launcherWakeupLatch;
+
 	/*
 	 * PID of leader apply worker if this slot is used for a parallel apply
 	 * worker, InvalidPid otherwise.
-- 
2.34.1

vignesh C

vignesh21@gmail.com

over 1 year ago

In reply to: Peter Smith (#4)

Re: Improving the latch handling between logical replication launcher and worker processes.

On Wed, 29 May 2024 at 10:41, Peter Smith <smithpb2250@gmail.com> wrote:

On Thu, Apr 25, 2024 at 6:59 PM vignesh C <vignesh21@gmail.com> wrote:

c) Don't reset the latch at worker attach and allow launcher main to
identify and handle it. For this there is a patch v6-0002 available at
[2].

This option c seems the easiest. Can you explain what are the
drawbacks of using this approach?

This solution will resolve the issue. However, one drawback to
consider is that because we're not resetting the latch, in this
scenario, the launcher process will need to perform an additional
round of acquiring subscription details and determining whether the
worker should start, regardless of any changes in subscriptions.

Regards,
Vignesh

Peter Smith

smithpb2250@gmail.com

over 1 year ago

In reply to: vignesh C (#6)

Re: Improving the latch handling between logical replication launcher and worker processes.

On Wed, May 29, 2024 at 7:53 PM vignesh C <vignesh21@gmail.com> wrote:

On Wed, 29 May 2024 at 10:41, Peter Smith <smithpb2250@gmail.com> wrote:

On Thu, Apr 25, 2024 at 6:59 PM vignesh C <vignesh21@gmail.com> wrote:

c) Don't reset the latch at worker attach and allow launcher main to
identify and handle it. For this there is a patch v6-0002 available at
[2].

This option c seems the easiest. Can you explain what are the
drawbacks of using this approach?

This solution will resolve the issue. However, one drawback to
consider is that because we're not resetting the latch, in this
scenario, the launcher process will need to perform an additional
round of acquiring subscription details and determining whether the
worker should start, regardless of any changes in subscriptions.

Hmm. IIUC the WaitLatch of the Launcher.WaitForReplicationWorkerAttach
was not expecting to get notified.

e.g.1. The WaitList comment in the function says so:
/*
* We need timeout because we generally don't get notified via latch
* about the worker attach. But we don't expect to have to wait long.
*/

e.g.2 The logicalrep_worker_attach() function (which is AFAIK what
WaitForReplicationWorkerAttach was waiting for) is not doing any
SetLatch. So that matches what the comment said.

~~~

AFAICT the original problem reported by this thread happened because
the SetLatch (from CREATE SUBSCRIPTION) has been inadvertently gobbled
by the WaitForReplicationWorkerAttach.WaitLatch/ResetLatch which BTW
wasn't expecting to be notified at all.

~~~

Your option c removes the ResetLatch done by WaitForReplicationWorkerAttach:

You said above that one drawback is "the launcher process will need to
perform an additional round of acquiring subscription details and
determining whether the worker should start, regardless of any changes
in subscriptions"

I think you mean if some CREATE SUBSCRIPTION (i.e. SetLatch) happens
during the attaching of other workers then the latch would (now after
option c) remain set and so the WaitLatch of ApplyLauncherMain would
be notified and/or return immediately end causing an immediate
re-iteration of the "foreach(lc, sublist)" loop.

But I don't understand why that is a problem.

a) I didn't know what you meant "regardless of any changes in
subscriptions" because I think the troublesome SetLatch originated
from the CREATE SUBSCRIPTION and so there *is* a change to
subscriptions.

b) We will need to repeat that sublist loop anyway to start the worker
for the new CREATE SUBSCRIPTION, and we need to do that at the
earliest opportunity because the whole point of the SetLatch is so the
CREATE SUBSCRIPTION worker can get started promptly. So the earlier we
do that the better, right?

c) AFAICT there is no danger of accidentally tying to starting workers
who are still in the middle of trying to start (from the previous
iteration) because those cases should be guarded by the
ApplyLauncherGetWorkerStartTime logic.

To summarise, I felt removing the ResetLatch and the WL_LATCH_SET
bitset (like your v6-0002 patch does) is not only an easy way of
fixing the problem reported by this thread, but also it now makes that
WaitForReplicationWorkerAttach code behave like the comment ("we
generally don't get notified via latch about the worker attach") is
saying.

(Unless there are some other problems with it that I can't see)

======
Kind Regards,
Peter Smith.
Fujitsu Australia

vignesh C

vignesh21@gmail.com

over 1 year ago

In reply to: Peter Smith (#7)

Re: Improving the latch handling between logical replication launcher and worker processes.

On Thu, 30 May 2024 at 08:46, Peter Smith <smithpb2250@gmail.com> wrote:

On Wed, May 29, 2024 at 7:53 PM vignesh C <vignesh21@gmail.com> wrote:

On Wed, 29 May 2024 at 10:41, Peter Smith <smithpb2250@gmail.com> wrote:

On Thu, Apr 25, 2024 at 6:59 PM vignesh C <vignesh21@gmail.com> wrote:

c) Don't reset the latch at worker attach and allow launcher main to
identify and handle it. For this there is a patch v6-0002 available at
[2].

This option c seems the easiest. Can you explain what are the
drawbacks of using this approach?

This solution will resolve the issue. However, one drawback to
consider is that because we're not resetting the latch, in this
scenario, the launcher process will need to perform an additional
round of acquiring subscription details and determining whether the
worker should start, regardless of any changes in subscriptions.

Hmm. IIUC the WaitLatch of the Launcher.WaitForReplicationWorkerAttach
was not expecting to get notified.

e.g.1. The WaitList comment in the function says so:
/*
* We need timeout because we generally don't get notified via latch
* about the worker attach. But we don't expect to have to wait long.
*/

e.g.2 The logicalrep_worker_attach() function (which is AFAIK what
WaitForReplicationWorkerAttach was waiting for) is not doing any
SetLatch. So that matches what the comment said.

~~~

AFAICT the original problem reported by this thread happened because
the SetLatch (from CREATE SUBSCRIPTION) has been inadvertently gobbled
by the WaitForReplicationWorkerAttach.WaitLatch/ResetLatch which BTW
wasn't expecting to be notified at all.

~~~

Your option c removes the ResetLatch done by WaitForReplicationWorkerAttach:

You said above that one drawback is "the launcher process will need to
perform an additional round of acquiring subscription details and
determining whether the worker should start, regardless of any changes
in subscriptions"

I think you mean if some CREATE SUBSCRIPTION (i.e. SetLatch) happens
during the attaching of other workers then the latch would (now after
option c) remain set and so the WaitLatch of ApplyLauncherMain would
be notified and/or return immediately end causing an immediate
re-iteration of the "foreach(lc, sublist)" loop.

But I don't understand why that is a problem.

a) I didn't know what you meant "regardless of any changes in
subscriptions" because I think the troublesome SetLatch originated
from the CREATE SUBSCRIPTION and so there *is* a change to
subscriptions.

The process of setting the latch unfolds as follows: Upon creating a
new subscription, the launcher process initiates a request to the
postmaster, prompting it to initiate a new apply worker process.
Subsequently, the postmaster commences the apply worker process and
dispatches a SIGUSR1 signal to the launcher process(this is done from
do_start_bgworker & ReportBackgroundWorkerPID). Upon receiving this
signal, the launcher process sets the latch.
Now, there are two potential scenarios:
a) Concurrent Creation of Another Subscription: In this situation, the
launcher traverses the subscription list to detect the creation of a
new subscription and proceeds to initiate a new apply worker for the
concurrently created subscription. This is ok. b) Absence of
Concurrent Subscription Creation: In this case, since the latch
remains unset, the launcher iterates through the subscription list and
identifies the absence of new subscriptions. This verification occurs
as the latch remains unset. Here there is an additional check.
I'm talking about the second scenario where no subscription is
concurrently created. In this case, as the latch remains unset, we
perform an additional check on the subscription list. There is no
problem with this.
This additional check can occur in the existing code too if the
function WaitForReplicationWorkerAttach returns from the initial if
check i.e. if the worker already started when this check happens.

Regards,
Vignesh

Heikki Linnakangas

hlinnaka@iki.fi

over 1 year ago

In reply to: vignesh C (#5)

Re: Improving the latch handling between logical replication launcher and worker processes.

I'm don't quite understand the problem we're trying to fix:

Currently the launcher's latch is used for the following: a) worker
process attach b) worker process exit and c) subscription creation.
Since this same latch is used for multiple cases, the launcher process
is not able to handle concurrent scenarios like: a) Launcher started a
new apply worker and waiting for apply worker to attach and b) create
subscription sub2 sending launcher wake up signal. In this scenario,
both of them will set latch of the launcher process, the launcher
process is not able to identify that both operations have occurred 1)
worker is attached 2) subscription is created and apply worker should
be started. As a result the apply worker does not get started for the
new subscription created immediately and gets started after the
timeout of 180 seconds.

I don't see how we could miss a notification. Yes, both events will set
the same latch. Why is that a problem? The loop will see that the new
worker has attached, and that the new subscription has been created, and
process both events. Right?

--
Heikki Linnakangas
Neon (https://neon.tech)

#10

vignesh C

vignesh21@gmail.com

over 1 year ago

In reply to: Heikki Linnakangas (#9)

Re: Improving the latch handling between logical replication launcher and worker processes.

On Thu, 4 Jul 2024 at 16:52, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I'm don't quite understand the problem we're trying to fix:

Currently the launcher's latch is used for the following: a) worker
process attach b) worker process exit and c) subscription creation.
Since this same latch is used for multiple cases, the launcher process
is not able to handle concurrent scenarios like: a) Launcher started a
new apply worker and waiting for apply worker to attach and b) create
subscription sub2 sending launcher wake up signal. In this scenario,
both of them will set latch of the launcher process, the launcher
process is not able to identify that both operations have occurred 1)
worker is attached 2) subscription is created and apply worker should
be started. As a result the apply worker does not get started for the
new subscription created immediately and gets started after the
timeout of 180 seconds.

I don't see how we could miss a notification. Yes, both events will set
the same latch. Why is that a problem?

The launcher process waits for the apply worker to attach at
WaitForReplicationWorkerAttach function. The launcher's latch is
getting set concurrently for another create subscription and apply
worker attached. The launcher now detects the latch is set while
waiting at WaitForReplicationWorkerAttach, it resets the latch and
proceed to the main loop and waits for DEFAULT_NAPTIME_PER_CYCLE(as
the latch has already been reset). Further details are provided below.

The loop will see that the new

worker has attached, and that the new subscription has been created, and
process both events. Right?

Since the latch is reset at WaitForReplicationWorkerAttach, it skips
processing the create subscription event.

Slightly detailing further:
In the scenario when we execute two concurrent create subscription
commands, first CREATE SUBSCRIPTION sub1, followed immediately by
CREATE SUBSCRIPTION sub2.
In a few random scenarios, the following events may unfold:
After the first create subscription command(sub1), the backend will
set the launcher's latch because of ApplyLauncherWakeupAtCommit.
Subsequently, the launcher process will reset the latch and identify
the addition of a new subscription in the pg_subscription list. The
launcher process will proceed to request the postmaster to start the
apply worker background process (sub1) and request the postmaster to
notify the launcher once the apply worker(sub1) has been started.
Launcher will then wait for the apply worker(sub1) to attach in the
WaitForReplicationWorkerAttach function.
Meanwhile, the second CREATE SUBSCRIPTION command (sub2) which was
executed concurrently, also set the launcher's latch(because of
ApplyLauncherWakeupAtCommit).
Simultaneously when the launcher remains in the
WaitForReplicationWorkerAttach function waiting for apply worker of
sub1 to start, the postmaster will start the apply worker for
subscription sub1 and send a SIGUSR1 signal to the launcher process
via ReportBackgroundWorkerPID. Upon receiving this signal, the
launcher process will then set its latch.

Now, the launcher's latch has been set concurrently after the apply
worker for sub1 is started and the execution of the CREATE
SUBSCRIPTION sub2 command.

At this juncture, the launcher, which had been awaiting the attachment
of the apply worker, detects that the latch is set and proceeds to
reset it. The reset operation of the latch occurs within the following
section of code in WaitForReplicationWorkerAttach:
...
rc = WaitLatch(MyLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
10L, WAIT_EVENT_BGWORKER_STARTUP);

if (rc & WL_LATCH_SET)
{
ResetLatch(MyLatch);
CHECK_FOR_INTERRUPTS();
}
...

After resetting the latch here, the activation signal intended to
start the apply worker for subscription sub2 is no longer present. The
launcher will return to the ApplyLauncherMain function, where it will
await the DEFAULT_NAPTIME_PER_CYCLE, which is 180 seconds, before
processing the create subscription request (i.e., creating a new apply
worker for sub2).
The issue arises from the latch being reset in
WaitForReplicationWorkerAttach, which can occasionally delay the
synchronization of table data for the subscription.
Another concurrency scenario, as mentioned by Alexander Lakhin at [3]/messages/by-id/CA+TgmoZkj=A39i4obKXADMhzJW=6dyGq-C1aGfb+jUy9XvxwYA@mail.gmail.com,
involves "apply worker exiting with failure and concurrent create
subscription." The underlying cause for both issues is from the same
source: the reset of the latch in WaitForReplicationWorkerAttach.
Even though there is no failure, such issues contribute to a delay of
180 seconds in the execution of buildfarm tests and local tests.

This was noticed in buildfarm where 026_stats.pl took more than 180
seconds at [2]https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=snakefly&dt=2024-02-01%2020%3A34%3A03&stg=subscription-check:
...
[21:11:21] t/025_rep_changes_for_schema.pl .... ok 3340 ms ( 0.00
usr 0.00 sys + 0.43 cusr 0.23 csys = 0.66 CPU)
[21:11:25] t/026_stats.pl ..................... ok 4499 ms ( 0.00
usr 0.00 sys + 0.42 cusr 0.21 csys = 0.63 CPU)
[21:14:31] t/027_nosuperuser.pl ............... ok 185457 ms ( 0.01
usr 0.00 sys + 5.91 cusr 2.68 csys = 8.60 CPU)
[21:14:35] t/028_row_filter.pl ................ ok 3998 ms ( 0.01
usr 0.00 sys + 0.86 cusr 0.34 csys = 1.21 CPU)
...

And also by Robert at [3]/messages/by-id/CA+TgmoZkj=A39i4obKXADMhzJW=6dyGq-C1aGfb+jUy9XvxwYA@mail.gmail.com.

[1]: /messages/by-id/858a7622-2c81-1687-d1df-1322dfcb2e72@gmail.com
[2]: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=snakefly&dt=2024-02-01%2020%3A34%3A03&stg=subscription-check
[3]: /messages/by-id/CA+TgmoZkj=A39i4obKXADMhzJW=6dyGq-C1aGfb+jUy9XvxwYA@mail.gmail.com

Regards,
Vignesh

#11

Heikki Linnakangas

hlinnaka@iki.fi

over 1 year ago

In reply to: vignesh C (#10)

1 attachment(s)

Re: Improving the latch handling between logical replication launcher and worker processes.

On 05/07/2024 14:07, vignesh C wrote:

On Thu, 4 Jul 2024 at 16:52, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I'm don't quite understand the problem we're trying to fix:

Currently the launcher's latch is used for the following: a) worker
process attach b) worker process exit and c) subscription creation.
Since this same latch is used for multiple cases, the launcher process
is not able to handle concurrent scenarios like: a) Launcher started a
new apply worker and waiting for apply worker to attach and b) create
subscription sub2 sending launcher wake up signal. In this scenario,
both of them will set latch of the launcher process, the launcher
process is not able to identify that both operations have occurred 1)
worker is attached 2) subscription is created and apply worker should
be started. As a result the apply worker does not get started for the
new subscription created immediately and gets started after the
timeout of 180 seconds.

I don't see how we could miss a notification. Yes, both events will set
the same latch. Why is that a problem?

The launcher process waits for the apply worker to attach at
WaitForReplicationWorkerAttach function. The launcher's latch is
getting set concurrently for another create subscription and apply
worker attached. The launcher now detects the latch is set while
waiting at WaitForReplicationWorkerAttach, it resets the latch and
proceed to the main loop and waits for DEFAULT_NAPTIME_PER_CYCLE(as
the latch has already been reset). Further details are provided below.

The loop will see that the new

worker has attached, and that the new subscription has been created, and
process both events. Right?

Since the latch is reset at WaitForReplicationWorkerAttach, it skips
processing the create subscription event.

Slightly detailing further:
In the scenario when we execute two concurrent create subscription
commands, first CREATE SUBSCRIPTION sub1, followed immediately by
CREATE SUBSCRIPTION sub2.
In a few random scenarios, the following events may unfold:
After the first create subscription command(sub1), the backend will
set the launcher's latch because of ApplyLauncherWakeupAtCommit.
Subsequently, the launcher process will reset the latch and identify
the addition of a new subscription in the pg_subscription list. The
launcher process will proceed to request the postmaster to start the
apply worker background process (sub1) and request the postmaster to
notify the launcher once the apply worker(sub1) has been started.
Launcher will then wait for the apply worker(sub1) to attach in the
WaitForReplicationWorkerAttach function.
Meanwhile, the second CREATE SUBSCRIPTION command (sub2) which was
executed concurrently, also set the launcher's latch(because of
ApplyLauncherWakeupAtCommit).
Simultaneously when the launcher remains in the
WaitForReplicationWorkerAttach function waiting for apply worker of
sub1 to start, the postmaster will start the apply worker for
subscription sub1 and send a SIGUSR1 signal to the launcher process
via ReportBackgroundWorkerPID. Upon receiving this signal, the
launcher process will then set its latch.

Now, the launcher's latch has been set concurrently after the apply
worker for sub1 is started and the execution of the CREATE
SUBSCRIPTION sub2 command.

At this juncture, the launcher, which had been awaiting the attachment
of the apply worker, detects that the latch is set and proceeds to
reset it. The reset operation of the latch occurs within the following
section of code in WaitForReplicationWorkerAttach:
...
rc = WaitLatch(MyLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
10L, WAIT_EVENT_BGWORKER_STARTUP);

if (rc & WL_LATCH_SET)
{
ResetLatch(MyLatch);
CHECK_FOR_INTERRUPTS();
}
...

After resetting the latch here, the activation signal intended to
start the apply worker for subscription sub2 is no longer present. The
launcher will return to the ApplyLauncherMain function, where it will
await the DEFAULT_NAPTIME_PER_CYCLE, which is 180 seconds, before
processing the create subscription request (i.e., creating a new apply
worker for sub2).
The issue arises from the latch being reset in
WaitForReplicationWorkerAttach, which can occasionally delay the
synchronization of table data for the subscription.

Ok, I see it now. Thanks for the explanation. So the problem isn't using
the same latch for different purposes per se. It's that we're trying to
use it in a nested fashion, resetting it in the inner loop.

Looking at the proposed patch more closely:

@@ -221,13 +224,13 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
* We need timeout because we generally don't get notified via latch
* about the worker attach.  But we don't expect to have to wait long.
*/
-		rc = WaitLatch(MyLatch,
+		rc = WaitLatch(&worker->launcherWakeupLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
10L, WAIT_EVENT_BGWORKER_STARTUP);

if (rc & WL_LATCH_SET)
{
-			ResetLatch(MyLatch);
+			ResetLatch(&worker->launcherWakeupLatch);
CHECK_FOR_INTERRUPTS();
}
}

The comment here is now outdated, right? The patch adds an explicit
SetLatch() call to ParallelApplyWorkerMain(), to notify the launcher
about the attachment.

Is the launcherWakeupLatch field protected by some lock, to protect
which process owns it? OwnLatch() is called in
logicalrep_worker_stop_internal() without holding a lock for example. Is
there a guarantee that only one process can do that at the same time?

What happens if a process never calls DisownLatch(), e.g. because it
bailed out with an error while waiting for the worker to attach or stop?

As an alternative, smaller fix, I think we could do the attached. It
forces the launcher's main loop to do another iteration, if it calls
logicalrep_worker_launch(). That extra iteration should pick up any
missed notifications.

Your solution with an additional latch seems better because it makes
WaitForReplicationWorkerAttach() react more quickly, without the 10 s
wait. I'm surprised we have that in the first place, 10 s seems like a
pretty long time to wait for a parallel apply worker to start. Why was
that ever OK?

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachments:

logical-launcher-nested-latch-issue.patchtext/x-patch; charset=UTF-8; name=logical-launcher-nested-latch-issue.patchDownload

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 27c3a91fb7..2ca9fb22e8 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -1149,6 +1149,7 @@ ApplyLauncherMain(Datum main_arg)
 		MemoryContext subctx;
 		MemoryContext oldctx;
 		long		wait_time = DEFAULT_NAPTIME_PER_CYCLE;
+		bool		launched_something = false;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -1201,6 +1202,8 @@ ApplyLauncherMain(Datum main_arg)
 										 sub->dbid, sub->oid, sub->name,
 										 sub->owner, InvalidOid,
 										 DSM_HANDLE_INVALID);
+				/* remember that we launched something; see comment below */
+				launched_something = true;
 			}
 			else
 			{
@@ -1214,6 +1217,17 @@ ApplyLauncherMain(Datum main_arg)
 		/* Clean the temporary memory. */
 		MemoryContextDelete(subctx);
 
+		/*
+		 * Launching a worker uses MyLatch to wait for the worker to attach,
+		 * resetting it. We use the same latch to be signalled about
+		 * subscription changes and woerkers exiting, so we might have missed
+		 * some notifications, if those events happened concurrently. To work
+		 * around that, do another iteration immediately, if we launched
+		 * anything in this iteration.
+		 */
+		if (launched_something)
+			continue;
+
 		/* Wait for more work. */
 		rc = WaitLatch(MyLatch,
 					   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,

#12

Amit Kapila

amit.kapila16@gmail.com

over 1 year ago

In reply to: Heikki Linnakangas (#11)

Re: Improving the latch handling between logical replication launcher and worker processes.

On Fri, Jul 5, 2024 at 6:38 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Your solution with an additional latch seems better because it makes
WaitForReplicationWorkerAttach() react more quickly, without the 10 s
wait. I'm surprised we have that in the first place, 10 s seems like a
pretty long time to wait for a parallel apply worker to start. Why was
that ever OK?

Isn't the call wait for 10 milliseconds? The comment atop
WaitLatch("The "timeout" is given in milliseconds...) indicates the
timeout is in milliseconds.

--
With Regards,
Amit Kapila.

#13

vignesh C

vignesh21@gmail.com

over 1 year ago

In reply to: Heikki Linnakangas (#11)

Re: Improving the latch handling between logical replication launcher and worker processes.

On Fri, 5 Jul 2024 at 18:38, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 05/07/2024 14:07, vignesh C wrote:

On Thu, 4 Jul 2024 at 16:52, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I'm don't quite understand the problem we're trying to fix:

Currently the launcher's latch is used for the following: a) worker
process attach b) worker process exit and c) subscription creation.
Since this same latch is used for multiple cases, the launcher process
is not able to handle concurrent scenarios like: a) Launcher started a
new apply worker and waiting for apply worker to attach and b) create
subscription sub2 sending launcher wake up signal. In this scenario,
both of them will set latch of the launcher process, the launcher
process is not able to identify that both operations have occurred 1)
worker is attached 2) subscription is created and apply worker should
be started. As a result the apply worker does not get started for the
new subscription created immediately and gets started after the
timeout of 180 seconds.

I don't see how we could miss a notification. Yes, both events will set
the same latch. Why is that a problem?

The launcher process waits for the apply worker to attach at
WaitForReplicationWorkerAttach function. The launcher's latch is
getting set concurrently for another create subscription and apply
worker attached. The launcher now detects the latch is set while
waiting at WaitForReplicationWorkerAttach, it resets the latch and
proceed to the main loop and waits for DEFAULT_NAPTIME_PER_CYCLE(as
the latch has already been reset). Further details are provided below.

The loop will see that the new

worker has attached, and that the new subscription has been created, and
process both events. Right?

Since the latch is reset at WaitForReplicationWorkerAttach, it skips
processing the create subscription event.

Slightly detailing further:
In the scenario when we execute two concurrent create subscription
commands, first CREATE SUBSCRIPTION sub1, followed immediately by
CREATE SUBSCRIPTION sub2.
In a few random scenarios, the following events may unfold:
After the first create subscription command(sub1), the backend will
set the launcher's latch because of ApplyLauncherWakeupAtCommit.
Subsequently, the launcher process will reset the latch and identify
the addition of a new subscription in the pg_subscription list. The
launcher process will proceed to request the postmaster to start the
apply worker background process (sub1) and request the postmaster to
notify the launcher once the apply worker(sub1) has been started.
Launcher will then wait for the apply worker(sub1) to attach in the
WaitForReplicationWorkerAttach function.
Meanwhile, the second CREATE SUBSCRIPTION command (sub2) which was
executed concurrently, also set the launcher's latch(because of
ApplyLauncherWakeupAtCommit).
Simultaneously when the launcher remains in the
WaitForReplicationWorkerAttach function waiting for apply worker of
sub1 to start, the postmaster will start the apply worker for
subscription sub1 and send a SIGUSR1 signal to the launcher process
via ReportBackgroundWorkerPID. Upon receiving this signal, the
launcher process will then set its latch.

Now, the launcher's latch has been set concurrently after the apply
worker for sub1 is started and the execution of the CREATE
SUBSCRIPTION sub2 command.

At this juncture, the launcher, which had been awaiting the attachment
of the apply worker, detects that the latch is set and proceeds to
reset it. The reset operation of the latch occurs within the following
section of code in WaitForReplicationWorkerAttach:
...
rc = WaitLatch(MyLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
10L, WAIT_EVENT_BGWORKER_STARTUP);

if (rc & WL_LATCH_SET)
{
ResetLatch(MyLatch);
CHECK_FOR_INTERRUPTS();
}
...

After resetting the latch here, the activation signal intended to
start the apply worker for subscription sub2 is no longer present. The
launcher will return to the ApplyLauncherMain function, where it will
await the DEFAULT_NAPTIME_PER_CYCLE, which is 180 seconds, before
processing the create subscription request (i.e., creating a new apply
worker for sub2).
The issue arises from the latch being reset in
WaitForReplicationWorkerAttach, which can occasionally delay the
synchronization of table data for the subscription.

Ok, I see it now. Thanks for the explanation. So the problem isn't using
the same latch for different purposes per se. It's that we're trying to
use it in a nested fashion, resetting it in the inner loop.

Looking at the proposed patch more closely:
@@ -221,13 +224,13 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
* We need timeout because we generally don't get notified via latch
* about the worker attach.  But we don't expect to have to wait long.
*/
-             rc = WaitLatch(MyLatch,
+             rc = WaitLatch(&worker->launcherWakeupLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
10L, WAIT_EVENT_BGWORKER_STARTUP);
if (rc & WL_LATCH_SET)
{
-                     ResetLatch(MyLatch);
+                     ResetLatch(&worker->launcherWakeupLatch);
CHECK_FOR_INTERRUPTS();
}
}
The comment here is now outdated, right? The patch adds an explicit
SetLatch() call to ParallelApplyWorkerMain(), to notify the launcher
about the attachment.

I will update the comment for this.

Is the launcherWakeupLatch field protected by some lock, to protect
which process owns it? OwnLatch() is called in
logicalrep_worker_stop_internal() without holding a lock for example. Is
there a guarantee that only one process can do that at the same time?

I have analyzed a few scenarios, I will analyze the remaining
scenarios and update on this.

What happens if a process never calls DisownLatch(), e.g. because it
bailed out with an error while waiting for the worker to attach or stop?

I tried a lot of scenarios by erroring out after ownlatch call and did
not hit the panic code of OwnLatch yet. I will try a few more
scenarios that I have in mind and see if we can hit the PANIC in
OwnLatch and update on this.

As an alternative, smaller fix, I think we could do the attached. It
forces the launcher's main loop to do another iteration, if it calls
logicalrep_worker_launch(). That extra iteration should pick up any
missed notifications.

This also works. I had another solution in similar lines like you
proposed at [1]/messages/by-id/CALDaNm10R7L0Dxq+-J=Pp3AfM_yaokpbhECvJ69QiGH8-jQquw@mail.gmail.com, where we don't reset the latch at the inner loop. I'm
not sure which solution we should proceed with. Thoughts?

Your solution with an additional latch seems better because it makes
WaitForReplicationWorkerAttach() react more quickly, without the 10 s
wait. I'm surprised we have that in the first place, 10 s seems like a
pretty long time to wait for a parallel apply worker to start. Why was
that ever OK?

Here 10 means 10 milliseconds, so it just waits for 10 milliseconds
before going to the next iteration.

[1]: /messages/by-id/CALDaNm10R7L0Dxq+-J=Pp3AfM_yaokpbhECvJ69QiGH8-jQquw@mail.gmail.com

Regards,
Vignesh

#14

Amit Kapila

amit.kapila16@gmail.com

over 1 year ago

In reply to: vignesh C (#13)

Re: Improving the latch handling between logical replication launcher and worker processes.

On Mon, Jul 8, 2024 at 5:47 PM vignesh C <vignesh21@gmail.com> wrote:

On Fri, 5 Jul 2024 at 18:38, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

As an alternative, smaller fix, I think we could do the attached. It
forces the launcher's main loop to do another iteration, if it calls
logicalrep_worker_launch(). That extra iteration should pick up any
missed notifications.

This also works.

The minor drawback would be that in many cases the extra iteration
would not lead to anything meaningful.

--
With Regards,
Amit Kapila.

#15

vignesh C

vignesh21@gmail.com

over 1 year ago

In reply to: vignesh C (#13)

Re: Improving the latch handling between logical replication launcher and worker processes.

On Mon, 8 Jul 2024 at 17:46, vignesh C <vignesh21@gmail.com> wrote:

On Fri, 5 Jul 2024 at 18:38, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 05/07/2024 14:07, vignesh C wrote:

On Thu, 4 Jul 2024 at 16:52, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I'm don't quite understand the problem we're trying to fix:

Currently the launcher's latch is used for the following: a) worker
process attach b) worker process exit and c) subscription creation.
Since this same latch is used for multiple cases, the launcher process
is not able to handle concurrent scenarios like: a) Launcher started a
new apply worker and waiting for apply worker to attach and b) create
subscription sub2 sending launcher wake up signal. In this scenario,
both of them will set latch of the launcher process, the launcher
process is not able to identify that both operations have occurred 1)
worker is attached 2) subscription is created and apply worker should
be started. As a result the apply worker does not get started for the
new subscription created immediately and gets started after the
timeout of 180 seconds.

I don't see how we could miss a notification. Yes, both events will set
the same latch. Why is that a problem?

The launcher process waits for the apply worker to attach at
WaitForReplicationWorkerAttach function. The launcher's latch is
getting set concurrently for another create subscription and apply
worker attached. The launcher now detects the latch is set while
waiting at WaitForReplicationWorkerAttach, it resets the latch and
proceed to the main loop and waits for DEFAULT_NAPTIME_PER_CYCLE(as
the latch has already been reset). Further details are provided below.

The loop will see that the new

worker has attached, and that the new subscription has been created, and
process both events. Right?

Since the latch is reset at WaitForReplicationWorkerAttach, it skips
processing the create subscription event.

Slightly detailing further:
In the scenario when we execute two concurrent create subscription
commands, first CREATE SUBSCRIPTION sub1, followed immediately by
CREATE SUBSCRIPTION sub2.
In a few random scenarios, the following events may unfold:
After the first create subscription command(sub1), the backend will
set the launcher's latch because of ApplyLauncherWakeupAtCommit.
Subsequently, the launcher process will reset the latch and identify
the addition of a new subscription in the pg_subscription list. The
launcher process will proceed to request the postmaster to start the
apply worker background process (sub1) and request the postmaster to
notify the launcher once the apply worker(sub1) has been started.
Launcher will then wait for the apply worker(sub1) to attach in the
WaitForReplicationWorkerAttach function.
Meanwhile, the second CREATE SUBSCRIPTION command (sub2) which was
executed concurrently, also set the launcher's latch(because of
ApplyLauncherWakeupAtCommit).
Simultaneously when the launcher remains in the
WaitForReplicationWorkerAttach function waiting for apply worker of
sub1 to start, the postmaster will start the apply worker for
subscription sub1 and send a SIGUSR1 signal to the launcher process
via ReportBackgroundWorkerPID. Upon receiving this signal, the
launcher process will then set its latch.

Now, the launcher's latch has been set concurrently after the apply
worker for sub1 is started and the execution of the CREATE
SUBSCRIPTION sub2 command.

At this juncture, the launcher, which had been awaiting the attachment
of the apply worker, detects that the latch is set and proceeds to
reset it. The reset operation of the latch occurs within the following
section of code in WaitForReplicationWorkerAttach:
...
rc = WaitLatch(MyLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
10L, WAIT_EVENT_BGWORKER_STARTUP);

if (rc & WL_LATCH_SET)
{
ResetLatch(MyLatch);
CHECK_FOR_INTERRUPTS();
}
...

After resetting the latch here, the activation signal intended to
start the apply worker for subscription sub2 is no longer present. The
launcher will return to the ApplyLauncherMain function, where it will
await the DEFAULT_NAPTIME_PER_CYCLE, which is 180 seconds, before
processing the create subscription request (i.e., creating a new apply
worker for sub2).
The issue arises from the latch being reset in
WaitForReplicationWorkerAttach, which can occasionally delay the
synchronization of table data for the subscription.

Ok, I see it now. Thanks for the explanation. So the problem isn't using
the same latch for different purposes per se. It's that we're trying to
use it in a nested fashion, resetting it in the inner loop.

Looking at the proposed patch more closely:
@@ -221,13 +224,13 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
* We need timeout because we generally don't get notified via latch
* about the worker attach.  But we don't expect to have to wait long.
*/
-             rc = WaitLatch(MyLatch,
+             rc = WaitLatch(&worker->launcherWakeupLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
10L, WAIT_EVENT_BGWORKER_STARTUP);
if (rc & WL_LATCH_SET)
{
-                     ResetLatch(MyLatch);
+                     ResetLatch(&worker->launcherWakeupLatch);
CHECK_FOR_INTERRUPTS();
}
}
The comment here is now outdated, right? The patch adds an explicit
SetLatch() call to ParallelApplyWorkerMain(), to notify the launcher
about the attachment.
I will update the comment for this.

Is the launcherWakeupLatch field protected by some lock, to protect
which process owns it? OwnLatch() is called in
logicalrep_worker_stop_internal() without holding a lock for example. Is
there a guarantee that only one process can do that at the same time?

I have analyzed a few scenarios, I will analyze the remaining
scenarios and update on this.

What happens if a process never calls DisownLatch(), e.g. because it
bailed out with an error while waiting for the worker to attach or stop?

I tried a lot of scenarios by erroring out after ownlatch call and did
not hit the panic code of OwnLatch yet. I will try a few more
scenarios that I have in mind and see if we can hit the PANIC in
OwnLatch and update on this.

I could encounter the PANIC from OwnLatch by having a lot of create
subscriptions doing a table sync with a limited number of slots. This
and the above scenario are slightly related. I need some more time to
prepare the fix.

Regards,
Vignesh

#16

vignesh C

vignesh21@gmail.com

over 1 year ago

In reply to: Heikki Linnakangas (#11)

1 attachment(s)

Re: Improving the latch handling between logical replication launcher and worker processes.

On Fri, 5 Jul 2024 at 18:38, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 05/07/2024 14:07, vignesh C wrote:

On Thu, 4 Jul 2024 at 16:52, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I'm don't quite understand the problem we're trying to fix:

Currently the launcher's latch is used for the following: a) worker
process attach b) worker process exit and c) subscription creation.
Since this same latch is used for multiple cases, the launcher process
is not able to handle concurrent scenarios like: a) Launcher started a
new apply worker and waiting for apply worker to attach and b) create
subscription sub2 sending launcher wake up signal. In this scenario,
both of them will set latch of the launcher process, the launcher
process is not able to identify that both operations have occurred 1)
worker is attached 2) subscription is created and apply worker should
be started. As a result the apply worker does not get started for the
new subscription created immediately and gets started after the
timeout of 180 seconds.

I don't see how we could miss a notification. Yes, both events will set
the same latch. Why is that a problem?

The launcher process waits for the apply worker to attach at
WaitForReplicationWorkerAttach function. The launcher's latch is
getting set concurrently for another create subscription and apply
worker attached. The launcher now detects the latch is set while
waiting at WaitForReplicationWorkerAttach, it resets the latch and
proceed to the main loop and waits for DEFAULT_NAPTIME_PER_CYCLE(as
the latch has already been reset). Further details are provided below.

The loop will see that the new

worker has attached, and that the new subscription has been created, and
process both events. Right?

Since the latch is reset at WaitForReplicationWorkerAttach, it skips
processing the create subscription event.

Slightly detailing further:
In the scenario when we execute two concurrent create subscription
commands, first CREATE SUBSCRIPTION sub1, followed immediately by
CREATE SUBSCRIPTION sub2.
In a few random scenarios, the following events may unfold:
After the first create subscription command(sub1), the backend will
set the launcher's latch because of ApplyLauncherWakeupAtCommit.
Subsequently, the launcher process will reset the latch and identify
the addition of a new subscription in the pg_subscription list. The
launcher process will proceed to request the postmaster to start the
apply worker background process (sub1) and request the postmaster to
notify the launcher once the apply worker(sub1) has been started.
Launcher will then wait for the apply worker(sub1) to attach in the
WaitForReplicationWorkerAttach function.
Meanwhile, the second CREATE SUBSCRIPTION command (sub2) which was
executed concurrently, also set the launcher's latch(because of
ApplyLauncherWakeupAtCommit).
Simultaneously when the launcher remains in the
WaitForReplicationWorkerAttach function waiting for apply worker of
sub1 to start, the postmaster will start the apply worker for
subscription sub1 and send a SIGUSR1 signal to the launcher process
via ReportBackgroundWorkerPID. Upon receiving this signal, the
launcher process will then set its latch.

Now, the launcher's latch has been set concurrently after the apply
worker for sub1 is started and the execution of the CREATE
SUBSCRIPTION sub2 command.

At this juncture, the launcher, which had been awaiting the attachment
of the apply worker, detects that the latch is set and proceeds to
reset it. The reset operation of the latch occurs within the following
section of code in WaitForReplicationWorkerAttach:
...
rc = WaitLatch(MyLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
10L, WAIT_EVENT_BGWORKER_STARTUP);

if (rc & WL_LATCH_SET)
{
ResetLatch(MyLatch);
CHECK_FOR_INTERRUPTS();
}
...

After resetting the latch here, the activation signal intended to
start the apply worker for subscription sub2 is no longer present. The
launcher will return to the ApplyLauncherMain function, where it will
await the DEFAULT_NAPTIME_PER_CYCLE, which is 180 seconds, before
processing the create subscription request (i.e., creating a new apply
worker for sub2).
The issue arises from the latch being reset in
WaitForReplicationWorkerAttach, which can occasionally delay the
synchronization of table data for the subscription.

Ok, I see it now. Thanks for the explanation. So the problem isn't using
the same latch for different purposes per se. It's that we're trying to
use it in a nested fashion, resetting it in the inner loop.

Looking at the proposed patch more closely:
@@ -221,13 +224,13 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
* We need timeout because we generally don't get notified via latch
* about the worker attach.  But we don't expect to have to wait long.
*/
-             rc = WaitLatch(MyLatch,
+             rc = WaitLatch(&worker->launcherWakeupLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
10L, WAIT_EVENT_BGWORKER_STARTUP);
if (rc & WL_LATCH_SET)
{
-                     ResetLatch(MyLatch);
+                     ResetLatch(&worker->launcherWakeupLatch);
CHECK_FOR_INTERRUPTS();
}
}
The comment here is now outdated, right? The patch adds an explicit
SetLatch() call to ParallelApplyWorkerMain(), to notify the launcher
about the attachment.

I have removed these comments

Is the launcherWakeupLatch field protected by some lock, to protect
which process owns it? OwnLatch() is called in
logicalrep_worker_stop_internal() without holding a lock for example. Is
there a guarantee that only one process can do that at the same time?

Added a lock to prevent concurrent access

What happens if a process never calls DisownLatch(), e.g. because it
bailed out with an error while waiting for the worker to attach or stop?

This will not happen now as we call own and disown just before wait
latch and not in other cases.

The attached v2 version patch has the changes for the same.

Regards,
Vignesh

Attachments:

v2-0001-Improved-latch-handling-between-the-logical-repli.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Improved-latch-handling-between-the-logical-repli.patchDownload

From 89b7feb30cb70898595366c0203d420c9502c7b6 Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Wed, 29 May 2024 14:55:49 +0530
Subject: [PATCH v2] Improved latch handling between the logical replication
 launcher and worker processes.

The current implementation uses a single latch for multiple purposes:
a) attaching worker processes, b) handling worker process exits, and
c) creating subscriptions. This overlap causes issues in concurrent scenarios,
such as when the launcher starts a new apply worker and is waiting for it to
attach, while simultaneously a new subscription creation request arrives.

This commit fixes the problem by introducing a new wakeupLatch. This latch
will be set once the logical replication processes (apply worker,
parallel apply worker, or tablesync worker) are started. Consequently,
bgw_notify_pid is no longer required, as wakeupLatch will handle notifications
when a logical replication worker process attaches and will signal the
launcher or apply worker upon process exit from logicalrep_worker_onexit.
---
 .../replication/logical/applyparallelworker.c |  1 +
 src/backend/replication/logical/launcher.c    | 93 +++++++++++--------
 src/backend/replication/logical/worker.c      |  1 +
 src/include/replication/worker_internal.h     |  7 ++
 4 files changed, 62 insertions(+), 40 deletions(-)

diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index e7f7d4c5e4..2fb4d63ebc 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -912,6 +912,7 @@ ParallelApplyWorkerMain(Datum main_arg)
 	 * the uninitialized memory queue.
 	 */
 	logicalrep_worker_attach(worker_slot);
+	SetLatch(&MyLogicalRepWorker->wakeupLatch);
 
 	/*
 	 * Register the shutdown callback after we are attached to the worker
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index c566d50a07..dea41b7ea1 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -171,6 +171,35 @@ get_subscription_list(void)
 	return res;
 }
 
+/*
+ * Wait for wakeupLatch to be set or a timeout.
+ */
+static void
+WaitForWakeupLatch(LogicalRepWorker *worker, uint32 wait_event_info)
+{
+	int			rc;
+
+	SpinLockAcquire(&worker->wakeupmutex);
+	OwnLatch(&worker->wakeupLatch);
+
+	rc = WaitLatch(&worker->wakeupLatch,
+					WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+					1000L, wait_event_info);
+
+	if (rc & WL_LATCH_SET)
+	{
+		ResetLatch(&worker->wakeupLatch);
+		DisownLatch(&worker->wakeupLatch);
+		SpinLockRelease(&worker->wakeupmutex);
+		CHECK_FOR_INTERRUPTS();
+	}
+	else
+	{
+		DisownLatch(&worker->wakeupLatch);
+		SpinLockRelease(&worker->wakeupmutex);
+	}
+}
+
 /*
  * Wait for a background worker to start up and attach to the shmem context.
  *
@@ -185,7 +214,6 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
 							   BackgroundWorkerHandle *handle)
 {
 	BgwHandleStatus status;
-	int			rc;
 
 	for (;;)
 	{
@@ -217,19 +245,7 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
 			return false;
 		}
 
-		/*
-		 * We need timeout because we generally don't get notified via latch
-		 * about the worker attach.  But we don't expect to have to wait long.
-		 */
-		rc = WaitLatch(MyLatch,
-					   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
-					   10L, WAIT_EVENT_BGWORKER_STARTUP);
-
-		if (rc & WL_LATCH_SET)
-		{
-			ResetLatch(MyLatch);
-			CHECK_FOR_INTERRUPTS();
-		}
+		WaitForWakeupLatch(worker, WAIT_EVENT_BGWORKER_STARTUP);
 	}
 }
 
@@ -456,6 +472,12 @@ retry:
 	worker->reply_lsn = InvalidXLogRecPtr;
 	TIMESTAMP_NOBEGIN(worker->reply_time);
 
+	SpinLockAcquire(&worker->wakeupmutex);
+	OwnLatch(&worker->wakeupLatch);
+	ResetLatch(&worker->wakeupLatch);
+	DisownLatch(&worker->wakeupLatch);
+	SpinLockRelease(&worker->wakeupmutex);
+
 	/* Before releasing lock, remember generation for future identification. */
 	generation = worker->generation;
 
@@ -503,7 +525,7 @@ retry:
 	}
 
 	bgw.bgw_restart_time = BGW_NEVER_RESTART;
-	bgw.bgw_notify_pid = MyProcPid;
+	bgw.bgw_notify_pid = 0;
 	bgw.bgw_main_arg = Int32GetDatum(slot);
 
 	if (!RegisterDynamicBackgroundWorker(&bgw, &bgw_handle))
@@ -548,20 +570,9 @@ logicalrep_worker_stop_internal(LogicalRepWorker *worker, int signo)
 	 */
 	while (worker->in_use && !worker->proc)
 	{
-		int			rc;
-
 		LWLockRelease(LogicalRepWorkerLock);
 
-		/* Wait a bit --- we don't expect to have to wait long. */
-		rc = WaitLatch(MyLatch,
-					   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
-					   10L, WAIT_EVENT_BGWORKER_STARTUP);
-
-		if (rc & WL_LATCH_SET)
-		{
-			ResetLatch(MyLatch);
-			CHECK_FOR_INTERRUPTS();
-		}
+		WaitForWakeupLatch(worker, WAIT_EVENT_BGWORKER_STARTUP);
 
 		/* Recheck worker status. */
 		LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
@@ -585,24 +596,13 @@ logicalrep_worker_stop_internal(LogicalRepWorker *worker, int signo)
 	/* ... and wait for it to die. */
 	for (;;)
 	{
-		int			rc;
-
 		/* is it gone? */
 		if (!worker->proc || worker->generation != generation)
 			break;
 
 		LWLockRelease(LogicalRepWorkerLock);
 
-		/* Wait a bit --- we don't expect to have to wait long. */
-		rc = WaitLatch(MyLatch,
-					   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
-					   10L, WAIT_EVENT_BGWORKER_SHUTDOWN);
-
-		if (rc & WL_LATCH_SET)
-		{
-			ResetLatch(MyLatch);
-			CHECK_FOR_INTERRUPTS();
-		}
+		WaitForWakeupLatch(worker, WAIT_EVENT_BGWORKER_SHUTDOWN);
 
 		LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
 	}
@@ -823,6 +823,8 @@ logicalrep_launcher_onexit(int code, Datum arg)
 static void
 logicalrep_worker_onexit(int code, Datum arg)
 {
+	LogicalRepWorkerType type = MyLogicalRepWorker->type;
+
 	/* Disconnect gracefully from the remote side. */
 	if (LogRepWorkerWalRcvConn)
 		walrcv_disconnect(LogRepWorkerWalRcvConn);
@@ -843,7 +845,16 @@ logicalrep_worker_onexit(int code, Datum arg)
 	if (!InitializingApplyWorker)
 		LockReleaseAll(DEFAULT_LOCKMETHOD, true);
 
-	ApplyLauncherWakeup();
+	SetLatch(&MyLogicalRepWorker->wakeupLatch);
+
+	/*
+	 * Since we don't set bgw_notify_pid, send an exit signal to the parent
+	 * process.
+	 */
+	if (type == WORKERTYPE_APPLY)
+		ApplyLauncherWakeup();
+	else
+		logicalrep_worker_wakeup(MyLogicalRepWorker->subid, InvalidOid);
 }
 
 /*
@@ -982,6 +993,8 @@ ApplyLauncherShmemInit(void)
 
 			memset(worker, 0, sizeof(LogicalRepWorker));
 			SpinLockInit(&worker->relmutex);
+			InitSharedLatch(&worker->wakeupLatch);
+			SpinLockInit(&worker->wakeupmutex);
 		}
 	}
 }
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 0fb577d328..15cfe5693e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -4737,6 +4737,7 @@ SetupApplyOrSyncWorker(int worker_slot)
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
 	pqsignal(SIGTERM, die);
 	BackgroundWorkerUnblockSignals();
+	SetLatch(&MyLogicalRepWorker->wakeupLatch);
 
 	/*
 	 * We don't currently need any ResourceOwner in a walreceiver process, but
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 9646261d7e..61f63dc1a3 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -77,6 +77,13 @@ typedef struct LogicalRepWorker
 	 */
 	FileSet    *stream_fileset;
 
+	/*
+	 * Used to wake up the launcher/apply worker after the logical replication
+	 * process is attached/exited.
+	 */
+	Latch		wakeupLatch;
+	slock_t		wakeupmutex;
+
 	/*
 	 * PID of leader apply worker if this slot is used for a parallel apply
 	 * worker, InvalidPid otherwise.
-- 
2.34.1

#17

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 1 year ago

In reply to: vignesh C (#16)

Re: Improving the latch handling between logical replication launcher and worker processes.

At Tue, 3 Sep 2024 09:10:07 +0530, vignesh C <vignesh21@gmail.com> wrote in

The attached v2 version patch has the changes for the same.

Sorry for jumping in at this point. I've just reviewed the latest
patch (v2), and the frequent Own/Disown-Latch operations caught my
attention. Additionally, handling multiple concurrently operating
trigger sources with nested latch waits seems bug-prone, which I’d
prefer to avoid from both a readability and safety perspective.

With that in mind, I’d like to suggest an alternative approach. I may
not be fully aware of all the previous discussions, so apologies if
this idea has already been considered and dismissed.

Currently, WaitForReplicationWorkerAttach() and
logicalrep_worker_stop_internal() wait on a latch after verifying the
desired state. This ensures that even if there are spurious or missed
wakeups, they won't cause issues. In contrast, ApplyLauncherMain()
enters a latch wait without checking the desired state
first. Consequently, if another process sets the latch to wake up the
main loop while the former two functions are waiting, that wakeup
could be missed. If my understanding is correct, the problem lies in
ApplyLauncherMain() not checking the expected state before beginning
to wait on the latch. There is no issue with waiting if the state
hasn't been satisfied yet.

So, I propose that ApplyLauncherMain() should check the condition that
triggers a main loop wakeup before calling WaitLatch(). To do this, we
could add a flag in LogicalRepCtxStruct to signal that the main loop
has immediate tasks to handle. ApplyLauncherWakeup() would set this
flag before sending SIGUSR1. In turn, ApplyLauncherMain() would check
this flag before calling WaitLatch() and skip the WaitLatch() call if
the flag is set.

I think this approach could solve the issue without adding
complexity. What do you think?

regard.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#18

vignesh C

vignesh21@gmail.com

over 1 year ago

In reply to: Kyotaro Horiguchi (#17)

1 attachment(s)

Re: Improving the latch handling between logical replication launcher and worker processes.

On Wed, 4 Sept 2024 at 08:32, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

At Tue, 3 Sep 2024 09:10:07 +0530, vignesh C <vignesh21@gmail.com> wrote in

The attached v2 version patch has the changes for the same.

Sorry for jumping in at this point. I've just reviewed the latest
patch (v2), and the frequent Own/Disown-Latch operations caught my
attention. Additionally, handling multiple concurrently operating
trigger sources with nested latch waits seems bug-prone, which I’d
prefer to avoid from both a readability and safety perspective.

With that in mind, I’d like to suggest an alternative approach. I may
not be fully aware of all the previous discussions, so apologies if
this idea has already been considered and dismissed.

Currently, WaitForReplicationWorkerAttach() and
logicalrep_worker_stop_internal() wait on a latch after verifying the
desired state. This ensures that even if there are spurious or missed
wakeups, they won't cause issues. In contrast, ApplyLauncherMain()
enters a latch wait without checking the desired state
first. Consequently, if another process sets the latch to wake up the
main loop while the former two functions are waiting, that wakeup
could be missed. If my understanding is correct, the problem lies in
ApplyLauncherMain() not checking the expected state before beginning
to wait on the latch. There is no issue with waiting if the state
hasn't been satisfied yet.

So, I propose that ApplyLauncherMain() should check the condition that
triggers a main loop wakeup before calling WaitLatch(). To do this, we
could add a flag in LogicalRepCtxStruct to signal that the main loop
has immediate tasks to handle. ApplyLauncherWakeup() would set this
flag before sending SIGUSR1. In turn, ApplyLauncherMain() would check
this flag before calling WaitLatch() and skip the WaitLatch() call if
the flag is set.

I think this approach could solve the issue without adding
complexity. What do you think?

I agree that this approach is more simple than the other approach. How
about something like the attached patch to handle the same.

Regards,
Vignesh

Attachments:

v3-0001-Resolve-a-problem-where-detection-of-concurrent-s.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Resolve-a-problem-where-detection-of-concurrent-s.patchDownload

From 3efa4ea7dff937f60d16664ec4e3a79f798a72ab Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Wed, 4 Sep 2024 16:39:07 +0530
Subject: [PATCH v3] Resolve a problem where detection of concurrent
 subscription creation was skipped.

The current implementation uses a single latch for multiple purposes:
a) attaching worker processes, b) handling worker process exits, and
c) creating subscriptions. This overlap causes issues in concurrent scenarios,
such as when the launcher starts a new apply worker and is waiting for it to
attach, while simultaneously a new subscription creation request arrives.

This commit addresses the issue by adding a new launcher_reload_sub flag
to the LogicalRepCtxStruct. Although the latch will be reset once the
worker is attached, the launcher can still detect concurrent operations
using this flag. The flag will be activated during new subscription
creation or modification, as well as when an apply worker exits. The
launcher will then check this flag in its main loop, allowing it to
reload the subscription list and initiate the necessary apply workers.
---
 src/backend/replication/logical/launcher.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index e5fdca8bbf..92f1253b39 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -62,6 +62,9 @@ typedef struct LogicalRepCtxStruct
 	dsa_handle	last_start_dsa;
 	dshash_table_handle last_start_dsh;
 
+	/* Flag to reload the subscription list and start worker if required */
+	bool launcher_reload_sub;
+
 	/* Background workers. */
 	LogicalRepWorker workers[FLEXIBLE_ARRAY_MEMBER];
 } LogicalRepCtxStruct;
@@ -1118,7 +1121,10 @@ static void
 ApplyLauncherWakeup(void)
 {
 	if (LogicalRepCtx->launcher_pid != 0)
+	{
+		LogicalRepCtx->launcher_reload_sub = true;
 		kill(LogicalRepCtx->launcher_pid, SIGUSR1);
+	}
 }
 
 /*
@@ -1164,6 +1170,8 @@ ApplyLauncherMain(Datum main_arg)
 									   ALLOCSET_DEFAULT_SIZES);
 		oldctx = MemoryContextSwitchTo(subctx);
 
+		LogicalRepCtx->launcher_reload_sub = false;
+
 		/* Start any missing workers for enabled subscriptions. */
 		sublist = get_subscription_list();
 		foreach(lc, sublist)
@@ -1220,6 +1228,9 @@ ApplyLauncherMain(Datum main_arg)
 		/* Clean the temporary memory. */
 		MemoryContextDelete(subctx);
 
+		if (LogicalRepCtx->launcher_reload_sub)
+			continue;
+
 		/* Wait for more work. */
 		rc = WaitLatch(MyLatch,
 					   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
-- 
2.34.1

#19

Heikki Linnakangas

hlinnaka@iki.fi

over 1 year ago

In reply to: vignesh C (#18)

Re: Improving the latch handling between logical replication launcher and worker processes.

On 04/09/2024 14:24, vignesh C wrote:

On Wed, 4 Sept 2024 at 08:32, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

I think this approach could solve the issue without adding
complexity. What do you think?

I agree that this approach is more simple than the other approach. How
about something like the attached patch to handle the same.

I haven't looked at these new patches from the last few days, but please
also note the work at
/messages/by-id/476672e7-62f1-4cab-a822-f3a8e949dd3f@iki.fi.
If those "interrupts" patches are committed, this is pretty
straightforward to fix by using a separate interrupt bit for this, as
the patch on that thread does.

--
Heikki Linnakangas
Neon (https://neon.tech)

#20

Alexander Lakhin

exclusion@gmail.com

11 months ago

In reply to: Heikki Linnakangas (#19)

Re: Improving the latch handling between logical replication launcher and worker processes.

Hello,

04.09.2024 16:53, Heikki Linnakangas wrote:

On 04/09/2024 14:24, vignesh C wrote:

I agree that this approach is more simple than the other approach. How
about something like the attached patch to handle the same.

I haven't looked at these new patches from the last few days, but please also note the work at
/messages/by-id/476672e7-62f1-4cab-a822-f3a8e949dd3f@iki.fi. If those "interrupts" patches are
committed, this is pretty straightforward to fix by using a separate interrupt bit for this, as the patch on that
thread does.

I'd also like to add that this issue leads to buildfarm test failures,
because of the race condition between
#define DEFAULT_NAPTIME_PER_CYCLE 180000L
and
$timeout_default = 180

That is, in the situation described above "the apply worker does not get
started for the new subscription created immediately and gets started
after the timeout of 180 seconds", 014_binary.pl can fail if
wait_for_log()'s 180 seconds passed sooner:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2025-02-09%2011%3A45%3A05

I reproduced this failure locally when running 50 014_binary tests in
parallel and got failures on iterations 4, 14, 10.

But with PG_TEST_TIMEOUT_DEFAULT=190, 30 iterations passed for me (8 of
them took 180+ seconds).

Best regards,
Alexander Lakhin
Neon (https://neon.tech)