Logrep launcher race conditions leading to slow tests

Started by Tom Lane7 months ago8 messages

tgl@sss.pgh.pa.us

7 months ago

1 attachment(s)

I've been annoyed for awhile because, while a parallel check-world
run usually takes a bit over a minute on my machine, sometimes it
takes between three and four minutes. I was finally able to
track down what is happening, and it's this: sometimes one or
another of the src/test/subscription tests takes an extra three
minutes because the logical replication launcher is sleeping
instead of launching the next task. It eventually reaches its
hard-wired maximum wait of DEFAULT_NAPTIME_PER_CYCLE (3min),
wakes up and notices it has something to do, and then we're
on our merry way again. I'm not sure how often this is a problem
in the real world, but it happens often enough to be annoying
during development.

There are two distinct bugs involved:

1. WaitForReplicationWorkerAttach sometimes has to clear a process
latch event so that it can keep waiting for the worker to launch.
It neglects to set the latch again, allowing ApplyLauncherMain
to miss events.

2. ApplyLauncherMain ignores a failure return from
logicalrep_worker_launch, which is bad because (unless it has
another worker launch pending) it will then sleep for
DEFAULT_NAPTIME_PER_CYCLE before reconsidering. What it ought to do
is try again after wal_retrieve_retry_interval. This situation can
arise in resource-exhaustion cases (cf. the early exits in
logicalrep_worker_launch), but what's more likely in the regression
tests is that the worker stops due to some error condition before
WaitForReplicationWorkerAttach sees it attached, which is then duly
reported as a failure.

It's possible to make the test slowness extremely reproducible
with this change, which widens the race condition window for
both problems:

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 1c3c051403d..724e82bcdc1 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -214,7 +214,7 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
 		 */
 		rc = WaitLatch(MyLatch,
 					   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
-					   10L, WAIT_EVENT_BGWORKER_STARTUP);
+					   1000L, WAIT_EVENT_BGWORKER_STARTUP);

if (rc & WL_LATCH_SET)
{

I don't recommend that as a permanent change, but it's helpful
for testing the attached patch.

In the attached, I made two other non-cosmetic changes:

3. In WaitForReplicationWorkerAttach, capture worker->in_use
before not after releasing LogicalRepWorkerLock. Maybe there
is a reason why that's not a dangerous race condition, but
it sure is un-obvious to me.

4. In process_syncing_tables_for_apply (the other caller of
logicalrep_worker_launch), it seems okay to ignore the
result of logicalrep_worker_launch, but I think it should
fill hentry->last_start_time before not after the call.
Otherwise we might be changing a hashtable entry that's
no longer relevant to this worker. I'm not sure exactly
where the failed worker will be cleaned-up-after, but
it could very easily be out of the system entirely before
logicalrep_worker_launch returns.

Barring objections, I plan to apply and back-patch this.

regards, tom lane

Attachments:

fix-logrep-launcher-race-conditions.patchtext/x-diff; charset=us-ascii; name=fix-logrep-launcher-race-conditions.patchDownload

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 1c3c051403d..14d8efbd25b 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -175,12 +175,14 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
 							   uint16 generation,
 							   BackgroundWorkerHandle *handle)
 {
-	BgwHandleStatus status;
-	int			rc;
+	bool		result = false;
+	bool		dropped_latch = false;
 
 	for (;;)
 	{
+		BgwHandleStatus status;
 		pid_t		pid;
+		int			rc;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -189,8 +191,9 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
 		/* Worker either died or has started. Return false if died. */
 		if (!worker->in_use || worker->proc)
 		{
+			result = worker->in_use;
 			LWLockRelease(LogicalRepWorkerLock);
-			return worker->in_use;
+			break;
 		}
 
 		LWLockRelease(LogicalRepWorkerLock);
@@ -205,7 +208,7 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
 			if (generation == worker->generation)
 				logicalrep_worker_cleanup(worker);
 			LWLockRelease(LogicalRepWorkerLock);
-			return false;
+			break;				/* result is already false */
 		}
 
 		/*
@@ -220,8 +223,18 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
 		{
 			ResetLatch(MyLatch);
 			CHECK_FOR_INTERRUPTS();
+			dropped_latch = true;
 		}
 	}
+
+	/*
+	 * If we had to clear a latch event in order to wait, be sure to restore
+	 * it before exiting.  Otherwise caller may miss events.
+	 */
+	if (dropped_latch)
+		SetLatch(MyLatch);
+
+	return result;
 }
 
 /*
@@ -1194,10 +1207,21 @@ ApplyLauncherMain(Datum main_arg)
 				(elapsed = TimestampDifferenceMilliseconds(last_start, now)) >= wal_retrieve_retry_interval)
 			{
 				ApplyLauncherSetWorkerStartTime(sub->oid, now);
-				logicalrep_worker_launch(WORKERTYPE_APPLY,
-										 sub->dbid, sub->oid, sub->name,
-										 sub->owner, InvalidOid,
-										 DSM_HANDLE_INVALID);
+				if (!logicalrep_worker_launch(WORKERTYPE_APPLY,
+											  sub->dbid, sub->oid, sub->name,
+											  sub->owner, InvalidOid,
+											  DSM_HANDLE_INVALID))
+				{
+					/*
+					 * We get here either if we failed to launch a worker
+					 * (perhaps for resource-exhaustion reasons) or if we
+					 * launched one but it immediately quit.  Either way, it
+					 * seems appropriate to try again after
+					 * wal_retrieve_retry_interval.
+					 */
+					wait_time = Min(wait_time,
+									wal_retrieve_retry_interval);
+				}
 			}
 			else
 			{
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 8e1e8762f62..b679156e3d8 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -603,6 +603,12 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 						TimestampDifferenceExceeds(hentry->last_start_time, now,
 												   wal_retrieve_retry_interval))
 					{
+						/*
+						 * Set the last_start_time even if we fail to start
+						 * the worker, so that we won't retry until
+						 * wal_retrieve_retry_interval has elapsed.
+						 */
+						hentry->last_start_time = now;
 						logicalrep_worker_launch(WORKERTYPE_TABLESYNC,
 												 MyLogicalRepWorker->dbid,
 												 MySubscription->oid,
@@ -610,7 +616,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 												 MyLogicalRepWorker->userid,
 												 rstate->relid,
 												 DSM_HANDLE_INVALID);
-						hentry->last_start_time = now;
 					}
 				}
 			}

Amit Kapila

amit.kapila16@gmail.com

7 months ago

In reply to: Tom Lane (#1)

Re: Logrep launcher race conditions leading to slow tests

On Tue, Jun 24, 2025 at 5:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I've been annoyed for awhile because, while a parallel check-world
run usually takes a bit over a minute on my machine, sometimes it
takes between three and four minutes. I was finally able to
track down what is happening, and it's this: sometimes one or
another of the src/test/subscription tests takes an extra three
minutes because the logical replication launcher is sleeping
instead of launching the next task. It eventually reaches its
hard-wired maximum wait of DEFAULT_NAPTIME_PER_CYCLE (3min),
wakes up and notices it has something to do, and then we're
on our merry way again. I'm not sure how often this is a problem
in the real world, but it happens often enough to be annoying
during development.

There are two distinct bugs involved:

1. WaitForReplicationWorkerAttach sometimes has to clear a process
latch event so that it can keep waiting for the worker to launch.
It neglects to set the latch again, allowing ApplyLauncherMain
to miss events.

There was a previous discussion to fix this behavior. Heikki has
proposed a similar fix for this, but at the caller. See the patch
attached in email [1]/messages/by-id/ff0663d9-8011-420f-a169-efbf57327cb5@iki.fi. The comments in his patch are more explicit
about the kind of events that are missed. Then there is also a
discussion solving it by having two latches, one for worker exiting
and the other for subscription changes. There is a slight advantage
with the approach of using two latches, which is probably to avoid
extra looping to traverse the subscriptions when nothing has changed.
I find one of your or Heikki's proposals to fix this issue is good
enough.

2. ApplyLauncherMain ignores a failure return from
logicalrep_worker_launch, which is bad because (unless it has
another worker launch pending) it will then sleep for
DEFAULT_NAPTIME_PER_CYCLE before reconsidering. What it ought to do
is try again after wal_retrieve_retry_interval. This situation can
arise in resource-exhaustion cases (cf. the early exits in
logicalrep_worker_launch), but what's more likely in the regression
tests is that the worker stops due to some error condition before
WaitForReplicationWorkerAttach sees it attached, which is then duly
reported as a failure.

LGTM.

It's possible to make the test slowness extremely reproducible
with this change, which widens the race condition window for
both problems:
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 1c3c051403d..724e82bcdc1 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -214,7 +214,7 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
*/
rc = WaitLatch(MyLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
-                                          10L, WAIT_EVENT_BGWORKER_STARTUP);
+                                          1000L, WAIT_EVENT_BGWORKER_STARTUP);
if (rc & WL_LATCH_SET)
{

I don't recommend that as a permanent change, but it's helpful
for testing the attached patch.

In the attached, I made two other non-cosmetic changes:

3. In WaitForReplicationWorkerAttach, capture worker->in_use
before not after releasing LogicalRepWorkerLock. Maybe there
is a reason why that's not a dangerous race condition, but
it sure is un-obvious to me.

I also can't think of a reason to use current coding. Your change
looks good to me.

4. In process_syncing_tables_for_apply (the other caller of
logicalrep_worker_launch), it seems okay to ignore the
result of logicalrep_worker_launch, but I think it should
fill hentry->last_start_time before not after the call.

With this, won't we end up retrying to launch the worker sooner if the
launch took time, but still failed to launch the worker?

[1]: /messages/by-id/ff0663d9-8011-420f-a169-efbf57327cb5@iki.fi

--
With Regards,
Amit Kapila.

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

7 months ago

In reply to: Tom Lane (#1)

Re: Logrep launcher race conditions leading to slow tests

On Tue, Jun 24, 2025 at 5:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I've been annoyed for awhile because, while a parallel check-world
run usually takes a bit over a minute on my machine, sometimes it
takes between three and four minutes. I was finally able to
track down what is happening, and it's this: sometimes one or
another of the src/test/subscription tests takes an extra three
minutes because the logical replication launcher is sleeping
instead of launching the next task. It eventually reaches its
hard-wired maximum wait of DEFAULT_NAPTIME_PER_CYCLE (3min),
wakes up and notices it has something to do, and then we're
on our merry way again. I'm not sure how often this is a problem
in the real world, but it happens often enough to be annoying
during development.

There are two distinct bugs involved:

1. WaitForReplicationWorkerAttach sometimes has to clear a process
latch event so that it can keep waiting for the worker to launch.
It neglects to set the latch again, allowing ApplyLauncherMain
to miss events.

Agreed.

2. ApplyLauncherMain ignores a failure return from
logicalrep_worker_launch, which is bad because (unless it has
another worker launch pending) it will then sleep for
DEFAULT_NAPTIME_PER_CYCLE before reconsidering. What it ought to do
is try again after wal_retrieve_retry_interval. This situation can
arise in resource-exhaustion cases (cf. the early exits in
logicalrep_worker_launch), but what's more likely in the regression
tests is that the worker stops due to some error condition before
WaitForReplicationWorkerAttach sees it attached, which is then duly
reported as a failure.

Agreed.

It's possible to make the test slowness extremely reproducible
with this change, which widens the race condition window for
both problems:
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 1c3c051403d..724e82bcdc1 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -214,7 +214,7 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
*/
rc = WaitLatch(MyLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
-                                          10L, WAIT_EVENT_BGWORKER_STARTUP);
+                                          1000L, WAIT_EVENT_BGWORKER_STARTUP);
if (rc & WL_LATCH_SET)
{

I don't recommend that as a permanent change, but it's helpful
for testing the attached patch.

Thanks. That makes reproduction easier.

In the attached, I made two other non-cosmetic changes:

3. In WaitForReplicationWorkerAttach, capture worker->in_use
before not after releasing LogicalRepWorkerLock. Maybe there
is a reason why that's not a dangerous race condition, but
it sure is un-obvious to me.

Looking at the code this seems to be dangerous. We will return from
the condition if worker->in_use = false which means that the slot is
free. By the time we release LogicalRepWorkerLock and read the value
of worker->in_use, it may have been filled with some other worker and
thus return true instead of false, and the caller would falsely assume
that the worker was successfully launched. That itself might cause the
launcher to not start the worker again. That's possibly rare but not
impossible.

+1 for this change.

4. In process_syncing_tables_for_apply (the other caller of
logicalrep_worker_launch), it seems okay to ignore the
result of logicalrep_worker_launch, but I think it should
fill hentry->last_start_time before not after the call.
Otherwise we might be changing a hashtable entry that's
no longer relevant to this worker.

A hash entry is associated with a table, not the worker. In case the
worker fails to launch it records the time when worker launch for that
table was attempted so that next attempt could be well-spaced in time.
I am not able your last statement, what is the entry's relevance to
the worker.

But your change makes this code similar to ApplyLauncherMain(), which
deals with subscriptions. +1 for the consistency.

--
Best Wishes,
Ashutosh Bapat

Tom Lane

tgl@sss.pgh.pa.us

7 months ago

In reply to: Amit Kapila (#2)

Re: Logrep launcher race conditions leading to slow tests

Amit Kapila <amit.kapila16@gmail.com> writes:

On Tue, Jun 24, 2025 at 5:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

1. WaitForReplicationWorkerAttach sometimes has to clear a process
latch event so that it can keep waiting for the worker to launch.
It neglects to set the latch again, allowing ApplyLauncherMain
to miss events.

There was a previous discussion to fix this behavior. Heikki has
proposed a similar fix for this, but at the caller. See the patch
attached in email [1].

Ah, thank you for that link. I vaguely recalled that we'd discussed
this strange behavior before, but I did not realize that anyone had
diagnosed a cause. I don't much like any of the patches proposed
in that thread though --- they seem overcomplicated or off-point.

I do not think we can switch to having two latches here. The
only reason WaitForReplicationWorkerAttach needs to pay attention
to the process latch at all is that it needs to service
CHECK_FOR_INTERRUPTS() conditions in a timely fashion, which is
also true in the ApplyLauncherMain loop. We can't realistically
expect other processes to signal different latches depending on
where the launcher is waiting, so those cases have to be triggered
by the same latch.

However, WaitForReplicationWorkerAttach can't service any
latch-setting conditions other than those managed by
CHECK_FOR_INTERRUPTS(). So if CHECK_FOR_INTERRUPTS() returns,
we have some other triggering condition, which is the business
of some outer code level to deal with. Having it re-set the latch
to allow that to happen promptly after it returns seems like a
pretty straightforward answer to me.

I do note Heikki's concern about whether we could get rid of the
sleep-and-retry looping in WaitForReplicationWorkerAttach in
favor of getting signaled somehow, and I agree with that as a
possible future improvement. But I don't especially see why that
demands another latch; in fact, unless we want to teach WaitLatch
to be able to wait on more than one latch, it *can't* be a
separate latch from the one that receives CHECK_FOR_INTERRUPTS()
conditions.

4. In process_syncing_tables_for_apply (the other caller of
logicalrep_worker_launch), it seems okay to ignore the
result of logicalrep_worker_launch, but I think it should
fill hentry->last_start_time before not after the call.

With this, won't we end up retrying to launch the worker sooner if the
launch took time, but still failed to launch the worker?

That code already does update last_start_time unconditionally, and
I think that's the right behavior for the same reason that it's
right for ApplyLauncherMain to do ApplyLauncherSetWorkerStartTime
whether or not logicalrep_worker_launch succeeds. If the worker
launch fails, we don't want to retry instantly, we want to wait
wal_retrieve_retry_interval before retrying. My desire to change
this code is just based on the idea that it's not clear what else
if anything looks at this hashtable, and by the time that
logicalrep_worker_launch returns the system state could be a lot
different. (For instance, the worker could have started and
failed already.) So, just as in ApplyLauncherMain, I'd rather
store the start time before calling logicalrep_worker_launch.

BTW, it strikes me that if we're going to leave
process_syncing_tables_for_apply() ignoring the result of
logicalrep_worker_launch, it'd be smart to insert an explicit
(void) cast to show that that's intentional. Otherwise Coverity
is likely to complain about how we're ignoring the result in
one place and not the other.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

7 months ago

In reply to: Ashutosh Bapat (#3)

Re: Logrep launcher race conditions leading to slow tests

Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes:

On Tue, Jun 24, 2025 at 5:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

4. In process_syncing_tables_for_apply (the other caller of
logicalrep_worker_launch), it seems okay to ignore the
result of logicalrep_worker_launch, but I think it should
fill hentry->last_start_time before not after the call.
Otherwise we might be changing a hashtable entry that's
no longer relevant to this worker.

A hash entry is associated with a table, not the worker. In case the
worker fails to launch it records the time when worker launch for that
table was attempted so that next attempt could be well-spaced in time.
I am not able your last statement, what is the entry's relevance to
the worker.

But your change makes this code similar to ApplyLauncherMain(), which
deals with subscriptions. +1 for the consistency.

Yeah, mainly I want to make it look more like ApplyLauncherMain().
It's true right now that nothing outside this process will touch that
hash table, so it doesn't matter which way we do it. But if we were
to switch that table to being shared state, this'd be an unsafe order
of operations for the same reasons it'd be wrong to do it like that in
ApplyLauncherMain().

regards, tom lane

Amit Kapila

amit.kapila16@gmail.com

7 months ago

In reply to: Tom Lane (#4)

Re: Logrep launcher race conditions leading to slow tests

On Tue, Jun 24, 2025 at 9:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

On Tue, Jun 24, 2025 at 5:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

1. WaitForReplicationWorkerAttach sometimes has to clear a process
latch event so that it can keep waiting for the worker to launch.
It neglects to set the latch again, allowing ApplyLauncherMain
to miss events.

There was a previous discussion to fix this behavior. Heikki has
proposed a similar fix for this, but at the caller. See the patch
attached in email [1].

Ah, thank you for that link. I vaguely recalled that we'd discussed
this strange behavior before, but I did not realize that anyone had
diagnosed a cause. I don't much like any of the patches proposed
in that thread though --- they seem overcomplicated or off-point.

I do not think we can switch to having two latches here. The
only reason WaitForReplicationWorkerAttach needs to pay attention
to the process latch at all is that it needs to service
CHECK_FOR_INTERRUPTS() conditions in a timely fashion, which is
also true in the ApplyLauncherMain loop. We can't realistically
expect other processes to signal different latches depending on
where the launcher is waiting, so those cases have to be triggered
by the same latch.

However, WaitForReplicationWorkerAttach can't service any
latch-setting conditions other than those managed by
CHECK_FOR_INTERRUPTS(). So if CHECK_FOR_INTERRUPTS() returns,
we have some other triggering condition, which is the business
of some outer code level to deal with. Having it re-set the latch
to allow that to happen promptly after it returns seems like a
pretty straightforward answer to me.

I do note Heikki's concern about whether we could get rid of the
sleep-and-retry looping in WaitForReplicationWorkerAttach in
favor of getting signaled somehow, and I agree with that as a
possible future improvement. But I don't especially see why that
demands another latch; in fact, unless we want to teach WaitLatch
to be able to wait on more than one latch, it *can't* be a
separate latch from the one that receives CHECK_FOR_INTERRUPTS()
conditions.

I agree that we don't need another latch, and I find your patch solves
it in the best possible way. The only minor point if you think makes
sense to change is the following comment:
+ /*
+ * If we had to clear a latch event in order to wait, be sure to restore
+ * it before exiting.  Otherwise caller may miss events.
+ */
+ if (dropped_latch)
..

The part of the comment "Otherwise caller may miss events." is clear
to me, but I'm not sure if it would be equally easy for everyone to
understand what the other events code is talking about here. Something
on the lines of what Heikki wrote, " We use the same latch to be
signalled about subscription changes and workers exiting, so we might
have missed some notifications, if those events happened
concurrently." is more specific.

4. In process_syncing_tables_for_apply (the other caller of
logicalrep_worker_launch), it seems okay to ignore the
result of logicalrep_worker_launch, but I think it should
fill hentry->last_start_time before not after the call.

With this, won't we end up retrying to launch the worker sooner if the
launch took time, but still failed to launch the worker?

That code already does update last_start_time unconditionally, and
I think that's the right behavior for the same reason that it's
right for ApplyLauncherMain to do ApplyLauncherSetWorkerStartTime
whether or not logicalrep_worker_launch succeeds. If the worker
launch fails, we don't want to retry instantly, we want to wait
wal_retrieve_retry_interval before retrying. My desire to change
this code is just based on the idea that it's not clear what else
if anything looks at this hashtable, and by the time that
logicalrep_worker_launch returns the system state could be a lot
different. (For instance, the worker could have started and
failed already.) So, just as in ApplyLauncherMain, I'd rather
store the start time before calling logicalrep_worker_launch.

BTW, it strikes me that if we're going to leave
process_syncing_tables_for_apply() ignoring the result of
logicalrep_worker_launch, it'd be smart to insert an explicit
(void) cast to show that that's intentional. Otherwise Coverity
is likely to complain about how we're ignoring the result in
one place and not the other.

Sounds reasonable.

--
With Regards,
Amit Kapila.

Dilip Kumar

dilipbalaut@gmail.com

7 months ago

In reply to: Tom Lane (#4)

Re: Logrep launcher race conditions leading to slow tests

On Tue, Jun 24, 2025 at 9:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

On Tue, Jun 24, 2025 at 5:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

1. WaitForReplicationWorkerAttach sometimes has to clear a process
latch event so that it can keep waiting for the worker to launch.
It neglects to set the latch again, allowing ApplyLauncherMain
to miss events.

There was a previous discussion to fix this behavior. Heikki has
proposed a similar fix for this, but at the caller. See the patch
attached in email [1].

Ah, thank you for that link. I vaguely recalled that we'd discussed
this strange behavior before, but I did not realize that anyone had
diagnosed a cause. I don't much like any of the patches proposed
in that thread though --- they seem overcomplicated or off-point.

I do not think we can switch to having two latches here. The
only reason WaitForReplicationWorkerAttach needs to pay attention
to the process latch at all is that it needs to service
CHECK_FOR_INTERRUPTS() conditions in a timely fashion, which is
also true in the ApplyLauncherMain loop. We can't realistically
expect other processes to signal different latches depending on
where the launcher is waiting, so those cases have to be triggered
by the same latch.

However, WaitForReplicationWorkerAttach can't service any
latch-setting conditions other than those managed by
CHECK_FOR_INTERRUPTS(). So if CHECK_FOR_INTERRUPTS() returns,
we have some other triggering condition, which is the business
of some outer code level to deal with. Having it re-set the latch
to allow that to happen promptly after it returns seems like a
pretty straightforward answer to me.

Yeah that makes sense, we can not simply wait on another latch which
is not managed by CHECK_FOR_INTERRUPTS(), and IMHO the proposed patch
looks simple for resolving this issue.

--
Regards,
Dilip Kumar
Google

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

7 months ago

In reply to: Tom Lane (#5)

Re: Logrep launcher race conditions leading to slow tests

On Tue, Jun 24, 2025 at 9:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

BTW, it strikes me that if we're going to leave
process_syncing_tables_for_apply() ignoring the result of
logicalrep_worker_launch, it'd be smart to insert an explicit
(void) cast to show that that's intentional. Otherwise Coverity
is likely to complain about how we're ignoring the result in
one place and not the other.

+1.

On Tue, Jun 24, 2025 at 10:03 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes:

On Tue, Jun 24, 2025 at 5:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

4. In process_syncing_tables_for_apply (the other caller of
logicalrep_worker_launch), it seems okay to ignore the
result of logicalrep_worker_launch, but I think it should
fill hentry->last_start_time before not after the call.
Otherwise we might be changing a hashtable entry that's
no longer relevant to this worker.

A hash entry is associated with a table, not the worker. In case the
worker fails to launch it records the time when worker launch for that
table was attempted so that next attempt could be well-spaced in time.
I am not able your last statement, what is the entry's relevance to
the worker.

But your change makes this code similar to ApplyLauncherMain(), which
deals with subscriptions. +1 for the consistency.

Yeah, mainly I want to make it look more like ApplyLauncherMain().
It's true right now that nothing outside this process will touch that
hash table, so it doesn't matter which way we do it. But if we were
to switch that table to being shared state, this'd be an unsafe order
of operations for the same reasons it'd be wrong to do it like that in
ApplyLauncherMain().

Makes sense to fix it proactively.

--
Best Wishes,
Ashutosh Bapat