subscription/015_stream sometimes breaks

Started by Thomas Munroover 2 years ago17 messages
#1Thomas Munro
thomas.munro@gmail.com

Hi,

A couple of times a day, cfbot reports an error like this:

https://cirrus-ci.com/task/6424286882168832

I didn't study it closely but it looks like there might be a second
deadlock, after the one that is expected by the test? Examples from
the past couple of weeks:

cfbot=> select test.task_id, task.task_name, task.created as time from
test join task using (task_id) where suite = 'subscription' and name =
'015_stream' and result = 'ERROR' and task.created > now() - interval
'14 days' order by task.created desc;
task_id | task_name
------------------+------------------------------------------------
6600867550330880 | Windows - Server 2019, VS 2019 - Meson & ninja
6222795470798848 | Windows - Server 2019, VS 2019 - Meson & ninja
6162088322662400 | macOS - Ventura - Meson
5014781862608896 | Windows - Server 2019, VS 2019 - Meson & ninja
6424286882168832 | Windows - Server 2019, VS 2019 - Meson & ninja
6683705222103040 | Windows - Server 2019, VS 2019 - Meson & ninja
5881665076068352 | Windows - Server 2019, VS 2019 - Meson & ninja
5865929054093312 | Windows - Server 2019, VS 2019 - Meson & ninja
5855144995192832 | Windows - Server 2019, VS 2019 - Meson & ninja
6071567994585088 | Windows - Server 2019, VS 2019 - Meson & ninja
5311312343859200 | Windows - Server 2019, VS 2019 - Meson & ninja
4986100071006208 | FreeBSD - 13 - Meson
6302301388800000 | Windows - Server 2019, VS 2019 - Meson & ninja
4554627119579136 | Windows - Server 2019, VS 2019 - Meson & ninja
6106090807492608 | Windows - Server 2019, VS 2019 - Meson & ninja
5190113534148608 | Windows - Server 2019, VS 2019 - Meson & ninja
6452324697112576 | Windows - Server 2019, VS 2019 - Meson & ninja
4610228927332352 | Windows - Server 2019, VS 2019 - Meson & ninja
4928567608344576 | Windows - Server 2019, VS 2019 - Meson & ninja
5705451157848064 | FreeBSD - 13 - Meson
5952066133164032 | Windows - Server 2019, VS 2019 - Meson & ninja
5341101565935616 | Windows - Server 2019, VS 2019 - Meson & ninja
6751165837213696 | Windows - Server 2019, VS 2019 - Meson & ninja
4624168109473792 | Windows - Server 2019, VS 2019 - Meson & ninja
6730863963013120 | Windows - Server 2019, VS 2019 - Meson & ninja
6174140269330432 | Windows - Server 2019, VS 2019 - Meson & ninja
4637318561136640 | Windows - Server 2019, VS 2019 - Meson & ninja
4535300303618048 | Windows - Server 2019, VS 2019 - Meson & ninja
5672693542944768 | FreeBSD - 13 - Meson
6087381225308160 | Windows - Server 2019, VS 2019 - Meson & ninja
6098413217906688 | Windows - Server 2019, VS 2019 - Meson & ninja
6130601380544512 | Windows - Server 2019, VS 2019 - Meson & ninja
5546054284738560 | Windows - Server 2019, VS 2019 - Meson & ninja
6674258676416512 | Windows - Server 2019, VS 2019 - Meson & ninja
4571976740634624 | FreeBSD - 13 - Meson
6566515328155648 | Windows - Server 2019, VS 2019 - Meson & ninja
6576879084240896 | Windows - Server 2019, VS 2019 - Meson & ninja
5295804827566080 | Windows - Server 2019, VS 2019 - Meson & ninja
6426387188285440 | Windows - Server 2019, VS 2019 - Meson & ninja
4763275859066880 | Windows - Server 2019, VS 2019 - Meson & ninja
6137227240013824 | Windows - Server 2019, VS 2019 - Meson & ninja
5185063273365504 | Windows - Server 2019, VS 2019 - Meson & ninja
6542656449282048 | Windows - Server 2019, VS 2019 - Meson & ninja
4874919171850240 | Windows - Server 2019, VS 2019 - Meson & ninja
6531290556530688 | Windows - Server 2019, VS 2019 - Meson & ninja
5377848232378368 | FreeBSD - 13 - Meson
6436049925177344 | FreeBSD - 13 - Meson
6057679748071424 | Windows - Server 2019, VS 2019 - Meson & ninja
5534694867992576 | Windows - Server 2019, VS 2019 - Meson & ninja
4831311429369856 | Windows - Server 2019, VS 2019 - Meson & ninja
4704271531245568 | macOS - Ventura - Meson
5297047549509632 | Windows - Server 2019, VS 2019 - Meson & ninja
5836487120388096 | macOS - Ventura - Meson
6527459915464704 | FreeBSD - 13 - Meson
4985483743199232 | FreeBSD - 13 - Meson
4583651082502144 | Linux - Debian Bullseye - Meson
5498444756811776 | FreeBSD - 13 - Meson
5146601035923456 | Windows - Server 2019, VS 2019 - Meson & ninja
5709550989344768 | macOS - Ventura - Meson
6357936616767488 | FreeBSD - 13 - Meson
(60 rows)

#2Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#1)
Re: subscription/015_stream sometimes breaks

On Wed, Aug 23, 2023 at 8:21 AM Thomas Munro <thomas.munro@gmail.com> wrote:

I didn't study it closely but it looks like there might be a second
deadlock, after the one that is expected by the test? Examples from
the past couple of weeks:

I should add, it's not correlated with the patches that cfbot is
testing, and it's the most frequent failure for which that is the
case.

suite | name | distinct_patches | errors
--------------+------------+------------------+--------
subscription | 015_stream | 47 | 61

#3Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Thomas Munro (#2)
RE: subscription/015_stream sometimes breaks

On Wednesday, August 23, 2023 4:55 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Wed, Aug 23, 2023 at 8:21 AM Thomas Munro
<thomas.munro@gmail.com> wrote:

I didn't study it closely but it looks like there might be a second
deadlock, after the one that is expected by the test? Examples from
the past couple of weeks:

I should add, it's not correlated with the patches that cfbot is testing, and it's
the most frequent failure for which that is the case.

suite | name | distinct_patches | errors
--------------+------------+------------------+--------
subscription | 015_stream | 47 | 61

Thanks for reporting !
I am researching the failure and will share my analysis.

Best Regards,
Hou zj

#4vignesh C
vignesh21@gmail.com
In reply to: Thomas Munro (#2)
Re: subscription/015_stream sometimes breaks

On Wed, 23 Aug 2023 at 02:25, Thomas Munro <thomas.munro@gmail.com> wrote:

On Wed, Aug 23, 2023 at 8:21 AM Thomas Munro <thomas.munro@gmail.com> wrote:

I didn't study it closely but it looks like there might be a second
deadlock, after the one that is expected by the test? Examples from
the past couple of weeks:

I should add, it's not correlated with the patches that cfbot is
testing, and it's the most frequent failure for which that is the
case.

suite | name | distinct_patches | errors
--------------+------------+------------------+--------
subscription | 015_stream | 47 | 61

I had noticed that it is failing because of a segmentation fault:
2023-08-22 19:07:22.403 UTC [3823023][logical replication parallel
worker][4/44:767] FATAL: terminating logical replication worker due
to administrator command
2023-08-22 19:07:22.403 UTC [3823023][logical replication parallel
worker][4/44:767] CONTEXT: processing remote data for replication
origin "pg_16397" during message type "STREAM STOP" in transaction 748
2023-08-22 19:07:22.404 UTC [3819892][postmaster][:0] DEBUG:
unregistering background worker "logical replication parallel apply
worker for subscription 16397"
2023-08-22 19:07:22.404 UTC [3819892][postmaster][:0] LOG: background
worker "logical replication parallel worker" (PID 3823455) exited with
exit code 1
2023-08-22 19:07:22.404 UTC [3819892][postmaster][:0] DEBUG:
unregistering background worker "logical replication parallel apply
worker for subscription 16397"
2023-08-22 19:07:22.404 UTC [3819892][postmaster][:0] LOG: background
worker "logical replication parallel worker" (PID 3823023) exited with
exit code 1
2023-08-22 19:07:22.419 UTC [3819892][postmaster][:0] LOG: background
worker "logical replication apply worker" (PID 3822876) was terminated
by signal 11: Segmentation fault

The stack trace for the same generated at [1]https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=dragonet&amp;dt=2023-08-22%2018%3A56%3A04&amp;stg=subscription-check is:
Core was generated by `postgres: subscriber: logical replication apply
worker for subscription 16397 '.
Program terminated with signal SIGSEGV, Segmentation fault.

warning: Section `.reg-xstate/3822876' in core file too small.
#0 0x00000000007b461e in logicalrep_worker_stop_internal
(worker=<optimized out>, signo=<optimized out>) at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/launcher.c:583
583 kill(worker->proc->pid, signo);
#0 0x00000000007b461e in logicalrep_worker_stop_internal
(worker=<optimized out>, signo=<optimized out>) at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/launcher.c:583
#1 0x00000000007b565a in logicalrep_worker_detach () at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/launcher.c:774
#2 0x00000000007b49ff in logicalrep_worker_onexit (code=<optimized
out>, arg=<optimized out>) at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/launcher.c:829
#3 0x00000000008034c5 in shmem_exit (code=<optimized out>) at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/ipc.c:239
#4 0x00000000008033dc in proc_exit_prepare (code=1) at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/ipc.c:194
#5 0x000000000080333d in proc_exit (code=1) at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/ipc.c:107
#6 0x0000000000797068 in StartBackgroundWorker () at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c:827
#7 0x000000000079f257 in do_start_bgworker (rw=0x284e750) at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:5734
#8 0x000000000079b541 in maybe_start_bgworkers () at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:5958
#9 0x000000000079cb51 in process_pm_pmsignal () at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:5121
#10 0x000000000079b6bb in ServerLoop () at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:1769
#11 0x000000000079aaa5 in PostmasterMain (argc=4, argv=<optimized
out>) at /home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:1462
#12 0x00000000006d82a0 in main (argc=4, argv=0x27e3fd0) at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/main/main.c:198
$1 = {si_signo = 11, si_errno = 0, si_code = 1, _sifields = {_pad =
{64, 0 <repeats 27 times>}, _kill = {si_pid = 64, si_uid = 0}, _timer
= {si_tid = 64, si_overrun = 0, si_sigval = {sival_int = 0, sival_ptr
= 0x0}}, _rt = {si_pid = 64, si_uid = 0, si_sigval = {sival_int = 0,
sival_ptr = 0x0}}, _sigchld = {si_pid = 64, si_uid = 0, si_status = 0,
si_utime = 0, si_stime = 0}, _sigfault = {si_addr = 0x40, _addr_lsb =
0, _addr_bnd = {_lower = 0x0, _upper = 0x0}}, _sigpoll = {si_band =
64, si_fd = 0}, _sigsys = {_call_addr = 0x40, _syscall = 0, _arch =
0}}}

[1]: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=dragonet&amp;dt=2023-08-22%2018%3A56%3A04&amp;stg=subscription-check

Regards,
Vignesh

#5Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Zhijie Hou (Fujitsu) (#3)
1 attachment(s)
RE: subscription/015_stream sometimes breaks

-----Original Message-----
From: Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
Sent: Wednesday, August 23, 2023 10:27 AM
To: Thomas Munro <thomas.munro@gmail.com>
Cc: Amit Kapila <amit.kapila16@gmail.com>; pgsql-hackers
<pgsql-hackers@postgresql.org>
Subject: RE: subscription/015_stream sometimes breaks

On Wednesday, August 23, 2023 4:55 AM Thomas Munro
<thomas.munro@gmail.com> wrote:

On Wed, Aug 23, 2023 at 8:21 AM Thomas Munro

<thomas.munro@gmail.com>

wrote:

I didn't study it closely but it looks like there might be a second
deadlock, after the one that is expected by the test? Examples from
the past couple of weeks:

I should add, it's not correlated with the patches that cfbot is
testing, and it's the most frequent failure for which that is the case.

suite | name | distinct_patches | errors
--------------+------------+------------------+--------
subscription | 015_stream | 47 | 61

Thanks for reporting !
I am researching the failure and will share my analysis.

Hi,

After an off-list discussion with Amit, we figured out the reason.
From the crash log, I can see the apply worker crashed when accessing the
worker->proc, so I think it's because the work->proc has been released.

577: /* Now terminate the worker ... */

578: kill(worker->proc->pid, signo);

579:
580: /* ... and wait for it to die. */

Normally, this should not happen because we take a lock on LogicalRepWorkerLock
when shutting all the parallel workers[1]- LWLockAcquire(LogicalRepWorkerLock, LW_SHARED); which can prevent concurrent worker
to free the worker info. But in logicalrep_worker_stop_internal(), when
stopping parallel worker #1, we will release the lock shortly. and at this
timing it's possible that another parallel worker #2 which reported an ERROR
will shutdown by itself and free the worker->proc. So when we try to stop that
parallel worker #2 in next round, we didn't realize it has been closed,
thus accessing invalid memory(worker->proc).

[1]: - LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);

workers = logicalrep_workers_find(MyLogicalRepWorker->subid, true);
foreach(lc, workers)
{
LogicalRepWorker *w = (LogicalRepWorker *) lfirst(lc);

** if (isParallelApplyWorker(w))
logicalrep_worker_stop_internal(w, SIGTERM);
}
--

The bug happens after commit 2a8b40e where isParallelApplyWorker() start to use
the worker->type to check but we forgot to reset the worker type at worker exit
time. So, even if the worker #2 has shutdown, the worker_type is still valid
and we try to stop it again.

Previously, the isParallelApplyWorker() used the worker->leader_pid which will
be reset when the worker exits, so the "if (isParallelApplyWorker(w))" won't pass
in this case and we don't try to stop the worker #2.

To fix it I think we need to reset the worker type at exit as well.
Attach the patch which does the same. I am also testing it locally
to see if there are other issues here.

Best Regards,
Hou Zhijie

Attachments:

0001-Reset-worker-type-at-exit-time.patchapplication/octet-stream; name=0001-Reset-worker-type-at-exit-time.patchDownload
From 9e9c79e311d2065afafd54572d9177efbe922577 Mon Sep 17 00:00:00 2001
From: Hou Zhijie <houzj.fnst@cn.fujitsu.com>
Date: Wed, 23 Aug 2023 13:44:12 +0800
Subject: [PATCH] Reset worker type at exit time

Commit 2a8b40e introduces the worker type field for logical replication
workers, but forgot to reset the type when the worker exits. This leads us to
possibly recognize a stopped worker as a valid logical replication worker.

---
 src/backend/replication/logical/launcher.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 72e44d5a02..48f88439d8 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -793,6 +793,7 @@ logicalrep_worker_cleanup(LogicalRepWorker *worker)
 {
 	Assert(LWLockHeldByMeInMode(LogicalRepWorkerLock, LW_EXCLUSIVE));
 
+	worker->type = WORKERTYPE_UNKNOWN;
 	worker->in_use = false;
 	worker->proc = NULL;
 	worker->dbid = InvalidOid;
-- 
2.30.0.windows.2

#6Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Zhijie Hou (Fujitsu) (#5)
3 attachment(s)
Re: subscription/015_stream sometimes breaks

On 2023-Aug-23, Zhijie Hou (Fujitsu) wrote:

[1]--
LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);

workers = logicalrep_workers_find(MyLogicalRepWorker->subid, true);
foreach(lc, workers)
{
LogicalRepWorker *w = (LogicalRepWorker *) lfirst(lc);

** if (isParallelApplyWorker(w))
logicalrep_worker_stop_internal(w, SIGTERM);
}

Hmm, I think if worker->in_use is false, we shouldn't consult the rest
of the struct at all, so I propose to add the attached 0001 as a minimal
fix.

In fact, I'd go further and propose that if we do take that stance, then
we don't need clear out the contents of this struct at all, so let's
not. That's 0002.

And the reason 0002 does not remove the zeroing of ->proc is that the
tests gets stuck when I do that, and the reason for that looks to be
some shoddy coding in WaitForReplicationWorkerAttach, so I propose we
change that too, as in 0003.

Thoughts?

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/

Attachments:

0001-Don-t-use-LogicalRepWorker-until-in_use-is-verified.patchtext/x-diff; charset=us-asciiDownload
From a28775fc5900cd740556a864e1826ceda268b794 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Wed, 23 Aug 2023 09:25:35 +0200
Subject: [PATCH 1/3] Don't use LogicalRepWorker until ->in_use is verified

---
 src/backend/replication/logical/launcher.c | 4 ++--
 src/include/replication/worker_internal.h  | 8 ++++++--
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 72e44d5a02..ec5156aa41 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -862,7 +862,7 @@ logicalrep_sync_worker_count(Oid subid)
 	{
 		LogicalRepWorker *w = &LogicalRepCtx->workers[i];
 
-		if (w->subid == subid && isTablesyncWorker(w))
+		if (isTablesyncWorker(w) && w->subid == subid)
 			res++;
 	}
 
@@ -889,7 +889,7 @@ logicalrep_pa_worker_count(Oid subid)
 	{
 		LogicalRepWorker *w = &LogicalRepCtx->workers[i];
 
-		if (w->subid == subid && isParallelApplyWorker(w))
+		if (isParallelApplyWorker(w) && w->subid == subid)
 			res++;
 	}
 
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index a428663859..8f4bed0958 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -327,8 +327,10 @@ extern void pa_decr_and_wait_stream_block(void);
 extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo,
 						   XLogRecPtr remote_lsn);
 
-#define isParallelApplyWorker(worker) ((worker)->type == WORKERTYPE_PARALLEL_APPLY)
-#define isTablesyncWorker(worker) ((worker)->type == WORKERTYPE_TABLESYNC)
+#define isParallelApplyWorker(worker) ((worker)->in_use && \
+									   (worker)->type == WORKERTYPE_PARALLEL_APPLY)
+#define isTablesyncWorker(worker) ((worker)->in_use && \
+								   (worker)->type == WORKERTYPE_TABLESYNC)
 
 static inline bool
 am_tablesync_worker(void)
@@ -339,12 +341,14 @@ am_tablesync_worker(void)
 static inline bool
 am_leader_apply_worker(void)
 {
+	Assert(MyLogicalRepWorker->in_use);
 	return (MyLogicalRepWorker->type == WORKERTYPE_APPLY);
 }
 
 static inline bool
 am_parallel_apply_worker(void)
 {
+	Assert(MyLogicalRepWorker->in_use);
 	return isParallelApplyWorker(MyLogicalRepWorker);
 }
 
-- 
2.30.2

0002-don-t-clean-up-unnecessarily.patchtext/x-diff; charset=us-asciiDownload
From 36336e6d8ef2913def59a53b6bc083a40181a85a Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Wed, 23 Aug 2023 09:53:51 +0200
Subject: [PATCH 2/3] don't clean up unnecessarily

---
 src/backend/replication/logical/launcher.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index ec5156aa41..f9d6ebf2d8 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -795,12 +795,6 @@ logicalrep_worker_cleanup(LogicalRepWorker *worker)
 
 	worker->in_use = false;
 	worker->proc = NULL;
-	worker->dbid = InvalidOid;
-	worker->userid = InvalidOid;
-	worker->subid = InvalidOid;
-	worker->relid = InvalidOid;
-	worker->leader_pid = InvalidPid;
-	worker->parallel_apply = false;
 }
 
 /*
-- 
2.30.2

0003-Don-t-rely-on-proc-being-NULL-either.patchtext/x-diff; charset=us-asciiDownload
From 9dced5b0898be22004d442cb2893848663f967ba Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Wed, 23 Aug 2023 09:55:13 +0200
Subject: [PATCH 3/3] Don't rely on proc being NULL either

---
 src/backend/replication/logical/launcher.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index f9d6ebf2d8..8bfceb5c27 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -201,11 +201,18 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
 
 		LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
 
-		/* Worker either died or has started. Return false if died. */
-		if (!worker->in_use || worker->proc)
+		/* Not started? */
+		if (!worker->in_use)
 		{
 			LWLockRelease(LogicalRepWorkerLock);
-			return worker->in_use;
+			return false;
+		}
+
+		/* Already going? */
+		if (worker->proc)
+		{
+			LWLockRelease(LogicalRepWorkerLock);
+			return true;
 		}
 
 		LWLockRelease(LogicalRepWorkerLock);
@@ -794,7 +801,6 @@ logicalrep_worker_cleanup(LogicalRepWorker *worker)
 	Assert(LWLockHeldByMeInMode(LogicalRepWorkerLock, LW_EXCLUSIVE));
 
 	worker->in_use = false;
-	worker->proc = NULL;
 }
 
 /*
-- 
2.30.2

#7Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Alvaro Herrera (#6)
Re: subscription/015_stream sometimes breaks

On 2023-Aug-23, Alvaro Herrera wrote:

And the reason 0002 does not remove the zeroing of ->proc is that the
tests gets stuck when I do that, and the reason for that looks to be
some shoddy coding in WaitForReplicationWorkerAttach, so I propose we
change that too, as in 0003.

Hmm, actually the test got stuck when I ran it repeatedly with this
0003. I guess there might be other places that depend on ->proc being
set to NULL on exit.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Most hackers will be perfectly comfortable conceptualizing users as entropy
sources, so let's move on." (Nathaniel Smith)

#8Amit Kapila
amit.kapila16@gmail.com
In reply to: Alvaro Herrera (#6)
Re: subscription/015_stream sometimes breaks

On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2023-Aug-23, Zhijie Hou (Fujitsu) wrote:

[1]--
LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);

workers = logicalrep_workers_find(MyLogicalRepWorker->subid, true);
foreach(lc, workers)
{
LogicalRepWorker *w = (LogicalRepWorker *) lfirst(lc);

** if (isParallelApplyWorker(w))
logicalrep_worker_stop_internal(w, SIGTERM);
}

Hmm, I think if worker->in_use is false, we shouldn't consult the rest
of the struct at all, so I propose to add the attached 0001 as a minimal
fix.

I think that way we may need to add the check for in_use before
accessing each of the LogicalRepWorker struct fields or form some rule
about which fields (or places) are okay to access without checking
in_use field.

In fact, I'd go further and propose that if we do take that stance, then
we don't need clear out the contents of this struct at all, so let's
not. That's 0002.

And the reason 0002 does not remove the zeroing of ->proc is that the
tests gets stuck when I do that, and the reason for that looks to be
some shoddy coding in WaitForReplicationWorkerAttach, so I propose we
change that too, as in 0003.

Personally, I think we should consider this change (0002 and 0002) separately.

--
With Regards,
Amit Kapila.

#9Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Amit Kapila (#8)
Re: subscription/015_stream sometimes breaks

On 2023-Aug-24, Amit Kapila wrote:

On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Hmm, I think if worker->in_use is false, we shouldn't consult the rest
of the struct at all, so I propose to add the attached 0001 as a minimal
fix.

I think that way we may need to add the check for in_use before
accessing each of the LogicalRepWorker struct fields or form some rule
about which fields (or places) are okay to access without checking
in_use field.

As far as I realize, we have that rule already. It's only a few
relatively new places that have broken it. I understand that the in_use
concept comes from the one of the same name in ReplicationSlot, except
that it is not at all documented in worker_internal.h.

So I propose we do both: apply Zhijie's patch and my 0001 now; and
somebody gets to document the locking design for LogicalRepWorker.

In fact, I'd go further and propose that if we do take that stance, then
we don't need clear out the contents of this struct at all, so let's
not. That's 0002.

Personally, I think we should consider this change (0002 and 0002) separately.

I agree. I'd maybe even retract them.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/

#10Amit Kapila
amit.kapila16@gmail.com
In reply to: Alvaro Herrera (#9)
Re: subscription/015_stream sometimes breaks

On Thu, Aug 24, 2023 at 1:20 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2023-Aug-24, Amit Kapila wrote:

On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Hmm, I think if worker->in_use is false, we shouldn't consult the rest
of the struct at all, so I propose to add the attached 0001 as a minimal
fix.

I think that way we may need to add the check for in_use before
accessing each of the LogicalRepWorker struct fields or form some rule
about which fields (or places) are okay to access without checking
in_use field.

As far as I realize, we have that rule already. It's only a few
relatively new places that have broken it. I understand that the in_use
concept comes from the one of the same name in ReplicationSlot, except
that it is not at all documented in worker_internal.h.

So I propose we do both: apply Zhijie's patch and my 0001 now; and
somebody gets to document the locking design for LogicalRepWorker.

Agreed.

--
With Regards,
Amit Kapila.

#11Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#10)
Re: subscription/015_stream sometimes breaks

On Thu, Aug 24, 2023 at 8:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 24, 2023 at 1:20 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2023-Aug-24, Amit Kapila wrote:

On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Hmm, I think if worker->in_use is false, we shouldn't consult the rest
of the struct at all, so I propose to add the attached 0001 as a minimal
fix.

I think that way we may need to add the check for in_use before
accessing each of the LogicalRepWorker struct fields or form some rule
about which fields (or places) are okay to access without checking
in_use field.

As far as I realize, we have that rule already. It's only a few
relatively new places that have broken it. I understand that the in_use
concept comes from the one of the same name in ReplicationSlot, except
that it is not at all documented in worker_internal.h.

So I propose we do both: apply Zhijie's patch and my 0001 now; and
somebody gets to document the locking design for LogicalRepWorker.

Agreed.

Both of these patches (Hou-san's expedient resetting of the worker
type, Alvaro's 0001 putting the 'in_use' check within the isXXXWorker
type macros) appear to be blending the concept of "type" with whether
the worker is "alive" or not, which I am not sure is a good thing. IMO
the type is the type forever, so I felt type should get assigned only
once when the worker is "born". For example, a dead horse is still a
horse.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

#12Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#11)
Re: subscription/015_stream sometimes breaks

On Fri, Aug 25, 2023 at 9:09 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Thu, Aug 24, 2023 at 8:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 24, 2023 at 1:20 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2023-Aug-24, Amit Kapila wrote:

On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Hmm, I think if worker->in_use is false, we shouldn't consult the rest
of the struct at all, so I propose to add the attached 0001 as a minimal
fix.

I think that way we may need to add the check for in_use before
accessing each of the LogicalRepWorker struct fields or form some rule
about which fields (or places) are okay to access without checking
in_use field.

As far as I realize, we have that rule already. It's only a few
relatively new places that have broken it. I understand that the in_use
concept comes from the one of the same name in ReplicationSlot, except
that it is not at all documented in worker_internal.h.

So I propose we do both: apply Zhijie's patch and my 0001 now; and
somebody gets to document the locking design for LogicalRepWorker.

Agreed.

Both of these patches (Hou-san's expedient resetting of the worker
type, Alvaro's 0001 putting the 'in_use' check within the isXXXWorker
type macros) appear to be blending the concept of "type" with whether
the worker is "alive" or not, which I am not sure is a good thing. IMO
the type is the type forever, so I felt type should get assigned only
once when the worker is "born". For example, a dead horse is still a
horse.

I think it is important to have a alive check before accessing the
worker type as we are doing for some of the other fields. For example,
see the usage of in_use flag in the function logicalrep_worker_find().
The usage of parallel apply workers doesn't consider the use of in_use
flag where as other worker types would first check in_use flag.

--
With Regards,
Amit Kapila.

#13Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#10)
Re: subscription/015_stream sometimes breaks

On Thu, Aug 24, 2023 at 3:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 24, 2023 at 1:20 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2023-Aug-24, Amit Kapila wrote:

On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Hmm, I think if worker->in_use is false, we shouldn't consult the rest
of the struct at all, so I propose to add the attached 0001 as a minimal
fix.

I think that way we may need to add the check for in_use before
accessing each of the LogicalRepWorker struct fields or form some rule
about which fields (or places) are okay to access without checking
in_use field.

As far as I realize, we have that rule already. It's only a few
relatively new places that have broken it. I understand that the in_use
concept comes from the one of the same name in ReplicationSlot, except
that it is not at all documented in worker_internal.h.

So I propose we do both: apply Zhijie's patch and my 0001 now; and
somebody gets to document the locking design for LogicalRepWorker.

Agreed.

Pushed both the patches.

--
With Regards,
Amit Kapila.

#14Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#13)
1 attachment(s)
Re: subscription/015_stream sometimes breaks

On Fri, Aug 25, 2023 at 8:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 24, 2023 at 3:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 24, 2023 at 1:20 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2023-Aug-24, Amit Kapila wrote:

On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Hmm, I think if worker->in_use is false, we shouldn't consult the rest
of the struct at all, so I propose to add the attached 0001 as a minimal
fix.

I think that way we may need to add the check for in_use before
accessing each of the LogicalRepWorker struct fields or form some rule
about which fields (or places) are okay to access without checking
in_use field.

As far as I realize, we have that rule already. It's only a few
relatively new places that have broken it. I understand that the in_use
concept comes from the one of the same name in ReplicationSlot, except
that it is not at all documented in worker_internal.h.

So I propose we do both: apply Zhijie's patch and my 0001 now; and
somebody gets to document the locking design for LogicalRepWorker.

Agreed.

Pushed both the patches.

IMO there are inconsistencies in the second patch that was pushed.

1. In the am_xxx functions, why is there Assert 'in_use' only for the
APPLY / PARALLEL_APPLY workers but not for TABLESYNC workers?

2. In the am_xxx functions there is now Assert 'in_use', so why are we
still using macros to check again what we already asserted is not
possible? (Or, if the checking overkill was a deliberate choice then
why is there no isLeaderApplyWorker macro?)

~

PSA a small patch to address these.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v1-0001-Fix-am_xxx-function-Asserts.patchapplication/octet-stream; name=v1-0001-Fix-am_xxx-function-Asserts.patchDownload
From abe9f4c56d931e12f6e8a9e3d1c1d88fff7e4149 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 28 Aug 2023 09:29:50 +1000
Subject: [PATCH v1] Fix am_xxx function Asserts

---
 src/include/replication/worker_internal.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 8f4bed0..dd94543 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -335,7 +335,8 @@ extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo,
 static inline bool
 am_tablesync_worker(void)
 {
-	return isTablesyncWorker(MyLogicalRepWorker);
+	Assert(MyLogicalRepWorker->in_use);
+	return (MyLogicalRepWorker->type == WORKERTYPE_TABLESYNC);
 }
 
 static inline bool
@@ -349,7 +350,7 @@ static inline bool
 am_parallel_apply_worker(void)
 {
 	Assert(MyLogicalRepWorker->in_use);
-	return isParallelApplyWorker(MyLogicalRepWorker);
+	return (MyLogicalRepWorker->type == WORKERTYPE_PARALLEL_APPLY);
 }
 
 #endif							/* WORKER_INTERNAL_H */
-- 
1.8.3.1

#15Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#14)
Re: subscription/015_stream sometimes breaks

On Mon, Aug 28, 2023 at 5:35 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Fri, Aug 25, 2023 at 8:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

IMO there are inconsistencies in the second patch that was pushed.

1. In the am_xxx functions, why is there Assert 'in_use' only for the
APPLY / PARALLEL_APPLY workers but not for TABLESYNC workers?

2. In the am_xxx functions there is now Assert 'in_use', so why are we
still using macros to check again what we already asserted is not
possible? (Or, if the checking overkill was a deliberate choice then
why is there no isLeaderApplyWorker macro?)

~

PSA a small patch to address these.

I find your suggestions reasonable. Alvaro, do you have any comments?

--
With Regards,
Amit Kapila.

#16Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Amit Kapila (#15)
Re: subscription/015_stream sometimes breaks

On 2023-Aug-29, Amit Kapila wrote:

On Mon, Aug 28, 2023 at 5:35 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Fri, Aug 25, 2023 at 8:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

IMO there are inconsistencies in the second patch that was pushed.

I find your suggestions reasonable. Alvaro, do you have any comments?

Well, my main comment is that at this point I'm not sure these
isFooWorker() macros are worth their salt. It looks like we could
replace their uses with direct type comparisons in their callsites and
remove them, with no loss of readability. The am_sth_worker() are
probably a bit more useful, though some of the callsites could end up
better if replaced with straight type comparison.

All in all, I don't disagree with Peter's suggestions, but this is
pretty much in "meh" territory for me.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/

#17Peter Smith
smithpb2250@gmail.com
In reply to: Alvaro Herrera (#16)
Re: subscription/015_stream sometimes breaks

On Tue, Aug 29, 2023 at 10:35 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2023-Aug-29, Amit Kapila wrote:

On Mon, Aug 28, 2023 at 5:35 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Fri, Aug 25, 2023 at 8:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

IMO there are inconsistencies in the second patch that was pushed.

I find your suggestions reasonable. Alvaro, do you have any comments?

Well, my main comment is that at this point I'm not sure these
isFooWorker() macros are worth their salt. It looks like we could
replace their uses with direct type comparisons in their callsites and
remove them, with no loss of readability. The am_sth_worker() are
probably a bit more useful, though some of the callsites could end up
better if replaced with straight type comparison.

All in all, I don't disagree with Peter's suggestions, but this is
pretty much in "meh" territory for me.

I had written a small non-functional patch [1]/messages/by-id/CAHut+PuwaF4Sb41pWQk69d2WO_ZJQpj-_2JkQvP=1jwozUpcCQ@mail.gmail.com to address some macro
inconsistencies introduced by a prior patch of this thread.

It received initial feedback from Amit ("I find your suggestions
reasonable") and from Alvaro ("I don't disagree with Peter's
suggestions") but then nothing further happened. I also created a CF
entry https://commitfest.postgresql.org/46/4570/ for it.

AFAIK my patch is still valid, but after 4 months of no activity it
seems there is no interest in pushing it, so I am withdrawing the CF
entry.

======
[1]: /messages/by-id/CAHut+PuwaF4Sb41pWQk69d2WO_ZJQpj-_2JkQvP=1jwozUpcCQ@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia