WaitEventSetWaitBlock() can still hang on Windows due to connection reset

Started by Alexander Lakhinabout 1 year ago3 messages
#1Alexander Lakhin
exclusion@gmail.com
3 attachment(s)

Hello hackers,

[ a follow-up to [1]/messages/by-id/CA+hUKGKn+xWjaEnQ+iYYs7STG7_nNP5w5VMb1vSvQfAkkOxOUA@mail.gmail.com separated to focus on this issue ]

Let me bring your attention to a failure [2]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-10-14%2010%3A08%3A17, presented by drongo in the
last week, with the following diagnostics:
[10:30:30.954](2.110s) ok 5 - compare primary and standby dumps
### Restarting node "primary"
# Running: pg_ctl -w -D
C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/recovery/027_stream_regress\\data/t_027_stream_regress_primary_data/pgdata
-l C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/recovery/027_stream_regress\\log/027_stream_regress_primary.log restart
waiting for server to shut down.... done
server stopped
waiting for server to start.... done
server started
# Postmaster PID for node "primary" is 8908
Waiting for replication conn standby_1's replay_lsn to pass 0/158C8B98 on primary
[10:41:32.115](661.161s) # poll_query_until timed out executing this query:
# SELECT '0/158C8B98' <= replay_lsn AND state = 'streaming'
#          FROM pg_catalog.pg_stat_replication
#          WHERE application_name IN ('standby_1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#

027_stream_regress_standby_1.log contains:
2024-10-14 10:30:28.483 UTC [4320:12] 027_stream_regress.pl LOG: disconnection: session time: 0:00:03.793 user=pgrunner
database=postgres host=127.0.0.1 port=61748
2024-10-14 10:30:31.442 UTC [8468:2] LOG:  replication terminated by primary server
2024-10-14 10:30:31.442 UTC [8468:3] DETAIL:  End of WAL reached on timeline 1 at 0/158C8B98.
2024-10-14 10:30:31.442 UTC [8468:4] FATAL:  could not send end-of-streaming message to primary: server closed the
connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
    no COPY in progress
2024-10-14 10:30:31.443 UTC [5452:7] LOG:  invalid resource manager ID 101 at 0/158C8B98
2024-10-14 10:35:06.986 UTC [8648:21] LOG:  restartpoint starting: time
2024-10-14 10:35:06.991 UTC [8648:22] LOG:  restartpoint complete: wrote 0 buffers (0.0%), wrote 1 SLRU buffers; 0 WAL
file(s) added, 0 removed, 1 recycled; write=0.001 s, sync=0.001 s, total=0.005 s; sync files=0, longest=0.000 s,
average=0.000 s; distance=15336 kB, estimate=69375 kB; lsn=0/158C8B20, redo lsn=0/158C8B20
2024-10-14 10:35:06.991 UTC [8648:23] LOG:  recovery restart point at 0/158C8B20
2024-10-14 10:35:06.991 UTC [8648:24] DETAIL:  Last completed transaction was at log time 2024-10-14 10:30:24.820804+00.
2024-10-14 10:41:32.510 UTC [4220:4] LOG:  received immediate shutdown request

(That is, primary was restarted, but standby didn't reconnect to it,
waiting for something...)

$node_primary->restart was added to 027_stream_regress.pl on 2024-06-27
(with 0844b3968), so this can explain why is it the first noticed failure
of this kind.

I managed to reproduce this failure locally, then I reduced the 027 test
to just restarting primary in a loop, and with debug logging added (please
find attached) I could see what makes standby hang:
2024-10-22 08:42:20.016 UTC [6652] DETAIL:  End of WAL reached on timeline 1 at 0/5D14C98.
2024-10-22 08:42:20.016 UTC [6652] FATAL:  could not send end-of-streaming message to primary: server closed the
connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
    no COPY in progress
2024-10-22 08:42:20.016 UTC [21268] LOG:  invalid record length at 0/5D14C98: expected at least 24, got 0
2024-10-22 08:42:20.088 UTC [8256] LOG:  !!!WalReceiverMain| before walrcv_connect
!!!libpqrcv_connect| before loop
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 64)
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 1
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 64)
    !!!WaitEventSetWaitBlock| before WaitForMultipleObjects
    !!!WaitEventSetWaitBlock| WaitForMultipleObjects returned: 3
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 64
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 4)
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 4
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 2)
    !!!WaitEventSetWaitBlock| WL_SOCKET_READABLE, cur_event->fd: 844, WSAGetLastError(): 10054
    !!!WaitEventSetWaitBlock| before WaitForMultipleObjects
2024-10-22 08:44:49.399 UTC [15352] LOG:  restartpoint starting: time
...

While for a previous successful reconnect we can see:
2024-10-22 08:42:17.275 UTC [6652] LOG:  !!!WalReceiverMain| before walrcv_connect
!!!libpqrcv_connect| before loop
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 64)
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 1
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 64)
    !!!WaitEventSetWaitBlock| before WaitForMultipleObjects
    !!!WaitEventSetWaitBlock| WaitForMultipleObjects returned: 3
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 64
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 4)
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 4
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 2)
    !!!WaitEventSetWaitBlock| WL_SOCKET_READABLE, cur_event->fd: 844, WSAGetLastError(): 10035
    !!!WaitEventSetWaitBlock| before WaitForMultipleObjects
    !!!WaitEventSetWaitBlock| WaitForMultipleObjects returned: 3
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 2
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 4)
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 4
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 2)
    !!!WaitEventSetWaitBlock| WL_SOCKET_READABLE, cur_event->fd: 844, WSAGetLastError(): 10035
    !!!WaitEventSetWaitBlock| before WaitForMultipleObjects
    !!!WaitEventSetWaitBlock| WaitForMultipleObjects returned: 3
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 2
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 2)
    !!!WaitEventSetWaitBlock| WL_SOCKET_READABLE, cur_event->fd: 844, WSAGetLastError(): 10035
    !!!WaitEventSetWaitBlock| before WaitForMultipleObjects
    !!!WaitEventSetWaitBlock| WaitForMultipleObjects returned: 3
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 2
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 2)
    !!!WaitEventSetWaitBlock| WL_SOCKET_READABLE, cur_event->fd: 844, WSAGetLastError(): 10035
    !!!WaitEventSetWaitBlock| before WaitForMultipleObjects
    !!!WaitEventSetWaitBlock| WaitForMultipleObjects returned: 3
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 2
!!!libpqrcv_connect| after loop

(The full log is attached too.)

In other words, if WSARecv() fails with WSAECONNRESET, calling
WaitForMultipleObjects() still leads to the hang.

I reproduced this on a8458f508~1 and on 97d891010, so the issue itself is
not new...

[1]: /messages/by-id/CA+hUKGKn+xWjaEnQ+iYYs7STG7_nNP5w5VMb1vSvQfAkkOxOUA@mail.gmail.com
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&amp;dt=2024-10-14%2010%3A08%3A17

Best regards,
Alexander

Attachments:

099_primary_restart.plapplication/x-perl; name=099_primary_restart.plDownload
099_primary_restart_standby_1.logtext/x-log; charset=UTF-8; name=099_primary_restart_standby_1.logDownload
walreceiver-win-socket-debugging.patchtext/x-patch; charset=UTF-8; name=walreceiver-win-socket-debugging.patchDownload
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 97f957cd87..9acf524e8f 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -221,6 +221,7 @@ libpqrcv_connect(const char *conninfo, bool replication, bool logical,
 	 * Per spec for PQconnectPoll, first wait till socket is write-ready.
 	 */
 	status = PGRES_POLLING_WRITING;
+fprintf(stderr, "!!!libpqrcv_connect| before loop\n");
 	do
 	{
 		int			io_flag;
@@ -236,11 +237,16 @@ libpqrcv_connect(const char *conninfo, bool replication, bool logical,
 		else
 			io_flag = WL_SOCKET_WRITEABLE;
 
+fprintf(stderr, "!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: %d)\n",
+io_flag);
+debug_latch = true;
 		rc = WaitLatchOrSocket(MyLatch,
 							   WL_EXIT_ON_PM_DEATH | WL_LATCH_SET | io_flag,
 							   PQsocket(conn->streamConn),
 							   0,
 							   WAIT_EVENT_LIBPQWALRECEIVER_CONNECT);
+debug_latch = false;
+fprintf(stderr, "!!!libpqrcv_connect| after WaitLatchOrSocket, rc: %d\n", rc);
 
 		/* Interrupted? */
 		if (rc & WL_LATCH_SET)
@@ -253,6 +267,7 @@ libpqrcv_connect(const char *conninfo, bool replication, bool logical,
 		if (rc & io_flag)
 			status = PQconnectPoll(conn->streamConn);
 	} while (status != PGRES_POLLING_OK && status != PGRES_POLLING_FAILED);
+fprintf(stderr, "!!!libpqrcv_connect| after loop\n");
 
 	if (PQstatus(conn->streamConn) != CONNECTION_OK)
 		goto bad_connection_errmsg;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index a27aee63de..e14d8cbd59 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -300,12 +300,14 @@ WalReceiverMain(char *startup_data, size_t startup_data_len)
 
 	/* Establish the connection to the primary for XLOG streaming */
 	appname = cluster_name[0] ? cluster_name : "walreceiver";
+elog(LOG, "!!!WalReceiverMain| before walrcv_connect");
 	wrconn = walrcv_connect(conninfo, true, false, false, appname, &err);
 	if (!wrconn)
 		ereport(ERROR,
 				(errcode(ERRCODE_CONNECTION_FAILURE),
 				 errmsg("streaming replication receiver \"%s\" could not connect to the primary server: %s",
 						appname, err)));
+elog(LOG, "!!!WalReceiverMain| wrconn: %p", wrconn);
 
 	/*
 	 * Save user-visible connection string.  This clobbers the original
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 608eb66abe..e6506cf6a3 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -98,6 +98,8 @@
 #endif
 #endif
 
+bool debug_latch = false;
+
 /* typedef in latch.h */
 struct WaitEventSet
 {
@@ -2029,6 +2034,8 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 				occurred_events->fd = cur_event->fd;
 				return 1;
 			}
+if (debug_latch)
+fprintf(stderr, "\t!!!WaitEventSetWaitBlock| WL_SOCKET_READABLE, cur_event->fd: %d, WSAGetLastError(): %d\n", cur_event->fd, WSAGetLastError());
 		}
 
 		/*
@@ -2069,8 +2076,12 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 	 *
 	 * Need to wait for ->nevents + 1, because signal handle is in [0].
 	 */
+if (debug_latch)
+fprintf(stderr, "\t!!!WaitEventSetWaitBlock| before WaitForMultipleObjects\n");
 	rc = WaitForMultipleObjects(set->nevents + 1, set->handles, FALSE,
 								cur_timeout);
+if (debug_latch)
+fprintf(stderr, "\t!!!WaitEventSetWaitBlock| WaitForMultipleObjects returned: %d\n", rc);
 
 	/* Check return code */
 	if (rc == WAIT_FAILED)
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 7e194d536f..5387afdedc 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -193,4 +193,5 @@ extern void InitializeLatchWaitSet(void);
 extern int	GetNumRegisteredWaitEvents(WaitEventSet *set);
 extern bool WaitEventSetCanReportClosed(void);
 
+extern PGDLLIMPORT bool debug_latch;
 #endif							/* LATCH_H */
#2Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Alexander Lakhin (#1)
RE: WaitEventSetWaitBlock() can still hang on Windows due to connection reset

Dear Alexander,

While analyzing the BF failure [1]https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=drongo&amp;dt=2025-04-12%2003%3A59%3A38&amp;stg=recovery-check, I noticed that the same issue may happen here,
which means apply worker waited something. According to the log, apply worker (PID 2820)
stucked so that logical replication could not restart.

Regress log:
```
### Restarting node "pub"
# Running: pg_ctl --wait --pgdata C:\\prog\\bf\\root\\HEAD\\pgsql.build/...
waiting for server to shut down.... done
server stopped
waiting for server to start.... done
server started
# Postmaster PID for node "pub" is 980
timed out waiting for match: (?^:Streaming transactions committing after ([A-F0-9]+/[A-F0-9]+), ...
```

Subscriber log;
```
2025-04-12 05:08:44.630 UTC [2820:1] LOG: logical replication apply worker for subscription "sub" has started
2025-04-12 05:08:44.642 UTC [5652:6] LOG: background worker "logical replication apply worker" (PID 6344) exited with exit code 1
2025-04-12 05:13:27.352 UTC [3988:1] LOG: checkpoint starting: time
2025-04-12 05:13:36.825 UTC [3988:2] LOG: checkpoint complete: wrote 62 buffers ...
2025-04-12 05:15:01.265 UTC [5652:7] LOG: received immediate shutdown request
2025-04-12 05:15:01.353 UTC [5652:8] LOG: database system is shut down
```

Publisher log;
```
2025-04-12 05:08:44.634 UTC [1112:7] LOG: database system is shut down
2025-04-12 05:08:45.685 UTC [980:1] LOG: starting PostgreSQL 18devel on...
2025-04-12 05:08:45.687 UTC [980:2] LOG: listening on IPv4 address "127.0.0.1", port 18057
2025-04-12 05:08:46.225 UTC [4392:1] LOG: database system was shut down at 2025-04-12 05:08:43 UTC
2025-04-12 05:08:46.319 UTC [980:3] LOG: database system is ready to accept connections
2025-04-12 05:15:00.408 UTC [980:4] LOG: received immediate shutdown request
2025-04-12 05:15:00.942 UTC [980:5] LOG: database system is shut down
```

Now the report has been reported for both physical and logical replication,
but I suspected that this can happen for all the application.

[1]: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=drongo&amp;dt=2025-04-12%2003%3A59%3A38&amp;stg=recovery-check

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#3Alexander Lakhin
exclusion@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#2)
Re: WaitEventSetWaitBlock() can still hang on Windows due to connection reset

Dear Kuroda-san,

17.04.2025 10:10, Hayato Kuroda (Fujitsu) wrote:

Dear Alexander,

While analyzing the BF failure [1], I noticed that the same issue may happen here,
which means apply worker waited something. According to the log, apply worker (PID 2820)
stucked so that logical replication could not restart.

...

Now the report has been reported for both physical and logical replication,
but I suspected that this can happen for all the application.

[1]:https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=drongo&amp;dt=2025-04-12%2003%3A59%3A38&amp;stg=recovery-check

Thank you for the analysis!
I think you're right, it's the same issue, so I've added the link to the
wiki page.

Best regards,
Alexander Lakhin
Neon (https://neon.tech)