Yet another way for pg_ctl stop to fail on Windows

Started by Alexander Lakhinover 1 year ago4 messages

exclusion@gmail.com

over 1 year ago

Hello hackers,

While trying to reproduce a recent fairywren (a Windows animal) failure,
I ran amcheck/amcheck/003_cic_2pc in parallel inside a slowed-down
VM and came across another issue:
### Stopping node "CIC_2PC_test" using mode fast
# Running: pg_ctl -D C:\src\postgresql\build/testrun/amcheck_17/003_cic_2pc\data/t_003_cic_2pc_CIC_2PC_test_data/pgdata
-m fast stop
waiting for server to shut down..... failed
pg_ctl: server does not shut down
# pg_ctl stop failed: 256
# Postmaster PID for node "CIC_2PC_test" is 6048
[08:24:52.915](12.792s) Bail out! pg_ctl stop failed

So "pg_ctl stop" failed due to not a timeout, but some other reason.

With extra logging added, I got:
### Stopping node "CIC_2PC_test" using mode fast
# Running: pg_ctl -D C:\src\postgresql\build/testrun/amcheck_3/003_cic_2pc\data/t_003_cic_2pc_CIC_2PC_test_data/pgdata
-m fast stop
waiting for server to shut down......!!!pgkill| GetLastError(): 231
postmaster (9596) died untimely? res: -1, errno: 22
failed

Thus, CallNamedPipe() in pgkill() returned ERROR_PIPE_BUSY (All pipe
instances are busy) and it was handled as an unexpected error.
(The error code 231 returned 10 times out of 10 failures of this ilk for
me.)

Noah, what do you think of handling this error in line with handling of
ERROR_BROKEN_PIPE and ERROR_BAD_PIPE (which was done in 0ea1f2a3a)?

I tried the following change:
        switch (GetLastError())
        {
                case ERROR_BROKEN_PIPE:
                case ERROR_BAD_PIPE:
+               case ERROR_PIPE_BUSY:
and saw no issues.

The reason I'd like to bring your attention to the issue (if you don't
mind), is that it's impossible to understand the reason of such false
failure if it happens in the buildfarm/CI.

Best regards,
Alexander

Noah Misch

noah@leadboat.com

over 1 year ago

In reply to: Alexander Lakhin (#1)

Re: Yet another way for pg_ctl stop to fail on Windows

On Sat, Sep 07, 2024 at 03:00:00PM +0300, Alexander Lakhin wrote:

With extra logging added, I got:
### Stopping node "CIC_2PC_test" using mode fast
# Running: pg_ctl -D C:\src\postgresql\build/testrun/amcheck_3/003_cic_2pc\data/t_003_cic_2pc_CIC_2PC_test_data/pgdata
-m fast stop
waiting for server to shut down......!!!pgkill| GetLastError(): 231
postmaster (9596) died untimely? res: -1, errno: 22
ï¿½failed

Thus, CallNamedPipe() in pgkill() returned ERROR_PIPE_BUSY (All pipe
instances are busy) and it was handled as an unexpected error.
(The error code 231 returned 10 times out of 10 failures of this ilk for
me.)

Thanks for discovering that.

Noah, what do you think of handling this error in line with handling of
ERROR_BROKEN_PIPE and ERROR_BAD_PIPE (which was done in 0ea1f2a3a)?

I tried the following change:
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ switch (GetLastError())
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ {
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ case ERROR_BROKEN_PIPE:
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ case ERROR_BAD_PIPE:
+ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ case ERROR_PIPE_BUSY:
and saw no issues.

That would be a strict improvement over returning EINVAL like we do today. We
do use PIPE_UNLIMITED_INSTANCES, so I expect the causes of ERROR_PIPE_BUSY are
process exit and ENOMEM-like situations. While that change is the best thing
if the process is exiting, it could silently drop the signal in ENOMEM-like
situations. Consider the following alternative. If sig==0, just return 0
like you propose, because the process isn't completely gone. Otherwise, sleep
and retry the signal, like pgwin32_open_handle() retries after certain errors.
What do you think of that?

Alexander Lakhin

exclusion@gmail.com

over 1 year ago

In reply to: Noah Misch (#2)

2 attachment(s)

Re: Yet another way for pg_ctl stop to fail on Windows

07.09.2024 21:11, Noah Misch wrote:

Noah, what do you think of handling this error in line with handling of
ERROR_BROKEN_PIPE and ERROR_BAD_PIPE (which was done in 0ea1f2a3a)?

I tried the following change:
        switch (GetLastError())
        {
                case ERROR_BROKEN_PIPE:
                case ERROR_BAD_PIPE:
+               case ERROR_PIPE_BUSY:
and saw no issues.

That would be a strict improvement over returning EINVAL like we do today. We
do use PIPE_UNLIMITED_INSTANCES, so I expect the causes of ERROR_PIPE_BUSY are
process exit and ENOMEM-like situations. While that change is the best thing
if the process is exiting, it could silently drop the signal in ENOMEM-like
situations. Consider the following alternative. If sig==0, just return 0
like you propose, because the process isn't completely gone. Otherwise, sleep
and retry the signal, like pgwin32_open_handle() retries after certain errors.
What do you think of that?

Thank you for your attention to the issue!

I agree with your approach. It looks like Microsoft recommends to loop on
ERROR_PIPE_BUSY: [1]https://learn.microsoft.com/en-us/windows/win32/ipc/named-pipe-client (they say "Calling CallNamedPipe is equivalent to
calling the CreateFile ..." at [2]https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-callnamedpipea).

So if we aim to not only fix "pg_ctl stop", but to make pgkill() robust,
it's the way to go, IMHO. I'm not sure about an infinite loop they show,
I'd vote for a loop with the same characteristics as in
pgwin32_open_handle().

I've managed to trigger ERROR_PIPE_BUSY with "pg_ctl reload", when running
20 TAP tests (see attached) in parallel (with 20 threads executing
$node->reload() in each), and with the kill() call inside do_reload()
repeated x100 (see the modification patch attached too):
...
# Running: pg_ctl -D .../099_test_pgkill\data/t_099_test_pgkill_node_data/pgdata reload
### Reloading node "node"
# Running: pg_ctl -D .../099_test_pgkill\data/t_099_test_pgkill_node_data/pgdata reload
[13:41:46.850](2.400s) # 18
server signaled
server signaled
server signaled
server signaled
server signaled
server signaled
server signaled
server signaled
!!!pgkill| GetLastError(): 231
pg_ctl: could not send reload signal (PID: 3912), iteration: 81, res: -1, errno: 22: Invalid argument
server signaled
[13:41:52.594](5.744s) # 19
...

[1]: https://learn.microsoft.com/en-us/windows/win32/ipc/named-pipe-client
[2]: https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-callnamedpipea

Best regards,
Alexander

Attachments:

pgkill-debugging.patchtext/x-patch; charset=UTF-8; name=pgkill-debugging.patchDownload

diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index e7e878c22f..cab6eb9472 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -1167,12 +1174,17 @@ do_reload(void)
 		exit(1);
 	}
 
-	if (kill(pid, sig) != 0)
+for (int i = 1; i <= 100; i++)
+{
+int res, en;
+	if ((res = kill(pid, sig)) != 0)
 	{
-		write_stderr(_("%s: could not send reload signal (PID: %d): %m\n"),
-					 progname, (int) pid);
+en = errno;
+		write_stderr(_("%s: could not send reload signal (PID: %d), iteration: %d, res: %d, errno: %d: %m\n"),
+					 progname, (int) pid, i, res, en);
 		exit(1);
 	}
+}
 
 	print_msg(_("server signaled\n"));
 }
diff --git a/src/port/kill.c b/src/port/kill.c
index 412c2f19c1..b3d2c00796 100644
--- a/src/port/kill.c
+++ b/src/port/kill.c
@@ -70,6 +70,7 @@ pgkill(int pid, int sig)
 		return 0;
 	}
 
+int le = GetLastError();
 	switch (GetLastError())
 	{
 		case ERROR_BROKEN_PIPE:
@@ -89,6 +90,7 @@ pgkill(int pid, int sig)
 			errno = EPERM;
 			return -1;
 		default:
+fprintf(stderr, "!!!pgkill| GetLastError(): %d\n", le);
 			errno = EINVAL;		/* unexpected */
 			return -1;
 	}

099_test_pgkill.plapplication/x-perl; name=099_test_pgkill.plDownload

Noah Misch

noah@leadboat.com

over 1 year ago

In reply to: Alexander Lakhin (#3)

Re: Yet another way for pg_ctl stop to fail on Windows

On Sun, Sep 08, 2024 at 06:00:00PM +0300, Alexander Lakhin wrote:

07.09.2024 21:11, Noah Misch wrote:

Noah, what do you think of handling this error in line with handling of
ERROR_BROKEN_PIPE and ERROR_BAD_PIPE (which was done in 0ea1f2a3a)?

I tried the following change:
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ switch (GetLastError())
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ {
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ case ERROR_BROKEN_PIPE:
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ case ERROR_BAD_PIPE:
+ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ case ERROR_PIPE_BUSY:
and saw no issues.

That would be a strict improvement over returning EINVAL like we do today. We
do use PIPE_UNLIMITED_INSTANCES, so I expect the causes of ERROR_PIPE_BUSY are
process exit and ENOMEM-like situations. While that change is the best thing
if the process is exiting, it could silently drop the signal in ENOMEM-like
situations. Consider the following alternative. If sig==0, just return 0
like you propose, because the process isn't completely gone. Otherwise, sleep
and retry the signal, like pgwin32_open_handle() retries after certain errors.
What do you think of that?

I agree with your approach. It looks like Microsoft recommends to loop on
ERROR_PIPE_BUSY: [1] (they say "Calling CallNamedPipe is equivalent to
calling the CreateFile ..." at [2]).

I see Microsoft suggests WaitNamedPipeA() as opposed to just polling.
WaitNamedPipeA() should be more responsive. Given how rare this has been, it
likely doesn't matter whether we use WaitNamedPipeA() or polling. I'd lean
toward whichever makes the code simpler, probably polling.

So if we aim to not only fix "pg_ctl stop", but to make pgkill() robust,
it's the way to go, IMHO. I'm not sure about an infinite loop they show,
I'd vote for a loop with the same characteristics as in
pgwin32_open_handle().

I agree with bounding the total time of each kill(), like
pgwin32_open_handle() does for open().

Show quoted text

[1] https://learn.microsoft.com/en-us/windows/win32/ipc/named-pipe-client
[2] https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-callnamedpipea