REL_15_STABLE: pgbench tests randomly failing on CI, Windows only

Started by Thomas Munroover 2 years ago6 messages
#1Thomas Munro
thomas.munro@gmail.com

Hi,

Somehow these tests have recently become unstable and have failed a few times:

https://github.com/postgres/postgres/commits/REL_15_STABLE

The failures are like:

[22:32:26.722] # Failed test 'pgbench simple update stdout
/(?^:builtin: simple update)/'
[22:32:26.722] # at t/001_pgbench_with_server.pl line 119.
[22:32:26.722] # 'pgbench (15.4)
[22:32:26.722] # '
[22:32:26.722] # doesn't match '(?^:builtin: simple update)'

#2Noah Misch
noah@leadboat.com
In reply to: Thomas Munro (#1)
1 attachment(s)
Re: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only

On Mon, Sep 04, 2023 at 03:18:40PM +1200, Thomas Munro wrote:

Somehow these tests have recently become unstable and have failed a few times:

https://github.com/postgres/postgres/commits/REL_15_STABLE

The failures are like:

[22:32:26.722] # Failed test 'pgbench simple update stdout
/(?^:builtin: simple update)/'
[22:32:26.722] # at t/001_pgbench_with_server.pl line 119.
[22:32:26.722] # 'pgbench (15.4)
[22:32:26.722] # '
[22:32:26.722] # doesn't match '(?^:builtin: simple update)'

Fun. That's a test of "pgbench -C". The test harness isn't reporting
pgbench's stderr, so I hacked things to get that and the actual file
descriptor values being assigned. The failure case gets "pgbench: error: too
many client connections for select()" in stderr, from this pgbench.c function:

static void
add_socket_to_set(socket_set *sa, int fd, int idx)
{
if (fd < 0 || fd >= FD_SETSIZE)
{
/*
* Doing a hard exit here is a bit grotty, but it doesn't seem worth
* complicating the API to make it less grotty.
*/
pg_fatal("too many client connections for select()");
}
FD_SET(fd, &sa->fds);
if (fd > sa->maxfd)
sa->maxfd = fd;
}

The "fd >= FD_SETSIZE" check is irrelevant on Windows. See comments in the
attached patch; in brief, Windows assigns FDs and uses FD_SETSIZE differently.
The first associated failure was commit dea12a1 (2023-08-03); as a doc commit,
it's an innocent victim. Bisect blamed 8488bab "ci: Use windows VMs instead
of windows containers" (2023-02), long before the failures began. I'll guess
some 2023-08 Windows update or reconfiguration altered file descriptor
assignment, hence the onset of failures. In my tests of v16, the highest file
descriptor was 948. I could make v16 fail by changing --client=5 to
--client=90 in the test. With the attached patch and --client=90, v16 peaked
at file descriptor 2040.

Thanks,
nm

P.S. Later, we should change test code so the pgbench stderr can't grow an
extra line without that line appearing in test logs.

Attachments:

w32-FD_SETSIZE-guards-v1.patchtext/plain; charset=us-asciiDownload
Author:     Noah Misch <noah@leadboat.com>
Commit:     Noah Misch <noah@leadboat.com>

    Don't spuriously report FD_SETSIZE exhaustion on Windows.
    
    Starting on 2023-08-03, this intermittently terminated a "pgbench -C"
    test in CI.  It could affect a high-client-count "pgbench" without "-C".
    While parallel reindexdb and vacuumdb reach the same problematic check,
    sufficient client count and/or connection turnover is less plausible for
    them.  Given the lack of examples from the buildfarm or from manual
    builds, reproducing this must entail rare operating system
    configurations.  Also correct the associated error message, which was
    wrong for non-Windows and did not comply with the style guide.
    Back-patch to v12, where the pgbench check first appeared.  While v11
    vacuumdb has the problematic check, reaching it with typical vacuumdb
    usage is implausible.
    
    Reviewed by FIXME.
    
    Discussion: https://postgr.es/m/CA+hUKG+JwvTNdcyJTriy9BbtzF1veSRQ=9M_ZKFn9_LqE7Kp7Q@mail.gmail.com

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index e3919395..a1f566c 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -7837,14 +7837,22 @@ clear_socket_set(socket_set *sa)
 static void
 add_socket_to_set(socket_set *sa, int fd, int idx)
 {
+	/* See connect_slot() for background on this code. */
+#ifdef WIN32
+	if (sa->fds.fd_count + 1 >= FD_SETSIZE)
+	{
+		pg_log_error("too many concurrent database clients for this platform: %d",
+					 sa->fds.fd_count + 1);
+		exit(1);
+	}
+#else
 	if (fd < 0 || fd >= FD_SETSIZE)
 	{
-		/*
-		 * Doing a hard exit here is a bit grotty, but it doesn't seem worth
-		 * complicating the API to make it less grotty.
-		 */
-		pg_fatal("too many client connections for select()");
+		pg_log_error("socket file descriptor out of range: %d", fd);
+		pg_log_error_hint("Try fewer concurrent database clients.");
+		exit(1);
 	}
+#endif
 	FD_SET(fd, &sa->fds);
 	if (fd > sa->maxfd)
 		sa->maxfd = fd;
diff --git a/src/fe_utils/parallel_slot.c b/src/fe_utils/parallel_slot.c
index aca238b..2ee4b6d 100644
--- a/src/fe_utils/parallel_slot.c
+++ b/src/fe_utils/parallel_slot.c
@@ -295,8 +295,40 @@ connect_slot(ParallelSlotArray *sa, int slotno, const char *dbname)
 	slot->connection = connectDatabase(sa->cparams, sa->progname, sa->echo, false, true);
 	sa->cparams->override_dbname = old_override;
 
-	if (PQsocket(slot->connection) >= FD_SETSIZE)
-		pg_fatal("too many jobs for this platform");
+	/*
+	 * POSIX defines FD_SETSIZE as the highest file descriptor acceptable to
+	 * FD_SET() and allied macros.  Windows defines it as a ceiling on the
+	 * count of file descriptors in the set, not a ceiling on the value of
+	 * each file descriptor; see
+	 * https://learn.microsoft.com/en-us/windows/win32/api/winsock2/nf-winsock2-select
+	 * and
+	 * https://learn.microsoft.com/en-us/windows/win32/api/winsock/ns-winsock-fd_set.
+	 * We can't ignore that, because Windows starts file descriptors at a
+	 * higher value, delays reuse, and skips values.  With less than ten
+	 * concurrent file descriptors, opened and closed rapidly, one can reach
+	 * file descriptor 1024.
+	 *
+	 * Doing a hard exit here is a bit grotty, but it doesn't seem worth
+	 * complicating the API to make it less grotty.
+	 */
+#ifdef WIN32
+	if (slotno >= FD_SETSIZE)
+	{
+		pg_log_error("too many jobs for this platform: %d", slotno);
+		exit(1);
+	}
+#else
+	{
+		int			fd = PQsocket(slot->connection);
+
+		if (fd >= FD_SETSIZE)
+		{
+			pg_log_error("socket file descriptor out of range: %d", fd);
+			pg_log_error_hint("Try fewer jobs.");
+			exit(1);
+		}
+	}
+#endif
 
 	/* Setup the connection using the supplied command, if any. */
 	if (sa->initcmd)
#3Thomas Munro
thomas.munro@gmail.com
In reply to: Noah Misch (#2)
Re: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only

On Mon, Oct 9, 2023 at 3:25 PM Noah Misch <noah@leadboat.com> wrote:

The "fd >= FD_SETSIZE" check is irrelevant on Windows. See comments in the
attached patch; in brief, Windows assigns FDs and uses FD_SETSIZE differently.
The first associated failure was commit dea12a1 (2023-08-03); as a doc commit,
it's an innocent victim. Bisect blamed 8488bab "ci: Use windows VMs instead
of windows containers" (2023-02), long before the failures began. I'll guess
some 2023-08 Windows update or reconfiguration altered file descriptor
assignment, hence the onset of failures. In my tests of v16, the highest file
descriptor was 948. I could make v16 fail by changing --client=5 to
--client=90 in the test. With the attached patch and --client=90, v16 peaked
at file descriptor 2040.

Ohhh. Thanks for figuring that out. I'd never noticed that quirk. I
didn't/can't test it but the patch looks reasonable after reading the
referenced docs. Maybe instead of just "out of range" I'd say "out of
range for select()" since otherwise it might seem a little baffling --
what range?

Random water cooler speculation about future ideas: I wonder
whether/when we can eventually ditch select() and use WSAPoll() for
this on Windows, which is supposed to be like poll(). There's a
comment explaining that we prefer select() because it has a higher
resolution sleep than poll() (us vs ms), so we wouldn't want to use
poll() on Unixen, but we know that Windows doesn't even use high
resolution timers for any user space APIs anyway so that's just not a
concern on that platform. The usual reason nobody ever uses WSAPoll()
is because the Curl guys told[1]https://daniel.haxx.se/blog/2012/10/10/wsapoll-is-broken/ everyone that it's terrible in a way
that would quite specifically break our usage. But I wonder, because
the documentation now says "As of Windows 10 version 2004, when a TCP
socket fails to connect, (POLLHUP | POLLERR | POLLWRNORM) is
indicated", it *sounds* like it might have been fixed ~3 years ago?
One idea would be to hide it inside WaitEventSet, and let WaitEventSet
be used in front end code (we couldn't use the
WaitForMultipleObjects() version because it's hard-limited to 64
events, but in front end code we don't need latches and other stuff,
so we could have a sockets-only version with WSAPoll()). The idea of
using WaitEventSet for pgbench on Unix was already mentioned a couple
of times by others for general scalability reasons -- that way we
could finish up using a better kernel interface on all supported
platforms. We'd probably want to add high resolution time-out support
(I already posted a patch for that somewhere), or we'd be back to 1ms
timeouts.

[1]: https://daniel.haxx.se/blog/2012/10/10/wsapoll-is-broken/

#4Noah Misch
noah@leadboat.com
In reply to: Thomas Munro (#3)
Re: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only

On Mon, Oct 09, 2023 at 04:22:52PM +1300, Thomas Munro wrote:

On Mon, Oct 9, 2023 at 3:25 PM Noah Misch <noah@leadboat.com> wrote:

The "fd >= FD_SETSIZE" check is irrelevant on Windows. See comments in the
attached patch; in brief, Windows assigns FDs and uses FD_SETSIZE differently.
The first associated failure was commit dea12a1 (2023-08-03); as a doc commit,
it's an innocent victim. Bisect blamed 8488bab "ci: Use windows VMs instead
of windows containers" (2023-02), long before the failures began. I'll guess
some 2023-08 Windows update or reconfiguration altered file descriptor
assignment, hence the onset of failures. In my tests of v16, the highest file
descriptor was 948. I could make v16 fail by changing --client=5 to
--client=90 in the test. With the attached patch and --client=90, v16 peaked
at file descriptor 2040.

Ohhh. Thanks for figuring that out. I'd never noticed that quirk. I
didn't/can't test it but the patch looks reasonable after reading the
referenced docs.

For what it's worth, I did all that testing through CI, using hacks like
https://cirrus-ci.com/task/5352974499708928 to get the relevant information.
I didn't bother trying non-CI, since buildfarm animals aren't failing.

Maybe instead of just "out of range" I'd say "out of
range for select()" since otherwise it might seem a little baffling --
what range?

I was trying to follow this from the style guide:

Avoid mentioning called function names, either; instead say what the code was trying to do:
BAD: open() failed: %m
BETTER: could not open file %s: %m

I didn't think of any phrasing that clearly explained things without the
reader consulting the code. I considered these:

"socket file descriptor out of range: %d" [what range?]
"socket file descriptor out of range for select(): %d" [style guide violation]
"socket file descriptor out of range for checking whether ready for reading: %d" [?]
"socket file descriptor out of range for synchronous I/O multiplexing: %d" [term from POSIX]

Random water cooler speculation about future ideas: I wonder
whether/when we can eventually ditch select() and use WSAPoll() for
this on Windows, which is supposed to be like poll(). There's a
comment explaining that we prefer select() because it has a higher
resolution sleep than poll() (us vs ms), so we wouldn't want to use
poll() on Unixen, but we know that Windows doesn't even use high
resolution timers for any user space APIs anyway so that's just not a
concern on that platform. The usual reason nobody ever uses WSAPoll()
is because the Curl guys told[1] everyone that it's terrible in a way
that would quite specifically break our usage. But I wonder, because
the documentation now says "As of Windows 10 version 2004, when a TCP
socket fails to connect, (POLLHUP | POLLERR | POLLWRNORM) is
indicated", it *sounds* like it might have been fixed ~3 years ago?
One idea would be to hide it inside WaitEventSet, and let WaitEventSet
be used in front end code (we couldn't use the
WaitForMultipleObjects() version because it's hard-limited to 64
events, but in front end code we don't need latches and other stuff,
so we could have a sockets-only version with WSAPoll()). The idea of
using WaitEventSet for pgbench on Unix was already mentioned a couple
of times by others for general scalability reasons -- that way we
could finish up using a better kernel interface on all supported
platforms. We'd probably want to add high resolution time-out support
(I already posted a patch for that somewhere), or we'd be back to 1ms
timeouts.

[1] https://daniel.haxx.se/blog/2012/10/10/wsapoll-is-broken/

Interesting. I have no current knowledge to add there.

#5David G. Johnston
david.g.johnston@gmail.com
In reply to: Noah Misch (#4)
Re: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only

On Sun, Oct 8, 2023 at 9:10 PM Noah Misch <noah@leadboat.com> wrote:

I didn't think of any phrasing that clearly explained things without the
reader consulting the code. I considered these:

"socket file descriptor out of range: %d" [what range?]

Quick drive-by...but it seems that < 0 is a distinctly different problem
than exceeding FD_SETSIZE. I'm unsure what would cause the former but the
error for the later seems like:

error: "Requested socket file descriptor %d exceeds connection limit of
%d", fd, FD_SETSIZE-1
hint: Reduce the requested number of concurrent connections

In short, the concept of range doesn't seem applicable here. There is an
error state at the max, and some invalid system state condition where the
position within a set is somehow negative. These should be separated -
with the < 0 check happening first.

And apparently this condition isn't applicable when dealing with jobs in
connect_slot? Also, I see that for connections we immediately issue FD_SET
so this check on the boundary of the file descriptor makes sense. But the
remaining code in connect_slot doesn't involve FD_SET so the test for the
file descriptor falling within its maximum seems to be coming out of
nowhere. Likely this is all good, and the lack of symmetry is just due to
the natural progressive development of the code, but it stands out to the
uninitiated looking at this patch.

David J.

#6Noah Misch
noah@leadboat.com
In reply to: David G. Johnston (#5)
Re: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only

On Sun, Oct 08, 2023 at 10:00:03PM -0700, David G. Johnston wrote:

On Sun, Oct 8, 2023 at 9:10 PM Noah Misch <noah@leadboat.com> wrote:

I didn't think of any phrasing that clearly explained things without the
reader consulting the code. I considered these:

I'm leaning toward:

"socket file descriptor out of range for select(): %d" [style guide violation]

A true style guide purist might bury it in a detail message:

ERROR: socket file descriptor out of range: %d
DETAIL: select() accepts file descriptors from 0 to 1023, inclusive, in this build.
HINT: Try fewer concurrent database clients.

"socket file descriptor out of range: %d" [what range?]

Quick drive-by...but it seems that < 0 is a distinctly different problem
than exceeding FD_SETSIZE. I'm unsure what would cause the former but the
error for the later seems like:

error: "Requested socket file descriptor %d exceeds connection limit of
%d", fd, FD_SETSIZE-1
hint: Reduce the requested number of concurrent connections

In short, the concept of range doesn't seem applicable here. There is an
error state at the max, and some invalid system state condition where the
position within a set is somehow negative. These should be separated -
with the < 0 check happening first.

I view it as: the range of select()-able file descriptors is [0,FD_SETSIZE).
Negative is out of range.

And apparently this condition isn't applicable when dealing with jobs in
connect_slot?

For both pgbench.c and parallel_slot.c, there are sufficient negative-FD
checks elsewhere in the file. Ideally, either both files would have redundant
checks, or neither file would. I didn't feel the need to mess with that part
of the status quo.

Also, I see that for connections we immediately issue FD_SET
so this check on the boundary of the file descriptor makes sense. But the
remaining code in connect_slot doesn't involve FD_SET so the test for the
file descriptor falling within its maximum seems to be coming out of
nowhere. Likely this is all good, and the lack of symmetry is just due to
the natural progressive development of the code, but it stands out to the
uninitiated looking at this patch.

True. The approach in parallel_slot.c is to check the FD number each time it
opens a socket, after which its loop with FD_SET() doesn't recheck. That's a
bit more efficient than the pgbench.c way, because each file may call FD_SET()
many times per socket. Again, I didn't mess with that part of the status quo.