BackgroundPsql swallowing errors on windows

Started by Andres Freundabout 1 year ago40 messages
Jump to latest
#1Andres Freund
andres@anarazel.de

Hi,

One of the remaining tasks for AIO was to convert the tests to be proper tap
tests. I did that and was thanked by the tests fairly randomly failing on
windows. Which test fails changes from run to run.

The symptom is that BackgroundPsql->query() sometimes would simply swallow
errors that were clearly generated by the backend. Which then would cause
tests to fail, because testing for errors was the whole point of $test.

At first I thought the issue was that psql didn't actually flush stderr after
displaying errors. And while that may be an issue, it doesn't seem to be the
one causing problems for me.

Lots of hair-pulling later, I have a somewhat confirmed theory for what's
happening:

BackgroundPsql::query() tries to detect if the passed in query has executed by
adding a "banner" after the query and using pump_until() to wait until that
banner has been "reached". That seems to work reasonably well on !windows.

On windows however, it looks like there's no guarantee that if stdout has been
received by IPC::Run, stderr also has been received, even if the stderr
content has been generated first. I tried to add an extra ->pump_nb() call to
query(), thinking that maybe IPC::Run just didn't get input that had actually
arrived, due to waiting on just one pipe. But no success.

My understanding is that IPC::Run uses a proxy process on windows to execute
subprocesses and then communicates with that over TCP (or something along
those lines). I suspect what's happening is that the communication with the
external process allows for reordering between stdout/stderr.

And indeed, changing BackgroundPsql::query() to emit the banner on both stdout
and stderr and waiting on both seems to fix the issue.

One complication is that I found that just waiting for the banner, without
also its newline, sometimes lead to unexpected newlines causing later queries
to fail. I think that happens if the trailing newline is read separately from
the rest of the string.

However, matching the newlines caused tests to fail on some machines. After a
lot of cursing I figured out that for interactive psql we output \r\n, causing
my regex match to fail. I.e. tests failed whenever IO::PTY was availble...

It's also not correct, as we did before, to just look for the banner, it has
to be anchored to either the start of the output or a newline, otherwise the
\echo (or \warn) command itself will be matched by pump_until() (but then the
replacing the command would fail). Not sure that could cause active problems
without the addition of \warn (which is also echoed on stdout), but it
certainly could after.

The banner being the same between queries made it hard to understand if a
banner that appeared in the output was from the current query or a past
query. Therefore I added a counter to it.

For debugging I added a "note" that shows stdout/stderr after executing the
query, I think it may be worth keeping that, but I'm not sure.

This was a rather painful exercise.

Greetings,

Andres Freund

Attachments:

v1-0001-BackgroundPsql-Fix-potential-for-lost-errors-on-w.patchtext/x-diff; charset=us-asciiDownload+43-13
#2Daniel Gustafsson
daniel@yesql.se
In reply to: Andres Freund (#1)
Re: BackgroundPsql swallowing errors on windows

On 13 Feb 2025, at 18:39, Andres Freund <andres@anarazel.de> wrote:

The banner being the same between queries made it hard to understand if a
banner that appeared in the output was from the current query or a past
query. Therefore I added a counter to it.

+	my $banner = "background_psql: QUERY_SEPARATOR $query_cnt";
+	my $banner_match = qr/(^|\n)$banner\r?\n/;
+	$self->{stdin} .= "$query\n;\n\\echo $banner\n\\warn $banner\n";
+	pump_until(
+		$self->{run}, $self->{timeout},
+		\$self->{stdout}, qr/$banner_match/);

Won't this allow "QUERY_SEPARATOR 11" to match against "QUERY_SEPARATOR 1"?
It's probably only of academic interest but appending an end-of-banner
character like "_" or something after the query counter should fix that.

For debugging I added a "note" that shows stdout/stderr after executing the
query, I think it may be worth keeping that, but I'm not sure.

I think it could be worth it, +1 for keeping it until it's beeb proven a
problem somewhere.

This was a rather painful exercise.

I have no trouble believing that.

--
Daniel Gustafsson

#3Andrew Dunstan
andrew@dunslane.net
In reply to: Andres Freund (#1)
Re: BackgroundPsql swallowing errors on windows

On 2025-02-13 Th 12:39 PM, Andres Freund wrote:

Hi,

One of the remaining tasks for AIO was to convert the tests to be proper tap
tests. I did that and was thanked by the tests fairly randomly failing on
windows. Which test fails changes from run to run.

The symptom is that BackgroundPsql->query() sometimes would simply swallow
errors that were clearly generated by the backend. Which then would cause
tests to fail, because testing for errors was the whole point of $test.

At first I thought the issue was that psql didn't actually flush stderr after
displaying errors. And while that may be an issue, it doesn't seem to be the
one causing problems for me.

Lots of hair-pulling later, I have a somewhat confirmed theory for what's
happening:

BackgroundPsql::query() tries to detect if the passed in query has executed by
adding a "banner" after the query and using pump_until() to wait until that
banner has been "reached". That seems to work reasonably well on !windows.

On windows however, it looks like there's no guarantee that if stdout has been
received by IPC::Run, stderr also has been received, even if the stderr
content has been generated first. I tried to add an extra ->pump_nb() call to
query(), thinking that maybe IPC::Run just didn't get input that had actually
arrived, due to waiting on just one pipe. But no success.

My understanding is that IPC::Run uses a proxy process on windows to execute
subprocesses and then communicates with that over TCP (or something along
those lines). I suspect what's happening is that the communication with the
external process allows for reordering between stdout/stderr.

And indeed, changing BackgroundPsql::query() to emit the banner on both stdout
and stderr and waiting on both seems to fix the issue.

One complication is that I found that just waiting for the banner, without
also its newline, sometimes lead to unexpected newlines causing later queries
to fail. I think that happens if the trailing newline is read separately from
the rest of the string.

However, matching the newlines caused tests to fail on some machines. After a
lot of cursing I figured out that for interactive psql we output \r\n, causing
my regex match to fail. I.e. tests failed whenever IO::PTY was availble...

It's also not correct, as we did before, to just look for the banner, it has
to be anchored to either the start of the output or a newline, otherwise the
\echo (or \warn) command itself will be matched by pump_until() (but then the
replacing the command would fail). Not sure that could cause active problems
without the addition of \warn (which is also echoed on stdout), but it
certainly could after.

The banner being the same between queries made it hard to understand if a
banner that appeared in the output was from the current query or a past
query. Therefore I added a counter to it.

For debugging I added a "note" that shows stdout/stderr after executing the
query, I think it may be worth keeping that, but I'm not sure.

This was a rather painful exercise.

It's been discussed before, but I'd really really like to get rid of
BackgroundPsql. It's ugly, non-intuitive and fragile.

Last year I did some work on this. I was going to present it at Athens
but illness prevented me, and then other life events managed to get in
my way. But the basic work is around. See
<https://github.com/adunstan/test-pq/commit/98518e4929e80fb96f210bbc5aab9fdcea058512&gt;
This introduces a libpq session object (PostgreSQL::Test::Session) which
can be backed either by FFI or a small XS wrapper - the commit has
recipes for both. Missing is a meson.build file for the XS wrapper.
There are significant performance gains to be had too (poll_query_until
is much nicer, for example, as are most uses of safe_psql). If there is 
interest I will bring the work up to date, and maybe we can introduce
this early in the v19 cycle. It would significantly reduce our
dependency on IPC::Run, especially the pump() stuff.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#4Andres Freund
andres@anarazel.de
In reply to: Daniel Gustafsson (#2)
Re: BackgroundPsql swallowing errors on windows

Hi,

On 2025-02-14 13:35:40 +0100, Daniel Gustafsson wrote:

On 13 Feb 2025, at 18:39, Andres Freund <andres@anarazel.de> wrote:

The banner being the same between queries made it hard to understand if a
banner that appeared in the output was from the current query or a past
query. Therefore I added a counter to it.

+	my $banner = "background_psql: QUERY_SEPARATOR $query_cnt";
+	my $banner_match = qr/(^|\n)$banner\r?\n/;
+	$self->{stdin} .= "$query\n;\n\\echo $banner\n\\warn $banner\n";
+	pump_until(
+		$self->{run}, $self->{timeout},
+		\$self->{stdout}, qr/$banner_match/);

Won't this allow "QUERY_SEPARATOR 11" to match against "QUERY_SEPARATOR 1"?
It's probably only of academic interest but appending an end-of-banner
character like "_" or something after the query counter should fix that.

You're right. I went with ":".

Thanks for reviewing!

Updated patch attached.

I also applied similar changes to wait_connect(), it had the same issues as
query(). This mostly matters for interactive_psql() uses. The fact that we
matched on the \echo itself, lead to the first query() having additional query
output, along the lines of

\echo background_psql: ready
psql (18devel)
Type "help" for help.

postgres=# \echo background_psql: ready
background_psql: ready

Which is rather confusing and can throw off checks based on the query results.

It does seem rather weird that psql outputs its input twice in this case, but
that's a separate issue.

I was thinking that this really ought to be backported, debugging failures due
to the set of fied bugs is really painful (just cost me 1 1/2 days), but
unfortunately there have been a bunch of recent changes that haven't been
backpatched:

commit 70291a3c66e
Author: Michael Paquier <michael@paquier.xyz>
Date: 2024-11-07 12:11:27 +0900

Improve handling of empty query results in BackgroundPsql::query()

commit ba08edb0654
Author: Michael Paquier <michael@paquier.xyz>
Date: 2024-11-06 15:31:14 +0900

Extend Cluster.pm's background_psql() to be able to start asynchronously

Particularly the former makes it hard to backpatch, as it's a behavioural
difference that really interacts with the problems described in this thread.

Michael, Jacob, thoughts?

Greetings,

Andres Freund

Attachments:

v2-0001-tests-BackgroundPsql-Fix-potential-for-lost-error.patchtext/x-diff; charset=us-asciiDownload+59-20
#5Andres Freund
andres@anarazel.de
In reply to: Andrew Dunstan (#3)
Re: BackgroundPsql swallowing errors on windows

Hi,

On 2025-02-14 08:14:45 -0500, Andrew Dunstan wrote:

It's been discussed before, but I'd really really like to get rid of
BackgroundPsql. It's ugly, non-intuitive and fragile.

I agree, unfortunately we're stuck with this until we have a better
alternative in tree :(

Last year I did some work on this. I was going to present it at Athens but
illness prevented me, and then other life events managed to get in my way.
But the basic work is around. See <https://github.com/adunstan/test-pq/commit/98518e4929e80fb96f210bbc5aab9fdcea058512&gt;
This introduces a libpq session object (PostgreSQL::Test::Session) which can
be backed either by FFI or a small XS wrapper - the commit has recipes for
both. Missing is a meson.build file for the XS wrapper. There are
significant performance gains to be had too (poll_query_until is much nicer,
for example, as are most uses of safe_psql). If there is� interest I will
bring the work up to date, and maybe we can introduce this early in the v19
cycle. It would significantly reduce our dependency on IPC::Run, especially
the pump() stuff.

I definitely am interested.

Greetings,

Andres Freund

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#5)
Re: BackgroundPsql swallowing errors on windows

Andres Freund <andres@anarazel.de> writes:

On 2025-02-14 08:14:45 -0500, Andrew Dunstan wrote:

... If there is interest I will
bring the work up to date, and maybe we can introduce this early in the v19
cycle. It would significantly reduce our dependency on IPC::Run, especially
the pump() stuff.

I definitely am interested.

+1. Would there be a chance of removing our use of IPC::Run entirely?
There'd be a lot to like about that.

regards, tom lane

#7Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Andres Freund (#4)
Re: BackgroundPsql swallowing errors on windows

On Fri, Feb 14, 2025 at 8:53 AM Andres Freund <andres@anarazel.de> wrote:

commit 70291a3c66e
Author: Michael Paquier <michael@paquier.xyz>
Date: 2024-11-07 12:11:27 +0900

Improve handling of empty query results in BackgroundPsql::query()

commit ba08edb0654
Author: Michael Paquier <michael@paquier.xyz>
Date: 2024-11-06 15:31:14 +0900

Extend Cluster.pm's background_psql() to be able to start asynchronously

Particularly the former makes it hard to backpatch, as it's a behavioural
difference that really interacts with the problems described in this thread.

Michael, Jacob, thoughts?

I think both should be backpatchable without too much risk, though
it's possible that there are more useless ok() calls in back branches
that would need to be touched when the first patch goes back. If we're
concerned about the second for any reason, the only conflicting part
should be the name and documentation of wait_connect, right?

--Jacob

#8Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#6)
Re: BackgroundPsql swallowing errors on windows

Hi,

On 2025-02-14 12:14:55 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2025-02-14 08:14:45 -0500, Andrew Dunstan wrote:

... If there is interest I will
bring the work up to date, and maybe we can introduce this early in the v19
cycle. It would significantly reduce our dependency on IPC::Run, especially
the pump() stuff.

I definitely am interested.

+1. Would there be a chance of removing our use of IPC::Run entirely?
There'd be a lot to like about that.

Unfortunately I doubt it'd get us that close to that.

We do have several tests that intentionally involve psql,
e.g. 010_tab_completion.pl, 001_password.pl. Presumably those would continue
to use something like BackgroundPsql.pm, even if we had a sane way to have
longrunning connections in tap tests.

There also are a bunch of things we use IPC::Run to run synchronously, we
probably could replace most of those without too much pain. The hardest
probably would be timeout handling.

Greetings,

Andres Freund

#9Andres Freund
andres@anarazel.de
In reply to: Jacob Champion (#7)
Re: BackgroundPsql swallowing errors on windows

Hi,

On 2025-02-14 09:52:24 -0800, Jacob Champion wrote:

On Fri, Feb 14, 2025 at 8:53 AM Andres Freund <andres@anarazel.de> wrote:

commit 70291a3c66e
Author: Michael Paquier <michael@paquier.xyz>
Date: 2024-11-07 12:11:27 +0900

Improve handling of empty query results in BackgroundPsql::query()

commit ba08edb0654
Author: Michael Paquier <michael@paquier.xyz>
Date: 2024-11-06 15:31:14 +0900

Extend Cluster.pm's background_psql() to be able to start asynchronously

Particularly the former makes it hard to backpatch, as it's a behavioural
difference that really interacts with the problems described in this thread.

Michael, Jacob, thoughts?

I think both should be backpatchable without too much risk

I think so too. Obviously it'll have to wait until later next week due to the
new minor releases, but after that I think we should backpatch them.

though it's possible that there are more useless ok() calls in back branches
that would need to be touched when the first patch goes back.

FWIW, I don't really agree that such ok() calls are useless, they do make the
output of the test more readable, particularly if the test ends up hanging.
But that's really just an aside.

If we're concerned about the second for any reason, the only conflicting
part should be the name and documentation of wait_connect, right?

It doesn't seem concerning to me either. The first commit seems much more
likely to cause trouble and even that seems ok. Even if it were to cause
problem for an extension (which I think is rather unlikely), it shouldn't be
too hard to fix.

Greetings,

Andres Freund

#10Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#1)
Re: BackgroundPsql swallowing errors on windows

On Thu, Feb 13, 2025 at 12:39:04PM -0500, Andres Freund wrote:

My understanding is that IPC::Run uses a proxy process on windows to execute
subprocesses and then communicates with that over TCP (or something along
those lines).

Right.

I suspect what's happening is that the communication with the
external process allows for reordering between stdout/stderr.

And indeed, changing BackgroundPsql::query() to emit the banner on both stdout
and stderr and waiting on both seems to fix the issue.

That makes sense. I wondered how one might fix IPC::Run to preserve the
relative timing of stdout and stderr, not perturbing the timing the way that
disrupted your test run. I can think of two strategies:

- Remove the proxy.

- Make pipe data visible to Perl variables only when at least one of the
proxy<-program_under_test pipes had no data ready to read. In other words,
if both pipes have data ready, make all that data visible to Perl code
simultaneously. (When both the stdout pipe and the stderr pipe have data
ready, one can't determine data arrival order.)

Is there a possibly-less-invasive change that might work?

The banner being the same between queries made it hard to understand if a
banner that appeared in the output was from the current query or a past
query. Therefore I added a counter to it.

Sounds good.

For debugging I added a "note" that shows stdout/stderr after executing the
query, I think it may be worth keeping that, but I'm not sure.

It should be okay to keep. We're not likely to funnel huge amounts of data
through BackgroundPsql. If we ever do that, we could just skip the "note" for
payloads larger than some threshold.

v2 of the patch looks fine.

#11Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#10)
Re: BackgroundPsql swallowing errors on windows

Hi,

On 2025-02-16 09:39:43 -0800, Noah Misch wrote:

On Thu, Feb 13, 2025 at 12:39:04PM -0500, Andres Freund wrote:

I suspect what's happening is that the communication with the
external process allows for reordering between stdout/stderr.

And indeed, changing BackgroundPsql::query() to emit the banner on both stdout
and stderr and waiting on both seems to fix the issue.

That makes sense. I wondered how one might fix IPC::Run to preserve the
relative timing of stdout and stderr, not perturbing the timing the way that
disrupted your test run. I can think of two strategies:

- Remove the proxy.

- Make pipe data visible to Perl variables only when at least one of the
proxy<-program_under_test pipes had no data ready to read. In other words,
if both pipes have data ready, make all that data visible to Perl code
simultaneously. (When both the stdout pipe and the stderr pipe have data
ready, one can't determine data arrival order.)

Is there a possibly-less-invasive change that might work?

I don't really know enough about IPC::Run's internals to answer. My
interpretation of how it might work, purely from observation, is that it opens
one tcp connection for each "pipe" and that that's what's introducing the
potential of reordering, as the different sockets can have different delivery
timeframes. If that's it, it seems proxying all the pipes through one
connection might be an option.

I did briefly look at IPC::Run's code, but couldn't figure out how it all fits
together quickly enough...

v2 of the patch looks fine.

Thanks for reviewing!

Greetings,

Andres Freund

#12Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#11)
Re: BackgroundPsql swallowing errors on windows

On Sun, Feb 16, 2025 at 01:02:01PM -0500, Andres Freund wrote:

On 2025-02-16 09:39:43 -0800, Noah Misch wrote:

On Thu, Feb 13, 2025 at 12:39:04PM -0500, Andres Freund wrote:

I suspect what's happening is that the communication with the
external process allows for reordering between stdout/stderr.

And indeed, changing BackgroundPsql::query() to emit the banner on both stdout
and stderr and waiting on both seems to fix the issue.

That makes sense. I wondered how one might fix IPC::Run to preserve the
relative timing of stdout and stderr, not perturbing the timing the way that
disrupted your test run. I can think of two strategies:

- Remove the proxy.

- Make pipe data visible to Perl variables only when at least one of the
proxy<-program_under_test pipes had no data ready to read. In other words,
if both pipes have data ready, make all that data visible to Perl code
simultaneously. (When both the stdout pipe and the stderr pipe have data
ready, one can't determine data arrival order.)

Is there a possibly-less-invasive change that might work?

I don't really know enough about IPC::Run's internals to answer. My
interpretation of how it might work, purely from observation, is that it opens
one tcp connection for each "pipe" and that that's what's introducing the
potential of reordering, as the different sockets can have different delivery
timeframes.

Right.

If that's it, it seems proxying all the pipes through one
connection might be an option.

It would. Thanks. However, I think that would entail modifying the program
under test to cooperate with the arrangement. When running an ordinary
program that does write(1, ...) and write(2, ...), the read end needs some way
to deal with the uncertainty about which write happened first. dup2(1, 2)
solves the order ambiguity, but it loses other signal.

I did briefly look at IPC::Run's code, but couldn't figure out how it all fits
together quickly enough...

We can ignore what IPC::Run does today. The reordering is a general problem
of proxying >1 pipe. (Proxying on non-Windows would have the same problem.)

#13Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#12)
Re: BackgroundPsql swallowing errors on windows

Hi,

On 2025-02-16 10:47:40 -0800, Noah Misch wrote:

On Sun, Feb 16, 2025 at 01:02:01PM -0500, Andres Freund wrote:

On 2025-02-16 09:39:43 -0800, Noah Misch wrote:

On Thu, Feb 13, 2025 at 12:39:04PM -0500, Andres Freund wrote:

I suspect what's happening is that the communication with the
external process allows for reordering between stdout/stderr.

And indeed, changing BackgroundPsql::query() to emit the banner on both stdout
and stderr and waiting on both seems to fix the issue.

That makes sense. I wondered how one might fix IPC::Run to preserve the
relative timing of stdout and stderr, not perturbing the timing the way that
disrupted your test run. I can think of two strategies:

- Remove the proxy.

- Make pipe data visible to Perl variables only when at least one of the
proxy<-program_under_test pipes had no data ready to read. In other words,
if both pipes have data ready, make all that data visible to Perl code
simultaneously. (When both the stdout pipe and the stderr pipe have data
ready, one can't determine data arrival order.)

Is there a possibly-less-invasive change that might work?

I don't really know enough about IPC::Run's internals to answer. My
interpretation of how it might work, purely from observation, is that it opens
one tcp connection for each "pipe" and that that's what's introducing the
potential of reordering, as the different sockets can have different delivery
timeframes.

Right.

If that's it, it seems proxying all the pipes through one
connection might be an option.

It would. Thanks. However, I think that would entail modifying the program
under test to cooperate with the arrangement. When running an ordinary
program that does write(1, ...) and write(2, ...), the read end needs some way
to deal with the uncertainty about which write happened first. dup2(1, 2)
solves the order ambiguity, but it loses other signal.

I think what's happening in this case must go beyond just that. Afaict just
doing ->pump_nb() would otherwise solve it. My uninformed theory is that two
tcp connections are used. With two pipes

P1: write(1)
P1: write(2)
P2: read(1)
P2: read(2)

wouldn't ever result in P2 not seeing data on either of reads. But with two
TCP sockets there can be time between the send() completing and recv() on the
other side reading the data, even on a local system (e.g. due to the tcp stack
waiting a while for more data before sending data).

To avoid that the proxy program could read from N pipes and then proxy them
through *one* socket by prefixing the data with information about which pipe
the data is from. Then IPC::Run could split the data again, using the added
prefix.

I don't think that would require modifying the program under test?

Greetings,

Andres Freund

#14Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#13)
Re: BackgroundPsql swallowing errors on windows

On Sun, Feb 16, 2025 at 03:55:10PM -0500, Andres Freund wrote:

On 2025-02-16 10:47:40 -0800, Noah Misch wrote:

On Sun, Feb 16, 2025 at 01:02:01PM -0500, Andres Freund wrote:

On 2025-02-16 09:39:43 -0800, Noah Misch wrote:

On Thu, Feb 13, 2025 at 12:39:04PM -0500, Andres Freund wrote:

I suspect what's happening is that the communication with the
external process allows for reordering between stdout/stderr.

And indeed, changing BackgroundPsql::query() to emit the banner on both stdout
and stderr and waiting on both seems to fix the issue.

That makes sense. I wondered how one might fix IPC::Run to preserve the
relative timing of stdout and stderr, not perturbing the timing the way that
disrupted your test run. I can think of two strategies:

- Remove the proxy.

- Make pipe data visible to Perl variables only when at least one of the
proxy<-program_under_test pipes had no data ready to read. In other words,
if both pipes have data ready, make all that data visible to Perl code
simultaneously. (When both the stdout pipe and the stderr pipe have data
ready, one can't determine data arrival order.)

Is there a possibly-less-invasive change that might work?

seems proxying all the pipes through one
connection might be an option.

It would. Thanks. However, I think that would entail modifying the program
under test to cooperate with the arrangement. When running an ordinary
program that does write(1, ...) and write(2, ...), the read end needs some way
to deal with the uncertainty about which write happened first. dup2(1, 2)
solves the order ambiguity, but it loses other signal.

I think what's happening in this case must go beyond just that. Afaict just
doing ->pump_nb() would otherwise solve it. My uninformed theory is that two
tcp connections are used. With two pipes

P1: write(1)
P1: write(2)
P2: read(1)
P2: read(2)

wouldn't ever result in P2 not seeing data on either of reads.

True.

But with two
TCP sockets there can be time between the send() completing and recv() on the
other side reading the data, even on a local system (e.g. due to the tcp stack
waiting a while for more data before sending data).

To avoid that the proxy program could read from N pipes and then proxy them
through *one* socket by prefixing the data with information about which pipe
the data is from. Then IPC::Run could split the data again, using the added
prefix.

I don't think that would require modifying the program under test?

I think that gets an order anomaly when the proxy is slow and the program
under test does this:

write(2, "ERROR: 1")
write(1, "BANNER 1")
write(2, "ERROR: 2")
write(1, "BANNER 2")

If the proxy is sufficiently fast, it will wake up four times and perform two
read() calls on each of the two pipes. On the flip side, if the proxy is
sufficiently slow, it will wake up once and perform one read() on each of the
two pipes. In the slow case, the reads get "ERROR: 1ERROR: 2" and "BANNER
1BANNER 2". The proxy sends the data onward as though the program under test
had done:

write(1, "BANNER 1BANNER 2")
write(2, "ERROR: 1ERROR: 2")

From the slow proxy's perspective, it can't rule out the program under test
having done those two write() calls. The proxy doesn't have enough
information to reconstruct the original four write() calls. What prevents
that anomaly?

If the proxy were to convey this uncertainty so the consumer side knew to
handle these writes as an atomic unit, I think that would work:

write(n, "proxy-begin")
write(n, "fd-1[BANNER 1BANNER 2]")
write(n, "fd-2[ERROR: 1ERROR: 2]")
write(n, "proxy-commit")

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#14)
Re: BackgroundPsql swallowing errors on windows

Noah Misch <noah@leadboat.com> writes:

From the slow proxy's perspective, it can't rule out the program under test
having done those two write() calls. The proxy doesn't have enough
information to reconstruct the original four write() calls. What prevents
that anomaly?

Yeah, I think it's hopeless to expect that we can disambiguate the
order of writes to two different pipes. For the problem at hand,
though, it seems like we don't really need to do that. Rather, the
question is "when we detect that the program-under-test has exited,
can we be sure we have collected all of its output?". I think that
IPC::Run may be screwing up here, because I have seen non-Windows
CI failures that look like it didn't read all the stderr output.
For example, this pgbench test failure on macOS from [1]https://cirrus-ci.com/task/6221238034497536:

# Running: pgbench -n -t 1 -Dfoo=bla -Dnull=null -Dtrue=true -Done=1 -Dzero=0.0 -Dbadtrue=trueXXX -Dmaxint=9223372036854775807 -Dminint=-9223372036854775808 -M prepared -f /Users/admin/pgsql/build/testrun/pgbench/001_pgbench_with_server/data/t_001_pgbench_with_server_main_data/001_pgbench_error_shell_bad_command
[17:27:47.408](0.061s) ok 273 - pgbench script error: shell bad command status (got 2 vs expected 2)
[17:27:47.409](0.000s) ok 274 - pgbench script error: shell bad command stdout /(?^:processed: 0/1)/
[17:27:47.409](0.000s) not ok 275 - pgbench script error: shell bad command stderr /(?^:\(shell\) .* meta-command failed)/
[17:27:47.409](0.000s) # Failed test 'pgbench script error: shell bad command stderr /(?^:\(shell\) .* meta-command failed)/'
# at /Users/admin/pgsql/src/bin/pgbench/t/001_pgbench_with_server.pl line 1466.
# ''
# doesn't match '(?^:\(shell\) .* meta-command failed)'

The program's exited with a failure code as expected, and we saw (some
of?) the expected stdout output, but stderr output is reported to be
empty.

regards, tom lane

[1]: https://cirrus-ci.com/task/6221238034497536

#16Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#9)
Re: BackgroundPsql swallowing errors on windows

On Sat, Feb 15, 2025 at 04:13:37PM -0500, Andres Freund wrote:

On 2025-02-14 09:52:24 -0800, Jacob Champion wrote:

On Fri, Feb 14, 2025 at 8:53 AM Andres Freund <andres@anarazel.de> wrote:

commit 70291a3c66e

(Side question entirely unrelated as I'm reading that..)
What's your magic recipe for showing up with commit format? The best
thing I could come up with was to use "(14,trunc)%H" in format.pretty,
but it has the idea of showing two dots at the end of the commit ID.

Author: Michael Paquier <michael@paquier.xyz>
Date: 2024-11-07 12:11:27 +0900

Improve handling of empty query results in BackgroundPsql::query()

commit ba08edb0654
Author: Michael Paquier <michael@paquier.xyz>
Date: 2024-11-06 15:31:14 +0900

Extend Cluster.pm's background_psql() to be able to start asynchronously

I think both should be backpatchable without too much risk

I think so too. Obviously it'll have to wait until later next week due to the
new minor releases, but after that I think we should backpatch them.

No issues with ba08edb0654.

If we're concerned about the second for any reason, the only conflicting
part should be the name and documentation of wait_connect, right?

It doesn't seem concerning to me either. The first commit seems much more
likely to cause trouble and even that seems ok. Even if it were to cause
problem for an extension (which I think is rather unlikely), it shouldn't be
too hard to fix.

FWIW, Debian Search reports that the only references to BackgroundPsql
are in the Postgres tree, so backpatching 70291a3c66e does not worry
me. Github has more much references due to forked code or direct
copies of BackgroundPsql.pn modified for the purpose of the code.
--
Michael

#17Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#15)
Re: BackgroundPsql swallowing errors on windows

Hi,

On 2025-02-16 18:18:44 -0500, Tom Lane wrote:

Noah Misch <noah@leadboat.com> writes:

From the slow proxy's perspective, it can't rule out the program under test
having done those two write() calls. The proxy doesn't have enough
information to reconstruct the original four write() calls. What prevents
that anomaly?

Yeah, I think it's hopeless to expect that we can disambiguate the
order of writes to two different pipes. For the problem at hand,
though, it seems like we don't really need to do that. Rather, the
question is "when we detect that the program-under-test has exited,
can we be sure we have collected all of its output?".

That's what my patch upthread tries to achieve by having a query separator
both on stdout and stderr and waiting for both.

I think that IPC::Run may be screwing up here, because I have seen
non-Windows CI failures that look like it didn't read all the stderr output.
For example, this pgbench test failure on macOS from [1]:

# Running: pgbench -n -t 1 -Dfoo=bla -Dnull=null -Dtrue=true -Done=1 -Dzero=0.0 -Dbadtrue=trueXXX -Dmaxint=9223372036854775807 -Dminint=-9223372036854775808 -M prepared -f /Users/admin/pgsql/build/testrun/pgbench/001_pgbench_with_server/data/t_001_pgbench_with_server_main_data/001_pgbench_error_shell_bad_command
[17:27:47.408](0.061s) ok 273 - pgbench script error: shell bad command status (got 2 vs expected 2)
[17:27:47.409](0.000s) ok 274 - pgbench script error: shell bad command stdout /(?^:processed: 0/1)/
[17:27:47.409](0.000s) not ok 275 - pgbench script error: shell bad command stderr /(?^:\(shell\) .* meta-command failed)/
[17:27:47.409](0.000s) # Failed test 'pgbench script error: shell bad command stderr /(?^:\(shell\) .* meta-command failed)/'
# at /Users/admin/pgsql/src/bin/pgbench/t/001_pgbench_with_server.pl line 1466.
# ''
# doesn't match '(?^:\(shell\) .* meta-command failed)'

The program's exited with a failure code as expected, and we saw (some
of?) the expected stdout output, but stderr output is reported to be
empty.

It's possible this is caused by the same issue as on windows. Or by one of the
other things fixed in the patch, a) there's afaict no guarantee that we'd read
from pipe A if we were waiting for A|B and B got ready b) that we weren't
actually waiting for quite all the output to be generated (missing the
newline). Or it could be because psql doesn't actually flush stderr in all
patch, from what I can tell...

I hope it'll be easier to debug with the patch in place if it doesn't turn out
to already be fixed.

Greetings,

Andres Freund

#18Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#15)
Re: BackgroundPsql swallowing errors on windows

On Sun, Feb 16, 2025 at 06:18:44PM -0500, Tom Lane wrote:

Noah Misch <noah@leadboat.com> writes:

From the slow proxy's perspective, it can't rule out the program under test
having done those two write() calls. The proxy doesn't have enough
information to reconstruct the original four write() calls. What prevents
that anomaly?

Yeah, I think it's hopeless to expect that we can disambiguate the
order of writes to two different pipes. For the problem at hand,
though, it seems like we don't really need to do that. Rather, the
question is "when we detect that the program-under-test has exited,
can we be sure we have collected all of its output?".

In the BackgroundPsql case, we have output collection moments for every query
while the psql-under-test is running, not just after exit. If I understand
the original post right, the specifics are as follows. If $stdout witnesses
the result of '\echo BANNER', $stderr should contain anything from psql
commands before the \echo. That holds on non-Windows, but the IPC::Run proxy
makes it not hold on Windows.

I think that
IPC::Run may be screwing up here, because I have seen non-Windows
CI failures that look like it didn't read all the stderr output.
For example, this pgbench test failure on macOS from [1]:

# Running: pgbench -n -t 1 -Dfoo=bla -Dnull=null -Dtrue=true -Done=1 -Dzero=0.0 -Dbadtrue=trueXXX -Dmaxint=9223372036854775807 -Dminint=-9223372036854775808 -M prepared -f /Users/admin/pgsql/build/testrun/pgbench/001_pgbench_with_server/data/t_001_pgbench_with_server_main_data/001_pgbench_error_shell_bad_command
[17:27:47.408](0.061s) ok 273 - pgbench script error: shell bad command status (got 2 vs expected 2)
[17:27:47.409](0.000s) ok 274 - pgbench script error: shell bad command stdout /(?^:processed: 0/1)/
[17:27:47.409](0.000s) not ok 275 - pgbench script error: shell bad command stderr /(?^:\(shell\) .* meta-command failed)/
[17:27:47.409](0.000s) # Failed test 'pgbench script error: shell bad command stderr /(?^:\(shell\) .* meta-command failed)/'
# at /Users/admin/pgsql/src/bin/pgbench/t/001_pgbench_with_server.pl line 1466.
# ''
# doesn't match '(?^:\(shell\) .* meta-command failed)'

The program's exited with a failure code as expected, and we saw (some
of?) the expected stdout output, but stderr output is reported to be
empty.

https://github.com/cpan-authors/IPC-Run/commit/2128df3bbcac7e733ac46302c4b1371ffb88fe14
fixed that one.

Show quoted text

[1] https://cirrus-ci.com/task/6221238034497536

#19Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#16)
Re: BackgroundPsql swallowing errors on windows

Hi,

On 2025-02-17 08:52:58 +0900, Michael Paquier wrote:

On Sat, Feb 15, 2025 at 04:13:37PM -0500, Andres Freund wrote:

On 2025-02-14 09:52:24 -0800, Jacob Champion wrote:

On Fri, Feb 14, 2025 at 8:53 AM Andres Freund <andres@anarazel.de> wrote:

commit 70291a3c66e

(Side question entirely unrelated as I'm reading that..)
What's your magic recipe for showing up with commit format? The best
thing I could come up with was to use "(14,trunc)%H" in format.pretty,
but it has the idea of showing two dots at the end of the commit ID.

git {show|log|...} --abbrev-commit

If we're concerned about the second for any reason, the only conflicting
part should be the name and documentation of wait_connect, right?

It doesn't seem concerning to me either. The first commit seems much more
likely to cause trouble and even that seems ok. Even if it were to cause
problem for an extension (which I think is rather unlikely), it shouldn't be
too hard to fix.

FWIW, Debian Search reports that the only references to BackgroundPsql
are in the Postgres tree, so backpatching 70291a3c66e does not worry
me. Github has more much references due to forked code or direct
copies of BackgroundPsql.pn modified for the purpose of the code.

Cool, will after the minor release freeze.

Greetings,

Andres Freund

#20Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#19)
Re: BackgroundPsql swallowing errors on windows

On Sun, Feb 16, 2025 at 07:03:39PM -0500, Andres Freund wrote:

git {show|log|...} --abbrev-commit

Ah, thanks for the hint. This can also be controlled by core.abbrev at
repo level with %h. Didn't suspect this keyword.

Cool, will after the minor release freeze.

Thanks, Andres. If you'd prefer that I double-check the code and do
it as the former committer of these two ones, please let me know.
--
Michael

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#18)
#22Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#21)
#23Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#22)
#24Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#22)
#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#23)
#26Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#11)
#27Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#20)
#28Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#27)
#29Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#23)
#30Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#29)
#31Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#30)
#32Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#31)
#33Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#32)
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#33)
#35Arseniy Mukhin
arseniy.mukhin.dev@gmail.com
In reply to: Tom Lane (#34)
#36Andrew Dunstan
andrew@dunslane.net
In reply to: Andres Freund (#5)
#37Andres Freund
andres@anarazel.de
In reply to: Andrew Dunstan (#36)
#38Andrew Dunstan
andrew@dunslane.net
In reply to: Andres Freund (#37)
#39Andres Freund
andres@anarazel.de
In reply to: Andrew Dunstan (#38)
#40Andrew Dunstan
andrew@dunslane.net
In reply to: Andres Freund (#39)