BackgroundPsql swallowing errors on windows
Hi,
One of the remaining tasks for AIO was to convert the tests to be proper tap
tests. I did that and was thanked by the tests fairly randomly failing on
windows. Which test fails changes from run to run.
The symptom is that BackgroundPsql->query() sometimes would simply swallow
errors that were clearly generated by the backend. Which then would cause
tests to fail, because testing for errors was the whole point of $test.
At first I thought the issue was that psql didn't actually flush stderr after
displaying errors. And while that may be an issue, it doesn't seem to be the
one causing problems for me.
Lots of hair-pulling later, I have a somewhat confirmed theory for what's
happening:
BackgroundPsql::query() tries to detect if the passed in query has executed by
adding a "banner" after the query and using pump_until() to wait until that
banner has been "reached". That seems to work reasonably well on !windows.
On windows however, it looks like there's no guarantee that if stdout has been
received by IPC::Run, stderr also has been received, even if the stderr
content has been generated first. I tried to add an extra ->pump_nb() call to
query(), thinking that maybe IPC::Run just didn't get input that had actually
arrived, due to waiting on just one pipe. But no success.
My understanding is that IPC::Run uses a proxy process on windows to execute
subprocesses and then communicates with that over TCP (or something along
those lines). I suspect what's happening is that the communication with the
external process allows for reordering between stdout/stderr.
And indeed, changing BackgroundPsql::query() to emit the banner on both stdout
and stderr and waiting on both seems to fix the issue.
One complication is that I found that just waiting for the banner, without
also its newline, sometimes lead to unexpected newlines causing later queries
to fail. I think that happens if the trailing newline is read separately from
the rest of the string.
However, matching the newlines caused tests to fail on some machines. After a
lot of cursing I figured out that for interactive psql we output \r\n, causing
my regex match to fail. I.e. tests failed whenever IO::PTY was availble...
It's also not correct, as we did before, to just look for the banner, it has
to be anchored to either the start of the output or a newline, otherwise the
\echo (or \warn) command itself will be matched by pump_until() (but then the
replacing the command would fail). Not sure that could cause active problems
without the addition of \warn (which is also echoed on stdout), but it
certainly could after.
The banner being the same between queries made it hard to understand if a
banner that appeared in the output was from the current query or a past
query. Therefore I added a counter to it.
For debugging I added a "note" that shows stdout/stderr after executing the
query, I think it may be worth keeping that, but I'm not sure.
This was a rather painful exercise.
Greetings,
Andres Freund
Attachments:
v1-0001-BackgroundPsql-Fix-potential-for-lost-errors-on-w.patchtext/x-diff; charset=us-asciiDownload
From 1aeb60e0687339b9f7524c3b347915bb53ed1284 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 13 Feb 2025 10:09:49 -0500
Subject: [PATCH v1] BackgroundPsql: Fix potential for lost errors on windows
Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
.../perl/PostgreSQL/Test/BackgroundPsql.pm | 55 +++++++++++++++----
1 file changed, 43 insertions(+), 12 deletions(-)
diff --git a/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm b/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
index dbfd82e4fac..7e76845307d 100644
--- a/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
+++ b/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
@@ -89,7 +89,8 @@ sub new
'stdin' => '',
'stdout' => '',
'stderr' => '',
- 'query_timer_restart' => undef
+ 'query_timer_restart' => undef,
+ 'query_cnt' => 1,
};
my $run;
@@ -219,27 +220,57 @@ sub query
my ($self, $query) = @_;
my $ret;
my $output;
+ my $query_cnt = $self->{query_cnt}++;
+
local $Test::Builder::Level = $Test::Builder::Level + 1;
- note "issuing query via background psql: $query";
+ note "issuing query $query_cnt via background psql: $query";
$self->{timeout}->start() if (defined($self->{query_timer_restart}));
# Feed the query to psql's stdin, followed by \n (so psql processes the
# line), by a ; (so that psql issues the query, if it doesn't include a ;
- # itself), and a separator echoed with \echo, that we can wait on.
- my $banner = "background_psql: QUERY_SEPARATOR";
- $self->{stdin} .= "$query\n;\n\\echo $banner\n";
-
- pump_until($self->{run}, $self->{timeout}, \$self->{stdout}, qr/$banner/);
+ # itself), and a separator echoed both with \echo and \warn, that we can
+ # wait on.
+ #
+ # To avoid somehow confusing the separator from separately issued queries,
+ # and to make it easier to debug, we include a per-psql query counter in
+ # the separator.
+ #
+ # We need both \echo (printing to stdout) and \warn (printing to stderr),
+ # because on windows we can get data on stdout before seeing data on
+ # stderr (or vice versa), even if psql printed them in the opposite
+ # order. We therefore wait on both.
+ #
+ # We need to match for the newline, because we try to remove it below, and
+ # it's possible to consume just the input *without* the newline. In
+ # interactive psql we emit \r\n, so we need to allow for that. Also need
+ # to be careful that we don't e.g. match the echoed \echo command, rather
+ # than its output.
+ my $banner = "background_psql: QUERY_SEPARATOR $query_cnt";
+ my $banner_match = qr/(^|\n)$banner\r?\n/;
+ $self->{stdin} .= "$query\n;\n\\echo $banner\n\\warn $banner\n";
+ pump_until(
+ $self->{run}, $self->{timeout},
+ \$self->{stdout}, qr/$banner_match/);
+ pump_until(
+ $self->{run}, $self->{timeout},
+ \$self->{stderr}, qr/$banner_match/);
die "psql query timed out" if $self->{timeout}->is_expired;
+
+ note "query returned:\n",
+ explain {
+ stdout => $self->{stdout},
+ stderr => $self->{stderr},
+ };
+
+ # Remove banner from stdout and stderr, our caller doesn't care. The
+ # first newline is optional, as there would not be one if consuming an
+ # empty query result.
$output = $self->{stdout};
-
- # Remove banner again, our caller doesn't care. The first newline is
- # optional, as there would not be one if consuming an empty query
- # result.
- $output =~ s/\n?$banner\n$//s;
+ $output =~ s/$banner_match//;
+ $self->{stderr} =~ s/$banner_match//;
# clear out output for the next query
$self->{stdout} = '';
--
2.48.1.76.g4e746b1a31.dirty
On 13 Feb 2025, at 18:39, Andres Freund <andres@anarazel.de> wrote:
The banner being the same between queries made it hard to understand if a
banner that appeared in the output was from the current query or a past
query. Therefore I added a counter to it.
+ my $banner = "background_psql: QUERY_SEPARATOR $query_cnt";
+ my $banner_match = qr/(^|\n)$banner\r?\n/;
+ $self->{stdin} .= "$query\n;\n\\echo $banner\n\\warn $banner\n";
+ pump_until(
+ $self->{run}, $self->{timeout},
+ \$self->{stdout}, qr/$banner_match/);
Won't this allow "QUERY_SEPARATOR 11" to match against "QUERY_SEPARATOR 1"?
It's probably only of academic interest but appending an end-of-banner
character like "_" or something after the query counter should fix that.
For debugging I added a "note" that shows stdout/stderr after executing the
query, I think it may be worth keeping that, but I'm not sure.
I think it could be worth it, +1 for keeping it until it's beeb proven a
problem somewhere.
This was a rather painful exercise.
I have no trouble believing that.
--
Daniel Gustafsson
On 2025-02-13 Th 12:39 PM, Andres Freund wrote:
Hi,
One of the remaining tasks for AIO was to convert the tests to be proper tap
tests. I did that and was thanked by the tests fairly randomly failing on
windows. Which test fails changes from run to run.The symptom is that BackgroundPsql->query() sometimes would simply swallow
errors that were clearly generated by the backend. Which then would cause
tests to fail, because testing for errors was the whole point of $test.At first I thought the issue was that psql didn't actually flush stderr after
displaying errors. And while that may be an issue, it doesn't seem to be the
one causing problems for me.Lots of hair-pulling later, I have a somewhat confirmed theory for what's
happening:BackgroundPsql::query() tries to detect if the passed in query has executed by
adding a "banner" after the query and using pump_until() to wait until that
banner has been "reached". That seems to work reasonably well on !windows.On windows however, it looks like there's no guarantee that if stdout has been
received by IPC::Run, stderr also has been received, even if the stderr
content has been generated first. I tried to add an extra ->pump_nb() call to
query(), thinking that maybe IPC::Run just didn't get input that had actually
arrived, due to waiting on just one pipe. But no success.My understanding is that IPC::Run uses a proxy process on windows to execute
subprocesses and then communicates with that over TCP (or something along
those lines). I suspect what's happening is that the communication with the
external process allows for reordering between stdout/stderr.And indeed, changing BackgroundPsql::query() to emit the banner on both stdout
and stderr and waiting on both seems to fix the issue.One complication is that I found that just waiting for the banner, without
also its newline, sometimes lead to unexpected newlines causing later queries
to fail. I think that happens if the trailing newline is read separately from
the rest of the string.However, matching the newlines caused tests to fail on some machines. After a
lot of cursing I figured out that for interactive psql we output \r\n, causing
my regex match to fail. I.e. tests failed whenever IO::PTY was availble...It's also not correct, as we did before, to just look for the banner, it has
to be anchored to either the start of the output or a newline, otherwise the
\echo (or \warn) command itself will be matched by pump_until() (but then the
replacing the command would fail). Not sure that could cause active problems
without the addition of \warn (which is also echoed on stdout), but it
certainly could after.The banner being the same between queries made it hard to understand if a
banner that appeared in the output was from the current query or a past
query. Therefore I added a counter to it.For debugging I added a "note" that shows stdout/stderr after executing the
query, I think it may be worth keeping that, but I'm not sure.This was a rather painful exercise.
It's been discussed before, but I'd really really like to get rid of
BackgroundPsql. It's ugly, non-intuitive and fragile.
Last year I did some work on this. I was going to present it at Athens
but illness prevented me, and then other life events managed to get in
my way. But the basic work is around. See
<https://github.com/adunstan/test-pq/commit/98518e4929e80fb96f210bbc5aab9fdcea058512>
This introduces a libpq session object (PostgreSQL::Test::Session) which
can be backed either by FFI or a small XS wrapper - the commit has
recipes for both. Missing is a meson.build file for the XS wrapper.
There are significant performance gains to be had too (poll_query_until
is much nicer, for example, as are most uses of safe_psql). If there is
interest I will bring the work up to date, and maybe we can introduce
this early in the v19 cycle. It would significantly reduce our
dependency on IPC::Run, especially the pump() stuff.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Hi,
On 2025-02-14 13:35:40 +0100, Daniel Gustafsson wrote:
On 13 Feb 2025, at 18:39, Andres Freund <andres@anarazel.de> wrote:
The banner being the same between queries made it hard to understand if a
banner that appeared in the output was from the current query or a past
query. Therefore I added a counter to it.+ my $banner = "background_psql: QUERY_SEPARATOR $query_cnt"; + my $banner_match = qr/(^|\n)$banner\r?\n/; + $self->{stdin} .= "$query\n;\n\\echo $banner\n\\warn $banner\n"; + pump_until( + $self->{run}, $self->{timeout}, + \$self->{stdout}, qr/$banner_match/);Won't this allow "QUERY_SEPARATOR 11" to match against "QUERY_SEPARATOR 1"?
It's probably only of academic interest but appending an end-of-banner
character like "_" or something after the query counter should fix that.
You're right. I went with ":".
Thanks for reviewing!
Updated patch attached.
I also applied similar changes to wait_connect(), it had the same issues as
query(). This mostly matters for interactive_psql() uses. The fact that we
matched on the \echo itself, lead to the first query() having additional query
output, along the lines of
\echo background_psql: ready
psql (18devel)
Type "help" for help.
postgres=# \echo background_psql: ready
background_psql: ready
Which is rather confusing and can throw off checks based on the query results.
It does seem rather weird that psql outputs its input twice in this case, but
that's a separate issue.
I was thinking that this really ought to be backported, debugging failures due
to the set of fied bugs is really painful (just cost me 1 1/2 days), but
unfortunately there have been a bunch of recent changes that haven't been
backpatched:
commit 70291a3c66e
Author: Michael Paquier <michael@paquier.xyz>
Date: 2024-11-07 12:11:27 +0900
Improve handling of empty query results in BackgroundPsql::query()
commit ba08edb0654
Author: Michael Paquier <michael@paquier.xyz>
Date: 2024-11-06 15:31:14 +0900
Extend Cluster.pm's background_psql() to be able to start asynchronously
Particularly the former makes it hard to backpatch, as it's a behavioural
difference that really interacts with the problems described in this thread.
Michael, Jacob, thoughts?
Greetings,
Andres Freund
Attachments:
v2-0001-tests-BackgroundPsql-Fix-potential-for-lost-error.patchtext/x-diff; charset=us-asciiDownload
From 0bea4887adabd458cdefa4d9fcb4c1a4baefc12a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 13 Feb 2025 10:09:49 -0500
Subject: [PATCH v2] tests: BackgroundPsql: Fix potential for lost errors on
windows
This addresses various corner cases in BackgroundPsql:
- On windows stdout and stderr may arrive out of order, leading to errors not
being reported, or attributed to the wrong statement.
To fix, emit the "query-separation banner" on both stdout and stderr and
wait for both.
- Very occasionally the "query-separation banner" would not get removed, because
we waited until the banner arrived, but then replaced the banner plus
newline.
To fix, wait for banner and newline.
- For interactive psql replacing $banner\n is not sufficient, interactive psql
outputs \r\n.
- For interactive psql, where commands are echoed to stdout, the \echo
command, rather than its output, would be matched.
This would sometimes lead to output from the prior query, or wait_connect(),
being returned in the next command.
This also affected wait_connect(), leading to sometimes sending queries to
psql before the connection actually was established.
While debugging these issues I also found that it's hard to know whether a
query separation banner was attributed to the right query. Make that easier by
counting the queries each BackgroundPsql instance has emitted and include the
number in the banner.
Also emit psql stdout/stderr in query() and wait_connect() as Test::More
notes, without that it's rather hard to debug some issues in CI and buildfarm.
As this can cause issues not just to-be-added tests, but also existing ones,
backpatch the fix to all supported versions.
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://postgr.es/m/wmovm6xcbwh7twdtymxuboaoarbvwj2haasd3sikzlb3dkgz76@n45rzycluzft
Backpatch-through: 13
---
.../perl/PostgreSQL/Test/BackgroundPsql.pm | 74 +++++++++++++++----
1 file changed, 59 insertions(+), 15 deletions(-)
diff --git a/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm b/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
index dbfd82e4fac..2fd44c1f867 100644
--- a/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
+++ b/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
@@ -89,7 +89,8 @@ sub new
'stdin' => '',
'stdout' => '',
'stderr' => '',
- 'query_timer_restart' => undef
+ 'query_timer_restart' => undef,
+ 'query_cnt' => 1,
};
my $run;
@@ -148,11 +149,24 @@ sub wait_connect
# connection failures are caught here, relieving callers of the need to
# handle those. (Right now, we have no particularly good handling for
# errors anyway, but that might be added later.)
+ #
+ # See query() for details about why/how the banner is used.
my $banner = "background_psql: ready";
- $self->{stdin} .= "\\echo $banner\n";
+ my $banner_match = qr/(^|\n)$banner\r?\n/;
+ $self->{stdin} .= "\\echo $banner\n\\warn $banner\n";
$self->{run}->pump()
- until $self->{stdout} =~ /$banner/ || $self->{timeout}->is_expired;
- $self->{stdout} = ''; # clear out banner
+ until ($self->{stdout} =~ /$banner_match/ && $self->{stderr} =~ /$banner\r?\n/)
+ || $self->{timeout}->is_expired;
+
+ note "connect output:\n",
+ explain {
+ stdout => $self->{stdout},
+ stderr => $self->{stderr},
+ };
+
+ # clear out banners
+ $self->{stdout} = '';
+ $self->{stderr} = '';
die "psql startup timed out" if $self->{timeout}->is_expired;
}
@@ -219,27 +233,57 @@ sub query
my ($self, $query) = @_;
my $ret;
my $output;
+ my $query_cnt = $self->{query_cnt}++;
+
local $Test::Builder::Level = $Test::Builder::Level + 1;
- note "issuing query via background psql: $query";
+ note "issuing query $query_cnt via background psql: $query";
$self->{timeout}->start() if (defined($self->{query_timer_restart}));
# Feed the query to psql's stdin, followed by \n (so psql processes the
# line), by a ; (so that psql issues the query, if it doesn't include a ;
- # itself), and a separator echoed with \echo, that we can wait on.
- my $banner = "background_psql: QUERY_SEPARATOR";
- $self->{stdin} .= "$query\n;\n\\echo $banner\n";
-
- pump_until($self->{run}, $self->{timeout}, \$self->{stdout}, qr/$banner/);
+ # itself), and a separator echoed both with \echo and \warn, that we can
+ # wait on.
+ #
+ # To avoid somehow confusing the separator from separately issued queries,
+ # and to make it easier to debug, we include a per-psql query counter in
+ # the separator.
+ #
+ # We need both \echo (printing to stdout) and \warn (printing to stderr),
+ # because on windows we can get data on stdout before seeing data on
+ # stderr (or vice versa), even if psql printed them in the opposite
+ # order. We therefore wait on both.
+ #
+ # We need to match for the newline, because we try to remove it below, and
+ # it's possible to consume just the input *without* the newline. In
+ # interactive psql we emit \r\n, so we need to allow for that. Also need
+ # to be careful that we don't e.g. match the echoed \echo command, rather
+ # than its output.
+ my $banner = "background_psql: QUERY_SEPARATOR $query_cnt:";
+ my $banner_match = qr/(^|\n)$banner\r?\n/;
+ $self->{stdin} .= "$query\n;\n\\echo $banner\n\\warn $banner\n";
+ pump_until(
+ $self->{run}, $self->{timeout},
+ \$self->{stdout}, qr/$banner_match/);
+ pump_until(
+ $self->{run}, $self->{timeout},
+ \$self->{stderr}, qr/$banner_match/);
die "psql query timed out" if $self->{timeout}->is_expired;
+
+ note "results query $query_cnt:\n",
+ explain {
+ stdout => $self->{stdout},
+ stderr => $self->{stderr},
+ };
+
+ # Remove banner from stdout and stderr, our caller doesn't care. The
+ # first newline is optional, as there would not be one if consuming an
+ # empty query result.
$output = $self->{stdout};
-
- # Remove banner again, our caller doesn't care. The first newline is
- # optional, as there would not be one if consuming an empty query
- # result.
- $output =~ s/\n?$banner\n$//s;
+ $output =~ s/$banner_match//;
+ $self->{stderr} =~ s/$banner_match//;
# clear out output for the next query
$self->{stdout} = '';
--
2.48.1.76.g4e746b1a31.dirty
Hi,
On 2025-02-14 08:14:45 -0500, Andrew Dunstan wrote:
It's been discussed before, but I'd really really like to get rid of
BackgroundPsql. It's ugly, non-intuitive and fragile.
I agree, unfortunately we're stuck with this until we have a better
alternative in tree :(
Last year I did some work on this. I was going to present it at Athens but
illness prevented me, and then other life events managed to get in my way.
But the basic work is around. See <https://github.com/adunstan/test-pq/commit/98518e4929e80fb96f210bbc5aab9fdcea058512>
This introduces a libpq session object (PostgreSQL::Test::Session) which can
be backed either by FFI or a small XS wrapper - the commit has recipes for
both. Missing is a meson.build file for the XS wrapper. There are
significant performance gains to be had too (poll_query_until is much nicer,
for example, as are most uses of safe_psql). If there is� interest I will
bring the work up to date, and maybe we can introduce this early in the v19
cycle. It would significantly reduce our dependency on IPC::Run, especially
the pump() stuff.
I definitely am interested.
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
On 2025-02-14 08:14:45 -0500, Andrew Dunstan wrote:
... If there is interest I will
bring the work up to date, and maybe we can introduce this early in the v19
cycle. It would significantly reduce our dependency on IPC::Run, especially
the pump() stuff.
I definitely am interested.
+1. Would there be a chance of removing our use of IPC::Run entirely?
There'd be a lot to like about that.
regards, tom lane
On Fri, Feb 14, 2025 at 8:53 AM Andres Freund <andres@anarazel.de> wrote:
commit 70291a3c66e
Author: Michael Paquier <michael@paquier.xyz>
Date: 2024-11-07 12:11:27 +0900Improve handling of empty query results in BackgroundPsql::query()
commit ba08edb0654
Author: Michael Paquier <michael@paquier.xyz>
Date: 2024-11-06 15:31:14 +0900Extend Cluster.pm's background_psql() to be able to start asynchronously
Particularly the former makes it hard to backpatch, as it's a behavioural
difference that really interacts with the problems described in this thread.Michael, Jacob, thoughts?
I think both should be backpatchable without too much risk, though
it's possible that there are more useless ok() calls in back branches
that would need to be touched when the first patch goes back. If we're
concerned about the second for any reason, the only conflicting part
should be the name and documentation of wait_connect, right?
--Jacob
Hi,
On 2025-02-14 12:14:55 -0500, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
On 2025-02-14 08:14:45 -0500, Andrew Dunstan wrote:
... If there is interest I will
bring the work up to date, and maybe we can introduce this early in the v19
cycle. It would significantly reduce our dependency on IPC::Run, especially
the pump() stuff.I definitely am interested.
+1. Would there be a chance of removing our use of IPC::Run entirely?
There'd be a lot to like about that.
Unfortunately I doubt it'd get us that close to that.
We do have several tests that intentionally involve psql,
e.g. 010_tab_completion.pl, 001_password.pl. Presumably those would continue
to use something like BackgroundPsql.pm, even if we had a sane way to have
longrunning connections in tap tests.
There also are a bunch of things we use IPC::Run to run synchronously, we
probably could replace most of those without too much pain. The hardest
probably would be timeout handling.
Greetings,
Andres Freund
Hi,
On 2025-02-14 09:52:24 -0800, Jacob Champion wrote:
On Fri, Feb 14, 2025 at 8:53 AM Andres Freund <andres@anarazel.de> wrote:
commit 70291a3c66e
Author: Michael Paquier <michael@paquier.xyz>
Date: 2024-11-07 12:11:27 +0900Improve handling of empty query results in BackgroundPsql::query()
commit ba08edb0654
Author: Michael Paquier <michael@paquier.xyz>
Date: 2024-11-06 15:31:14 +0900Extend Cluster.pm's background_psql() to be able to start asynchronously
Particularly the former makes it hard to backpatch, as it's a behavioural
difference that really interacts with the problems described in this thread.Michael, Jacob, thoughts?
I think both should be backpatchable without too much risk
I think so too. Obviously it'll have to wait until later next week due to the
new minor releases, but after that I think we should backpatch them.
though it's possible that there are more useless ok() calls in back branches
that would need to be touched when the first patch goes back.
FWIW, I don't really agree that such ok() calls are useless, they do make the
output of the test more readable, particularly if the test ends up hanging.
But that's really just an aside.
If we're concerned about the second for any reason, the only conflicting
part should be the name and documentation of wait_connect, right?
It doesn't seem concerning to me either. The first commit seems much more
likely to cause trouble and even that seems ok. Even if it were to cause
problem for an extension (which I think is rather unlikely), it shouldn't be
too hard to fix.
Greetings,
Andres Freund
On Thu, Feb 13, 2025 at 12:39:04PM -0500, Andres Freund wrote:
My understanding is that IPC::Run uses a proxy process on windows to execute
subprocesses and then communicates with that over TCP (or something along
those lines).
Right.
I suspect what's happening is that the communication with the
external process allows for reordering between stdout/stderr.And indeed, changing BackgroundPsql::query() to emit the banner on both stdout
and stderr and waiting on both seems to fix the issue.
That makes sense. I wondered how one might fix IPC::Run to preserve the
relative timing of stdout and stderr, not perturbing the timing the way that
disrupted your test run. I can think of two strategies:
- Remove the proxy.
- Make pipe data visible to Perl variables only when at least one of the
proxy<-program_under_test pipes had no data ready to read. In other words,
if both pipes have data ready, make all that data visible to Perl code
simultaneously. (When both the stdout pipe and the stderr pipe have data
ready, one can't determine data arrival order.)
Is there a possibly-less-invasive change that might work?
The banner being the same between queries made it hard to understand if a
banner that appeared in the output was from the current query or a past
query. Therefore I added a counter to it.
Sounds good.
For debugging I added a "note" that shows stdout/stderr after executing the
query, I think it may be worth keeping that, but I'm not sure.
It should be okay to keep. We're not likely to funnel huge amounts of data
through BackgroundPsql. If we ever do that, we could just skip the "note" for
payloads larger than some threshold.
v2 of the patch looks fine.
Hi,
On 2025-02-16 09:39:43 -0800, Noah Misch wrote:
On Thu, Feb 13, 2025 at 12:39:04PM -0500, Andres Freund wrote:
I suspect what's happening is that the communication with the
external process allows for reordering between stdout/stderr.And indeed, changing BackgroundPsql::query() to emit the banner on both stdout
and stderr and waiting on both seems to fix the issue.That makes sense. I wondered how one might fix IPC::Run to preserve the
relative timing of stdout and stderr, not perturbing the timing the way that
disrupted your test run. I can think of two strategies:- Remove the proxy.
- Make pipe data visible to Perl variables only when at least one of the
proxy<-program_under_test pipes had no data ready to read. In other words,
if both pipes have data ready, make all that data visible to Perl code
simultaneously. (When both the stdout pipe and the stderr pipe have data
ready, one can't determine data arrival order.)Is there a possibly-less-invasive change that might work?
I don't really know enough about IPC::Run's internals to answer. My
interpretation of how it might work, purely from observation, is that it opens
one tcp connection for each "pipe" and that that's what's introducing the
potential of reordering, as the different sockets can have different delivery
timeframes. If that's it, it seems proxying all the pipes through one
connection might be an option.
I did briefly look at IPC::Run's code, but couldn't figure out how it all fits
together quickly enough...
v2 of the patch looks fine.
Thanks for reviewing!
Greetings,
Andres Freund
On Sun, Feb 16, 2025 at 01:02:01PM -0500, Andres Freund wrote:
On 2025-02-16 09:39:43 -0800, Noah Misch wrote:
On Thu, Feb 13, 2025 at 12:39:04PM -0500, Andres Freund wrote:
I suspect what's happening is that the communication with the
external process allows for reordering between stdout/stderr.And indeed, changing BackgroundPsql::query() to emit the banner on both stdout
and stderr and waiting on both seems to fix the issue.That makes sense. I wondered how one might fix IPC::Run to preserve the
relative timing of stdout and stderr, not perturbing the timing the way that
disrupted your test run. I can think of two strategies:- Remove the proxy.
- Make pipe data visible to Perl variables only when at least one of the
proxy<-program_under_test pipes had no data ready to read. In other words,
if both pipes have data ready, make all that data visible to Perl code
simultaneously. (When both the stdout pipe and the stderr pipe have data
ready, one can't determine data arrival order.)Is there a possibly-less-invasive change that might work?
I don't really know enough about IPC::Run's internals to answer. My
interpretation of how it might work, purely from observation, is that it opens
one tcp connection for each "pipe" and that that's what's introducing the
potential of reordering, as the different sockets can have different delivery
timeframes.
Right.
If that's it, it seems proxying all the pipes through one
connection might be an option.
It would. Thanks. However, I think that would entail modifying the program
under test to cooperate with the arrangement. When running an ordinary
program that does write(1, ...) and write(2, ...), the read end needs some way
to deal with the uncertainty about which write happened first. dup2(1, 2)
solves the order ambiguity, but it loses other signal.
I did briefly look at IPC::Run's code, but couldn't figure out how it all fits
together quickly enough...
We can ignore what IPC::Run does today. The reordering is a general problem
of proxying >1 pipe. (Proxying on non-Windows would have the same problem.)
Hi,
On 2025-02-16 10:47:40 -0800, Noah Misch wrote:
On Sun, Feb 16, 2025 at 01:02:01PM -0500, Andres Freund wrote:
On 2025-02-16 09:39:43 -0800, Noah Misch wrote:
On Thu, Feb 13, 2025 at 12:39:04PM -0500, Andres Freund wrote:
I suspect what's happening is that the communication with the
external process allows for reordering between stdout/stderr.And indeed, changing BackgroundPsql::query() to emit the banner on both stdout
and stderr and waiting on both seems to fix the issue.That makes sense. I wondered how one might fix IPC::Run to preserve the
relative timing of stdout and stderr, not perturbing the timing the way that
disrupted your test run. I can think of two strategies:- Remove the proxy.
- Make pipe data visible to Perl variables only when at least one of the
proxy<-program_under_test pipes had no data ready to read. In other words,
if both pipes have data ready, make all that data visible to Perl code
simultaneously. (When both the stdout pipe and the stderr pipe have data
ready, one can't determine data arrival order.)Is there a possibly-less-invasive change that might work?
I don't really know enough about IPC::Run's internals to answer. My
interpretation of how it might work, purely from observation, is that it opens
one tcp connection for each "pipe" and that that's what's introducing the
potential of reordering, as the different sockets can have different delivery
timeframes.Right.
If that's it, it seems proxying all the pipes through one
connection might be an option.It would. Thanks. However, I think that would entail modifying the program
under test to cooperate with the arrangement. When running an ordinary
program that does write(1, ...) and write(2, ...), the read end needs some way
to deal with the uncertainty about which write happened first. dup2(1, 2)
solves the order ambiguity, but it loses other signal.
I think what's happening in this case must go beyond just that. Afaict just
doing ->pump_nb() would otherwise solve it. My uninformed theory is that two
tcp connections are used. With two pipes
P1: write(1)
P1: write(2)
P2: read(1)
P2: read(2)
wouldn't ever result in P2 not seeing data on either of reads. But with two
TCP sockets there can be time between the send() completing and recv() on the
other side reading the data, even on a local system (e.g. due to the tcp stack
waiting a while for more data before sending data).
To avoid that the proxy program could read from N pipes and then proxy them
through *one* socket by prefixing the data with information about which pipe
the data is from. Then IPC::Run could split the data again, using the added
prefix.
I don't think that would require modifying the program under test?
Greetings,
Andres Freund
On Sun, Feb 16, 2025 at 03:55:10PM -0500, Andres Freund wrote:
On 2025-02-16 10:47:40 -0800, Noah Misch wrote:
On Sun, Feb 16, 2025 at 01:02:01PM -0500, Andres Freund wrote:
On 2025-02-16 09:39:43 -0800, Noah Misch wrote:
On Thu, Feb 13, 2025 at 12:39:04PM -0500, Andres Freund wrote:
I suspect what's happening is that the communication with the
external process allows for reordering between stdout/stderr.And indeed, changing BackgroundPsql::query() to emit the banner on both stdout
and stderr and waiting on both seems to fix the issue.That makes sense. I wondered how one might fix IPC::Run to preserve the
relative timing of stdout and stderr, not perturbing the timing the way that
disrupted your test run. I can think of two strategies:- Remove the proxy.
- Make pipe data visible to Perl variables only when at least one of the
proxy<-program_under_test pipes had no data ready to read. In other words,
if both pipes have data ready, make all that data visible to Perl code
simultaneously. (When both the stdout pipe and the stderr pipe have data
ready, one can't determine data arrival order.)Is there a possibly-less-invasive change that might work?
seems proxying all the pipes through one
connection might be an option.It would. Thanks. However, I think that would entail modifying the program
under test to cooperate with the arrangement. When running an ordinary
program that does write(1, ...) and write(2, ...), the read end needs some way
to deal with the uncertainty about which write happened first. dup2(1, 2)
solves the order ambiguity, but it loses other signal.I think what's happening in this case must go beyond just that. Afaict just
doing ->pump_nb() would otherwise solve it. My uninformed theory is that two
tcp connections are used. With two pipesP1: write(1)
P1: write(2)
P2: read(1)
P2: read(2)wouldn't ever result in P2 not seeing data on either of reads.
True.
But with two
TCP sockets there can be time between the send() completing and recv() on the
other side reading the data, even on a local system (e.g. due to the tcp stack
waiting a while for more data before sending data).To avoid that the proxy program could read from N pipes and then proxy them
through *one* socket by prefixing the data with information about which pipe
the data is from. Then IPC::Run could split the data again, using the added
prefix.I don't think that would require modifying the program under test?
I think that gets an order anomaly when the proxy is slow and the program
under test does this:
write(2, "ERROR: 1")
write(1, "BANNER 1")
write(2, "ERROR: 2")
write(1, "BANNER 2")
If the proxy is sufficiently fast, it will wake up four times and perform two
read() calls on each of the two pipes. On the flip side, if the proxy is
sufficiently slow, it will wake up once and perform one read() on each of the
two pipes. In the slow case, the reads get "ERROR: 1ERROR: 2" and "BANNER
1BANNER 2". The proxy sends the data onward as though the program under test
had done:
write(1, "BANNER 1BANNER 2")
write(2, "ERROR: 1ERROR: 2")
From the slow proxy's perspective, it can't rule out the program under test
having done those two write() calls. The proxy doesn't have enough
information to reconstruct the original four write() calls. What prevents
that anomaly?
If the proxy were to convey this uncertainty so the consumer side knew to
handle these writes as an atomic unit, I think that would work:
write(n, "proxy-begin")
write(n, "fd-1[BANNER 1BANNER 2]")
write(n, "fd-2[ERROR: 1ERROR: 2]")
write(n, "proxy-commit")
Noah Misch <noah@leadboat.com> writes:
From the slow proxy's perspective, it can't rule out the program under test
having done those two write() calls. The proxy doesn't have enough
information to reconstruct the original four write() calls. What prevents
that anomaly?
Yeah, I think it's hopeless to expect that we can disambiguate the
order of writes to two different pipes. For the problem at hand,
though, it seems like we don't really need to do that. Rather, the
question is "when we detect that the program-under-test has exited,
can we be sure we have collected all of its output?". I think that
IPC::Run may be screwing up here, because I have seen non-Windows
CI failures that look like it didn't read all the stderr output.
For example, this pgbench test failure on macOS from [1]https://cirrus-ci.com/task/6221238034497536:
# Running: pgbench -n -t 1 -Dfoo=bla -Dnull=null -Dtrue=true -Done=1 -Dzero=0.0 -Dbadtrue=trueXXX -Dmaxint=9223372036854775807 -Dminint=-9223372036854775808 -M prepared -f /Users/admin/pgsql/build/testrun/pgbench/001_pgbench_with_server/data/t_001_pgbench_with_server_main_data/001_pgbench_error_shell_bad_command
[17:27:47.408](0.061s) ok 273 - pgbench script error: shell bad command status (got 2 vs expected 2)
[17:27:47.409](0.000s) ok 274 - pgbench script error: shell bad command stdout /(?^:processed: 0/1)/
[17:27:47.409](0.000s) not ok 275 - pgbench script error: shell bad command stderr /(?^:\(shell\) .* meta-command failed)/
[17:27:47.409](0.000s) # Failed test 'pgbench script error: shell bad command stderr /(?^:\(shell\) .* meta-command failed)/'
# at /Users/admin/pgsql/src/bin/pgbench/t/001_pgbench_with_server.pl line 1466.
# ''
# doesn't match '(?^:\(shell\) .* meta-command failed)'
The program's exited with a failure code as expected, and we saw (some
of?) the expected stdout output, but stderr output is reported to be
empty.
regards, tom lane
On Sat, Feb 15, 2025 at 04:13:37PM -0500, Andres Freund wrote:
On 2025-02-14 09:52:24 -0800, Jacob Champion wrote:
On Fri, Feb 14, 2025 at 8:53 AM Andres Freund <andres@anarazel.de> wrote:
commit 70291a3c66e
(Side question entirely unrelated as I'm reading that..)
What's your magic recipe for showing up with commit format? The best
thing I could come up with was to use "(14,trunc)%H" in format.pretty,
but it has the idea of showing two dots at the end of the commit ID.
Author: Michael Paquier <michael@paquier.xyz>
Date: 2024-11-07 12:11:27 +0900Improve handling of empty query results in BackgroundPsql::query()
commit ba08edb0654
Author: Michael Paquier <michael@paquier.xyz>
Date: 2024-11-06 15:31:14 +0900Extend Cluster.pm's background_psql() to be able to start asynchronously
I think both should be backpatchable without too much risk
I think so too. Obviously it'll have to wait until later next week due to the
new minor releases, but after that I think we should backpatch them.
No issues with ba08edb0654.
If we're concerned about the second for any reason, the only conflicting
part should be the name and documentation of wait_connect, right?It doesn't seem concerning to me either. The first commit seems much more
likely to cause trouble and even that seems ok. Even if it were to cause
problem for an extension (which I think is rather unlikely), it shouldn't be
too hard to fix.
FWIW, Debian Search reports that the only references to BackgroundPsql
are in the Postgres tree, so backpatching 70291a3c66e does not worry
me. Github has more much references due to forked code or direct
copies of BackgroundPsql.pn modified for the purpose of the code.
--
Michael
Hi,
On 2025-02-16 18:18:44 -0500, Tom Lane wrote:
Noah Misch <noah@leadboat.com> writes:
From the slow proxy's perspective, it can't rule out the program under test
having done those two write() calls. The proxy doesn't have enough
information to reconstruct the original four write() calls. What prevents
that anomaly?Yeah, I think it's hopeless to expect that we can disambiguate the
order of writes to two different pipes. For the problem at hand,
though, it seems like we don't really need to do that. Rather, the
question is "when we detect that the program-under-test has exited,
can we be sure we have collected all of its output?".
That's what my patch upthread tries to achieve by having a query separator
both on stdout and stderr and waiting for both.
I think that IPC::Run may be screwing up here, because I have seen
non-Windows CI failures that look like it didn't read all the stderr output.
For example, this pgbench test failure on macOS from [1]:# Running: pgbench -n -t 1 -Dfoo=bla -Dnull=null -Dtrue=true -Done=1 -Dzero=0.0 -Dbadtrue=trueXXX -Dmaxint=9223372036854775807 -Dminint=-9223372036854775808 -M prepared -f /Users/admin/pgsql/build/testrun/pgbench/001_pgbench_with_server/data/t_001_pgbench_with_server_main_data/001_pgbench_error_shell_bad_command
[17:27:47.408](0.061s) ok 273 - pgbench script error: shell bad command status (got 2 vs expected 2)
[17:27:47.409](0.000s) ok 274 - pgbench script error: shell bad command stdout /(?^:processed: 0/1)/
[17:27:47.409](0.000s) not ok 275 - pgbench script error: shell bad command stderr /(?^:\(shell\) .* meta-command failed)/
[17:27:47.409](0.000s) # Failed test 'pgbench script error: shell bad command stderr /(?^:\(shell\) .* meta-command failed)/'
# at /Users/admin/pgsql/src/bin/pgbench/t/001_pgbench_with_server.pl line 1466.
# ''
# doesn't match '(?^:\(shell\) .* meta-command failed)'The program's exited with a failure code as expected, and we saw (some
of?) the expected stdout output, but stderr output is reported to be
empty.
It's possible this is caused by the same issue as on windows. Or by one of the
other things fixed in the patch, a) there's afaict no guarantee that we'd read
from pipe A if we were waiting for A|B and B got ready b) that we weren't
actually waiting for quite all the output to be generated (missing the
newline). Or it could be because psql doesn't actually flush stderr in all
patch, from what I can tell...
I hope it'll be easier to debug with the patch in place if it doesn't turn out
to already be fixed.
Greetings,
Andres Freund
On Sun, Feb 16, 2025 at 06:18:44PM -0500, Tom Lane wrote:
Noah Misch <noah@leadboat.com> writes:
From the slow proxy's perspective, it can't rule out the program under test
having done those two write() calls. The proxy doesn't have enough
information to reconstruct the original four write() calls. What prevents
that anomaly?Yeah, I think it's hopeless to expect that we can disambiguate the
order of writes to two different pipes. For the problem at hand,
though, it seems like we don't really need to do that. Rather, the
question is "when we detect that the program-under-test has exited,
can we be sure we have collected all of its output?".
In the BackgroundPsql case, we have output collection moments for every query
while the psql-under-test is running, not just after exit. If I understand
the original post right, the specifics are as follows. If $stdout witnesses
the result of '\echo BANNER', $stderr should contain anything from psql
commands before the \echo. That holds on non-Windows, but the IPC::Run proxy
makes it not hold on Windows.
I think that
IPC::Run may be screwing up here, because I have seen non-Windows
CI failures that look like it didn't read all the stderr output.
For example, this pgbench test failure on macOS from [1]:# Running: pgbench -n -t 1 -Dfoo=bla -Dnull=null -Dtrue=true -Done=1 -Dzero=0.0 -Dbadtrue=trueXXX -Dmaxint=9223372036854775807 -Dminint=-9223372036854775808 -M prepared -f /Users/admin/pgsql/build/testrun/pgbench/001_pgbench_with_server/data/t_001_pgbench_with_server_main_data/001_pgbench_error_shell_bad_command
[17:27:47.408](0.061s) ok 273 - pgbench script error: shell bad command status (got 2 vs expected 2)
[17:27:47.409](0.000s) ok 274 - pgbench script error: shell bad command stdout /(?^:processed: 0/1)/
[17:27:47.409](0.000s) not ok 275 - pgbench script error: shell bad command stderr /(?^:\(shell\) .* meta-command failed)/
[17:27:47.409](0.000s) # Failed test 'pgbench script error: shell bad command stderr /(?^:\(shell\) .* meta-command failed)/'
# at /Users/admin/pgsql/src/bin/pgbench/t/001_pgbench_with_server.pl line 1466.
# ''
# doesn't match '(?^:\(shell\) .* meta-command failed)'The program's exited with a failure code as expected, and we saw (some
of?) the expected stdout output, but stderr output is reported to be
empty.
https://github.com/cpan-authors/IPC-Run/commit/2128df3bbcac7e733ac46302c4b1371ffb88fe14
fixed that one.
Show quoted text
Hi,
On 2025-02-17 08:52:58 +0900, Michael Paquier wrote:
On Sat, Feb 15, 2025 at 04:13:37PM -0500, Andres Freund wrote:
On 2025-02-14 09:52:24 -0800, Jacob Champion wrote:
On Fri, Feb 14, 2025 at 8:53 AM Andres Freund <andres@anarazel.de> wrote:
commit 70291a3c66e
(Side question entirely unrelated as I'm reading that..)
What's your magic recipe for showing up with commit format? The best
thing I could come up with was to use "(14,trunc)%H" in format.pretty,
but it has the idea of showing two dots at the end of the commit ID.
git {show|log|...} --abbrev-commit
If we're concerned about the second for any reason, the only conflicting
part should be the name and documentation of wait_connect, right?It doesn't seem concerning to me either. The first commit seems much more
likely to cause trouble and even that seems ok. Even if it were to cause
problem for an extension (which I think is rather unlikely), it shouldn't be
too hard to fix.FWIW, Debian Search reports that the only references to BackgroundPsql
are in the Postgres tree, so backpatching 70291a3c66e does not worry
me. Github has more much references due to forked code or direct
copies of BackgroundPsql.pn modified for the purpose of the code.
Cool, will after the minor release freeze.
Greetings,
Andres Freund
On Sun, Feb 16, 2025 at 07:03:39PM -0500, Andres Freund wrote:
git {show|log|...} --abbrev-commit
Ah, thanks for the hint. This can also be controlled by core.abbrev at
repo level with %h. Didn't suspect this keyword.
Cool, will after the minor release freeze.
Thanks, Andres. If you'd prefer that I double-check the code and do
it as the former committer of these two ones, please let me know.
--
Michael
Noah Misch <noah@leadboat.com> writes:
On Sun, Feb 16, 2025 at 06:18:44PM -0500, Tom Lane wrote:
I think that
IPC::Run may be screwing up here, because I have seen non-Windows
CI failures that look like it didn't read all the stderr output.
For example, this pgbench test failure on macOS from [1]:
https://github.com/cpan-authors/IPC-Run/commit/2128df3bbcac7e733ac46302c4b1371ffb88fe14
fixed that one.
Ah. Do we know whether that fix has made it into our CI images?
(Or anywhere else, for that matter?)
regards, tom lane
Hi,
On February 16, 2025 7:50:18 PM EST, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Noah Misch <noah@leadboat.com> writes:
On Sun, Feb 16, 2025 at 06:18:44PM -0500, Tom Lane wrote:
I think that
IPC::Run may be screwing up here, because I have seen non-Windows
CI failures that look like it didn't read all the stderr output.
For example, this pgbench test failure on macOS from [1]:https://github.com/cpan-authors/IPC-Run/commit/2128df3bbcac7e733ac46302c4b1371ffb88fe14
fixed that one.Ah. Do we know whether that fix has made it into our CI images?
(Or anywhere else, for that matter?)
The CI images are regenerated three times a week, but for most OSs, they will only install perl modules via the applicable packaging method, so it'll depend on when they pick up that version.
On Windows cpan is used, so it should pick that new version fairly quickly if a release has been made.
On macos we can't currently use images, so we just cache all the installed macports packages. The cache is keyed by OS version and list of packages to be installed, with no other forced invalidation right now. So it's hard to predict when a new version of a package will be picked up and it will differ between git repositories. I've been wondering whether the cached macports install should just be regularly generated instead, along the other ci images.
Greetings,
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Sun, Feb 16, 2025 at 08:42:50PM -0500, Andres Freund wrote:
On February 16, 2025 7:50:18 PM EST, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Noah Misch <noah@leadboat.com> writes:
On Sun, Feb 16, 2025 at 06:18:44PM -0500, Tom Lane wrote:
I think that
IPC::Run may be screwing up here, because I have seen non-Windows
CI failures that look like it didn't read all the stderr output.
For example, this pgbench test failure on macOS from [1]:https://github.com/cpan-authors/IPC-Run/commit/2128df3bbcac7e733ac46302c4b1371ffb88fe14
fixed that one.Ah. Do we know whether that fix has made it into our CI images?
(Or anywhere else, for that matter?)The CI images are regenerated three times a week, but for most OSs, they will only install perl modules via the applicable packaging method, so it'll depend on when they pick up that version.
On Windows cpan is used, so it should pick that new version fairly quickly if a release has been made.
On macos we can't currently use images, so we just cache all the installed macports packages. The cache is keyed by OS version and list of packages to be installed, with no other forced invalidation right now. So it's hard to predict when a new version of a package will be picked up and it will differ between git repositories. I've been wondering whether the cached macports install should just be regularly generated instead, along the other ci images.
The change is not in a release yet. We could have macos install IPC::Run from
github, or I could get a release cut so it can make its way to macports.
https://ports.macports.org/port/p5.34-ipc-run/builds/ suggests it ingested the
last release within a couple days of release, so macports itself may add
negligible latency.
Hi,
On February 16, 2025 8:42:50 PM EST, Andres Freund <andres@anarazel.de> wrote:
On Windows cpan is used, so it should pick that new version fairly quickly if a release has been made.
Looks like no release has happened since late 2023:
https://github.com/cpan-authors/IPC-Run/blob/master/Changelog
So it won't be picked up for now.
I guess we could make CI pick up the git version, but that doesn't really seem like a scalable approach.
Greetings,
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
Noah Misch <noah@leadboat.com> writes:
On Sun, Feb 16, 2025 at 08:42:50PM -0500, Andres Freund wrote:
On macos we can't currently use images, so we just cache all the installed macports packages. The cache is keyed by OS version and list of packages to be installed, with no other forced invalidation right now. So it's hard to predict when a new version of a package will be picked up and it will differ between git repositories. I've been wondering whether the cached macports install should just be regularly generated instead, along the other ci images.
The change is not in a release yet. We could have macos install IPC::Run from
github, or I could get a release cut so it can make its way to macports.
https://ports.macports.org/port/p5.34-ipc-run/builds/ suggests it ingested the
last release within a couple days of release, so macports itself may add
negligible latency.
Yeah, my experience is that macports is pretty quick about picking up
new releases. If you can persuade upstream to make a release happen,
that'd be great.
regards, tom lane
On Mon, Feb 17, 2025 at 7:02 AM Andres Freund <andres@anarazel.de> wrote:
I don't really know enough about IPC::Run's internals to answer. My
interpretation of how it might work, purely from observation, is that it opens
one tcp connection for each "pipe" and that that's what's introducing the
potential of reordering, as the different sockets can have different delivery
timeframes. If that's it, it seems proxying all the pipes through one
connection might be an option.
I had a couple of ideas about how to get rid of the intermediate
subprocess. Obviously it can't convert "two pipes are ready" into two
separate socket send() calls that preserve the original order, as it
doesn't know them (unless perhaps it switches to completion-based
I/O). But really, the whole design is ugly and slow. If we have some
capacity to improve Run::IPC, I think we should try to get rid of the
pipe/socket bridge and plug either a pipe or a socket directly into
the target subprocess. But which one?
1. Pipes only: Run::IPC could use IOCP or WaitForMultipleEvents()
instead of select()/poll().
2. Sockets only: Apparently you can give sockets directly to
subprocesses as stdin/stdout/stderr:
https://stackoverflow.com/questions/4993119/redirect-io-of-process-to-windows-socket
The Run::IPC comments explain that the extra process was needed to be
able to forward all data even if the target subprocess exits without
closing the socket (the linger stuff we have met before in PostgreSQL
itself). I suspect that if we went that way, maybe asynchronous I/O
would fix that too (see my other thread with guesses and demos on that
topic), but it might not be race-free. I don't know. I'd like to
know for PostgreSQL's own sake, but for Run::IPC I think I'd prefer
option 1 anyway: if you have to write new native Windows API
interactions either way, you might as well go with the normal native
way for Windows processes to connect standard I/O streams.
Hi,
On 2025-02-17 09:24:24 +0900, Michael Paquier wrote:
On Sun, Feb 16, 2025 at 07:03:39PM -0500, Andres Freund wrote:
Cool, will after the minor release freeze.
Thanks, Andres. If you'd prefer that I double-check the code and do
it as the former committer of these two ones, please let me know.
Thanks, but I think it's ok, it looks unproblematic enough. I've pushed the
backports (but not yet the new fix), after running them through CI [1]I have WIP patches for CI support for 13, 14, with some work for windows left to do. I plan to write an email about that at some point.. Let's
hope the buildfarm thinks similarly.
I'm planning to wait for a few hours and see what the BF says and then push
the fix this thread is about to all supported branches.
Greetings,
Andres Freund
[1]: I have WIP patches for CI support for 13, 14, with some work for windows left to do. I plan to write an email about that at some point.
left to do. I plan to write an email about that at some point.
Hi,
On 2025-02-19 10:41:53 -0500, Andres Freund wrote:
I'm planning to wait for a few hours and see what the BF says and then push
the fix this thread is about to all supported branches.
Not having seen any issues, I pushed the fix.
Thanks for the reviews etc!
Greetings,
Andres Freund
Hi,
On 2025-02-16 17:52:36 -0800, Noah Misch wrote:
On Sun, Feb 16, 2025 at 08:42:50PM -0500, Andres Freund wrote:
On February 16, 2025 7:50:18 PM EST, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Noah Misch <noah@leadboat.com> writes:
On Sun, Feb 16, 2025 at 06:18:44PM -0500, Tom Lane wrote:
I think that
IPC::Run may be screwing up here, because I have seen non-Windows
CI failures that look like it didn't read all the stderr output.
For example, this pgbench test failure on macOS from [1]:https://github.com/cpan-authors/IPC-Run/commit/2128df3bbcac7e733ac46302c4b1371ffb88fe14
fixed that one.Ah. Do we know whether that fix has made it into our CI images?
(Or anywhere else, for that matter?)The CI images are regenerated three times a week, but for most OSs, they will only install perl modules via the applicable packaging method, so it'll depend on when they pick up that version.
On Windows cpan is used, so it should pick that new version fairly quickly if a release has been made.
On macos we can't currently use images, so we just cache all the installed
macports packages. The cache is keyed by OS version and list of packages
to be installed, with no other forced invalidation right now. So it's hard
to predict when a new version of a package will be picked up and it will
differ between git repositories. I've been wondering whether the cached
macports install should just be regularly generated instead, along the
other ci images.The change is not in a release yet. We could have macos install IPC::Run from
github, or I could get a release cut so it can make its way to macports.
It'd be great if we could get a release. I guess I can figure out the magic
incantations to install it from git for CI, but that doesn't help every
individual developer encountering this issue. Only a release can eventually do
that...
I just hit it twice in an hour or so, once on CI and once locally on a mac
mini when trying to reproduce a separate issue.
Greetings,
Andres Freund
On Sun, Mar 09, 2025 at 12:47:34PM -0400, Andres Freund wrote:
On 2025-02-16 17:52:36 -0800, Noah Misch wrote:
On Sun, Feb 16, 2025 at 08:42:50PM -0500, Andres Freund wrote:
On February 16, 2025 7:50:18 PM EST, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Noah Misch <noah@leadboat.com> writes:
On Sun, Feb 16, 2025 at 06:18:44PM -0500, Tom Lane wrote:
I think that
IPC::Run may be screwing up here, because I have seen non-Windows
CI failures that look like it didn't read all the stderr output.
For example, this pgbench test failure on macOS from [1]:https://github.com/cpan-authors/IPC-Run/commit/2128df3bbcac7e733ac46302c4b1371ffb88fe14
fixed that one.Ah. Do we know whether that fix has made it into our CI images?
(Or anywhere else, for that matter?)The CI images are regenerated three times a week, but for most OSs, they will only install perl modules via the applicable packaging method, so it'll depend on when they pick up that version.
On Windows cpan is used, so it should pick that new version fairly quickly if a release has been made.
On macos we can't currently use images, so we just cache all the installed
macports packages. The cache is keyed by OS version and list of packages
to be installed, with no other forced invalidation right now. So it's hard
to predict when a new version of a package will be picked up and it will
differ between git repositories. I've been wondering whether the cached
macports install should just be regularly generated instead, along the
other ci images.The change is not in a release yet. We could have macos install IPC::Run from
github, or I could get a release cut so it can make its way to macports.It'd be great if we could get a release.
Yep. I put the tree in the necessary state, and I contacted the person on
2025-02-17 and again on 2025-03-04. I'm scheduled to follow up again on
2025-03-11.
I guess I can figure out the magic
incantations to install it from git for CI
There's no need to build anything, so it suffices to do:
git clone https://github.com/cpan-authors/IPC-Run.git
export PERL5LIB=$PWD/IPC-Run/lib
# (or append, if you already have non-empty PERL5LIB)
On Sun, Mar 09, 2025 at 10:23:55AM -0700, Noah Misch wrote:
On Sun, Mar 09, 2025 at 12:47:34PM -0400, Andres Freund wrote:
On 2025-02-16 17:52:36 -0800, Noah Misch wrote:
On Sun, Feb 16, 2025 at 08:42:50PM -0500, Andres Freund wrote:
On February 16, 2025 7:50:18 PM EST, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Noah Misch <noah@leadboat.com> writes:
On Sun, Feb 16, 2025 at 06:18:44PM -0500, Tom Lane wrote:
I think that
IPC::Run may be screwing up here, because I have seen non-Windows
CI failures that look like it didn't read all the stderr output.
For example, this pgbench test failure on macOS from [1]:https://github.com/cpan-authors/IPC-Run/commit/2128df3bbcac7e733ac46302c4b1371ffb88fe14
fixed that one.Ah. Do we know whether that fix has made it into our CI images?
(Or anywhere else, for that matter?)The CI images are regenerated three times a week, but for most OSs, they will only install perl modules via the applicable packaging method, so it'll depend on when they pick up that version.
On Windows cpan is used, so it should pick that new version fairly quickly if a release has been made.
On macos we can't currently use images, so we just cache all the installed
macports packages. The cache is keyed by OS version and list of packages
to be installed, with no other forced invalidation right now. So it's hard
to predict when a new version of a package will be picked up and it will
differ between git repositories. I've been wondering whether the cached
macports install should just be regularly generated instead, along the
other ci images.The change is not in a release yet. We could have macos install IPC::Run from
github, or I could get a release cut so it can make its way to macports.It'd be great if we could get a release.
Yep. I put the tree in the necessary state, and [...]
https://metacpan.org/dist/IPC-Run is now a fresh release, and
https://ports.macports.org/search/?q=ipc-run&name=on shows that version.
Noah Misch <noah@leadboat.com> writes:
On Sun, Mar 09, 2025 at 10:23:55AM -0700, Noah Misch wrote:
Yep. I put the tree in the necessary state, and [...]
https://metacpan.org/dist/IPC-Run is now a fresh release, and
https://ports.macports.org/search/?q=ipc-run&name=on shows that version.
That's great news, but apparently there is some lag in new packages
actually becoming available for download:
$ sudo port selfupdate
---> Checking for newer releases of MacPorts
MacPorts base version 2.11.4 installed,
MacPorts base version 2.11.4 available.
---> MacPorts base is already the latest version
---> Updating the ports tree
The ports tree has been updated.
All installed ports are up to date.
$ port installed | grep ipc
p5.34-ipc-run @20231003.0.0_0 (active)
regards, tom lane
On Sun, Aug 10, 2025 at 03:07:43PM -0400, Tom Lane wrote:
Noah Misch <noah@leadboat.com> writes:
On Sun, Mar 09, 2025 at 10:23:55AM -0700, Noah Misch wrote:
Yep. I put the tree in the necessary state, and [...]
https://metacpan.org/dist/IPC-Run is now a fresh release, and
https://ports.macports.org/search/?q=ipc-run&name=on shows that version.That's great news, but apparently there is some lag in new packages
actually becoming available for download:$ sudo port selfupdate
---> Checking for newer releases of MacPorts
MacPorts base version 2.11.4 installed,
MacPorts base version 2.11.4 available.
---> MacPorts base is already the latest version
---> Updating the ports treeThe ports tree has been updated.
All installed ports are up to date.
$ port installed | grep ipc
p5.34-ipc-run @20231003.0.0_0 (active)
Hmm. I've never used MacPorts, so I'll probably not be an informed
commentator on its release flow. If relevant (probably not):
https://ports.macports.org/port/p5.34-ipc-run/details/ does show missing "Port
Health" for Sequoia (x86_64), Sonoma (arm64), and Sonoma (x86_64). Meanwhile,
https://ports.macports.org/port/p5.34-ipc-run/builds/ shows all builds of this
version happening in a <10min span, with no failures. If there's some "port"
command mode to force an install from source, that might bypass the blockage.
Noah Misch <noah@leadboat.com> writes:
On Sun, Aug 10, 2025 at 03:07:43PM -0400, Tom Lane wrote:
That's great news, but apparently there is some lag in new packages
actually becoming available for download:
Hmm. I've never used MacPorts, so I'll probably not be an informed
commentator on its release flow. If relevant (probably not):
https://ports.macports.org/port/p5.34-ipc-run/details/ does show missing "Port
Health" for Sequoia (x86_64), Sonoma (arm64), and Sonoma (x86_64).
Dunno, but it is there now: after another upgrade run,
$ port installed | grep ipc
p5.34-ipc-run @20250809.0.0_0 (active)
I've updated all my macOS buildfarm animals to this version.
regards, tom lane
Hi,
I think I encountered the bug that relates to the patch from this
thread so I decided to write a report here.
version(): PostgreSQL 19devel on x86_64-linux, compiled by gcc-13.3.0,
64-bit. (master)
While writing a TAP test I noticed that background_psql hangs on a
simple query without any reason. The most simple reproducer I came up
with:
my $psql = $node->background_psql('postgres', timeout => 3);
$psql->query(q(\warn AAAAA));
$psql->query("select 1");
Here $psql->query("select 1;") hangs until timeout.
Here what I managed to understand after some investigation:
Just to remind how banner and banner_match look like:
my $banner = "background_psql: QUERY_SEPARATOR $query_cnt:";
my $banner_match = qr/(^|\n)$banner\r?\n/;
psql->query() hangs in an endless loop in pump_until() as the
termination condition (last if $$stream =~ /$until/) is never met.
Unfortunately, logs don't show the reason why we are stuck here (maybe
I do something wrong but it seems that [0]/messages/by-id/1100715.1712265845@sss.pgh.pa.us explains why pump_until()
timeout diag code doesn't work), so to get more information we need to
add some additional logging in pump_until(). If we add logging of
$$stream and $until then we can see next lines for pump_until() stderr
call:
STREAM: AAAAAbackground_psql: QUERY_SEPARATOR 2:
UNTIL: (?^:(^|\n)background_psql: QUERY_SEPARATOR 2:\r?\n)
STREAM here is what we have in stderr and UNTIL is just a banner_match.
You can see that we have stderr from the previous query ('AAAAA')
concatenated with the banner on the same line. So it doesn't match
what we have in $until pattern as it requires (^|\n) to be before the
banner. This way we have an endless loop. It seems that the reason we
don't have line separator after 'AAAAA' is a difference between how we
inject and remove the banner:
Here is a how we inject banner:
$self->{stdin} .= "$query\n;\n\\echo $banner\n\\warn $banner\n";
How we remove banner from stderr
$self->{stderr} =~ s/$banner_match//;
We remove from stderr not only the banner we previously injected with
warn, but also the line separator before the banner.
[0]: /messages/by-id/1100715.1712265845@sss.pgh.pa.us
Best regards,
Arseniy Mukhin