buildfarm windows checks / tap tests on windows

Started by Andres Freundalmost 5 years ago20 messages
#1Andres Freund
andres@anarazel.de

Hi,

As part of trying to make the aio branch tests on cirrus CI pass with
some tap tests I noticed that "src/tools/msvc/vcregress.pl recoverycheck"
hangs. A long phase of remote debugging later I figured out that that's
not a fault of the aio branch - it also doesn't pass on master (fixed in
[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=1e6e40447115ca7b4749d7d117b81b016ee5e2c2

Which confused me - shouldn't the buildfarm have noticed? But it seems
that all msvc animals either don't have tap tests enabled or they
disable 'misc-checks' which in turn is what the buildfarm client uses to
trigger all the 'recovery' checks.

It seems we're not just skipping recovery, but also e.g. tap tests in
contrib, all the tests in src/test/modules/...

Andrew, what's the reason for that? Is it just that they hung at some
point? Were too slow?

On that last point: Running the tap tests on windows appears to be
*excruciatingly* slow. How does anybody develop on windows without a
mechanism to actually run tests in parallel?

I think it'd be good if vcregress.pl at least respected PROVE_FLAGS from
the environment - it can't currently be passed in for several of the
vcregress.pl tests, and it does seem to make things to be at least
somewhat less painful.

This makes it even clearer to me that we really need a builtin
testrunner that runs tests efficiently *and* debuggable on windows.

Greetings,

Andres Freund

[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=1e6e40447115ca7b4749d7d117b81b016ee5e2c2

#2Andrew Dunstan
andrew@dunslane.net
In reply to: Andres Freund (#1)
Re: buildfarm windows checks / tap tests on windows

On 3/1/21 3:07 PM, Andres Freund wrote:

Hi,

As part of trying to make the aio branch tests on cirrus CI pass with
some tap tests I noticed that "src/tools/msvc/vcregress.pl recoverycheck"
hangs. A long phase of remote debugging later I figured out that that's
not a fault of the aio branch - it also doesn't pass on master (fixed in
[1]).

Which confused me - shouldn't the buildfarm have noticed? But it seems
that all msvc animals either don't have tap tests enabled or they
disable 'misc-checks' which in turn is what the buildfarm client uses to
trigger all the 'recovery' checks.

It seems we're not just skipping recovery, but also e.g. tap tests in
contrib, all the tests in src/test/modules/...

Andrew, what's the reason for that? Is it just that they hung at some
point? Were too slow?

I don't think speed is the issue. I probably disabled misc-tests on
drongo and bowerbird (my two animals in question) because I got  either
instability or errors I was unable to diagnose. I'll go back and take
another look to narrow this down. It's possible to disable individual tests.

On that last point: Running the tap tests on windows appears to be
*excruciatingly* slow. How does anybody develop on windows without a
mechanism to actually run tests in parallel?

I think most people develop elsewhere and then adapt/test on Windows if
necessary.

I think it'd be good if vcregress.pl at least respected PROVE_FLAGS from
the environment - it can't currently be passed in for several of the
vcregress.pl tests, and it does seem to make things to be at least
somewhat less painful.

+1

This makes it even clearer to me that we really need a builtin
testrunner that runs tests efficiently *and* debuggable on windows.

"show me the code" :-)

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#3Andres Freund
andres@anarazel.de
In reply to: Andrew Dunstan (#2)
Re: buildfarm windows checks / tap tests on windows

Hi,

On 2021-03-02 07:48:28 -0500, Andrew Dunstan wrote:

I don't think speed is the issue. I probably disabled misc-tests on
drongo and bowerbird (my two animals in question) because I got� either
instability or errors I was unable to diagnose. I'll go back and take
another look to narrow this down. It's possible to disable individual tests.

Yea, there was one test that hung in 021_row_visibility.pl which was a
weird perl hang - with weird symptoms. But that should be fixed now.

There's another failure in recoverycheck that I at first thought was the
fault of the aio branch. But it also see it on master - but only on one
of the two machines I use to test. Pretty odd.

t/003_recovery_targets.pl ............ 7/9
# Failed test 'multiple conflicting settings'
# at t/003_recovery_targets.pl line 151.

# Failed test 'recovery end before target reached is a fatal error'
# at t/003_recovery_targets.pl line 177.
t/003_recovery_targets.pl ............ 9/9 # Looks like you failed 2 tests of 9.
t/003_recovery_targets.pl ............ Dubious, test returned 2 (wstat 512, 0x200)
Failed 2/9 subtests

I think it's pretty dangerous if we have a substantial number of tests
that aren't run on windows - I think a lot of us just assume that the
BF would catch windows specific problems...

On that last point: Running the tap tests on windows appears to be
*excruciatingly* slow. How does anybody develop on windows without a
mechanism to actually run tests in parallel?

I think most people develop elsewhere and then adapt/test on Windows if
necessary.

Yea, but even that overstretches my patience by a good bit. I can't deal
with a serial check-world on linux either - I think that's the biggest
differentiator. It's a lot less painful to deal with slow-ish tests if
they do all their slowness concurrently. But that's basically impossible
with vcregress.pl, and the bf etc can't easily do it either until we
have a decent way to see the correct logfiles & output for individual
tests...

I think it'd be good if vcregress.pl at least respected PROVE_FLAGS from
the environment - it can't currently be passed in for several of the
vcregress.pl tests, and it does seem to make things to be at least
somewhat less painful.

+1

K, will send a patch for that in a bit.

This makes it even clearer to me that we really need a builtin
testrunner that runs tests efficiently *and* debuggable on windows.

"show me the code" :-)

The biggest obstacle on that front is perl. I started to write one, but
hit several perl issues within an hour. I think I might write one in
python, that'd be a lot less painful.

One windows build question I have is why the msvc infrastructure doesn't
accept msys perl in places like this:
guid => $^O eq "MSWin32" ? Win32::GuidGen() : 'FAKE',
If I change them to accept msys perl then the build ends up working.

The reason it'd be nice to accept msys perl is that git for windows
bundles that - and that's already installed on most CI projects...

Greetings,

Andres Freund

#4Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#2)
Re: buildfarm windows checks / tap tests on windows

On 3/2/21 7:48 AM, Andrew Dunstan wrote:

On 3/1/21 3:07 PM, Andres Freund wrote:

Hi,

As part of trying to make the aio branch tests on cirrus CI pass with
some tap tests I noticed that "src/tools/msvc/vcregress.pl recoverycheck"
hangs. A long phase of remote debugging later I figured out that that's
not a fault of the aio branch - it also doesn't pass on master (fixed in
[1]).

Which confused me - shouldn't the buildfarm have noticed? But it seems
that all msvc animals either don't have tap tests enabled or they
disable 'misc-checks' which in turn is what the buildfarm client uses to
trigger all the 'recovery' checks.

It seems we're not just skipping recovery, but also e.g. tap tests in
contrib, all the tests in src/test/modules/...

Andrew, what's the reason for that? Is it just that they hung at some
point? Were too slow?

I don't think speed is the issue. I probably disabled misc-tests on
drongo and bowerbird (my two animals in question) because I got  either
instability or errors I was unable to diagnose. I'll go back and take
another look to narrow this down. It's possible to disable individual tests.

Well, I though it was, but it wasn't. Fixed now, see
<https://github.com/PGBuildFarm/client-code/commit/377e9129a08d607bf7aa32a42bcf6ebecb92ba4d&gt;

I've deployed this on drongo, which will now run almost all the TAP
tests. The exception is the recovery tests, because
021_row_visibility.pl crashes so badly it brings down the whole
buildfarm client.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#5Andres Freund
andres@anarazel.de
In reply to: Andrew Dunstan (#4)
Re: buildfarm windows checks / tap tests on windows

Hi,

On 2021-03-02 17:04:16 -0500, Andrew Dunstan wrote:

Well, I though it was, but it wasn't. Fixed now, see
<https://github.com/PGBuildFarm/client-code/commit/377e9129a08d607bf7aa32a42bcf6ebecb92ba4d&gt;

I've deployed this on drongo, which will now run almost all the TAP
tests.

Thanks!

The exception is the recovery tests, because
021_row_visibility.pl crashes so badly it brings down the whole
buildfarm client.

It still does, even after

commit 1e6e40447115ca7b4749d7d117b81b016ee5e2c2 (upstream/master, master)
Author: Andres Freund <andres@anarazel.de>
Date: 2021-03-01 09:52:15 -0800

Fix recovery test hang in 021_row_visibility.pl on windows.

? I didn't see failures after that?

Greetings,

Andres Freund

#6Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#5)
Re: buildfarm windows checks / tap tests on windows

On Tue, Mar 02, 2021 at 04:54:57PM -0800, Andres Freund wrote:

It still does, even after

commit 1e6e40447115ca7b4749d7d117b81b016ee5e2c2 (upstream/master, master)
Author: Andres Freund <andres@anarazel.de>
Date: 2021-03-01 09:52:15 -0800

Fix recovery test hang in 021_row_visibility.pl on windows.

? I didn't see failures after that?

Yes. Testing this morning on top of 5b2f2af, it fails for me with a
"Terminating on signal SIGBREAK".

Having a support for PROVE_TESTS would be nice in src/tools/msvc/,
wrapping any ENV{PROVE_TESTS} value within an extra glob() before
passing that down to the prove command.
--
Michael

#7Andrew Dunstan
andrew@dunslane.net
In reply to: Michael Paquier (#6)
Re: buildfarm windows checks / tap tests on windows

On 3/2/21 8:27 PM, Michael Paquier wrote:

On Tue, Mar 02, 2021 at 04:54:57PM -0800, Andres Freund wrote:

It still does, even after

commit 1e6e40447115ca7b4749d7d117b81b016ee5e2c2 (upstream/master, master)
Author: Andres Freund <andres@anarazel.de>
Date: 2021-03-01 09:52:15 -0800

Fix recovery test hang in 021_row_visibility.pl on windows.

? I didn't see failures after that?

Yes. Testing this morning on top of 5b2f2af, it fails for me with a
"Terminating on signal SIGBREAK".

Having a support for PROVE_TESTS would be nice in src/tools/msvc/,
wrapping any ENV{PROVE_TESTS} value within an extra glob() before
passing that down to the prove command.

Yes, I saw similar this morning, which woud have been after that commit.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#8Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#3)
Re: buildfarm windows checks / tap tests on windows

Hi,

On 2021-03-02 12:57:57 -0800, Andres Freund wrote:

t/003_recovery_targets.pl ............ 7/9
# Failed test 'multiple conflicting settings'
# at t/003_recovery_targets.pl line 151.

# Failed test 'recovery end before target reached is a fatal error'
# at t/003_recovery_targets.pl line 177.
t/003_recovery_targets.pl ............ 9/9 # Looks like you failed 2 tests of 9.
t/003_recovery_targets.pl ............ Dubious, test returned 2 (wstat 512, 0x200)
Failed 2/9 subtests

This appears to be caused by stderr in windows docker containers to
somehow not work quite right. cirrus-ci uses docker on windows.

If you look e.g. at https://cirrus-ci.com/task/6111560255930368, and
specifically at the relevant log file:
https://api.cirrus-ci.com/v1/artifact/task/6111560255930368/log/src/test/recovery/tmp_check/log/003_recovery_targets_primary.log
you can see that it's, uh, less full than we normally expect:
1 file(s) copied.
1 file(s) copied.
1 file(s) copied.
1 file(s) copied.

As that test uses the log file to determine the state of servers:

my $logfile = slurp_file($node_standby->logfile());
ok($logfile =~ qr/multiple recovery targets specified/,
'multiple conflicting settings');

that doesn't work.

I was *very* confused by this for a while. But finally the cluebait hit
when I discovered that stderr works just fine for *other*
programs. Including the programs that evidently log into
003_recovery_targets_primary.log. The problem is that
pgwin32_is_service() somehow decides that postgres is running as a
service. Despite that not really being the case (I guess somehow
internally docker containers are started below a service, and that
causes the problem).

I hate everything right now. So much.

I think it's quite nasty that postgres just silently starts to log to
the event log. Why on earth wasn't the solution instead to hardcode that
as a server parameter in pg_ctl register?

Not sure what a good fix is for this.

The second problem I saw was 001_initdb failing, which appears to have
been caused by some weird permission issue that I don't fully
understand. The directory with PG in it was created by user andres, an
administrator. But somehow the inherited permissions lead to the chmod()
that initdb does ("fixing permissions on existing directory %s ...") to
fail.

c:\src\postgres>icacls c:\src\postgres
c:\src\postgres BUILTIN\Administrators:(F)
BUILTIN\Administrators:(I)(OI)(CI)(F)
NT AUTHORITY\SYSTEM:(I)(OI)(CI)(F)
CREATOR OWNER:(I)(OI)(CI)(IO)(F)
BUILTIN\Users:(I)(OI)(CI)(RX)
BUILTIN\Users:(I)(CI)(AD)
BUILTIN\Users:(I)(CI)(WD)
c:\src\postgres>whoami
andres-build-te\andres

c:\src\postgres>net user andres
User name andres
...
Local Group Memberships *Administrators *Users

Greetings,

Andres Freund

#9Andres Freund
andres@anarazel.de
In reply to: Andrew Dunstan (#7)
Re: buildfarm windows checks / tap tests on windows

Hi,

On 2021-03-02 21:20:52 -0500, Andrew Dunstan wrote:

On 3/2/21 8:27 PM, Michael Paquier wrote:

On Tue, Mar 02, 2021 at 04:54:57PM -0800, Andres Freund wrote:

It still does, even after

commit 1e6e40447115ca7b4749d7d117b81b016ee5e2c2 (upstream/master, master)
Author: Andres Freund <andres@anarazel.de>
Date: 2021-03-01 09:52:15 -0800

Fix recovery test hang in 021_row_visibility.pl on windows.

? I didn't see failures after that?

Yes. Testing this morning on top of 5b2f2af, it fails for me with a
"Terminating on signal SIGBREAK".

Yes, I saw similar this morning, which woud have been after that commit.

I can't reproduce that here - could either (or both) of you send

src/test/recovery/tmp_check/log/regress_log_021_row_visibility
src/test/recovery/tmp_check/log/021_row_visibility_standby.log
src/test/recovery/tmp_check/log/021_row_visibility_primary.log

of a failed run? And maybe how you're invoking it?

Does adding a

$psql_primary{run}->finish;
$psql_standby{run}->finish;
before
$psql_primary{run}->kill_kill;
$psql_standby{run}->kill_kill;

fix the issue?

Regards,

Andres

#10Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#8)
Re: buildfarm windows checks / tap tests on windows

Hi,

On 2021-03-02 21:20:11 -0800, Andres Freund wrote:

On 2021-03-02 12:57:57 -0800, Andres Freund wrote:

t/003_recovery_targets.pl ............ 7/9
# Failed test 'multiple conflicting settings'
# at t/003_recovery_targets.pl line 151.

# Failed test 'recovery end before target reached is a fatal error'
# at t/003_recovery_targets.pl line 177.
t/003_recovery_targets.pl ............ 9/9 # Looks like you failed 2 tests of 9.
t/003_recovery_targets.pl ............ Dubious, test returned 2 (wstat 512, 0x200)
Failed 2/9 subtests

This appears to be caused by stderr in windows docker containers to
somehow not work quite right. cirrus-ci uses docker on windows.

If you look e.g. at https://cirrus-ci.com/task/6111560255930368, and
specifically at the relevant log file:
https://api.cirrus-ci.com/v1/artifact/task/6111560255930368/log/src/test/recovery/tmp_check/log/003_recovery_targets_primary.log
you can see that it's, uh, less full than we normally expect:
1 file(s) copied.
1 file(s) copied.
1 file(s) copied.
1 file(s) copied.

As that test uses the log file to determine the state of servers:

my $logfile = slurp_file($node_standby->logfile());
ok($logfile =~ qr/multiple recovery targets specified/,
'multiple conflicting settings');

that doesn't work.

I was *very* confused by this for a while. But finally the cluebait hit
when I discovered that stderr works just fine for *other*
programs. Including the programs that evidently log into
003_recovery_targets_primary.log. The problem is that
pgwin32_is_service() somehow decides that postgres is running as a
service. Despite that not really being the case (I guess somehow
internally docker containers are started below a service, and that
causes the problem).

I hate everything right now. So much.

I think it's quite nasty that postgres just silently starts to log to
the event log. Why on earth wasn't the solution instead to hardcode that
as a server parameter in pg_ctl register?

Not sure what a good fix is for this.

FWIW, just forcing pgwin32_is_service() to return false seems to get the
cirrus tests past 003_recovery_targets.pl. Possible it'll not finish due
to other problems (or too tight timeouts I set), but at least this one
can be considered diagnosed I think.

https://cirrus-ci.com/task/5049764917018624?command=windows_worker_buf#L132

Greetings,

Andres Freund

#11Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#9)
3 attachment(s)
Re: buildfarm windows checks / tap tests on windows

On Tue, Mar 02, 2021 at 09:47:18PM -0800, Andres Freund wrote:

I can't reproduce that here - could either (or both) of you send

src/test/recovery/tmp_check/log/regress_log_021_row_visibility
src/test/recovery/tmp_check/log/021_row_visibility_standby.log
src/test/recovery/tmp_check/log/021_row_visibility_primary.log

of a failed run? And maybe how you're invoking it?

I have not checked this stuff in details, but here you go. I have
simply invoked that with vcregress taptest src/test/recovery/,
speeding up the process by removing temporarily all the other
scripts.
--
Michael

Attachments:

021_row_visibility_primary.logtext/plain; charset=us-asciiDownload
021_row_visibility_standby.logtext/plain; charset=us-asciiDownload
regress_log_021_row_visibilitytext/plain; charset=us-asciiDownload
#12Andrew Dunstan
andrew@dunslane.net
In reply to: Andres Freund (#9)
Re: buildfarm windows checks / tap tests on windows

On 3/3/21 12:47 AM, Andres Freund wrote:

Hi,

On 2021-03-02 21:20:52 -0500, Andrew Dunstan wrote:

On 3/2/21 8:27 PM, Michael Paquier wrote:

On Tue, Mar 02, 2021 at 04:54:57PM -0800, Andres Freund wrote:

It still does, even after

commit 1e6e40447115ca7b4749d7d117b81b016ee5e2c2 (upstream/master, master)
Author: Andres Freund <andres@anarazel.de>
Date: 2021-03-01 09:52:15 -0800

Fix recovery test hang in 021_row_visibility.pl on windows.

? I didn't see failures after that?

Yes. Testing this morning on top of 5b2f2af, it fails for me with a
"Terminating on signal SIGBREAK".

Yes, I saw similar this morning, which woud have been after that commit.

I can't reproduce that here - could either (or both) of you send

src/test/recovery/tmp_check/log/regress_log_021_row_visibility
src/test/recovery/tmp_check/log/021_row_visibility_standby.log
src/test/recovery/tmp_check/log/021_row_visibility_primary.log

of a failed run? And maybe how you're invoking it?

Does adding a

$psql_primary{run}->finish;
$psql_standby{run}->finish;
before
$psql_primary{run}->kill_kill;
$psql_standby{run}->kill_kill;

fix the issue?

I will check later on. Note that IPC::Run's kill_kill is known to cause
problems on MSWin32 perl, see the other recovery checks where we skip
tests involving it - crash_recovery and logical_decoding.

Maybe we need a wrapper for it that dies if called on MSWin32 perl. That
at least would not crash the invoking service with a nasty signal, and
the buildfarm would actually tell us about the issue.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#13Andrew Dunstan
andrew@dunslane.net
In reply to: Andres Freund (#3)
Re: buildfarm windows checks / tap tests on windows

On 3/2/21 3:57 PM, Andres Freund wrote:

This makes it even clearer to me that we really need a builtin
testrunner that runs tests efficiently *and* debuggable on windows.

"show me the code" :-)

The biggest obstacle on that front is perl. I started to write one, but
hit several perl issues within an hour. I think I might write one in
python, that'd be a lot less painful.

Without knowing details I'm skeptical. Over nearly three decades of
using perl I have found very little I wanted to do that I could not.

One windows build question I have is why the msvc infrastructure doesn't
accept msys perl in places like this:
guid => $^O eq "MSWin32" ? Win32::GuidGen() : 'FAKE',
If I change them to accept msys perl then the build ends up working.

The reason it'd be nice to accept msys perl is that git for windows
bundles that - and that's already installed on most CI projects...

Nice idea, but we can't run prove under Git's msys perl, because it's
missing some stuff, at least on drongo:

C:\prog>bin\prove --version

C:\prog>"c:\Program Files\Git\usr\bin\perl" "c:\Program
Files\Git\usr\bin\core_perl\prove" --version
Can't locate TAP/Harness/Env.pm in @INC (you may need to install the
TAP::Harness::Env module) (@INC contains: /usr/lib/perl5/site_perl
/usr/share/perl5/site_perl /usr/lib/perl5/vendor_perl
/usr/share/perl5/vendor_perl /usr/lib/perl5/core_perl
/usr/share/perl5/core_perl) at /usr/share/perl5/core_perl/App/Prove.pm
line 6.
BEGIN failed--compilation aborted at
/usr/share/perl5/core_perl/App/Prove.pm line 6.
Compilation failed in require at c:\Program
Files\Git\usr\bin\core_perl\prove line 9.
BEGIN failed--compilation aborted at c:\Program
Files\Git\usr\bin\core_perl\prove line 9.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#14Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#12)
Re: buildfarm windows checks / tap tests on windows

On 3/3/21 7:57 AM, Andrew Dunstan wrote:

On 3/3/21 12:47 AM, Andres Freund wrote:

Hi,

On 2021-03-02 21:20:52 -0500, Andrew Dunstan wrote:

On 3/2/21 8:27 PM, Michael Paquier wrote:

On Tue, Mar 02, 2021 at 04:54:57PM -0800, Andres Freund wrote:

It still does, even after

commit 1e6e40447115ca7b4749d7d117b81b016ee5e2c2 (upstream/master, master)
Author: Andres Freund <andres@anarazel.de>
Date: 2021-03-01 09:52:15 -0800

Fix recovery test hang in 021_row_visibility.pl on windows.

? I didn't see failures after that?

Yes. Testing this morning on top of 5b2f2af, it fails for me with a
"Terminating on signal SIGBREAK".

Yes, I saw similar this morning, which woud have been after that commit.

I can't reproduce that here - could either (or both) of you send

src/test/recovery/tmp_check/log/regress_log_021_row_visibility
src/test/recovery/tmp_check/log/021_row_visibility_standby.log
src/test/recovery/tmp_check/log/021_row_visibility_primary.log

of a failed run? And maybe how you're invoking it?

Does adding a

$psql_primary{run}->finish;
$psql_standby{run}->finish;
before
$psql_primary{run}->kill_kill;
$psql_standby{run}->kill_kill;

fix the issue?

I will check later on. Note that IPC::Run's kill_kill is known to cause
problems on MSWin32 perl, see the other recovery checks where we skip
tests involving it - crash_recovery and logical_decoding.

Here's what I actually got working. Rip out the calls to kill_kill and
replace them with:

$psql_primary{stdin} .= "\\q\n";
$psql_primary{run}->pump_nb();
$psql_standby{stdin} .= "\\q\n";
$psql_standby{run}->pump_nb();
sleep 2; # give them time to quit

No hang or signal now.

cheers

andrew

--

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#15Andres Freund
andres@anarazel.de
In reply to: Andrew Dunstan (#14)
Re: buildfarm windows checks / tap tests on windows

Hi,

On 2021-03-03 16:07:13 -0500, Andrew Dunstan wrote:

Here's what I actually got working. Rip out the calls to kill_kill and
replace them with:

$psql_primary{stdin} .= "\\q\n";
$psql_primary{run}->pump_nb();
$psql_standby{stdin} .= "\\q\n";
$psql_standby{run}->pump_nb();
sleep 2; # give them time to quit

No hang or signal now.

Hm. I wonder if we can avoid the sleep 2 by doing something like
->pump(); ->finish(); instead of one pump_nb()? One pump() is needed to
send the \q to psql, and then we need to wait for the process to finish?
I'll try that, but given that I can't reproduce any problems...

I suspect that at least the crash recovery test suffers from exactly the
same problem and that we can re-enable it on windows after doign an
equivalent change...

Greetings,

Andres Freund

#16Andrew Dunstan
andrew@dunslane.net
In reply to: Andres Freund (#15)
Re: buildfarm windows checks / tap tests on windows

On 3/3/21 4:42 PM, Andres Freund wrote:

Hi,

On 2021-03-03 16:07:13 -0500, Andrew Dunstan wrote:

Here's what I actually got working. Rip out the calls to kill_kill and
replace them with:

$psql_primary{stdin} .= "\\q\n";
$psql_primary{run}->pump_nb();
$psql_standby{stdin} .= "\\q\n";
$psql_standby{run}->pump_nb();
sleep 2; # give them time to quit

No hang or signal now.

Hm. I wonder if we can avoid the sleep 2 by doing something like
->pump(); ->finish(); instead of one pump_nb()? One pump() is needed to
send the \q to psql, and then we need to wait for the process to finish?
I'll try that, but given that I can't reproduce any problems...

Looking at the examples in the IPC:Run docco, it looks like we might
just be able to replace the pump_nb above with finish, and leave the
sleep out. I'll try that.

I suspect that at least the crash recovery test suffers from exactly the
same problem and that we can re-enable it on windows after doign an
equivalent change...

Yes, possibly - it would be good to remove those skips.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#17Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#16)
Re: buildfarm windows checks / tap tests on windows

On 3/3/21 5:32 PM, Andrew Dunstan wrote:

On 3/3/21 4:42 PM, Andres Freund wrote:

Hi,

On 2021-03-03 16:07:13 -0500, Andrew Dunstan wrote:

Here's what I actually got working. Rip out the calls to kill_kill and
replace them with:

$psql_primary{stdin} .= "\\q\n";
$psql_primary{run}->pump_nb();
$psql_standby{stdin} .= "\\q\n";
$psql_standby{run}->pump_nb();
sleep 2; # give them time to quit

No hang or signal now.

Hm. I wonder if we can avoid the sleep 2 by doing something like
->pump(); ->finish(); instead of one pump_nb()? One pump() is needed to
send the \q to psql, and then we need to wait for the process to finish?
I'll try that, but given that I can't reproduce any problems...

Looking at the examples in the IPC:Run docco, it looks like we might
just be able to replace the pump_nb above with finish, and leave the
sleep out. I'll try that.

OK, this worked fine. I'll try it s a recipe in the other places where
kill_kill is forcing us to skip tests under MSwin32 perl, and see how we go.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#18Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#17)
1 attachment(s)
Re: buildfarm windows checks / tap tests on windows

On 3/3/21 7:21 PM, Andrew Dunstan wrote:

On 3/3/21 5:32 PM, Andrew Dunstan wrote:

On 3/3/21 4:42 PM, Andres Freund wrote:

Hi,

On 2021-03-03 16:07:13 -0500, Andrew Dunstan wrote:

Here's what I actually got working. Rip out the calls to kill_kill and
replace them with:

$psql_primary{stdin} .= "\\q\n";
$psql_primary{run}->pump_nb();
$psql_standby{stdin} .= "\\q\n";
$psql_standby{run}->pump_nb();
sleep 2; # give them time to quit

No hang or signal now.

Hm. I wonder if we can avoid the sleep 2 by doing something like
->pump(); ->finish(); instead of one pump_nb()? One pump() is needed to
send the \q to psql, and then we need to wait for the process to finish?
I'll try that, but given that I can't reproduce any problems...

Looking at the examples in the IPC:Run docco, it looks like we might
just be able to replace the pump_nb above with finish, and leave the
sleep out. I'll try that.

OK, this worked fine. I'll try it s a recipe in the other places where
kill_kill is forcing us to skip tests under MSwin32 perl, and see how we go.

Here's the patch. I didn't see a convenient way of handling the
pg_recvlogical case, so that's unchanged.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachments:

fix-recovery-tests-for-windows.patchtext/x-patch; charset=UTF-8; name=fix-recovery-tests-for-windows.patchDownload
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 5fe917978c..10cd98f70a 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -7,16 +7,8 @@ use PostgresNode;
 use TestLib;
 use Test::More;
 use Config;
-if ($Config{osname} eq 'MSWin32')
-{
 
-	# some Windows Perls at least don't like IPC::Run's start/kill_kill regime.
-	plan skip_all => "Test fails on Windows perl";
-}
-else
-{
-	plan tests => 3;
-}
+plan tests => 3;
 
 my $node = get_new_node('primary');
 $node->init(allows_streaming => 1);
@@ -65,4 +57,5 @@ cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'),
 is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 	'aborted', 'xid is aborted after crash');
 
-$tx->kill_kill;
+$stdin .= "\\q\n";
+$tx->finish; # wait for psql to quit gracefully
diff --git a/src/test/recovery/t/021_row_visibility.pl b/src/test/recovery/t/021_row_visibility.pl
index b76990dfe0..f6a486bb88 100644
--- a/src/test/recovery/t/021_row_visibility.pl
+++ b/src/test/recovery/t/021_row_visibility.pl
@@ -151,9 +151,12 @@ ok(send_query_and_wait(\%psql_standby,
 					   qr/will_commit.*\n\(1 row\)$/m),
    'finished prepared visible');
 
-# explicitly shut down psql instances - they cause hangs on windows
-$psql_primary{run}->kill_kill;
-$psql_standby{run}->kill_kill;
+# explicitly shut down psql instances gracefully - to avoid hangs
+# or worse on windows
+$psql_primary{stdin}  .= "\\q\n";
+$psql_primary{run}->finish;
+$psql_standby{stdin} .= "\\q\n";
+$psql_standby{run}->finish;
 
 $node_primary->stop;
 $node_standby->stop;
#19Andres Freund
andres@anarazel.de
In reply to: Andrew Dunstan (#18)
Re: buildfarm windows checks / tap tests on windows

Hi,

On 2021-03-04 11:10:19 -0500, Andrew Dunstan wrote:

Here's the patch.

Awesome. Will you commit it?

I didn't see a convenient way of handling the pg_recvlogical case, so
that's unchanged.

Is the problem actually the kill_kill() itself, or just doing
->kill_kill() without a subsequent ->finish()?

But anyway, that seems like a less critical test...

Greetings,

Andres Freund

#20Andrew Dunstan
andrew@dunslane.net
In reply to: Andres Freund (#19)
Re: buildfarm windows checks / tap tests on windows

On 3/4/21 12:56 PM, Andres Freund wrote:

Hi,

On 2021-03-04 11:10:19 -0500, Andrew Dunstan wrote:

Here's the patch.

Awesome. Will you commit it?

Done

I didn't see a convenient way of handling the pg_recvlogical case, so
that's unchanged.

Is the problem actually the kill_kill() itself, or just doing
->kill_kill() without a subsequent ->finish()?

Pretty sure it's the kill_kill that causes the awful crash Michael and I
have seen.

But anyway, that seems like a less critical test...

right

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com