Postgres v15 windows bincheck regression test failures

Started by Russell Fosterover 2 years ago8 messages
#1Russell Foster
russell.foster.coding@gmail.com

Hi All:

I upgraded to postgres v15, and I am getting intermittent failures for
some of the bin regression tests when building on Windows 10. Example:

perl vcregress.pl bincheck

Installation complete.
t/001_initdb.pl .. ok
All tests successful.
Files=1, Tests=25, 12 wallclock secs ( 0.03 usr + 0.01 sys = 0.05 CPU)
Result: PASS
t/001_basic.pl ........... ok
t/002_nonesuch.pl ........ 1/?
# Failed test 'checking a non-existent database stderr /(?^:FATAL:
database "qqq" does not exist)/'
# at t/002_nonesuch.pl line 25.
# 'pg_amcheck: error: connection to server at
"127.0.0.1", port 49393 failed: server closed the connection
unexpectedly
# This probably means the server terminated abnormally
# before or while processing the request.
# '
# doesn't match '(?^:FATAL: database "qqq" does not exist)'
t/002_nonesuch.pl ........ 97/? # Looks like you failed 1 test of 100.
t/002_nonesuch.pl ........ Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/100 subtests
t/003_check.pl ........... ok
t/004_verify_heapam.pl ... ok
t/005_opclass_damage.pl .. ok

Test Summary Report
-------------------
t/002_nonesuch.pl (Wstat: 256 Tests: 100 Failed: 1)
Failed test: 3
Non-zero exit status: 1
Files=5, Tests=196, 86 wallclock secs ( 0.11 usr + 0.08 sys = 0.19 CPU)
Result: FAIL
...

I see a similar failure on the build farm at:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2023-06-03%2020%3A03%3A07

I have also received the same error in the pg_dump test as the build
server above. Are these errors expected? Are they due to the fact that
windows tests use SSPI? It seems to work correctly if I recreate all
of the steps with an HBA that does not use SSPI.

thanks,
Russell

#2Andrew Dunstan
andrew@dunslane.net
In reply to: Russell Foster (#1)
Re: Postgres v15 windows bincheck regression test failures

On 2023-06-08 Th 13:41, Russell Foster wrote:

Hi All:

I upgraded to postgres v15, and I am getting intermittent failures for
some of the bin regression tests when building on Windows 10. Example:

perl vcregress.pl bincheck

Installation complete.
t/001_initdb.pl .. ok
All tests successful.
Files=1, Tests=25, 12 wallclock secs ( 0.03 usr + 0.01 sys = 0.05 CPU)
Result: PASS
t/001_basic.pl ........... ok
t/002_nonesuch.pl ........ 1/?
# Failed test 'checking a non-existent database stderr /(?^:FATAL:
database "qqq" does not exist)/'
# at t/002_nonesuch.pl line 25.
# 'pg_amcheck: error: connection to server at
"127.0.0.1", port 49393 failed: server closed the connection
unexpectedly
# This probably means the server terminated abnormally
# before or while processing the request.
# '
# doesn't match '(?^:FATAL: database "qqq" does not exist)'
t/002_nonesuch.pl ........ 97/? # Looks like you failed 1 test of 100.
t/002_nonesuch.pl ........ Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/100 subtests
t/003_check.pl ........... ok
t/004_verify_heapam.pl ... ok
t/005_opclass_damage.pl .. ok

Test Summary Report
-------------------
t/002_nonesuch.pl (Wstat: 256 Tests: 100 Failed: 1)
Failed test: 3
Non-zero exit status: 1
Files=5, Tests=196, 86 wallclock secs ( 0.11 usr + 0.08 sys = 0.19 CPU)
Result: FAIL
...

I see a similar failure on the build farm at:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2023-06-03%2020%3A03%3A07

I have also received the same error in the pg_dump test as the build
server above. Are these errors expected? Are they due to the fact that
windows tests use SSPI? It seems to work correctly if I recreate all
of the steps with an HBA that does not use SSPI.

In general you're better off using something like this

set PG_TEST_USE_UNIX_SOCKETS=1
set PG_REGRESS_SOCK_DIR=%LOCALAPPDATA%\Local\temp

That avoids several sorts of issues.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

#3Russell Foster
russell.foster.coding@gmail.com
In reply to: Andrew Dunstan (#2)
Re: Postgres v15 windows bincheck regression test failures

On Thu, Jun 8, 2023 at 3:33 PM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2023-06-08 Th 13:41, Russell Foster wrote:

Hi All:

I upgraded to postgres v15, and I am getting intermittent failures for
some of the bin regression tests when building on Windows 10. Example:

perl vcregress.pl bincheck

Installation complete.
t/001_initdb.pl .. ok
All tests successful.
Files=1, Tests=25, 12 wallclock secs ( 0.03 usr + 0.01 sys = 0.05 CPU)
Result: PASS
t/001_basic.pl ........... ok
t/002_nonesuch.pl ........ 1/?
# Failed test 'checking a non-existent database stderr /(?^:FATAL:
database "qqq" does not exist)/'
# at t/002_nonesuch.pl line 25.
# 'pg_amcheck: error: connection to server at
"127.0.0.1", port 49393 failed: server closed the connection
unexpectedly
# This probably means the server terminated abnormally
# before or while processing the request.
# '
# doesn't match '(?^:FATAL: database "qqq" does not exist)'
t/002_nonesuch.pl ........ 97/? # Looks like you failed 1 test of 100.
t/002_nonesuch.pl ........ Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/100 subtests
t/003_check.pl ........... ok
t/004_verify_heapam.pl ... ok
t/005_opclass_damage.pl .. ok

Test Summary Report
-------------------
t/002_nonesuch.pl (Wstat: 256 Tests: 100 Failed: 1)
Failed test: 3
Non-zero exit status: 1
Files=5, Tests=196, 86 wallclock secs ( 0.11 usr + 0.08 sys = 0.19 CPU)
Result: FAIL
...

I see a similar failure on the build farm at:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&amp;dt=2023-06-03%2020%3A03%3A07

I have also received the same error in the pg_dump test as the build
server above. Are these errors expected? Are they due to the fact that
windows tests use SSPI? It seems to work correctly if I recreate all
of the steps with an HBA that does not use SSPI.

In general you're better off using something like this

set PG_TEST_USE_UNIX_SOCKETS=1
set PG_REGRESS_SOCK_DIR=%LOCALAPPDATA%\Local\temp

That avoids several sorts of issues.

cheers

andrew

Thanks for responding! This does indeed work, but again it is no
longer using SSPI, nor the sockets that are used in the runtime. Plus
there is this scary comment in code:

/*
* We don't use Unix-domain sockets on Windows by default, even if the
* build supports them. (See comment at remove_temp() for a reason.)
* Override at your own risk.
*/

Is there some sort of race condition in the SSPI code that sometimes
doesn't gracefully finish/close the connection when the backend
decides to exit due to error?

Show quoted text

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#4Noah Misch
noah@leadboat.com
In reply to: Russell Foster (#3)
Re: Postgres v15 windows bincheck regression test failures

On Tue, Jun 20, 2023 at 07:49:52AM -0400, Russell Foster wrote:

/*
* We don't use Unix-domain sockets on Windows by default, even if the
* build supports them. (See comment at remove_temp() for a reason.)
* Override at your own risk.
*/

Is there some sort of race condition in the SSPI code that sometimes
doesn't gracefully finish/close the connection when the backend
decides to exit due to error?

No. remove_temp() is part of test driver "pg_regress". Non-test usage is
unaffected. Even for test usage, folks have reported no failures from the
cause mentioned in the remove_temp() comment.

#5Alexander Lakhin
exclusion@gmail.com
In reply to: Noah Misch (#4)
Re: Postgres v15 windows bincheck regression test failures

Hello,

28.07.2023 05:17, Noah Misch wrote:

On Tue, Jun 20, 2023 at 07:49:52AM -0400, Russell Foster wrote:

/*
* We don't use Unix-domain sockets on Windows by default, even if the
* build supports them. (See comment at remove_temp() for a reason.)
* Override at your own risk.
*/

Is there some sort of race condition in the SSPI code that sometimes
doesn't gracefully finish/close the connection when the backend
decides to exit due to error?

No. remove_temp() is part of test driver "pg_regress". Non-test usage is
unaffected. Even for test usage, folks have reported no failures from the
cause mentioned in the remove_temp() comment.

It seems to me that it's just another manifestation of bug #16678 ([1]/messages/by-id/16678-253e48d34dc0c376@postgresql.org).
See also commits 6051857fc and 29992a6a5.

[1]: /messages/by-id/16678-253e48d34dc0c376@postgresql.org

Best regards,
Alexander

#6Noah Misch
noah@leadboat.com
In reply to: Alexander Lakhin (#5)
Re: Postgres v15 windows bincheck regression test failures

On Fri, Jul 28, 2023 at 07:00:01AM +0300, Alexander Lakhin wrote:

28.07.2023 05:17, Noah Misch wrote:

On Tue, Jun 20, 2023 at 07:49:52AM -0400, Russell Foster wrote:

/*
* We don't use Unix-domain sockets on Windows by default, even if the
* build supports them. (See comment at remove_temp() for a reason.)
* Override at your own risk.
*/

Is there some sort of race condition in the SSPI code that sometimes
doesn't gracefully finish/close the connection when the backend
decides to exit due to error?

No. remove_temp() is part of test driver "pg_regress". Non-test usage is
unaffected. Even for test usage, folks have reported no failures from the
cause mentioned in the remove_temp() comment.

It seems to me that it's just another manifestation of bug #16678 ([1]).
See also commits 6051857fc and 29992a6a5.

[1] /messages/by-id/16678-253e48d34dc0c376@postgresql.org

That was about a bug that appears when using TCP sockets. The remove_temp()
comment involves code that doesn't run when using TCP sockets. I don't think
they can be manifestations of the same phenomenon.

#7Alexander Lakhin
exclusion@gmail.com
In reply to: Noah Misch (#6)
Re: Postgres v15 windows bincheck regression test failures

28.07.2023 14:42, Noah Misch wrpte:

That was about a bug that appears when using TCP sockets. ...

Yes, and according to the failed test output, TCP sockets were used:

#                   'pg_amcheck: error: connection to server at
"127.0.0.1", port 49393 failed: server closed the connection
unexpectedly
#       This probably means the server terminated abnormally
#       before or while processing the request.

Best regards,
Alexander

#8Noah Misch
noah@leadboat.com
In reply to: Alexander Lakhin (#7)
Re: Postgres v15 windows bincheck regression test failures

On Fri, Jul 28, 2023 at 04:00:00PM +0300, Alexander Lakhin wrote:

28.07.2023 14:42, Noah Misch wrpte:

That was about a bug that appears when using TCP sockets. ...

Yes, and according to the failed test output, TCP sockets were used:

#������������������ 'pg_amcheck: error: connection to server at
"127.0.0.1", port 49393 failed: server closed the connection
unexpectedly
#������ This probably means the server terminated abnormally
#������ before or while processing the request.

I think we were talking about different details. Agreed, bug #16678 probably
did cause the failure in the original post. I was saying that bug has no
connection to the "scary comment", though.