ssl tests fail due to TCP port conflict
Hello hackers,
As buildfarm shows, ssl tests are not stable enough when running via meson.
I can see the following failures for the last 90 days:
REL_16_STABLE:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-03-12%2023%3A15%3A50
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-03-21%2000%3A35%3A23
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-03-27%2011%3A15%3A31
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2024-04-16%2016%3A10%3A45
master:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-03-08%2011%3A19%3A42
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-03-11%2022%3A23%3A28
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-03-17%2023%3A03%3A50
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-03-20%2009%3A21%3A30
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-03-20%2016%3A53%3A27
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2024-04-07%2012%3A25%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-04-08%2019%3A50%3A13
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-04-19%2021%3A24%3A30
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-04-22%2006%3A17%3A13
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-04-29%2023%3A27%3A15
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-04-30%2000%3A24%3A28
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-06-04%2011%3A20%3A07
All the failures are caused by the server inability to restart with a
previously chosen random port on TCP/IP. For example:
2024-06-04 11:30:40.227 UTC [3373644][postmaster][:0] LOG: starting PostgreSQL 17beta1 on x86_64-linux, compiled by
clang-13.0.1-11, 64-bit
2024-06-04 11:30:40.231 UTC [3373644][postmaster][:0] LOG: listening on Unix socket "/tmp/tUmT8ItNQ2/.s.PGSQL.60362"
2024-06-04 11:30:40.337 UTC [3373798][startup][:0] LOG: database system was shut down at 2024-06-04 11:21:25 UTC
...
2024-06-04 11:30:45.273 UTC [3376046][postmaster][:0] LOG: starting PostgreSQL 17beta1 on x86_64-linux, compiled by
clang-13.0.1-11, 64-bit
2024-06-04 11:30:45.273 UTC [3376046][postmaster][:0] LOG: could not bind IPv4 address "127.0.0.1": Address already in use
2024-06-04 11:30:45.273 UTC [3376046][postmaster][:0] HINT: Is another postmaster already running on port 60362? If
not, wait a few seconds and retry.
2024-06-04 11:30:45.273 UTC [3376046][postmaster][:0] WARNING: could not create listen socket for "127.0.0.1"
I've managed to reproduce the failure locally with the following change:
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -149,7 +149,7 @@ INIT
$ENV{PGDATABASE} = 'postgres';
# Tracking of last port value assigned to accelerate free port lookup.
- $last_port_assigned = int(rand() * 16384) + 49152;
+ $last_port_assigned = int(rand() * 1024) + 49152;
and multiplying one of the tests.
for i in `seq 50`; do cp .../src/test/ssl/t/001_ssltests.pl \
.../src/test/ssl/t/001_ssltests_$i.pl; \
sed -E "s|('t/001_ssltests.pl',)|\1\n't/001_ssltests_$i.pl',|" -i \
.../src/test/ssl/meson.build; done
Then `meson test --suite ssl` fails for me as below:
...
26/53 postgresql:ssl / ssl/001_ssltests_26 OK 9.03s 205 subtests passed
27/53 postgresql:ssl / ssl/001_ssltests_18 ERROR 3.55s (exit status 255 or signal 127 SIGinvalid)
OPENSSL=/usr/bin/openssl ...
28/53 postgresql:ssl / ssl/001_ssltests_25 OK 10.98s 205 subtests passed
29/53 postgresql:ssl / ssl/001_ssltests_24 OK 10.84s 205 subtests passed
testrun/ssl/001_ssltests_18/log/001_ssltests_18_primary.log contains:
2024-06-05 11:51:22.005 UTC [710541] LOG: could not bind IPv4 address "127.0.0.1": Address already in use
2024-06-05 11:51:22.005 UTC [710541] HINT: Is another postmaster already running on port 49632? If not, wait a few
seconds and retry.
`grep '\b49632\b' -r testrun/` finds:
testrun/ssl/001_ssltests_18/log/regress_log_001_ssltests_18:# Checking port 49632
testrun/ssl/001_ssltests_18/log/regress_log_001_ssltests_18:# Found port 49632
testrun/ssl/001_ssltests_18/log/regress_log_001_ssltests_18:Connection string: port=49632 host=/tmp/sp_VLbpjJF
testrun/ssl/001_ssltests_18/log/001_ssltests_18_primary.log:2024-06-05 11:51:18.896 UTC [710082] LOG: listening on Unix
socket "/tmp/sp_VLbpjJF/.s.PGSQL.49632"
testrun/ssl/001_ssltests_18/log/001_ssltests_18_primary.log:2024-06-05 11:51:22.005 UTC [710541] HINT: Is another
postmaster already running on port 49632? If not, wait a few seconds and retry.
...
testrun/ssl/001_ssltests_23/log/regress_log_001_ssltests_23:# Checking port 49632
testrun/ssl/001_ssltests_23/log/regress_log_001_ssltests_23:# Found port 49632
testrun/ssl/001_ssltests_23/log/regress_log_001_ssltests_23:Connection string: port=49632 host=/tmp/3lxVDNzuGC
testrun/ssl/001_ssltests_23/log/001_ssltests_23_primary.log:2024-06-05 11:51:13.337 UTC [708377] LOG: listening on Unix
socket "/tmp/3lxVDNzuGC/.s.PGSQL.49632"
testrun/ssl/001_ssltests_23/log/001_ssltests_23_primary.log:2024-06-05 11:51:14.333 UTC [708715] LOG: listening on IPv4
address "127.0.0.1", port 49632
...
Another case (with psql using the port):
testrun/ssl/001_ssltests_47/log/regress_log_001_ssltests_47:# Checking port 49448
testrun/ssl/001_ssltests_47/log/regress_log_001_ssltests_47:# Found port 49448
testrun/ssl/001_ssltests_47/log/001_ssltests_47_primary.log:2024-06-05 12:20:50.178 UTC [976826] LOG: listening on Unix
socket "/tmp/GePu6gmouP/.s.PGSQL.49448"
testrun/ssl/001_ssltests_47/log/001_ssltests_47_primary.log:2024-06-05 12:20:50.491 UTC [976927] HINT: Is another
postmaster already running on port 49448? If not, wait a few seconds and retry.
...
testrun/ssl/001_ssltests_48/log/001_ssltests_48_primary.log:2024-06-05 12:20:50.491 UTC [976943] [unknown] LOG:
connection received: host=localhost port=49448
The broader excerpt:
2024-06-05 12:20:50.415 UTC [976918] [unknown] LOG: connection received: host=localhost port=50326
2024-06-05 12:20:50.418 UTC [976918] [unknown] LOG: could not accept SSL connection: EOF detected
2024-06-05 12:20:50.433 UTC [976920] [unknown] LOG: connection received: host=localhost port=49420
2024-06-05 12:20:50.435 UTC [976920] [unknown] LOG: could not accept SSL connection: EOF detected
2024-06-05 12:20:50.447 UTC [976922] [unknown] LOG: connection received: host=localhost port=49430
2024-06-05 12:20:50.452 UTC [976922] [unknown] LOG: could not accept SSL connection: tlsv1 alert unknown ca
2024-06-05 12:20:50.466 UTC [976933] [unknown] LOG: connection received: host=localhost port=49440
2024-06-05 12:20:50.472 UTC [976933] [unknown] LOG: could not accept SSL connection: tlsv1 alert unknown ca
2024-06-05 12:20:50.491 UTC [976943] [unknown] LOG: connection received: host=localhost port=49448
2024-06-05 12:20:50.497 UTC [976943] [unknown] LOG: could not accept SSL connection: tlsv1 alert unknown ca
2024-06-05 12:20:50.513 UTC [976969] [unknown] LOG: connection received: host=localhost port=49464
2024-06-05 12:20:50.517 UTC [976969] [unknown] LOG: could not accept SSL connection: tlsv1 alert unknown ca
2024-06-05 12:20:50.532 UTC [976971] [unknown] LOG: connection received: host=localhost port=49468
Maybe the ssl tests should not be considered failed in case of the TCP port
binding error?
Best regards,
Alexander
On 2024-06-05 We 09:00, Alexander Lakhin wrote:
Another case (with psql using the port):
testrun/ssl/001_ssltests_47/log/regress_log_001_ssltests_47:# Checking
port 49448
testrun/ssl/001_ssltests_47/log/regress_log_001_ssltests_47:# Found
port 49448
testrun/ssl/001_ssltests_47/log/001_ssltests_47_primary.log:2024-06-05
12:20:50.178 UTC [976826] LOG: listening on Unix socket
"/tmp/GePu6gmouP/.s.PGSQL.49448"
testrun/ssl/001_ssltests_47/log/001_ssltests_47_primary.log:2024-06-05
12:20:50.491 UTC [976927] HINT: Is another postmaster already running
on port 49448? If not, wait a few seconds and retry.
...
testrun/ssl/001_ssltests_48/log/001_ssltests_48_primary.log:2024-06-05
12:20:50.491 UTC [976943] [unknown] LOG: connection received:
host=localhost port=49448
The broader excerpt:
2024-06-05 12:20:50.415 UTC [976918] [unknown] LOG: connection
received: host=localhost port=50326
2024-06-05 12:20:50.418 UTC [976918] [unknown] LOG: could not accept
SSL connection: EOF detected
2024-06-05 12:20:50.433 UTC [976920] [unknown] LOG: connection
received: host=localhost port=49420
2024-06-05 12:20:50.435 UTC [976920] [unknown] LOG: could not accept
SSL connection: EOF detected
2024-06-05 12:20:50.447 UTC [976922] [unknown] LOG: connection
received: host=localhost port=49430
2024-06-05 12:20:50.452 UTC [976922] [unknown] LOG: could not accept
SSL connection: tlsv1 alert unknown ca
2024-06-05 12:20:50.466 UTC [976933] [unknown] LOG: connection
received: host=localhost port=49440
2024-06-05 12:20:50.472 UTC [976933] [unknown] LOG: could not accept
SSL connection: tlsv1 alert unknown ca
2024-06-05 12:20:50.491 UTC [976943] [unknown] LOG: connection
received: host=localhost port=49448
2024-06-05 12:20:50.497 UTC [976943] [unknown] LOG: could not accept
SSL connection: tlsv1 alert unknown ca
2024-06-05 12:20:50.513 UTC [976969] [unknown] LOG: connection
received: host=localhost port=49464
2024-06-05 12:20:50.517 UTC [976969] [unknown] LOG: could not accept
SSL connection: tlsv1 alert unknown ca
2024-06-05 12:20:50.532 UTC [976971] [unknown] LOG: connection
received: host=localhost port=49468
I think I see what's going on here. It looks like it's because we start
the server in unix socket mode, and then switch to using TCP as well.
Can you try your test with this patch applied and see if the problem
persists? If we start in TCP mode the framework should test for a port
clash.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Attachments:
ssl-test-tcp-fix.patchtext/x-patch; charset=UTF-8; name=ssl-test-tcp-fix.patchDownload
diff --git a/src/test/ssl/t/001_ssltests.pl b/src/test/ssl/t/001_ssltests.pl
index b877327023..20ec7cb5fb 100644
--- a/src/test/ssl/t/001_ssltests.pl
+++ b/src/test/ssl/t/001_ssltests.pl
@@ -4,8 +4,9 @@
use strict;
use warnings FATAL => 'all';
use Config qw ( %Config );
-use PostgreSQL::Test::Cluster;
use PostgreSQL::Test::Utils;
+BEGIN { $PostgreSQL::Test::Utils::use_unix_sockets = 0; }
+use PostgreSQL::Test::Cluster;
use Test::More;
use FindBin;
On 2024-06-05 We 14:10, Andrew Dunstan wrote:
On 2024-06-05 We 09:00, Alexander Lakhin wrote:
Another case (with psql using the port):
testrun/ssl/001_ssltests_47/log/regress_log_001_ssltests_47:#
Checking port 49448
testrun/ssl/001_ssltests_47/log/regress_log_001_ssltests_47:# Found
port 49448
testrun/ssl/001_ssltests_47/log/001_ssltests_47_primary.log:2024-06-05
12:20:50.178 UTC [976826] LOG: listening on Unix socket
"/tmp/GePu6gmouP/.s.PGSQL.49448"
testrun/ssl/001_ssltests_47/log/001_ssltests_47_primary.log:2024-06-05
12:20:50.491 UTC [976927] HINT: Is another postmaster already
running on port 49448? If not, wait a few seconds and retry.
...
testrun/ssl/001_ssltests_48/log/001_ssltests_48_primary.log:2024-06-05
12:20:50.491 UTC [976943] [unknown] LOG: connection received:
host=localhost port=49448
The broader excerpt:
2024-06-05 12:20:50.415 UTC [976918] [unknown] LOG: connection
received: host=localhost port=50326
2024-06-05 12:20:50.418 UTC [976918] [unknown] LOG: could not accept
SSL connection: EOF detected
2024-06-05 12:20:50.433 UTC [976920] [unknown] LOG: connection
received: host=localhost port=49420
2024-06-05 12:20:50.435 UTC [976920] [unknown] LOG: could not accept
SSL connection: EOF detected
2024-06-05 12:20:50.447 UTC [976922] [unknown] LOG: connection
received: host=localhost port=49430
2024-06-05 12:20:50.452 UTC [976922] [unknown] LOG: could not accept
SSL connection: tlsv1 alert unknown ca
2024-06-05 12:20:50.466 UTC [976933] [unknown] LOG: connection
received: host=localhost port=49440
2024-06-05 12:20:50.472 UTC [976933] [unknown] LOG: could not accept
SSL connection: tlsv1 alert unknown ca
2024-06-05 12:20:50.491 UTC [976943] [unknown] LOG: connection
received: host=localhost port=49448
2024-06-05 12:20:50.497 UTC [976943] [unknown] LOG: could not accept
SSL connection: tlsv1 alert unknown ca
2024-06-05 12:20:50.513 UTC [976969] [unknown] LOG: connection
received: host=localhost port=49464
2024-06-05 12:20:50.517 UTC [976969] [unknown] LOG: could not accept
SSL connection: tlsv1 alert unknown ca
2024-06-05 12:20:50.532 UTC [976971] [unknown] LOG: connection
received: host=localhost port=49468I think I see what's going on here. It looks like it's because we
start the server in unix socket mode, and then switch to using TCP as
well.Can you try your test with this patch applied and see if the problem
persists? If we start in TCP mode the framework should test for a port
clash.
Hmm, on closer inspection we should have reserved the port anyway. But
why is the port "already used" on restart? We haven't previously opened
a TCP connection on that port (except when checking if we can bind it),
and instances should be locked against using that port.
... wanders away muttering and scratching head ...
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Hello Andrew,
05.06.2024 21:10, Andrew Dunstan wrote:
I think I see what's going on here. It looks like it's because we start the server in unix socket mode, and then
switch to using TCP as well.Can you try your test with this patch applied and see if the problem persists? If we start in TCP mode the framework
should test for a port clash.
It seems that the failure rate decreased (I guess the patch rules out the
case with two servers choosing the same port), but I still got:
16/53 postgresql:ssl / ssl/001_ssltests_36 OK 15.25s 205 subtests passed
17/53 postgresql:ssl / ssl/001_ssltests_30 ERROR 3.17s (exit status 255 or signal 127 SIGinvalid)
2024-06-05 19:40:37.395 UTC [414110] LOG: starting PostgreSQL 17beta1 on x86_64-linux, compiled by gcc-13.2.1, 64-bit
2024-06-05 19:40:37.395 UTC [414110] LOG: could not bind IPv4 address "127.0.0.1": Address already in use
2024-06-05 19:40:37.395 UTC [414110] HINT: Is another postmaster already running on port 50072? If not, wait a few
seconds and retry.
`grep '\b50072\b' -r testrun/` yields:
testrun/ssl/001_ssltests_34/log/001_ssltests_34_primary.log:2024-06-05 19:40:37.392 UTC [414111] [unknown] LOG:
connection received: host=localhost port=50072
(a psql case)
That is, psql from the test instance 001_ssltests_34 opened a connection to
the test server with the client port 50072 and it made using the port by
the server from the test instance 001_ssltests_30 impossible.
Best regards.
Alexander
On 2024-06-05 We 16:00, Alexander Lakhin wrote:
Hello Andrew,
05.06.2024 21:10, Andrew Dunstan wrote:
I think I see what's going on here. It looks like it's because we
start the server in unix socket mode, and then switch to using TCP as
well.Can you try your test with this patch applied and see if the problem
persists? If we start in TCP mode the framework should test for a
port clash.It seems that the failure rate decreased (I guess the patch rules out the
case with two servers choosing the same port), but I still got:16/53 postgresql:ssl / ssl/001_ssltests_36 OK 15.25s 205
subtests passed
17/53 postgresql:ssl / ssl/001_ssltests_30 ERROR 3.17s (exit
status 255 or signal 127 SIGinvalid)2024-06-05 19:40:37.395 UTC [414110] LOG: starting PostgreSQL 17beta1
on x86_64-linux, compiled by gcc-13.2.1, 64-bit
2024-06-05 19:40:37.395 UTC [414110] LOG: could not bind IPv4 address
"127.0.0.1": Address already in use
2024-06-05 19:40:37.395 UTC [414110] HINT: Is another postmaster
already running on port 50072? If not, wait a few seconds and retry.`grep '\b50072\b' -r testrun/` yields:
testrun/ssl/001_ssltests_34/log/001_ssltests_34_primary.log:2024-06-05
19:40:37.392 UTC [414111] [unknown] LOG: connection received:
host=localhost port=50072
(a psql case)That is, psql from the test instance 001_ssltests_34 opened a
connection to
the test server with the client port 50072 and it made using the port by
the server from the test instance 001_ssltests_30 impossible.
Oh. (kicks self)
Should we really be allocating ephemeral server ports in the range
41952..65535? Maybe we should be looking for an unallocated port
somewhere below 41952, and above, say, 32767, so we couldn't have a
client socket collision.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
On 2024-06-05 We 16:49, Andrew Dunstan wrote:
On 2024-06-05 We 16:00, Alexander Lakhin wrote:
Hello Andrew,
05.06.2024 21:10, Andrew Dunstan wrote:
I think I see what's going on here. It looks like it's because we
start the server in unix socket mode, and then switch to using TCP
as well.Can you try your test with this patch applied and see if the problem
persists? If we start in TCP mode the framework should test for a
port clash.It seems that the failure rate decreased (I guess the patch rules out
the
case with two servers choosing the same port), but I still got:16/53 postgresql:ssl / ssl/001_ssltests_36 OK 15.25s 205
subtests passed
17/53 postgresql:ssl / ssl/001_ssltests_30 ERROR 3.17s (exit
status 255 or signal 127 SIGinvalid)2024-06-05 19:40:37.395 UTC [414110] LOG: starting PostgreSQL
17beta1 on x86_64-linux, compiled by gcc-13.2.1, 64-bit
2024-06-05 19:40:37.395 UTC [414110] LOG: could not bind IPv4
address "127.0.0.1": Address already in use
2024-06-05 19:40:37.395 UTC [414110] HINT: Is another postmaster
already running on port 50072? If not, wait a few seconds and retry.`grep '\b50072\b' -r testrun/` yields:
testrun/ssl/001_ssltests_34/log/001_ssltests_34_primary.log:2024-06-05
19:40:37.392 UTC [414111] [unknown] LOG: connection received:
host=localhost port=50072
(a psql case)That is, psql from the test instance 001_ssltests_34 opened a
connection to
the test server with the client port 50072 and it made using the port by
the server from the test instance 001_ssltests_30 impossible.Oh. (kicks self)
Should we really be allocating ephemeral server ports in the range
41952..65535? Maybe we should be looking for an unallocated port
somewhere below 41952, and above, say, 32767, so we couldn't have a
client socket collision.
Except that I see this on my Ubuntu instance:
$ sudo sysctl net.ipv4.ip_local_port_range
net.ipv4.ip_local_port_range = 32768 60999
And indeed I see client sockets in that range.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes:
On 2024-06-05 We 16:00, Alexander Lakhin wrote:
That is, psql from the test instance 001_ssltests_34 opened a
connection to
the test server with the client port 50072 and it made using the port by
the server from the test instance 001_ssltests_30 impossible.
Oh. (kicks self)
D'oh.
Should we really be allocating ephemeral server ports in the range
41952..65535? Maybe we should be looking for an unallocated port
somewhere below 41952, and above, say, 32767, so we couldn't have a
client socket collision.
Hmm, are there really any standards about how these port numbers
are used?
I wonder if we don't need to just be prepared to retry the whole
thing a few times. Even if it's true that "clients" shouldn't
choose ports below 41952, we still have a small chance of failure
against a non-Postgres server starting up at the wrong time.
regards, tom lane
On 2024-06-05 We 17:37, Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
On 2024-06-05 We 16:00, Alexander Lakhin wrote:
That is, psql from the test instance 001_ssltests_34 opened a
connection to
the test server with the client port 50072 and it made using the port by
the server from the test instance 001_ssltests_30 impossible.Oh. (kicks self)
D'oh.
Should we really be allocating ephemeral server ports in the range
41952..65535? Maybe we should be looking for an unallocated port
somewhere below 41952, and above, say, 32767, so we couldn't have a
client socket collision.Hmm, are there really any standards about how these port numbers
are used?I wonder if we don't need to just be prepared to retry the whole
thing a few times. Even if it's true that "clients" shouldn't
choose ports below 41952, we still have a small chance of failure
against a non-Postgres server starting up at the wrong time.
Yeah, I think you're right. One thing we should do is be careful to use
the port as soon as possible after we have picked it, to reduce the
possibility that something else will use it first.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
On 2024-06-05 We 16:00, Alexander Lakhin wrote:
Hello Andrew,
05.06.2024 21:10, Andrew Dunstan wrote:
I think I see what's going on here. It looks like it's because we
start the server in unix socket mode, and then switch to using TCP as
well.Can you try your test with this patch applied and see if the problem
persists? If we start in TCP mode the framework should test for a
port clash.It seems that the failure rate decreased (I guess the patch rules out the
case with two servers choosing the same port), but I still got:16/53 postgresql:ssl / ssl/001_ssltests_36 OK 15.25s 205
subtests passed
17/53 postgresql:ssl / ssl/001_ssltests_30 ERROR 3.17s (exit
status 255 or signal 127 SIGinvalid)2024-06-05 19:40:37.395 UTC [414110] LOG: starting PostgreSQL 17beta1
on x86_64-linux, compiled by gcc-13.2.1, 64-bit
2024-06-05 19:40:37.395 UTC [414110] LOG: could not bind IPv4 address
"127.0.0.1": Address already in use
2024-06-05 19:40:37.395 UTC [414110] HINT: Is another postmaster
already running on port 50072? If not, wait a few seconds and retry.`grep '\b50072\b' -r testrun/` yields:
testrun/ssl/001_ssltests_34/log/001_ssltests_34_primary.log:2024-06-05
19:40:37.392 UTC [414111] [unknown] LOG: connection received:
host=localhost port=50072
(a psql case)That is, psql from the test instance 001_ssltests_34 opened a
connection to
the test server with the client port 50072 and it made using the port by
the server from the test instance 001_ssltests_30 impossible.
After sleeping on it, I still think the patch would be a good thing.
Your torture test might still show some failures, but the buildfarm
isn't running those, and it might be enough to eliminate or at least
substantially reduce buildfarm failures by reducing to almost zero the
time in which a competing script might grab the port. The biggest
problem with the current script is apparently that we delay using the
TCP port by starting the server in Unix socket mode, and only switch to
using TCP when we restart. If changing that doesn't fix the problem
we'll have to rethink. If this isn't the cause, though, I would expect
to have seen similar failures from other test suites.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
On Wed, 5 Jun 2024 at 23:37, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
On 2024-06-05 We 16:00, Alexander Lakhin wrote:
That is, psql from the test instance 001_ssltests_34 opened a
connection to
the test server with the client port 50072 and it made using the port by
the server from the test instance 001_ssltests_30 impossible.Oh. (kicks self)
D'oh.
Should we really be allocating ephemeral server ports in the range
41952..65535? Maybe we should be looking for an unallocated port
somewhere below 41952, and above, say, 32767, so we couldn't have a
client socket collision.Hmm, are there really any standards about how these port numbers
are used?I wonder if we don't need to just be prepared to retry the whole
thing a few times. Even if it's true that "clients" shouldn't
choose ports below 41952, we still have a small chance of failure
against a non-Postgres server starting up at the wrong time.
My suggestion would be to not touch the ephemeral port range at all
for these ports. In practice the ephemeral port range is used for
cases where the operating system assigns the port, and the application
doesn't care whot it is. Not for when you want to get a free port, but
want to know in advance which one it is.
For the PgBouncer test suite we do something similar as the PG its
perl tests do, but there we allocate a port between 10200 and 32768:
https://github.com/pgbouncer/pgbouncer/blob/master/test/utils.py#L192-L215
Sure theoretically it's possible to hit a rare case where another
server starts up at the wrong time, but that chance seems way lower
than a client starting up at the wrong time. Especially since there
aren't many servers that use a port with 5 digits.
Attached is a patch that updates the port numbers.
Attachments:
v1-0001-Don-t-use-ephemeral-port-range.patchapplication/octet-stream; name=v1-0001-Don-t-use-ephemeral-port-range.patchDownload
From 84c98b6333e3e4fa3c882955298d9ab0e8246dc6 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <jelte.fennema@microsoft.com>
Date: Thu, 6 Jun 2024 23:50:51 +0200
Subject: [PATCH v1] Don't use ephemeral port range
---
src/test/perl/PostgreSQL/Test/Cluster.pm | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 83f385a4870..359cfc3715e 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -125,6 +125,13 @@ our $min_compat = 12;
# list of file reservations made by get_free_port
my @port_reservation_files;
+# this is out of ephemeral port range for many systems hence
+# it is a lower chance that it will conflict with "in-use" ports
+my $port_lower_bound = 10200;
+
+# ephemeral port start on many Linux systems
+my $port_upper_bound = 32768;
+
INIT
{
@@ -149,7 +156,7 @@ INIT
$ENV{PGDATABASE} = 'postgres';
# Tracking of last port value assigned to accelerate free port lookup.
- $last_port_assigned = int(rand() * 16384) + 49152;
+ $last_port_assigned = int(rand() * ($port_upper_bound - $port_lower_bound - 1)) + $port_lower_bound;
# Set the port lock directory
@@ -1687,7 +1694,7 @@ sub get_free_port
{
# advance $port, wrapping correctly around range end
- $port = 49152 if ++$port >= 65536;
+ $port = $port_lower_bound if ++$port >= $port_upper_bound;
print "# Checking port $port\n";
# Check first that candidate port number is not included in
base-commit: 7b71e5bbccd6c86bc12ba0124e7282cfb3aa3226
--
2.34.1
On 2024-06-06 Th 18:02, Jelte Fennema-Nio wrote:
On Wed, 5 Jun 2024 at 23:37, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
On 2024-06-05 We 16:00, Alexander Lakhin wrote:
That is, psql from the test instance 001_ssltests_34 opened a
connection to
the test server with the client port 50072 and it made using the port by
the server from the test instance 001_ssltests_30 impossible.Oh. (kicks self)
D'oh.
Should we really be allocating ephemeral server ports in the range
41952..65535? Maybe we should be looking for an unallocated port
somewhere below 41952, and above, say, 32767, so we couldn't have a
client socket collision.Hmm, are there really any standards about how these port numbers
are used?I wonder if we don't need to just be prepared to retry the whole
thing a few times. Even if it's true that "clients" shouldn't
choose ports below 41952, we still have a small chance of failure
against a non-Postgres server starting up at the wrong time.My suggestion would be to not touch the ephemeral port range at all
for these ports. In practice the ephemeral port range is used for
cases where the operating system assigns the port, and the application
doesn't care whot it is. Not for when you want to get a free port, but
want to know in advance which one it is.For the PgBouncer test suite we do something similar as the PG its
perl tests do, but there we allocate a port between 10200 and 32768:
https://github.com/pgbouncer/pgbouncer/blob/master/test/utils.py#L192-L215Sure theoretically it's possible to hit a rare case where another
server starts up at the wrong time, but that chance seems way lower
than a client starting up at the wrong time. Especially since there
aren't many servers that use a port with 5 digits.Attached is a patch that updates the port numbers.
Makes sense to me.
I still think my patch to force TCP mode for the SSL test makes sense as
well.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes:
On 2024-06-06 Th 18:02, Jelte Fennema-Nio wrote:
For the PgBouncer test suite we do something similar as the PG its
perl tests do, but there we allocate a port between 10200 and 32768:
https://github.com/pgbouncer/pgbouncer/blob/master/test/utils.py#L192-L215
Makes sense to me.
I still think my patch to force TCP mode for the SSL test makes sense as
well.
+1 to both things. If that doesn't get the failure rate down to an
acceptable level, we can look at the retry idea.
regards, tom lane
Hello,
07.06.2024 17:25, Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
I still think my patch to force TCP mode for the SSL test makes sense as
well.+1 to both things. If that doesn't get the failure rate down to an
acceptable level, we can look at the retry idea.
I'd like to add that the kerberos/001_auth test also suffers from the port
conflict, but slightly differently. Look for example at [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-07-02%2009%3A27%3A15:
krb5kdc.log contains:
Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](info): setting up network...
Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error): Address already in use - Cannot bind server socket
on 127.0.0.1.55853
Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error): Failed setting up a UDP socket (for 127.0.0.1.55853)
Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error): Address already in use - Error setting up network
As far as I can see, the port for kdc is chosen by
PostgreSQL::Test::Kerberos, via
PostgreSQL::Test::Cluster::get_free_port(), which checks only for TCP
port availability (with can_bind()), but not for UDP, so this increases
the probability of the conflict for this test (a similar failure: [2]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-05-15%2001%3A25%3A07).
Although we can also find a failure with TCP: [3]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-07-04%2008%3A28%3A19
(It's not clear to me, what processes can use UDP ports while testing,
but maybe those buildfarm animals are running on the same logical
machine simultaneously?)
[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-07-02%2009%3A27%3A15
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-05-15%2001%3A25%3A07
[3]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-07-04%2008%3A28%3A19
Best regards,
Alexander
On 2024-07-08 Mo 8:00 AM, Alexander Lakhin wrote:
Hello,
07.06.2024 17:25, Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
I still think my patch to force TCP mode for the SSL test makes
sense as
well.+1 to both things. If that doesn't get the failure rate down to an
acceptable level, we can look at the retry idea.
I have push patches for both of those (i.e. start SSL test nodes in TCP
mode and change the range of ports we allocate server ports from)
I didn't see this email until after I had pushed them.
I'd like to add that the kerberos/001_auth test also suffers from the
port
conflict, but slightly differently. Look for example at [1]:
krb5kdc.log contains:
Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](info):
setting up network...
Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error):
Address already in use - Cannot bind server socket on 127.0.0.1.55853
Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error):
Failed setting up a UDP socket (for 127.0.0.1.55853)
Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error):
Address already in use - Error setting up networkAs far as I can see, the port for kdc is chosen by
PostgreSQL::Test::Kerberos, via
PostgreSQL::Test::Cluster::get_free_port(), which checks only for TCP
port availability (with can_bind()), but not for UDP, so this increases
the probability of the conflict for this test (a similar failure: [2]).
Although we can also find a failure with TCP: [3](It's not clear to me, what processes can use UDP ports while testing,
but maybe those buildfarm animals are running on the same logical
machine simultaneously?)[1]
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-07-02%2009%3A27%3A15
[2]
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-05-15%2001%3A25%3A07
[3]
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-07-04%2008%3A28%3A19
Let's see if this persists now we are testing for free ports in a
different range, not the range usually used for ephemeral ports.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com