Brittleness in regression test setup

Started by Peter Eisentrautover 17 years ago15 messageshackers
Jump to latest
#1Peter Eisentraut
peter_e@gmx.net

I have experienced some brittleness in the regression test setup that
causes the tests to be run against a different server instance or fail
in confusing ways when you have multiple instances running.

For some historic reasons, I have my local scripts set up so that they
build development instances using the hardcoded port 65432. This will
cause a temp install regression test to attempt to use port 565432 which
will be rejected silently by pg_regress, which will then use its
hardcoded default 65432 (no relation to my 65432). If I have some other
instance already running on 65432, then this will fail in non-reassuring
ways such as

============== removing existing temp installation ==============
============== creating temporary installation ==============
============== initializing database system ==============
============== starting postmaster ==============
running on port 65432 with pid 94178
============== creating database "regression" ==============
ERROR: database "regression" already exists

It evidently failed to realize that there is another postmaster already
running at that port and just ran its test setup routines against that
other instance.

If there is no database named "regression" on that other instance, then
it will happily go ahead and run its full test suite against that other
instance.

I see two problems here:

1) It fails to realize that it could not start its own temp instance
when another instance is already running.

2) It ignores the port specification almost silently.

Since ports above 49152 are for private use, I think it is valid to
configure test instances in that port range, but our regression test
setup does not handle that port range very well.

So even if I configured my local scripts to use ports that are all
different from each other and from 65432, this problem would still exist.

So, also,

2a) It has an undocumented hardcoded default port.

Any thoughts on how to fix this?

#2Andrew Dunstan
andrew@dunslane.net
In reply to: Peter Eisentraut (#1)
Re: Brittleness in regression test setup

Peter Eisentraut wrote:

So even if I configured my local scripts to use ports that are all
different from each other and from 65432, this problem would still exist.

Only if you choose the private port to be above 9999. The standard
buildfarm config file contains a warning against using such ports for
exactly this reason. Is using a private 4-digit port terribly difficult
for you?

So, also,

2a) It has an undocumented hardcoded default port.

Any thoughts on how to fix this?

a) document it

b) make it a lot noisier if it falls back on 65432.

cheers

andrew

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#1)
Re: Brittleness in regression test setup

Peter Eisentraut <peter_e@gmx.net> writes:

For some historic reasons, I have my local scripts set up so that they
build development instances using the hardcoded port 65432. This will
cause a temp install regression test to attempt to use port 565432 which
will be rejected silently by pg_regress, which will then use its
hardcoded default 65432 (no relation to my 65432).

One thing we should do is have pg_regress.c, not the Makefile,
select the default port to use. The concatenate-5 behavior is
just not intelligent enough.

regards, tom lane

#4Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#1)
Re: Brittleness in regression test setup

Peter Eisentraut wrote:

============== removing existing temp installation ==============
============== creating temporary installation ==============
============== initializing database system ==============
============== starting postmaster ==============
running on port 65432 with pid 94178
============== creating database "regression" ==============
ERROR: database "regression" already exists

It evidently failed to realize that there is another postmaster already
running at that port and just ran its test setup routines against that
other instance.

On this matter, I noticed that pg_regress doesn't do anything to clean
up its child processes. I see zombies lying around on Linux and Mac OS
when the postmaster dies. (And the zombie is exactly the pid 94178 it
announced in the example above.)

I played around a little with signal handling to collect the dying
postmaster and report and error; see attached rough patch. I'm not
exactly understanding how this works though. I would expect lots of
psql zombies for example, because those go through the same
spawn_process() call, but I'm not seeing any. Also, the sleep() call in
my patch is necessary to get some effect. Anyone else go a clue on what
to do here?

Attachments:

pgregress-postmaster-death.difftext/plain; name=pgregress-postmaster-death.diff; x-mac-creator=0; x-mac-type=0Download+36-2
#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#4)
Re: Brittleness in regression test setup

Peter Eisentraut <peter_e@gmx.net> writes:

I played around a little with signal handling to collect the dying
postmaster and report and error; see attached rough patch. I'm not
exactly understanding how this works though. I would expect lots of
psql zombies for example, because those go through the same
spawn_process() call, but I'm not seeing any.

That's because wait_for_tests wait()s for them.

AFAICS the only way you'd end up with a zombie postmaster is if pg_ctl
stop fails, but I'm failing to understand why that's likely to happen.

regards, tom lane

#6Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#3)
Re: Brittleness in regression test setup

Tom Lane wrote:

One thing we should do is have pg_regress.c, not the Makefile,
select the default port to use. The concatenate-5 behavior is
just not intelligent enough.

How about something like this, constructing a port number from the
version and a timestamp? We could also take 2 more bits from the
version and give it to the timestamp, which would make this a bit safer,
I think.

Attachments:

pgregress-port-number.difftext/plain; name=pgregress-port-number.diff; x-mac-creator=0; x-mac-type=0Download+14-23
#7Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Eisentraut (#6)
Re: Brittleness in regression test setup

Peter Eisentraut wrote:

Tom Lane wrote:

One thing we should do is have pg_regress.c, not the Makefile,
select the default port to use. The concatenate-5 behavior is
just not intelligent enough.

How about something like this, constructing a port number from the
version and a timestamp? We could also take 2 more bits from the
version and give it to the timestamp, which would make this a bit safer,
I think.

Is it possible to make it retry in case the chosen port is busy? I
guess a simple check should suffice, ignoring the obvious race condition
that someone uses the port after you checked it was OK.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#6)
Re: Brittleness in regression test setup

Peter Eisentraut <peter_e@gmx.net> writes:

Tom Lane wrote:

One thing we should do is have pg_regress.c, not the Makefile,
select the default port to use. The concatenate-5 behavior is
just not intelligent enough.

How about something like this, constructing a port number from the
version and a timestamp? We could also take 2 more bits from the
version and give it to the timestamp, which would make this a bit safer,
I think.

I'd vote for keeping the --temp-port option but not having the Makefile
use it. Seems like it'd still be potentially useful for hand use of
pg_regress.

Also, like Alvaro I'm thinking that a retry is really needed. As this
patch stands you'd be vulnerable to random, unrepeatable failures
anytime you picked a port that happened to be in use for something else.

regards, tom lane

#9Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#5)
Re: Brittleness in regression test setup

Tom Lane wrote:

AFAICS the only way you'd end up with a zombie postmaster is if pg_ctl
stop fails, but I'm failing to understand why that's likely to happen.

No, the zombies appear if the postmaster dies (briefly) after launch.

#10Peter Eisentraut
peter_e@gmx.net
In reply to: Alvaro Herrera (#7)
Re: Brittleness in regression test setup

Alvaro Herrera wrote:

Is it possible to make it retry in case the chosen port is busy? I
guess a simple check should suffice, ignoring the obvious race condition
that someone uses the port after you checked it was OK.

Well, the whole point of this exercise was to avoid that. If we had a
way to do a "simple check", we might as well stick to the hardcoded port
and count up from that or something.

The problem with doing the checking is that you have to emulate the
complete postmaster logic for port numbers, listen addresses, Unix
domain socket directories, etc. That can become quite involved.

Then again, a simple way to avoid the issue altogether on platforms
supporting Unix-domain sockets would be to run the test over Unix-domain
sockets (which we do anyway) placed in a private directory. How about that?

#11Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#8)
Re: Brittleness in regression test setup

Tom Lane wrote:

I'd vote for keeping the --temp-port option but not having the Makefile
use it. Seems like it'd still be potentially useful for hand use of
pg_regress.

Sorry, I didn't document this fully. The --temp-port option appears to
be redundant with the --port option, so I figured we could drop the
former and just use the latter for both the temp install and existing
install cases.

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#10)
Re: Brittleness in regression test setup

Peter Eisentraut <peter_e@gmx.net> writes:

Then again, a simple way to avoid the issue altogether on platforms
supporting Unix-domain sockets would be to run the test over Unix-domain
sockets (which we do anyway) placed in a private directory. How about that?

Then the brittleness is still there on Windows, only we'd probably get
confused and think it was a platform-specific bug.

regards, tom lane

#13Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#10)
Re: Brittleness in regression test setup

Peter Eisentraut wrote:

Alvaro Herrera wrote:

Is it possible to make it retry in case the chosen port is busy? I
guess a simple check should suffice, ignoring the obvious race condition
that someone uses the port after you checked it was OK.

Well, the whole point of this exercise was to avoid that. If we had a
way to do a "simple check", we might as well stick to the hardcoded port
and count up from that or something.

Well, duh, the checking is actually pretty simple. We just try to
connect with psql to the candidate port number before starting our own
postmaster and see if anyone is already there.

Patch attached. It solves my immediate problems nicely.

Attachments:

pgregress-port-number.difftext/plain; name=pgregress-port-number.diff; x-mac-creator=0; x-mac-type=0Download+46-24
#14Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Eisentraut (#13)
Re: Brittleness in regression test setup

Peter Eisentraut wrote:

Peter Eisentraut wrote:

Alvaro Herrera wrote:

Is it possible to make it retry in case the chosen port is busy? I
guess a simple check should suffice, ignoring the obvious race condition
that someone uses the port after you checked it was OK.

Well, the whole point of this exercise was to avoid that. If we had a
way to do a "simple check", we might as well stick to the hardcoded
port and count up from that or something.

Well, duh, the checking is actually pretty simple. We just try to
connect with psql to the candidate port number before starting our own
postmaster and see if anyone is already there.

But what if something else is using the port? I think you could attempt
a bare connect().

Note typo here:

+ fprintf(stderr, _("Specify an used port using the --port option or shut down any conflicting PostgreSQL servers.\n"));

Should say "an unused port"

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#15Peter Eisentraut
peter_e@gmx.net
In reply to: Alvaro Herrera (#14)
Re: Brittleness in regression test setup

Alvaro Herrera wrote:

Well, duh, the checking is actually pretty simple. We just try to
connect with psql to the candidate port number before starting our own
postmaster and see if anyone is already there.

But what if something else is using the port? I think you could attempt
a bare connect().

Well, that goes beyond the scope of my original problem, which is that
the regression tests will silently run against a different installation.
If you run psql against a non-PostgreSQL server, you will hopefully
see more obvious failures. We could add this in the future, if there
are complaints from the field.