Tests randomly failed

Started by Alexander Klimovabout 25 years ago10 messagesbugs

ask@wisdom.weizmann.ac.il

about 25 years ago

Hi all.

First time I execute `make check' 10 tests failed:
float8 ... FAILED
test numerology ... FAILED
point ... FAILED
lseg ... FAILED
interval ... FAILED
test geometry ... FAILED
test horology ... FAILED
subselect ... FAILED
union ... FAILED
test misc ... FAILED

the second time it was only 5:

abstime ... FAILED
test horology ... FAILED
subselect ... FAILED
union ... FAILED
test misc ... FAILED

the third time is was 10 again:
abstime ... FAILED
tinterval ... FAILED
inet ... FAILED
comments ... FAILED
oidjoins ... FAILED
test horology ... FAILED
case ... FAILED
join ... FAILED
portals ... FAILED
test misc ... FAILED

Results of second and third passes are in the attachment.
It is looks like failed tests are due to
! psql: connectDBStart() -- connect() failed: Connection refused
! Is the postmaster running locally
! and accepting connections on Unix socket '/tmp/.s.PGSQL.65432'?

My guess is that this could be due to high load of my box, but
w said
11:29am up 24 day(s), 18:30, 2 users, load average: 0.00, 0.18, 0.29
and I shut down my production postmaster before tests, and I have 256MB of
RAM,
SunOS iridium 5.6 Generic_105181-20 sun4u sparc SUNW,Ultra-5_10
gcc version 2.95.2 19991024 (release)
psql (PostgreSQL) 7.1RC1 (actualy from CVS)

So, the question is: what is the reason of such behaviour, and how to
fight against it?

Regards,
ASK

Tom Lane

tgl@sss.pgh.pa.us

about 25 years ago

In reply to: Alexander Klimov (#1)

Re: Tests randomly failed

Alexander Klimov <ask@wisdom.weizmann.ac.il> writes:

It is looks like failed tests are due to
! psql: connectDBStart() -- connect() failed: Connection refused
! Is the postmaster running locally
! and accepting connections on Unix socket '/tmp/.s.PGSQL.65432'?

What I see is a lot of

! psql: Backend startup failed

which suggests a fork() failure. Look in the postmaster logfile to see
the exact kernel error code --- but probably you are out of swap space
or up against the kernel's limit on number of processes for one userid.

regards, tom lane

Alexander Klimov

ask@wisdom.weizmann.ac.il

about 25 years ago

In reply to: Tom Lane (#2)

Re: Tests randomly failed

On Thu, 22 Mar 2001, Peter Eisentraut wrote:

In src/test/regress/pg_regress[.sh], line 163, change

*-*-qnx* | *beos*)

to

*-*-qnx* | *beos* | *solaris*)

and rerun the tests. This will avoid using Unix domain sockets, which are
broken on Solaris.

Yes, it works now:
======================
All 76 tests passed.
======================

From the other hand, my production version uses Unix domain sockets
without problems

Regards,
ASK

Import Notes

Reply to msg id not found: Pine.LNX.4.30.0103221800180.1208-100000@peter.localdomain | Resolved by subject fallback

Peter Eisentraut

peter_e@gmx.net

about 25 years ago

In reply to: Alexander Klimov (#1)

Re: Tests randomly failed

Alexander Klimov writes:

Results of second and third passes are in the attachment.
It is looks like failed tests are due to
! psql: connectDBStart() -- connect() failed: Connection refused
! Is the postmaster running locally
! and accepting connections on Unix socket '/tmp/.s.PGSQL.65432'?

My guess is that this could be due to high load of my box, but
w said
11:29am up 24 day(s), 18:30, 2 users, load average: 0.00, 0.18, 0.29
and I shut down my production postmaster before tests, and I have 256MB of
RAM,
SunOS iridium 5.6 Generic_105181-20 sun4u sparc SUNW,Ultra-5_10
gcc version 2.95.2 19991024 (release)
psql (PostgreSQL) 7.1RC1 (actualy from CVS)

In src/test/regress/pg_regress[.sh], line 163, change

*-*-qnx* | *beos*)

*-*-qnx* | *beos* | *solaris*)

and rerun the tests. This will avoid using Unix domain sockets, which are
broken on Solaris.

--
Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/

Tom Lane

tgl@sss.pgh.pa.us

about 25 years ago

In reply to: Peter Eisentraut (#4)

Re: Tests randomly failed

Peter Eisentraut <peter_e@gmx.net> writes:

In src/test/regress/pg_regress[.sh], line 163, change
*-*-qnx* | *beos*)
to
*-*-qnx* | *beos* | *solaris*)

and rerun the tests. This will avoid using Unix domain sockets, which are
broken on Solaris.

I was just thinking that maybe pg_regress should have a command line
option to set unix_sockets=no, so that both connection options could
be exercised when there's doubt.

regards, tom lane

Justin Clift

justin@postgresql.org

about 25 years ago

In reply to: Peter Eisentraut (#4)

Re: Tests randomly failed

Hey guys,

I don't understand what you mean by "This will avoid using Unix domain
sockets, which are broken on Solaris.".

If this were the case, then the errors which are described would happen
on ALL solaris platforms wouldn't they? And other packages using Unix
domain sockets would have problems too wouldn't they?

If it's of any help, I get the same types of regression testing failures
on Solaris, with the same "is the backend running?" type error
messages.. when the installation of solaris HAS NOT had it's /etc/system
file altered to change the amount of shared memory segments and
semaphores.

Whenever I have those problems, I insert the updated (higher) values for
shared memory and semaphores, reboot the system, then the tests pass as
the backend is able to start fine.

Hope this is helpful.

Regards and best wishes,

Justin Clift

Tom Lane wrote:

Show quoted text

Peter Eisentraut <peter_e@gmx.net> writes:

In src/test/regress/pg_regress[.sh], line 163, change
*-*-qnx* | *beos*)
to
*-*-qnx* | *beos* | *solaris*)

and rerun the tests. This will avoid using Unix domain sockets, which are
broken on Solaris.

I was just thinking that maybe pg_regress should have a command line
option to set unix_sockets=no, so that both connection options could
be exercised when there's doubt.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

Tom Lane

tgl@sss.pgh.pa.us

about 25 years ago

In reply to: Justin Clift (#6)

Re: Tests randomly failed

Justin Clift <jclift@iprimus.com.au> writes:

If it's of any help, I get the same types of regression testing failures
on Solaris, with the same "is the backend running?" type error
messages.. when the installation of solaris HAS NOT had it's /etc/system
file altered to change the amount of shared memory segments and
semaphores.

Whenever I have those problems, I insert the updated (higher) values for
shared memory and semaphores, reboot the system, then the tests pass as
the backend is able to start fine.

Hm. That's interesting, but it's fairly hard to believe. For at least
a couple releases past, Postgres has grabbed all the shared memory and
semaphores that it wants at postmaster start. Insufficient shmem/sema
resources should result in postmaster abort, not in occasional failures
to start backends.

regards, tom lane

Justin Clift

justin@postgresql.org

about 25 years ago

In reply to: Peter Eisentraut (#4)

Re: Tests randomly failed

Hi Tom,

I know what you're saying, but I've come across it multiple times.

The process for building a Solaris server for PostgreSQL is (from
memory) :

A) Install the OS
B) Install the latest Maintenance Update
C) Install the latest recommended patches
D) Adjust system values for semaphores and shared memory
E) Do an initial lockdown for system security
F) Reboot for the new settings to take effect
G) Create postgres group and postgres user
H) Compile postgres
I) Run the regression tests
J) Lockdown system again
K) Reboot, test startup scripts, etc
<etc>

If I'm working very late and can't find the semaphore settings, then
sometimes I'll do them out-of-order.

A number of times I've totally forgotten to change things until
PostgreSQL complains either in the regression tests (as described in
this thread) or during normal startup.

We're talking a few times anyway, probably about.... um... 15 - 20 times
or so that I've forgotten.

Regards and best wishes,

Justin Clift

Tom Lane wrote:

Show quoted text

Justin Clift <jclift@iprimus.com.au> writes:

If it's of any help, I get the same types of regression testing failures
on Solaris, with the same "is the backend running?" type error
messages.. when the installation of solaris HAS NOT had it's /etc/system
file altered to change the amount of shared memory segments and
semaphores.

Whenever I have those problems, I insert the updated (higher) values for
shared memory and semaphores, reboot the system, then the tests pass as
the backend is able to start fine.

Hm. That's interesting, but it's fairly hard to believe. For at least
a couple releases past, Postgres has grabbed all the shared memory and
semaphores that it wants at postmaster start. Insufficient shmem/sema
resources should result in postmaster abort, not in occasional failures
to start backends.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl

Alexander Klimov

ask@wisdom.weizmann.ac.il

about 25 years ago

In reply to: Tom Lane (#2)

Re: Tests randomly failed

On Thu, 22 Mar 2001, Tom Lane wrote:

What I see is a lot of

! psql: Backend startup failed

which suggests a fork() failure. Look in the postmaster logfile to see
the exact kernel error code --- but probably you are out of swap space
or up against the kernel's limit on number of processes for one userid.

Strange, but this solution *also* works: I raise in /etc/system from 64 to
set maxuprc=256
revert pg_regress.sh in original state (with unix sockets for solaris),
and now all tests are passed.

Regards,
ASK

#10

Peter Eisentraut

peter_e@gmx.net

about 25 years ago

In reply to: Justin Clift (#6)

Re: Tests randomly failed

Justin Clift writes:

I don't understand what you mean by "This will avoid using Unix domain
sockets, which are broken on Solaris.".

If this were the case, then the errors which are described would happen
on ALL solaris platforms wouldn't they?

I suppose things are a bit more complicated than that. We once has a
brief suspicion that it could be related to Sun's tmpfs file system that
/tmp often resides on, but I don't think this turned out to be the case.

And other packages using Unix domain sockets would have problems too
wouldn't they?

Indeed. A while ago I looked around and found at least two packages (INN
and Postfix) that had similar-sounding problems. In fact, one of the two
ended up disabling it with the words "more trouble than it's worth".

You could argue that X and KDE and what else should be broken as well.
This is a good question. It could perhaps be related to a buffer problem,
under the assumption that X usually passes small amounts of data through
the pipe, whereas PostgreSQL can pass megabytes in a very short time.
(Don't know what INN and Postfix would want to do with a local socket.)

The bottom line here is that the switch from local sockets to TCP/IP
invariably fixes the identical failure pattern. Make of that what you
will.

--
Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/

Tests randomly failed

Attachments: