Server error
Hello,
I have a plpgsql function which dies strangely very often, with the
message "server closed the connection unexpectedly". The log file says
[...]
postgres[3315]: [391-8] :extprm () :locprm () :initplan <> :nprm 0
:scanrelid 1 }
postgres[30748]: [182] DEBUG: reaping dead processes
postgres[30748]: [183] DEBUG: child process (pid 3315) was terminated
by signal 11
postgres[30748]: [184] DEBUG: server process (pid 3315) was terminated
by signal 11
postgres[30748]: [185] DEBUG: terminating any other active server
processes
postgres[30748]: [186] DEBUG: all server processes terminated;
reinitializing shared memory and semaphores
postgres[30748]: [187] DEBUG: shmem_exit(0)
[...]
What is signal 11 (and where is it documented anyway?), and what could
be the cause? I tracked the error to a line in the function which
consists of a simple EXECUTE call.
Erik
__________________________________________________
Yahoo! Plus
For a better Internet experience
http://www.yahoo.co.uk/btoffer
On Tue, 6 May 2003, Erik Ronström wrote:
Hello,
I have a plpgsql function which dies strangely very often, with the
message "server closed the connection unexpectedly". The log file says[...]
postgres[3315]: [391-8] :extprm () :locprm () :initplan <> :nprm 0
:scanrelid 1 }
postgres[30748]: [182] DEBUG: reaping dead processes
postgres[30748]: [183] DEBUG: child process (pid 3315) was terminated
by signal 11
postgres[30748]: [184] DEBUG: server process (pid 3315) was terminated
by signal 11
postgres[30748]: [185] DEBUG: terminating any other active server
processes
postgres[30748]: [186] DEBUG: all server processes terminated;
reinitializing shared memory and semaphores
postgres[30748]: [187] DEBUG: shmem_exit(0)
[...]What is signal 11 (and where is it documented anyway?), and what could
be the cause? I tracked the error to a line in the function which
consists of a simple EXECUTE call.
Sig 11 means you have bad memory or CPU, about 99.9% of the time.
www.memtest86.com
Signal 11 is a segfault.
As can be seen in 'man kill'
So it is either broken hardware or broken software, in the latter case
broken postgres?
Perhaps it is usefull if you'd be able to post that plpgsql function
along?
arjen
Show quoted text
-----Oorspronkelijk bericht-----
Van: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] Namens Erik Ronström
Verzonden: dinsdag 6 mei 2003 23:53
Aan: pgsql-general@postgresql.org
Onderwerp: [GENERAL] Server errorHello,
I have a plpgsql function which dies strangely very often,
with the message "server closed the connection unexpectedly".
The log file says[...]
postgres[3315]: [391-8] :extprm () :locprm () :initplan <> :nprm 0
:scanrelid 1 }
postgres[30748]: [182] DEBUG: reaping dead processes
postgres[30748]: [183] DEBUG: child process (pid 3315) was
terminated by signal 11
postgres[30748]: [184] DEBUG: server process (pid 3315) was
terminated by signal 11
postgres[30748]: [185] DEBUG: terminating any other active
server processes
postgres[30748]: [186] DEBUG: all server processes
terminated; reinitializing shared memory and semaphores
postgres[30748]: [187] DEBUG: shmem_exit(0)
[...]What is signal 11 (and where is it documented anyway?), and
what could be the cause? I tracked the error to a line in the
function which consists of a simple EXECUTE call.Erik
__________________________________________________
Yahoo! Plus
For a better Internet experience
http://www.yahoo.co.uk/btoffer---------------------------(end of
broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an
appropriate subscribe-nomail command to
majordomo@postgresql.org so that your message can get through
to the mailing list cleanly
"scott.marlowe" <scott.marlowe@ihs.com> writes:
On Tue, 6 May 2003, Erik Ronstr�m wrote:
I have a plpgsql function which dies strangely very often, with the
message "server closed the connection unexpectedly". The log file says
Sig 11 means you have bad memory or CPU, about 99.9% of the time.
In my part of the universe, about 99% of the time it means you've found
a software bug ;-) ... especially if you can create an example case that
is reproducible on another machine. Erik, can you wrap up a test case?
And which PG version are you running, anyway?
regards, tom lane
Hi again,
thanks for the answers.
--- Tom Lane <tgl@sss.pgh.pa.us> wrote:
"scott.marlowe" <scott.marlowe@ihs.com> writes:
Sig 11 means you have bad memory or CPU, about 99.9% of the time.
In my part of the universe, about 99% of the time it means you've
found a software bug ;-) ... especially if you can create an example
case that is reproducible on another machine. Erik, can you wrap up
a test case?
99% + 99.9%, that makes 198.9 percent :-)
Unfortunatly, the function depends heavily on the database structure. I
tried to extract the essential parts to reproduce the problem within a
small test DB, but then everything worked just fine! But I will post an
example when I get one...
And which PG version are you running, anyway?
7.2.1. I've heard it has some bugs, but the guy running the server
refuses to upgrade to _anything_ that isn't cleared by Debian. 7.3
should have been cleared some days ago, but it hasn't, don't know why.
Best regards
Erik
__________________________________________________
Yahoo! Plus
For a better Internet experience
http://www.yahoo.co.uk/btoffer
On Wed, 2003-05-07 at 11:48, Erik Ronström wrote:
And which PG version are you running, anyway?
7.2.1. I've heard it has some bugs, but the guy running the server
refuses to upgrade to _anything_ that isn't cleared by Debian. 7.3
should have been cleared some days ago, but it hasn't, don't know why.
It's blocked by perl and python2.2. The package dependencies are set by
the environment when I build the package and my versions of perl and
python are ahead of those in testing.
His position is a bit irrational, since progression to testing means
only that no release critical bugs have been found within the 10 days
since the package was uploaded. It is in no sense a guarantee of
perfection; it only means that it will probably not break your system
through an egregious packaging error.
You may tell him that, as Debian maintainer, I consider the current
unstable package to be better than the one in testing! If he wants to
keep the rest of his system pure he could download the source package
and build from that.
If in fact you mean that he is running stable, there is a woody build of
7.3.2 in an aptable repository at
http://people.debian.org/~elphick/debian
--
Oliver Elphick Oliver.Elphick@lfix.co.uk
Isle of Wight, UK http://www.lfix.co.uk/oliver
GPG: 1024D/3E1D0C1C: CA12 09E0 E8D5 8870 5839 932A 614D 4C34 3E1D 0C1C
========================================
"Dearly beloved, avenge not yourselves, but rather give
place unto wrath. For it is written, Vengeance is
mine; I will repay, saith the Lord. Therefore if thine
enemy hunger, feed him; if he thirst, give him drink;
for in so doing thou shalt heap coals of fire on his
head. Be not overcome of evil, but overcome evil with
good." Romans 12:19-21
On Tue, 6 May 2003, Tom Lane wrote:
"scott.marlowe" <scott.marlowe@ihs.com> writes:
On Tue, 6 May 2003, Erik Ronström wrote:
I have a plpgsql function which dies strangely very often, with the
message "server closed the connection unexpectedly". The log file saysSig 11 means you have bad memory or CPU, about 99.9% of the time.
In my part of the universe, about 99% of the time it means you've found
a software bug ;-) ... especially if you can create an example case that
is reproducible on another machine. Erik, can you wrap up a test case?
And which PG version are you running, anyway?
Touche' I think the real issue is whether or not the error remains the
same each time, occuring in the same exact place, then it is usually code.
But if the sig 11 shows up in different places each time, then it is
likely bad hardware.
Further, just because one gets a sig11 every time they run a certain
stored proc is not necessarily the same as getting one in the same exact
place of the stored proc or postgresql code while it's running.
So, it's a good idea to get several traces of the sig 11, and compare
them. If they aren't happening in the same place each time, then the
hardware should be checked.
My point on this is that YOU shouldn't be chasing down these problems
until such time as the user has proven that their hardware is sound.
Since bad hardware is pretty common, and your time is a limited resource,
I really feel that if someone is getting sig 11s, they should be directed
to test their hardware first with something like memtest86 and only after
it passes should they come back to you. Especially right now when you and
the other developers are working hard to get the 7.4 code ready to go.
The old test for bad hardware, by the way, was to compile the linux kernel
a 100 times with a -j <bignum> switch with bignum set high enough to use
all your memory. Of course, that was back when 64 megs was a fair bit,
so it wasn't hard to get the machine to use it all. With bigger and
bigger memory subsystems, bad memory is much more likely to stay hidden
until load increases, then boom, you hit that bad bit and get a sig11.
Hence the need for better hardware testing before chasing the software bug
possibility.
On Wed, 7 May 2003, Erik Ronström wrote:
Hi again,
thanks for the answers.
--- Tom Lane <tgl@sss.pgh.pa.us> wrote:"scott.marlowe" <scott.marlowe@ihs.com> writes:
Sig 11 means you have bad memory or CPU, about 99.9% of the time.
In my part of the universe, about 99% of the time it means you've
found a software bug ;-) ... especially if you can create an example
case that is reproducible on another machine. Erik, can you wrap up
a test case?99% + 99.9%, that makes 198.9 percent :-)
That's because Postgresql goes to 11. :-)
Unfortunatly, the function depends heavily on the database structure. I
tried to extract the essential parts to reproduce the problem within a
small test DB, but then everything worked just fine! But I will post an
example when I get one...
It could easily be that some data structure has to get to a certain size
before it clobbers some pointers somewhere. Or that the single bad bit of
memory in both machines isn't used until load gets high enough for it to
get allocated for a postgresql backend process.
And which PG version are you running, anyway?
7.2.1. I've heard it has some bugs, but the guy running the server
refuses to upgrade to _anything_ that isn't cleared by Debian. 7.3
should have been cleared some days ago, but it hasn't, don't know why.
Hasn't debian approved 7.2.4 yet? With the known bugs in 7.2.1 your
friend is being a bit pedantic if he won't at least upgrade to the latest
version of 7.2. I'd surely trust the opinion of the postgresql developers
over that of the debian developers on which versions of postgresql have
bugs you should be worried about.
Hello,
Still stuck with the same error. Finally managed to upgrade from 7.2.1
to 7.2.4, and realized that the problem is still there. Shit! I've not
yet been able to reproduce the problem on another location, but at
least I've isolated it a bit:
I have a function which creates a "cache" table with a subset of rows
from another table (CREATE TABLE new AS SELECT ... FROM old WHERE ...).
Then it adds a foreign key (ALTER TABLE new ADD CONSTRAINT ref FOREIGN
KEY ... REFERENCES old). Everything is fine so far. Now, if I drop the
new table and run the function again, postgres crashes.
Things to note:
1) If the old table doesn't contain any rows *when running the query
the first time*, there is no crash the second time.
2) If I execute the queries from the function "manually", typing them
into psql, everything works fine.
Looks to me like there is some sort of cleanup problem, since it is
almost always the second run (in each session) that crashes.
One question is: is it always safe to create a foreign key constraint,
even when the table contains data?
Erik
__________________________________________________
Yahoo! Plus
For a better Internet experience
http://www.yahoo.co.uk/btoffer
On Thu, 8 May 2003, [iso-8859-1] Erik Ronstr�m wrote:
Still stuck with the same error. Finally managed to upgrade from 7.2.1
to 7.2.4, and realized that the problem is still there. Shit! I've not
yet been able to reproduce the problem on another location, but at
least I've isolated it a bit:I have a function which creates a "cache" table with a subset of rows
from another table (CREATE TABLE new AS SELECT ... FROM old WHERE ...).
Then it adds a foreign key (ALTER TABLE new ADD CONSTRAINT ref FOREIGN
KEY ... REFERENCES old). Everything is fine so far. Now, if I drop the
new table and run the function again, postgres crashes.
I can reproduce on 7.2 but not 7.3 or 7.4. It looks like something is
getting clobbered. When I recompiled with debug and asserting, I get a
crash the first time the function is called. You may need to go through
with a debugger.
One question is: is it always safe to create a foreign key constraint,
even when the table contains data?
It'll error if the constraint is violated or invalid, but otherwise it
should be.
name your tables with some sort of sequence and a concatenation like:
create table 'temp' || next_val(blah blah) AS blah blah, making sure to delete of course :-)
It will avoid any race conditions on cleanup. Do this for awhile, and check the catalogs and make sure that everything DOES get cleaned up.
Erik Ronström wrote:
Show quoted text
Hello,
Still stuck with the same error. Finally managed to upgrade from 7.2.1
to 7.2.4, and realized that the problem is still there. Shit! I've not
yet been able to reproduce the problem on another location, but at
least I've isolated it a bit:I have a function which creates a "cache" table with a subset of rows
from another table (CREATE TABLE new AS SELECT ... FROM old WHERE ...).
Then it adds a foreign key (ALTER TABLE new ADD CONSTRAINT ref FOREIGN
KEY ... REFERENCES old). Everything is fine so far. Now, if I drop the
new table and run the function again, postgres crashes.Things to note:
1) If the old table doesn't contain any rows *when running the query
the first time*, there is no crash the second time.
2) If I execute the queries from the function "manually", typing them
into psql, everything works fine.Looks to me like there is some sort of cleanup problem, since it is
almost always the second run (in each session) that crashes.One question is: is it always safe to create a foreign key constraint,
even when the table contains data?Erik
__________________________________________________
Yahoo! Plus
For a better Internet experience
http://www.yahoo.co.uk/btoffer---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?
it does a full scan of the table's child and parent columns upon creation of a foreign key?
Stephan Szabo wrote:
Show quoted text
On Thu, 8 May 2003, [iso-8859-1] Erik Ronstr?m wrote:
Still stuck with the same error. Finally managed to upgrade from 7.2.1
to 7.2.4, and realized that the problem is still there. Shit! I've not
yet been able to reproduce the problem on another location, but at
least I've isolated it a bit:I have a function which creates a "cache" table with a subset of rows
from another table (CREATE TABLE new AS SELECT ... FROM old WHERE ...).
Then it adds a foreign key (ALTER TABLE new ADD CONSTRAINT ref FOREIGN
KEY ... REFERENCES old). Everything is fine so far. Now, if I drop the
new table and run the function again, postgres crashes.I can reproduce on 7.2 but not 7.3 or 7.4. It looks like something is
getting clobbered. When I recompiled with debug and asserting, I get a
crash the first time the function is called. You may need to go through
with a debugger.One question is: is it always safe to create a foreign key constraint,
even when the table contains data?It'll error if the constraint is violated or invalid, but otherwise it
should be.---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?
On Thu, 8 May 2003, Dennis Gearon wrote:
it does a full scan of the table's child and parent columns upon
creation of a foreign key?
It currently runs the trigger once per row in the referencing table.
Doing a single select with not exists will almost certainly be faster, but
that's waiting for someone else to decide to do it or me to get time to do
it. :)
Some form of check is required, AFAIK however, because if the constraint
isn't satisified at the end of the alter table an error should be thrown.