It happened again: Server hung up solid

Started by The Hermit Hackeralmost 26 years ago39 messageshackers
Jump to latest
#1The Hermit Hacker
scrappy@hub.org

Okay, this is with code of ~May 4th ... a 'psql' connection to the
database hangs solid.

errout is dated:

pgsql% !ls
ls -lt
total 13324
-rw------- 1 pgsql pgsql 4842715 May 7 10:57 errout.5432

and the last few lines contain:

ERROR: parser: parse error at or near "vpti"
pq_recvbuf: unexpected EOF on client connection
pq_flush: send() failed: Broken pipe
pq_recvbuf: recv() failed: Connection reset by peer
pq_recvbuf: unexpected EOF on client connection
pq_recvbuf: unexpected EOF on client connection
pq_flush: send() failed: Broken pipe
pq_recvbuf: recv() failed: Connection reset by peer

But, of course, no date/time ...

ps shows:

USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
pgsql 33515 0.0 0.0 0 0 ?? Z 4:45PM 0:00.00 (postgres)
pgsql 33516 0.0 0.0 0 0 ?? Z 4:45PM 0:00.00 (postgres)
pgsql 93757 0.0 0.2 1456 1088 p0 S Wed03PM 0:01.11 -su (tcsh)
pgsql 7100 0.0 0.5 38692 2616 ?? Is Fri12AM 8:43.44 /pgsql/bin/postmas
pgsql 33667 0.0 0.0 396 224 p0 R+ 7:35PM 0:00.00 ps ux

and postmaster is started with:

pgsql% cat pgstart
#!/bin/tcsh
setenv PORT 5432
setenv POSTMASTER /pgsql/bin/postmaster
unlimit
${POSTMASTER} -B 4096 -N 128 -S -o "-F -o /pgsql/errout.${PORT} -S 32768" \
-i -p ${PORT} -D/pgsql/data

The machine is a Dual PIII with 512Meg of RAM, running FreeBSD 4.0-STABLE
from April 22nd ...

pgsql% truss -p 7100

Shows zilch ...

Since this is a production server, I can't just leave it there hung like
that, but if someone wants to give some instructions on what to do the
next time this happens, please feel free to do so, and I'll add that to my
list ... maybe run a gdb command on it, since truss doesn't appear to
help?

At this time, I consider this to be a show-stopper on the release ... this
is what happened the last time when the result appeared to be the index
corruption ... this time, I've checked a VACUUM after re-starting and it
doesn't appear to be a problem, but they might not have been related, just
a fluke ...

Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: The Hermit Hacker (#1)
Re: It happened again: Server hung up solid

The Hermit Hacker <scrappy@hub.org> writes:

Okay, this is with code of ~May 4th ... a 'psql' connection to the
database hangs solid.

Do you mean you can't make a connection at all? Is there any indication
that the postmaster is lighting off a backend for you? Since you show
a couple of zombie backends hanging around, it would seem like a good
bet that the postmaster itself is wedged and not responding to events,
but I'm not sure.

errout is dated:

pgsql% !ls
ls -lt
total 13324
-rw------- 1 pgsql pgsql 4842715 May 7 10:57 errout.5432

and the last few lines contain:

ERROR: parser: parse error at or near "vpti"
pq_recvbuf: unexpected EOF on client connection
pq_flush: send() failed: Broken pipe
pq_recvbuf: recv() failed: Connection reset by peer
pq_recvbuf: unexpected EOF on client connection
pq_recvbuf: unexpected EOF on client connection
pq_flush: send() failed: Broken pipe
pq_recvbuf: recv() failed: Connection reset by peer

But, of course, no date/time ...

Given that the file mod time is considerably before the hang (right?)
the messages in it are probably unrelated. It does seem odd that you
have so many clients disconnecting ungracefully; what client apps are
you running?

Since this is a production server, I can't just leave it there hung like
that, but if someone wants to give some instructions on what to do the
next time this happens, please feel free to do so, and I'll add that to my
list ... maybe run a gdb command on it, since truss doesn't appear to
help?

Try killing the postmaster itself in such a way as to produce a coredump
(kill -ABORT ought to do) and get a backtrace from that. It might also
be worth running the postmaster with connection tracing turned on (I
forget the incantation for that, but it should be in TFM).

At this time, I consider this to be a show-stopper on the release ... this
is what happened the last time when the result appeared to be the index
corruption

If the postmaster is hanging then it's almost certainly unrelated to
index corruption...

regards, tom lane

#3The Hermit Hacker
scrappy@hub.org
In reply to: Tom Lane (#2)
Re: It happened again: Server hung up solid

On Sun, 7 May 2000, Tom Lane wrote:

The Hermit Hacker <scrappy@hub.org> writes:

Okay, this is with code of ~May 4th ... a 'psql' connection to the
database hangs solid.

Do you mean you can't make a connection at all? Is there any indication
that the postmaster is lighting off a backend for you? Since you show
a couple of zombie backends hanging around, it would seem like a good
bet that the postmaster itself is wedged and not responding to events,
but I'm not sure.

This appears to be the case, but next time it happens I will make
double-sure of that ... considering that it was ~7pm at night when I
tried, my initial guess is that nothing is going through postmaster at the
time of hte hang ...

Given that the file mod time is considerably before the hang (right?)
the messages in it are probably unrelated. It does seem odd that you
have so many clients disconnecting ungracefully; what client apps are
you running?

alot of dbi stuff, the search engine for udmsearch, some php ... the
server is currently serving ~12 databases for various clients ...

Try killing the postmaster itself in such a way as to produce a coredump
(kill -ABORT ought to do) and get a backtrace from that. It might also
be worth running the postmaster with connection tracing turned on (I
forget the incantation for that, but it should be in TFM).

Will look at that one ...

#4The Hermit Hacker
scrappy@hub.org
In reply to: Tom Lane (#2)
Re: It happened again: Server hung up solid

Okay, just happened again ... no postgres backend is being started:

USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
pgsql 34611 0.0 0.0 0 0 ?? Z 8:43PM 0:00.00 (postgres)
pgsql 93757 0.0 0.2 1456 1104 p0 S Wed03PM 0:01.16 -su (tcsh)
pgsql 33683 0.0 0.6 38356 3024 ?? Is 7:38PM 0:03.54 /pgsql/bin/postmaster -B 4096 -N 128 -S -o -F -o /pgsql/errout.5432
pgsql 34677 0.0 0.2 1408 1048 p2 S 8:50PM 0:00.07 -su (tcsh)
pgsql 34685 0.0 0.2 1652 1032 p0 S+ 8:51PM 0:00.01 psql udmsearch
pgsql 34687 0.0 0.0 400 232 p2 R+ 8:51PM 0:00.00 ps ux

Going to look at the connection tracing option now and see what I can come
up with ...

On Sun, 7 May 2000, Tom Lane wrote:

The Hermit Hacker <scrappy@hub.org> writes:

Okay, this is with code of ~May 4th ... a 'psql' connection to the
database hangs solid.

Do you mean you can't make a connection at all? Is there any indication
that the postmaster is lighting off a backend for you? Since you show
a couple of zombie backends hanging around, it would seem like a good
bet that the postmaster itself is wedged and not responding to events,
but I'm not sure.

errout is dated:

pgsql% !ls
ls -lt
total 13324
-rw------- 1 pgsql pgsql 4842715 May 7 10:57 errout.5432

and the last few lines contain:

ERROR: parser: parse error at or near "vpti"
pq_recvbuf: unexpected EOF on client connection
pq_flush: send() failed: Broken pipe
pq_recvbuf: recv() failed: Connection reset by peer
pq_recvbuf: unexpected EOF on client connection
pq_recvbuf: unexpected EOF on client connection
pq_flush: send() failed: Broken pipe
pq_recvbuf: recv() failed: Connection reset by peer

But, of course, no date/time ...

Given that the file mod time is considerably before the hang (right?)
the messages in it are probably unrelated. It does seem odd that you
have so many clients disconnecting ungracefully; what client apps are
you running?

Since this is a production server, I can't just leave it there hung like
that, but if someone wants to give some instructions on what to do the
next time this happens, please feel free to do so, and I'll add that to my
list ... maybe run a gdb command on it, since truss doesn't appear to
help?

Try killing the postmaster itself in such a way as to produce a coredump
(kill -ABORT ought to do) and get a backtrace from that. It might also
be worth running the postmaster with connection tracing turned on (I
forget the incantation for that, but it should be in TFM).

At this time, I consider this to be a show-stopper on the release ... this
is what happened the last time when the result appeared to be the index
corruption

If the postmaster is hanging then it's almost certainly unrelated to
index corruption...

regards, tom lane

Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

#5The Hermit Hacker
scrappy@hub.org
In reply to: The Hermit Hacker (#4)
Re: It happened again: Server hung up solid

kill -ABRT does nothing:

pgsql% kill -ABRT 33683
pgsql% !ps
ps ux
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
pgsql 34611 0.0 0.0 0 0 ?? Z 8:43PM 0:00.00 (postgres)
pgsql 93757 0.0 0.2 1456 1104 p0 S Wed03PM 0:01.17 -su (tcsh)
pgsql 33683 0.0 0.6 38356 3024 ?? Is 7:38PM 0:03.54 /pgsql/bin/postmas
pgsql 34677 0.0 0.2 1408 1048 p2 S+ 8:50PM 0:00.08 -su (tcsh)
pgsql 34696 0.0 0.0 396 232 p0 R+ 8:56PM 0:00.00 ps ux
pgsql% !ps
ps ux
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
pgsql 34611 0.0 0.0 0 0 ?? Z 8:43PM 0:00.00 (postgres)
pgsql 93757 0.0 0.2 1456 1104 p0 S Wed03PM 0:01.17 -su (tcsh)
pgsql 33683 0.0 0.6 38356 3024 ?? Is 7:38PM 0:03.54 /pgsql/bin/postmas
pgsql 34677 0.0 0.2 1408 1048 p2 S+ 8:50PM 0:00.08 -su (tcsh)
pgsql 34697 0.0 0.0 396 232 p0 R+ 8:56PM 0:00.00 ps ux

On Sun, 7 May 2000, The Hermit Hacker wrote:

Okay, just happened again ... no postgres backend is being started:

USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
pgsql 34611 0.0 0.0 0 0 ?? Z 8:43PM 0:00.00 (postgres)
pgsql 93757 0.0 0.2 1456 1104 p0 S Wed03PM 0:01.16 -su (tcsh)
pgsql 33683 0.0 0.6 38356 3024 ?? Is 7:38PM 0:03.54 /pgsql/bin/postmaster -B 4096 -N 128 -S -o -F -o /pgsql/errout.5432
pgsql 34677 0.0 0.2 1408 1048 p2 S 8:50PM 0:00.07 -su (tcsh)
pgsql 34685 0.0 0.2 1652 1032 p0 S+ 8:51PM 0:00.01 psql udmsearch
pgsql 34687 0.0 0.0 400 232 p2 R+ 8:51PM 0:00.00 ps ux

Going to look at the connection tracing option now and see what I can come
up with ...

On Sun, 7 May 2000, Tom Lane wrote:

The Hermit Hacker <scrappy@hub.org> writes:

Okay, this is with code of ~May 4th ... a 'psql' connection to the
database hangs solid.

Do you mean you can't make a connection at all? Is there any indication
that the postmaster is lighting off a backend for you? Since you show
a couple of zombie backends hanging around, it would seem like a good
bet that the postmaster itself is wedged and not responding to events,
but I'm not sure.

errout is dated:

pgsql% !ls
ls -lt
total 13324
-rw------- 1 pgsql pgsql 4842715 May 7 10:57 errout.5432

and the last few lines contain:

ERROR: parser: parse error at or near "vpti"
pq_recvbuf: unexpected EOF on client connection
pq_flush: send() failed: Broken pipe
pq_recvbuf: recv() failed: Connection reset by peer
pq_recvbuf: unexpected EOF on client connection
pq_recvbuf: unexpected EOF on client connection
pq_flush: send() failed: Broken pipe
pq_recvbuf: recv() failed: Connection reset by peer

But, of course, no date/time ...

Given that the file mod time is considerably before the hang (right?)
the messages in it are probably unrelated. It does seem odd that you
have so many clients disconnecting ungracefully; what client apps are
you running?

Since this is a production server, I can't just leave it there hung like
that, but if someone wants to give some instructions on what to do the
next time this happens, please feel free to do so, and I'll add that to my
list ... maybe run a gdb command on it, since truss doesn't appear to
help?

Try killing the postmaster itself in such a way as to produce a coredump
(kill -ABORT ought to do) and get a backtrace from that. It might also
be worth running the postmaster with connection tracing turned on (I
forget the incantation for that, but it should be in TFM).

At this time, I consider this to be a show-stopper on the release ... this
is what happened the last time when the result appeared to be the index
corruption

If the postmaster is hanging then it's almost certainly unrelated to
index corruption...

regards, tom lane

Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

#6Vince Vielhaber
vev@michvhf.com
In reply to: The Hermit Hacker (#4)
Re: It happened again: Server hung up solid

On Sun, 7 May 2000, The Hermit Hacker wrote:

Okay, just happened again ... no postgres backend is being started:

I don't know how close in time it was, but I just hit reload on that
query that was sent to webmaster.

Vince.

USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
pgsql 34611 0.0 0.0 0 0 ?? Z 8:43PM 0:00.00 (postgres)
pgsql 93757 0.0 0.2 1456 1104 p0 S Wed03PM 0:01.16 -su (tcsh)
pgsql 33683 0.0 0.6 38356 3024 ?? Is 7:38PM 0:03.54 /pgsql/bin/postmaster -B 4096 -N 128 -S -o -F -o /pgsql/errout.5432
pgsql 34677 0.0 0.2 1408 1048 p2 S 8:50PM 0:00.07 -su (tcsh)
pgsql 34685 0.0 0.2 1652 1032 p0 S+ 8:51PM 0:00.01 psql udmsearch
pgsql 34687 0.0 0.0 400 232 p2 R+ 8:51PM 0:00.00 ps ux

Going to look at the connection tracing option now and see what I can come
up with ...

On Sun, 7 May 2000, Tom Lane wrote:

The Hermit Hacker <scrappy@hub.org> writes:

Okay, this is with code of ~May 4th ... a 'psql' connection to the
database hangs solid.

Do you mean you can't make a connection at all? Is there any indication
that the postmaster is lighting off a backend for you? Since you show
a couple of zombie backends hanging around, it would seem like a good
bet that the postmaster itself is wedged and not responding to events,
but I'm not sure.

errout is dated:

pgsql% !ls
ls -lt
total 13324
-rw------- 1 pgsql pgsql 4842715 May 7 10:57 errout.5432

and the last few lines contain:

ERROR: parser: parse error at or near "vpti"
pq_recvbuf: unexpected EOF on client connection
pq_flush: send() failed: Broken pipe
pq_recvbuf: recv() failed: Connection reset by peer
pq_recvbuf: unexpected EOF on client connection
pq_recvbuf: unexpected EOF on client connection
pq_flush: send() failed: Broken pipe
pq_recvbuf: recv() failed: Connection reset by peer

But, of course, no date/time ...

Given that the file mod time is considerably before the hang (right?)
the messages in it are probably unrelated. It does seem odd that you
have so many clients disconnecting ungracefully; what client apps are
you running?

Since this is a production server, I can't just leave it there hung like
that, but if someone wants to give some instructions on what to do the
next time this happens, please feel free to do so, and I'll add that to my
list ... maybe run a gdb command on it, since truss doesn't appear to
help?

Try killing the postmaster itself in such a way as to produce a coredump
(kill -ABORT ought to do) and get a backtrace from that. It might also
be worth running the postmaster with connection tracing turned on (I
forget the incantation for that, but it should be in TFM).

At this time, I consider this to be a show-stopper on the release ... this
is what happened the last time when the result appeared to be the index
corruption

If the postmaster is hanging then it's almost certainly unrelated to
index corruption...

regards, tom lane

Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

--
==========================================================================
Vince Vielhaber -- KA8CSH email: vev@michvhf.com http://www.pop4.net
128K ISDN from $22.00/mo - 56K Dialup from $16.00/mo at Pop4 Networking
Online Campground Directory http://www.camping-usa.com
Online Giftshop Superstore http://www.cloudninegifts.com
==========================================================================

#7The Hermit Hacker
scrappy@hub.org
In reply to: The Hermit Hacker (#1)
Re: It happened again: Server hung up solid

With -d set to 1 (connection tracing), all I see when I connect, in the
log files, is:

FindExec: found "/pgsql/bin/postgres" using argv[0]
FindExec: found "/pgsql/bin/postgres" using argv[0]

doesn't tell me to what I'm connecting through ...

On Sun, 7 May 2000, The Hermit Hacker wrote:

Okay, this is with code of ~May 4th ... a 'psql' connection to the
database hangs solid.

errout is dated:

pgsql% !ls
ls -lt
total 13324
-rw------- 1 pgsql pgsql 4842715 May 7 10:57 errout.5432

and the last few lines contain:

ERROR: parser: parse error at or near "vpti"
pq_recvbuf: unexpected EOF on client connection
pq_flush: send() failed: Broken pipe
pq_recvbuf: recv() failed: Connection reset by peer
pq_recvbuf: unexpected EOF on client connection
pq_recvbuf: unexpected EOF on client connection
pq_flush: send() failed: Broken pipe
pq_recvbuf: recv() failed: Connection reset by peer

But, of course, no date/time ...

ps shows:

USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
pgsql 33515 0.0 0.0 0 0 ?? Z 4:45PM 0:00.00 (postgres)
pgsql 33516 0.0 0.0 0 0 ?? Z 4:45PM 0:00.00 (postgres)
pgsql 93757 0.0 0.2 1456 1088 p0 S Wed03PM 0:01.11 -su (tcsh)
pgsql 7100 0.0 0.5 38692 2616 ?? Is Fri12AM 8:43.44 /pgsql/bin/postmas
pgsql 33667 0.0 0.0 396 224 p0 R+ 7:35PM 0:00.00 ps ux

and postmaster is started with:

pgsql% cat pgstart
#!/bin/tcsh
setenv PORT 5432
setenv POSTMASTER /pgsql/bin/postmaster
unlimit
${POSTMASTER} -B 4096 -N 128 -S -o "-F -o /pgsql/errout.${PORT} -S 32768" \
-i -p ${PORT} -D/pgsql/data

The machine is a Dual PIII with 512Meg of RAM, running FreeBSD 4.0-STABLE
from April 22nd ...

pgsql% truss -p 7100

Shows zilch ...

Since this is a production server, I can't just leave it there hung like
that, but if someone wants to give some instructions on what to do the
next time this happens, please feel free to do so, and I'll add that to my
list ... maybe run a gdb command on it, since truss doesn't appear to
help?

At this time, I consider this to be a show-stopper on the release ... this
is what happened the last time when the result appeared to be the index
corruption ... this time, I've checked a VACUUM after re-starting and it
doesn't appear to be a problem, but they might not have been related, just
a fluke ...

Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: The Hermit Hacker (#5)
Re: It happened again: Server hung up solid

The Hermit Hacker <scrappy@hub.org> writes:

kill -ABRT does nothing:

Oh? Must be hung up in a kernel call then. That will probably mean
that you can't attach to the stuck process with gdb either (though
it'd be worth trying, since a backtrace would be mighty useful if
you could get it).

My next thought is to truss the postmaster process before it hangs
up, with hopes of finding out what kernel call is hanging.

Also, you might try netstat to see if you can see any freshly-opened
incoming connections when it happens. Also, "lsof -p" or local
equivalent on the stuck postmaster.

regards, tom lane

#9Michael Robinson
robinson@netrinsics.com
In reply to: Tom Lane (#8)
Re: It happened again: Server hung up solid

Try killing the postmaster itself in such a way as to produce a coredump
(kill -ABORT ought to do) and get a backtrace from that.

The "gcore" command (on most modern unices) will generate a core dump of a
running process without killing the process. It seems that would be more
useful in this circumstance.

-Michael Robinson

#10The Hermit Hacker
scrappy@hub.org
In reply to: Michael Robinson (#9)
Re: Re: It happened again: Server hung up solid

*sigh*

gcore 87721

gcore: /proc/87721/file: No such file or directory

On Mon, 8 May 2000, Michael Robinson wrote:

Try killing the postmaster itself in such a way as to produce a coredump
(kill -ABORT ought to do) and get a backtrace from that.

The "gcore" command (on most modern unices) will generate a core dump of a
running process without killing the process. It seems that would be more
useful in this circumstance.

-Michael Robinson

Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

#11Bruce Momjian
bruce@momjian.us
In reply to: The Hermit Hacker (#10)
Re: Re: It happened again: Server hung up solid

Are we still releasing 7.0 tomorrow?

*sigh*

gcore 87721

gcore: /proc/87721/file: No such file or directory

On Mon, 8 May 2000, Michael Robinson wrote:

Try killing the postmaster itself in such a way as to produce a coredump
(kill -ABORT ought to do) and get a backtrace from that.

The "gcore" command (on most modern unices) will generate a core dump of a
running process without killing the process. It seems that would be more
useful in this circumstance.

-Michael Robinson

Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#12The Hermit Hacker
scrappy@hub.org
In reply to: Bruce Momjian (#11)
Re: Re: It happened again: Server hung up solid

On Sun, 7 May 2000, Bruce Momjian wrote:

Are we still releasing 7.0 tomorrow?

I don't know ... this problem has me nervous, but I can't seem to
re-create it on the fly :( It happened twice so far today, and I'm
working on improving logging to see if I can narrow it down ...

I would like to *at least* postpone until Wednesday to see if I can
recreate this between now and then ... will spend a good part of tomorrow
seeing if I can get a more decent amount of data logged, to narrow her
down ...

We still have to write up a release announcement (can someone summarize
the key features of v7.0?), so that gives us a little bit of time ...

*sigh*

gcore 87721

gcore: /proc/87721/file: No such file or directory

On Mon, 8 May 2000, Michael Robinson wrote:

Try killing the postmaster itself in such a way as to produce a coredump
(kill -ABORT ought to do) and get a backtrace from that.

The "gcore" command (on most modern unices) will generate a core dump of a
running process without killing the process. It seems that would be more
useful in this circumstance.

-Michael Robinson

Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

-- 
Bruce Momjian                        |  http://www.op.net/~candle
pgman@candle.pha.pa.us               |  (610) 853-3000
+  If your life is a hard drive,     |  830 Blythe Avenue
+  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

#13Bruce Momjian
bruce@momjian.us
In reply to: The Hermit Hacker (#12)
Re: Re: It happened again: Server hung up solid

On Sun, 7 May 2000, Bruce Momjian wrote:

Are we still releasing 7.0 tomorrow?

I don't know ... this problem has me nervous, but I can't seem to
re-create it on the fly :( It happened twice so far today, and I'm
working on improving logging to see if I can narrow it down ...

I would like to *at least* postpone until Wednesday to see if I can
recreate this between now and then ... will spend a good part of tomorrow
seeing if I can get a more decent amount of data logged, to narrow her
down ...

Isn't is something we can fix with a 7.0.1? Seems many people are
already using 7.0 in production systems. I just hate to see the date
slip again.

We still have to write up a release announcement (can someone summarize
the key features of v7.0?), so that gives us a little bit of time ...

Well, you can take it off the top of the HISTORY file.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#14The Hermit Hacker
scrappy@hub.org
In reply to: Bruce Momjian (#13)
Re: Re: It happened again: Server hung up solid

On Mon, 8 May 2000, Bruce Momjian wrote:

On Sun, 7 May 2000, Bruce Momjian wrote:

Are we still releasing 7.0 tomorrow?

I don't know ... this problem has me nervous, but I can't seem to
re-create it on the fly :( It happened twice so far today, and I'm
working on improving logging to see if I can narrow it down ...

I would like to *at least* postpone until Wednesday to see if I can
recreate this between now and then ... will spend a good part of tomorrow
seeing if I can get a more decent amount of data logged, to narrow her
down ...

Isn't is something we can fix with a 7.0.1? Seems many people are
already using 7.0 in production systems. I just hate to see the date
slip again.

As I said, if we feel comfortable with this, no probs ... its not an issue
I'm going to push, since it is something that I'm finding relativley
difficult to recreate "at will" :(

We still have to write up a release announcement (can someone summarize
the key features of v7.0?), so that gives us a little bit of time ...

Well, you can take it off the top of the HISTORY file.

Great, will work this up tomorrow during the day :) Thanks ...

Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

#15Bruce Momjian
bruce@momjian.us
In reply to: The Hermit Hacker (#14)
Re: Re: It happened again: Server hung up solid

Isn't is something we can fix with a 7.0.1? Seems many people are
already using 7.0 in production systems. I just hate to see the date
slip again.

As I said, if we feel comfortable with this, no probs ... its not an issue
I'm going to push, since it is something that I'm finding relativley
difficult to recreate "at will" :(

We still have to write up a release announcement (can someone summarize
the key features of v7.0?), so that gives us a little bit of time ...

Well, you can take it off the top of the HISTORY file.

Great, will work this up tomorrow during the day :) Thanks ...

My feeling is that we can address this in 7.0.1, though our recent
pg_group fix could not be done in 7.0.1, but this doesn't seem like that
kind of problem. Such problems are usually easily reproducible because
they represent problems with the system catalogs.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#13)
Re: Re: It happened again: Server hung up solid

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Are we still releasing 7.0 tomorrow?

I don't know ... this problem has me nervous, but I can't seem to
re-create it on the fly :( It happened twice so far today, and I'm
working on improving logging to see if I can narrow it down ...

I would like to *at least* postpone until Wednesday to see if I can
recreate this between now and then ... will spend a good part of tomorrow
seeing if I can get a more decent amount of data logged, to narrow her
down ...

Isn't is something we can fix with a 7.0.1? Seems many people are
already using 7.0 in production systems. I just hate to see the date
slip again.

That's my feeling too. Whatever this is, it seems to be in the
postmaster not the backend. We've hardly changed the postmaster since
6.5.3, so I suspect the problem has existed for a good while and is of
low probability. (I have no explanation why Marc's suddenly getting
bit, but if it weren't low-probability we'd surely have more reports
than just his, no?)

Almost certainly, we will need a 7.0.1 in a few weeks, once 7.0 gets out
there and starts getting pounded on by people outside the circle of
usual suspects (sorry, been watching _Casablanca_ again). If we delay
7.0 until we can figure out what this bug is all about, we might be
sitting on it for days or weeks. Let's push 7.0 out the door and let
some other work go on in parallel while we try to figure out this one.

Marc, if you see it happen again could you give me a call before you
restart? I'd like to telnet in and poke at it a little myself.
(Wait a sec, is this happening on hub, or somewhere else?)

regards, tom lane

#17Vince Vielhaber
vev@michvhf.com
In reply to: The Hermit Hacker (#10)
Re: Re: It happened again: Server hung up solid

On Mon, 8 May 2000, The Hermit Hacker wrote:

*sigh*

gcore 87721

gcore: /proc/87721/file: No such file or directory

Accroding to TFM:

The process identifier, pid, must be given on the command line. If no
executable image is specified, gcore will use ``/proc/<pid>/file''.

So you might try:

gcore /path_to_postmaster/postmaster 87721

or something close to that.

Vince.

On Mon, 8 May 2000, Michael Robinson wrote:

Try killing the postmaster itself in such a way as to produce a coredump
(kill -ABORT ought to do) and get a backtrace from that.

The "gcore" command (on most modern unices) will generate a core dump of a
running process without killing the process. It seems that would be more
useful in this circumstance.

-Michael Robinson

Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

--
==========================================================================
Vince Vielhaber -- KA8CSH email: vev@michvhf.com http://www.pop4.net
128K ISDN from $22.00/mo - 56K Dialup from $16.00/mo at Pop4 Networking
Online Campground Directory http://www.camping-usa.com
Online Giftshop Superstore http://www.cloudninegifts.com
==========================================================================

#18The Hermit Hacker
scrappy@hub.org
In reply to: Tom Lane (#16)
Re: Re: It happened again: Server hung up solid

On Mon, 8 May 2000, Tom Lane wrote:

Marc, if you see it happen again could you give me a call before you
restart? I'd like to telnet in and poke at it a little myself.
(Wait a sec, is this happening on hub, or somewhere else?)

We built a Dual-PIII server to handle just database server, so I can give
you access to it ...

#19D'Arcy J.M. Cain
darcy@druid.net
In reply to: The Hermit Hacker (#18)
Re: Re: It happened again: Server hung up solid

Thus spake The Hermit Hacker

Marc, if you see it happen again could you give me a call before you
restart? I'd like to telnet in and poke at it a little myself.
(Wait a sec, is this happening on hub, or somewhere else?)

We built a Dual-PIII server to handle just database server, so I can give
you access to it ...

Are you talking about the new database server for Trends? If so I should
mention that I had to restart it this morning. Sorry, I didn't poke
around in it before doing so. Clients couldn't log in and I couldn't wait.

I should mention that I did have to kill -9 it. A simple kill didn't work.
I then cleared out the lock file and restarted it and connections seem to
be working again.

-- 
D'Arcy J.M. Cain <darcy@{druid|vex}.net>   |  Democracy is three wolves
http://www.druid.net/darcy/                |  and a sheep voting on
+1 416 425 1212     (DoD#0082)    (eNTP)   |  what's for dinner.
#20The Hermit Hacker
scrappy@hub.org
In reply to: D'Arcy J.M. Cain (#19)
Re: Re: It happened again: Server hung up solid

On Mon, 8 May 2000, D'Arcy J.M. Cain wrote:

Thus spake The Hermit Hacker

Marc, if you see it happen again could you give me a call before you
restart? I'd like to telnet in and poke at it a little myself.
(Wait a sec, is this happening on hub, or somewhere else?)

We built a Dual-PIII server to handle just database server, so I can give
you access to it ...

Are you talking about the new database server for Trends? If so I
should mention that I had to restart it this morning. Sorry, I didn't
poke around in it before doing so. Clients couldn't log in and I
couldn't wait.

I should mention that I did have to kill -9 it. A simple kill didn't
work. I then cleared out the lock file and restarted it and
connections seem to be working again.

That's the server ... and that's the key problem ... there are apps
running on here that are such that delaying the restart, when it requires
it, is very difficult :(

D'Arcy, when it happens again, and if you catch it before me, can you run:

gcore -s bin/postmaster <pid>

on it as the pgsql user before restarting it? I just tested it here and
it dump'd core nicely ... I'm hoping it does the same if/when the
postmaster itself hangs *cross fingers*

Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

#21D'Arcy J.M. Cain
darcy@druid.net
In reply to: The Hermit Hacker (#20)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: The Hermit Hacker (#12)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#22)
#24Mitch Vincent
mitch@huntsvilleal.com
In reply to: The Hermit Hacker (#12)
#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mitch Vincent (#24)
#26The Hermit Hacker
scrappy@hub.org
In reply to: Mitch Vincent (#24)
#27Hannu Krosing
hannu@tm.ee
In reply to: The Hermit Hacker (#12)
#28Hannu Krosing
hannu@tm.ee
In reply to: Hannu Krosing (#27)
#29Vince Vielhaber
vev@michvhf.com
In reply to: Tom Lane (#25)
#30D'Arcy J.M. Cain
darcy@druid.net
In reply to: Vince Vielhaber (#29)
#31Vince Vielhaber
vev@michvhf.com
In reply to: D'Arcy J.M. Cain (#30)
#32Mitch Vincent
mitch@huntsvilleal.com
In reply to: The Hermit Hacker (#12)
#33Hannu Krosing
hannu@tm.ee
In reply to: The Hermit Hacker (#12)
#34D'Arcy J.M. Cain
darcy@druid.net
In reply to: Vince Vielhaber (#31)
#35Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#23)
#36Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: The Hermit Hacker (#12)
#37Bruce Momjian
bruce@momjian.us
In reply to: Thomas Lockhart (#36)
#38Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Bruce Momjian (#37)
#39Vince Vielhaber
vev@michvhf.com
In reply to: Thomas Lockhart (#38)