PG service restart failure (start getting ahead of stop?)

Started by George Pavlovalmost 19 years ago6 messagesgeneral

gpavlov@mynewplace.com

almost 19 years ago

We have a nightly restart of one PG database. Today it failed and I
can't seem to understand why and how to prevent this in the future (nor
can I reproduce the problem).

We have a line in a shell script that calls "/etc/init.d/postgresql
restart". In the shell script's log from this invocation I have:

Stopping postgresql service: [FAILED]
Starting postgresql service: [ OK ]

The outcome of this was that the service was not running (despite the "[
OK ]" on the second line). The query log from that run ends with:

2007-04-23 03:03:59 PDT [23265] LOG: received fast shutdown request
2007-04-23 03:03:59 PDT [23265] LOG: aborting any active transactions
2007-04-23 03:03:59 PDT [26749] FATAL: terminating connection due to
administrator command
<... snipped more lines like the one above ...>
2007-04-23 03:03:59 PDT [24090] LOG: could not send data to client:
Broken pipe
2007-04-23 03:03:59 PDT [26700] FATAL: terminating connection due to
administrator command
<... snipped more lines like the one above ...>
2007-04-23 03:04:13 PDT [26820] FATAL: the database system is
shutting down
<... snipped more lines like the one above ...>
2007-04-23 03:06:10 PDT [23269] LOG: shutting down
2007-04-23 03:06:10 PDT [23269] LOG: database system is shut down
2007-04-23 03:06:13 PDT [23267] LOG: logger shutting down
<end of log>

So it looks like the STOPPING of the service actually succeeded, albeit
it took a while (more than the usual sessions open?). The STARTING is
the one that actually failed (is that because the STOP was still in
process?). The question is why -- in a RESTART situation
wouldn't/shouldn't the START part wait for the STOP part to complete
(regardless of how long it takes)?

Also what can we do to avoid this in the future? We can issue a separate
STOP and then a START X minutes later, but how long an X? It would seem
that a RESTART is really what I want...

TIA,

George

Brent Wood

b.wood@niwa.co.nz

almost 19 years ago

In reply to: George Pavlov (#1)

Re: PG service restart failure (start getting ahead of stop?)

George Pavlov wrote:

We have a nightly restart of one PG database. Today it failed and I
can't seem to understand why and how to prevent this in the future (nor
can I reproduce the problem).

If you get the PID of the psql server process then invoke a STOP, then
have a loop waiting for that PID to disappear, then run your START, it
may do what you want. Or someone may have a more elegant solution :-)

Brent Wood

Show quoted text

We have a line in a shell script that calls "/etc/init.d/postgresql
restart". In the shell script's log from this invocation I have:

Stopping postgresql service: [FAILED]
Starting postgresql service: [ OK ]

The outcome of this was that the service was not running (despite the "[
OK ]" on the second line). The query log from that run ends with:

2007-04-23 03:03:59 PDT [23265] LOG: received fast shutdown request
2007-04-23 03:03:59 PDT [23265] LOG: aborting any active transactions
2007-04-23 03:03:59 PDT [26749] FATAL: terminating connection due to
administrator command
<... snipped more lines like the one above ...>
2007-04-23 03:03:59 PDT [24090] LOG: could not send data to client:
Broken pipe
2007-04-23 03:03:59 PDT [26700] FATAL: terminating connection due to
administrator command
<... snipped more lines like the one above ...>
2007-04-23 03:04:13 PDT [26820] FATAL: the database system is
shutting down
<... snipped more lines like the one above ...>
2007-04-23 03:06:10 PDT [23269] LOG: shutting down
2007-04-23 03:06:10 PDT [23269] LOG: database system is shut down
2007-04-23 03:06:13 PDT [23267] LOG: logger shutting down
<end of log>

So it looks like the STOPPING of the service actually succeeded, albeit
it took a while (more than the usual sessions open?). The STARTING is
the one that actually failed (is that because the STOP was still in
process?). The question is why -- in a RESTART situation
wouldn't/shouldn't the START part wait for the STOP part to complete
(regardless of how long it takes)?

Also what can we do to avoid this in the future? We can issue a separate
STOP and then a START X minutes later, but how long an X? It would seem
that a RESTART is really what I want...

TIA,

George

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Tom Lane

tgl@sss.pgh.pa.us

almost 19 years ago

In reply to: George Pavlov (#1)

Re: PG service restart failure (start getting ahead of stop?)

"George Pavlov" <gpavlov@mynewplace.com> writes:

We have a nightly restart of one PG database.

Just out of curiosity, what for? I can't imagine any really good reason
for just shutting down the postmaster and immediately restarting it.

So it looks like the STOPPING of the service actually succeeded, albeit
it took a while (more than the usual sessions open?). The STARTING is
the one that actually failed (is that because the STOP was still in
process?). The question is why -- in a RESTART situation
wouldn't/shouldn't the START part wait for the STOP part to complete
(regardless of how long it takes)?

Well, this'd depend on the details of the postgres init script you're
using, which you gave no hint about (and yes, there are a *ton* of
different versions out there). The one I'm currently shipping for Red
Hat would give up waiting after a minute, but it should report failure
not success in that case.

regards, tom lane

George Pavlov

gpavlov@mynewplace.com

almost 19 years ago

In reply to: Tom Lane (#3)

Re: PG service restart failure (start getting ahead of stop?)

So it looks like the STOPPING of the service actually
succeeded, albeit
it took a while (more than the usual sessions open?). The
STARTING is
the one that actually failed (is that because the STOP was still in
process?). The question is why -- in a RESTART situation
wouldn't/shouldn't the START part wait for the STOP part to complete
(regardless of how long it takes)?

Well, this'd depend on the details of the postgres init script you're
using, which you gave no hint about (and yes, there are a *ton* of
different versions out there). The one I'm currently shipping for Red
Hat would give up waiting after a minute, but it should report failure
not success in that case.

oh, sorry, i did not realize there were many of them. this is the 8.1.8
redhat one; the (hopefully) identifying lines are:

64 # Version 8.1 Devrim Gunduz <devrim@PostgreSQL.org>
65 # Increased sleep time from 1 sec to 2 sec.
66
67 # PGVERSION is the full package version, e.g., 8.1.1
68 # Note: the specfile ordinarily updates this during install
69 PGVERSION=8.1.8

i have examined the stop() and start() and i think i understand why the
stop() reported a failure (it took to long), but i don't understand how
the start() could have reported success:

Stopping postgresql service: [FAILED]
Starting postgresql service: [ OK ]

there was definitely no running DB after that (until someone manually
started it hours later).

We have a nightly restart of one PG database.

Just out of curiosity, what for? I can't imagine any really
good reason
for just shutting down the postmaster and immediately restarting it.

why have a "restart" option if there is never a reason for it? :-)
seriously, this is a good question, i think this was someone's idea of a
quick way to clear any remaining DB sessions in order to be able to drop
a database and restore a newer version (this is a reporting DB that gets
refreshed nightly with a dump from another DB). this seems a bit
misguided in that if you want to kill sessions you should just kill
sessions, (a la http://varlena.com/GeneralBits/29.html), right? what's
your opinion on the best way to do this?

thanks.

george

Tom Lane

tgl@sss.pgh.pa.us

almost 19 years ago

In reply to: George Pavlov (#4)

Re: PG service restart failure (start getting ahead of stop?)

"George Pavlov" <gpavlov@mynewplace.com> writes:

i have examined the stop() and start() and i think i understand why the
stop() reported a failure (it took to long), but i don't understand how
the start() could have reported success:

Stopping postgresql service: [FAILED]
Starting postgresql service: [ OK ]

Well, that makes sense: if the shutdown took more than a minute then the
"stop" script action would give up waiting, and then the "start" action
would see the postmaster running and go away happy. (It's a bit odd
that "service start" actions are supposed to treat "already running"
as OK, but I've been told that that's required by the Linux Standards
Base and I can't change it.)

The real question here is why'd it take so long to stop? It should be
using "mode fast" which'd kick out active queries.

Just out of curiosity, what for? I can't imagine any really
good reason
for just shutting down the postmaster and immediately restarting it.

why have a "restart" option if there is never a reason for it? :-)

Well, there are times when you need it, like changing shared_buffers
or one of the other postmaster-start-time-only parameters. But doing
it just as a routine action smacks of Microsoftish "you need to reboot
regularly" thinking...

seriously, this is a good question, i think this was someone's idea of a
quick way to clear any remaining DB sessions in order to be able to drop
a database and restore a newer version (this is a reporting DB that gets
refreshed nightly with a dump from another DB).

Hmm. It's not exactly bulletproof --- what if someone reconnects to the
DB as soon as you restart? But I guess it's a simple answer that might
not be worth improving on.

regards, tom lane

George Pavlov

gpavlov@mynewplace.com

almost 19 years ago

In reply to: Tom Lane (#5)

Re: [QUARANTINE] Re: PG service restart failure (start getting ahead of stop?)

Well, that makes sense: if the shutdown took more than a
minute then the
"stop" script action would give up waiting, and then the
"start" action
would see the postmaster running and go away happy. (It's a bit odd
that "service start" actions are supposed to treat "already running"
as OK, but I've been told that that's required by the Linux Standards
Base and I can't change it.)

thanks, that's good to know. i never realized that's how it behaves.

The real question here is why'd it take so long to stop? It should be
using "mode fast" which'd kick out active queries.

indeed a mystery -- from looking at the query log there didn't seem to
be ANY active queries at the time. seems that > 85% of the session IDs
(84 total) for which i got lines like "2007-04-23 03:05:48 PDT [26987]
FATAL: the database system is shutting down" did not even have any
preceding query activity. hard to debug retroactively -- it's just
annoying that i don't understand what was different about this restart.

and, yes, i confirmed that it is using fast mode:

212 $SU -l postgres -c "$PGENGINE/pg_ctl stop -D '$PGDATA' -s -m
fast" > /dev/null 2>&1 < /dev/null

george