Why has postmaster shutdown gotten so slow?

Started by Tom Lanealmost 22 years ago7 messages
#1Tom Lane
tgl@sss.pgh.pa.us

Shutdown of an idle postmaster used to take about two or three seconds
(mostly due to the sync/sleep(2)/sync in md_sync). For the last couple
of days it's taking more like a dozen seconds. I presume somebody broke
something, but I'm unsure whether to pin the blame on bgwriter or
Windows changes. Anyone care to fess up?

regards, tom lane

#2Claudio Natoli
claudio.natoli@memetrics.com
In reply to: Tom Lane (#1)
Re: Why has postmaster shutdown gotten so slow?

Shutdown of an idle postmaster used to take about two or three seconds
(mostly due to the sync/sleep(2)/sync in md_sync). For the last couple
of days it's taking more like a dozen seconds. I presume somebody broke
something, but I'm unsure whether to pin the blame on bgwriter or
Windows changes. Anyone care to fess up?

AFAICS, Win32 changes for the past few days have been minimal, and pretty
much isolated to Win32. Happy to stand corrected, but I'd start by looking
elsewhere...

Cheers,
Claudio

--- 
Certain disclaimers and policies apply to all email sent from Memetrics.
For the full text of these disclaimers and policies see 
<a
href="http://www.memetrics.com/emailpolicy.html">http://www.memetrics.com/em
ailpolicy.html</a>
#3Jan Wieck
JanWieck@Yahoo.com
In reply to: Tom Lane (#1)
Re: Why has postmaster shutdown gotten so slow?

Tom Lane wrote:

Shutdown of an idle postmaster used to take about two or three seconds
(mostly due to the sync/sleep(2)/sync in md_sync). For the last couple
of days it's taking more like a dozen seconds. I presume somebody broke
something, but I'm unsure whether to pin the blame on bgwriter or
Windows changes. Anyone care to fess up?

I guess it could well be the bgwriter, which when having nothing to do
at all is sleeping for 10 seconds. Not sure, will check.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#4Jan Wieck
JanWieck@Yahoo.com
In reply to: Jan Wieck (#3)
Re: Why has postmaster shutdown gotten so slow?

Jan Wieck wrote:

Tom Lane wrote:

Shutdown of an idle postmaster used to take about two or three seconds
(mostly due to the sync/sleep(2)/sync in md_sync). For the last couple
of days it's taking more like a dozen seconds. I presume somebody broke
something, but I'm unsure whether to pin the blame on bgwriter or
Windows changes. Anyone care to fess up?

I guess it could well be the bgwriter, which when having nothing to do
at all is sleeping for 10 seconds. Not sure, will check.

I checked the background writer for this and I can not reproduce the
behaviour. If the bgwriter had zero blocks to write it does PG_USLEEP
for 10 seconds, which on Unix is done by select() and that is correctly
interrupted when the postmaster sends it the term signal on shutdown.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jan Wieck (#4)
Re: Why has postmaster shutdown gotten so slow?

Jan Wieck <JanWieck@Yahoo.com> writes:

I checked the background writer for this and I can not reproduce the
behaviour. If the bgwriter had zero blocks to write it does PG_USLEEP
for 10 seconds, which on Unix is done by select() and that is correctly
interrupted when the postmaster sends it the term signal on shutdown.

This appears to be a platform-dependent behavior. The HPUX select(2) man
page says

[EINTR] The select() function was interrupted before any
of the selected events occurred and before the
timeout interval expired. If SA_RESTART has been
set for the interrupting signal, it is
implementation-dependent whether select() restarts
or returns with EINTR.

which text also appears verbatim in the Single Unix Spec. Since we set
SA_RESTART for every signal except SIGALRM (see pqsignal.c), we are
subject to the implementation dependency for SIGTERM.

Tracing the bgwriter process on my machine makes it real obvious that in
fact the select delay is allowed to finish out when SIGTERM is received.
In fact worse than that: it's restarted from the beginning. If 5
seconds have already elapsed, another 10 still elapse before the select
exits.

This won't do :-(. We cannot afford to fritter away 10 seconds in the
SIGTERM shutdown cycle --- on typical systems init isn't going to give
us more than 20 seconds before a hard kill.

I'd suggest reducing the delay to a second or two, or perhaps breaking
it into several 1-second waits with interrupt flag checks between.

In the longer run we might want to rethink what we are doing with
SA_RESTART, but I am not sure about the implications of fooling with
that.

regards, tom lane

#6Jan Wieck
JanWieck@Yahoo.com
In reply to: Tom Lane (#5)
Re: Why has postmaster shutdown gotten so slow?

Tom Lane wrote:

Jan Wieck <JanWieck@Yahoo.com> writes:

I checked the background writer for this and I can not reproduce the
behaviour. If the bgwriter had zero blocks to write it does PG_USLEEP
for 10 seconds, which on Unix is done by select() and that is correctly
interrupted when the postmaster sends it the term signal on shutdown.

This appears to be a platform-dependent behavior. The HPUX select(2) man
page says

[EINTR] The select() function was interrupted before any
of the selected events occurred and before the
timeout interval expired. If SA_RESTART has been
set for the interrupting signal, it is
implementation-dependent whether select() restarts
or returns with EINTR.

which text also appears verbatim in the Single Unix Spec. Since we set
SA_RESTART for every signal except SIGALRM (see pqsignal.c), we are
subject to the implementation dependency for SIGTERM.

That explains it.

Tracing the bgwriter process on my machine makes it real obvious that in
fact the select delay is allowed to finish out when SIGTERM is received.
In fact worse than that: it's restarted from the beginning. If 5
seconds have already elapsed, another 10 still elapse before the select
exits.

This won't do :-(. We cannot afford to fritter away 10 seconds in the
SIGTERM shutdown cycle --- on typical systems init isn't going to give
us more than 20 seconds before a hard kill.

I'd suggest reducing the delay to a second or two, or perhaps breaking
it into several 1-second waits with interrupt flag checks between.

In the longer run we might want to rethink what we are doing with
SA_RESTART, but I am not sure about the implications of fooling with
that.

I think we should at this point have some maximum value for PG_xSLEEP
over which it falls back to a function call that does either this
breaking up into a loop with checking InterruptPending or removes the
SA_RESTART flag while wating for the timeout.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#7Magnus Hagander
mha@sollentuna.net
In reply to: Jan Wieck (#6)
Re: Why has postmaster shutdown gotten so slow?

Tracing the bgwriter process on my machine makes it real

obvious that in

fact the select delay is allowed to finish out when SIGTERM

is received.

In fact worse than that: it's restarted from the beginning. If 5
seconds have already elapsed, another 10 still elapse before

the select

exits.

This won't do :-(. We cannot afford to fritter away 10

seconds in the

SIGTERM shutdown cycle --- on typical systems init isn't

going to give

us more than 20 seconds before a hard kill.

I'd suggest reducing the delay to a second or two, or

perhaps breaking

it into several 1-second waits with interrupt flag checks between.

In the longer run we might want to rethink what we are doing with
SA_RESTART, but I am not sure about the implications of fooling with
that.

I think we should at this point have some maximum value for PG_xSLEEP
over which it falls back to a function call that does either this
breaking up into a loop with checking InterruptPending or removes the
SA_RESTART flag while wating for the timeout.

If you look at my win32 signals patch nr 3 (posted feb 4th), I have code
to do this for win32 in it. It breaks up select() timeouts into pieces
of 1 second and polls for win32 signals inbetween.

Turns out it wasn't necessary, since win32 *does* deliver our signals
whlie in select. So for once it's win32 that does what we want - I think
that's a first.. But it might help on another platform.

//Magnus