postmaster recovery and automatic restart suppression

Started by Kolb, Harald (NSN - DE/Munich)almost 17 years ago27 messageshackers
Jump to latest

Hi,

in case of a serious failure of a backend or an auxiliary process the postmaster performs a crash recovery and restarts the db automatically.
Is there a possibility to deactivate the restart and to force the postmaster to simply exit at the end ?
The background is that we will have a watchdog process which will in this case perform a fast switchover to the standby side (in case of syncronous replication) or will restart the db by its own and in addition will perform some specific actions.

Regards,

Harald Kolb.

Best regards / freundliche Grüße
-----------------------------------------
Harald Kolb
COO RTP PD SW RD Area B 1 DE
Mch-M Building 5532 / Room 3045
Tel: +49 89 636 47606

mailto:Harald.Kolb@nsn.com
http://www.nokiasiemensnetworks.com/global/

Nokia Siemens Networks GmbH & Co. KG
Sitz der Gesellschaft: München / Registered office: Munich
Registergericht: München / Commercial registry: Munich, HRA 88537 WEEE-Reg.-Nr.: DE 52984304
Persönlich haftende Gesellschafterin / General Partner: Nokia Siemens Networks Management GmbH
Geschäftsleitung / Board of Directors: Lydia Sommer, Olaf Horsthemke
Vorsitzender des Aufsichtsrats / Chairman of supervisory board: Lauri Kivinen
Sitz der Gesellschaft: München / Registered office: Munich
Registergericht: München / Commercial registry: Munich, HRB 163416

#2Fujii Masao
masao.fujii@gmail.com
In reply to: Kolb, Harald (NSN - DE/Munich) (#1)
Re: postmaster recovery and automatic restart suppression

Hi,

On Fri, Jun 5, 2009 at 1:02 AM, Kolb, Harald (NSN - DE/Munich)
<harald.kolb@nsn.com> wrote:

Hi,

in case of a serious failure of a backend or an auxiliary process the
postmaster performs a crash recovery and restarts the db automatically.

Is there a possibility to deactivate the restart and to force the postmaster
to simply exit at the end ?

Good point. I also think that this makes a handling of failover
more complicated. In other words, clusterware cannot determine
whether to do failover when it detects the death of the primary
postgres. A wrong decision might cause split brain syndrome.

How about new GUC parameter to determine whether to restart
postmaster automatically when it fails abnormally? This would
be useful for various failover system.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

In reply to: Fujii Masao (#2)
Re: postmaster recovery and automatic restart suppression

Hi,

-----Original Message-----
From: ext Fujii Masao [mailto:masao.fujii@gmail.com]
Sent: Friday, June 05, 2009 8:14 AM
To: Kolb, Harald (NSN - DE/Munich)
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] postmaster recovery and automatic
restart suppression

Hi,

On Fri, Jun 5, 2009 at 1:02 AM, Kolb, Harald (NSN - DE/Munich)
<harald.kolb@nsn.com> wrote:

Hi,

in case of a serious failure of a backend or an auxiliary

process the

postmaster performs a crash recovery and restarts the db

automatically.

Is there a possibility to deactivate the restart and to

force the postmaster

to simply exit at the end ?

Good point. I also think that this makes a handling of failover
more complicated. In other words, clusterware cannot determine
whether to do failover when it detects the death of the primary
postgres. A wrong decision might cause split brain syndrome.

Mh, I cannot follow your reflections. Could you explain a little bit
more ?

How about new GUC parameter to determine whether to restart
postmaster automatically when it fails abnormally? This would
be useful for various failover system.

A new GUC parameter would be the optimal solution.
Since I'm new to the community, what's the "usual" way to make this
happen ?

Regards, Harald.

#4Fujii Masao
masao.fujii@gmail.com
In reply to: Kolb, Harald (NSN - DE/Munich) (#3)
Re: postmaster recovery and automatic restart suppression

Hi,

On Fri, Jun 5, 2009 at 9:24 PM, Kolb, Harald (NSN -
DE/Munich)<harald.kolb@nsn.com> wrote:

Good point. I also think that this makes a handling of failover
more complicated. In other words, clusterware cannot determine
whether to do failover when it detects the death of the primary
postgres. A wrong decision might cause split brain syndrome.

Mh, I cannot follow your reflections. Could you explain a little bit
more ?

How about new GUC parameter to determine whether to restart
postmaster automatically when it fails abnormally? This would
be useful for various failover system.

The primary postgres might restart automatically after clusterware finished
failover (i.e. the standby postgres has came up live). In this case, postgres
would work in each server, and they are independent of each other. This
is known as one of Split-Brain syndrome. The problem is that, for example,
if they share the archival storage, some archived files might get lost; the
original primary postgres might overwrite the archived file which is written
by the new primary.

On the other hand, the primary postgres might *not* restart automatically.
So, it's difficult for clusterware to choose whether to do failover when it
detects the deatch of the primary postgres, I think.

A new GUC parameter would be the optimal solution.
Since I'm new to the community, what's the "usual" way to make this
happen ?

The followings might be a good reference to you.

http://www.pgcon.org/2009/schedule/events/178.en.html
http://wiki.postgresql.org/wiki/Submitting_a_Patch

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#5Bruce Momjian
bruce@momjian.us
In reply to: Fujii Masao (#4)
Re: postmaster recovery and automatic restart suppression

Fujii Masao <masao.fujii@gmail.com> writes:

On the other hand, the primary postgres might *not* restart automatically.
So, it's difficult for clusterware to choose whether to do failover when it
detects the death of the primary postgres, I think.

I think the accepted way to handle this kind of situation is called STONITH --
"Shoot The Other Node In The Head".

You need some way when the cluster software decides to initiate failover to
ensure that the first node *cannot* come back up. That could mean shutting the
power to it at the PDU or disabling its network connection at the switch, or
various other options.

Gregory Stark
http://mit.edu/~gsstark/resume.pdf

#6Fujii Masao
masao.fujii@gmail.com
In reply to: Bruce Momjian (#5)
Re: postmaster recovery and automatic restart suppression

Hi,

On Mon, Jun 8, 2009 at 6:45 PM, Gregory Stark<stark@enterprisedb.com> wrote:

Fujii Masao <masao.fujii@gmail.com> writes:

On the other hand, the primary postgres might *not* restart automatically.
So, it's difficult for clusterware to choose whether to do failover when it
detects the death of the primary postgres, I think.

I think the accepted way to handle this kind of situation is called STONITH --
"Shoot The Other Node In The Head".

You need some way when the cluster software decides to initiate failover to
ensure that the first node *cannot* come back up. That could mean shutting the
power to it at the PDU or disabling its network connection at the switch, or
various other options.

Yes, I understand that STONITH is a safe solution for split-brain. But,
since some special equipment like PDU must probably be prepared,
I think that some people (including me) want another reasonable way.

The proposed feature is not perfect solution, but is a convenient way
to prevent one of split-brain situations.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#5)
Re: postmaster recovery and automatic restart suppression

Gregory Stark <stark@enterprisedb.com> writes:

I think the accepted way to handle this kind of situation is called STONITH --
"Shoot The Other Node In The Head".

Yeah, and the reason people go to the trouble of having special hardware
for that is that pure-software solutions are unreliable.

I think the proposed don't-restart flag is exceedingly ugly and will not
solve any real-world problem.

regards, tom lane

#8Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#7)
Re: postmaster recovery and automatic restart suppression

On Mon, 2009-06-08 at 09:47 -0400, Tom Lane wrote:

I think the proposed don't-restart flag is exceedingly ugly and will not
solve any real-world problem.

Agreed.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#9Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#8)
Re: postmaster recovery and automatic restart suppression

On Mon, Jun 8, 2009 at 6:58 PM, Simon Riggs<simon@2ndquadrant.com> wrote:

On Mon, 2009-06-08 at 09:47 -0400, Tom Lane wrote:

I think the proposed don't-restart flag is exceedingly ugly and will not
solve any real-world problem.

Agreed.

Hm. I'm not sure I see a solid use case for it -- in my experience you
want to be pretty sure you have a persistent problem before you fail
over.

But I don't really see why it's ugly either. I mean our auto-restart
behaviour is pretty arbitrary. You could just as easily argue we
shouldn't auto-restart and rely on the user to restart the service
like he would any service which crashes.

I would file it under "mechanism not policy" and make it optional. The
user should be able to select what to do when a backend crash is
detected from amongst the various safe options, even if we think some
of the options don't have any use cases we can think of. Someone will
surely think of one at some point. (idly I wonder if cloud
environments where you can have an infinite supply of slaves are such
a use case...)

--
greg
http://mit.edu/~gsstark/resume.pdf

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#9)
Re: postmaster recovery and automatic restart suppression

Greg Stark <stark@enterprisedb.com> writes:

On Mon, 2009-06-08 at 09:47 -0400, Tom Lane wrote:

I think the proposed don't-restart flag is exceedingly ugly and will not
solve any real-world problem.

Hm. I'm not sure I see a solid use case for it -- in my experience you
want to be pretty sure you have a persistent problem before you fail
over.

Yeah, and when you do fail over you want more guarantee than "none at
all" that the primary won't start back up again on its own.

But I don't really see why it's ugly either.

Because it's intentionally blowing a hole in one of the most prized
properties of the database, ie, that it doesn't go down if it can help
it. I want a *WHOLE* lot stronger rationale than "somebody might want
it someday" before providing a switch that lets somebody thoughtlessly
break a property we've sweated blood for ten years to ensure.

regards, tom lane

#11Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#10)
Re: postmaster recovery and automatic restart suppression

On Mon, Jun 8, 2009 at 4:30 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:

Greg Stark <stark@enterprisedb.com> writes:

On Mon, 2009-06-08 at 09:47 -0400, Tom Lane wrote:

I think the proposed don't-restart flag is exceedingly ugly and will not
solve any real-world problem.

Hm. I'm not sure I see a solid use case for it -- in my experience you
want to be pretty sure you have a persistent problem before you fail
over.

Yeah, and when you do fail over you want more guarantee than "none at
all" that the primary won't start back up again on its own.

But I don't really see why it's ugly either.

Because it's intentionally blowing a hole in one of the most prized
properties of the database, ie, that it doesn't go down if it can help
it.  I want a *WHOLE* lot stronger rationale than "somebody might want
it someday" before providing a switch that lets somebody thoughtlessly
break a property we've sweated blood for ten years to ensure.

I see that you've carefully not quoted Greg's remark about "mechanism
not policy" with which I completely agree. This seems like a pretty
useful switch for people who want more control over how the database
gets restarted on those rare occasions when it wipes out (and possibly
for debugging crash-type problems as well). The amount of
blood-sweating that was required to make a robust automatic restart
mechanism doesn't seem relevant to this discussion, though it is
certainly a cool feature.

I also don't see any reason to assume that users will do this
"thoughtlessly". Perhaps someone will, but if our policy is to not
add any features on the theory that someone might use in a stupid way,
we'd better get busy reverting a significant fraction of the work done
for 8.4. I'm not going to go so far as to say that we should never
reject a feature because the danger of someone shooting themselves in
the foot is too high, but this doesn't even seem like a likely
candidate. If we put an option in postgresql.conf called
"automatic_restart_after_crash = on", anyone who switches that to
"off" should have a pretty good idea what the likely consequences of
that decision will be. The people who are too stupid to figure that
one out are likely to have a whole lot of other problems too, and
they're not the people at whom we should be targetting this product.

...Robert

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#11)
Re: postmaster recovery and automatic restart suppression

Robert Haas <robertmhaas@gmail.com> writes:

I see that you've carefully not quoted Greg's remark about "mechanism
not policy" with which I completely agree.

Mechanism should exist to support useful policy. I don't believe that
the proposed switch has any real-world usefulness.

regards, tom lane

#13Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#12)
Re: postmaster recovery and automatic restart suppression

On Mon, Jun 8, 2009 at 7:34 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

I see that you've carefully not quoted Greg's remark about "mechanism
not policy" with which I completely agree.

Mechanism should exist to support useful policy.  I don't believe that
the proposed switch has any real-world usefulness.

I guess I agree that it doesn't seem to make much sense to trigger
failover on a DB crash, as the OP suggested. The most likely cause of
a DB crash is probably a software bug, in which case failover isn't
going to help (won't you just trigger the same bug on the standby
server?). The case where you'd probably want to do failover is when
the whole server has gone down to a hardware or power failure, in
which case your hypothetical home-grown supervisor process won't be
able to run anyway.

But I'm still not 100% convinced that the proposed mechanism is
useless. There might be other reasons to want to get control in the
event of a crash. You might want to page the system administrator, or
trigger a filesystem snapshot so you can go back and do a post-mortem.
(The former could arguably be done just as well by scanning the log
file for the relevant log messages, I suppose, but the latter
certainly couldn't be, if your goal is to get a snapshot before
recovery is done.)

But maybe I'm all wet...

...Robert

In reply to: Tom Lane (#12)
Re: postmaster recovery and automatic restart suppression

Hi,

-----Original Message-----
From: ext Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Tuesday, June 09, 2009 1:35 AM
To: Robert Haas
Cc: Greg Stark; Simon Riggs; Fujii Masao; Kolb, Harald (NSN -
DE/Munich); pgsql-hackers@postgresql.org; Czichy, Thoralf
(NSN - FI/Helsinki)
Subject: Re: [HACKERS] postmaster recovery and automatic
restart suppression

Robert Haas <robertmhaas@gmail.com> writes:

I see that you've carefully not quoted Greg's remark about

"mechanism

not policy" with which I completely agree.

Mechanism should exist to support useful policy. I don't believe that
the proposed switch has any real-world usefulness.

regards, tom lane

There are some good reasons why a switchover could be an appropriate
means in case the DB is facing troubles. It may be that the root cause
is not the DB itsself, but used resources or other things which are
going crazy and hit the DB first ( we've seen a lot of these
unbelievable things which made us quite sensible for robustness
aspects). Therefore we want to have control on the DB recovery.
If you don't want to see this option as a GUC parameter, would it be
acceptable to have it as a new postmaster cmd line option ?

Regards, Harald Kolb.

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kolb, Harald (NSN - DE/Munich) (#14)
Re: postmaster recovery and automatic restart suppression

"Kolb, Harald (NSN - DE/Munich)" <harald.kolb@nsn.com> writes:

If you don't want to see this option as a GUC parameter, would it be
acceptable to have it as a new postmaster cmd line option ?

That would make two kluges, not one (we don't do options that are
settable in only one way). And it does nothing whatever to address
my objection to the concept.

regards, tom lane

#16Simon Riggs
simon@2ndQuadrant.com
In reply to: Kolb, Harald (NSN - DE/Munich) (#14)
Re: postmaster recovery and automatic restart suppression

On Tue, 2009-06-09 at 20:59 +0200, Kolb, Harald (NSN - DE/Munich) wrote:

There are some good reasons why a switchover could be an appropriate
means in case the DB is facing troubles. It may be that the root cause
is not the DB itsself, but used resources or other things which are
going crazy and hit the DB first ( we've seen a lot of these
unbelievable things which made us quite sensible for robustness
aspects). Therefore we want to have control on the DB recovery.
If you don't want to see this option as a GUC parameter, would it be
acceptable to have it as a new postmaster cmd line option ?

Even if you had this, you still need to STONITH just in case the
failover happens by mistake.

If you still have to take an action to be certain, what is the point of
the feature?

Most losses of availability are caused by human error and this seems
like one more way to blow your remaining toes off.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#17Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Kolb, Harald (NSN - DE/Munich) (#14)
Re: postmaster recovery and automatic restart suppression

"Kolb, Harald (NSN - DE/Munich)" <harald.kolb@nsn.com> wrote:

From: ext Tom Lane [mailto:tgl@sss.pgh.pa.us]

Mechanism should exist to support useful policy. I don't believe
that the proposed switch has any real-world usefulness.

There are some good reasons why a switchover could be an appropriate
means in case the DB is facing troubles. It may be that the root
cause is not the DB itsself, but used resources or other things
which are going crazy and hit the DB first

Would an example of this be that one drive in a RAID has gone bad and
the hot spare rebuild has been triggered, leading to poor performance
for a while? Is that the sort of issue where you see value?

-Kevin

#18Bruce Momjian
bruce@momjian.us
In reply to: Kevin Grittner (#17)
Re: postmaster recovery and automatic restart suppression

Not really since once you fail over you may as well stop the rebuild
since you'll have to restore the whole database. Moreover wouldn't
that have to be a manual decision?

The closest thing I can come to a use case would be if you run a very
large cluster with hundreds of read-only replicas. If one has problems
you would rather the load balancer notice and take it out of rotation
immediately rather than have it flap and continue to cause problems.

Even there it would be dicey since a software bug could easily cause
all your replicas to start misbehaving simultaneously. It would suck
to see them all shut down one by one...

--
Greg

On 9 Jun 2009, at 20:53, "Kevin Grittner"
<Kevin.Grittner@wicourts.gov> wrote:

Show quoted text

"Kolb, Harald (NSN - DE/Munich)" <harald.kolb@nsn.com> wrote:

From: ext Tom Lane [mailto:tgl@sss.pgh.pa.us]

Mechanism should exist to support useful policy. I don't believe
that the proposed switch has any real-world usefulness.

There are some good reasons why a switchover could be an appropriate
means in case the DB is facing troubles. It may be that the root
cause is not the DB itsself, but used resources or other things
which are going crazy and hit the DB first

Would an example of this be that one drive in a RAID has gone bad and
the hot spare rebuild has been triggered, leading to poor performance
for a while? Is that the sort of issue where you see value?

-Kevin

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Grittner (#17)
Re: postmaster recovery and automatic restart suppression

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

"Kolb, Harald (NSN - DE/Munich)" <harald.kolb@nsn.com> wrote:

There are some good reasons why a switchover could be an appropriate
means in case the DB is facing troubles. It may be that the root
cause is not the DB itsself, but used resources or other things
which are going crazy and hit the DB first

Would an example of this be that one drive in a RAID has gone bad and
the hot spare rebuild has been triggered, leading to poor performance
for a while? Is that the sort of issue where you see value?

How would that be connected to a "no restart on crash" setting?

regards, tom lane

#20Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Tom Lane (#19)
Re: postmaster recovery and automatic restart suppression

Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

"Kolb, Harald (NSN - DE/Munich)" <harald.kolb@nsn.com> wrote:

There are some good reasons why a switchover could be an
appropriate means in case the DB is facing troubles. It may be
that the root cause is not the DB itself, but used resources or
other things which are going crazy and hit the DB first

Would an example of this be that one drive in a RAID has gone bad
and the hot spare rebuild has been triggered, leading to poor
performance for a while? Is that the sort of issue where you see
value?

How would that be connected to a "no restart on crash" setting?

It wouldn't; but I'm trying to better understand the problem the OP is
trying to solve, to see where that leads.

My first reaction on hearing the request was that it might have *some*
use; but in trying to recall any restart where it is what I would have
wanted, I come up dry. I haven't even really come up with a good
hypothetical use case. But I get the feeling the OP has had some
problem this is attempting to address. I'm just not clear what that
is.

-Kevin

#21Simon Riggs
simon@2ndQuadrant.com
In reply to: Kevin Grittner (#20)
#22Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#16)
In reply to: Tom Lane (#15)
#24Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Kolb, Harald (NSN - DE/Munich) (#23)
In reply to: Alvaro Herrera (#24)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Czichy, Thoralf (NSN - FI/Helsinki) (#25)
#27Fujii Masao
masao.fujii@gmail.com
In reply to: Czichy, Thoralf (NSN - FI/Helsinki) (#25)