How to simulate crashes of PostgreSQL?

Started by Sergey Samokhinover 16 years ago25 messagesgeneral
Jump to latest
#1Sergey Samokhin
prikrutil@gmail.com

Hello!

To make my client application tolerant of PostgreSQL failures I first
need to be able to simulate them in a safe manner (hard reset isn't a
solution I'm looking for :)

Is there a way to disconnect all the clients as if the server has
crashed? It should look like a real crash from the client's point of
view.

Is using kill what everyone should use for these purposes?

Thanks.

--
Sergey Samokhin

#2Ray Stell
stellr@cns.vt.edu
In reply to: Sergey Samokhin (#1)
Re: How to simulate crashes of PostgreSQL?

On Sat, Aug 22, 2009 at 01:03:43PM -0700, Sergey Samokhin wrote:

Is there a way to disconnect all the clients as if the server has
crashed? It should look like a real crash from the client's point of
view.

ifconfig ethx down ?

#3Greg Sabino Mullane
greg@turnstep.com
In reply to: Sergey Samokhin (#1)
Re: How to simulate crashes of PostgreSQL?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

Is there a way to disconnect all the clients as if the server has
crashed? It should look like a real crash from the client's point of
view.

ifconfig ethx down ?

Or even:

iptables -I INPUT -p tcp --dport 5432 -j DROP

Keep in mind that both of those are simulating network failures, not
a "server crash". But network failures are something your application
should handle gracefully too. :) To make something look like a real
crash, you should do a real crash. In this case, kill -9 the backend(s).

A server crash is a pretty rare event in the Postgres world, so I
would not spend too many cycles on this...

- --
Greg Sabino Mullane greg@turnstep.com
End Point Corporation
PGP Key: 0x14964AC8 200908221849
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAkqQd2sACgkQvJuQZxSWSsg6TwCfXMZ/GNi33qc2TyMa4uf1asw8
vVcAn3bUUZMP+cmSNEd5EABH/09gLeE/
=Uowh
-----END PGP SIGNATURE-----

#4Scott Marlowe
scott.marlowe@gmail.com
In reply to: Greg Sabino Mullane (#3)
Re: How to simulate crashes of PostgreSQL?

On Sat, Aug 22, 2009 at 4:55 PM, Greg Sabino Mullane<greg@turnstep.com> wrote:

A server crash is a pretty rare event in the Postgres world, so I
would not spend too many cycles on this...

I've been running pg in production since 7.0 came out. zero server crashes.

#5Craig Ringer
craig@2ndquadrant.com
In reply to: Sergey Samokhin (#1)
Re: How to simulate crashes of PostgreSQL?

On Sat, 2009-08-22 at 13:03 -0700, Sergey Samokhin wrote:

Hello!

To make my client application tolerant of PostgreSQL failures I first
need to be able to simulate them in a safe manner (hard reset isn't a
solution I'm looking for :)

Is there a way to disconnect all the clients as if the server has
crashed? It should look like a real crash from the client's point of
view.

If you mean a PostgreSQL server crash: write a C extension function that
de-references a null pointer or calls abort() . Instant crash on demand.
`kill -9' on a backend should have much the same effect, though, and is
easier - it's just not something a client can trigger through an SQL
query.

Remember to keep backups - Pg's designed to be fault tolerant, but it's
still good to be careful just in case.

If, however, you mean a crash of the server machine PostgreSQL is
runnning on, which is MUCH more likely and will have different
effects/behaviour, then Ray Stell's advice to bring the interface down
is probably pretty good. The machine should stop responding to ARP
requests or to any packets directed to its MAC address and will stop
sending packets, so it'll look to the client like it's a hard server
crash.

You should also test your client's response to the Pg server remaining
up but becoming non-responsive (eg: failed disk array causes Pg backends
to remain in uninterruptable disk I/O system calls in the kernel). A
possibly good way to do this is to SIGSTOP the backend(s).

--
Craig Ringer

#6Scott Marlowe
scott.marlowe@gmail.com
In reply to: Craig Ringer (#5)
Re: How to simulate crashes of PostgreSQL?

On Mon, Aug 24, 2009 at 12:49 AM, Craig
Ringer<craig@postnewspapers.com.au> wrote:

You should also test your client's response to the Pg server remaining
up but becoming non-responsive (eg: failed disk array causes Pg backends
to remain in uninterruptable disk I/O system calls in the kernel). A
possibly good way to do this is to SIGSTOP the backend(s).

This is a far more common and likely problem than the server crash
scenario. I've had servers go unresponsive under load before. Took
the load away and they came back, but the way the app responded has
not always been optimal. Many apps get jammed up from something like
this and require the app servers to be restarted.

#7David Fetter
david@fetter.org
In reply to: Scott Marlowe (#4)
Re: How to simulate crashes of PostgreSQL?

On Mon, Aug 24, 2009 at 12:10:30AM -0600, Scott Marlowe wrote:

On Sat, Aug 22, 2009 at 4:55 PM, Greg Sabino Mullane<greg@turnstep.com> wrote:

A server crash is a pretty rare event in the Postgres world, so I
would not spend too many cycles on this...

I've been running pg in production since 7.0 came out. zero server
crashes.

In my experience, OS crashes are much more common than PostgreSQL
crashes.

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#8Scott Marlowe
scott.marlowe@gmail.com
In reply to: David Fetter (#7)
Re: How to simulate crashes of PostgreSQL?

On Mon, Aug 24, 2009 at 12:41 PM, David Fetter<david@fetter.org> wrote:

On Mon, Aug 24, 2009 at 12:10:30AM -0600, Scott Marlowe wrote:

On Sat, Aug 22, 2009 at 4:55 PM, Greg Sabino Mullane<greg@turnstep.com> wrote:

A server crash is a pretty rare event in the Postgres world, so I
would not spend too many cycles on this...

I've been running pg in production since 7.0 came out.  zero server
crashes.

In my experience, OS crashes are much more common than PostgreSQL
crashes.

Also, admin mistakes are more common than pgsql crashes. I've done
things like type "sudo reboot" into my workstation only realize
seconds later that I'm logged into a production server (long time ago,
but still).

#9Sergey Samokhin
prikrutil@gmail.com
In reply to: Craig Ringer (#5)
Re: How to simulate crashes of PostgreSQL?

Hello!

If, however, you mean a crash of the server machine PostgreSQL is
runnning on, which is MUCH more likely and will have different
effects/behaviour, then Ray Stell's advice to bring the interface down
is probably pretty good.

Sorry for a bit ambiguous usage of both "crash" and "fault" terms. By
those words I meant crash of the server machine PostgreSQL is running
on, not the PostgreSQL itself. Network outages between client and
PostgreSQL are also kind of something I would like to simulate in any
way.

Though I don't think there are any differences between the crash of
PosgreSQL itself and the crash of the machine PostgreSQL is running on
from the client's point of view.

Yet another way to simulate this terrible behaviour I've found is to
stop PostgreSQL by "pg_ctl -m immediate" command.

Thanks to all who has answered in this topic! It was very helpful to read it!

--
Sergey Samokhin

#10Sergey Samokhin
prikrutil@gmail.com
In reply to: Craig Ringer (#5)
Re: How to simulate crashes of PostgreSQL?

Hello!

You should also test your client's response to the Pg server remaining
up but becoming non-responsive (eg: failed disk array causes Pg backends
to remain in uninterruptable disk I/O system calls in the kernel). A
possibly good way to do this is to SIGSTOP the backend(s).

I haven't thought about it yet. It's possible the place where I should
use timeouts on the operations involving calls to PostgreSQL.

--
Sergey Samokhin

#11Craig Ringer
craig@2ndquadrant.com
In reply to: Sergey Samokhin (#9)
Re: How to simulate crashes of PostgreSQL?

On Tue, 2009-08-25 at 00:26 +0400, Sergey Samokhin wrote:

Hello!

If, however, you mean a crash of the server machine PostgreSQL is
runnning on, which is MUCH more likely and will have different
effects/behaviour, then Ray Stell's advice to bring the interface down
is probably pretty good.

Sorry for a bit ambiguous usage of both "crash" and "fault" terms. By
those words I meant crash of the server machine PostgreSQL is running
on, not the PostgreSQL itself. Network outages between client and
PostgreSQL are also kind of something I would like to simulate in any
way.

Get a cheap PC with two Ethernet cards running Linux, and put it between
your Pg server and the rest of the network - or between your client and
the rest of the network.

Set it up to route packets between the two interfaces using iptables.
You can now easily introduce rules to do things like drop random
packets, drop packets of particular sizes, drop a regular percentage of
packets, etc.

You can also introduce latency using iproute2's `tc' .

http://lartc.org/

example:

http://www.kdedevelopers.org/node/1878

showing the use of the "delay" option of the network emulation (netem)
qdisc.

Alternately: brtables lets you do some network issue simulation on a
Linux machine that's bridging between two interfaces instead of routing
between them, so you can make your router transparent to the network.

Unless you've worked a bit with iptables before or at least done a lot
of general networking work you'll need to do a bit of learning to get
much of this up and running smoothly. It's not a trivial drop-in. I'm
not going to give detailed instructions and support, as I just don't
have the time to go into it at present - sorry.

--
Craig Ringer

#12Craig Ringer
craig@2ndquadrant.com
In reply to: Sergey Samokhin (#9)
Re: How to simulate crashes of PostgreSQL?

On Tue, 2009-08-25 at 00:26 +0400, Sergey Samokhin wrote:

Hello!

If, however, you mean a crash of the server machine PostgreSQL is
runnning on, which is MUCH more likely and will have different
effects/behaviour, then Ray Stell's advice to bring the interface down
is probably pretty good.

Sorry for a bit ambiguous usage of both "crash" and "fault" terms. By
those words I meant crash of the server machine PostgreSQL is running
on, not the PostgreSQL itself. Network outages between client and
PostgreSQL are also kind of something I would like to simulate in any
way.

This is the reference I should've given:

http://www.linuxfoundation.org/en/Net:Netem

--
Craig Ringer

#13Craig Ringer
craig@2ndquadrant.com
In reply to: Sergey Samokhin (#9)
Re: How to simulate crashes of PostgreSQL?

On Tue, 2009-08-25 at 00:26 +0400, Sergey Samokhin wrote:

Though I don't think there are any differences between the crash of
PosgreSQL itself and the crash of the machine PostgreSQL is running on
from the client's point of view.

There certainly are!

For one thing, if a client with an established connection sends a packet
to a machine where PostgreSQL has crashed (the backend process has
exited on a signal) it'll receive a TCP RST indicating that the
connection has been broken. The OS will also generally FIN to the client
when the backend crashes to inform it that the connection is closing, so
you'll often find out as soon as the backend dies or at least as soon as
you next try to use the connection. If the issue was just with that
backend, your client can just reconnect, retry its most recent work, and
keep on going.

Similarly, a new client trying to connect to a machine where the
postmaster has crashed will receive a TCP RST packet indicating that the
connection attempt was actively refused. It'll know immediately that
something's not right and will get a useful error from the TCP stack.

If, on the other hand, the server has crashed, clients may not receive
any response at all to packets. The server may even stop responding to
ARP requests, in which case the nearest router to it will - eventually,
maybe - send your client an ICMP destination-unreachable . There will be
long delays either way before the TCP/IP stack decides the connection
has died. Your client will probably block on recv(...) / read(...) for
an extended period.

If a backend is still running but in a nonresponsive state, the TCP/IP
stack on the server will still ACK packets you send to the backend (at
least until the buffers fill up), but the backend won't be doing
anything with the data. The local TCP stack won't see anything wrong
because, at the TCP level, there isn't - something that can't happen in
a server crash.

So, yes, there's a pretty big difference between a crash of PostgreSQL
and a server crash. Behaviour is different from the client perspective
and you need to consider that. Intermediate network issues are different
again, as you might encounter huge latency (possibly randomly only on
some packets), random packet loss, etc. This will cause weird pauses and
delays in communication that your client must cope with.

This, by the way, is one of the reasons you *really* should do all your
database work in a separate worker thread on GUI clients. The GUI must
remain responsive even when you're waiting for a response that'll never
come, or being held up by multi-second network latencies.

--
Craig Ringer

#14Vick Khera
vivek@khera.org
In reply to: Greg Sabino Mullane (#3)
Re: How to simulate crashes of PostgreSQL?

On Sat, Aug 22, 2009 at 6:55 PM, Greg Sabino Mullane<greg@turnstep.com> wrote:

A server crash is a pretty rare event in the Postgres world, so I
would not spend too many cycles on this...

I had one the other day caused by server resource issues: I ran out of
file descriptors when I had a very large surge in activity. Pg
rightfully panicked and disconnected all my clients.

Only the well written ones recovered automagically. I had to restart a
handful of services :-( It is wise to put the effort to testing your
client recovery strategy does work.

I must say that I haven't had a Postgres crash due to Postgres bug
since version 7.2 or so.

#15Merlin Moncure
mmoncure@gmail.com
In reply to: Scott Marlowe (#4)
Re: How to simulate crashes of PostgreSQL?

On Mon, Aug 24, 2009 at 2:10 AM, Scott Marlowe<scott.marlowe@gmail.com> wrote:

On Sat, Aug 22, 2009 at 4:55 PM, Greg Sabino Mullane<greg@turnstep.com> wrote:

A server crash is a pretty rare event in the Postgres world, so I
would not spend too many cycles on this...

I've been running pg in production since 7.0 came out.  zero server crashes.

I've found a few...I discovered the aggregate problem in 8.4. I also
co-discovered the prepared query/alter table that can trivially crash
any pg server up to 8.2.

merlin

#16Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Vick Khera (#14)
Re: How to simulate crashes of PostgreSQL?

Vick Khera wrote:

On Sat, Aug 22, 2009 at 6:55 PM, Greg Sabino Mullane<greg@turnstep.com> wrote:

A server crash is a pretty rare event in the Postgres world, so I
would not spend too many cycles on this...

I had one the other day caused by server resource issues: I ran out of
file descriptors when I had a very large surge in activity. Pg
rightfully panicked and disconnected all my clients.

PG is not supposed to crash when it runs out of file descriptors. In
fact there's a whole abstraction layer to ensure this does not happen.
What you saw was either misconfiguration or a bug somewhere (for example
maybe you have untrusted functions that try to open files?)

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#17Vick Khera
vivek@khera.org
In reply to: Alvaro Herrera (#16)
Re: How to simulate crashes of PostgreSQL?

On Tue, Aug 25, 2009 at 1:09 PM, Alvaro
Herrera<alvherre@commandprompt.com> wrote:

Vick Khera wrote:

On Sat, Aug 22, 2009 at 6:55 PM, Greg Sabino Mullane<greg@turnstep.com> wrote:

A server crash is a pretty rare event in the Postgres world, so I
would not spend too many cycles on this...

I had one the other day caused by server resource issues: I ran out of
file descriptors when I had a very large surge in activity.  Pg
rightfully panicked and disconnected all my clients.

PG is not supposed to crash when it runs out of file descriptors.  In
fact there's a whole abstraction layer to ensure this does not happen.
What you saw was either misconfiguration or a bug somewhere (for example
maybe you have untrusted functions that try to open files?)

From my syslog:

Aug 21 15:11:13 d01 postgres[12037]: [156-1] PANIC: could not open
file "pg_xlog/00000001000013E300000014" (log file 5091, segment 20):
Too many open files in system

Then all other processes did this:

Aug 21 15:11:15 d01 postgres[38452]: [71-1] WARNING: terminating
connection because of crash of another server process

Then recovery began. Luckily it only took 3 minutes because I limit
the number of log segments when in production mode.

Seems to me to be a part of the core server that caused the panic, not
any external functions (only external modules I use are pl/pgsql and
slony1).

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Vick Khera (#17)
Re: How to simulate crashes of PostgreSQL?

Vick Khera <vivek@khera.org> writes:

On Tue, Aug 25, 2009 at 1:09 PM, Alvaro
Herrera<alvherre@commandprompt.com> wrote:

PG is not supposed to crash when it runs out of file descriptors. �In
fact there's a whole abstraction layer to ensure this does not happen.

From my syslog:
Aug 21 15:11:13 d01 postgres[12037]: [156-1] PANIC: could not open
file "pg_xlog/00000001000013E300000014" (log file 5091, segment 20):
Too many open files in system

This is probably coming from walwriter, which might not have very much
of a cushion of "extra" open files to close.

regards, tom lane

#19Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#18)
Re: How to simulate crashes of PostgreSQL?

Tom Lane wrote:

Vick Khera <vivek@khera.org> writes:

On Tue, Aug 25, 2009 at 1:09 PM, Alvaro
Herrera<alvherre@commandprompt.com> wrote:

PG is not supposed to crash when it runs out of file descriptors. �In
fact there's a whole abstraction layer to ensure this does not happen.

From my syslog:
Aug 21 15:11:13 d01 postgres[12037]: [156-1] PANIC: could not open
file "pg_xlog/00000001000013E300000014" (log file 5091, segment 20):
Too many open files in system

This is probably coming from walwriter, which might not have very much
of a cushion of "extra" open files to close.

Note that this is ENFILE, not EMFILE; so if the load is high, it's
possible that the released file descriptor is immediately taken by
another process before BasicFileOpen is able to grab it (assuming
there's any open file to close).

Vivek, do you see this error message before the PANIC?
LOG: out of file descriptors: %m; release and retry

Would it be worth for walwriter to grab a dozen of dummy fd's?

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#20Vick Khera
vivek@khera.org
In reply to: Alvaro Herrera (#19)
Re: How to simulate crashes of PostgreSQL?

On Tue, Aug 25, 2009 at 2:49 PM, Alvaro
Herrera<alvherre@commandprompt.com> wrote:

Vivek, do you see this error message before the PANIC?
LOG:    out of file descriptors: %m; release and retry

Nope. no mention of "release" in that log file. I have a handful of
lines like these:

ERROR: could not load library
"/usr/local/lib/postgresql/slony1_funcs.so": dlopen
(/usr/local/lib/postgresql/slony1_funcs.so) failed:

ERROR: could not load library "/usr/local/lib/postgresql/plpgsql.so":
dlopen (/usr/local/lib/postgresql/plpgsql.so) failed:

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#19)
#22Vick Khera
vivek@khera.org
In reply to: Tom Lane (#21)
#23Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Vick Khera (#22)
#24Bruce Momjian
bruce@momjian.us
In reply to: Alvaro Herrera (#23)
#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#23)