killing pg_dump leaves backend process

Started by Tatsuo Ishiiover 12 years ago14 messages
#1Tatsuo Ishii
ishii@postgresql.org

I noticed pg_dump does not exit gracefully when killed.

start pg_dump
kill pg_dump by ctrl-c
ps x

27246 ? Ds 96:02 postgres: t-ishii dbt3 [local] COPY
29920 ? S 0:00 sshd: ishii@pts/5
29921 pts/5 Ss 0:00 -bash
30172 ? Ss 0:00 postgres: t-ishii dbt3 [local] LOCK TABLE waiting

As you can see, after killing pg_dump, a backend process is (LOCK
TABLE waiting) left behind. I think this could be easily fixed by
adding signal handler to pg_dump so that it catches the signal and
issues a query cancel request.

Thoughts?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#1)
Re: killing pg_dump leaves backend process

Tatsuo Ishii <ishii@postgresql.org> writes:

I noticed pg_dump does not exit gracefully when killed.
start pg_dump
kill pg_dump by ctrl-c
ps x

27246 ? Ds 96:02 postgres: t-ishii dbt3 [local] COPY
29920 ? S 0:00 sshd: ishii@pts/5
29921 pts/5 Ss 0:00 -bash
30172 ? Ss 0:00 postgres: t-ishii dbt3 [local] LOCK TABLE waiting

As you can see, after killing pg_dump, a backend process is (LOCK
TABLE waiting) left behind. I think this could be easily fixed by
adding signal handler to pg_dump so that it catches the signal and
issues a query cancel request.

If we think that's a problem (which I'm not convinced of) then pg_dump
is the wrong place to fix it. Any other client would behave the same
if it were killed while waiting for some backend query. So the right
fix would involve figuring out a way for the backend to kill itself
if the client connection goes away while it's waiting.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Greg Stark
stark@mit.edu
In reply to: Tom Lane (#2)
Re: killing pg_dump leaves backend process

On Sat, Aug 10, 2013 at 5:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Any other client would behave the same
if it were killed while waiting for some backend query. So the right
fix would involve figuring out a way for the backend to kill itself
if the client connection goes away while it's waiting.

Well I'm not sure. Maybe every other client should also issue a query
cancel and close the connection if it gets killed. libpq could offer a
function specifically for programs to call from atexit(), signal
handlers, or exception handlers (yes, that might be a bit tricky).

But I do see a convincing argument for doing something in the server.
Namely that if you kill -9 the client surely the server should still
detect that the connection has gone away immediately.

The problem is that I don't know of any way to detect eof on a socket
other than trying to read from it (or calling poll or select). So the
server would have to periodically poll the client even when it's not
expecting any data. The inefficiency is annoying enough and it still
won't detect the eof immediately.

I would actually tend to think libpq should offer a way to easily send
a cancel and disconnect message immediately upon exiting or closing
the connection *and* the server should periodically poll to check for
the connection being cleanly closed to handle kill -9.

I'm surprised this is the first time we're hearing people complain
about this. I know I've seen similar behaviour from Mysql and thought
to myself that represented pretty poor behaviour and assumed Postgres
did better.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Tatsuo Ishii
ishii@postgresql.org
In reply to: Greg Stark (#3)
Re: killing pg_dump leaves backend process

On Sat, Aug 10, 2013 at 5:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Any other client would behave the same
if it were killed while waiting for some backend query. So the right
fix would involve figuring out a way for the backend to kill itself
if the client connection goes away while it's waiting.

I am a little bit surprised to hear the response. I'm talking about
one of the client programs that are part of PostgreSQL. IMO they
should satisfy higher standard than other PostgreSQL application
programs in error control and signal handling.

Well I'm not sure. Maybe every other client should also issue a query
cancel and close the connection if it gets killed. libpq could offer a
function specifically for programs to call from atexit(), signal
handlers, or exception handlers (yes, that might be a bit tricky).

I'm not sure if it's a duty of libpq. Different applications need to
behave differently when catching signals. I think It would be better
to leave the job for applications.

But I do see a convincing argument for doing something in the server.
Namely that if you kill -9 the client surely the server should still
detect that the connection has gone away immediately.

The problem is that I don't know of any way to detect eof on a socket
other than trying to read from it (or calling poll or select). So the
server would have to periodically poll the client even when it's not
expecting any data. The inefficiency is annoying enough and it still
won't detect the eof immediately.

I think in some cases reading from socket is not reliable enough to
detect a broken socket. Writing to the socket is more reliable. For
this prupose Pgpool-II periodically sends "parameter status" packet to
frontend while waiting for response from backend to detect the socket
is broken or not. Probably PostgreSQL backend could do similar thing.

I would actually tend to think libpq should offer a way to easily send
a cancel and disconnect message immediately upon exiting or closing
the connection *and* the server should periodically poll to check for
the connection being cleanly closed to handle kill -9.

I'm surprised this is the first time we're hearing people complain
about this. I know I've seen similar behaviour from Mysql and thought
to myself that represented pretty poor behaviour and assumed Postgres
did better.

--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Christopher Browne
cbbrowne@gmail.com
In reply to: Tom Lane (#2)
Re: killing pg_dump leaves backend process

On Sat, Aug 10, 2013 at 12:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Tatsuo Ishii <ishii@postgresql.org> writes:

I noticed pg_dump does not exit gracefully when killed.
start pg_dump
kill pg_dump by ctrl-c
ps x

27246 ? Ds 96:02 postgres: t-ishii dbt3 [local] COPY
29920 ? S 0:00 sshd: ishii@pts/5
29921 pts/5 Ss 0:00 -bash
30172 ? Ss 0:00 postgres: t-ishii dbt3 [local] LOCK TABLE waiting

As you can see, after killing pg_dump, a backend process is (LOCK
TABLE waiting) left behind. I think this could be easily fixed by
adding signal handler to pg_dump so that it catches the signal and
issues a query cancel request.

If we think that's a problem (which I'm not convinced of) then pg_dump
is the wrong place to fix it. Any other client would behave the same
if it were killed while waiting for some backend query. So the right
fix would involve figuring out a way for the backend to kill itself
if the client connection goes away while it's waiting.

This seems to me to be quite a bit like the TCP keepalive issue.

We noticed with Slony that if something ungraceful happens in the
networking layer (the specific thing noticed was someone shutting off
networking, e.g. "/etc/init.d/networking stop" before shutting down
Postgres+Slony), the usual timeouts are really rather excessive, on
the order of a couple hours.

Probably it would be desirable to reduce the timeout period, so that
the server could figure out that clients are incommunicado "reasonably
quickly." It's conceivable that it would be apropos to diminish the
timeout values in postgresql.conf, or at least to recommend that users
consider doing so.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Noah Misch
noah@leadboat.com
In reply to: Greg Stark (#3)
Re: killing pg_dump leaves backend process

On Sat, Aug 10, 2013 at 12:26:43PM +0100, Greg Stark wrote:

On Sat, Aug 10, 2013 at 5:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Any other client would behave the same
if it were killed while waiting for some backend query. So the right
fix would involve figuring out a way for the backend to kill itself
if the client connection goes away while it's waiting.

Well I'm not sure. Maybe every other client should also issue a query
cancel and close the connection if it gets killed. libpq could offer a
function specifically for programs to call from atexit(), signal
handlers, or exception handlers (yes, that might be a bit tricky).

But I do see a convincing argument for doing something in the server.
Namely that if you kill -9 the client surely the server should still
detect that the connection has gone away immediately.

I agree that both efforts have value. A client-side change can't replace the
server-side change, and tightening the client side will be more of a neatness
measure once the server-side mechanism is in place.

The problem is that I don't know of any way to detect eof on a socket
other than trying to read from it (or calling poll or select). So the
server would have to periodically poll the client even when it's not
expecting any data. The inefficiency is annoying enough and it still
won't detect the eof immediately.

Yes, I think that is the way to do it. The check interval could default to
something like 90s, high enough to make the cost disappear into the noise and
yet a dramatic improvement over the current "no fixed time limit".

I bet the utils/timeout.h infrastructure added in 9.3 will make this at least
60% easier to implement than it would have been before.

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Greg Stark
stark@mit.edu
In reply to: Christopher Browne (#5)
Re: killing pg_dump leaves backend process

I think this is utterly the won't way to think about this.

TCP is designed to be robust against transient network outages. They are
*not* supposed to cause disconnections. The purpose of keepalives is to
detect connections that are still valid live connections that are stale and
the remote end is not longer present for.

Keepalives that trigger on the timescale of less than several times the msl
are just broken and make TCP unreliable. That means they cannot trigger in
less than many minutes.

This case is one that should just work and should work immediately. From
the users point of view when a client cleanly dies the kernel on the client
is fully aware of the connection being closed and the network is working
fine. The server should be aware the client has gone away *immediately*.
There's no excuse for any polling or timeouts.

--
greg
On 10 Aug 2013 17:30, "Christopher Browne" <cbbrowne@gmail.com> wrote:

Show quoted text

On Sat, Aug 10, 2013 at 12:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Tatsuo Ishii <ishii@postgresql.org> writes:

I noticed pg_dump does not exit gracefully when killed.
start pg_dump
kill pg_dump by ctrl-c
ps x

27246 ? Ds 96:02 postgres: t-ishii dbt3 [local] COPY
29920 ? S 0:00 sshd: ishii@pts/5
29921 pts/5 Ss 0:00 -bash
30172 ? Ss 0:00 postgres: t-ishii dbt3 [local] LOCK TABLE

waiting

As you can see, after killing pg_dump, a backend process is (LOCK
TABLE waiting) left behind. I think this could be easily fixed by
adding signal handler to pg_dump so that it catches the signal and
issues a query cancel request.

If we think that's a problem (which I'm not convinced of) then pg_dump
is the wrong place to fix it. Any other client would behave the same
if it were killed while waiting for some backend query. So the right
fix would involve figuring out a way for the backend to kill itself
if the client connection goes away while it's waiting.

This seems to me to be quite a bit like the TCP keepalive issue.

We noticed with Slony that if something ungraceful happens in the
networking layer (the specific thing noticed was someone shutting off
networking, e.g. "/etc/init.d/networking stop" before shutting down
Postgres+Slony), the usual timeouts are really rather excessive, on
the order of a couple hours.

Probably it would be desirable to reduce the timeout period, so that
the server could figure out that clients are incommunicado "reasonably
quickly." It's conceivable that it would be apropos to diminish the
timeout values in postgresql.conf, or at least to recommend that users
consider doing so.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Josh Berkus
josh@agliodbs.com
In reply to: Tatsuo Ishii (#1)
Re: killing pg_dump leaves backend process

On 08/10/2013 04:26 AM, Greg Stark wrote:

On Sat, Aug 10, 2013 at 5:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Any other client would behave the same
if it were killed while waiting for some backend query. So the right
fix would involve figuring out a way for the backend to kill itself
if the client connection goes away while it's waiting.

I've been waiting forever to have something we can justifiably call the
"loner suicide patch". ;-)

I'm surprised this is the first time we're hearing people complain
about this. I know I've seen similar behaviour from Mysql and thought
to myself that represented pretty poor behaviour and assumed Postgres
did better.

No, it's been a chronic issue since we got SMP support, pretty much
forever. Why do you think we have pg_terminate_backend()?

The problem, as explored downthread, is that there's no clear way to fix
this. It's a problem which goes pretty far beyond PostgreSQL; you can
experience the same issue on Apache with stuck downloads.

Our advantage over MySQL is that the idle process isn't likely to crash
anything ...

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Greg Stark
stark@mit.edu
In reply to: Josh Berkus (#8)
Re: killing pg_dump leaves backend process

On Sun, Aug 11, 2013 at 9:25 PM, Josh Berkus <josh@agliodbs.com> wrote:

No, it's been a chronic issue since we got SMP support, pretty much
forever. Why do you think we have pg_terminate_backend()?

The problem, as explored downthread, is that there's no clear way to fix
this. It's a problem which goes pretty far beyond PostgreSQL; you can
experience the same issue on Apache with stuck downloads.

No. There are multiple problems that can cause a stuck orphaned server
process and I think you're conflating different kinds of problems.

a) If the client dies due to C-c or kill or any other normal exit
path. There's really no excuse for not detecting that situation
*immediately*. As suggested in the original post the client could
notify the server before it dies that it's about to die.

b) If the client dies in some abnormal path such as kill -9. In that
case we could easily detect the situation as quickly as we want but
the more often we probe the more time we waste and cpu wakeups we
waste sending probes. We would only need to react to errors on that
connection (RST packets which will cause a SIGIO or eof depending on
what we ask for), not a lack of response so it doesn't need to make
things more fragile.

c) If the client goes away either because it's turned off or the
network is disconnected. This is the problem Apache faces because it's
exposed to the internet at large. We're not entirely immune to it but
we have much less of a problem with it. The problem here is that
there's really no easy solution at all. If you send keep-alives and
time them out then transient network connections become spurious fatal
errors.

Our advantage over MySQL is that the idle process isn't likely to crash
anything ...

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Jeff Janes
jeff.janes@gmail.com
In reply to: Greg Stark (#3)
Re: killing pg_dump leaves backend process

On Sat, Aug 10, 2013 at 4:26 AM, Greg Stark <stark@mit.edu> wrote:

The problem is that I don't know of any way to detect eof on a socket
other than trying to read from it (or calling poll or select). So the
server would have to periodically poll the client even when it's not
expecting any data. The inefficiency is annoying enough and it still
won't detect the eof immediately.

Do we know how inefficient it is, compared to the baseline work done
by CHECK_FOR_INTERRUPTS() and its affiliated machinery?

...

I'm surprised this is the first time we're hearing people complain
about this. I know I've seen similar behaviour from Mysql and thought
to myself that represented pretty poor behaviour and assumed Postgres
did better.

I've seen other complaints about it (and made at least one myself)

Cheers,

Jeff

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Janes (#10)
Re: killing pg_dump leaves backend process

Jeff Janes <jeff.janes@gmail.com> writes:

On Sat, Aug 10, 2013 at 4:26 AM, Greg Stark <stark@mit.edu> wrote:

The problem is that I don't know of any way to detect eof on a socket
other than trying to read from it (or calling poll or select).

Do we know how inefficient it is, compared to the baseline work done
by CHECK_FOR_INTERRUPTS() and its affiliated machinery?

CHECK_FOR_INTERRUPTS() is about two instructions (test a global variable
and branch) in the normal case with nothing to do. Don't even think of
putting a kernel call into it.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Greg Stark
stark@mit.edu
In reply to: Tom Lane (#11)
Re: killing pg_dump leaves backend process

On Mon, Aug 12, 2013 at 6:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeff Janes <jeff.janes@gmail.com> writes:

On Sat, Aug 10, 2013 at 4:26 AM, Greg Stark <stark@mit.edu> wrote:

The problem is that I don't know of any way to detect eof on a socket
other than trying to read from it (or calling poll or select).

Do we know how inefficient it is, compared to the baseline work done
by CHECK_FOR_INTERRUPTS() and its affiliated machinery?

CHECK_FOR_INTERRUPTS() is about two instructions (test a global variable
and branch) in the normal case with nothing to do. Don't even think of
putting a kernel call into it.

So I poked around a bit. It looks like Linux does send a SIGIO when a
tcp connection is closed (with POLL_HUP if it's closed and POLL_IN if
it's half-closed). So it should be possible to arrange to get a signal
which CHECK_FOR_INTERRUPTS could handle the normal way.

However this would mean getting a signal every time there's data
available from the client. I don't know how inefficient that would be
or how convenient it would be to turn it off and on all the time so we
aren't constantly receiving useless signals.

I'm not sure how portal this behaviour is either. There may well be
platforms where having the socket closed doesn't generate a SIGIO.

I'm not sure this is the end of the story either. Ok, so the tcp
stream is closed, does that mean it's safe to end the currently
executing command? There may be a commit buffered up in the stream
that hasn't been processed yet. If you connect and send "vacuum" and
then close the connection do you expect the vacuum to just cancel
immediately?

It does seem obvious that a select shouldn't keep running since it
will die as soon as it produces any output. It may well be that
Postgres should just document it as part of the protocol that if the
tcp connection is closed then whatever command was running may be
terminated at any time since that's effectively true since really any
WARNING or INFO would do that anyways and we don't have any policy of
discouraging those for fear of causing spurious failures.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Stark (#12)
Re: killing pg_dump leaves backend process

Greg Stark <stark@mit.edu> writes:

So I poked around a bit. It looks like Linux does send a SIGIO when a
tcp connection is closed (with POLL_HUP if it's closed and POLL_IN if
it's half-closed). So it should be possible to arrange to get a signal
which CHECK_FOR_INTERRUPTS could handle the normal way.

However this would mean getting a signal every time there's data
available from the client. I don't know how inefficient that would be
or how convenient it would be to turn it off and on all the time so we
aren't constantly receiving useless signals.

That sounds like a mess --- race conditions all over the place,
even aside from efficiency worries.

I'm not sure how portal this behaviour is either. There may well be
platforms where having the socket closed doesn't generate a SIGIO.

AFAICS, the POSIX spec doesn't define SIGIO at all, so this worry is
probably very real.

What I *do* see standardized in POSIX is SIGURG (out-of-band data is
available). If that's delivered upon socket close, which unfortunately
POSIX doesn't say, then it'd avoid the race condition issue. We don't
use out-of-band data in the protocol and could easily say that we'll
never do so in future.

Of course the elephant in the room is Windows --- does it support
any of this stuff?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Greg Stark
stark@mit.edu
In reply to: Tom Lane (#13)
Re: killing pg_dump leaves backend process

On Mon, Aug 12, 2013 at 11:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

That sounds like a mess --- race conditions all over the place, even aside from efficiency worries.

This I don't understand. All I'm envisioning is setting a flag in the
signal handler. If that flag is set then the next CHECK_FOR_INTERRUPTS
would check for eof on the client input anyways (by reading some
additional data into the so any spurious signals due to races would
just be ignored anyways.

It occurs to me it can be kind of tricky to arrange for the
communication layer to actually try to read however. It may have some
data buffered up and choose not to read anything. It's possibly even
going through openssl so we may not even know whether the read
actually happened. Still, at least trying is better than not.

AFAICS, the POSIX spec doesn't define SIGIO at all, so this worry is
probably very real.

What I *do* see standardized in POSIX is SIGURG (out-of-band data is
available). If that's delivered upon socket close

It's not. You're not going to get SIGURG unless any data is sent with
MSG_OOB. That's not helpful since if the client actually was aware it
was about to exit it could have happily done the existing query cancel
dance. (We could use MSG_OOB and SIGURG instead of our existing query
cancel tricks which might be simpler but given we already have the
existing code and it works I doubt anyone's going to get excited about
experimenting with replacing it with something that's rarely used and
nobody's familiar with any more.)

I do think it's worth making it easy for clients to send a normal
cancel whenever they exit normally. That would probably cover 90% of
the actual problem cases.

Of course the elephant in the room is Windows --- does it support
any of this stuff?

I suspect there are three different competing APIs for doing this on
Windows, none of which is spelled the same as Unix but are all better
in various subtly different ways.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers