Client-based EOFs triggering hung queries?
Hello all,
First time poster here - probably a good sign since I've been running
postgresql with zero issues for the last several years! At any rate, I've
recently run into a strange issue. Client-based EOFs are nothing new to our
application; people can sometimes close a connection for a number of
reasons. However, on the DB side these have always been released with no
issue.
Today, we had 4 connections that saw either a client-based EOF or a
'canceling statement due to user request', and in all of these 4 cases the
query remained active and started chewing heavily into the cpu such that
they ground the db server to a halt until each of the procpids were manually
issued a 'pg_cancel_backed ( procpid );' from the console. None of the 4
queries were particularly heavy; they registered approximately a 2300ms
completion time using a query EXPLAIN ANALYZE. A little long, but not too
far out of the ordinary on our usual database usage.
Has anyone run into something similar, or have any ideas on how I can track
down what might have kept these queries in the pg_stat_activity table even
after the procpids were gone on the originating device?
TIA,
-Chris.
Christopher Opena <counterveil@gmail.com> writes:
Hello all,
First time poster here - probably a good sign since I've been running
postgresql with zero issues for the last several years! At any rate, I've
recently run into a strange issue. Client-based EOFs are nothing new to our
application; people can sometimes close a connection for a number of
reasons. However, on the DB side these have always been released with no
issue.
Today, we had 4 connections that saw either a client-based EOF or a
'canceling statement due to user request', and in all of these 4 cases the
query remained active and started chewing heavily into the cpu such that
they ground the db server to a halt until each of the procpids were manually
issued a 'pg_cancel_backed ( procpid );' from the console. None of the 4
queries were particularly heavy; they registered approximately a 2300ms
completion time using a query EXPLAIN ANALYZE. A little long, but not too
far out of the ordinary on our usual database usage.
It's always been the case that if the client drops the connection
mid-query, the backend will notice that only when it next tries to send
some data --- and even then, if the kernel isn't aware that the far end
has dropped the connection, the kernel may accept and buffer quite a bit
of data before blocking the backend from doing more work. So if the
query requires a lot of processing before it starts to emit data, the
backend can do a lot of work before noticing anything is wrong.
I suspect that these queries were expensive and you made a mistake in
what you tested after the fact (or maybe conditions changed, eg
autovacuum came along and updated statistics). It's also possible that
some network glitch affected the connections such that the server's
kernel didn't know they were lost. It's hard to say much more than that
without a lot more evidence than you've provided. If it happens again,
you might try poking a bit harder into the situation before you kill the
queries --- netstat's opinion of the connection statuses might be
interesting, and so might stack traces from the busy backends.
regards, tom lane
Thanks for the reply, Tom. I admit we were a bit rushed during the
troubleshooting process; now that we know precisely how to identify these
procs and deal with them, I imagine we'll grab more info next time before
killing them, including a netstat view and a stack trace per your
recommendation.
Cheers,
-Chris.
On Mon, May 16, 2011 at 1:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
Christopher Opena <counterveil@gmail.com> writes:
Hello all,
First time poster here - probably a good sign since I've been running
postgresql with zero issues for the last several years! At any rate,I've
recently run into a strange issue. Client-based EOFs are nothing new to
our
application; people can sometimes close a connection for a number of
reasons. However, on the DB side these have always been released with no
issue.Today, we had 4 connections that saw either a client-based EOF or a
'canceling statement due to user request', and in all of these 4 casesthe
query remained active and started chewing heavily into the cpu such that
they ground the db server to a halt until each of the procpids weremanually
issued a 'pg_cancel_backed ( procpid );' from the console. None of the 4
queries were particularly heavy; they registered approximately a 2300ms
completion time using a query EXPLAIN ANALYZE. A little long, but nottoo
far out of the ordinary on our usual database usage.
It's always been the case that if the client drops the connection
mid-query, the backend will notice that only when it next tries to send
some data --- and even then, if the kernel isn't aware that the far end
has dropped the connection, the kernel may accept and buffer quite a bit
of data before blocking the backend from doing more work. So if the
query requires a lot of processing before it starts to emit data, the
backend can do a lot of work before noticing anything is wrong.
I suspect that these queries were expensive and you made a mistake in
what you tested after the fact (or maybe conditions changed, eg
autovacuum came along and updated statistics). It's also possible that
some network glitch affected the connections such that the server's
kernel didn't know they were lost. It's hard to say much more than that
without a lot more evidence than you've provided. If it happens again,
you might try poking a bit harder into the situation before you kill the
queries --- netstat's opinion of the connection statuses might be
interesting, and so might stack traces from the busy backends.regards, tom lane