Long running query - connection and keepalives enabled but query not canceled
Hi.
I have database users which have to run some ad-hoc queries on the
database. These queries sometimes are pretty long running, especially
when a user messes up a join condition for two 1Grows tables and
therefore tries to count the number of quarks in the universe or
something like this.
After a query fails to return result for some time the user sometimes
just closes his laptop lid, dropping his VPN connection and goes home to
watch some cats on the Internet.
The TCP keepalives functionality on the database server's operating
system would figure out that the connection is not working anymore and
close it.
You'd assume that the database would get the info and cancel the query.
Except - it does not.
The database would get an error on the connection socket only after it
tries to read or write to it next time. But it does not try to do this -
it's busy counting those quarks until statement_timeout is reached.
But default statement_timeout has to be quite large - the databases are
large (up to several TB) and are sometimes exported with pg_dump,
restored etc. - 24 hours is maybe a reasonable setting.
Is there a way to configure the database somehow, so it would figure out
that the connection is closed, faster than statement_timeout?
I'd think maybe a periodic signal to the connection process asking for
validating its connections somehow if it's running a query longer than
tcp_keepalives_idle+tcp_keepalives_count*tcp_keepalives_interval?
I think it would be possible to work-around with some external process
checking /proc/[PID]/net/tcp[6] for closed connections and
pg_cancel_backend() them. But it's managed PostgreSQL (AWS RDS), so it's
not allowed.
Is there any possibility to have something like this added to
PostgreSQL? I believe it's not just us having this problem.
--
Tomasz "Tometzky" Ostrowski
Tomasz Ostrowski wrote:
I have database users which have to run some ad-hoc queries on the
database. These queries sometimes are pretty long running, especially
when a user messes up a join condition for two 1Grows tables and
therefore tries to count the number of quarks in the universe or
something like this.After a query fails to return result for some time the user sometimes
just closes his laptop lid, dropping his VPN connection and goes home to
watch some cats on the Internet.The TCP keepalives functionality on the database server's operating
system would figure out that the connection is not working anymore and
close it.You'd assume that the database would get the info and cancel the query.
Except - it does not.The database would get an error on the connection socket only after it
tries to read or write to it next time. But it does not try to do this -
it's busy counting those quarks until statement_timeout is reached.
By default "tcp_keepalives_idle" is quite large: 2 hours.
Are you sure that the queries keep running for longer than that?
Try to experiment with lower settings. It will cause marginally more
network traffic, but dead connections will be detected more quickly.
Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com
On 6/4/19 6:24 PM, Laurenz Albe wrote:
Tomasz Ostrowski wrote:
The TCP keepalives functionality on the database server's operating
system would figure out that the connection is not working anymore and
close it.You'd assume that the database would get the info and cancel the query.
Except - it does not.The database would get an error on the connection socket only after it
tries to read or write to it next time. But it does not try to do this -
it's busy counting those quarks until statement_timeout is reached.By default "tcp_keepalives_idle" is quite large: 2 hours.
Are you sure that the queries keep running for longer than that?
Try to experiment with lower settings. It will cause marginally more
network traffic, but dead connections will be detected more quickly.
The keepalive settings we use is idle/interval/count=60/10/60. It means
that the the dead connections should be cleared after 60s+10*60s=11m. I
started to investigate when I found a query running for over 11 days
(there was no query_timeout set on this server).
But the problem is not that the dead connections aren't detected - they
are, and the operating system clears them. They disappear from "netstat
--tcp" output after expected time.
The problem is that the database does not get the info and does not kill
the query.
I've reproduced this with running the SQL below on a local Postgresql 11
connected with TCP, configured with idle/interval/count=15/15/15:
create or replace function pg_temp.fib(n int) returns int language
plpgsql as $$
begin if n<=1 then return n; end if; return
pg_temp.fib(n-1)+pg_temp.fib(n-2); end;
$$;
select pg_temp.fib(50);
And then filtering out the connection with iptables.
--
Tomasz "Tometzky" Ostrowski
Tomasz Ostrowski wrote:
The database would get an error on the connection socket only after it tries to read or write to it next time. But it does not try to do this - it's busy counting those quarks until statement_timeout is reached.
It is unfortunate that PostgreSQL does not take into account that the client has disconnected and does not need to process query.
I know only one non-stable utility that can solve your problem: https://github.com/yandex/odyssey <https://github.com/yandex/odyssey> (unfortunately pgbouncer can't do that).
--
Dmitry Vasiliev
Tomasz Ostrowski wrote:
The database would get an error on the connection socket only after it tries to read or write to it next time. But it does not try to do this - it's busy counting those quarks until statement_timeout is reached.
I found relevant patch from postgrespro on commitfest: https://commitfest.postgresql.org/21/1882/
--
Dmitry Vasiliev