What could keep a connection / query alive?

Started by Derrick Riceabout 15 years ago4 messagesgeneral

derrick.rice@gmail.com

about 15 years ago

Hi folks,

I'm investigating (using 8.2) an instance of a database client connection
remaining open in a single query well past statement timeout settings. I
understand that severed TCP connections can cause the backend to hang until
the connection is closed, but our tcp keepalive settings should have
recognized that condition after about 30 minutes. The connection and
backend didn't terminate for nearly 90 minutes.

What can cause this? Why would these tcp and statement timeout settings not
terminate the backend?

Active Query Time (from pg_stat_activity): 01:27:44 - 00 days
Statement Timeout: 2 minutes
tcp_keepalives_idle=20min
(count and interval set to 0, using system defaults)

system keepalive settings are default for Linux, being:
count: 9
interval: 75

Expect severed connections to be detected within 20 min + 9*75s < 32min.

Thanks,

Derrick

gsievers19@comcast.net

about 15 years ago

In reply to: Derrick Rice (#1)

Re: What could keep a connection / query alive?

Derrick Rice <derrick.rice@gmail.com> writes:

I'm investigating (using 8.2) an instance of a database client
connection remaining open in a single query well past statement
timeout settings.? I understand that severed TCP connections can cause
the backend to hang until the connection is closed, but our tcp
keepalive settings should have recognized that condition after about
30 minutes.? The connection and backend didn't terminate for nearly 90
minutes.

What can cause this?? Why would these tcp and statement timeout settings not terminate the backend?

Try trussing the backend process. You may find it in a network IO wait
trying to send data to a client that is hung or over a socket that was
timed out by a firewall or network equipment.

Such a condition will cause the backend to be unable to hear the
cancel. The statement will still show as running in pg_stat_activity.

SIGTERM on such a backend will probably also fall on deaf ears.

This has been my experience several times in an environment of
EnterpriseDB 8.2 systems on Solaris 10.

YMMV

--
Jerry Sievers
e: gsievers19@comcast.net
p: 305.321.1144

derrick.rice@gmail.com

about 15 years ago

In reply to: Jerry Sievers (#2)

Re: What could keep a connection / query alive?

On Tue, Mar 29, 2011 at 3:17 AM, Jerry Sievers <gsievers19@comcast.net>wrote:

What can cause this?? Why would these tcp and statement timeout settings

not terminate the backend?

Try trussing the backend process. You may find it in a network IO wait
trying to send data to a client that is hung or over a socket that was
timed out by a firewall or network equipment.

Such a condition will cause the backend to be unable to hear the
cancel. The statement will still show as running in pg_stat_activity.

SIGTERM on such a backend will probably also fall on deaf ears.

I'm aware of that condition, which is exactly what the keepalive settings
are supposed to detect.

# strace -p 32307
Process 32307 attached - interrupt to quit
send(6, "\252\0\17\0\0\0\01042810425\0\0\0\01010010333\0\0\0\27"..., 880, 0

The client is remote (not unix-domain socket) so I expect tcp_keepalive
settings to kill this connection after 32 minutes. That's not happening.
Not sure where else to look.

Derrick

derrick.rice@gmail.com

about 15 years ago

In reply to: Derrick Rice (#3)

Re: What could keep a connection / query alive?

On Tue, Mar 29, 2011 at 2:54 PM, Derrick Rice <derrick.rice@gmail.com>wrote:

Try trussing the backend process. You may find it in a network IO wait
trying to send data to a client that is hung or over a socket that was
timed out by a firewall or network equipment.

Such a condition will cause the backend to be unable to hear the
cancel. The statement will still show as running in pg_stat_activity.

SIGTERM on such a backend will probably also fall on deaf ears.

I'm aware of that condition, which is exactly what the keepalive settings
are supposed to detect.

So I spent some time reading Linux-2.6 TCP code and my previous statement is
downright wrong. Keepalive is only in use when there is no data
unacknowledged and no data to send. Retransmission timeouts are in use for
those other scenarios.

In any case, I would have expected a retransmission timeout. My new
hypothesis based on output from `ss' is that a firewall, NAT, or VPN of my
users is putting the connection into persist mode (setting the window size
to 0) when the end point of the connection is unresponsive. Furthermore, I
think that firewall is continuing to respond to the persist probes of my
machine until it finally decides that the end point is gone. At which point
it might be ignoring future probes, starting the retransmission timeouts for
my machine.

So I'm not looking for any further help here, since this isn't a PostgreSQL
issue. If I resolve the problem I'll let you all know just for
entertainment purposes :)

Thanks

Derrick