TCP keepalive support for libpq
(please Cc me on replies, I am not subscribed)
Hi,
libpq currently does not use TCP keepalives. This is a problem in our
case where we have some clients waiting for notifies and then the
connection is dropped on the server side. The client never gets the FIN
and thinks the connection is up. The attached patch unconditionally
adds keepalives. I chose unconditionally as this is what the server
does. We didn't need the ability to tune the timeouts, but that could
be added with reasonable ease.
--
Tollef Fog Heen
UNIX is user friendly, it's just picky about who its friends are
Attachments:
libpq_keepalive.difftext/x-diffDownload+12-0
On Tue, Feb 9, 2010 at 14:03, Tollef Fog Heen
<tollef.fog.heen@collabora.co.uk> wrote:
(please Cc me on replies, I am not subscribed)
Hi,
libpq currently does not use TCP keepalives. This is a problem in our
case where we have some clients waiting for notifies and then the
connection is dropped on the server side. The client never gets the FIN
and thinks the connection is up. The attached patch unconditionally
adds keepalives. I chose unconditionally as this is what the server
does. We didn't need the ability to tune the timeouts, but that could
be added with reasonable ease.
Seems reasonable to add this. Are there any scenarios where this can
cause trouble, that would be fixed by having the ability to select
non-standard behavior?
I don't recall ever changing away from the standard behavior in any of
my deployments, but that might be platform dependent?
If not, I think this is small and trivial enough not to have to push
back for 9.1 ;)
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
]] Magnus Hagander
| Seems reasonable to add this. Are there any scenarios where this can
| cause trouble, that would be fixed by having the ability to select
| non-standard behavior?
Well, it might be unwanted if you're on a pay-per-bit connection such as
3G, but in this case, it just makes the problem a bit worse than the
server keepalive already makes it – it doesn't introduce a new problem.
| I don't recall ever changing away from the standard behavior in any of
| my deployments, but that might be platform dependent?
If you were (ab)using postgres as an IPC mechanism, I could see it being
useful, but not in the normal case.
| If not, I think this is small and trivial enough not to have to push
| back for 9.1 ;)
\o/
--
Tollef Fog Heen
UNIX is user friendly, it's just picky about who its friends are
Tollef Fog Heen wrote:
(please Cc me on replies, I am not subscribed)
Hi,
libpq currently does not use TCP keepalives. This is a problem in our
case where we have some clients waiting for notifies and then the
connection is dropped on the server side. The client never gets the FIN
and thinks the connection is up. The attached patch unconditionally
adds keepalives. I chose unconditionally as this is what the server
does. We didn't need the ability to tune the timeouts, but that could
be added with reasonable ease.
ISTM that the default behavior should be keep alives disabled, as it is
now, and those wanting it can just set it in their apps:
setsockopt(PQsocket(conn), SOL_SOCKET, SO_KEEPALIVE, ...)
If you really want libpq to manage this, I think you need to expose the
probe interval and timeouts. There should be some platform checks as
well. Check out...
http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg128603.html
--
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/
On Tue, Feb 9, 2010 at 11:34 PM, Andrew Chernow <ac@esilo.com> wrote:
If you really want libpq to manage this, I think you need to expose the
probe interval and timeouts.
Agreed.
Previously I was making the patch that exposes them as conninfo
options so that the standby can detect a network outage ASAP in SR.
I attached that WIP patch as a reference. Hope this helps.
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
Attachments:
keepalive.patchtext/x-patch; charset=US-ASCII; name=keepalive.patchDownload+160-2
On Tue, Feb 09, 2010 at 09:34:10AM -0500, Andrew Chernow wrote:
Tollef Fog Heen wrote:
(please Cc me on replies, I am not subscribed)
Hi,
libpq currently does not use TCP keepalives. This is a problem in our
case where we have some clients waiting for notifies and then the
connection is dropped on the server side. The client never gets the FIN
and thinks the connection is up. The attached patch unconditionally
adds keepalives. I chose unconditionally as this is what the server
does. We didn't need the ability to tune the timeouts, but that could
be added with reasonable ease.ISTM that the default behavior should be keep alives disabled, as it is
now, and those wanting it can just set it in their apps:setsockopt(PQsocket(conn), SOL_SOCKET, SO_KEEPALIVE, ...)
I disagree. I have clients who have problems with leftover client connections
due to server host failures. They do not write apps in C. For a non-default
change to be effective we would need to have all the client drivers, eg JDBC,
psycopg, DBD-DBI, and the apps like psql make changes to turn it on. Adding
this option as a non-default will not really help.
-dg
--
David Gould daveg@sonic.net 510 536 1443 510 282 0869
If simplicity worked, the world would be overrun with insects.
]] daveg
| I disagree. I have clients who have problems with leftover client connections
| due to server host failures. They do not write apps in C. For a non-default
| change to be effective we would need to have all the client drivers, eg JDBC,
| psycopg, DBD-DBI, and the apps like psql make changes to turn it on. Adding
| this option as a non-default will not really help.
FWIW, this is my case. My application uses psycopg, which provides no
way to get access to the underlying socket. Sure, I could hack my way
around this, but from the application writer's point of view, I have a
connection that I expect to stay around and be reliable. Whether that
connection is over a UNIX socket, a TCP socket or something else is
something I would rather not have to worry about; it feels very much
like an abstraction violation.
--
Tollef Fog Heen
UNIX is user friendly, it's just picky about who its friends are
2010/2/10 daveg <daveg@sonic.net>:
On Tue, Feb 09, 2010 at 09:34:10AM -0500, Andrew Chernow wrote:
Tollef Fog Heen wrote:
(please Cc me on replies, I am not subscribed)
Hi,
libpq currently does not use TCP keepalives. This is a problem in our
case where we have some clients waiting for notifies and then the
connection is dropped on the server side. The client never gets the FIN
and thinks the connection is up. The attached patch unconditionally
adds keepalives. I chose unconditionally as this is what the server
does. We didn't need the ability to tune the timeouts, but that could
be added with reasonable ease.ISTM that the default behavior should be keep alives disabled, as it is
now, and those wanting it can just set it in their apps:setsockopt(PQsocket(conn), SOL_SOCKET, SO_KEEPALIVE, ...)
I disagree. I have clients who have problems with leftover client connections
due to server host failures. They do not write apps in C. For a non-default
change to be effective we would need to have all the client drivers, eg JDBC,
psycopg, DBD-DBI, and the apps like psql make changes to turn it on. Adding
this option as a non-default will not really help.
Yes, that's definitely the use-case. PQsocket() will work fine for C apps only.
But it should work fine as an option, no? As long as you can specify
it on the connection string - I don't think there are any interfaces
that won't take a connection string?
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
ISTM that the default behavior should be keep alives disabled, as it is
now, and those wanting it can just set it in their apps:setsockopt(PQsocket(conn), SOL_SOCKET, SO_KEEPALIVE, ...)
I disagree. I have clients who have problems with leftover client connections
due to server host failures. They do not write apps in C. For a non-default
change to be effective we would need to have all the client drivers, eg JDBC,
psycopg, DBD-DBI, and the apps like psql make changes to turn it on. Adding
this option as a non-default will not really help.Yes, that's definitely the use-case. PQsocket() will work fine for C apps only.
But it should work fine as an option, no? As long as you can specify
it on the connection string - I don't think there are any interfaces
that won't take a connection string?
Perl and python appear to have the same abilities as C. I don't use either of
these drivers but I *think* the below would work:
DBD:DBI
setsockopt($dbh->pg_socket(), ...);
psycopg
conn.cursor().socket().setsockopt(...);
Although, I think Dave's comments have made me change my mind about this patch.
Looks like it serves a good purpose. That said, there is no guarentee the
driver will implement the new feature ... JDBC seems to lack the ability to get
the backing Socket object but java can set socket options. Maybe a JDBC kong fu
master knows how to do this.
--
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/
On Thu, Feb 11, 2010 at 2:15 AM, Tollef Fog Heen
<tollef.fog.heen@collabora.co.uk> wrote:
]] daveg
| I disagree. I have clients who have problems with leftover client connections
| due to server host failures. They do not write apps in C. For a non-default
| change to be effective we would need to have all the client drivers, eg JDBC,
| psycopg, DBD-DBI, and the apps like psql make changes to turn it on. Adding
| this option as a non-default will not really help.FWIW, this is my case. My application uses psycopg, which provides no
way to get access to the underlying socket. Sure, I could hack my way
around this, but from the application writer's point of view, I have a
connection that I expect to stay around and be reliable. Whether that
connection is over a UNIX socket, a TCP socket or something else is
something I would rather not have to worry about; it feels very much
like an abstraction violation.
I've sometimes wondered why keepalives aren't the default for all TCP
connections. They seem like they're usually a Good Thing (TM), but I
wonder if we can think of any situations where someone might not want
them?
...Robert
Robert Haas <robertmhaas@gmail.com> wrote:
I've sometimes wondered why keepalives aren't the default for all
TCP connections. They seem like they're usually a Good Thing
(TM), but I wonder if we can think of any situations where someone
might not want them?
I think it's insane not to use them at all, but there are valid use
cases for different timings. Personally, I'd be happy to see a
default of sending them if a connection is idle for two minutes, but
those people who create 2000 lightly used connections to the
database might feel differently.
-Kevin
From the Slony-I docs (http://www.slony.info/documentation/faq.html) :
"Supposing you experience some sort of network outage, the connection
between slon and database may fail, and the slon may figure this out
long before the PostgreSQL instance it was connected to does. The
result is that there will be some number of idle connections left on
the database server, which won't be closed out until TCP/IP timeouts
complete, which seems to normally take about two hours. For that two
hour period, the slon will try to connect, over and over, and will get
the above fatal message, over and over. "
Speaking as someone who uses Slony quite a lot, this patch sounds very
helpful. Why hasn't libpq had keepalives for years?
Regards,
Peter Geoghegan
Robert Haas wrote:
On Thu, Feb 11, 2010 at 2:15 AM, Tollef Fog Heen
<tollef.fog.heen@collabora.co.uk> wrote:]] daveg
| I disagree. I have clients who have problems with leftover client connections
| due to server host failures. They do not write apps in C. For a non-default
| change to be effective we would need to have all the client drivers, eg JDBC,
| psycopg, DBD-DBI, and the apps like psql make changes to turn it on. Adding
| this option as a non-default will not really help.FWIW, this is my case. My application uses psycopg, which provides no
way to get access to the underlying socket. Sure, I could hack my way
around this, but from the application writer's point of view, I have a
connection that I expect to stay around and be reliable. Whether that
connection is over a UNIX socket, a TCP socket or something else is
something I would rather not have to worry about; it feels very much
like an abstraction violation.I've sometimes wondered why keepalives aren't the default for all TCP
connections. They seem like they're usually a Good Thing (TM), but I
wonder if we can think of any situations where someone might not want
them?
The only case I can think of are systems that send application layer
keepalive-like packets; I've worked on systems like this. The goal
wasn't to reinvent keepalives but to check-in every minute or two to
meet a different set of requirements, thus TCP keepalives weren't
needed. However, I don't think they would of caused any harm.
The more I think about this the more I think it's a pretty non-invasive
change to enable keepalives in libpq. I don't think this has any
negative impact on clients written while the default was disabled.
This is really a driver setting. There is no way to ensure libpq, DBI,
psycopg, JDBC, etc... all enable or disable keepalives by default. I
only bring this up because it appears there are complaints from
non-libpq clients. This patch wouldn't fix those cases.
--
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
those people who create 2000 lightly used connections to the
database might feel differently.
Yeah I still run against installation using the infamous PHP pconnect()
function. You certainly don't want to add some load there, but that
could urge them into arranging for being able to use pgbouncer in
transaction pooling mode (and stop using pconnect(), damn it).
Regards,
--
dim
Also, more importantly (from
http://www.slony.info/documentation/slonyadmin.html):
"A WAN outage (or flakiness of the WAN in general) can leave database
connections "zombied", and typical TCP/IP behaviour will allow those
connections to persist, preventing a slon restart for around two
hours. "
Regards,
Peter Geoghegan
]] Robert Haas
| I've sometimes wondered why keepalives aren't the default for all TCP
| connections. They seem like they're usually a Good Thing (TM), but I
| wonder if we can think of any situations where someone might not want
| them?
As somebody mentioned somewhere else (I think): If you pay per byte
transmitted, be it 3G/GPRS. Or if you're on a very, very high-latency
link or have no bandwidth. Like, a rocket to Mars or maybe the moon.
While I think they are valid use-cases, requiring people to change the
defaults if that kind of thing sounds like a sensible solution to me.
--
Tollef Fog Heen
UNIX is user friendly, it's just picky about who its friends are
On Thu, 11 Feb 2010, Andrew Chernow wrote:
Although, I think Dave's comments have made me change my mind about this
patch. Looks like it serves a good purpose. That said, there is no
guarentee the driver will implement the new feature ... JDBC seems to
lack the ability to get the backing Socket object but java can set
socket options. Maybe a JDBC kong fu master knows how to do this.
Use the tcpKeepAlive connection option as described here:
http://jdbc.postgresql.org/documentation/84/connect.html#connection-parameters
Java can only enable/disable keep alives, it can't set the desired
timeout.
http://java.sun.com/javase/6/docs/api/java/net/Socket.html#setKeepAlive%28boolean%29
Kris Jurka
On Fri, Feb 12, 2010 at 1:33 AM, Peter Geoghegan
<peter.geoghegan86@gmail.com> wrote:
Why hasn't libpq had keepalives for years?
I guess that it's because keepalive doesn't work as expected
in some cases. For example, if the network outage happens
before a client sends some packets, keepalive doesn't work,
then it would have to wait for a long time until it detects
the outage. This is the specification of linux kernel. So
a client should not have excessive expectations of keepalive,
and should have another timeout like QueryTimeout of JDBC.
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
<peter.geoghegan86@gmail.com> wrote:
Why hasn't libpq had keepalives for years?
I guess that it's because keepalive doesn't work as expected
in some cases. For example, if the network outage happens
before a client sends some packets, keepalive doesn't work,
then it would have to wait for a long time until it detects
the outage. This is the specification of linux kernel. So
a client should not have excessive expectations of keepalive,
and should have another timeout like QueryTimeout of JDBC.
In my experience, the problems described are common when using libpq
over any sort of flaky connection, which I myself regularly do (not
just with Slony, but with a handheld wi-fi PDT application, where
libpq is used without any wrapper). The slony docs say it takes about
2 hours for the problem to correct itself, but I have found that it
may take a lot longer, perhaps because I have a hybrid Linux/Windows
Slony cluster.
keepalive doesn't work,
then it would have to wait for a long time until it detects
the outage.
I'm not really sure what you mean. In this scenario, would it take as
long as it would have taken had keepalives not been used?
I strongly welcome anything that can ameliorate these problems, which
are probably not noticed by the majority of users, but are a real
inconvenience when they do arise.
Regards,
Peter Geoghegan
On Fri, Feb 12, 2010 at 6:40 PM, Peter Geoghegan
<peter.geoghegan86@gmail.com> wrote:
keepalive doesn't work,
then it would have to wait for a long time until it detects
the outage.I'm not really sure what you mean. In this scenario, would it take as
long as it would have taken had keepalives not been used?
Please see the following threads.
http://archives.postgresql.org/pgsql-bugs/2006-08/msg00098.php
http://lkml.indiana.edu/hypermail/linux/kernel/0508.2/0757.html
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center