libpq connection timeout mismanagement

Started by Tom Laneover 7 years ago6 messages
#1Tom Lane
tgl@sss.pgh.pa.us
1 attachment(s)

The patch that taught libpq about allowing multiple target hosts
modified connectDBComplete() with the intent of making the
connect_timeout (if specified) apply per-host, not to the complete
connection attempt. It did not do a very good job though, because
the timeout only gets reset when connectDBComplete() itself detects
a timeout. If PQconnectPoll advances to a new host due to some
other cause, the previous host's timeout continues to run, possibly
causing a premature timeout failure for the new one.

Another thing that I find pretty strange is that it is coded so that,
in event of a timeout detection by connectDBComplete, we give up on the
current connhost entry and advance to the next host, ignoring any
additional addresses we might have for the current hostname. This seems
at best poorly thought through. There's no good reason for libpq to
assume that all the addresses returned by DNS point at the same machine,
or share the same network failure points in between.

Attached is an attempt to improve this. I'll park it in the Sept fest.

regards, tom lane

Attachments:

libpq-connect-timeout-1.patchtext/x-diff; charset=us-ascii; name=libpq-connect-timeout-1.patchDownload
diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index 986f74f..429be74 100644
*** a/src/interfaces/libpq/fe-connect.c
--- b/src/interfaces/libpq/fe-connect.c
*************** connectDBComplete(PGconn *conn)
*** 1905,1910 ****
--- 1905,1912 ----
  	PostgresPollingStatusType flag = PGRES_POLLING_WRITING;
  	time_t		finish_time = ((time_t) -1);
  	int			timeout = 0;
+ 	int			last_whichhost = -2;	/* certainly different from whichhost */
+ 	struct addrinfo *last_addr_cur = NULL;
  
  	if (conn == NULL || conn->status == CONNECTION_BAD)
  		return 0;
*************** connectDBComplete(PGconn *conn)
*** 1922,1929 ****
  			 */
  			if (timeout < 2)
  				timeout = 2;
- 			/* calculate the finish time based on start + timeout */
- 			finish_time = time(NULL) + timeout;
  		}
  	}
  
--- 1924,1929 ----
*************** connectDBComplete(PGconn *conn)
*** 1932,1937 ****
--- 1932,1952 ----
  		int			ret = 0;
  
  		/*
+ 		 * (Re)start the connect_timeout timer if it's active and we are
+ 		 * considering a different host than we were last time through.  If
+ 		 * we've already succeeded, though, needn't recalculate.
+ 		 */
+ 		if (flag != PGRES_POLLING_OK &&
+ 			timeout > 0 &&
+ 			(conn->whichhost != last_whichhost ||
+ 			 conn->addr_cur != last_addr_cur))
+ 		{
+ 			finish_time = time(NULL) + timeout;
+ 			last_whichhost = conn->whichhost;
+ 			last_addr_cur = conn->addr_cur;
+ 		}
+ 
+ 		/*
  		 * Wait, if necessary.  Note that the initial state (just after
  		 * PQconnectStart) is to wait for the socket to select for writing.
  		 */
*************** connectDBComplete(PGconn *conn)
*** 1975,1992 ****
  		if (ret == 1)			/* connect_timeout elapsed */
  		{
  			/*
! 			 * Attempt connection to the next host, ignoring any remaining
! 			 * addresses for the current host.
  			 */
! 			conn->try_next_addr = false;
! 			conn->try_next_host = true;
  			conn->status = CONNECTION_NEEDED;
- 
- 			/*
- 			 * Restart the connect_timeout timer for the new host.
- 			 */
- 			if (timeout > 0)
- 				finish_time = time(NULL) + timeout;
  		}
  
  		/*
--- 1990,1999 ----
  		if (ret == 1)			/* connect_timeout elapsed */
  		{
  			/*
! 			 * Give up on current server/address, try the next one.
  			 */
! 			conn->try_next_addr = true;
  			conn->status = CONNECTION_NEEDED;
  		}
  
  		/*
#2Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Tom Lane (#1)
Re: libpq connection timeout mismanagement

Hello Tom,

The patch that taught libpq about allowing multiple target hosts
modified connectDBComplete() with the intent of making the
connect_timeout (if specified) apply per-host, not to the complete
connection attempt. It did not do a very good job though, because
the timeout only gets reset when connectDBComplete() itself detects
a timeout. If PQconnectPoll advances to a new host due to some
other cause, the previous host's timeout continues to run, possibly
causing a premature timeout failure for the new one.

Another thing that I find pretty strange is that it is coded so that,
in event of a timeout detection by connectDBComplete, we give up on the
current connhost entry and advance to the next host, ignoring any
additional addresses we might have for the current hostname. This seems
at best poorly thought through. There's no good reason for libpq to
assume that all the addresses returned by DNS point at the same machine,
or share the same network failure points in between.

Attached is an attempt to improve this. I'll park it in the Sept fest.

Patch does not "git apply", but is ok with "patch -p1". Compiles.
Global "make check" is okay.

Feature works for me, i.e. each target host seems to get at least its
timeout.

Code looks straightforward enough.

In passing:

The code suggests that timeout is always 2 or more

if (timeout < 2) timeout = 2;

but doc says "It is not recommended to use a timeout of less than 2
seconds", which is inconsistent. It should read "Timeout is always set to
at least 2 whatever the user asks.".

connect_timeout=2.9 is accepted and considered as meaning 2.
connect_timeout=-10 or connect_timeout=two are also accepted and mean
forever. Probably thanks to "atoi".

--
Fabien.

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Fabien COELHO (#2)
Re: libpq connection timeout mismanagement

Fabien COELHO <coelho@cri.ensmp.fr> writes:

Patch does not "git apply", but is ok with "patch -p1". Compiles.

Yeah, as far as I can tell, "git apply" is very intolerant.

The code suggests that timeout is always 2 or more
if (timeout < 2) timeout = 2;
but doc says "It is not recommended to use a timeout of less than 2
seconds", which is inconsistent. It should read "Timeout is always set to
at least 2 whatever the user asks.".

Agreed, changed the docs to clarify this.

connect_timeout=2.9 is accepted and considered as meaning 2.
connect_timeout=-10 or connect_timeout=two are also accepted and mean
forever. Probably thanks to "atoi".

Right. As before, I'm not excited about rejecting trailing junk,
considering we never did before. It is kind of bogus that the
resolution of the timeout is only 1 second though --- maybe in
2002 that was fine, but today it seems strange. I'm tempted to
change the parameter to be floating point (parse with atof, say)
and allow millisecond precision. Then we'd not really need to
have the restriction about minimum delay.

That's a task for a different patch though. In the meantime,
pushed this one --- thanks for reviewing!

regards, tom lane

#4Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#1)
Re: libpq connection timeout mismanagement

On Thu, Aug 9, 2018 at 11:23 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The patch that taught libpq about allowing multiple target hosts
modified connectDBComplete() with the intent of making the
connect_timeout (if specified) apply per-host, not to the complete
connection attempt. It did not do a very good job though, because
the timeout only gets reset when connectDBComplete() itself detects
a timeout. If PQconnectPoll advances to a new host due to some
other cause, the previous host's timeout continues to run, possibly
causing a premature timeout failure for the new one.

Oops.

Another thing that I find pretty strange is that it is coded so that,
in event of a timeout detection by connectDBComplete, we give up on the
current connhost entry and advance to the next host, ignoring any
additional addresses we might have for the current hostname. This seems
at best poorly thought through. There's no good reason for libpq to
assume that all the addresses returned by DNS point at the same machine,
or share the same network failure points in between.

Hmm, well, that was deliberate, but maybe ill-advised. I guess I was
worried about making users wait for a long time to no gain, but your
points are valid, too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#5Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Tom Lane (#3)
Re: libpq connection timeout mismanagement

Hello Tom,

connect_timeout=2.9 is accepted and considered as meaning 2.
connect_timeout=-10 or connect_timeout=two are also accepted and mean
forever. Probably thanks to "atoi".

Right. As before, I'm not excited about rejecting trailing junk,
considering we never did before.

My 0.02ᅵ: accepting trailing junk is just a recipee for hiding bugs:

sh> psql "connect_timeout=2,port=5433"
psql> \conninfo
... port=5432 # port directive was silently ignored

So erroring out would be a good thing, and it could be done on a new
release. At the minimum there should be a warning.

--
Fabien.

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Fabien COELHO (#5)
Re: libpq connection timeout mismanagement

Fabien COELHO <coelho@cri.ensmp.fr> writes:

Right. As before, I'm not excited about rejecting trailing junk,
considering we never did before.

My 0.02¤: accepting trailing junk is just a recipee for hiding bugs:
sh> psql "connect_timeout=2,port=5433"
psql> \conninfo
... port=5432 # port directive was silently ignored
So erroring out would be a good thing, and it could be done on a new
release.

Perhaps. That should certainly be a separate patch, though, and
it should cover everything not just connection_timeout (I see at
least five parameters that are parsed with atoi() and nothing else).
I'm still not excited, but feel free to send a patch.

regards, tom lane