[bug fix] PG10: libpq doesn't connect to alternative hosts when some errors occur
Hello,
I found a problem with libpq connection failover. When libpq cannot connect to earlier hosts in the host list, it doesn't try to connect to other hosts. For example, when you specify a wrong port that some non-postgres program is using, or some non-postgres program is using PG's port unexpectedly, you get an error like this:
$ psql -h localhost -p 23
psql: received invalid response to SSL negotiation:
$ psql -h localhost -p 23 -d "sslmode=disable"
psql: expected authentication request from server, but received
Likewise, when the first host has already reached max_connections, libpq doesn't attempt the connection aginst later hosts.
The attached patch fixes this. I'll add this item in the PostgreSQL 10 Open Items.
Regards
Takayuki Tsunakawa
Attachments:
libpq-reconnect-on-error.patchapplication/octet-stream; name=libpq-reconnect-on-error.patchDownload+41-0
On Fri, May 12, 2017 at 1:28 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
Likewise, when the first host has already reached max_connections, libpq doesn't attempt the connection aginst later hosts.
It seems to me that the feature is behaving as wanted. Or in short
attempt to connect to the next host only if a connection cannot be
established. If there is a failure once the exchange with the server
has begun, just consider it as a hard failure. This is an important
property for authentication and SSL connection failures actually.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
From: Michael Paquier [mailto:michael.paquier@gmail.com]
It seems to me that the feature is behaving as wanted. Or in short attempt
to connect to the next host only if a connection cannot be established.
If there is a failure once the exchange with the server has begun, just
consider it as a hard failure. This is an important property for
authentication and SSL connection failures actually.
But PgJDBC behaves as expected -- attempt another connection to other hosts (and succeed). I believe that's what users would naturally expect. The current libpq implementation handles only the socket-level connect failure.
Regards
Takayuki Tsunakawa
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Michael Paquier <michael.paquier@gmail.com> writes:
On Fri, May 12, 2017 at 1:28 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:Likewise, when the first host has already reached max_connections, libpq doesn't attempt the connection aginst later hosts.
It seems to me that the feature is behaving as wanted. Or in short
attempt to connect to the next host only if a connection cannot be
established. If there is a failure once the exchange with the server
has begun, just consider it as a hard failure. This is an important
property for authentication and SSL connection failures actually.
I would not really expect that reconnection would retry after arbitrary
failure cases. Should it retry for "wrong database name", for instance?
It's not hard to imagine that leading to very confusing behavior.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, May 12, 2017 at 10:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Michael Paquier <michael.paquier@gmail.com> writes:
On Fri, May 12, 2017 at 1:28 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:Likewise, when the first host has already reached max_connections, libpq doesn't attempt the connection aginst later hosts.
It seems to me that the feature is behaving as wanted. Or in short
attempt to connect to the next host only if a connection cannot be
established. If there is a failure once the exchange with the server
has begun, just consider it as a hard failure. This is an important
property for authentication and SSL connection failures actually.I would not really expect that reconnection would retry after arbitrary
failure cases. Should it retry for "wrong database name", for instance?
It's not hard to imagine that leading to very confusing behavior.
I guess not as well. That would be tricky for the user to have a
different behavior depending on the error returned by the server,
which is why the current code is doing things right IMO. Now, the
feature has been designed similarly to JDBC with its parametrization,
so it could be surprising for users to get a different failure
handling compared to that. Not saying that JDBC is doing it wrong, but
libpq does nothing wrong either.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, May 14, 2017 at 9:19 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Fri, May 12, 2017 at 10:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Michael Paquier <michael.paquier@gmail.com> writes:
On Fri, May 12, 2017 at 1:28 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:Likewise, when the first host has already reached max_connections, libpq doesn't attempt the connection aginst later hosts.
It seems to me that the feature is behaving as wanted. Or in short
attempt to connect to the next host only if a connection cannot be
established. If there is a failure once the exchange with the server
has begun, just consider it as a hard failure. This is an important
property for authentication and SSL connection failures actually.I would not really expect that reconnection would retry after arbitrary
failure cases. Should it retry for "wrong database name", for instance?
It's not hard to imagine that leading to very confusing behavior.I guess not as well. That would be tricky for the user to have a
different behavior depending on the error returned by the server,
which is why the current code is doing things right IMO. Now, the
feature has been designed similarly to JDBC with its parametrization,
so it could be surprising for users to get a different failure
handling compared to that. Not saying that JDBC is doing it wrong, but
libpq does nothing wrong either.
I concur.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
From: Michael Paquier [mailto:michael.paquier@gmail.com]
On Fri, May 12, 2017 at 10:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I would not really expect that reconnection would retry after
arbitrary failure cases. Should it retry for "wrong database name", forinstance?
It's not hard to imagine that leading to very confusing behavior.
I guess not as well. That would be tricky for the user to have a different
behavior depending on the error returned by the server, which is why the
current code is doing things right IMO. Now, the feature has been designed
similarly to JDBC with its parametrization, so it could be surprising for
users to get a different failure handling compared to that. Not saying that
JDBC is doing it wrong, but libpq does nothing wrong either.
I didn't intend to make the user have a different behavior depending on the error returned by the server. I meant attempting connection to alternative hosts when the server returned an error. I thought the new libpq feature tries to connect to other hosts when a connection attempt fails, where the "connection" is the *database connection* (user's perspective), not the *socket connection* (PG developer's perspective). I think PgJDBC meets the user's desire better -- "Please connect to some host for better HA if a database server is unavailable for some reason."
By the way, could you elaborate what problem could occur if my solution is applied? (it doesn't seem easy for me to imagine...) FYI, as below, the case Tom picked up didn't raise an issue:
[libpq]
$ psql -h localhost,localhost -p 5450,5451 -d aaa
psql: FATAL: database "aaa" does not exist
$
[JDBC]
$ java org.hsqldb.cmdline.SqlTool postgres
SqlTool v. 3481.
2017-05-15T10:23:55.991+0900 SEVERE Connection error:
org.postgresql.util.PSQLException: FATAL: database "aaa" does not exist
Location: File: postinit.c, Routine: InitPostgres, Line: 846
Server SQLState: 3D000
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2412)
at org.postgresql.core.v3.QueryExecutorImpl.readStartupMessages(QueryExecutorImpl.java:2538)
at org.postgresql.core.v3.QueryExecutorImpl.<init>(QueryExecutorImpl.java:122)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:227)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49)
at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:194)
at org.postgresql.Driver.makeConnection(Driver.java:431)
at org.postgresql.Driver.connect(Driver.java:247)
at java.sql.DriverManager.getConnection(DriverManager.java:664)
at java.sql.DriverManager.getConnection(DriverManager.java:247)
at org.hsqldb.lib.RCData.getConnection(Unknown Source)
at org.hsqldb.cmdline.SqlTool.objectMain(Unknown Source)
at org.hsqldb.cmdline.SqlTool.main(Unknown Source)
Failed to get a connection to 'jdbc:postgresql://localhost:5450,localhost:5451/aaa' as user "tunakawa".
Cause: FATAL: database "aaa" does not exist
Location: File: postinit.c, Routine: InitPostgres, Line: 846
Server SQLState: 3D000
$
Regards
Takayuki Tsunakawa
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, May 14, 2017 at 9:50 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
I guess not as well. That would be tricky for the user to have a different
behavior depending on the error returned by the server, which is why the
current code is doing things right IMO. Now, the feature has been designed
similarly to JDBC with its parametrization, so it could be surprising for
users to get a different failure handling compared to that. Not saying that
JDBC is doing it wrong, but libpq does nothing wrong either.I didn't intend to make the user have a different behavior depending on the error returned by the server. I meant attempting connection to alternative hosts when the server returned an error. I thought the new libpq feature tries to connect to other hosts when a connection attempt fails, where the "connection" is the *database connection* (user's perspective), not the *socket connection* (PG developer's perspective). I think PgJDBC meets the user's desire better -- "Please connect to some host for better HA if a database server is unavailable for some reason."
By the way, could you elaborate what problem could occur if my solution is applied? (it doesn't seem easy for me to imagine...)
Sure. Imagine that the user thinks that 'foo' and 'bar' are the
relevant database servers for some service and writes 'dbname=quux
host=foo,bar' as a connection string. However, actually the user has
made a mistake and 'foo' is supporting some other service entirely; it
has no database 'quux'; the database servers which have database
'quux' are in fact 'bar' and 'baz'. All appears well as long as 'bar'
remains up, because the missing-database error for 'foo' is ignored
and we just connect to 'bar'. However, when 'bar' goes down then we
are out of service instead of failing over to 'baz' as we should have
done.
Now it's quite possible that the user, if they test carefully, might
realize that things are not working as intended, because the DBA might
say "hey, all of your connections are being directed to 'bar' instead
of being load-balanced properly!". But even if they are careful
enough to realize this, it may not be clear what has gone wrong.
Under your proposal, the connection to 'foo' could be failing for *any
reason whatsoever* from lack of connectivity to a missing database to
a missing user to a missing CONNECT privilege to an authentication
failure. If the user looks at the server log and can pick out the
entries from their own connection attempts they can figure it out, but
otherwise they might spend quite a bit of time wondering what's wrong;
after all, libpq will report no error, as long as the connection to
the other server works.
Now, this is all arguable. You could certainly say -- and you are
saying -- that this feature ought to be defined to retry after any
kind of failure whatsoever. But I think what Tom and Michael and I
are saying is that this is a failover feature and therefore ought to
try the next server when the first one in the list appears to have
gone down, but not when the first one in the list is unhappy with the
connection request for some other reason. Who is right is a judgement
call, but I don't think it's self-evident that users want to ignore
anything and everything that might have gone wrong with the connection
to the first server, rather than only those things which resemble a
down server. It seems quite possible to me that if we had defined it
as you are proposing, somebody would now be arguing for a behavior
change in the other direction.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
On Sun, May 14, 2017 at 9:50 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:By the way, could you elaborate what problem could occur if my solution is applied? (it doesn't seem easy for me to imagine...)
Sure. Imagine that the user thinks that 'foo' and 'bar' are the
relevant database servers for some service and writes 'dbname=quux
host=foo,bar' as a connection string. However, actually the user has
made a mistake and 'foo' is supporting some other service entirely; it
has no database 'quux'; the database servers which have database
'quux' are in fact 'bar' and 'baz'.
Even more simply, suppose that your userid is known to host bar but the
DBA has forgotten to create it on foo. This is surely a configuration
error that ought to be rectified, not just failed past, or else you don't
have any of the redundancy you think you do.
Of course, the user would have to try connections to both foo and bar
to be sure that they're both configured correctly. But he might try
"host=foo,bar" and "host=bar,foo" and figure he was OK, not noticing
that both connections had silently been made to bar.
The bigger picture here is that we only want to fail past transient
errors, not configuration errors. I'm willing to err in favor of
regarding doubtful cases as transient, but most server login rejections
aren't for transient causes.
There might be specific post-connection errors that we should consider
retrying; "too many connections" is an obvious case.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello Robert, Tom,
Thank you for being kind enough to explain. I think I could understand your concern.
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas
Who is right is a judgement call, but I don't think it's self-evident that
users want to ignore anything and everything that might have gone wrong
with the connection to the first server, rather than only those things which
resemble a down server. It seems quite possible to me that if we had defined
it as you are proposing, somebody would now be arguing for a behavior change
in the other direction.
Judgment call... so, I understood that it's a matter of choosing between helping to detect configuration errors early or service continuity. Hmm, I'd like to know how other databases treat this, but I couldn't find useful information after some Google search. I wonder whether I sould ask PgJDBC people if they know something, because they chose service continuity.
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
The bigger picture here is that we only want to fail past transient errors,
not configuration errors. I'm willing to err in favor of regarding doubtful
cases as transient, but most server login rejections aren't for transient
causes.
I got "doubtful cases" as ones such as specifying non-existent host or an unused port number. In that case, the configuration error can't be distinguished from the server failure.
What do you think of the following cases? Don't you want to connect to other servers?
* The DBA shuts down the database. The server takes a long time to do checkpointing. During the shutdown checkpoint, libpq tries to connect to the server and receive an error "the database system is shutting down."
* The former primary failed and now is trying to start as a standby, catching up by applying WAL. During the recovery, libpq tries to connect to the server and receive an error "the database system is performing recovery."
* The database server crashed due to a bug. Unfortunately, the server takes unexpectedly long time to shut down because it takes many seconds to write the stats file (as you remember, Tom-san experienced 57 seconds to write the stats file during regression tests.) During the stats file write, libpq tries to connect to the server and receive an error "the database system is shutting down."
These are equivalent to server failure. I believe we should prioritize rescuing errors during operation over detecting configuration errors.
Of course, the user would have to try connections to both foo and bar to
be sure that they're both configured correctly. But he might try
"host=foo,bar" and "host=bar,foo" and figure he was OK, not noticing that
both connections had silently been made to bar.
In that case, I think he would specify "host=foo" and "host=bar" in turn, because he would be worried about where he's connected if he specified multiple hosts.
Regards
Takayuki Tsunakawa
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, May 17, 2017 at 3:06 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
What do you think of the following cases? Don't you want to connect to other servers?
* The DBA shuts down the database. The server takes a long time to do checkpointing. During the shutdown checkpoint, libpq tries to connect to the server and receive an error "the database system is shutting down."
* The former primary failed and now is trying to start as a standby, catching up by applying WAL. During the recovery, libpq tries to connect to the server and receive an error "the database system is performing recovery."
* The database server crashed due to a bug. Unfortunately, the server takes unexpectedly long time to shut down because it takes many seconds to write the stats file (as you remember, Tom-san experienced 57 seconds to write the stats file during regression tests.) During the stats file write, libpq tries to connect to the server and receive an error "the database system is shutting down."
These are equivalent to server failure. I believe we should prioritize rescuing errors during operation over detecting configuration errors.
Yeah, you have a point. I'm willing to admit that we may have defined
the behavior of the feature incorrectly, provided that you're willing
to admit that you're proposing a definition change, not just a bug
fix.
Anybody else want to weigh in with an opinion here?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
Yeah, you have a point. I'm willing to admit that we may have defined
the behavior of the feature incorrectly, provided that you're willing
to admit that you're proposing a definition change, not just a bug
fix.
Anybody else want to weigh in with an opinion here?
I'm not really on board with "try each server until you find one where
this dbname+username+password combination works". That's just a recipe
for trouble, especially the password angle.
I think it's a good point that there are certain server responses that
we should take as equivalent to "server down", but by the same token
there are responses that we should not take that way.
I suggest that we need to conditionalize the decision based on what
SQLSTATE is reported. Not sure offhand if it's better to have a whitelist
of SQLSTATEs that allow failing over to the next server, or a blacklist of
SQLSTATEs that don't.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Tom, Robert,
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
Robert Haas <robertmhaas@gmail.com> writes:
Yeah, you have a point. I'm willing to admit that we may have defined
the behavior of the feature incorrectly, provided that you're willing
to admit that you're proposing a definition change, not just a bug
fix.Anybody else want to weigh in with an opinion here?
I'm not really on board with "try each server until you find one where
this dbname+username+password combination works". That's just a recipe
for trouble, especially the password angle.
Agreed.
I think it's a good point that there are certain server responses that
we should take as equivalent to "server down", but by the same token
there are responses that we should not take that way.
Right.
I suggest that we need to conditionalize the decision based on what
SQLSTATE is reported. Not sure offhand if it's better to have a whitelist
of SQLSTATEs that allow failing over to the next server, or a blacklist of
SQLSTATEs that don't.
No particular comment on this. I do wonder about forward/backwards
compatibility in such lists and if SQLSTATE really covers all
cases/distinctions which are interesting when it comes to making this
decision.
Thanks!
Stephen
Stephen Frost <sfrost@snowman.net> writes:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
I suggest that we need to conditionalize the decision based on what
SQLSTATE is reported. Not sure offhand if it's better to have a whitelist
of SQLSTATEs that allow failing over to the next server, or a blacklist of
SQLSTATEs that don't.
No particular comment on this. I do wonder about forward/backwards
compatibility in such lists and if SQLSTATE really covers all
cases/distinctions which are interesting when it comes to making this
decision.
If the server is reporting the same SQLSTATE for server-down types
of conditions as for server-up, then that's a bug and we need to change
the SQLSTATE assigned to one case or the other. The entire point of
SQLSTATE is that it should generally capture distinctions as finely
as client software is likely to be interested in.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, May 17, 2017 at 12:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
Yeah, you have a point. I'm willing to admit that we may have defined
the behavior of the feature incorrectly, provided that you're willing
to admit that you're proposing a definition change, not just a bug
fix.Anybody else want to weigh in with an opinion here?
I'm not really on board with "try each server until you find one where
this dbname+username+password combination works". That's just a recipe
for trouble, especially the password angle.
Sure, I know what *your* opinion is. And I'm somewhat inclined to
agree, but not to the degree that I don't think we should hear what
other people have to say.
I suggest that we need to conditionalize the decision based on what
SQLSTATE is reported. Not sure offhand if it's better to have a whitelist
of SQLSTATEs that allow failing over to the next server, or a blacklist of
SQLSTATEs that don't.
Urgh. There are two things I don't like about that. First, it's a
major redesign of this feature at the 11th hour. Second, if we can't
even agree on the general question of whether all, some, or no server
errors should cause a retry, the chances of agreeing on which SQL
states to include in the retry loop are probably pretty low. Indeed,
there might not be one answer that will be right for everyone.
One good argument for leaving this alone entirely is that this feature
was committed on November 3rd and this thread began on May 12th. If
there was ample time before feature freeze to question the design and
nobody did, then I'm not sure why we should disregard the freeze to
start whacking it around now, especially on the strength of one
complaint. It may be that after we get some field experience with
this the right thing to do will become clearer.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert,
* Robert Haas (robertmhaas@gmail.com) wrote:
One good argument for leaving this alone entirely is that this feature
was committed on November 3rd and this thread began on May 12th. If
there was ample time before feature freeze to question the design and
nobody did, then I'm not sure why we should disregard the freeze to
start whacking it around now, especially on the strength of one
complaint. It may be that after we get some field experience with
this the right thing to do will become clearer.
I am not particularly convinced by this argument. As much as we hope
that committers have worked with a variety of people with varying
interests and that individuals who are concerned about such start
testing just as soon as something is committed, that, frankly, isn't how
the world really works, based on my observations, at least.
The point of this period of time between feature freeze and actual
release is, more-or-less, to figure out if the solution we've reached
actually is a good one, and if not, to do something about it.
Thanks!
Stephen
Stephen Frost <sfrost@snowman.net> writes:
* Robert Haas (robertmhaas@gmail.com) wrote:
One good argument for leaving this alone entirely is that this feature
was committed on November 3rd and this thread began on May 12th. If
there was ample time before feature freeze to question the design and
nobody did, then I'm not sure why we should disregard the freeze to
start whacking it around now, especially on the strength of one
complaint. It may be that after we get some field experience with
this the right thing to do will become clearer.
I am not particularly convinced by this argument. As much as we hope
that committers have worked with a variety of people with varying
interests and that individuals who are concerned about such start
testing just as soon as something is committed, that, frankly, isn't how
the world really works, based on my observations, at least.
The point of this period of time between feature freeze and actual
release is, more-or-less, to figure out if the solution we've reached
actually is a good one, and if not, to do something about it.
Sure, but part of the point of beta testing is to get user feedback.
I agree with Robert's point that major redesign of the feature on the
basis of one complaint isn't necessarily the way to go. Since the
existing behavior is already out in beta1, let's wait and see if anyone
else complains. We don't need to fix it Right This Instant.
Maybe add this to the list of open issues to reconsider mid-beta?
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, May 17, 2017 at 12:06 AM, Tsunakawa, Takayuki <
tsunakawa.takay@jp.fujitsu.com> wrote:
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas
Who is right is a judgement call, but I don't think it's self-evidentthat
users want to ignore anything and everything that might have gone wrong
with the connection to the first server, rather than only those thingswhich
resemble a down server. It seems quite possible to me that if we had
defined
it as you are proposing, somebody would now be arguing for a behavior
change
in the other direction.
Judgment call... so, I understood that it's a matter of choosing between
helping to detect configuration errors early or service continuity.
This is how I've been reading this thread and I'm tending to agree with
prioritizing service continuity over configuration error detection. As a
client if I have an alternative that ends up working I don't really care
whose fault it is that the earlier options weren't. I don't have enough
experience to think up plausible scenarios here but I'm sold on the theory.
David J.
Tom,
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
I agree with Robert's point that major redesign of the feature on the
basis of one complaint isn't necessarily the way to go. Since the
existing behavior is already out in beta1, let's wait and see if anyone
else complains. We don't need to fix it Right This Instant.
Fair enough.
Maybe add this to the list of open issues to reconsider mid-beta?
Works for me.
Thanks!
Stephen
Moin,
On Wed, May 17, 2017 12:34 pm, Robert Haas wrote:
On Wed, May 17, 2017 at 3:06 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:What do you think of the following cases? Don't you want to connect to
other servers?* The DBA shuts down the database. The server takes a long time to do
checkpointing. During the shutdown checkpoint, libpq tries to connect
to the server and receive an error "the database system is shutting
down."* The former primary failed and now is trying to start as a standby,
catching up by applying WAL. During the recovery, libpq tries to
connect to the server and receive an error "the database system is
performing recovery."* The database server crashed due to a bug. Unfortunately, the server
takes unexpectedly long time to shut down because it takes many seconds
to write the stats file (as you remember, Tom-san experienced 57 seconds
to write the stats file during regression tests.) During the stats file
write, libpq tries to connect to the server and receive an error "the
database system is shutting down."These are equivalent to server failure. I believe we should prioritize
rescuing errors during operation over detecting configuration errors.Yeah, you have a point. I'm willing to admit that we may have defined
the behavior of the feature incorrectly, provided that you're willing
to admit that you're proposing a definition change, not just a bug
fix.Anybody else want to weigh in with an opinion here?
Hm, to me the feature needs to be reliable (for certain values of
reliable) to be usefull.
Consider that you have X hosts (rendundancy), and a lot of applications
that want a stable connection to the one that (still) works, whichever
this is.
You can then either:
1. make one primary, the other standby(s) and play DNS tricks or similiar
to make it appear that there is only one working host, and have all apps
connect to the "one host" (and reconnect to it upon failure)
2. let each app try each host until it finds a working one, if the
connection breaks, retry with the next host
3. or use libpq and let it try the hosts for you.
However, if I understand it correctly, #3 only works reliable in certain
cases (e.g. host down), but not if it is "sort of down". In that case each
app would again need code to retry different hosts until it finds a
working one, instead of letting libpq do the work.
That sound hard to deploy #3 in praxis, as you might easily just code up
#1 or #2 and call it a day.
All the best,
Tels
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers