Multiple hosts in connection string failed to failover in non-hot standby mode

Started by Hubert Zhangover 5 years ago38 messageshackers
Jump to latest
#1Hubert Zhang
zhubert@vmware.com

Hi hackers,

Libpq has supported to specify multiple hosts in connection string and enable auto failover when the previous PostgreSQL instance cannot be accessed.
But when I tried to enable this feature for a non-hot standby, it cannot do the failover with the following messages.

psql: error: could not connect to server: FATAL: the database system is starting up

Document says ' If a connection is established successfully, but authentication fails, the remaining hosts in the list are not tried.'
I'm wondering is it a feature by design or a bug? If it's a bug, I plan to fix it.

Thanks,
Hubert Zhang

#2tsunakawa.takay@fujitsu.com
tsunakawa.takay@fujitsu.com
In reply to: Hubert Zhang (#1)
RE: Multiple hosts in connection string failed to failover in non-hot standby mode

Please send emails in text format. Your email was in HTML, and I changed this reply to text format.

From: Hubert Zhang <zhubert@vmware.com>

Libpq has supported to specify multiple hosts in connection string and enable auto failover when the previous PostgreSQL instance cannot be accessed.
But when I tried to enable this feature for a non-hot standby, it cannot do the failover with the following messages.

psql: error: could not connect to server: FATAL: the database system is starting up

Was the primary running and accepting connections when you encountered this error? That is, if you specified host="host1 host2", host1 was the non-hot standby and host2 was a running primary? Or only the non-hot standby was running?

If a primary was running, I'd say it's a bug... Perhaps the following part in libpq gives up connection attempts wen the above FATAL error is returned from the server. Maybe libpq should differentiate errors using SQLSTATE and continue connection attempts on other hosts.

[fe-connect.c]
/* Handle errors. */
if (beresp == 'E')
{
if (PG_PROTOCOL_MAJOR(conn->pversion) >= 3)
...
#endif

goto error_return;
}

/* It is an authentication request. */
conn->auth_req_received = true;

/* Get the type of request. */

Regards
Takayuki Tsunakawa

#3Noah Misch
noah@leadboat.com
In reply to: Hubert Zhang (#1)
Re: Multiple hosts in connection string failed to failover in non-hot standby mode

On Tue, Oct 27, 2020 at 07:14:14AM +0000, Hubert Zhang wrote:

Libpq has supported to specify multiple hosts in connection string and enable auto failover when the previous PostgreSQL instance cannot be accessed.
But when I tried to enable this feature for a non-hot standby, it cannot do the failover with the following messages.

psql: error: could not connect to server: FATAL: the database system is starting up

Document says ' If a connection is established successfully, but authentication fails, the remaining hosts in the list are not tried.'
I'm wondering is it a feature by design or a bug? If it's a bug, I plan to fix it.

I felt it was a bug, but the community as a whole may or may not agree:
/messages/by-id/flat/16508-1a63222835164566@postgresql.org

#4Hubert Zhang
zhubert@vmware.com
In reply to: tsunakawa.takay@fujitsu.com (#2)
Re: Multiple hosts in connection string failed to failover in non-hot standby mode

Was the primary running and accepting connections when you encountered this error? That is, if you specified host="host1 host2", host1 was the non-hot standby and host2 was a running primary? Or only the non-hot standby was running?

If a primary was running, I'd say it's a bug... Perhaps the following part in libpq gives up connection attempts wen the above FATAL error is returned from the server. Maybe libpq should differentiate errors using SQLSTATE and continue connection attempts on other hosts.
Yes, the primary was running, but non-hot standby is in front of the primary in connection string.
Hao Wu and I wrote a patch to fix this problem. Client side libpq should try another hosts in connection string when it is rejected by a non-hot standby, or the first host encounter some n/w problems during the libpq handshake.

Please send emails in text format. Your email was in HTML, and I changed this reply to text format.
Thanks. Is this email in text format now? I just use outlook in chrome. Let me know if it still in html format.

Hubert & Hao Wu

________________________________
From: tsunakawa.takay@fujitsu.com <tsunakawa.takay@fujitsu.com>
Sent: Tuesday, October 27, 2020 5:30 PM
To: Hubert Zhang <zhubert@vmware.com>
Cc: pgsql-hackers@postgresql.org <pgsql-hackers@postgresql.org>
Subject: RE: Multiple hosts in connection string failed to failover in non-hot standby mode

Please send emails in text format. Your email was in HTML, and I changed this reply to text format.

From: Hubert Zhang <zhubert@vmware.com>

Libpq has supported to specify multiple hosts in connection string and enable auto failover when the previous PostgreSQL instance cannot be accessed.
But when I tried to enable this feature for a non-hot standby, it cannot do the failover with the following messages.

psql: error: could not connect to server: FATAL: the database system is starting up

Was the primary running and accepting connections when you encountered this error? That is, if you specified host="host1 host2", host1 was the non-hot standby and host2 was a running primary? Or only the non-hot standby was running?

If a primary was running, I'd say it's a bug... Perhaps the following part in libpq gives up connection attempts wen the above FATAL error is returned from the server. Maybe libpq should differentiate errors using SQLSTATE and continue connection attempts on other hosts.

[fe-connect.c]
/* Handle errors. */
if (beresp == 'E')
{
if (PG_PROTOCOL_MAJOR(conn->pversion) >= 3)
...
#endif

goto error_return;
}

/* It is an authentication request. */
conn->auth_req_received = true;

/* Get the type of request. */

Regards
Takayuki Tsunakawa

Attachments:

0001-Enhance-libpq-to-support-multiple-host-for-non-hot-s.patchapplication/octet-stream; name=0001-Enhance-libpq-to-support-multiple-host-for-non-hot-s.patchDownload+52-6
#5tsunakawa.takay@fujitsu.com
tsunakawa.takay@fujitsu.com
In reply to: Hubert Zhang (#4)
RE: Multiple hosts in connection string failed to failover in non-hot standby mode

From: Hubert Zhang <zhubert@vmware.com>

Hao Wu and I wrote a patch to fix this problem. Client side libpq should try another hosts in connection string when it is rejected by a non-hot standby, or the first host encounter some n/w problems during the libpq handshake.

Thank you. Please add it to the November Commitfest.

Thanks. Is this email in text format now? I just use outlook in chrome. Let me know if it still in html format.

I'm afraid not. The Outlook's title bar says that it's in HTML format. I'm using Outlook 2016 client app on Windows 10. I may have failed to convert my previous email to text, but it should be text format this time.

Regards
Takayuki Tsunakawa

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hubert Zhang (#4)
Re: Multiple hosts in connection string failed to failover in non-hot standby mode

Hubert Zhang <zhubert@vmware.com> writes:

[ 0001-Enhance-libpq-to-support-multiple-host-for-non-hot-s.patch ]

I took a quick look at this. TBH, I'd just drop the first three hunks,
as they've got nothing to do with any failure mode that there's evidence
for in this thread or the prior one, and I'm afraid they're more likely
to create trouble than fix it.

As for the last hunk, why is it after rather than before the SSL/GSS
checks? I doubt that retrying with/without SSL is going to change
a CANNOT_CONNECT_NOW result, unless maybe by slowing things down to
the point where recovery has finished ;-)

The bigger picture though is

(1) what set of failures should we retry on? I think CANNOT_CONNECT_NOW
is reasonable, but are there others?

(2) what does this do to the quality of the error messages in cases
where all the connection attempts fail?

I think that error message quality was not thought too much about
in the original development of the multi-host feature, so to some
extent I'm asking you to clean up someone else's mess. Nonetheless,
I feel that we do need to clean it up before we do things that risk
making it even more confusing.

The problems that I see in this area are first that there's no
real standardization in libpq as to whether to append error messages
together or just flush preceding messages; and second that no effort
is made in multi-connection-attempt cases to separate the errors from
different attempts, much less identify which error goes with which
host or IP address. I think we really need to put some work into
that. In some cases you can infer what happened from breadcrumbs
we already put into the text, for example

$ psql -h localhost,/tmp -p 12345
psql: error: could not connect to server: Connection refused
Is the server running on host "localhost" (::1) and accepting
TCP/IP connections on port 12345?
could not connect to server: Connection refused
Is the server running on host "localhost" (127.0.0.1) and accepting
TCP/IP connections on port 12345?
could not connect to server: No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/tmp/.s.PGSQL.12345"?

but this doesn't seem particularly helpfully laid out to me, and we don't
provide the breadcrumbs at all for a lot of other error cases.

I'm vaguely imagining that we could do something more like

could not connect to host "localhost" (::1), port 12345: Connection refused
could not connect to host "localhost" (127.0.0.1), port 12345: Connection refused
could not connect to socket "/tmp/.s.PGSQL.12345": No such file or directory

Not quite sure if the "Is the server running" hint is worth preserving.
We'd have to reword it quite a bit, and it'd be very duplicative.

The implementation of this might involve sticking the initial string
(up to the colon, in this example) into conn->errorMessage speculatively
as we try each host. If we then append an error to it and go around
again, we're good. If we successfully connect, then the contents of
conn->errorMessage don't matter.

regards, tom lane

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#6)
Re: Multiple hosts in connection string failed to failover in non-hot standby mode

I wrote:

The problems that I see in this area are first that there's no
real standardization in libpq as to whether to append error messages
together or just flush preceding messages; and second that no effort
is made in multi-connection-attempt cases to separate the errors from
different attempts, much less identify which error goes with which
host or IP address. I think we really need to put some work into
that.

I spent some time on this, and here is a patch set that tries to
improve matters.

0001 changes the libpq coding rules to be that all error messages should
be appended to conn->errorMessage, never overwritten (there are a couple
of exceptions in fe-lobj.c) and we reset conn->errorMessage to empty only
at the start of a connection request or new query. This is something
that's been bugging me for a long time and I'm glad to get it cleaned up.
Formerly it seemed that a dartboard had been used to decide whether to use
"printfPQExpBuffer" or "appendPQExpBuffer"; now it's always the latter.
We can also get rid of several hacks that were used to get around the
mess and force appending behavior.

0002 then changes the reporting rules in fe-connect.c as I suggested,
so that you might get errors like this:

$ psql -h localhost,/tmp -p 12345
psql: error: could not connect to host "localhost" (::1), port 12345: Connection refused
Is the server running on that host and accepting TCP/IP connections?
could not connect to host "localhost" (127.0.0.1), port 12345: Connection refused
Is the server running on that host and accepting TCP/IP connections?
could not connect to socket "/tmp/.s.PGSQL.12345": No such file or directory
Is the server running locally and accepting connections on that socket?

and we have a pretty uniform rule that errors coming back from a
connection attempt will be prefixed with the server address.

Then 0003 is the part of your patch that I'm happy with. Given 0001+0002
we could certainly consider looping back and retrying for more cases, but
I still want to tread lightly on that. I don't see a lot of value in
retrying errors that seem to be on the client side, such as failure to
set socket properties; and in general I'm hesitant to add untestable
code paths here.

I feel pretty good about 0001: it might be committable as-is. 0002 is
probably subject to bikeshedding, plus it has a problem in the ECPG tests.
Two of the error messages are now unstable because they expose
chosen-at-random socket paths:

diff -U3 /home/postgres/pgsql/src/interfaces/ecpg/test/expected/connect-test5.stderr /home/postgres/pgsql/src/interfaces/ecpg/test/results/connect-test5.stderr
--- /home/postgres/pgsql/src/interfaces/ecpg/test/expected/connect-test5.stderr 2020-08-04 14:59:45.617380050 -0400
+++ /home/postgres/pgsql/src/interfaces/ecpg/test/results/connect-test5.stderr  2021-01-10 16:12:22.539433702 -0500
@@ -36,7 +36,7 @@
 [NO_PID]: sqlca: code: 0, state: 00000
 [NO_PID]: ECPGconnect: opening database <DEFAULT> on <DEFAULT> port <DEFAULT>  for user regress_ecpg_user2
 [NO_PID]: sqlca: code: 0, state: 00000
-[NO_PID]: ECPGconnect: could not open database: FATAL:  database "regress_ecpg_user2" does not exist
+[NO_PID]: ECPGconnect: could not open database: could not connect to socket "/tmp/pg_regress-EbHubF/.s.PGSQL.58080": FATAL:  database "regress_ecpg_user2" does not exist
 [NO_PID]: sqlca: code: 0, state: 00000
 [NO_PID]: ecpg_finish: connection main closed
@@ -73,7 +73,7 @@
 [NO_PID]: sqlca: code: -220, state: 08003
 [NO_PID]: ECPGconnect: opening database <DEFAULT> on <DEFAULT> port <DEFAULT>  for user regress_ecpg_user2
 [NO_PID]: sqlca: code: 0, state: 00000
-[NO_PID]: ECPGconnect: could not open database: FATAL:  database "regress_ecpg_user2" does not exist
+[NO_PID]: ECPGconnect: could not open database: could not connect to socket "/tmp/pg_regress-EbHubF/.s.PGSQL.58080": FATAL:  database "regress_ecpg_user2" does not exist

[NO_PID]: sqlca: code: 0, state: 00000
[NO_PID]: ecpg_finish: connection main closed

I don't have any non-hacky ideas what to do about that. The extra detail
seems useful to end users, but we don't have any infrastructure that
would allow filtering it out in the ECPG tests.

regards, tom lane

Attachments:

0001-append-all-error-messages.patchtext/x-diff; charset=us-ascii; name=0001-append-all-error-messages.patchDownload+493-590
0002-identify-host-more-consistently-wip.patchtext/x-diff; charset=us-ascii; name=0002-identify-host-more-consistently-wip.patchDownload+60-63
0003-retry-if-cannot-connect-now.patchtext/x-diff; charset=us-ascii; name=0003-retry-if-cannot-connect-now.patchDownload+14-0
#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#7)
Re: Multiple hosts in connection string failed to failover in non-hot standby mode

I wrote:

I feel pretty good about 0001: it might be committable as-is. 0002 is
probably subject to bikeshedding, plus it has a problem in the ECPG tests.
Two of the error messages are now unstable because they expose
chosen-at-random socket paths:
...
I don't have any non-hacky ideas what to do about that. The extra detail
seems useful to end users, but we don't have any infrastructure that
would allow filtering it out in the ECPG tests.

So far the only solution that comes to mind is to introduce some
infrastructure to do that filtering. 0001-0003 below are unchanged,
0004 patches up the ecpg test framework with a rather ad-hoc filtering
function. I'd feel worse about this if there weren't already a very
ad-hoc filtering function there ;-)

This set passes check-world for me; we'll soon see what the cfbot
thinks.

regards, tom lane

Attachments:

0001-append-all-error-messages.patchtext/x-diff; charset=us-ascii; name=0001-append-all-error-messages.patchDownload+493-590
0002-identify-host-more-consistently-wip.patchtext/x-diff; charset=us-ascii; name=0002-identify-host-more-consistently-wip.patchDownload+60-63
0003-retry-if-cannot-connect-now.patchtext/x-diff; charset=us-ascii; name=0003-retry-if-cannot-connect-now.patchDownload+14-0
0004-fix-unstable-ecpg-tests.patchtext/x-diff; charset=us-ascii; name=0004-fix-unstable-ecpg-tests.patchDownload+134-26
#9Hubert Zhang
zhubert@vmware.com
In reply to: Tom Lane (#8)
Re: Multiple hosts in connection string failed to failover in non-hot standby mode

Hi Tom,

I agree to get detailed error message for each failed host as your patch 0001.

As for patch 0004, find ':' after "could not connect to" may failed when error message like:
"could not connect to host "localhost" (::1), port 12345: Connection refused", where p2 will point to "::1" instead of ": Connection refused". But since it's only used for test case, we don't need to filter the error message precisely.

```
ecpg_filter_stderr(const char *resultfile, const char *tmpfile)
{
......
char *p1 = strstr(linebuf.data, "could not connect to ");
if (p1)
{
char *p2 = strchr(p1, ':');
if (p2)
memmove(p1 + 17, p2, strlen(p2) + 1);
}
}
```

Thanks,
Hubert

________________________________
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Monday, January 11, 2021 10:56 AM
To: Hubert Zhang <zhubert@vmware.com>
Cc: tsunakawa.takay@fujitsu.com <tsunakawa.takay@fujitsu.com>; pgsql-hackers@postgresql.org <pgsql-hackers@postgresql.org>
Subject: Re: Multiple hosts in connection string failed to failover in non-hot standby mode

I wrote:

I feel pretty good about 0001: it might be committable as-is. 0002 is
probably subject to bikeshedding, plus it has a problem in the ECPG tests.
Two of the error messages are now unstable because they expose
chosen-at-random socket paths:
...
I don't have any non-hacky ideas what to do about that. The extra detail
seems useful to end users, but we don't have any infrastructure that
would allow filtering it out in the ECPG tests.

So far the only solution that comes to mind is to introduce some
infrastructure to do that filtering. 0001-0003 below are unchanged,
0004 patches up the ecpg test framework with a rather ad-hoc filtering
function. I'd feel worse about this if there weren't already a very
ad-hoc filtering function there ;-)

This set passes check-world for me; we'll soon see what the cfbot
thinks.

regards, tom lane

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hubert Zhang (#9)
Re: Multiple hosts in connection string failed to failover in non-hot standby mode

Hubert Zhang <zhubert@vmware.com> writes:

As for patch 0004, find ':' after "could not connect to" may failed when error message like:
"could not connect to host "localhost" (::1), port 12345: Connection refused", where p2 will point to "::1" instead of ": Connection refused". But since it's only used for test case, we don't need to filter the error message precisely.

Excellent point, and I think that could happen on a Windows installation.
We can make it look for ": " instead of just ':', and that'll reduce the
odds of trouble.

regards, tom lane

#11Justin Pryzby
pryzby@telsasoft.com
In reply to: Tom Lane (#7)
Re: Multiple hosts in connection string failed to failover in non-hot standby mode

On Sun, Jan 10, 2021 at 05:38:50PM -0500, Tom Lane wrote:

I wrote:

The problems that I see in this area are first that there's no
real standardization in libpq as to whether to append error messages
together or just flush preceding messages; and second that no effort
is made in multi-connection-attempt cases to separate the errors from
different attempts, much less identify which error goes with which
host or IP address. I think we really need to put some work into
that.

I spent some time on this, and here is a patch set that tries to
improve matters.

0001 changes the libpq coding rules to be that all error messages should
be appended to conn->errorMessage, never overwritten (there are a couple
of exceptions in fe-lobj.c) and we reset conn->errorMessage to empty only
at the start of a connection request or new query. This is something
that's been bugging me for a long time and I'm glad to get it cleaned up.
Formerly it seemed that a dartboard had been used to decide whether to use
"printfPQExpBuffer" or "appendPQExpBuffer"; now it's always the latter.
We can also get rid of several hacks that were used to get around the
mess and force appending behavior.

0002 then changes the reporting rules in fe-connect.c as I suggested,
so that you might get errors like this:

$ psql -h localhost,/tmp -p 12345
psql: error: could not connect to host "localhost" (::1), port 12345: Connection refused
Is the server running on that host and accepting TCP/IP connections?
could not connect to host "localhost" (127.0.0.1), port 12345: Connection refused
Is the server running on that host and accepting TCP/IP connections?
could not connect to socket "/tmp/.s.PGSQL.12345": No such file or directory
Is the server running locally and accepting connections on that socket?

and we have a pretty uniform rule that errors coming back from a
connection attempt will be prefixed with the server address.

52a10224 broke sqlsmith, of all things.

It was using errmsg as a test of success, instead of checking if the connection
result wasn't null:

conn = PQconnectdb(conninfo.c_str());
char *errmsg = PQerrorMessage(conn);
if (strlen(errmsg))
throw dut::broken(errmsg, "08001");

That's clearly the wrong thing to do, but maybe this should be described in the
release notes as a compatibility issue, in case other people had the same idea.
Clearing errorMessage during success is an option..

--
Justin

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Justin Pryzby (#11)
Re: Multiple hosts in connection string failed to failover in non-hot standby mode

Justin Pryzby <pryzby@telsasoft.com> writes:

52a10224 broke sqlsmith, of all things.

It was using errmsg as a test of success, instead of checking if the connection
result wasn't null:

conn = PQconnectdb(conninfo.c_str());
char *errmsg = PQerrorMessage(conn);
if (strlen(errmsg))
throw dut::broken(errmsg, "08001");

That's clearly the wrong thing to do, but maybe this should be described in the
release notes as a compatibility issue, in case other people had the same idea.
Clearing errorMessage during success is an option..

Hm. I'd debated whether to clear conn->errorMessage at the end of
a successful connection sequence, and decided not to on the grounds
that it might be interesting info (eg it could tell you why you
ended up connected to server Y and not server X). But perhaps
it's too much of a compatibility break for this small benefit.

I'm curious though why it took this long for anyone to complain.
I'd supposed that people were running sqlsmith against HEAD on
a pretty regular basis.

regards, tom lane

#13Justin Pryzby
pryzby@telsasoft.com
In reply to: Tom Lane (#12)
Re: Multiple hosts in connection string failed to failover in non-hot standby mode

On Thu, May 06, 2021 at 01:22:27PM -0400, Tom Lane wrote:

I'm curious though why it took this long for anyone to complain.
I'd supposed that people were running sqlsmith against HEAD on
a pretty regular basis.

I think it's also becase sqlsmith would need to run against the v14 *client*
library. I don't know about anyone else's workflow, but I tend not to "make
install", but work with binaries out of ./tmp_install.

There's a few changes needed on sqlsmith HEAD, but I guess nobody would have
gotten that far if the connection was failing (or rather, detected as such).

diff --git a/grammar.cc b/grammar.cc
index 62aa8e9..76491ff 100644
--- a/grammar.cc
+++ b/grammar.cc
@@ -327,7 +327,11 @@ query_spec::query_spec(prod *p, struct scope *s, bool lateral) :

search = bool_expr::factory(this);

-  if (d6() > 2) {
+  if (d6() > 4) {
+    ostringstream cons;
+    cons << "order by 1 fetch first " << d100() + d100() << " rows with ties";
+    limit_clause = cons.str();
+  } else if (d6() > 2) {
     ostringstream cons;
     cons << "limit " << d100() + d100();
     limit_clause = cons.str();
diff --git a/postgres.cc b/postgres.cc
index f2a3627..1c0c55f 100644
--- a/postgres.cc
+++ b/postgres.cc
@@ -30,6 +30,7 @@ bool pg_type::consistent(sqltype *rvalue)
   case 'c': /* composite type */
   case 'd': /* domain */
   case 'r': /* range */
+  case 'm': /* multirange */
   case 'e': /* enum */
     return this == t;
#14Andreas Seltenreich
seltenreich@gmx.de
In reply to: Justin Pryzby (#13)
Re: Multiple hosts in connection string failed to failover in non-hot standby mode

On Thu, May 06, 2021 at 01:22:27PM -0400, Tom Lane wrote:

I'm curious though why it took this long for anyone to complain.
I'd supposed that people were running sqlsmith against HEAD on
a pretty regular basis.

Last time I ran it was November 27. I'm neglecting it on my spare time
and there is hardly any opportunity to sneak it onto my agenda at work.
I'll do my best to try to get either of these fixed.

Justin Pryzby writes:

I think it's also becase sqlsmith would need to run against the v14 *client*
library. I don't know about anyone else's workflow, but I tend not to "make
install", but work with binaries out of ./tmp_install.

My playbooks don't grab the client libraries of the test target either.
I'll change them.

There's a few changes needed on sqlsmith HEAD, but I guess nobody would have
gotten that far if the connection was failing (or rather, detected as such).

Thanks for the patch.

regards,
andreas

#15Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#12)
Re: Multiple hosts in connection string failed to failover in non-hot standby mode

On Thu, May 06, 2021 at 01:22:27PM -0400, Tom Lane wrote:

Hm. I'd debated whether to clear conn->errorMessage at the end of
a successful connection sequence, and decided not to on the grounds
that it might be interesting info (eg it could tell you why you
ended up connected to server Y and not server X). But perhaps
it's too much of a compatibility break for this small benefit.

I'm curious though why it took this long for anyone to complain.
I'd supposed that people were running sqlsmith against HEAD on
a pretty regular basis.

FWIW, I think that the case of getting some information about any
failed connections while a connection has been successfully made
within the scope of the connection string parameters provided by the
user is rather thin, and I really feel that this is going to cause
more pain to users than this is worth it. So my vote would be to
clean up conn->errorMessage after a successful connection.

Now, I would not mind either if we finish by taking a decision here
after beta1, to see if there are actual complains on the matter based
on the feedback we get.
--
Michael

#16Daniel Gustafsson
daniel@yesql.se
In reply to: Michael Paquier (#15)
Re: Multiple hosts in connection string failed to failover in non-hot standby mode

On 11 May 2021, at 09:29, Michael Paquier <michael@paquier.xyz> wrote:

FWIW, I think that the case of getting some information about any
failed connections while a connection has been successfully made
within the scope of the connection string parameters provided by the
user is rather thin, and I really feel that this is going to cause
more pain to users than this is worth it. So my vote would be to
clean up conn->errorMessage after a successful connection.

Agreed, given how conservative we typically are with backwards compatibility it
seems a too thin benefit to warrant potential breakage.

My vote would too be to restore the behavior by clearing conn->errorMessage.

--
Daniel Gustafsson https://vmware.com/

#17Justin Pryzby
pryzby@telsasoft.com
In reply to: Tom Lane (#12)
Re: Multiple hosts in connection string failed to failover in non-hot standby mode

On Thu, May 06, 2021 at 01:22:27PM -0400, Tom Lane wrote:

Justin Pryzby <pryzby@telsasoft.com> writes:

52a10224 broke sqlsmith, of all things.

It was using errmsg as a test of success, instead of checking if the connection
result wasn't null:

conn = PQconnectdb(conninfo.c_str());
char *errmsg = PQerrorMessage(conn);
if (strlen(errmsg))
throw dut::broken(errmsg, "08001");

That's clearly the wrong thing to do, but maybe this should be described in the
release notes as a compatibility issue, in case other people had the same idea.
Clearing errorMessage during success is an option..

Hm. I'd debated whether to clear conn->errorMessage at the end of
a successful connection sequence, and decided not to on the grounds
that it might be interesting info (eg it could tell you why you
ended up connected to server Y and not server X). But perhaps
it's too much of a compatibility break for this small benefit.

I don't care if applications break because they check the errorMessage instead
of the return value...

..But I think it's not useful to put details into errorMessage on success,
unless you're going to document that. It would never have occurred to me to
look there, or that it would even be safe.

--
Justin

#18Michael Paquier
michael@paquier.xyz
In reply to: Justin Pryzby (#17)
Re: Multiple hosts in connection string failed to failover in non-hot standby mode

On Sun, May 30, 2021 at 08:25:00PM -0500, Justin Pryzby wrote:

..But I think it's not useful to put details into errorMessage on success,
unless you're going to document that. It would never have occurred to me to
look there, or that it would even be safe.

Yeah. On the contrary, it could be confusing if one sees an error
message but there is nothing to worry about, because things are
working in the scope of what the user wanted at connection time.
--
Michael

#19Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#18)
Re: Multiple hosts in connection string failed to failover in non-hot standby mode

On Mon, May 31, 2021 at 11:00:55AM +0900, Michael Paquier wrote:

Yeah. On the contrary, it could be confusing if one sees an error
message but there is nothing to worry about, because things are
working in the scope of what the user wanted at connection time.

In my recent quest to look at GSSAPI builds on Windows, I have bumped
into another failure that's related to this thread. hamerkop
summarizes the situation here:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamerkop&amp;dt=2021-05-29%2010%3A15%3A42

There are two failures like this one as errorMessage piles up on
failures, as of connect/test5:
-[NO_PID]: ECPGconnect: connection to server failed: FATAL:  database
 "regress_ecpg_user2" does not exist
+[NO_PID]: ECPGconnect: connection to server failed: could not
 initiate GSSAPI security context: Unspecified GSS failure.  Minor
 code may provide more information: Credential cache is empty
+connection to server failed: FATAL:  database "regress_ecpg_user2"
 does not exist 
--
Michael
#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#19)
Re: Multiple hosts in connection string failed to failover in non-hot standby mode

Michael Paquier <michael@paquier.xyz> writes:

In my recent quest to look at GSSAPI builds on Windows, I have bumped
into another failure that's related to this thread.  hamerkop
summarizes the situation here:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamerkop&dt=2021-05-29%2010%3A15%3A42
There are two failures like this one as errorMessage piles up on
failures, as of connect/test5:
-[NO_PID]: ECPGconnect: connection to server failed: FATAL:  database
"regress_ecpg_user2" does not exist
+[NO_PID]: ECPGconnect: connection to server failed: could not
initiate GSSAPI security context: Unspecified GSS failure.  Minor
code may provide more information: Credential cache is empty
+connection to server failed: FATAL:  database "regress_ecpg_user2"
does not exist 

Yeah, I was looking at that earlier today. Evidently libpq is
trying a GSS-encrypted connection, and that doesn't work, so
it falls back to a regular connection where we get the expected
error. Probably all the connections in this test are hitting the
GSS failure, but only the ones where the second attempt fails
show a visible issue.

What is not clear is why GSS is acting that way. We wouldn't
have tried a GSS connection unless pg_GSS_have_cred_cache
succeeded ... so how come that worked but then gss_init_sec_context
complained "Credential cache is empty"?

My rough guess is that Windows has implemented the GSS APIs in
such a way that what pg_GSS_have_cred_cache is testing isn't
sufficient to tell whether a sane credential actually exists.

Or there's something particularly weird about how hamerkop
is set up.

regards, tom lane

#21Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#20)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#21)
#23Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#22)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#23)
#25Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#24)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#25)
#27Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#26)
#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#27)
#29Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#28)
#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#29)
#31Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#30)
#32Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#31)
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#32)
#34Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#33)
#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#34)
#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#35)
#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#18)
#38Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#37)