libpq should append auth failures, not overwrite

Started by Tom Laneover 7 years ago16 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

I noticed that, although most error reports during libpq's connection
setup code append to conn->errorMessage, the ones in fe-auth.c and
fe-auth-scram.c don't: they're all printfPQExpBuffer() not
appendPQExpBuffer(). This seems wrong to me. It makes no difference
in simple cases with a single target server, but as soon as you have
multiple servers listed in "host", this coding makes it impossible
to tell which server rejected your login.

So I think we should basically s/printfPQExpBuffer/appendPQExpBuffer/g
anywhere those files touch conn->errorMessage, allowing any problems
with previous servers to be preserved in the eventually-reported message.
I won't bother posting an actual patch for that right now, but has anyone
got an objection?

regards, tom lane

#2Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#1)
Re: libpq should append auth failures, not overwrite

On Thu, Aug 09, 2018 at 11:44:27AM -0400, Tom Lane wrote:

I noticed that, although most error reports during libpq's connection
setup code append to conn->errorMessage, the ones in fe-auth.c and
fe-auth-scram.c don't: they're all printfPQExpBuffer() not
appendPQExpBuffer(). This seems wrong to me. It makes no difference
in simple cases with a single target server, but as soon as you have
multiple servers listed in "host", this coding makes it impossible
to tell which server rejected your login.

+1.
--
Michael
#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#2)
Re: libpq should append auth failures, not overwrite

Michael Paquier <michael@paquier.xyz> writes:

On Thu, Aug 09, 2018 at 11:44:27AM -0400, Tom Lane wrote:

I noticed that, although most error reports during libpq's connection
setup code append to conn->errorMessage, the ones in fe-auth.c and
fe-auth-scram.c don't: they're all printfPQExpBuffer() not
appendPQExpBuffer(). This seems wrong to me. It makes no difference
in simple cases with a single target server, but as soon as you have
multiple servers listed in "host", this coding makes it impossible
to tell which server rejected your login.

+1.

So I thought this would be a trivial patch barely even worthy of posting,
but an awful lot of critters crawled out from under that rock. It's not
just fe-auth*.c at fault; low-level failures such as timeout errors were
also clearing conn->errorMessage, and if you got an actual server error
(like "role does not exist"), that did too. What I ended up doing was a
wholesale conversion of libpq's management of conn->errorMessage. Now,
there are a few identified points at connection start or query start that
explicitly clear errorMessage, and everyplace else in libpq is supposed to
append to it, never just printf onto it. (Interestingly, all of those
"identified points" already did clear errorMessage, meaning that a large
fraction of the existing printf's would always have started with an empty
errorMessage anyway.)

The first patch attached does that, and then the second one is a proposed
modification in the way we report failures for multi-host connection
attempts: let's explicitly label each section of conn->errorMessage with
the host we're trying to connect to. This can replace the ad-hoc labeling
that somebody thought would be a good idea for two errors associated with
the target_session_attrs=read-write feature. (That's not a bad idea in
itself, but doing it only for those two errors is.)

Here's some examples of the kind of connection failure reports the code
can produce with these patches:

$ psql "host=refusenik,dead,localhost user=readonly dbname=postgres connect_timeout=3 target_session_attrs=read-write"
psql: server "refusenik" port 5432:
could not connect to server: Connection refused
Is the server running on host "refusenik" (192.168.1.43) and accepting
TCP/IP connections on port 5432?
server "dead" port 5432:
timeout expired
server "localhost" port 5432:
could not open a read-write session

$ psql "host=refusenik,dead,localhost user=nobody dbname=postgres"
psql: server "refusenik" port 5432:
could not connect to server: Connection refused
Is the server running on host "refusenik" (192.168.1.43) and accepting
TCP/IP connections on port 5432?
server "dead" port 5432:
timeout expired
server "localhost" port 5432:
FATAL: role "nobody" does not exist

Before, you'd get a rather random subset of these messages depending on
which parts of the code decided to clear conn->errorMessage, and in
many cases it'd be really hard to tell which server was involved with
which message(s).

Some points for discussion and review:

1. The bulk of patch 0001 is indeed just
s/printfPQExpBuffer/appendPQExpBuffer/g, though I also made an attempt
to use appendPQExpBufferStr wherever possible. There are two categories
of printfPQExpBuffer calls I didn't change:

1a. Calls reporting an out-of-memory situation. There was already a
policy in some places that we'd intentionally printf not append such
messages, and I made that explicit. The idea is that we might not
have room to add more text to errorMessage, so resetting the buffer
provides more certainty of success. However, this makes for a pretty
weird inconsistency in the code; there are a lot of functions now in
which half of the messages are printf'd and half are append'd, so I'm
afraid that future patches will get it wrong as to which to use. Also,
in reality there often *would* be room to append "out of memory" without
enlarging errorMessage's buffer, so that this could just be pointless
destruction of useful error history. I didn't do anything about it here,
but I'm tempted to replace all the printf's for OOM messages with a
subroutine that will append if it can do so without enlarging the existing
buffer, else replace.

1b. There are a few places where it didn't seem worth changing anything
because it'd just result in needing to add a resetPQExpBuffer in the same
function or very nearby. Mostly in fe-lobj.c.

2. I had to drop some code (notably in pqPrepareAsyncResult) that
attempted to force conn->errorMessage to always have the same contents
as the current PGresult's PQresultErrorMessage; as it stood, server
errors such as "role does not exist" wiped out preceding contents of
errorMessage, breaking cases such as the second example above. This is
slightly annoying, but in the new dispensation it's possible that
conn->errorMessage contains a mix of libpq-generated errors and text
received from the server, so I'm not sure that preserving the old
equivalence is a useful goal anyway. We shouldn't cram pre-existing
errorMessage text back into a server-generated error; that would
invalidate the server's auxiliary error info such as the SQLSTATE.
I don't know if it's possible to do better here. One idea is to tweak
pqPrepareAsyncResult so that it does the message synchronization only
when in CONNECTION_OK state, allowing conn->errorMessage to be a
historical record only for connection processing. But that seems like
a hack.

3. In some places I failed to resist the temptation to copy-edit error
messages that didn't meet style guidelines, or to add libpq_gettext()
annotation where it seemed to have been missed.

4. connectDBComplete() cleared errorMessage after a successful
completion, but that's the wrong place to do it; we have to do that
at the success exits from PQconnectPoll, else applications that
use PQconnectPoll directly will see leftover garbage there.

5. There were places in PQconnectPoll that temporarily set
conn->status = CONNECTION_OK to prevent PQsendQueryStart from
complaining. A better idea is to leave the status alone and change
that test in PQsendQueryStart to be status != CONNECTION_BAD. This
not only simplifies PQconnectPoll, but allows PQsendQueryStart
to tell whether we're still in the connection sequence. I fixed
it to not clear errorMessage when we are, thus solving the problem
of PQsendQuery destroying prior connection error history. That allows
getting rid of saveErrorMessage/restoreErrorMessage, which were
(a) an incredibly ugly and inefficient hack, and (b) inadequate,
since there were other paths through PQconnectPoll that could reach
PQsendQueryStart but were not protected with calls to those.

6. In the 0002 patch, I have it printing the annotation only when
there's more than one connhost[], and only once per connhost not
once per IP address. The motivation for doing it like that was to
not change the behavior for simple cases with only one host name.
There is certainly room to argue that we should make an annotation
once per IP address, but that'd change the behavior for pretty common
cases like "localhost" resolving to both 127.0.0.1 and ::1, so I don't
know if it'd be a good idea or not. There's also an interaction to
consider with my patch at
/messages/by-id/4913.1533827102@sss.pgh.pa.us
namely that we have to be sure the labeling behaves sanely for
connhosts that fail to resolve any IP addresses.

I'll put this into the September commitfest.

regards, tom lane

Attachments:

0001-libpq-append-errors-1.patchtext/x-diff; charset=us-ascii; name=0001-libpq-append-errors-1.patchDownload+380-472
0002-libpq-better-server-id-1.patchtext/x-diff; charset=us-ascii; name=0002-libpq-better-server-id-1.patchDownload+34-35
#4Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Tom Lane (#3)
Re: libpq should append auth failures, not overwrite

Hello Tom,

Some points for discussion and review:

1. The bulk of patch 0001 is indeed just
s/printfPQExpBuffer/appendPQExpBuffer/g, though I also made an attempt
to use appendPQExpBufferStr wherever possible. There are two categories
of printfPQExpBuffer calls I didn't change:

1a. Calls reporting an out-of-memory situation. There was already a
policy in some places that we'd intentionally printf not append such
messages, and I made that explicit. The idea is that we might not
have room to add more text to errorMessage, so resetting the buffer
provides more certainty of success. However, this makes for a pretty
weird inconsistency in the code; there are a lot of functions now in
which half of the messages are printf'd and half are append'd, so I'm
afraid that future patches will get it wrong as to which to use. Also,
in reality there often *would* be room to append "out of memory" without
enlarging errorMessage's buffer, so that this could just be pointless
destruction of useful error history. I didn't do anything about it here,
but I'm tempted to replace all the printf's for OOM messages with a
subroutine that will append if it can do so without enlarging the existing
buffer, else replace.

1b. There are a few places where it didn't seem worth changing anything
because it'd just result in needing to add a resetPQExpBuffer in the same
function or very nearby. Mostly in fe-lobj.c.

2. I had to drop some code (notably in pqPrepareAsyncResult) that
attempted to force conn->errorMessage to always have the same contents
as the current PGresult's PQresultErrorMessage; as it stood, server
errors such as "role does not exist" wiped out preceding contents of
errorMessage, breaking cases such as the second example above. This is
slightly annoying, but in the new dispensation it's possible that
conn->errorMessage contains a mix of libpq-generated errors and text
received from the server, so I'm not sure that preserving the old
equivalence is a useful goal anyway. We shouldn't cram pre-existing
errorMessage text back into a server-generated error; that would
invalidate the server's auxiliary error info such as the SQLSTATE.
I don't know if it's possible to do better here. One idea is to tweak
pqPrepareAsyncResult so that it does the message synchronization only
when in CONNECTION_OK state, allowing conn->errorMessage to be a
historical record only for connection processing. But that seems like
a hack.

3. In some places I failed to resist the temptation to copy-edit error
messages that didn't meet style guidelines, or to add libpq_gettext()
annotation where it seemed to have been missed.

4. connectDBComplete() cleared errorMessage after a successful
completion, but that's the wrong place to do it; we have to do that
at the success exits from PQconnectPoll, else applications that
use PQconnectPoll directly will see leftover garbage there.

5. There were places in PQconnectPoll that temporarily set
conn->status = CONNECTION_OK to prevent PQsendQueryStart from
complaining. A better idea is to leave the status alone and change
that test in PQsendQueryStart to be status != CONNECTION_BAD. This
not only simplifies PQconnectPoll, but allows PQsendQueryStart
to tell whether we're still in the connection sequence. I fixed
it to not clear errorMessage when we are, thus solving the problem
of PQsendQuery destroying prior connection error history. That allows
getting rid of saveErrorMessage/restoreErrorMessage, which were
(a) an incredibly ugly and inefficient hack, and (b) inadequate,
since there were other paths through PQconnectPoll that could reach
PQsendQueryStart but were not protected with calls to those.

6. In the 0002 patch, I have it printing the annotation only when
there's more than one connhost[], and only once per connhost not
once per IP address. The motivation for doing it like that was to
not change the behavior for simple cases with only one host name.
There is certainly room to argue that we should make an annotation
once per IP address, but that'd change the behavior for pretty common
cases like "localhost" resolving to both 127.0.0.1 and ::1, so I don't
know if it'd be a good idea or not. There's also an interaction to
consider with my patch at
/messages/by-id/4913.1533827102@sss.pgh.pa.us
namely that we have to be sure the labeling behaves sanely for
connhosts that fail to resolve any IP addresses.

I'll put this into the September commitfest.

* About the *first* patch

It applies cleanly and compiles.

Alas, global "make check" is ok although there is no change on regression
tests, which suggest that the multiple error message reporting is not
tested anywhere. Should there be at least one test?

Getting rid of somehow ugly "let's save the current message and restore it
later" logic in places looks like a definite improvement.

There are still 86 instances of "printfPQExpBuffer", which seems like a
lot. They are mostly OOM messages, but not only. This make checking the
patch quite cumbersome.

I'd rather there would be only one rule, and clear reset with a comment
when needded (eg "fe-lobj.c").

I agree with your suggestion of adding a function trying to preserve
messages on OOM errors if possible. Maybe "appendPQExpBufferOOM"?

It is unclear to me why the "printf" variant is used in
"PQencryptPasswordConn" and "connectOptions2", it looks that you skipped
these functions.

In passing, some OOM message are rather terse: "out of memory\n", other
are more precise "out of memory allocating GSSAPI buffer (%d)\n".

I do not think it is worth creating a l10n exception on a should not
happen message, eg (there are a few):

  -  /* shouldn't happen */
  -  printfPQExpBuffer(&conn->errorMessage,
  -                    libpq_gettext("invalid SCRAM exchange state\n"));
  +  /* shouldn't happen, so we don't translate the msg */
  +  appendPQExpBufferStr(&conn->errorMessage,
  +                       "invalid SCRAM exchange state\n");

I would have let "libpq_gettext" it as it was.

I'm ok of libpq & server messages are mixed, because I do not see why one
may take precedent on the others, and when some strange & combined errors,
every clue is worthy.

I have not seen anything which raises a "risky" alarm about undesired
consequences, but who knows?

Comments update seem ok. It is possible that some were missed.

* About the second patch:

It applies cleanly on top of previous and compiles. Yet again, global
"make check" is ok although there were no test changes, which just show
the abysmal coverage of psql regression tests.

Should there be at least one test?

The feature answers some of my issues with the "libpq should not look up
all host addresses at once" patch.

* "server \"%s\" port %s:\n"

I do not like the resulting output

server:
problem found
more details

I'd rather have

server: problem found
more details

which would require to append a '\n' at the beginning of the line from the
second attempt.

ISTM that both the hostname and ip should be shown to avoid confusion
about hosts with multiple ips, esp. as ips are given in any order by the
dns.

The patch misses the one server with multiple ips case. ISTM that the
introductory line should be shown in such case as well.

sh> psql "host=local.coelho.net,localhost port=5434"
psql: server "local.coelho.net" port 5434:
...
server "localhost" port 5434:
...

But:

sh> psql "host=local2.coelho.net port=5434"
psql: could not connect to server: Connection refused
...
could not connect to server: Connection refused
...

Even if the hint message in this case is explicit enough, ISTM that the
server line is deserved.

Also for homogeneity, I'd suggest to always add the server line. If the
server introduction is inserted in all cases, including when only one host
is used, hints become partially redundant:

server "local.coelho.net" port 5434:
could not connect to server: Connection refused
Is the server running on host "local.coelho.net" (127.0.0.1) and accepting
TCP/IP connections on port 5434?

This would allow to simplify more hints, which you seem to have done on
"open read-write session" and "SHOW transaction_read_only" but not others.

--
Fabien.

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Fabien COELHO (#4)
Re: libpq should append auth failures, not overwrite

Fabien COELHO <coelho@cri.ensmp.fr> writes:

ISTM that both the hostname and ip should be shown to avoid confusion
about hosts with multiple ips, esp. as ips are given in any order by the
dns.
...
Also for homogeneity, I'd suggest to always add the server line. If the
server introduction is inserted in all cases, including when only one host
is used, hints become partially redundant:
server "local.coelho.net" port 5434:
could not connect to server: Connection refused
Is the server running on host "local.coelho.net" (127.0.0.1) and accepting
TCP/IP connections on port 5434?
This would allow to simplify more hints, which you seem to have done on
"open read-write session" and "SHOW transaction_read_only" but not others.

As I explained in my comments, the reason I did not do these things
is that I didn't want to change the output for cases in which just a
single host name is given. I still don't. People will think it's
change for the sake of change, and they tend not to like that.
The multi-host feature is new enough that I think we can still get away
with changing how errors are reported in those cases ... but what you're
proposing here is to mess with error-reporting behavior that was settled
on decades ago. I'm not really interested in taking the flak that will
come with that.

regards, tom lane

#6Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Tom Lane (#5)
Re: libpq should append auth failures, not overwrite

Hello Tom,

ISTM that both the hostname and ip should be shown to avoid confusion
about hosts with multiple ips, esp. as ips are given in any order by the
dns.
...
Also for homogeneity, I'd suggest to always add the server line. If the
server introduction is inserted in all cases, including when only one host
is used, hints become partially redundant:
server "local.coelho.net" port 5434:
could not connect to server: Connection refused
Is the server running on host "local.coelho.net" (127.0.0.1) and accepting
TCP/IP connections on port 5434?
This would allow to simplify more hints, which you seem to have done on
"open read-write session" and "SHOW transaction_read_only" but not others.

As I explained in my comments, the reason I did not do these things
is that I didn't want to change the output for cases in which just a
single host name is given. I still don't.

Ok, I get your argument when there is just one target server (cluster),
which is probably at least 99.9% of the use case in practice.

However, ISTM multiple ip & multiple hostname look pretty close.

I guess that the number of people that use these features is small, but
for these when things go wrong better messages are useful... so I would
see no harm to trigger the server introductory line when there are
multiples servers, whatever the reason why there are multiples servers.

--
Fabien.

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Fabien COELHO (#6)
Re: libpq should append auth failures, not overwrite

Fabien COELHO <coelho@cri.ensmp.fr> writes:

As I explained in my comments, the reason I did not do these things
is that I didn't want to change the output for cases in which just a
single host name is given. I still don't.

Ok, I get your argument when there is just one target server (cluster),
which is probably at least 99.9% of the use case in practice.
However, ISTM multiple ip & multiple hostname look pretty close.
I guess that the number of people that use these features is small, but
for these when things go wrong better messages are useful... so I would
see no harm to trigger the server introductory line when there are
multiples servers, whatever the reason why there are multiples servers.

The thing is that there are a *lot* of systems nowadays on which localhost
maps to both ipv4 and ipv6, so that "host=localhost" would be enough to
trigger the new behavior, and I think that would inevitably give rise to
more complaints than kudos. So I'm still of the opinion that we're better
off with the second patch's behavior as it stands.

Responding to your other points from upthread:

There are still 86 instances of "printfPQExpBuffer", which seems like a
lot. They are mostly OOM messages, but not only. This make checking the
patch quite cumbersome.
I'd rather there would be only one rule, and clear reset with a comment
when needded (eg "fe-lobj.c").

Well, I did this for discussion's sake, but I don't really find the
resulting changes in fe-lobj.c to be any sort of improvement. In
particular, the messiness around suppressing extraneous complaints from
lo_close is still there, if anything worse than before (it's hard to
get rid of because lo_close will reset the buffer). I'd rather just
leave fe-lobj.c alone, possibly adding a header comment explaining why
it's different. Which is basically because none of those functions
expect to be appending multiple errors; they're all gonna quit after
the first one.

I agree with your suggestion of adding a function trying to preserve
messages on OOM errors if possible. Maybe "appendPQExpBufferOOM"?

Yeah, done.

It is unclear to me why the "printf" variant is used in
"PQencryptPasswordConn" and "connectOptions2", it looks that you skipped
these functions.

I did because it seemed like unnecessary extra diff content, since
we're expecting the error buffer to be initially empty anyway (for
connectOptions2) or we'd just need an extra reset (for
PQencryptPasswordConn). In the attached I went ahead and made those
changes, but again I'm unsure that it's really a net improvement.

In passing, some OOM message are rather terse: "out of memory\n", other
are more precise "out of memory allocating GSSAPI buffer (%d)\n".

I got rid of the latter; I think they're unnecessary translation burdens,
and it'd be hard to fit them into a one-size-fits-all OOM reporting
subroutine, and it's pretty unclear what's the point of special-casing
them anyway.

I do not like the resulting output
server:
problem found
more details
I'd rather have
server: problem found
more details

Done in the attached, although I'm concerned about the resulting effects
for error message line length --- in my tests, lines that used to reliably
fit in 80 columns don't anymore. Again, I'm not really convinced this is
an improvement. It would absolutely *not* be an improvement if we tried
to also shoehorn the server's IP address in there; that'd make the lines
way too long, with way too much stuff before the important part.

regards, tom lane

Attachments:

0001-libpq-append-errors-2.patchtext/x-diff; charset=us-ascii; name=0001-libpq-append-errors-2.patchDownload+610-637
0002-libpq-better-server-id-2.patchtext/x-diff; charset=us-ascii; name=0002-libpq-better-server-id-2.patchDownload+34-35
#8Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Tom Lane (#7)
Re: libpq should append auth failures, not overwrite

Hello Tom,

The thing is that there are a *lot* of systems nowadays on which localhost
maps to both ipv4 and ipv6, so that "host=localhost" would be enough to
trigger the new behavior,

I think that people would survive having the ip spelled out on localhost
errors when there are several ips associated to the name.

You could also create an exception for "localhost" if you see it as a
large problem, but if someone redefined localhost somehow (I did that once
by inadvertence), it would be nice to help and be explicit.

and I think that would inevitably give rise to more complaints than
kudos.

Hmmm. I'm not sure about what the complaint could be in case of multiple
ips. "Hey, I have a 'host' with multiple ips, and on errors you tell me
which ip generated an issue, and I do not want to know, really".

So I'm still of the opinion that we're better off with the
second patch's behavior as it stands.

This was a mere suggestion.

As a user, and as a support person for others on occasions, I like precise
error messages which convey all relevant information. In this instance a
key information is hidden, hence the proposal.

Responding to your other points from upthread:

There are still 86 instances of "printfPQExpBuffer", which seems like a
lot. They are mostly OOM messages, but not only. This make checking the
patch quite cumbersome.
I'd rather there would be only one rule, and clear reset with a comment
when needded (eg "fe-lobj.c").

Well, I did this for discussion's sake, but I don't really find the
resulting changes in fe-lobj.c to be any sort of improvement. [...]

The improvement I see is that if there would not be single remaining
printf, so it is easy to check that no case were forgotten in libpq, and
for future changes that everywhere in libpq there is only one rule which
is "append errors to expbuf", not "it depends on the file or function you
are in, please look at it in detail".

It is unclear to me why the "printf" variant is used in
"PQencryptPasswordConn" and "connectOptions2", it looks that you
skipped these functions.

I did because it seemed like unnecessary extra diff content, since
we're expecting the error buffer to be initially empty anyway (for
connectOptions2) or we'd just need an extra reset (for
PQencryptPasswordConn). In the attached I went ahead and made those
changes, but again I'm unsure that it's really a net improvement.

Same as above: easier to check, one simple rule for all libpq errors.

I do not like the resulting output
server:
problem found
more details
I'd rather have
server: problem found
more details

Done in the attached, although I'm concerned about the resulting effects
for error message line length --- in my tests, lines that used to reliably
fit in 80 columns don't anymore. Again, I'm not really convinced this is
an improvement. It would absolutely *not* be an improvement if we tried
to also shoehorn the server's IP address in there; that'd make the lines
way too long, with way too much stuff before the important part.

I understand your concern about line length.

A possible compromise would be to newline *and* indent before "problem
found". To avoid messy code, there could be an internal function to manage
that, eg.

appendErrorContext(...), for "server:"
appendError(exbuf, fmt, ...) , for errors, which would indent the
initial line by two spaces.

server1:
problem found
details
server2:
problem found
details

This would also give room for the ip on a 80-column server line.

The alternative is that the committer does as they want, which is also
fine with me:-)

--
Fabien.

#9Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#1)
Re: libpq should append auth failures, not overwrite

On Thu, Aug 9, 2018 at 11:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I noticed that, although most error reports during libpq's connection
setup code append to conn->errorMessage, the ones in fe-auth.c and
fe-auth-scram.c don't: they're all printfPQExpBuffer() not
appendPQExpBuffer(). This seems wrong to me. It makes no difference
in simple cases with a single target server, but as soon as you have
multiple servers listed in "host", this coding makes it impossible
to tell which server rejected your login.

So I think we should basically s/printfPQExpBuffer/appendPQExpBuffer/g
anywhere those files touch conn->errorMessage, allowing any problems
with previous servers to be preserved in the eventually-reported message.
I won't bother posting an actual patch for that right now, but has anyone
got an objection?

I guess the question in my mind is what behavior is most useful for
users. There are, I'm sure, cases where it's useful to see all the
details you are proposing to print. But it also seems like there
could be a significant number of cases where it's really more than
anybody wants to know. For instance, if I try to connect
host=primary1,primary2,dr and primary1 is offline right now, because
we're using primary2, and if it happens that I mistype my password,
with your patch, I'll get an error saying that primary1 is down,
followed by an error about my password. I only care about the second
one. If I had typed my password correctly, I would have just gotten a
connection, and everything would have been fine. Telling me that
primary1 is down may make me (1) panic or (2) miss the real cause of
my problem.

So I'm not as convinced as you are that always displaying maximum
detail is really the right thing to do.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#9)
Re: libpq should append auth failures, not overwrite

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Aug 9, 2018 at 11:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

So I think we should basically s/printfPQExpBuffer/appendPQExpBuffer/g
anywhere those files touch conn->errorMessage, allowing any problems
with previous servers to be preserved in the eventually-reported message.

I guess the question in my mind is what behavior is most useful for
users. There are, I'm sure, cases where it's useful to see all the
details you are proposing to print. But it also seems like there
could be a significant number of cases where it's really more than
anybody wants to know. For instance, if I try to connect
host=primary1,primary2,dr and primary1 is offline right now, because
we're using primary2, and if it happens that I mistype my password,
with your patch, I'll get an error saying that primary1 is down,
followed by an error about my password. I only care about the second
one. If I had typed my password correctly, I would have just gotten a
connection, and everything would have been fine. Telling me that
primary1 is down may make me (1) panic or (2) miss the real cause of
my problem.

So I'm not as convinced as you are that always displaying maximum
detail is really the right thing to do.

Well, I'm actually not proposing to print "maximum detail", and Fabien
is complaining about that, which makes me think maybe I've hit a happy
medium ;-). In particular, the proposed patches won't change behavior
for cases where you just give one hostname or hostaddr, and I'd argue
that that is the only case where we really have enough track record to
be sure that people are happy with the amount of detail provided.
As soon as you have multiple target hosts, though, the current code's
behavior is inadequate IMO. Indisputably it's inconsistent, because
some code paths will concatenate error messages and some won't, without
any rhyme or reason that I can detect. I think the only reason we've
not had more complaints is that hardly anyone is using this feature yet.

regards, tom lane

#11Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#10)
Re: libpq should append auth failures, not overwrite

On Wed, Aug 15, 2018 at 11:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Well, I'm actually not proposing to print "maximum detail", and Fabien
is complaining about that, which makes me think maybe I've hit a happy
medium ;-). In particular, the proposed patches won't change behavior
for cases where you just give one hostname or hostaddr, and I'd argue
that that is the only case where we really have enough track record to
be sure that people are happy with the amount of detail provided.
As soon as you have multiple target hosts, though, the current code's
behavior is inadequate IMO.

I'm not entirely convinced; see the example I posted before.

Indisputably it's inconsistent, because
some code paths will concatenate error messages and some won't, without
any rhyme or reason that I can detect. I think the only reason we've
not had more complaints is that hardly anyone is using this feature yet.

I agree that inconsistency is bad as a rule. :-)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#11)
Re: libpq should append auth failures, not overwrite

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Aug 15, 2018 at 11:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

As soon as you have multiple target hosts, though, the current code's
behavior is inadequate IMO.

I'm not entirely convinced; see the example I posted before.

TBH I find your example to be the exact opposite of convincing.
You've cherry-picked a case where the current behavior tells you
what you need to know and not anything you don't, but very small
variations on the case make that not hold anymore. Remember the
complaint we started with, which was that if you get a bad-password
error it's not apparent which host gave that response. If, in
fact, the hosts you've listed don't all use the same password,
you could be tearing your hair out for quite awhile before you
figure out what's going wrong.

In any case, I'm a little bit confused as to why someone who'd
listed multiple hosts would be surprised to see that the connection
had in fact fallen back to not-the-first host.

regards, tom lane

#13Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#12)
Re: libpq should append auth failures, not overwrite

On Wed, Aug 15, 2018 at 12:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

TBH I find your example to be the exact opposite of convincing.
You've cherry-picked a case where the current behavior tells you
what you need to know and not anything you don't, but very small
variations on the case make that not hold anymore. Remember the
complaint we started with, which was that if you get a bad-password
error it's not apparent which host gave that response. If, in
fact, the hosts you've listed don't all use the same password,
you could be tearing your hair out for quite awhile before you
figure out what's going wrong.

That seems like a pretty unlikely use case, though. It seems to me
that the virtue of the feature is in letting you connect to one of a
number of hosts that are basically all equivalent. If they're not all
equivalent, how come you're OK with connecting to any one of them?
It's of course not impossible for somebody to have databases with
equivalent contents but different authentication, but that sounds like
some kind of confusing fringe thing to me. My guess -- and I think
this is what underlay the original coding, though maybe I didn't get
all the details right -- is that if your connection fails to reach any
server, you want to know the details as to why that failed for each
server in the list, but that if your connection fails for some reason
other than not being able to reach the server at all, like a missing
role name or password, you just want to know about that, and you don't
really care about the servers to which you failed to connect
altogether. Saying host=a,b,c basically means exactly that: I don't
really care about failure to connect to a or even a and b; that's not
an error condition. Of course if I can't connect to any of them --
that's still got to be an error.

In any case, I'm a little bit confused as to why someone who'd
listed multiple hosts would be surprised to see that the connection
had in fact fallen back to not-the-first host.

In a perfect world where users never misinterpret error messages, they
wouldn't, but that is not a world I have the privilege of inhabiting.
Focusing the error on the thing that is probably what the user needs
to fix is a very sensible decision, I think. We could print out a
full stack backtrace, a memory context dump, and a node tree display
of the plan every time a query errors out, but most of the time, that
would be a truly excessive amount of information that would distract
from, say, the fact that you divided by zero, or whatever. Brevity
has value.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#13)
Re: libpq should append auth failures, not overwrite

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Aug 15, 2018 at 12:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

TBH I find your example to be the exact opposite of convincing.

That seems like a pretty unlikely use case, though. It seems to me
that the virtue of the feature is in letting you connect to one of a
number of hosts that are basically all equivalent. If they're not all
equivalent, how come you're OK with connecting to any one of them?

I'm still unpersuaded. The point you're ignoring is that the *intention*
might be that the hosts are equivalent, but the *reality* might be that
they're not, and the difference very likely is exactly why your connection
failed. Password not what you expected, user or database not there at
all, etc. In such situations, an error message that doesn't give you
enough info to realize where the problem really lies is very unfriendly.

Also, if I buy this line of argument, I fail to see why the existing
messages that explicitly mention which host we didn't connect to (because
of problems with the read-write session test) are like that. How is it
that those need to be labeled as to host, and concatenated, but other
sorts of failures don't? I think the author(s) of that patch understood
the problem perfectly well, but were too lazy or cowardly to fix it other
than in code they were adding. I'm here to clean that up.

In a perfect world where users never misinterpret error messages, they
wouldn't, but that is not a world I have the privilege of inhabiting.

Me neither, but a message that fails to provide critical context
is much more susceptible to misinterpretation than one that doesn't.

regards, tom lane

#15Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#14)
Re: libpq should append auth failures, not overwrite

On Wed, Aug 15, 2018 at 12:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think the author(s) of that patch understood
the problem perfectly well, but were too lazy or cowardly to fix it other
than in code they were adding.

I think this is an ad hominum attack.

I have explained some factors that are relevant from my point of view,
and that explain some of the behavior you don't like. I have already
said several times that the implementation might not be up to the
mark, and I apologize for that. If you want to interpret the current
state of affairs as being 0% attributable to a difference of opinion
between reasonable people about what the behavior ought to be, and
100% due to laziness or cowardliness on my part and the part of the
other people involved in these patches, I cannot prevent you from
doing so. However, I think that judgement is inaccurate. More
importantly, I think it's a nasty sentiment which is inappropriate on
this mailing list or in any other civilized discourse.

If you think that the code is bad, you can say that without commenting
on my character.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#16Michael Paquier
michael@paquier.xyz
In reply to: Fabien COELHO (#8)
Re: libpq should append auth failures, not overwrite

On Tue, Aug 14, 2018 at 02:17:09PM +0200, Fabien COELHO wrote:

I think that people would survive having the ip spelled out on localhost
errors when there are several ips associated to the name.

You could also create an exception for "localhost" if you see it as a large
problem, but if someone redefined localhost somehow (I did that once by
inadvertence), it would be nice to help and be explicit.

The last status of this patch comes from August, which has been
unanswered. So I am marking the patch as returned with feedback. The
patch set does not apply cleanly by the way.
--
Michael