Expansion of our checks for connection-loss errors

Started by Tom Laneover 5 years ago7 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

Over in the thread at [1]/messages/by-id/E1kPc9v-0005L4-2l@gemulon.postgresql.org, we've tentatively determined that the
reason buildfarm member lorikeet is currently failing is that its
network stack returns ECONNABORTED for (some?) connection failures,
whereas our code is only expecting ECONNRESET. Fujii Masao therefore
proposes that we treat ECONNABORTED the same as ECONNRESET. I think
this is a good idea, but after a bit of research I feel it does not
go far enough. I find these POSIX-standard errnos that also seem
likely candidates to be returned for a hard loss of connection:

ECONNABORTED
EHOSTUNREACH
ENETDOWN
ENETUNREACH

All of these have been in POSIX since SUSv2, so it seems unlikely
that we need to #ifdef any of them. (It is in any case pretty silly
that we have #ifdefs around a very small minority of our references
to ECONNRESET :-(.)

There are some other related errnos, such as ECONNREFUSED, that
don't seem like they'd be returned for a failure of a pre-existing
connection, so we don't need to include them in such tests.

Accordingly, I propose the attached patch (an expansion of
Fujii-san's) that causes us to test for all five errnos anyplace
we had been checking for ECONNRESET. I felt that this was getting to
the point where we'd better centralize the knowledge of what to check,
so the patch does that, via an inline function and an admittedly hacky
macro. I also upgraded some places such as strerror.c to have full
support for these symbols.

All of the machines I have (even as far back as HPUX 10.20) also
define ENETRESET and EHOSTDOWN. However, those symbols do not appear
in SUSv2. ENETRESET was added at some later point, but EHOSTDOWN is
still not in POSIX. For the moment I've left these second-tier
symbols out of the patch, but there's a case for adding them. I'm
not sure whether there'd be any point in trying to #ifdef them.

BTW, I took out the conditional defines of some of these errnos in
libpq's win32.h; AFAICS that's been dead code ever since we added
#define's for them to win32_port.h. Am I missing something?

This seems like a bug fix to me, so I'm inclined to back-patch.

regards, tom lane

[1]: /messages/by-id/E1kPc9v-0005L4-2l@gemulon.postgresql.org

Attachments:

recognize-more-connection-loss-errnos-1.patchtext/x-diff; charset=us-ascii; name=recognize-more-connection-loss-errnos-1.patchDownload+49-37
#2Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Tom Lane (#1)
Re: Expansion of our checks for connection-loss errors

At Thu, 08 Oct 2020 15:15:54 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in

Over in the thread at [1], we've tentatively determined that the
reason buildfarm member lorikeet is currently failing is that its
network stack returns ECONNABORTED for (some?) connection failures,
whereas our code is only expecting ECONNRESET. Fujii Masao therefore
proposes that we treat ECONNABORTED the same as ECONNRESET. I think
this is a good idea, but after a bit of research I feel it does not
go far enough. I find these POSIX-standard errnos that also seem
likely candidates to be returned for a hard loss of connection:

ECONNABORTED
EHOSTUNREACH
ENETDOWN
ENETUNREACH

All of these have been in POSIX since SUSv2, so it seems unlikely
that we need to #ifdef any of them. (It is in any case pretty silly
that we have #ifdefs around a very small minority of our references
to ECONNRESET :-(.)

There are some other related errnos, such as ECONNREFUSED, that
don't seem like they'd be returned for a failure of a pre-existing
connection, so we don't need to include them in such tests.

Accordingly, I propose the attached patch (an expansion of
Fujii-san's) that causes us to test for all five errnos anyplace
we had been checking for ECONNRESET. I felt that this was getting to
the point where we'd better centralize the knowledge of what to check,
so the patch does that, via an inline function and an admittedly hacky
macro. I also upgraded some places such as strerror.c to have full
support for these symbols.

All of the machines I have (even as far back as HPUX 10.20) also
define ENETRESET and EHOSTDOWN. However, those symbols do not appear
in SUSv2. ENETRESET was added at some later point, but EHOSTDOWN is
still not in POSIX. For the moment I've left these second-tier
symbols out of the patch, but there's a case for adding them. I'm
not sure whether there'd be any point in trying to #ifdef them.

BTW, I took out the conditional defines of some of these errnos in
libpq's win32.h; AFAICS that's been dead code ever since we added
#define's for them to win32_port.h. Am I missing something?

This seems like a bug fix to me, so I'm inclined to back-patch.

regards, tom lane

[1] /messages/by-id/E1kPc9v-0005L4-2l@gemulon.postgresql.org

+1 for the direction.

In terms of connection errors, connect(2) and bind(2) can return
EADDRNOTAVAIL. bind(2) and listen(2) can return EADDRINUSE. FWIW I
recetnly saw pgbench getting EADDRNOTAVAIL. (They have mapping from
respective WSA errors in TranslateSocketError())

I'm not sure how we should treat EMFILE/ENFILE/ENOBUFS/ENOMEM from
accept(2). (select(2) can return ENOMEM.)

I'd make errno_is_connection_loss use ALL_CONNECTION_LOSS_ERRNOS to
avoid duplication definition of the errno list.

-	if (ret < 0 && WSAGetLastError() == WSAECONNRESET)
+	if (ret < 0 && errno_is_connection_loss(WSAGetLastError()))

Don't we need to use TranslateSocketError() before?

+		/* We might get ECONNRESET etc here if using TCP and backend died */
+		if (errno_is_connection_loss(SOCK_ERRNO))

Perhaps I'm confused but SOCK_ERROR doesn't seem portable between
Windows and Linux.

=====
/*
* These macros are needed to let error-handling code be portable between
* Unix and Windows. (ugh)
*/
#ifdef WIN32
#define SOCK_ERRNO (WSAGetLastError())
#define SOCK_STRERROR winsock_strerror
#define SOCK_ERRNO_SET(e) WSASetLastError(e)
#else
#define SOCK_ERRNO errno
#define SOCK_STRERROR strerror_r
#define SOCK_ERRNO_SET(e) (errno = (e))
#endif
=====

AFAICS SOCK_ERRNO is intended to be used idiomatically as:

SOCK_STRERROR(SOCK_ERRNO, ...)

The WSAE values from WSAGetLastError() and E values in errno are not
compatible and needs translation by TranslateSocketError()?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kyotaro Horiguchi (#2)
Re: Expansion of our checks for connection-loss errors

Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:

At Thu, 08 Oct 2020 15:15:54 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in

Accordingly, I propose the attached patch (an expansion of
Fujii-san's) that causes us to test for all five errnos anyplace
we had been checking for ECONNRESET.

+1 for the direction.

In terms of connection errors, connect(2) and bind(2) can return
EADDRNOTAVAIL. bind(2) and listen(2) can return EADDRINUSE. FWIW I
recetnly saw pgbench getting EADDRNOTAVAIL. (They have mapping from
respective WSA errors in TranslateSocketError())

I do not think we have any issues with connection-time errors;
or at least, if we do, the spots being touched here certainly
shouldn't need to worry about them. These places are dealing
with already-established connections.

I'd make errno_is_connection_loss use ALL_CONNECTION_LOSS_ERRNOS to
avoid duplication definition of the errno list.

Hmm, might be worth doing, but I'm not sure. I am worried about
whether compilers will generate equally good code that way.

-	if (ret < 0 && WSAGetLastError() == WSAECONNRESET)
+	if (ret < 0 && errno_is_connection_loss(WSAGetLastError()))

Don't we need to use TranslateSocketError() before?

Oh, I missed that. But:

Perhaps I'm confused but SOCK_ERROR doesn't seem portable between
Windows and Linux.

In that case, nothing would have worked on Windows for the last
ten years, so you're mistaken. I think the actual explanation
why this works, and why that test in parallel.c probably still
works even with my mistake, is that win32_port.h makes sure that
our values of ECONNRESET etc match WSAECONNRESET etc.

IOW, we'd not actually need TranslateSocketError at all, except
that it maps some not-similarly-named error codes for conditions
that don't exist in Unix into ones that do. We probably do want
TranslateSocketError in this parallel.c test so that anything that
it maps to one of the errno_is_connection_loss codes will be
recognized; but the basic cases would work anyway, unless I
misunderstand this stuff entirely.

regards, tom lane

#4Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Tom Lane (#3)
Re: Expansion of our checks for connection-loss errors

At Thu, 08 Oct 2020 21:41:55 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in

Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:

At Thu, 08 Oct 2020 15:15:54 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in

Accordingly, I propose the attached patch (an expansion of
Fujii-san's) that causes us to test for all five errnos anyplace
we had been checking for ECONNRESET.

+1 for the direction.

In terms of connection errors, connect(2) and bind(2) can return
EADDRNOTAVAIL. bind(2) and listen(2) can return EADDRINUSE. FWIW I
recetnly saw pgbench getting EADDRNOTAVAIL. (They have mapping from
respective WSA errors in TranslateSocketError())

I do not think we have any issues with connection-time errors;
or at least, if we do, the spots being touched here certainly
shouldn't need to worry about them. These places are dealing
with already-established connections.

errcode_for_socket_access() is called for connect, bind and lesten but
I understand we don't consider the case since we don't have an actual
issue related to the functions.

I'd make errno_is_connection_loss use ALL_CONNECTION_LOSS_ERRNOS to
avoid duplication definition of the errno list.

Hmm, might be worth doing, but I'm not sure. I am worried about
whether compilers will generate equally good code that way.

The two are placed side-by-side so either will do for me.

-	if (ret < 0 && WSAGetLastError() == WSAECONNRESET)
+	if (ret < 0 && errno_is_connection_loss(WSAGetLastError()))

Don't we need to use TranslateSocketError() before?

Oh, I missed that. But:

Perhaps I'm confused but SOCK_ERROR doesn't seem portable between
Windows and Linux.

In that case, nothing would have worked on Windows for the last
ten years, so you're mistaken. I think the actual explanation
why this works, and why that test in parallel.c probably still
works even with my mistake, is that win32_port.h makes sure that
our values of ECONNRESET etc match WSAECONNRESET etc.

Mmmmmmmmmm. Sure.

IOW, we'd not actually need TranslateSocketError at all, except
that it maps some not-similarly-named error codes for conditions
that don't exist in Unix into ones that do. We probably do want
TranslateSocketError in this parallel.c test so that anything that
it maps to one of the errno_is_connection_loss codes will be
recognized; but the basic cases would work anyway, unless I
misunderstand this stuff entirely.

Yeah, that seems to work.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#5Fujii Masao
masao.fujii@gmail.com
In reply to: Tom Lane (#1)
Re: Expansion of our checks for connection-loss errors

On 2020/10/09 4:15, Tom Lane wrote:

Over in the thread at [1], we've tentatively determined that the
reason buildfarm member lorikeet is currently failing is that its
network stack returns ECONNABORTED for (some?) connection failures,
whereas our code is only expecting ECONNRESET. Fujii Masao therefore
proposes that we treat ECONNABORTED the same as ECONNRESET. I think
this is a good idea, but after a bit of research I feel it does not
go far enough. I find these POSIX-standard errnos that also seem
likely candidates to be returned for a hard loss of connection:

ECONNABORTED
EHOSTUNREACH
ENETDOWN
ENETUNREACH

All of these have been in POSIX since SUSv2, so it seems unlikely
that we need to #ifdef any of them. (It is in any case pretty silly
that we have #ifdefs around a very small minority of our references
to ECONNRESET :-(.)

There are some other related errnos, such as ECONNREFUSED, that
don't seem like they'd be returned for a failure of a pre-existing
connection, so we don't need to include them in such tests.

Accordingly, I propose the attached patch (an expansion of
Fujii-san's) that causes us to test for all five errnos anyplace
we had been checking for ECONNRESET.

+1

Thanks for expanding the patch!

-#ifdef ECONNRESET
- case ECONNRESET:
+ case ALL_CONNECTION_LOSS_ERRNOS:
printfPQExpBuffer(&conn->errorMessage,
libpq_gettext("server closed the connection unexpectedly\n"
"\tThis probably means the server terminated abnormally\n"
"\tbefore or while processing the request.\n"));

This change causes the same error message to be reported for those five errno.
That is, we cannot identify which errno is actually reported, from the error
message. But I just wonder if it's more helpful for the troubleshooting if we,
for example, append strerror() into the message so that we can easily
identify errno. Thought?

I felt that this was getting to
the point where we'd better centralize the knowledge of what to check,
so the patch does that, via an inline function and an admittedly hacky
macro. I also upgraded some places such as strerror.c to have full
support for these symbols.

All of the machines I have (even as far back as HPUX 10.20) also
define ENETRESET and EHOSTDOWN. However, those symbols do not appear
in SUSv2. ENETRESET was added at some later point, but EHOSTDOWN is
still not in POSIX. For the moment I've left these second-tier
symbols out of the patch, but there's a case for adding them. I'm
not sure whether there'd be any point in trying to #ifdef them.

BTW, I took out the conditional defines of some of these errnos in
libpq's win32.h; AFAICS that's been dead code ever since we added
#define's for them to win32_port.h. Am I missing something?

This seems like a bug fix to me, so I'm inclined to back-patch.

+1

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Fujii Masao (#5)
Re: Expansion of our checks for connection-loss errors

Fujii Masao <masao.fujii@oss.nttdata.com> writes:

On 2020/10/09 4:15, Tom Lane wrote:

-#ifdef ECONNRESET
- case ECONNRESET:
+ case ALL_CONNECTION_LOSS_ERRNOS:
printfPQExpBuffer(&conn->errorMessage,
libpq_gettext("server closed the connection unexpectedly\n"
"\tThis probably means the server terminated abnormally\n"
"\tbefore or while processing the request.\n"));

This change causes the same error message to be reported for those five errno.
That is, we cannot identify which errno is actually reported, from the error
message. But I just wonder if it's more helpful for the troubleshooting if we,
for example, append strerror() into the message so that we can easily
identify errno. Thought?

Hmm, excellent point. While our code response to all these errors
should be the same, you are right that that doesn't extend to emitting
identical error texts. For EHOSTUNREACH/ENETDOWN/ENETUNREACH, we
should say something like "connection to server lost", without claiming
that the server crashed. It is less clear what to do with ECONNABORTED,
but I'm inclined to put it in the network-problem bucket not the
server-crash bucket, despite lorikeet's behavior. Thoughts?

This also destroys the patch's idea that switch statements should be
able to handle these all alike. If we group things as "ECONNRESET means
server crash and the others are all network failures", then I'd be
inclined to leave the ECONNRESET cases alone and just introduce
new infrastructure to recognize all the network-failure errnos.

regards, tom lane

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#6)
Re: Expansion of our checks for connection-loss errors

I wrote:

Hmm, excellent point. While our code response to all these errors
should be the same, you are right that that doesn't extend to emitting
identical error texts. For EHOSTUNREACH/ENETDOWN/ENETUNREACH, we
should say something like "connection to server lost", without claiming
that the server crashed. It is less clear what to do with ECONNABORTED,
but I'm inclined to put it in the network-problem bucket not the
server-crash bucket, despite lorikeet's behavior. Thoughts?

This also destroys the patch's idea that switch statements should be
able to handle these all alike. If we group things as "ECONNRESET means
server crash and the others are all network failures", then I'd be
inclined to leave the ECONNRESET cases alone and just introduce
new infrastructure to recognize all the network-failure errnos.

Actually, treating it that way seems like a good thing because it nets
out as (nearly) no change to our error message behavior. The connection
failure errnos fall through to the default case, which produces a
perfectly reasonable report that includes strerror(). The only big thing
we're changing is the set of errnos that errcode_for_socket_access will
map to ERRCODE_CONNECTION_FAILURE, so this is spiritually closer to your
original patch.

Some other changes in the attached v2:

* I incorporated Kyotaro-san's suggested improvements.

* I went ahead and included ENETRESET and EHOSTDOWN, figuring that
if they exist we definitely want to class them as network failures.
We can worry about ifdef'ing them when and if we find a platform
that hasn't got them. (I don't see any non-ugly way to make the
ALL_NETWORK_FAILURE_ERRNOS macro vary for missing symbols, so I'd
rather not deal with that unless it's proven necessary.)

* I noticed that we were not terribly consistent about whether
EPIPE is regarded as indicating a server failure like ECONNRESET
does. So this patch also makes sure that EPIPE is treated like
ECONNRESET everywhere. (Hence, pqsecure_raw_read's error reporting
does change, since it'll now report EPIPE as server failure.)

I lack a way to test this on Windows, but otherwise it feels
like it's about ready.

regards, tom lane

Attachments:

recognize-more-connection-loss-errnos-2.patchtext/x-diff; charset=us-ascii; name=recognize-more-connection-loss-errnos-2.patchDownload+70-36