libpq maligning postgres stability

Started by Andres Freund10 months ago4 messages
#1Andres Freund
andres@anarazel.de

Hi,

We have several places in libpq where libpq says that a connection closing is
probably due to a server crash with a message like:

server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing

I think this is rather unhelpful, at least these days. There are a lot of
reasons the connection could have failed, the server having terminated
abnormally is just one of them.

It's common to see this due to network issues, for example. I've quite a few
times fielded worried questions of postgres users due to the message.

The reason I was looking at this message just now was a discussion of CI
failures on windows [1]Via Bilal:, which were likely caused by the known issue of
windows occasionally swallowing the server's last messages before the backend
exits (more detail e.g. in [2]/messages/by-id/CA+hUKGLR10ZqRCvdoRrkQusq75wF5=vEetRSs2_u1s+FAUosFQ@mail.gmail.com). It's easy to think that the failure was
wrongly caused by a postgres crash, due to the message, rather than due to not
receiving the expected FATAL.

And we don't even just add this message when the connection was actually
closed unexpectedly, we often do it even when we *did* get a FATAL, as in this
example:

psql -c 'select pg_terminate_backend(pg_backend_pid())'
FATAL: 57P01: terminating connection due to administrator command
LOCATION: ProcessInterrupts, postgres.c:3351
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
connection to server was lost

I think this one is mostly a weakness in how libpq tracks connection state,
but it kind of shows the silliness of claiming postgres probably crashed.

Greetings,

Andres Freund

[1]: Via Bilal:

4 of the failures on the front page are related to Windows:
https://cirrus-ci.com/build/4878370632105984
https://cirrus-ci.com/build/5063665856020480
https://cirrus-ci.com/build/4636858312818688
https://cirrus-ci.com/build/6385762419081216

[2]: /messages/by-id/CA+hUKGLR10ZqRCvdoRrkQusq75wF5=vEetRSs2_u1s+FAUosFQ@mail.gmail.com

#2Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#1)
Re: libpq maligning postgres stability

On Thu, Mar 27, 2025 at 11:19 AM Andres Freund <andres@anarazel.de> wrote:

We have several places in libpq where libpq says that a connection closing is
probably due to a server crash with a message like:

server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing

I think this is rather unhelpful, at least these days. There are a lot of
reasons the connection could have failed, the server having terminated
abnormally is just one of them.

It's common to see this due to network issues, for example. I've quite a few
times fielded worried questions of postgres users due to the message.

Yeah, I agree. I used to think this hint was helpful, but it's gotten
less helpful as the years have passed, because the server is more
stable these days. Another thing that can cause this (as discussed in
Discord) is that the individual backend process can have died, but not
the server as a whole. In that case, the hint is only accurate if you
mean "server" to read your individual server process.

I wonder if, in addition to removing the hint, we could also consider
rewording the message. For example, a slight rewording to "server
connection closed unexpectedly" would avoid implying that it was the
server that took action, which is correct, because it could be a
firewall in between the machines or even security software on the
client side. Maybe there is some more dramatic rewording that is even
better, but there's probably some value in keeping it similar to what
people are used to seeing.

--
Robert Haas
EDB: http://www.enterprisedb.com

#3Christoph Berg
myon@debian.org
In reply to: Robert Haas (#2)
Re: libpq maligning postgres stability

Re: Robert Haas

I wonder if, in addition to removing the hint, we could also consider
rewording the message. For example, a slight rewording to "server
connection closed unexpectedly" would avoid implying that it was the

There is a lot of software doing string-parsing of this part of the
message, so it might be advisable to leave the first line alone.

https://sources.debian.org/src/php-laravel-framework/10.48.25+dfsg-2/src/Illuminate/Database/DetectsLostConnections.php/?hl=28#L28
https://sources.debian.org/src/python-taskflow/5.9.1-4/taskflow/persistence/backends/impl_sqlalchemy.py/?hl=87#L87
https://sources.debian.org/src/gnucash/1:5.10-0.1/libgnucash/backend/dbi/gnc-backend-dbi.cpp/?hl=798#L798
https://sources.debian.org/src/pgbouncer/1.24.0-3/test/test_misc.py/?hl=301#L301
https://sources.debian.org/src/icingaweb2-module-reporting/1.0.2-2/library/Reporting/RetryConnection.php/?hl=23#L23
https://sources.debian.org/src/storm/1.0-1/storm/databases/postgres.py/?hl=353#L353
https://sources.debian.org/src/timescaledb/2.19.0+dfsg-1/test/expected/loader-tsl.out/?hl=473#L473
https://sources.debian.org/src/odoo/18.0.0+dfsg-2/addons/web/tests/test_db_manager.py/?hl=277#L277

https://codesearch.debian.net/search?q=server+closed+the+connection+unexpectedly&amp;literal=1

(There might be room for asking why this string parsing is being done,
is libpq missing "connection lost" detection vs. other errors?)

The remaining message lines are admittedly very pessimistic about
PostgreSQL's stability and should mention networking issues first.

Christoph

#4Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#2)
Re: libpq maligning postgres stability

On Thu, Mar 27, 2025 at 11:48:26AM -0400, Robert Haas wrote:

On Thu, Mar 27, 2025 at 11:19 AM Andres Freund <andres@anarazel.de> wrote:

We have several places in libpq where libpq says that a connection closing is
probably due to a server crash with a message like:

server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing

I think this is rather unhelpful, at least these days. There are a lot of
reasons the connection could have failed, the server having terminated
abnormally is just one of them.

It's common to see this due to network issues, for example. I've quite a few
times fielded worried questions of postgres users due to the message.

Yeah, I agree. I used to think this hint was helpful, but it's gotten
less helpful as the years have passed, because the server is more
stable these days. Another thing that can cause this (as discussed in
Discord) is that the individual backend process can have died, but not
the server as a whole. In that case, the hint is only accurate if you
mean "server" to read your individual server process.

I wonder if, in addition to removing the hint, we could also consider
rewording the message. For example, a slight rewording to "server
connection closed unexpectedly" would avoid implying that it was the
server that took action, which is correct, because it could be a
firewall in between the machines or even security software on the
client side. Maybe there is some more dramatic rewording that is even
better, but there's probably some value in keeping it similar to what
people are used to seeing.

FYI, I researched these messages in 2023 to see if the message can be
adjusted based on the code line generating the message, but with no
conclusion:

/messages/by-id/CA+TgmoZYvqmyQpzSUdtDmtk4Aj94MppDGe9qVJczbPLy4G2Yfg@mail.gmail.com

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Do not let urgent matters crowd out time for investment in the future.