Misleading panic message in backend/access/transam/xlog.c

Started by Michael Banckabout 7 years ago7 messages
#1Michael Banck
michael.banck@credativ.de

Hi,

there was a report in #postgresql recently about a crash on Google Cloud
SQL with the somewhat misleading message "could not write to log file"
while in fact it was the xlog/wal:

|PANIC: could not write to log file 000000010000019600000054 at offset
| 13279232, length 245760: Cannot allocate memory 
|ERROR: could not write block 74666 in file "base/18031/48587": Cannot
| allocate memory 
|CONTEXT: writing block 74666 of relation base/18031/48587
|LOG: server process (PID 5160) was terminated by signal 9: Killed

The slightly longer logfile can be found here: http://dpaste.com/2T61PS9

I suggest to reword that message, e.g. "could not write to transaction
log file" or "could not write to wal file".

Also, the errno (ENOMEM) is curious (and the user wrote that Google
monitoring reported memory at 16/20GB at the time of the crash), but it
could be a due to running on a cloud-fork? As you have no access to
PGDATA, it sounds difficult to diagnose after the fact.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

#2Magnus Hagander
magnus@hagander.net
In reply to: Michael Banck (#1)
Re: Misleading panic message in backend/access/transam/xlog.c

On Wed, Jan 9, 2019 at 12:06 PM Michael Banck <michael.banck@credativ.de>
wrote:

Hi,

there was a report in #postgresql recently about a crash on Google Cloud
SQL with the somewhat misleading message "could not write to log file"
while in fact it was the xlog/wal:

|PANIC: could not write to log file 000000010000019600000054 at offset
| 13279232, length 245760: Cannot allocate memory
|ERROR: could not write block 74666 in file "base/18031/48587": Cannot
| allocate memory
|CONTEXT: writing block 74666 of relation base/18031/48587
|LOG: server process (PID 5160) was terminated by signal 9: Killed

The slightly longer logfile can be found here: http://dpaste.com/2T61PS9

I suggest to reword that message, e.g. "could not write to transaction
log file" or "could not write to wal file".

Given the change xlog -> wal, I would suggest "could not write to wal file"
as the correct option there.

And +1 for rewording it. I think there are also some other messages like it
that needs to be changed, and also things like

(errmsg("restored log file \"%s\" from archive"

could do with an update.

Also, the errno (ENOMEM) is curious (and the user wrote that Google

monitoring reported memory at 16/20GB at the time of the crash), but it
could be a due to running on a cloud-fork? As you have no access to
PGDATA, it sounds difficult to diagnose after the fact.

Yeah, nobody knows what Google has done in their fork *or* how they
actually measure those things, so without a repro I think that's hard..

//Magnus

#3Andres Freund
andres@anarazel.de
In reply to: Michael Banck (#1)
Re: Misleading panic message in backend/access/transam/xlog.c

Hi,

On 2019-01-09 12:06:39 +0100, Michael Banck wrote:

there was a report in #postgresql recently about a crash on Google Cloud
SQL with the somewhat misleading message "could not write to log file"
while in fact it was the xlog/wal:

|PANIC: could not write to log file 000000010000019600000054 at offset
| 13279232, length 245760: Cannot allocate memory�
|ERROR: could not write block 74666 in file "base/18031/48587": Cannot
| allocate memory�
|CONTEXT: writing block 74666 of relation base/18031/48587
|LOG: server process (PID 5160) was terminated by signal 9: Killed

The slightly longer logfile can be found here: http://dpaste.com/2T61PS9

I suggest to reword that message, e.g. "could not write to transaction
log file" or "could not write to wal file".

I'm quite unenthused about that. If anything, I'd remove detail and use
the standard error message about not being able to write to a file, and
include the full path.

Also, the errno (ENOMEM) is curious (and the user wrote that Google
monitoring reported memory at 16/20GB at the time of the crash), but it
could be a due to running on a cloud-fork? As you have no access to
PGDATA, it sounds difficult to diagnose after the fact.

Yes, that sounds quite likely. This pretty much is a write() which isn't
documented to return ENOMEM commonly, so I assume Google's doing
something odd.

Greetings,

Andres Freund

#4Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#3)
Re: Misleading panic message in backend/access/transam/xlog.c

On Wed, Jan 09, 2019 at 08:10:43AM -0800, Andres Freund wrote:

I'm quite unenthused about that. If anything, I'd remove detail and use
the standard error message about not being able to write to a file, and
include the full path.

Partially agreed. Those messages have been left out of 56df07b
because they include some context about the offset and the length, and
I don't think that we simply want to remove that information. What
about making the offset and the length part of an extra errdetail, and
switch the main error string to a more generic one?
--
Michael

#5Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#4)
Re: Misleading panic message in backend/access/transam/xlog.c

On January 9, 2019 5:01:40 PM PST, Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Jan 09, 2019 at 08:10:43AM -0800, Andres Freund wrote:

I'm quite unenthused about that. If anything, I'd remove detail and

use

the standard error message about not being able to write to a file,

and

include the full path.

Partially agreed. Those messages have been left out of 56df07b
because they include some context about the offset and the length, and
I don't think that we simply want to remove that information. What
about making the offset and the length part of an extra errdetail, and
switch the main error string to a more generic one?

IIRC we have other such errors including offset and length (and if not we'll grow some). It should be formatted as a genetic write error with the file name, no reference to log file, etc, even if there's no precedent.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#6Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#5)
Re: Misleading panic message in backend/access/transam/xlog.c

On Wed, Jan 09, 2019 at 05:09:19PM -0800, Andres Freund wrote:

IIRC we have other such errors including offset and length (and if
not we'll grow some). It should be formatted as a genetic write
error with the file name, no reference to log file, etc, even if
there's no precedent.

Yeah, there are a couple of them:
access/transam/xlog.c:
errmsg("could not read from log segment %s, offset %u: %m",
access/transam/xlog.c:
errmsg("could not read from log segment %s, offset %u: read %d of %zu",
access/transam/xlogutils.c:
errmsg("could not seek in log segment %s to offset %u: %m"
access/transam/xlogutils.c:
errmsg("could not read from log segment %s, offset %u, length %lu: %m",
replication/walreceiver.c:
errmsg("could not seek in log segment %s to offset %u: %m",
replication/walsender.c:
errmsg("could not seek in log segment %s to offset %u: %m",
replication/walsender.c:
errmsg("could not read from log segment %s, offset %u, length %zu: %m",
replication/walsender.c:
errmsg("could not read from log segment %s, offset %u: read %d of %zu",
--
Michael

#7Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#6)
Re: Misleading panic message in backend/access/transam/xlog.c

On Wed, Jan 9, 2019 at 8:38 PM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Jan 09, 2019 at 05:09:19PM -0800, Andres Freund wrote:

IIRC we have other such errors including offset and length (and if
not we'll grow some). It should be formatted as a genetic write
error with the file name, no reference to log file, etc, even if
there's no precedent.

Yeah, there are a couple of them:
access/transam/xlog.c:
errmsg("could not read from log segment %s, offset %u: %m",

In smgr.c, we have:

"could not read block %u in file \"%s\": %m"

That seems to be the closet thing we have to a generic message
template right now, but it's not entirely generic because it talks
about blocks. Maybe we should go with something like:

"could not read %u bytes in file \"%s\" at offset %u: %m"

...and use that for both WAL and smgr.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company