backtrace_on_internal_error

Started by Peter Eisentrautover 2 years ago38 messageshackers

peter_e@gmx.net

over 2 years ago

We have backtrace support for server errors. You can activate that
either by setting backtrace_functions or by explicitly attaching
errbacktrace() to an ereport() call.

I would like an additional mode that essentially triggers a backtrace
anytime elog() (for internal errors) is called. This is not well
covered by backtrace_functions, because there are many equally-worded
low-level errors in many functions. And if you find out where the error
is, then you need to manually rewrite the elog() to ereport() to attach
the errbacktrace(), which is annoying. Having a backtrace automatically
on every elog() call would be very helpful during development for
various kinds of common errors from palloc, syscache, node support, etc.

I think the implementation would be very simple, something like

diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 6aeb855e49..45d40abe92 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -498,9 +498,11 @@ errfinish(const char *filename, int lineno, const 
char *funcname)

     /* Collect backtrace, if enabled and we didn't already */
     if (!edata->backtrace &&
-       edata->funcname &&
-       backtrace_functions &&
-       matches_backtrace_functions(edata->funcname))
+       ((edata->funcname &&
+         backtrace_functions &&
+         matches_backtrace_functions(edata->funcname)) ||
+        (edata->sqlerrcode == ERRCODE_INTERNAL_ERROR &&
+         backtrace_on_internal_error)))
         set_backtrace(edata, 2);

where backtrace_on_internal_error would be a GUC variable.

Would others find this useful? Any other settings or variants in this
area that should be considered while we're here?

Nathan Bossart

nathandbossart@gmail.com

over 2 years ago

In reply to: Peter Eisentraut (#1)

Re: backtrace_on_internal_error

On Tue, Dec 05, 2023 at 11:55:05AM +0100, Peter Eisentraut wrote:

Would others find this useful?

Yes. I think I would use this pretty frequently.

Any other settings or variants in this area
that should be considered while we're here?

IMO it would be nice to have a way to turn on backtraces for everything, or
at least everything above a certain logging level. That would primarily be
useful for when you don't know exactly which C function is producing the
error.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Robert Haas

robertmhaas@gmail.com

over 2 years ago

In reply to: Nathan Bossart (#2)

Re: backtrace_on_internal_error

On Tue, Dec 5, 2023 at 12:40 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Tue, Dec 05, 2023 at 11:55:05AM +0100, Peter Eisentraut wrote:

Would others find this useful?

Yes. I think I would use this pretty frequently.

I think we should consider unconditionally emitting a backtrace when
an elog() is hit, instead of requiring a GUC. Or at least any elog()
that's not at a DEBUGn level. If the extra output annoys anybody, that
means they're regularly hitting an elog(), and it ought to be promoted
to ereport().

--
Robert Haas
EDB: http://www.enterprisedb.com

Nathan Bossart

nathandbossart@gmail.com

over 2 years ago

In reply to: Robert Haas (#3)

Re: backtrace_on_internal_error

On Tue, Dec 05, 2023 at 01:16:22PM -0500, Robert Haas wrote:

I think we should consider unconditionally emitting a backtrace when
an elog() is hit, instead of requiring a GUC. Or at least any elog()
that's not at a DEBUGn level. If the extra output annoys anybody, that
means they're regularly hitting an elog(), and it ought to be promoted
to ereport().

Perhaps this should be a GUC that defaults to LOG or ERROR.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Robert Haas

robertmhaas@gmail.com

over 2 years ago

In reply to: Nathan Bossart (#4)

Re: backtrace_on_internal_error

On Tue, Dec 5, 2023 at 1:28 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Tue, Dec 05, 2023 at 01:16:22PM -0500, Robert Haas wrote:

I think we should consider unconditionally emitting a backtrace when
an elog() is hit, instead of requiring a GUC. Or at least any elog()
that's not at a DEBUGn level. If the extra output annoys anybody, that
means they're regularly hitting an elog(), and it ought to be promoted
to ereport().

Perhaps this should be a GUC that defaults to LOG or ERROR.

Why?

--
Robert Haas
EDB: http://www.enterprisedb.com

Matthias van de Meent

boekewurm+postgres@gmail.com

over 2 years ago

In reply to: Robert Haas (#5)

Re: backtrace_on_internal_error

On Tue, 5 Dec 2023 at 19:30, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Dec 5, 2023 at 1:28 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Tue, Dec 05, 2023 at 01:16:22PM -0500, Robert Haas wrote:

I think we should consider unconditionally emitting a backtrace when
an elog() is hit, instead of requiring a GUC. Or at least any elog()
that's not at a DEBUGn level. If the extra output annoys anybody, that
means they're regularly hitting an elog(), and it ought to be promoted
to ereport().

Perhaps this should be a GUC that defaults to LOG or ERROR.

Why?

I can't speak for Nathan, but my reason would be that I'm not in the
habit to attach a debugger to my program to keep track of state
progression, but instead use elog() during patch development. I'm not
super stoked for getting my developmental elog(LOG)-s spammed with
stack traces, so I'd want to set this at least to ERROR, while in
production LOG could be fine.

Similarly, there are probably extensions that do not use ereport()
directly, but instead use elog(), because of reasons like 'not
planning on doing translations' and 'elog() is the easier API'.
Forcing a change over to ereport because of stack trace spam in logs
caused by elog would be quite annoying.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Robert Haas

robertmhaas@gmail.com

over 2 years ago

In reply to: Matthias van de Meent (#6)

Re: backtrace_on_internal_error

On Tue, Dec 5, 2023 at 1:47 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

I can't speak for Nathan, but my reason would be that I'm not in the
habit to attach a debugger to my program to keep track of state
progression, but instead use elog() during patch development. I'm not
super stoked for getting my developmental elog(LOG)-s spammed with
stack traces, so I'd want to set this at least to ERROR, while in
production LOG could be fine.

Similarly, there are probably extensions that do not use ereport()
directly, but instead use elog(), because of reasons like 'not
planning on doing translations' and 'elog() is the easier API'.
Forcing a change over to ereport because of stack trace spam in logs
caused by elog would be quite annoying.

That does seem like a fair complaint. But I also think it would be
really good if we had something that could be enabled unconditionally
instead of via a GUC... because if it's gated by aa GUC then it often
won't be there when you need it.

--
Robert Haas
EDB: http://www.enterprisedb.com

Nathan Bossart

nathandbossart@gmail.com

over 2 years ago

In reply to: Matthias van de Meent (#6)

Re: backtrace_on_internal_error

On Tue, Dec 05, 2023 at 07:47:25PM +0100, Matthias van de Meent wrote:

On Tue, 5 Dec 2023 at 19:30, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Dec 5, 2023 at 1:28 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

Perhaps this should be a GUC that defaults to LOG or ERROR.

Why?

Sorry, I should've explained why in my message.

I can't speak for Nathan, but my reason would be that I'm not in the
habit to attach a debugger to my program to keep track of state
progression, but instead use elog() during patch development. I'm not
super stoked for getting my developmental elog(LOG)-s spammed with
stack traces, so I'd want to set this at least to ERROR, while in
production LOG could be fine.

Similarly, there are probably extensions that do not use ereport()
directly, but instead use elog(), because of reasons like 'not
planning on doing translations' and 'elog() is the easier API'.
Forcing a change over to ereport because of stack trace spam in logs
caused by elog would be quite annoying.

My main concern was forcing extra logging that users won't have a great way
to turn off (except for maybe raising log_min_messages or something).
Also, it'd give us a way to slowly ramp up backtraces over a few years
without suddenly spamming everyones logs in v17. For example, maybe this
GUC defaults to PANIC or FATAL in v17. Once we've had a chance to address
any common backtraces there, we could bump it to ERROR or WARNING in v18.
And so on. If we just flood everyone's logs immediately, I worry that
folks will just turn it off, and we won't get the reports we are hoping
for.

I know we already have so many GUCs and would like to avoid adding new ones
when possible. Maybe this is one that could eventually be retired as we
gain confidence that it won't obliterate the log files.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Tom Lane

tgl@sss.pgh.pa.us

over 2 years ago

In reply to: Matthias van de Meent (#6)

Re: backtrace_on_internal_error

Matthias van de Meent <boekewurm+postgres@gmail.com> writes:

On Tue, 5 Dec 2023 at 19:30, Robert Haas <robertmhaas@gmail.com> wrote:

I think we should consider unconditionally emitting a backtrace when
an elog() is hit, instead of requiring a GUC.

Perhaps this should be a GUC that defaults to LOG or ERROR.

I can't speak for Nathan, but my reason would be that I'm not in the
habit to attach a debugger to my program to keep track of state
progression, but instead use elog() during patch development. I'm not
super stoked for getting my developmental elog(LOG)-s spammed with
stack traces, so I'd want to set this at least to ERROR, while in
production LOG could be fine.

Yeah, I would not be happy either with elog(LOG) suddenly getting
10x more verbose. I think it might be okay to unconditionally do this
when elevel >= ERROR, though.

(At the same time, I don't have a problem with the idea of a GUC
controlling the minimum elevel to cause the report. Other people
might have other use-cases than I do.)

regards, tom lane

#10

Peter Eisentraut

peter_e@gmx.net

over 2 years ago

In reply to: Peter Eisentraut (#1)

Re: backtrace_on_internal_error

On 05.12.23 11:55, Peter Eisentraut wrote:

I think the implementation would be very simple, something like

diff --git a/src/backend/utils/error/elog.c 
b/src/backend/utils/error/elog.c
index 6aeb855e49..45d40abe92 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -498,9 +498,11 @@ errfinish(const char *filename, int lineno, const 
char *funcname)

    /* Collect backtrace, if enabled and we didn't already */
    if (!edata->backtrace &&
-       edata->funcname &&
-       backtrace_functions &&
-       matches_backtrace_functions(edata->funcname))
+       ((edata->funcname &&
+         backtrace_functions &&
+         matches_backtrace_functions(edata->funcname)) ||
+        (edata->sqlerrcode == ERRCODE_INTERNAL_ERROR &&
+         backtrace_on_internal_error)))
        set_backtrace(edata, 2);

where backtrace_on_internal_error would be a GUC variable.

It looks like many people found this idea worthwhile.

Several people asked for a way to set the minimum log level for this
treatment.

Something else to note: I wrote the above code to check the error code;
it doesn't check whether the original code write elog() or ereport().
There are some internal errors that are written as ereport() now.
Others might be changed from time to time; until now there would have
been no external effect from this. I think it would be weird to
introduce a difference between these forms now.

But then, elog() only uses the error code ERRCODE_INTERNAL_ERROR if the
error level is >=ERROR. So this already excludes everything below.

Do people want a way to distinguish ERROR/FATAL/PANIC?

Or do people want a way to enable backtraces for elog(LOG)? This didn't
look too interesting to me. (Many current elog(LOG) calls are behind
additional guards like TRACE_SORT or LOCK_DEBUG.)

If neither of these two are very interesting, then the above code would
already appear to do what was asked.

#11

Tom Lane

tgl@sss.pgh.pa.us

over 2 years ago

In reply to: Peter Eisentraut (#10)

Re: backtrace_on_internal_error

Peter Eisentraut <peter@eisentraut.org> writes:

Something else to note: I wrote the above code to check the error code;
it doesn't check whether the original code write elog() or ereport().
There are some internal errors that are written as ereport() now.
Others might be changed from time to time; until now there would have
been no external effect from this. I think it would be weird to
introduce a difference between these forms now.

Yeah, that was bothering me too. IIRC, elog is already documented
as being *exactly equivalent* to ereport with a minimal set of
options. I don't think we should break that equivalence. So I
agree with driving this off the stated-or-imputed errcode rather
than which function is called.

Do people want a way to distinguish ERROR/FATAL/PANIC?
Or do people want a way to enable backtraces for elog(LOG)?

Personally I don't see a need for either.

regards, tom lane

#12

Peter Eisentraut

peter_e@gmx.net

over 2 years ago

In reply to: Tom Lane (#11)

Re: backtrace_on_internal_error

Here is a patch to play with.

I also found a related typo.

One possible question for discussion is whether the default for this
should be off, on, or possibly something like on-in-assert-builds.
(Personally, I'm happy to turn it on myself at run time, but everyone
has different workflows.)

#13

Tom Lane

tgl@sss.pgh.pa.us

over 2 years ago

In reply to: Peter Eisentraut (#12)

Re: backtrace_on_internal_error

Peter Eisentraut <peter@eisentraut.org> writes:

Here is a patch to play with.

Didn't read the patch yet, but ...

One possible question for discussion is whether the default for this
should be off, on, or possibly something like on-in-assert-builds.
(Personally, I'm happy to turn it on myself at run time, but everyone
has different workflows.)

... there was already opinion upthread that this should be on by
default, which I agree with. You shouldn't be hitting cases like
this commonly (if so, they're bugs to fix or the errcode should be
rethought), and the failure might be pretty hard to reproduce.

I'm not really sold that we even need YA GUC, for that matter.
How about committing the behavior without a GUC, and then
back-filling one if we get pushback?

regards, tom lane

#14

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Tom Lane (#13)

Re: backtrace_on_internal_error

Hi,

On 2023-12-08 10:05:09 -0500, Tom Lane wrote:

Peter Eisentraut <peter@eisentraut.org> writes:

One possible question for discussion is whether the default for this
should be off, on, or possibly something like on-in-assert-builds.
(Personally, I'm happy to turn it on myself at run time, but everyone
has different workflows.)

... there was already opinion upthread that this should be on by
default, which I agree with. You shouldn't be hitting cases like
this commonly (if so, they're bugs to fix or the errcode should be
rethought), and the failure might be pretty hard to reproduce.

FWIW, I did some analysis on aggregated logs on a larger number of machines,
and it does look like that'd be a measurable increase in log volume. There are
a few voluminous internal errors in core, but the bigger issue is
extensions. They are typically much less disciplined about assigning error
codes than core PG is.

I've been wondering about doing some macro hackery to inform elog.c about
whether a log message is from core or an extension. It might even be possible
to identify the concrete extension, e.g. by updating the contents of
PG_MODULE_MAGIC during module loading, and referencing that.

Based on the aforementioned data, the most common, in-core, log messages
without assigned error codes are:

could not accept SSL connection: %m - with zero errno
archive command was terminated by signal %d: %s
could not send data to client: %m - with zero errno
cache lookup failed for type %u
archive command failed with exit code %d
tuple concurrently updated
could not restore file "%s" from archive: %s
archive command was terminated by signal %d: %s
%s at file "%s" line %u
invalid memory alloc request size %zu
could not send data to client: %m
could not open directory "%s": %m - errno indicating ENOMEM
could not write init file
out of relcache_callback_list slots
online backup was canceled, recovery cannot continue
requested timeline %u does not contain minimum recovery point %X/%X on timeline %u

There were a lot more in older PG versions, I tried to filter those out.

I'm a bit confused about the huge number of "could not accept SSL connection:
%m" with a zero errno. I guess we must be clearing errno somehow, but I don't
immediately see where. Or perhaps we need to actually look at what
SSL_get_error() returns?

Greetings,

Andres Freund

#15

Tom Lane

tgl@sss.pgh.pa.us

over 2 years ago

In reply to: Andres Freund (#14)

Re: backtrace_on_internal_error

Andres Freund <andres@anarazel.de> writes:

On 2023-12-08 10:05:09 -0500, Tom Lane wrote:

... there was already opinion upthread that this should be on by
default, which I agree with. You shouldn't be hitting cases like
this commonly (if so, they're bugs to fix or the errcode should be
rethought), and the failure might be pretty hard to reproduce.

FWIW, I did some analysis on aggregated logs on a larger number of machines,
and it does look like that'd be a measurable increase in log volume. There are
a few voluminous internal errors in core, but the bigger issue is
extensions. They are typically much less disciplined about assigning error
codes than core PG is.

Well, I don't see much wrong with making a push to assign error codes
to more calls. We've had other discussions about doing that.
Certainly these SSL failures are not "internal" errors.

could not accept SSL connection: %m - with zero errno
...
I'm a bit confused about the huge number of "could not accept SSL connection:
%m" with a zero errno. I guess we must be clearing errno somehow, but I don't
immediately see where. Or perhaps we need to actually look at what
SSL_get_error() returns?

Hmm, don't suppose you have a way to reproduce that?

regards, tom lane

#16

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Tom Lane (#15)

Re: backtrace_on_internal_error

Hi,

On 2023-12-08 13:23:50 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2023-12-08 10:05:09 -0500, Tom Lane wrote:

... there was already opinion upthread that this should be on by
default, which I agree with. You shouldn't be hitting cases like
this commonly (if so, they're bugs to fix or the errcode should be
rethought), and the failure might be pretty hard to reproduce.

FWIW, I did some analysis on aggregated logs on a larger number of machines,
and it does look like that'd be a measurable increase in log volume. There are
a few voluminous internal errors in core, but the bigger issue is
extensions. They are typically much less disciplined about assigning error
codes than core PG is.

Well, I don't see much wrong with making a push to assign error codes
to more calls.

Oh, very much agreed. But I suspect we won't quickly do the same for
out-of-core extensions...

Certainly these SSL failures are not "internal" errors.

could not accept SSL connection: %m - with zero errno
...
I'm a bit confused about the huge number of "could not accept SSL connection:
%m" with a zero errno. I guess we must be clearing errno somehow, but I don't
immediately see where. Or perhaps we need to actually look at what
SSL_get_error() returns?

Hmm, don't suppose you have a way to reproduce that?

After a bit of trying, yes. I put an abort() into pgtls_open_client(), after
initialize_SSL(). Connecting does result in:

LOG: could not accept SSL connection: Success

Greetings,

Andres Freund

#17

Tom Lane

tgl@sss.pgh.pa.us

over 2 years ago

In reply to: Andres Freund (#16)

Re: backtrace_on_internal_error

Andres Freund <andres@anarazel.de> writes:

On 2023-12-08 13:23:50 -0500, Tom Lane wrote:

Hmm, don't suppose you have a way to reproduce that?

After a bit of trying, yes. I put an abort() into pgtls_open_client(), after
initialize_SSL(). Connecting does result in:
LOG: could not accept SSL connection: Success

OK. I can dig into that, unless you're already on it?

regards, tom lane

#18

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Tom Lane (#17)

Re: backtrace_on_internal_error

Hi,

On 2023-12-08 13:46:07 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2023-12-08 13:23:50 -0500, Tom Lane wrote:

Hmm, don't suppose you have a way to reproduce that?

After a bit of trying, yes. I put an abort() into pgtls_open_client(), after
initialize_SSL(). Connecting does result in:
LOG: could not accept SSL connection: Success

OK. I can dig into that, unless you're already on it?

I think I figured it it out. Looks like we need to translate a closed socket
(recvfrom() returning 0) to ECONNRESET or such.

Greetings,

Andres Freund

#19

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Andres Freund (#18)

Re: backtrace_on_internal_error

Hi,

On 2023-12-08 10:51:01 -0800, Andres Freund wrote:

On 2023-12-08 13:46:07 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2023-12-08 13:23:50 -0500, Tom Lane wrote:

Hmm, don't suppose you have a way to reproduce that?

After a bit of trying, yes. I put an abort() into pgtls_open_client(), after
initialize_SSL(). Connecting does result in:
LOG: could not accept SSL connection: Success

OK. I can dig into that, unless you're already on it?

I think I figured it it out. Looks like we need to translate a closed socket
(recvfrom() returning 0) to ECONNRESET or such.

I think we might just need to expand the existing branch for EOF:

if (r < 0)
ereport(COMMERROR,
(errcode_for_socket_access(),
errmsg("could not accept SSL connection: %m")));
else
ereport(COMMERROR,
(errcode(ERRCODE_PROTOCOL_VIOLATION),
errmsg("could not accept SSL connection: EOF detected")));

The openssl docs say:

The following return values can occur:

The TLS/SSL handshake was not successful but was shut down controlled and by the specifications of the TLS/SSL protocol. Call SSL_get_error() with the return value ret to find out the reason.
1

The TLS/SSL handshake was successfully completed, a TLS/SSL connection has been established.
<0

The TLS/SSL handshake was not successful because a fatal error occurred either at the protocol level or a connection failure occurred. The shutdown was not clean. It can also occur if action is needed to continue the operation for nonblocking BIOs. Call SSL_get_error() with the return value ret to find out the reason.

Which fits with my reproducer - due to the abort the connection was *not* shut
down via SSL in a controlled manner, therefore r < 0.

Hm, oddly enough, there's this tidbit in the SSL_get_error() manpage:

On an unexpected EOF, versions before OpenSSL 3.0 returned SSL_ERROR_SYSCALL,
nothing was added to the error stack, and errno was 0. Since OpenSSL 3.0 the
returned error is SSL_ERROR_SSL with a meaningful error on the error stack.

But I reproduced this with 3.1.

Seems like we should just treat errno == 0 as a reason to emit the "EOF
detected" message?

I wonder if we should treat send/recv returning 0 different from an error
message perspective during an established connection. Right now we produce
could not receive data from client: Connection reset by peer

because be_tls_read() sets errno to ECONNRESET - despite that not having been
returned by the OS. But I guess that's a topic for another day.

Greetings,

Andres Freund

#20

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Andres Freund (#19)

Re: backtrace_on_internal_error

Hi,

On 2023-12-08 11:33:16 -0800, Andres Freund wrote:

On 2023-12-08 10:51:01 -0800, Andres Freund wrote:

On 2023-12-08 13:46:07 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2023-12-08 13:23:50 -0500, Tom Lane wrote:

Hmm, don't suppose you have a way to reproduce that?

After a bit of trying, yes. I put an abort() into pgtls_open_client(), after
initialize_SSL(). Connecting does result in:
LOG: could not accept SSL connection: Success

OK. I can dig into that, unless you're already on it?

[...]

Seems like we should just treat errno == 0 as a reason to emit the "EOF
detected" message?

I thought it'd be nice to have a test for this, particularly because it's not
clear that the behaviour is consistent across openssl versions.

I couldn't think of a way to do that with psql. But it's just a few lines of
perl to gin up an "SSL" startup packet and then close the socket. I couldn't
quite figure out when IO::Socket::INET was added, but I think it's likely been
long enough, I see references from 1999.

This worked easily on linux and freebsd, but not on windows and macos, where
it seems to cause ECONNRESET. I thought that explicitly shutting down the
socket might help, but that just additionally caused freebsd to fail.

Windows uses an older openssl, so it could also be caused by the behaviour
differing back then.

To deal with that, I changed the test to instead check if "not accept SSL
connection: Success" is not logged. I'm not sure that actually would be
logged on windows, it does seem to have different strings for errors than
other platforms.

Greetings,

Andres Freund

#21

Tom Lane

tgl@sss.pgh.pa.us

over 2 years ago

In reply to: Andres Freund (#19)

#22

Tom Lane

tgl@sss.pgh.pa.us

over 2 years ago

In reply to: Andres Freund (#20)