Strange failure on mamba

Started by Thomas Munroover 3 years ago20 messageshackers
Jump to latest
#1Thomas Munro
thomas.munro@gmail.com

Hi,

I wonder why the walreceiver didn't start in
008_min_recovery_point_node_3.log here:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2022-11-16%2023%3A13%3A38

There was the case of commit 8acd8f86, but that involved a deadlocked
postmaster whereas this one still handled a shutdown request.

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#1)
Re: Strange failure on mamba

Thomas Munro <thomas.munro@gmail.com> writes:

I wonder why the walreceiver didn't start in
008_min_recovery_point_node_3.log here:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&amp;dt=2022-11-16%2023%3A13%3A38

mamba has been showing intermittent failures in various replication
tests since day one. My guess is that it's slow enough to be
particularly subject to the signal-handler race conditions that we
know exist in walreceivers and elsewhere. (Now, it wasn't any faster
in its previous incarnation as a macOS critter. But maybe modern
NetBSD has different scheduler behavior than ancient macOS and that
contributes somehow. Or maybe there's some other NetBSD weirdness
in here.)

I've tried to reproduce manually, without much success :-(

Like many of its other failures, there's a suggestive postmaster
log entry at the very end:

2022-11-16 19:45:53.851 EST [2036:4] LOG: received immediate shutdown request
2022-11-16 19:45:58.873 EST [2036:5] LOG: issuing SIGKILL to recalcitrant children
2022-11-16 19:45:58.881 EST [2036:6] LOG: database system is shut down

So some postmaster child is stuck somewhere where it's not responding
to SIGQUIT. While it's not unreasonable to guess that that's a
walreceiver, there's no hard evidence of it here. I've been wondering
if it'd be worth patching the postmaster so that it's a bit more verbose
about which children it had to SIGKILL. I've also wondered about
changing the SIGKILL to SIGABRT in hopes of reaping a core file that
could be investigated.

regards, tom lane

#3Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#2)
Re: Strange failure on mamba

On Fri, Nov 18, 2022 at 11:08 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

I wonder why the walreceiver didn't start in
008_min_recovery_point_node_3.log here:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&amp;dt=2022-11-16%2023%3A13%3A38

mamba has been showing intermittent failures in various replication
tests since day one. My guess is that it's slow enough to be
particularly subject to the signal-handler race conditions that we
know exist in walreceivers and elsewhere. (Now, it wasn't any faster
in its previous incarnation as a macOS critter. But maybe modern
NetBSD has different scheduler behavior than ancient macOS and that
contributes somehow. Or maybe there's some other NetBSD weirdness
in here.)

I've tried to reproduce manually, without much success :-(

Like many of its other failures, there's a suggestive postmaster
log entry at the very end:

2022-11-16 19:45:53.851 EST [2036:4] LOG: received immediate shutdown request
2022-11-16 19:45:58.873 EST [2036:5] LOG: issuing SIGKILL to recalcitrant children
2022-11-16 19:45:58.881 EST [2036:6] LOG: database system is shut down

So some postmaster child is stuck somewhere where it's not responding
to SIGQUIT. While it's not unreasonable to guess that that's a
walreceiver, there's no hard evidence of it here. I've been wondering
if it'd be worth patching the postmaster so that it's a bit more verbose
about which children it had to SIGKILL. I've also wondered about
changing the SIGKILL to SIGABRT in hopes of reaping a core file that
could be investigated.

I wonder if it's a runtime variant of the other problem. We do
load_file("libpqwalreceiver", false) before unblocking signals but
maybe don't resolve the symbols until calling them, or something like
that...

#4Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#3)
Re: Strange failure on mamba

On Fri, Nov 18, 2022 at 11:35 AM Thomas Munro <thomas.munro@gmail.com> wrote:

I wonder if it's a runtime variant of the other problem. We do
load_file("libpqwalreceiver", false) before unblocking signals but
maybe don't resolve the symbols until calling them, or something like
that...

Hmm, no, I take that back. A key ingredient was that a symbol was
being resolved inside the signal handler, which is a postmaster-only
thing.

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#3)
Re: Strange failure on mamba

Thomas Munro <thomas.munro@gmail.com> writes:

On Fri, Nov 18, 2022 at 11:08 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

mamba has been showing intermittent failures in various replication
tests since day one.

I wonder if it's a runtime variant of the other problem. We do
load_file("libpqwalreceiver", false) before unblocking signals but
maybe don't resolve the symbols until calling them, or something like
that...

Yeah, that or some other NetBSD bug could be the explanation, too.
Without a stack trace it's hard to have any confidence about it,
but I've been unable to reproduce the problem outside the buildfarm.
(Which is a familiar refrain. I wonder what it is about the buildfarm
environment that makes it act different from the exact same code running
on the exact same machine.)

So I'd like to have some way to make the postmaster send SIGABRT instead
of SIGKILL in the buildfarm environment. The lowest-tech way would be
to drive that off some #define or other. We could scale it up to a GUC
perhaps. Adjacent to that, I also wonder whether SIGABRT wouldn't be
more useful than SIGSTOP for the existing SendStop half-a-feature ---
the idea that people should collect cores manually seems mighty
last-century.

regards, tom lane

#6Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#5)
Re: Strange failure on mamba

Hi,

On 2022-11-17 17:47:50 -0500, Tom Lane wrote:

Yeah, that or some other NetBSD bug could be the explanation, too.
Without a stack trace it's hard to have any confidence about it,
but I've been unable to reproduce the problem outside the buildfarm.
(Which is a familiar refrain. I wonder what it is about the buildfarm
environment that makes it act different from the exact same code running
on the exact same machine.)

So I'd like to have some way to make the postmaster send SIGABRT instead
of SIGKILL in the buildfarm environment. The lowest-tech way would be
to drive that off some #define or other. We could scale it up to a GUC
perhaps. Adjacent to that, I also wonder whether SIGABRT wouldn't be
more useful than SIGSTOP for the existing SendStop half-a-feature ---
the idea that people should collect cores manually seems mighty
last-century.

I suspect that having a GUC would be a good idea. I needed something similar
recently, debugging an occasional hang in the AIO patchset. I first tried
something like your #define approach and it did cause a problematic flood of
core files.

I ended up using libbacktrace to generate useful backtraces (vs what
backtrace_symbols() generates) when receiving SIGQUIT. I didn't do the legwork
to make it properly signal safe, but it'd be doable afaiu.

Greetings,

Andres Freund

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#6)
Sending SIGABRT to child processes (was Re: Strange failure on mamba)

Andres Freund <andres@anarazel.de> writes:

On 2022-11-17 17:47:50 -0500, Tom Lane wrote:

So I'd like to have some way to make the postmaster send SIGABRT instead
of SIGKILL in the buildfarm environment. The lowest-tech way would be
to drive that off some #define or other. We could scale it up to a GUC
perhaps. Adjacent to that, I also wonder whether SIGABRT wouldn't be
more useful than SIGSTOP for the existing SendStop half-a-feature ---
the idea that people should collect cores manually seems mighty
last-century.

I suspect that having a GUC would be a good idea. I needed something similar
recently, debugging an occasional hang in the AIO patchset. I first tried
something like your #define approach and it did cause a problematic flood of
core files.

Yeah, the main downside of such a thing is the risk of lots of core files
accumulating over repeated crashes. Nonetheless, I think it'll be a
useful debugging aid. Here's a proposed patch. (I took the opportunity
to kill off the long-since-unimplemented Reinit switch, too.)

One thing I'm not too clear on is if we want to send SIGABRT to the child
groups (ie, SIGABRT grandchild processes too). I made signal_child do
so here, but perhaps it's overkill.

regards, tom lane

Attachments:

optionally-kill-child-processes-with-SIGABRT-v1.patchtext/x-diff; charset=us-ascii; name=optionally-kill-child-processes-with-SIGABRT-v1.patchDownload+134-129
#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#7)
Re: Sending SIGABRT to child processes (was Re: Strange failure on mamba)

I wrote:

Andres Freund <andres@anarazel.de> writes:

I suspect that having a GUC would be a good idea. I needed something similar
recently, debugging an occasional hang in the AIO patchset. I first tried
something like your #define approach and it did cause a problematic flood of
core files.

Yeah, the main downside of such a thing is the risk of lots of core files
accumulating over repeated crashes. Nonetheless, I think it'll be a
useful debugging aid. Here's a proposed patch. (I took the opportunity
to kill off the long-since-unimplemented Reinit switch, too.)

Hearing no complaints, I've pushed this and reconfigured mamba to use
send_abort_for_kill. Once I've got a core file or two to look at,
I'll try to figure out what's going on there.

One thing I'm not too clear on is if we want to send SIGABRT to the child
groups (ie, SIGABRT grandchild processes too). I made signal_child do
so here, but perhaps it's overkill.

After further thought, we do have to SIGABRT the grandchildren too,
or they won't shut down promptly. I think there might be a small
risk of some programs trapping SIGABRT and doing something other than
what we want; but since this is only a debug aid that's probably
tolerable.

regards, tom lane

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#5)
Re: Strange failure on mamba

I wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

On Fri, Nov 18, 2022 at 11:08 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

mamba has been showing intermittent failures in various replication
tests since day one.

I wonder if it's a runtime variant of the other problem. We do
load_file("libpqwalreceiver", false) before unblocking signals but
maybe don't resolve the symbols until calling them, or something like
that...

Yeah, that or some other NetBSD bug could be the explanation, too.
Without a stack trace it's hard to have any confidence about it,
but I've been unable to reproduce the problem outside the buildfarm.

Thanks to commit 51b5834cd I've now been able to capture some info
from mamba's last couple of failures [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&amp;dt=2022-11-24%2021%3A45%3A29[2]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&amp;dt=2022-11-29%2020%3A50%3A36. Sure enough, what is
happening is that postmaster children are getting stuck in recursive
rtld symbol resolution. A couple of the stack traces I collected are

#0 0xfdeede4c in ___lwp_park60 () from /usr/libexec/ld.elf_so
#1 0xfdee3e08 in _rtld_exclusive_enter () from /usr/libexec/ld.elf_so
#2 0xfdee59e4 in dlopen () from /usr/libexec/ld.elf_so
#3 0x01e54ed0 in internal_load_library (
libname=libname@entry=0xfd74cc88 "/home/buildfarm/bf-data/HEAD/pgsql.build/tmp_install/home/buildfarm/bf-data/HEAD/inst/lib/postgresql/libpqwalreceiver.so") at dfmgr.c:239
#4 0x01e55c78 in load_file (filename=<optimized out>, restricted=<optimized out>) at dfmgr.c:156
#5 0x01c5ba24 in WalReceiverMain () at walreceiver.c:292
#6 0x01c090f8 in AuxiliaryProcessMain (auxtype=auxtype@entry=WalReceiverProcess) at auxprocess.c:161
#7 0x01c10970 in StartChildProcess (type=WalReceiverProcess) at postmaster.c:5310
#8 0x01c123ac in MaybeStartWalReceiver () at postmaster.c:5475
#9 MaybeStartWalReceiver () at postmaster.c:5468
#10 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5131
#11 <signal handler called>
#12 0xfdee6b44 in _rtld_symlook_obj () from /usr/libexec/ld.elf_so
#13 0xfdee6fc0 in _rtld_symlook_list () from /usr/libexec/ld.elf_so
#14 0xfdee7644 in _rtld_symlook_default () from /usr/libexec/ld.elf_so
#15 0xfdee795c in _rtld_find_symdef () from /usr/libexec/ld.elf_so
#16 0xfdee7ad0 in _rtld_find_plt_symdef () from /usr/libexec/ld.elf_so
#17 0xfdee1918 in _rtld_bind () from /usr/libexec/ld.elf_so
#18 0xfdee1dc0 in _rtld_bind_secureplt_start () from /usr/libexec/ld.elf_so
Backtrace stopped: frame did not save the PC

#0 0xfdeede4c in ___lwp_park60 () from /usr/libexec/ld.elf_so
#1 0xfdee3e08 in _rtld_exclusive_enter () from /usr/libexec/ld.elf_so
#2 0xfdee4ba4 in _rtld_exit () from /usr/libexec/ld.elf_so
#3 0xfd54ea74 in __cxa_finalize () from /usr/lib/libc.so.12
#4 0xfd54e354 in exit () from /usr/lib/libc.so.12
#5 0x01c963c0 in proc_exit (code=code@entry=0) at ipc.c:152
#6 0x01c056e4 in AutoVacLauncherShutdown () at autovacuum.c:853
#7 0x01c071dc in AutoVacLauncherMain (argv=0x0, argc=0) at autovacuum.c:800
#8 0x01c07694 in StartAutoVacLauncher () at autovacuum.c:416
#9 0x01c11d3c in reaper (postgres_signal_arg=<optimized out>) at postmaster.c:3038
#10 <signal handler called>
#11 0xfdee6f64 in _rtld_symlook_list () from /usr/libexec/ld.elf_so
#12 0xfdee7644 in _rtld_symlook_default () from /usr/libexec/ld.elf_so
#13 0xfdee795c in _rtld_find_symdef () from /usr/libexec/ld.elf_so
#14 0xfdee7ad0 in _rtld_find_plt_symdef () from /usr/libexec/ld.elf_so
#15 0xfdee1918 in _rtld_bind () from /usr/libexec/ld.elf_so
#16 0xfdee1dc0 in _rtld_bind_secureplt_start () from /usr/libexec/ld.elf_so
Backtrace stopped: frame did not save the PC

which is pretty much just the same thing we were seeing before
commit 8acd8f869 :-(

Now, we certainly cannot think that these are occurring early in
postmaster startup. In the wake of 8acd8f869, we should expect
that there's no further need to call rtld_bind at all in the
postmaster, but seemingly that's not so. It's very frustrating
that the backtrace stops where it does :-(. It's also strange
that we're apparently running with signals enabled whereever
it is that rtld_bind is getting called from. Could it be that
sigaction is failing to install the requested signal mask, so
that one postmaster signal handler is interrupting another?

regards, tom lane

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&amp;dt=2022-11-24%2021%3A45%3A29
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&amp;dt=2022-11-29%2020%3A50%3A36

#10Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#9)
Re: Strange failure on mamba

On Wed, Nov 30, 2022 at 2:44 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Now, we certainly cannot think that these are occurring early in
postmaster startup. In the wake of 8acd8f869, we should expect
that there's no further need to call rtld_bind at all in the
postmaster, but seemingly that's not so. It's very frustrating
that the backtrace stops where it does :-(. It's also strange
that we're apparently running with signals enabled whereever
it is that rtld_bind is getting called from. Could it be that
sigaction is failing to install the requested signal mask, so
that one postmaster signal handler is interrupting another?

Add in some code that does sigaction(0, NULL, &mask) to read the
current mask and assert that it's blocked as expected in the handlers?
Start the postmaster in gdb with a break on _rtld_bind to find all the
places that reach it (unexpectedly)?

#11Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#10)
Re: Strange failure on mamba

On Wed, Nov 30, 2022 at 3:43 PM Thomas Munro <thomas.munro@gmail.com> wrote:

sigaction(0, NULL, &mask)

s/sigaction/sigprocmask/

#12Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#9)
Re: Strange failure on mamba

Hi,

On 2022-11-29 20:44:34 -0500, Tom Lane wrote:

Thanks to commit 51b5834cd I've now been able to capture some info
from mamba's last couple of failures [1][2]. Sure enough, what is
happening is that postmaster children are getting stuck in recursive
rtld symbol resolution. A couple of the stack traces I collected are

#0 0xfdeede4c in ___lwp_park60 () from /usr/libexec/ld.elf_so
#1 0xfdee3e08 in _rtld_exclusive_enter () from /usr/libexec/ld.elf_so
#2 0xfdee59e4 in dlopen () from /usr/libexec/ld.elf_so
#3 0x01e54ed0 in internal_load_library (
libname=libname@entry=0xfd74cc88 "/home/buildfarm/bf-data/HEAD/pgsql.build/tmp_install/home/buildfarm/bf-data/HEAD/inst/lib/postgresql/libpqwalreceiver.so") at dfmgr.c:239
#4 0x01e55c78 in load_file (filename=<optimized out>, restricted=<optimized out>) at dfmgr.c:156
#5 0x01c5ba24 in WalReceiverMain () at walreceiver.c:292
#6 0x01c090f8 in AuxiliaryProcessMain (auxtype=auxtype@entry=WalReceiverProcess) at auxprocess.c:161
#7 0x01c10970 in StartChildProcess (type=WalReceiverProcess) at postmaster.c:5310
#8 0x01c123ac in MaybeStartWalReceiver () at postmaster.c:5475
#9 MaybeStartWalReceiver () at postmaster.c:5468
#10 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5131
#11 <signal handler called>
#12 0xfdee6b44 in _rtld_symlook_obj () from /usr/libexec/ld.elf_so
#13 0xfdee6fc0 in _rtld_symlook_list () from /usr/libexec/ld.elf_so
#14 0xfdee7644 in _rtld_symlook_default () from /usr/libexec/ld.elf_so
#15 0xfdee795c in _rtld_find_symdef () from /usr/libexec/ld.elf_so
#16 0xfdee7ad0 in _rtld_find_plt_symdef () from /usr/libexec/ld.elf_so
#17 0xfdee1918 in _rtld_bind () from /usr/libexec/ld.elf_so
#18 0xfdee1dc0 in _rtld_bind_secureplt_start () from /usr/libexec/ld.elf_so
Backtrace stopped: frame did not save the PC

Do you have any idea why the stack can't be unwound further here? Is it
possibly indicative of a corrupted stack? I guess we'd need to dig into the
the netbsd libc code :(

which is pretty much just the same thing we were seeing before
commit 8acd8f869 :->

What libraries is postgres linked against? I don't know whether -z now only
affects the "top-level" dependencies of postgres, or also the dependencies of
shared libraries that haven't been built with -z now. The only dependencies
that I could see being relevant are libintl and openssl.

You could try if anything changes if you set LD_BIND_NOW, that should trigger
"recursive" dependencies to be loaded eagerly as well.

Greetings,

Andres Freund

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#12)
Re: Strange failure on mamba

Andres Freund <andres@anarazel.de> writes:

On 2022-11-29 20:44:34 -0500, Tom Lane wrote:

Backtrace stopped: frame did not save the PC

Do you have any idea why the stack can't be unwound further here? Is it
possibly indicative of a corrupted stack? I guess we'd need to dig into
the netbsd libc code :(

I did do some digging in that area previously when we were seeing this
on HPPA, and determined that the assembly code in that area was not
bothering to establish a standard stack frame, for no very obvious
reason :-(. I haven't studied their equivalent PPC code, but apparently
it's equally cavalier. I recall trying to hack the HPPA code to make
it set up the stack frame correctly, without success, but I didn't
try very hard. Maybe I'll have a go at that on the PPC side.

What libraries is postgres linked against? I don't know whether -z now only
affects the "top-level" dependencies of postgres, or also the dependencies of
shared libraries that haven't been built with -z now. The only dependencies
that I could see being relevant are libintl and openssl.

Hmm. mamba is using both --enable-nls and --with-openssl, but
I can't see a reason why the postmaster would be interacting with
OpenSSL post-startup in test cases that don't use SSL. Perhaps
libintl is doing something it shouldn't?

You could try if anything changes if you set LD_BIND_NOW, that should trigger
"recursive" dependencies to be loaded eagerly as well.

Googling LD_BIND_NOW suggests that that's a Linux thing; do you know that
it should have an effect on NetBSD?

regards, tom lane

#14Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#9)
Re: Strange failure on mamba

Hi,

On 2022-11-29 20:44:34 -0500, Tom Lane wrote:

It's also strange that we're apparently running with signals enabled
whereever it is that rtld_bind is getting called from. Could it be that
sigaction is failing to install the requested signal mask, so that one
postmaster signal handler is interrupting another?

This made me look at pqsignal_pm() / pqsignal() and realize that we wouldn't
even notice if it failed, because they just return SIG_ERR and callers don't
check. I don't think that's a likely to be related, but theoretically it could
lead to some odd situations.

Greetings,

Andres Freund

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#14)
Re: Strange failure on mamba

Andres Freund <andres@anarazel.de> writes:

On 2022-11-29 20:44:34 -0500, Tom Lane wrote:

It's also strange that we're apparently running with signals enabled
whereever it is that rtld_bind is getting called from. Could it be that
sigaction is failing to install the requested signal mask, so that one
postmaster signal handler is interrupting another?

This made me look at pqsignal_pm() / pqsignal() and realize that we wouldn't
even notice if it failed, because they just return SIG_ERR and callers don't
check. I don't think that's a likely to be related, but theoretically it could
lead to some odd situations.

Yeah, I noticed that just now too. But if sigaction() failed,
the signal handler wouldn't get installed at all, which'd lead
to different and more-obvious symptoms. So I doubt that that's
what happened.

regards, tom lane

#16Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#13)
Re: Strange failure on mamba

Hi,

On 2022-11-30 00:55:42 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

What libraries is postgres linked against? I don't know whether -z now only
affects the "top-level" dependencies of postgres, or also the dependencies of
shared libraries that haven't been built with -z now. The only dependencies
that I could see being relevant are libintl and openssl.

Hmm. mamba is using both --enable-nls and --with-openssl, but
I can't see a reason why the postmaster would be interacting with
OpenSSL post-startup in test cases that don't use SSL. Perhaps
libintl is doing something it shouldn't?

We do call into openssl in postmaster, via RandomCancelKey(). But we should
have signals masked at that point, so it shouldn't matter.

You could try if anything changes if you set LD_BIND_NOW, that should trigger
"recursive" dependencies to be loaded eagerly as well.

Googling LD_BIND_NOW suggests that that's a Linux thing; do you know that
it should have an effect on NetBSD?

I'm not at all sure it does, but I did see it listed in
https://man.netbsd.org/ld.elf_so.1

LD_BIND_NOW If defined immediate binding of Procedure Link Table
(PLT) entries is performed instead of the default lazy
method.

so I assumed it would do the same as on linux.

Greetings,

Andres Freund

#17Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#16)
Re: Strange failure on mamba

Hi,

On 2022-11-29 22:31:50 -0800, Andres Freund wrote:

On 2022-11-30 00:55:42 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

What libraries is postgres linked against? I don't know whether -z now only
affects the "top-level" dependencies of postgres, or also the dependencies of
shared libraries that haven't been built with -z now. The only dependencies
that I could see being relevant are libintl and openssl.

Hmm. mamba is using both --enable-nls and --with-openssl, but
I can't see a reason why the postmaster would be interacting with
OpenSSL post-startup in test cases that don't use SSL. Perhaps
libintl is doing something it shouldn't?

We do call into openssl in postmaster, via RandomCancelKey(). But we should
have signals masked at that point, so it shouldn't matter.

Openssl does some muckery with signal masks on ppc (and a few others archs,
but not x86), but I don't immediately see it conflicting with our code:

https://github.com/openssl/openssl/blob/master/crypto/ppccap.c#L275

It should also already have been executed by the time we accept connections,
due to the __attribute__ ((constructor)).

I didn't check where netbsd gets libcrypto and whether it does something
different than upstream openssl...

Greetings,

Andres Freund

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#16)
Re: Strange failure on mamba

Andres Freund <andres@anarazel.de> writes:

On 2022-11-30 00:55:42 -0500, Tom Lane wrote:

Googling LD_BIND_NOW suggests that that's a Linux thing; do you know that
it should have an effect on NetBSD?

I'm not at all sure it does, but I did see it listed in
https://man.netbsd.org/ld.elf_so.1
LD_BIND_NOW If defined immediate binding of Procedure Link Table
(PLT) entries is performed instead of the default lazy
method.

I checked the source code, and learned that (1) yes, rtld does pay
attention to this, and (2) the documentation lies: it has to be not
only defined, but nonempty, to get any effect.

Also, I dug into my stuck processes some more, and I have to take
back the claim that this is happening later than postmaster startup.
All the stuck children are ones that either are launched on request
from the startup process, or are launched as soon as we get the
termination report for the startup process. So it's plausible that
the problem is happening during the postmaster's first select()
wait. I then got dirty with the assembly code, and found out that
where the stack trace stops is an attempt to resolve this call:

0xfd6f7a48 <__select50+76>: bl 0xfd700ed0 <0000803c.got2.plt_pic32._sys___select50>

which is inside libpthread.so and is trying to call something in libc.so.
So we successfully got to the select() function from PostmasterMain, but
that has a non-prelinked call to someplace else, and kaboom.

In short, looks like Andres' theory is right. It means that 8acd8f869
didn't actually fix anything, though it reduced the probability of the
failure by reducing the number of vulnerable PLT-indirect calls.

I've adjusted mamba to set LD_BIND_NOW=1 in its environment.
I've verified that that causes the call inside __select50
to get resolved before we reach main(), so I'm hopeful that
it will cure the issue. But it'll probably be a few weeks
before we can be sure.

Don't have a good idea about a non-band-aid fix. Perhaps we
should revert 8acd8f869 altogether, but then what? Even if
somebody comes up with a rewrite to avoid doing interesting
stuff in the postmaster's signal handlers, we surely wouldn't
risk back-patching it.

It's possible that doing nothing is okay, at least in the
short term. It's probably nigh impossible to hit this
issue on modern multi-CPU hardware. Or perhaps we could revive
the idea of having postmaster.c do one dummy select() call
before it unblocks signals.

regards, tom lane

#19Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#18)
Re: Strange failure on mamba

Hi,

On 2022-11-30 18:33:06 -0500, Tom Lane wrote:

Also, I dug into my stuck processes some more, and I have to take
back the claim that this is happening later than postmaster startup.
All the stuck children are ones that either are launched on request
from the startup process, or are launched as soon as we get the
termination report for the startup process. So it's plausible that
the problem is happening during the postmaster's first select()
wait. I then got dirty with the assembly code, and found out that
where the stack trace stops is an attempt to resolve this call:

0xfd6f7a48 <__select50+76>: bl 0xfd700ed0 <0000803c.got2.plt_pic32._sys___select50>

which is inside libpthread.so and is trying to call something in libc.so.
So we successfully got to the select() function from PostmasterMain, but
that has a non-prelinked call to someplace else, and kaboom.

This whole area just seems quite broken in netbsd :(.

We're clearly doing stuff in a signal handler that we really shouldn't, but
not being able to call any functions implemented in libc, even if they're
async signal safe (as e.g. select is) means signals are basically not
usable. Afaict this basically means that signals are *never* safe on netbsd,
as long as there's a single external function call in a signal handler.

I've adjusted mamba to set LD_BIND_NOW=1 in its environment.
I've verified that that causes the call inside __select50
to get resolved before we reach main(), so I'm hopeful that
it will cure the issue. But it'll probably be a few weeks
before we can be sure.

Don't have a good idea about a non-band-aid fix.

It's also a band aid, but perhaps a bit more reliable: We could link
statically to libc and libpthread.

Another approach could be to iterate over the loaded shared libraries during
postmaster startup and force symbols to be resolved. IIRC there's functions
that'd allow that. But it seems like a lot of work to work around an OS bug.

Perhaps we should revert 8acd8f869 altogether, but then what?

FWIW, I think we should consider using those flags everywhere for the backend
- they make copy-on-write more effective and decrease connection overhead a
bit, because otherwise each backend process does the same symbol resolutions
again and again, dirtying memory post-fork.

Even if somebody comes up with a rewrite to avoid doing interesting stuff in
the postmaster's signal handlers, we surely wouldn't risk back-patching it.

Would that actually fix anything, given netbsd's brokenness? If we used a
latch like mechanism, the signal handler would still use functions in libc. So
postmaster could deadlock, at least during the first execution of a signal
handler? So I think 8acd8f869 continues to be important...

Greetings,

Andres Freund

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#19)
Re: Strange failure on mamba

Andres Freund <andres@anarazel.de> writes:

On 2022-11-30 18:33:06 -0500, Tom Lane wrote:

Even if somebody comes up with a rewrite to avoid doing interesting stuff in
the postmaster's signal handlers, we surely wouldn't risk back-patching it.

Would that actually fix anything, given netbsd's brokenness? If we used a
latch like mechanism, the signal handler would still use functions in libc. So
postmaster could deadlock, at least during the first execution of a signal
handler? So I think 8acd8f869 continues to be important...

I agree that "-z now" is a good idea for performance reasons, but
what we're seeing is that it's only a partial fix for netbsd's issue,
since it doesn't apply to shared libraries that the postmaster pulls
in.

I'm not sure about your thesis that things are fundamentally broken.
It does seem like if a signal handler does SetLatch then that could
require PLT resolution, and if it interrupts something else doing
PLT resolution then we have a problem. But if it were a live
problem then we'd have seen instances outside of the postmaster's
select() wait, and we haven't.

I'm kind of inclined to band-aid that select() call as previously
suggested, and see where we end up.

regards, tom lane