Preventing abort() and exit() calls in libpq
[ starting a new thread so as not to confuse the cfbot ]
I wrote:
Michael Paquier <michael@paquier.xyz> writes:
Good point. That's worse than just pfree() which is just a plain call
to free() in the frontend. We could have more policies here, but my
take is that we'd better move fe_memutils.o to OBJS_FRONTEND in
src/common/Makefile so as shared libraries don't use those routines in
the long term.
Ugh. Not only is that bad, but your proposed fix doesn't fix it.
At least in psql, and probably in most/all of our other clients,
removing fe_memutils.o from libpq's link just causes it to start
relying on the copy in the psql executable :-(. So I agree that
some sort of mechanical enforcement would be a really good thing,
but I'm not sure what it would look like.
After some thought I propose that what we really want is to prevent
any calls of abort() or exit() from inside libpq. Attached is a
draft patch to do that. This can't be committed as-is, because
we still have some abort() calls in there in HEAD, but if we could
get that cleaned up it'd work. Alternatively we could just disallow
exit(), which'd be enough to catch the problematic src/common files.
This relies on "nm" being able to work on shlibs, which it's not
required to by POSIX. However, it seems to behave as desired even
on my oldest dinosaurs. In any case, if "nm" doesn't work then
we'll just not detect such problems on that platform, which should
be OK as long as the test does work on common platforms.
Other than that point I think it's relying only on POSIX-spec
features.
I'll stick this into the CF list to see if the cfbot agrees that
it finds the abort() problems...
regards, tom lane
Attachments:
disallow-abort-and-exit-in-libpq-1.patchtext/x-diff; charset=us-ascii; name=disallow-abort-and-exit-in-libpq-1.patchDownload+7-1
On Sat, Jun 26, 2021 at 05:29:29PM -0400, Tom Lane wrote:
I'll stick this into the CF list to see if the cfbot agrees that
it finds the abort() problems...
The CF Bot is finding those problems.
+# Check for functions that libpq must not call. +# (If nm doesn't exist or doesn't work on shlibs, this test will silently +# do nothing, which is fine.) +.PHONY: check-libpq-refs +check-libpq-refs: $(shlib) + @! nm -A -g -u $< 2>/dev/null | grep -e abort -e exit
"abort" and "exit" could be generic terms present in some other
libraries. Could be be better to match with "U abort" and "U exit"
instead? MinGW has a nm command, and it has a compatible option set,
so I think that it should work.
--
Michael
+# Check for functions that libpq must not call. +# (If nm doesn't exist or doesn't work on shlibs, this test will silently +# do nothing, which is fine.) +.PHONY: check-libpq-refs +check-libpq-refs: $(shlib) + @! nm -A -g -u $< 2>/dev/null | grep -e abort -e exit"abort" and "exit" could be generic terms present in some other
libraries. Could be be better to match with "U abort" and "U exit"
instead? MinGW has a nm command, and it has a compatible option set,
so I think that it should work.
A possible trick is to add ccp flags such as: -Dexit=exit_BAD
-Dabort=abort_BAD.
--
Fabien.
Michael Paquier <michael@paquier.xyz> writes:
On Sat, Jun 26, 2021 at 05:29:29PM -0400, Tom Lane wrote:
I'll stick this into the CF list to see if the cfbot agrees that
it finds the abort() problems...
The CF Bot is finding those problems.
+# Check for functions that libpq must not call. +# (If nm doesn't exist or doesn't work on shlibs, this test will silently +# do nothing, which is fine.) +.PHONY: check-libpq-refs +check-libpq-refs: $(shlib) + @! nm -A -g -u $< 2>/dev/null | grep -e abort -e exit
Yeah, all except on Windows. Not sure if it's worth trying to build
some way to make this check on Windows.
"abort" and "exit" could be generic terms present in some other
libraries. Could be be better to match with "U abort" and "U exit"
instead?
No, for a couple of reasons:
* nm's output format isn't all that well standardized
* on some platforms, what appears here is "_abort".
I would have liked to use "-w" in the grep call, but between the
"_abort" case and the "abort@@GLIBC" case we see elsewhere, we'd
be assuming way too much about what grep will consider to be a word.
In practice I don't think it's too much of a problem. It doesn't
matter whether libc has exported names containing "exit", unless
libpq or something it imports from src/common or src/port actually
attempts to call those names. Which I'm not expecting.
A possible counterexample is atexit(3). If libpq ever grew a
reason to call that then we'd have an issue. It wouldn't be
that hard to work around, by adding a grep -v filter. But in
any case I'm dubious that we could ever make correct use of
atexit(3) in libpq, because we'd have no way to know whether
the host application has its own atexit callbacks and if so
whether they'll run before or after libpq's. Something like
isolationtester's atexit callback to PQclose all its connections
would risk breakage if libpq tried to clean up via atexit.
regards, tom lane
Fabien COELHO <coelho@cri.ensmp.fr> writes:
A possible trick is to add ccp flags such as: -Dexit=exit_BAD
-Dabort=abort_BAD.
Not really going to work, at least not without a lot of fragile
kluges, because the main problem here is to prevent libpq from
*indirectly* calling those functions via stuff it imports from
src/port or src/common.
It's possible that we could make it work by generalizing the
policy that "libpq may not call abort/exit" into "no PG shlib
may call abort/exit", and then apply the cpp #defines while
compiling the xxx_shlib.o variants of those files. This does
not seem more attractive than what I proposed, though.
regards, tom lane
I wrote:
This relies on "nm" being able to work on shlibs, which it's not
required to by POSIX. However, it seems to behave as desired even
on my oldest dinosaurs. In any case, if "nm" doesn't work then
we'll just not detect such problems on that platform, which should
be OK as long as the test does work on common platforms.
Other than that point I think it's relying only on POSIX-spec
features.
Further dinosaur-wrangling reveals a small problem on prairiedog
(ancient macOS):
$ nm -A -g -u libpq.5.14.dylib | grep abort
libpq.5.14.dylib:fe-connect.o: _abort
libpq.5.14.dylib:_eprintf.o: _abort
The fe-connect.o reference is from PGTHREAD_ERROR of course,
but what's that other thing? Investigation finds this:
IOW it seems that this file is pulled in to implement <assert.h>,
and the abort call underlies uses of Assert. So that seems fine
from a coding-rule perspective: it's okay for development builds
to contain core-dumping assertions. It complicates matters for
the proposed patch though.
As far as old macOS goes, it seems like we can work around this
pretty easily, since this version of nm helpfully breaks down
the references by .o file: just add a "grep -v" pass to reject
"_eprintf.o:". However, if there are any other platforms that
similarly convert assert() calls into some direct reference
to abort(), it may be harder to work around it elsewhere.
I guess the only way to know is to see what the buildfarm
says.
Worst case, we might only be able to enforce the prohibition
against exit(). That'd be annoying but it's still much better
than nothing.
regards, tom lane
I wrote:
This relies on "nm" being able to work on shlibs, which it's not
required to by POSIX. However, it seems to behave as desired even
on my oldest dinosaurs. In any case, if "nm" doesn't work then
we'll just not detect such problems on that platform, which should
be OK as long as the test does work on common platforms.
So I pushed that, and not very surprisingly, it's run into some
portability problems. gombessa (recent OpenBSD) reports
! nm -A -g -u libpq.so.5.15 2>/dev/null | grep -v '_eprintf\\.o:' | grep -e abort -e exit
libpq.so.5.15:__cxa_atexit
So far as I can find, __cxa_atexit is a C++ support routine, so
I wondered what the heck libpq.so is doing calling it. I managed
to reproduce the failure here using an OpenBSD installation I had
at hand, and confirmed that __cxa_atexit is *not* referenced by any
of the .o files in src/port, src/common, or src/interfaces/libpq.
So apparently it's being injected at some fairly low level of the
shlib support on that platform.
Probably the thing to do is adjust the grep filter to exclude
__cxa_atexit, but I want to wait awhile and see whether any
other buildfarm animals report this. morepork at least will
be pretty interesting, since it's a slightly older OpenBSD
vintage.
regards, tom lane
I wrote:
So I pushed that, and not very surprisingly, it's run into some
portability problems. gombessa (recent OpenBSD) reports
! nm -A -g -u libpq.so.5.15 2>/dev/null | grep -v '_eprintf\\.o:' | grep -e abort -e exit
libpq.so.5.15:__cxa_atexit
After a few more hours, all of our OpenBSD animals have reported
that, on several different OpenBSD releases and with both gcc
and clang compilers. So at least it's a longstanding platform
behavior.
More troublingly, fossa reports this:
! nm -A -g -u libpq.so.5.15 2>/dev/null | grep -v '_eprintf\\.o:' | grep -e abort -e exit
libpq.so.5.15: U abort@@GLIBC_2.2.5
Where is that coming from? hippopotamus and jay, which seem to
be different compilers on the same physical machine, aren't showing
it. That'd lead to the conclusion that icc is injecting abort()
calls of its own accord, which seems quite nasty. Lacking an icc
license, I can't poke into that more directly here.
Perhaps we could wrap the test for abort() in something like
'if "$CC" != icc then ...', but ugh.
regards, tom lane
On 2021-Jun-29, Tom Lane wrote:
More troublingly, fossa reports this:
! nm -A -g -u libpq.so.5.15 2>/dev/null | grep -v '_eprintf\\.o:' | grep -e abort -e exit
libpq.so.5.15: U abort@@GLIBC_2.2.5Where is that coming from? hippopotamus and jay, which seem to
be different compilers on the same physical machine, aren't showing
it. That'd lead to the conclusion that icc is injecting abort()
calls of its own accord, which seems quite nasty. Lacking an icc
license, I can't poke into that more directly here.
I noticed that the coverage report is not updating, and lo and behold
it's failing this bit.
I can inspect the built files ... what exactly are you looking for?
--
�lvaro Herrera Valdivia, Chile
<Schwern> It does it in a really, really complicated way
<crab> why does it need to be complicated?
<Schwern> Because it's MakeMaker.
Ah, I nm'd all files in src/interfaces/libpq and got no hits for abort.
But I did get one in libpgport_shlib.a:
path_shlib.o:
U abort
0000000000000320 T canonicalize_path
0000000000000197 T cleanup_path
00000000000009e3 t dir_strcmp
...
--
�lvaro Herrera Valdivia, Chile
"People get annoyed when you try to debug them." (Larry Wall)
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
Ah, I nm'd all files in src/interfaces/libpq and got no hits for abort.
But I did get one in libpgport_shlib.a:
path_shlib.o:
U abort
Yeah, there is one in get_progname(). But path.o shouldn't be getting
pulled into libpq ... else why aren't all the animals failing?
What platform does the coverage report run on exactly?
regards, tom lane
On 2021-Jun-29, Tom Lane wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
Ah, I nm'd all files in src/interfaces/libpq and got no hits for abort.
But I did get one in libpgport_shlib.a:path_shlib.o:
U abortYeah, there is one in get_progname(). But path.o shouldn't be getting
pulled into libpq ... else why aren't all the animals failing?
Maybe there's something about the linker flags being used.
... ah yeah, if I configure with coverage enabled on my machine, it fails in the same way.
What platform does the coverage report run on exactly?
It's Debian Buster.
libpq.so is linked as
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -fprofile-arcs -ftest-coverage -O0 -pthread -D_REENTRANT -D_THREAD_SAFE -D_POSIX_PTHREAD_SEMANTICS -fPIC -shared -Wl,-soname,libpq.so.5 -Wl,--version-script=exports.list -o libpq.so.5.15 fe-auth-scram.o fe-connect.o fe-exec.o fe-lobj.o fe-misc.o fe-print.o fe-protocol3.o fe-secure.o fe-trace.o legacy-pqsignal.o libpq-events.o pqexpbuffer.o fe-auth.o fe-secure-common.o fe-secure-openssl.o -L../../../src/port -L../../../src/common -lpgcommon_shlib -lpgport_shlib -L/usr/lib/llvm-6.0/lib -Wl,--as-needed -Wl,-rpath,'/usr/local/pgsql/lib',--enable-new-dtags -lssl -lcrypto -lm -lldap_r
and libpgport was just
ar crs libpgport_shlib.a fls_shlib.o getpeereid_shlib.o strlcat_shlib.o strlcpy_shlib.o pg_crc32c_sse42_shlib.o pg_crc32c_sb8_shlib.o pg_crc32c_sse42_choose_shlib.o bsearch_arg_shlib.o chklocale_shlib.o erand48_shlib.o inet_net_ntop_shlib.o noblock_shlib.o path_shlib.o pg_bitutils_shlib.o pg_strong_random_shlib.o pgcheckdir_shlib.o pgmkdirp_shlib.o pgsleep_shlib.o pgstrcasecmp_shlib.o pgstrsignal_shlib.o pqsignal_shlib.o qsort_shlib.o qsort_arg_shlib.o quotes_shlib.o snprintf_shlib.o strerror_shlib.o tar_shlib.o thread_shlib.o
--
�lvaro Herrera Valdivia, Chile
https://www.EnterpriseDB.com/
On 2021-Jun-30, Alvaro Herrera wrote:
On 2021-Jun-29, Tom Lane wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
Ah, I nm'd all files in src/interfaces/libpq and got no hits for abort.
But I did get one in libpgport_shlib.a:path_shlib.o:
U abortYeah, there is one in get_progname(). But path.o shouldn't be getting
pulled into libpq ... else why aren't all the animals failing?Maybe there's something about the linker flags being used.
... ah yeah, if I configure with coverage enabled on my machine, it fails in the same way.
If I remove -fprofile-arcs from CFLAGS, then abort is no longer present,
but we still get a fail because of __gcov_exit. I suppose if you'd add
an exception for __cxa_atexit, the same place could use one for
__gcov_exit.
I'm not sure what to make of the -fprofile-arcs stuff though.
--
�lvaro Herrera Valdivia, Chile
https://www.EnterpriseDB.com/
"Java is clearly an example of money oriented programming" (A. Stepanov)
On 2021-Jun-30, Alvaro Herrera wrote:
If I remove -fprofile-arcs from CFLAGS, then abort is no longer present,
but we still get a fail because of __gcov_exit. I suppose if you'd add
an exception for __cxa_atexit, the same place could use one for
__gcov_exit.
I tried the attached patch, and while libpq.so now builds successfully,
it causes anything that tries to link to libpq fail like
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -fprofile-arcs -ftest-coverage findtimezone.o initdb.o localtime.o -L../../../src/port -L../../../src/common -L../../../src/fe_utils -lpgfeutils -L../../../src/common -lpgcommon -L../../../src/port -lpgport -L../../../src/interfaces/libpq -lpq -Wl,--as-needed -Wl,-rpath,'/pgsql/install/master-coverage/lib',--enable-new-dtags -lpgcommon -lpgport -lpthread -lxml2 -lssl -lcrypto -lz -lreadline -lpthread -lrt -ldl -lm -o initdb
/usr/bin/ld: initdb: hidden symbol `__gcov_merge_add' in /usr/lib/gcc/x86_64-linux-gnu/8/libgcov.a(_gcov_merge_add.o) is referenced by DSO
/usr/bin/ld: final link failed: bad value
collect2: error: ld returned 1 exit status
make[3]: *** [Makefile:43: initdb] Error 1
so this doesn't look too promising.
--
�lvaro Herrera Valdivia, Chile
https://www.EnterpriseDB.com/
Attachments:
0001-Fix-libpq-build-for-coverage-support.patchtext/x-diff; charset=us-asciiDownload+2-3
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
Maybe there's something about the linker flags being used.
... ah yeah, if I configure with coverage enabled on my machine, it fails in the same way.
Ah-hah, yeah, I see it too if I enable profiling. I can confirm
that it's not from the abort() call in path.c, because it's still
there if I remove that. So this is another case where build
infrastructure is injecting abort() calls we didn't ask for.
Between this and the icc case, I'm now inclined to give up on
trying to forbid abort() calls in libpq. I think the value-add
for that is a lot lower than it is for exit() anyway. abort()
is something one doesn't toss around lightly.
You mentioned __gcov_exit, but I'm not sure if we need an
exception for that. I see it referenced by the individual .o
files, but the completed .so has no such reference, so at least
on RHEL8 it's apparently satisfied during .so linkage. Do you
see something different?
regards, tom lane
On 2021-Jun-30, Tom Lane wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
Maybe there's something about the linker flags being used.
... ah yeah, if I configure with coverage enabled on my machine, it fails in the same way.Ah-hah, yeah, I see it too if I enable profiling. I can confirm
that it's not from the abort() call in path.c, because it's still
there if I remove that. So this is another case where build
infrastructure is injecting abort() calls we didn't ask for.
Hah, I didn't think to try that.
Between this and the icc case, I'm now inclined to give up on
trying to forbid abort() calls in libpq. I think the value-add
for that is a lot lower than it is for exit() anyway. abort()
is something one doesn't toss around lightly.
No objections to that.
You mentioned __gcov_exit, but I'm not sure if we need an
exception for that. I see it referenced by the individual .o
files, but the completed .so has no such reference, so at least
on RHEL8 it's apparently satisfied during .so linkage. Do you
see something different?
Well, not really. I saw it but only after I removed -fprofile-arcs from
Makefile.shlib's link line; but per my other email, that doesn't really
work.
Everything seems to work well for me after removing abort from that grep.
--
�lvaro Herrera Valdivia, Chile
https://www.EnterpriseDB.com/
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
On 2021-Jun-30, Tom Lane wrote:
You mentioned __gcov_exit, but I'm not sure if we need an
exception for that. I see it referenced by the individual .o
files, but the completed .so has no such reference, so at least
on RHEL8 it's apparently satisfied during .so linkage. Do you
see something different?
Well, not really. I saw it but only after I removed -fprofile-arcs from
Makefile.shlib's link line; but per my other email, that doesn't really
work.
Everything seems to work well for me after removing abort from that grep.
OK, thanks, will push a fix momentarily.
regards, tom lane
On 2021-Jun-30, Tom Lane wrote:
OK, thanks, will push a fix momentarily.
(BTW since the _eprintf.o stuff comes from _abort, I suppose you're
going to remove that grep -v too?)
--
�lvaro Herrera 39�49'30"S 73�17'W
https://www.EnterpriseDB.com/
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
(BTW since the _eprintf.o stuff comes from _abort, I suppose you're
going to remove that grep -v too?)
Right, I did that.
regards, tom lane
I wrote:
OK, thanks, will push a fix momentarily.
Did so, and look what popped up on wrasse [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2021-06-30%2014%3A58%3A15:
! nm -A -g -u libpq.so.5.15 2>/dev/null | grep -v __cxa_atexit | grep exit
libpq.so.5.15: [765] | 232544| 248|FUNC |GLOB |3 |14 |PQexitPipelineMode
This makes no sense, because (a) wrasse was happy with the previous
version, and (b) surely the "-u" switch should prevent nm from
printing PQexitPipelineMode. Noah, did you change anything about
wrasse's configuration today?
regards, tom lane
[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2021-06-30%2014%3A58%3A15