RTLD_GLOBAL (& JIT inlining)

Started by Andres Freundabout 8 years ago8 messageshackers
Jump to latest
#1Andres Freund
andres@anarazel.de

Hi Peter, All,

First question:

Why do we currently use RTLD_GLOBAL loading extension libraries, but
take pains ([1]http://archives.postgresql.org/message-id/2652.1475512158%40sss.pgh.pa.us) to make things work without RTLD_GLOBAL. It seems like
it'd be both safer to RTLD_LOCAL on platforms supporting it (or
equivalent), as well as less error-prone because we'd be less likely to
depend on it like we did e.g. during transforms development.

It seems error prone because on systmes I'm aware of (IOW linux ;))
conflicting symbols will not cause errors. Neither when there's a
conflict between a .so and the main binary, nor between different
shlibs. My reading of glibc's ld is that the search order for
unresolved references in a .so is main binary, and then other
RTLD_GLOBAL shared libraries in order of time they're loaded.

Right now it appears that e.g. hstore_plperl has some issue, building
with RTLD_LOCAL makes its test fail. Which means it'll probably not work
on some platforms.

I think using RTLD_LOCAL on most machines would be a much better
idea. I've not found proper explanations why GLOBAL is used. We started
using it ages ago, with [2]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c87bc779d4e9f109e92f8b8f1dfad5d6739f8e97, but that commit contains no explanation,
and a quick search didn't show up anything either. Peter?

Secondly:

For JITing internal & C operators, and other functions referenced in
JITed code, can be inlined if the definition is available in a suitable
form. The details how that works don't matter much here.

For my jitting work I've so far treated symbols of binaries and all
referenced extensions as being part of a single namespace. Currently the
order in which symbols with multiple definitions were looked up was in
essentially in filesystem order.

While I'm not aware of any observable problems, reviewing that decision
I think that's a terrible idea. First and foremost it seems obviously
wrong to not load symbols from the main postgres binary first. But
leaving that aside, I'm uncomfortable doing inlining between shlibs for
JITing, because that can lead to symbols beeing reloaded differently
between non JITed code (where all external libraries loaded in the
current library are searched in load/usage order) and JITed code (where
the symbols would be resolved according to some static order).

I'm about to rewrite it so that functions are inlined just from the main
binary, or the basename of the shared library explicitly referenced in
the function definition. That's still different, but at least easy to
understand. But that still means there's inconsistent order between
different methods of execution, which isn't great.

Comments?

Greetings,

Andres Freund

[1]: http://archives.postgresql.org/message-id/2652.1475512158%40sss.pgh.pa.us
[2]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c87bc779d4e9f109e92f8b8f1dfad5d6739f8e97

#2Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
Re: RTLD_GLOBAL (& JIT inlining)

Hi,

On 2018-02-22 23:11:07 -0800, Andres Freund wrote:

I think using RTLD_LOCAL on most machines would be a much better
idea. I've not found proper explanations why GLOBAL is used. We started
using it ages ago, with [2], but that commit contains no explanation,
and a quick search didn't show up anything either. Peter?

Hm. No idea if that was the reason back then, but I think it's
unfortunately not easy to change. The reason why some plperlu tests
fail, is that plperl extension libraries rely on plperl to be loaded
into the global namespace.

When plperl gets loaded with RTLD_LOCAL all the dependant shared
libaries (i.e. libperl.so) also get loaded with it. That works perfectly
for plperl itself, but once per loads an extension library is loaded, it
fails to resolve libperl.so symbols... I can "fix" this by
dlopen(RTLD_GLOBAL) libperl.so, but that's obviously not a practial
solution.

FWIW, something in plperl's error handling appears to be busted. Instead
of a helpful error report we get:
ERROR: 22021: invalid byte sequence for encoding "UTF8": 0x00
which doesn't help much to nail down the issue.

With a breakpoint it becomes clear what the issue is:

#0 report_invalid_encoding (encoding=6, mbstr=0x557ae880416b "", len=252)
at /home/andres/src/postgresql/src/backend/utils/mb/wchar.c:1997
#1 0x0000557ae73d4473 in pg_verify_mbstr_len (encoding=6, mbstr=0x557ae880416b "", len=252, noError=0 '\000')
at /home/andres/src/postgresql/src/backend/utils/mb/wchar.c:1936
#2 0x0000557ae73d4354 in pg_verify_mbstr (encoding=6,
mbstr=0x557ae8804080 "Can't load '/usr/lib/x86_64-linux-gnu/perl5/5.26/auto/List/Util/Util.so' for module List::Util: /usr/lib/x86_64-linux-gnu/perl5/5.26/auto/List/Util/Util.so: undefined symbol: PL_memory_wrap at /usr/share/perl/5.26/XSLoader.pm line 96.\n", len=487, noError=0 '\000') at /home/andres/src/postgresql/src/backend/utils/mb/wchar.c:1879
#3 0x0000557ae73d119a in pg_any_to_server (
s=0x557ae8804080 "Can't load '/usr/lib/x86_64-linux-gnu/perl5/5.26/auto/List/Util/Util.so' for module List::Util: /usr/lib/x86_64-linux-gnu/perl5/5.26/auto/List/Util/Util.so: undefined symbol: PL_memory_wrap at /usr/share/perl/5.26/XSLoader.pm line 96.\n", len=487, encoding=6) at /home/andres/src/postgresql/src/backend/utils/mb/mbutils.c:572
#4 0x00007f4cabb6b123 in utf_u2e (
utf8_str=0x557ae8804080 "Can't load '/usr/lib/x86_64-linux-gnu/perl5/5.26/auto/List/Util/Util.so' for module List::Util: /usr/lib/x86_64-linux-gnu/perl5/5.26/auto/List/Util/Util.so: undefined symbol: PL_memory_wrap at /usr/share/perl/5.26/XSLoader.pm line 96.\n", len=487) at /home/andres/src/postgresql/src/pl/plperl/plperl_helpers.h:16
#5 0x00007f4cabb6b2f6 in sv2cstr (sv=0x557ae8769568) at /home/andres/src/postgresql/src/pl/plperl/plperl_helpers.h:96
#6 0x00007f4cabb732dc in plperl_call_perl_func (desc=0x557ae86cf6b0, fcinfo=0x557ae875fc20)
at /home/andres/src/postgresql/src/pl/plperl/plperl.c:2250
#7 0x00007f4cabb74a3b in plperl_func_handler (fcinfo=0x557ae875fc20)
at /home/andres/src/postgresql/src/pl/plperl/plperl.c:2438

I don't now perl apis at all, but it's clearly wrong that fram 3,4 show
len=487, which comes from SvPVutf8() in sv2cstr(). It appears that
somehow perl throws out two error messages separated by a null byte...

(gdb) p *(char[487]*)utf8_str
$42 = "Can't load '/usr/lib/x86_64-linux-gnu/perl5/5.26/auto/List/Util/Util.so' for module List::Util: /usr/lib/x86_64-linux-gnu/perl5/5.26/auto/List/Util/Util.so: undefined symbol: PL_memory_wrap at /usr/share/perl/5.26/XSLoader.pm line 96.\n\000 at /usr/lib/x86_64-linux-gnu/perl5/5.26/List/Util.pm line 23.\nCompilation failed in require at /usr/lib/x86_64-linux-gnu/perl5/5.26/Scalar/Util.pm line 23.\nCompilation failed in require at /usr/lib/x86_64-linux-gnu/perl/5.26/Data/Dumper.pm line 303.\n"

Greetings,

Andres Freund

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#1)
Re: RTLD_GLOBAL (& JIT inlining)

Andres Freund <andres@anarazel.de> writes:

First question:
Why do we currently use RTLD_GLOBAL loading extension libraries, but
take pains ([1]) to make things work without RTLD_GLOBAL.

As mentioned in that very message, the point was better missing-symbol
error detection, not RTLD_GLOBAL/LOCAL per se.

I think using RTLD_LOCAL on most machines would be a much better
idea. I've not found proper explanations why GLOBAL is used. We started
using it ages ago, with [2], but that commit contains no explanation,
and a quick search didn't show up anything either. Peter?

/messages/by-id/7142.1277772335@sss.pgh.pa.us

My position is the same as then: I'm happy to remove it if it doesn't
break things anywhere ... but it seems like it would cause problems for
plpython, unless their behavior has changed since 2001 which is
surely possible.

regards, tom lane

#4Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#3)
Re: RTLD_GLOBAL (& JIT inlining)

On 2018-02-23 11:05:23 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I think using RTLD_LOCAL on most machines would be a much better
idea. I've not found proper explanations why GLOBAL is used. We started
using it ages ago, with [2], but that commit contains no explanation,
and a quick search didn't show up anything either. Peter?

/messages/by-id/7142.1277772335@sss.pgh.pa.us

My position is the same as then: I'm happy to remove it if it doesn't
break things anywhere ... but it seems like it would cause problems for
plpython, unless their behavior has changed since 2001 which is
surely possible.

It hasn't, and it's not just plpython, at least plperl is affected as
well. I'd guess most libraries that have their own dynamic loading
support are affected.

So RTLD_LOCAL is out of the question, but I think we can get a good bit
of the benefit by either specifying -Wl,-Bsymbolic at shlib build time,
or RTLD_DEEPBIND at dlopen() time. Either leads to the opened shared
library effectively being put at the beginning of the search path,
therefore avoiding the issue that an earlier loaded shared library or
symbols from the main binary can accidentally overwrite things in the
shared library itself. Which incidentally also makes loading a bit
faster.

A bit of googling suggests -Bsymbolic is likely to be more portable.

A quick test confirms that under linux our tests pass with it, after
patching Makefile.shlib.

Greetings,

Andres Freund

#5Ants Aasma
ants.aasma@cybertec.at
In reply to: Andres Freund (#4)
Re: RTLD_GLOBAL (& JIT inlining)

On Mon, Feb 26, 2018 at 11:28 PM, Andres Freund <andres@anarazel.de> wrote:

So RTLD_LOCAL is out of the question, but I think we can get a good bit
of the benefit by either specifying -Wl,-Bsymbolic at shlib build time,
or RTLD_DEEPBIND at dlopen() time. Either leads to the opened shared
library effectively being put at the beginning of the search path,
therefore avoiding the issue that an earlier loaded shared library or
symbols from the main binary can accidentally overwrite things in the
shared library itself. Which incidentally also makes loading a bit
faster.

I think this would also fix oracle_fdw crashing when postgres is
compiled with --with-ldap. At least RTLD_DEEPBIND helped. [1]/messages/by-id/CA+CSw_tPDYgnzCYW0S4oU0mTUoUhZ9pc7MRBPXVD-3Zbiwni9w@mail.gmail.com

[1]: /messages/by-id/CA+CSw_tPDYgnzCYW0S4oU0mTUoUhZ9pc7MRBPXVD-3Zbiwni9w@mail.gmail.com

Ants Aasma

#6Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Ants Aasma (#5)
Re: RTLD_GLOBAL (& JIT inlining)

Hello.

At Mon, 26 Feb 2018 23:50:41 +0200, Ants Aasma <ants.aasma@eesti.ee> wrote in <CA+CSw_uD6dZdg9BbOLCx+oY0rufsRSp5V5W8teonCB=OK=74aw@mail.gmail.com>

On Mon, Feb 26, 2018 at 11:28 PM, Andres Freund <andres@anarazel.de> wrote:

So RTLD_LOCAL is out of the question, but I think we can get a good bit
of the benefit by either specifying -Wl,-Bsymbolic at shlib build time,
or RTLD_DEEPBIND at dlopen() time. Either leads to the opened shared
library effectively being put at the beginning of the search path,
therefore avoiding the issue that an earlier loaded shared library or
symbols from the main binary can accidentally overwrite things in the
shared library itself. Which incidentally also makes loading a bit
faster.

I think this would also fix oracle_fdw crashing when postgres is
compiled with --with-ldap. At least RTLD_DEEPBIND helped. [1]

[1] /messages/by-id/CA+CSw_tPDYgnzCYW0S4oU0mTUoUhZ9pc7MRBPXVD-3Zbiwni9w@mail.gmail.com

I saw several cases of the misbinding specifically among
extensions using different versions of the same library or a part
of which was just transplanted from another extension. Now I'm
seeing a case nearby again and found this thread.

Some pl-libraries (plpython does but plperl doesn't for me)
mandates to export their symbols for external modules. So
RTLD_GLOBAL is needed for such extensions but not needed in most
cases. We need a means to instruct whether a module wants to
expose symbols or not.

The extension control file doesn't seem the place. The most
appropriate way seems to be another magic data.

With the attached patch, external modules are loaded RTLD_LOCAL
by default. PG_MODULE_EXPORT_SYMBOL in a module source files let
it loaded RTLD_GLOBAL. I suppose this is the change with the
least impact on existing modules because I believe most of them
don't need to export symbols.

I think this works for all platforms that have dlopen or
shl_load. Windows doesn't the equivalent but anyway we should
explicitly declare using dllexport.

Any opinions or thoughts?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Add-control-on-wheter-symbols-in-external-module-is-.patchtext/x-patch; charset=us-asciiDownload+29-2
#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kyotaro Horiguchi (#6)
Re: RTLD_GLOBAL (& JIT inlining)

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

With the attached patch, external modules are loaded RTLD_LOCAL
by default. PG_MODULE_EXPORT_SYMBOL in a module source files let
it loaded RTLD_GLOBAL. I suppose this is the change with the
least impact on existing modules because I believe most of them
don't need to export symbols.

This seems really hacky :-(

In the PL cases, at least, it's not so much the PL itself that
wants to export symbols as the underlying language library
(libpython, libperl, etc). You're assuming that an action on
the PL .so will propagate to the underlying language .so, which
seems like a pretty shaky assumption. I wonder also whether
closing and reopening the DL handle will really do anything
at all for already-loaded libraries ...

regards, tom lane

#8Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Tom Lane (#7)
Re: RTLD_GLOBAL (& JIT inlining)

At Thu, 24 Jan 2019 01:02:14 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in <24744.1548309734@sss.pgh.pa.us>

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

With the attached patch, external modules are loaded RTLD_LOCAL
by default. PG_MODULE_EXPORT_SYMBOL in a module source files let
it loaded RTLD_GLOBAL. I suppose this is the change with the
least impact on existing modules because I believe most of them
don't need to export symbols.

This seems really hacky :-(

In the PL cases, at least, it's not so much the PL itself that
wants to export symbols as the underlying language library
(libpython, libperl, etc). You're assuming that an action on
the PL .so will propagate to the underlying language .so, which

(If I understand you correctly) I'm not assuming that the
propagation happens or even the contray. As can be seen from it
actually works, symbols of underlying libraries referred from
pl*.so are resoved using dependency list in pl*.so itself
independently of the RTLD flags given for it. The problem here is
while underlying language libraries cannot ferer symbols back in
the pl*.so when it is loaded with RTLD_LOCAL, RTLD_GLOBAL makes
some extensions unhappy.

I think that the assumption that RTLD_GLOBAL resolves all was
proved rather shaky. RTLD_DEEPBIND is not available on some
platforms so we need to choose between LOCAL and GLOBAL per
loading module.

seems like a pretty shaky assumption. I wonder also whether
closing and reopening the DL handle will really do anything
at all for already-loaded libraries ...

internal_load_library() avoids duplicate loading so I believe it
cannot happen. Regarding duplicate dlopen, RTLD_GLOBAL just
promotes RTLD_LOCAL. In other words, the former wins. So it's not
a problem if the dlclose() there did nothing.

We could RTLD_NOLOAD for almost all platforms but win32 implement
needs additinal change to make work it as expected.

If you are saying hacky about that it uses only the symbol of
unused function, we can actually call it to get the flag.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center