Reduce function call costs on ELF platforms

Started by Andres Freundabout 4 years ago10 messages
#1Andres Freund
andres@anarazel.de

Hi,

There's two related, but somewhat different aspects to $subject.

TL;DR: We use use -fvisibility=hidden + explicit symbol visiblity,
-Wl,-Bdynamic, -fno-plt

1) Cross-translation-unit calls in extension library

A while ago I was looking at a profile of a workload that spent a good chunk
of time in an extension. Looking at the instruction level profile it showed
that some of that time was spent doing more-complicated-than-necessary
function calls to other functions within the extension.

Basically they way we currently build our extensions, the compiler & linker
assume every symbol inside the extension libraries needs to be interceptable
by the main binary. Which means that all function calls to symbols visible
outside the current translation unit need to be made indirectly via the PLT.

An example of this (picked from plpgsql, for simplicity)

0000000000024a40 <plpgsql_inline_handler>:
{
...
func = plpgsql_compile_inline(codeblock->source_text);
24a80: 48 8b 85 a8 fe ff ff mov -0x158(%rbp),%rax
24a87: 48 8b 78 08 mov 0x8(%rax),%rdi
24a8b: e8 20 41 fe ff call 8bb0 <plpgsql_compile_inline@plt>
...

0000000000008bb0 <plpgsql_compile_inline@plt>:
8bb0: ff 25 da ac 02 00 jmp *0x2acda(%rip) # 33890 <plpgsql_compile_inline@@Base+0x24de0>
8bb6: 68 12 01 00 00 push $0x112
8bbb: e9 c0 ee ff ff jmp 7a80 <_init+0x18>

I.e. plpgsql_inline_handler doesn't call plpgsql_compile_inline() directly, it
calls plpgsql_compile_inline@plt(), which then loads the target address for
plpgsql_compile_inline() from the global offset table. Depending on the linker
settings / flags passed to dlopen() that'll point to yet another wrapper
function (doing a dynamic symbol lookup on the first call, putting the
real address in the GOT).

This can be addressed to some degree by using explicit symbol visibility
markers, as I propose in [1]/messages/by-id/20211101020311.av6hphdl6xbjbuif@alap3.anarazel.de.

With that patch applied compiler / linker know that plpgsql_compile_inline()
is not an external symbol, and therefore doesn't need to go through the
PLT/GOT. That changes the above to:

func = plpgsql_compile_inline(codeblock->source_text);
23000: 48 8b 85 a8 fe ff ff mov -0x158(%rbp),%rax
23007: 48 8b 78 08 mov 0x8(%rax),%rdi
2300b: e8 00 a1 fe ff call d110 <plpgsql_compile_inline>

which unsurprisingly is cheaper.

2) Calls to exported functions in extension library

However, this does *not* address the issue fully. When an extension calls a
function that has to be exported, the symbol with continue to be loaded from
the PLT.

E.g. hstorePairs() has to be exported, because it's called from transform
modules. That results in calls to hstorePairs() from within hstore.so to go
through the PLT. e.g.

000000000000e380 <hstore_subscript_assign>:
{
...
e427: e8 e4 59 ff ff call 3e10 <hstorePairs@plt>

In theory we could mark such symbols as "protected" while compiling hstore.so
and as "default" otherwise, but that's pretty complicated. And there are some
toolchain issues with protected visibility.

The easier approach for this class of issues is to use the linker option
-Bsymbolic. That turns the above into a plain function call

000000000000e250 <hstore_subscript_assign>:
{
...
e2f7: e8 f4 a2 ff ff call 85f0 <hstorePairs>

As it turns out we already use -Bsymbolic on some platforms (solaris,
hpux). But not elsehwere.

3) Function calls from extension library to main binary
4) C library function calls

However, even with the above done, calls into shared libraries still
go through the PLT. This is particularly annoying for functions like palloc()
that are quite performance sensitive and where there's no potential use of
intercepting the function call with a different shared library.

E.g. the optimized disassembly add_dummy_return() looks like

000000000000bc30 <add_dummy_return>:
{
...
new = palloc0(sizeof(PLpgSQL_stmt_block));
bc4d: bf 38 00 00 00 mov $0x38,%edi
bc52: e8 d9 a7 ff ff call 6430 <palloc0@plt>
...
0000000000006430 <palloc0@plt>:
6430: ff 25 d2 bb 02 00 jmp *0x2bbd2(%rip) # 32008 <palloc0>
6436: 68 01 00 00 00 push $0x1
643b: e9 d0 ff ff ff jmp 6410 <_init+0x20>

Obviously we cannot easily avoid indirection entirely in this case. The offset
to call palloc0 is not known when plpgsql.so is built. But we don't actually
need a two-level indirection.

By compiling with -fno-plt, the above becomes:

000000000000b130 <add_dummy_return>:
{
...
new = palloc0(sizeof(PLpgSQL_stmt_block));
b14d: bf 38 00 00 00 mov $0x38,%edi
b152: ff 15 80 66 02 00 call *0x26680(%rip) # 317d8 <palloc0>

I.e. a single level of indirection. This has more benefits than just removing
one layer of indirection. Here's what gcc's man page says:

-fno-plt
Do not use the PLT for external function calls in position-independent code. Instead, load the callee address
at call sites from the GOT and branch to it. This leads to more efficient code by eliminating PLT stubs and
exposing GOT loads to optimizations.

In some cases this allows functions to use the sibling-call optimization where
that previously was not possible (i.e. for x86 use "jmp" instead of "call" to
call another function when that function call is the last thing done in a
function, thereby reusing the call frame and reducing the cost of returns).

This doesn't just matter for extension libraries. It's also relevant for the
main binary (i.e. the upsides are bigger / more widely applicable) - every
function call to libc goes through PLT+GOT (well, with a dynamically linked
libc). This includes things that are often called in performance critical
bits, like strlen. E.g. without -fno-plt raw_parser() calls strlen via the
plt:

cur_token_length = strlen(yyextra->core_yy_extra.scanbuf + *llocp);
2775a6: 49 63 55 00 movslq 0x0(%r13),%rdx
2775aa: 4c 8b 3b mov (%rbx),%r15
2775ad: 48 89 4d c0 mov %rcx,-0x40(%rbp)
2775b1: 49 8d 3c 17 lea (%r15,%rdx,1),%rdi
2775b5: 48 89 55 c8 mov %rdx,-0x38(%rbp)
2775b9: e8 82 03 e5 ff call c7940 <strlen@plt>

but not with -fno-plt:
cur_token_length = strlen(yyextra->core_yy_extra.scanbuf + *llocp);
2838e6: 49 63 55 00 movslq 0x0(%r13),%rdx
2838ea: 4c 8b 3b mov (%rbx),%r15
2838ed: 48 89 4d c0 mov %rcx,-0x40(%rbp)
2838f1: 49 8d 3c 17 lea (%r15,%rdx,1),%rdi
2838f5: 48 89 55 c8 mov %rdx,-0x38(%rbp)
2838f9: ff 15 09 45 66 00 call *0x664509(%rip) # 8e7e08 <strlen@GLIBC_2.2.5>

I haven't run detailed benchmarks in isolation, but have seen some good
results. It obviously is heavily workload dependent.

Greetings,

Andres Freund

[1]: /messages/by-id/20211101020311.av6hphdl6xbjbuif@alap3.anarazel.de

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#1)
Re: Reduce function call costs on ELF platforms

Andres Freund <andres@anarazel.de> writes:

Basically they way we currently build our extensions, the compiler & linker
assume every symbol inside the extension libraries needs to be interceptable
by the main binary. Which means that all function calls to symbols visible
outside the current translation unit need to be made indirectly via the PLT.

Yeah, that would be nice to improve.

The easier approach for this class of issues is to use the linker option
-Bsymbolic.

I don't recall details, but we've previously rejected the idea of
trying to use -Bsymbolic widely; apparently it has undesirable
side-effects on some platforms. See commit message for e3d77ea6b
(hopefully there's some detail in the email thread [1]/messages/by-id/153626613985.23143.4743626885618266803@wrigleys.postgresql.org). It sounds
like you're not actually proposing that, but I thought it would be
a good idea to note the hazard here.

By compiling with -fno-plt, the above becomes:

Does -fno-plt amount to an ABI change? If so, I'm worried that it'd
break the ability to compile extensions with a different compiler.

Also, we have at least some places where there actually are cross-calls
between extensions, eg hstore_perl -> plperl. Do we need to worry about
breaking those?

regards, tom lane

[1]: /messages/by-id/153626613985.23143.4743626885618266803@wrigleys.postgresql.org

#3Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#2)
Re: Reduce function call costs on ELF platforms

Hi,

On 2021-11-22 17:32:21 -0500, Tom Lane wrote:

The easier approach for this class of issues is to use the linker option
-Bsymbolic.

I don't recall details, but we've previously rejected the idea of
trying to use -Bsymbolic widely; apparently it has undesirable
side-effects on some platforms. See commit message for e3d77ea6b
(hopefully there's some detail in the email thread [1]).

Hm. I didn't really see reasons to not use -Bsymbolic in that thread.

It sounds like you're not actually proposing that, but I thought it would be
a good idea to note the hazard here.

I think it'd be good to use, but it'll be much less important once we use
symbol visibility.

By compiling with -fno-plt, the above becomes:

Does -fno-plt amount to an ABI change? If so, I'm worried that it'd
break the ability to compile extensions with a different compiler.

No, it is a change at the function callsite thats transparent to the function
itself (unless the called function goes out of its way to do stuff like
inspecting frame pointers or such, but they're not available by default on
most platforms anyway).

It does however change symbol binding, basically making all symbols bound
eagerly. Which I guess theoretically could be considered an ABI change,
because it removes the ability to intercept symbols referenced in a previously
loaded shared library, with a subsequently loaded library (e.g. loaded with
RTLD_DEEPBIND) function before the symbol is used. But that seems like a
stretch. And I think most ELF platforms/linux distributions have/are moving
towards using -Wl,-z,now -Wl,-z,relro also makes symbols bound eagerly.

Also, we have at least some places where there actually are cross-calls
between extensions, eg hstore_perl -> plperl. Do we need to worry about
breaking those?

It does require a bit of care in the symbol visibility patch, basically
marking all such symbols as exported (which may happen implicitly via
PG_FUNCTION_INFO_V1). For extensions that are referenced that way that
actually seems like a good thing, because it makes it clearer what could be
referenced that way.

Greetings,

Andres Freund

#4Bruce Momjian
bruce@momjian.us
In reply to: Andres Freund (#3)
Re: Reduce function call costs on ELF platforms

On Mon, Nov 22, 2021 at 03:57:45PM -0800, Andres Freund wrote:

It does however change symbol binding, basically making all symbols bound
eagerly. Which I guess theoretically could be considered an ABI change,
because it removes the ability to intercept symbols referenced in a previously
loaded shared library, with a subsequently loaded library (e.g. loaded with
RTLD_DEEPBIND) function before the symbol is used. But that seems like a
stretch. And I think most ELF platforms/linux distributions have/are moving
towards using -Wl,-z,now -Wl,-z,relro also makes symbols bound eagerly.

I found this really interesting, and I am surprised how things got so
suboptimal. Has it always been this way? Is it the use of C++ that is
causing this by default?

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

#5Andres Freund
andres@anarazel.de
In reply to: Bruce Momjian (#4)
Re: Reduce function call costs on ELF platforms

Hi,

On 2021-11-22 20:13:28 -0500, Bruce Momjian wrote:

On Mon, Nov 22, 2021 at 03:57:45PM -0800, Andres Freund wrote:

It does however change symbol binding, basically making all symbols bound
eagerly. Which I guess theoretically could be considered an ABI change,
because it removes the ability to intercept symbols referenced in a previously
loaded shared library, with a subsequently loaded library (e.g. loaded with
RTLD_DEEPBIND) function before the symbol is used. But that seems like a
stretch. And I think most ELF platforms/linux distributions have/are moving
towards using -Wl,-z,now -Wl,-z,relro also makes symbols bound eagerly.

I found this really interesting, and I am surprised how things got so
suboptimal.

It's always been this way on ELF afaict (*).

I don't think anybody would design ELF the same way today. I think it's pretty
clear that defaulting to making symbols interceptable was a bad idea. But it's
where we are...

Has it always been this way? Is it the use of C++ that is causing this by
default?

Nope, this is with plain C.

Greetings,

Andres Freund

(*) I guess you can argue it got a tad worse with the increasing use of
position independent executables, but that's a relatively small difference in
the scope of the issue.

#6Peter Eisentraut
peter.eisentraut@enterprisedb.com
In reply to: Tom Lane (#2)
Re: Reduce function call costs on ELF platforms

On 22.11.21 23:32, Tom Lane wrote:

The easier approach for this class of issues is to use the linker option
-Bsymbolic.

I don't recall details, but we've previously rejected the idea of
trying to use -Bsymbolic widely; apparently it has undesirable
side-effects on some platforms. See commit message for e3d77ea6b
(hopefully there's some detail in the email thread [1]). It sounds
like you're not actually proposing that, but I thought it would be
a good idea to note the hazard here.

Also, IIRC, -Bsymbolic was once frowned upon by packaging policies,
since it prevents use of LD_PRELOAD. I'm not sure what the current
thinking there is, however.

#7Andres Freund
andres@anarazel.de
In reply to: Peter Eisentraut (#6)
Re: Reduce function call costs on ELF platforms

Hi,

On 2021-11-23 17:28:08 +0100, Peter Eisentraut wrote:

On 22.11.21 23:32, Tom Lane wrote:

The easier approach for this class of issues is to use the linker option
-Bsymbolic.

I don't recall details, but we've previously rejected the idea of
trying to use -Bsymbolic widely; apparently it has undesirable
side-effects on some platforms. See commit message for e3d77ea6b
(hopefully there's some detail in the email thread [1]). It sounds
like you're not actually proposing that, but I thought it would be
a good idea to note the hazard here.

Also, IIRC, -Bsymbolic was once frowned upon by packaging policies, since it
prevents use of LD_PRELOAD. I'm not sure what the current thinking there
is, however.

It doesn't break some (most?) of the uses of LD_PRELOAD. In particular, it
doesn't break things like replacing the malloc implementation. When do you
have a symbol that you want to override *inside* your library (executables
already bind to their own symbols at compile time)? I've seen that for
replacing buggy functions in closed source things, but that's about it?

Greetings,

Andres Freund

#8Andrew Dunstan
andrew@dunslane.net
In reply to: Andres Freund (#7)
Re: Reduce function call costs on ELF platforms

On 11/24/21 13:55, Andres Freund wrote:

Hi,

On 2021-11-23 17:28:08 +0100, Peter Eisentraut wrote:

On 22.11.21 23:32, Tom Lane wrote:

The easier approach for this class of issues is to use the linker option
-Bsymbolic.

I don't recall details, but we've previously rejected the idea of
trying to use -Bsymbolic widely; apparently it has undesirable
side-effects on some platforms. See commit message for e3d77ea6b
(hopefully there's some detail in the email thread [1]). It sounds
like you're not actually proposing that, but I thought it would be
a good idea to note the hazard here.

Also, IIRC, -Bsymbolic was once frowned upon by packaging policies, since it
prevents use of LD_PRELOAD. I'm not sure what the current thinking there
is, however.

It doesn't break some (most?) of the uses of LD_PRELOAD. In particular, it
doesn't break things like replacing the malloc implementation. When do you
have a symbol that you want to override *inside* your library (executables
already bind to their own symbols at compile time)? I've seen that for
replacing buggy functions in closed source things, but that's about it?

Which things does it break exactly? I have a case where a library that
is LD_PRELOADed calls PQsetSSLKeyPassHook_OpenSSL() in its constructor
function. I'd be very unhappy if that stopped working (and so would our
client).

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#9Andres Freund
andres@anarazel.de
In reply to: Andrew Dunstan (#8)
Re: Reduce function call costs on ELF platforms

Hi,

On 2021-11-24 17:55:03 -0500, Andrew Dunstan wrote:

On 11/24/21 13:55, Andres Freund wrote:

On 2021-11-23 17:28:08 +0100, Peter Eisentraut wrote:

On 22.11.21 23:32, Tom Lane wrote:

The easier approach for this class of issues is to use the linker option
-Bsymbolic.

I don't recall details, but we've previously rejected the idea of
trying to use -Bsymbolic widely; apparently it has undesirable
side-effects on some platforms. See commit message for e3d77ea6b
(hopefully there's some detail in the email thread [1]). It sounds
like you're not actually proposing that, but I thought it would be
a good idea to note the hazard here.

Also, IIRC, -Bsymbolic was once frowned upon by packaging policies, since it
prevents use of LD_PRELOAD. I'm not sure what the current thinking there
is, however.

It doesn't break some (most?) of the uses of LD_PRELOAD. In particular, it
doesn't break things like replacing the malloc implementation. When do you
have a symbol that you want to override *inside* your library (executables
already bind to their own symbols at compile time)? I've seen that for
replacing buggy functions in closed source things, but that's about it?

Which things does it break exactly?

-Bsymbolic causes symbols that are defined and referenced within one shared
library to use that definition. E.g. if a shared lib has a function
"do_something()" and some of its code calls do_something(), you cannot use
LD_PRELOAD (or a definition in the main binary) to redirect the call to
do_something() inside the shared library to something else.

I.e. if a shared library calls a function that's *not* defined within that
shared library, -Bsymbolic doesn't have an effect for that symbol.

I have a case where a library that
is LD_PRELOADed calls PQsetSSLKeyPassHook_OpenSSL() in its constructor
function. I'd be very unhappy if that stopped working (and so would our
client).

Bsymbolic shouldn't affect that at all.

Greetings,

Andres Freund

#10Andrew Dunstan
andrew@dunslane.net
In reply to: Andres Freund (#9)
Re: Reduce function call costs on ELF platforms

On 11/24/21 22:57, Andres Freund wrote:

Which things does it break exactly?

-Bsymbolic causes symbols that are defined and referenced within one shared
library to use that definition. E.g. if a shared lib has a function
"do_something()" and some of its code calls do_something(), you cannot use
LD_PRELOAD (or a definition in the main binary) to redirect the call to
do_something() inside the shared library to something else.

I.e. if a shared library calls a function that's *not* defined within that
shared library, -Bsymbolic doesn't have an effect for that symbol.

I have a case where a library that
is LD_PRELOADed calls PQsetSSLKeyPassHook_OpenSSL() in its constructor
function. I'd be very unhappy if that stopped working (and so would our
client).

Bsymbolic shouldn't affect that at all.

Thanks for the explanation.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com