inline newNode()
I've been taking a look at improving the performance of the GEQO
code. Before looking at algorithmic improvements, I noticed some
"low-hanging fruit": gprof revealed that newNode() was a hotspot. When
I altered it to be an inline function, the overall time to plan a
12-table table join (using the default GEQO settings) dropped by about
9%. I haven't taken a look at how the patch effects the other places
that use newNode(), but it stands to reason that they'd see a
performance improvement as well (although probably less noticeable).
However, I'm not sure if I used the correct syntax for inlining the
function (since it was originally declared in a header and defined
elsewhere, I couldn't just add 'inline'). The method I used (declaring
the function 'extern inline' and defining it in the header file) works
for me with GCC 3.2, but I'm open to suggestions for improvement.
BTW, after applying the patch, the GEQO profile looks like:
% cumulative self self total
time seconds seconds calls s/call s/call name
16.07 0.18 0.18 4618735 0.00 0.00 compare_path_costs
9.82 0.29 0.11 2149666 0.00 0.00 AllocSetAlloc
8.04 0.38 0.09 396333 0.00 0.00 add_path
4.46 0.43 0.05 2149661 0.00 0.00 MemoryContextAlloc
3.57 0.47 0.04 1150953 0.00 0.00 compare_pathkeys
(Yes, gprof on my machine is still a bit fubared...)
Cheers,
Neil
--
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
Attachments:
geqo-3.patchtext/x-patchDownload+17-49
Neil Conway <neilc@samurai.com> writes:
I've been taking a look at improving the performance of the GEQO
code. Before looking at algorithmic improvements, I noticed some
"low-hanging fruit": gprof revealed that newNode() was a hotspot. When
I altered it to be an inline function, the overall time to plan a
12-table table join (using the default GEQO settings) dropped by about
9%.
How much did you bloat the code? There are an awful lot of calls to
newNode(), so even though it's not all that large, I'd think the
multiplier would be nasty.
It might be better to offer two versions, standard out-of-line and a
macro-or-inline called something like newNodeInline(), and change just
the hottest call sites to use the inline. You could probably get most
of the speedup from inlining at just a few call sites. (I've been
intending to do something similar with the TransactionId comparison
functions.)
However, I'm not sure if I used the correct syntax for inlining the
function (since it was originally declared in a header and defined
elsewhere, I couldn't just add 'inline'). The method I used (declaring
the function 'extern inline' and defining it in the header file) works
for me with GCC 3.2, but I'm open to suggestions for improvement.
This isn't portable at all, AFAIK :-(. Unfortunately I can't think
of a portable way to do it with a macro, either.
However, if you're willing to go the route involving changing call
sites, then you could change
foo = newNode(typename);
to
newNodeMacro(foo, typename);
whereupon it becomes pretty easy to make a suitable macro.
regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes:
How much did you bloat the code? There are an awful lot of calls to
newNode(), so even though it's not all that large, I'd think the
multiplier would be nasty.
The patch increases the executable from 12844452 to 13005244 bytes,
when compiled with '-pg -g -O2' and without being stripped.
This isn't portable at all, AFAIK :-(. Unfortunately I can't think
of a portable way to do it with a macro, either.
Well, one alternative might be to provide 2 definitions of the
function -- one an extern inline in the header file, and one using the
current method (in a separate file, non-inline). Then wrap the header
file in an #ifdef __GNUC__ block, and the non-inlined version in
#ifndef __GNUC__. The downside is that it means maintaining two
versions of the same function -- but given that newNode() is pretty
trivial, that might be acceptable.
BTW, the GCC docs on inline functions are here:
http://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Inline.html#Inline
According to that page, using 'static inline' instead of 'extern
inline' is recommended for future compatability with C99, so that's
what we should probably use (in the __GNUC__ version).
Cheers,
Neil
--
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
Neil Conway <neilc@samurai.com> writes:
Tom Lane <tgl@sss.pgh.pa.us> writes:
How much did you bloat the code? There are an awful lot of calls to
newNode(), so even though it's not all that large, I'd think the
multiplier would be nasty.
The patch increases the executable from 12844452 to 13005244 bytes,
when compiled with '-pg -g -O2' and without being stripped.
Okay, not as bad as I feared, but still kinda high.
I believe that most of the bloat comes from the MemSet macro; there's
just not much else in newNode(). Now, the reason MemSet expands to
a fair amount of code is its if-then-else case to decide whether to
call memset() or do an inline loop. I've looked at the assembler code
for it on a couple of machines, and the loop proper is only about a
third of the code that gets generated.
Ideally, we'd like to eliminate the if-test for inlined newNode calls.
That would buy back a lot of the bloat and speed things up still
further.
Now the tests on _val == 0 and _len <= MEMSET_LOOP_LIMIT and _len being
a multiple of 4 are no problem, since _val and _len are compile-time
constants; these will be optimized away. What is not optimized away
(on the compilers I've looked at) is the check for _start being
int-aligned.
A brute-force approach is to say "we know _start is word-aligned because
we just got it from palloc, which guarantees MAXALIGNment". We could
make a variant version of MemSet that omits the alignment check, and use
it here and anywhere else we're sure it's safe.
A nicer approach would be to somehow make use of the datatype of the
first argument to MemSet. If we could determine at compile time that
it's supposed to point at a type with at least int alignment, then
it'd be possible for the compiler to optimize away this check in a
reasonably safe fashion. I'm not sure if there's a portable way to
do this, though. There's no "alignof()" construct in C :-(.
Any ideas?
regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes:
[ snip interesting analysis ]
A nicer approach would be to somehow make use of the datatype of the
first argument to MemSet. If we could determine at compile time
that it's supposed to point at a type with at least int alignment,
then it'd be possible for the compiler to optimize away this check
in a reasonably safe fashion. I'm not sure if there's a portable
way to do this, though. There's no "alignof()" construct in C
:-(. Any ideas?
Well, we could make use of (yet another) GCC-ism: the __alignof__
keyword, which is described here:
http://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Alignment.html#Alignment
I don't like making the code GCC-specific any more than anyone else
does, but given that the code-bloat is specific to the inline version
of newNode (which in the scheme I described earlier would be
GCC-only) -- so introducing a GCC-specific fix for a GCC-specific
problem isn't too bad, IMHO.
Or we could just use your other suggestion: define a variant of
MemSet() and use it when we know it's safe. Not sure which is the
better solution: any comments?
Cheers,
Neil
--
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
Neil Conway <neilc@samurai.com> writes:
I don't like making the code GCC-specific any more than anyone else
does, but given that the code-bloat is specific to the inline version
of newNode (which in the scheme I described earlier would be
GCC-only) -- so introducing a GCC-specific fix for a GCC-specific
problem isn't too bad, IMHO.
Or we could just use your other suggestion: define a variant of
MemSet() and use it when we know it's safe. Not sure which is the
better solution: any comments?
If we're going with a GCC-only approach to inlining newNode then it
seems like a tossup to me too. Any other thoughts out there?
regards, tom lane
Tom Lane wrote:
Neil Conway <neilc@samurai.com> writes:
I don't like making the code GCC-specific any more than anyone else
does, but given that the code-bloat is specific to the inline version
of newNode (which in the scheme I described earlier would be
GCC-only) -- so introducing a GCC-specific fix for a GCC-specific
problem isn't too bad, IMHO.Or we could just use your other suggestion: define a variant of
MemSet() and use it when we know it's safe. Not sure which is the
better solution: any comments?If we're going with a GCC-only approach to inlining newNode then it
seems like a tossup to me too. Any other thoughts out there?
Seems newNode can easily be made into a macro. I can do the coding, and
if you tell me that newNode will always be int-aligned, I can make an
assume-aligned version of MemSet.
Is that what people want to try?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Seems newNode can easily be made into a macro.
If you can see how to do that, show us ... that would eliminate the
need for a GCC-only inline function, which would sway my vote away
from depending on __alignof__ for the other thing.
regards, tom lane
Tom Lane writes:
A brute-force approach is to say "we know _start is word-aligned because
we just got it from palloc, which guarantees MAXALIGNment". We could
make a variant version of MemSet that omits the alignment check, and use
it here and anywhere else we're sure it's safe.
Or make a version of palloc that zeroes the memory.
--
Peter Eisentraut peter_e@gmx.net
Neil Conway writes:
Well, one alternative might be to provide 2 definitions of the
function -- one an extern inline in the header file, and one using the
current method (in a separate file, non-inline). Then wrap the header
file in an #ifdef __GNUC__ block, and the non-inlined version in
#ifndef __GNUC__.
External inline functions aren't even portable across different versions
of GCC. It's best not to go there at all.
--
Peter Eisentraut peter_e@gmx.net
Peter Eisentraut <peter_e@gmx.net> writes:
Or make a version of palloc that zeroes the memory.
Well, we might lose a bit of speed that way -- with the current
MemSet() macro, most of the tests that MemSet() does can be eliminated
at compile-time, since MemSet() is usually called with some constant
parameters. If we moved this code inside palloc(), it wouldn't be
possible to do this constant propagation.
Not sure it would make a big performance difference, though -- and I
think palloc() is a clean solution. In fact, it's probably worth
having anyway: a lot of the call sites of palloc() immediately zero
the memory after they allocate it...
Cheers,
Neil
--
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
Peter Eisentraut wrote:
Tom Lane writes:
A brute-force approach is to say "we know _start is word-aligned because
we just got it from palloc, which guarantees MAXALIGNment". We could
make a variant version of MemSet that omits the alignment check, and use
it here and anywhere else we're sure it's safe.Or make a version of palloc that zeroes the memory.
OK, here is a version of newNode that is a macro. However, there are
problems:
First, I can't get the code to assign the tag (dereference the pointer)
_and_ return the pointer, in a macro. I can convert it to take a third
pointer argument, and it is only called <10 times, so that is easy, but
it is used by makeNode, which is called ~1000 times, so that seems like
a bad idea. My solution was to declare a global variable in the header
file to be used by the macro. Because the macro isn't called
recursively, that should be fine.
Now, the other problem is MemSet. MemSet isn't designed to be called
inside a macro. In fact, the 'if' tests are easy to do in a macro with
"? :", but the "while" loop seems to be impossible in a macro. I tried
seeing if 'goto' would work in a macro, but of course it doesn't, and
even if it did, I doubt it would be portable. The patch merely calls
memset() rather than MemSet.
Not sure if this is a win or not. It makes makeNode a macro, but
removes the use of the MemSet macro in favor of memset().
Does anyone have additional suggestions? The only thing I can suggest
is to make a clear-memory version of palloc because palloc always calls
MemoryContextAlloc() so I can put it in there. How does that sound?
However, now that I look at it, MemoryContextAlloc() could be made a
macro easier than newNode() and that would save function call in more
cases than newNode. In fact, the comment at the top of
MemoryContextAlloc() says:
* This could be turned into a macro, but we'd have to import
* nodes/memnodes.h into postgres.h which seems a bad idea.
Ideas?
The regression tests do pass with this patch, so functionally it works
fine.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Attachments:
/bjm/difftext/plainDownload+24-17
Bruce Momjian <pgman@candle.pha.pa.us> writes:
OK, here is a version of newNode that is a macro.
If you use memset() instead of MemSet(), I'm afraid you're going to blow
off most of the performance gain this was supposed to achieve.
Does anyone have additional suggestions? The only thing I can suggest
is to make a clear-memory version of palloc because palloc always calls
MemoryContextAlloc() so I can put it in there. How does that sound?
I do not think palloc should auto-zero memory. Hard to explain why,
but it just feels like a bad decision. One point is that the MemSet
has to be inlined or it cannot compile-out the tests on _len. palloc
can't treat the length as a compile-time constant.
The regression tests do pass with this patch, so functionally it works
fine.
Speed is the issue here, not functionality...
regards, tom lane
Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
OK, here is a version of newNode that is a macro.
If you use memset() instead of MemSet(), I'm afraid you're going to blow
off most of the performance gain this was supposed to achieve.
Yep.
Does anyone have additional suggestions? The only thing I can suggest
is to make a clear-memory version of palloc because palloc always calls
MemoryContextAlloc() so I can put it in there. How does that sound?I do not think palloc should auto-zero memory. Hard to explain why,
but it just feels like a bad decision. One point is that the MemSet
has to be inlined or it cannot compile-out the tests on _len. palloc
can't treat the length as a compile-time constant.
Right, palloc shouldn't. I was thinking of having another version of
palloc that _does_ clear out memory, and calling that from a newNode()
macro. We already know palloc is going to call MemoryContextAlloc, so
we could have a pallocC() that calls a new MemoryContextAllocC() that
would call the underlying memory allocation function, then do the loop
like MemSet to clear it.
It would allow newNode to be a macro and prevent the bloat of having
MemSet appear everywhere newNode appears; it would all be in the new
function MemoryContextAllocC().
The regression tests do pass with this patch, so functionally it works
fine.Speed is the issue here, not functionality...
I was just proving it works, that's all.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Right, palloc shouldn't. I was thinking of having another version of
palloc that _does_ clear out memory, and calling that from a newNode()
macro. We already know palloc is going to call MemoryContextAlloc, so
we could have a pallocC() that calls a new MemoryContextAllocC() that
would call the underlying memory allocation function, then do the loop
like MemSet to clear it.
But if the MemSet is inside the called function then it cannot reduce
the if-tests to a compile-time decision to invoke the word-zeroing loop.
We want the MemSet to be expanded at the newNode call site, where the
size of the allocated memory is a compile-time constant.
regards, tom lane
Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Right, palloc shouldn't. I was thinking of having another version of
palloc that _does_ clear out memory, and calling that from a newNode()
macro. We already know palloc is going to call MemoryContextAlloc, so
we could have a pallocC() that calls a new MemoryContextAllocC() that
would call the underlying memory allocation function, then do the loop
like MemSet to clear it.But if the MemSet is inside the called function then it cannot reduce
the if-tests to a compile-time decision to invoke the word-zeroing loop.
We want the MemSet to be expanded at the newNode call site, where the
size of the allocated memory is a compile-time constant.
I can easily do the tests in the MemSet macro, but I can't do a loop in
a macro that has to return a value; I need while(). Though a loop in a
new fuction will not be as fast as a MemSet macro, I think it will be
better than what we have now with newNode only because newNode will be a
macro and not a function anymore, i.e. the MemSet will happen in the
function called by pallocC and not in newNode anymore, and there will be
zero code bloat. I wish I saw another way.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian wrote:
Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Right, palloc shouldn't. I was thinking of having another version of
palloc that _does_ clear out memory, and calling that from a newNode()
macro. We already know palloc is going to call MemoryContextAlloc, so
we could have a pallocC() that calls a new MemoryContextAllocC() that
would call the underlying memory allocation function, then do the loop
like MemSet to clear it.But if the MemSet is inside the called function then it cannot reduce
the if-tests to a compile-time decision to invoke the word-zeroing loop.
We want the MemSet to be expanded at the newNode call site, where the
size of the allocated memory is a compile-time constant.I can easily do the tests in the MemSet macro, but I can't do a loop in
a macro that has to return a value; I need while(). Though a loop in a
new fuction will not be as fast as a MemSet macro, I think it will be
better than what we have now with newNode only because newNode will be a
macro and not a function anymore, i.e. the MemSet will happen in the
function called by pallocC and not in newNode anymore, and there will be
zero code bloat. I wish I saw another way.
OK, here's a patch for testing. It needs cleanup because the final
version would remove the nodes/nodes.c file. The net effect of the
patch is to make newNode a macro with little code bloat.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Attachments:
/pgpatches/macrotext/plainDownload+49-17
Bruce Momjian <pgman@candle.pha.pa.us> writes:
... I wish I saw another way.
I liked Neil's proposal better.
regards, tom lane
Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
... I wish I saw another way.
I liked Neil's proposal better.
What is it you liked, specifically?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
... I wish I saw another way.
I liked Neil's proposal better.
I think Neil is going to test my patch. I think he will find it yields
identical performance to his patch without the bloat, and it will work
on all platforms. I don't think the MemSet() constant calls are that
big a performance win.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073