User Defined Functions/AM's inherently slow?

Started by Eric Ridgealmost 22 years ago13 messages
#1Eric Ridge
ebr@tcdi.com

I've created a stub AM that literally does nothing. It indexes
nothing. It scans for nothing. Nadda. It does just enough work to
specify return values that prevent PG from dumping core.

What I've found is that this stub AM, compiled with the same options as
postgres itself (-O2 -fno-strict-aliasing), is roughly 4 times slower
than the built in btree AM that actually does something useful!

The test table contains 1 column, and 1 row:

My stub AM:
explain analyze select * from test where a ==> '1';
Total runtime: 0.254 ms

builtin btree AM:
explain analyze select * from test where a = '1';
Total runtime: 0.058 ms

(I ran each one a number times, with basically the same results).

What gives? *scratches head* I know btree's are efficient, but geez,
can they really be more efficient than O(zero)? :) Looking at the
backtrace from the beginscan function of each AM, PG doesn't appear to
do anything different for user-provided AM's.

My platform is OS X 10.3.2, using PG 7.4, gcc version 3.3 (Apple's
build #1495).

Any insight would be greatly appreciated.

Thanks!

eric

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Eric Ridge (#1)
Re: User Defined Functions/AM's inherently slow?

Eric Ridge <ebr@tcdi.com> writes:

I've created a stub AM that literally does nothing.

It's not possible for an index AM to "do nothing", at least not for an
indexscan. It has to return tuple pointers. What are you doing for
that?

regards, tom lane

#3Eric Ridge
ebr@tcdi.com
In reply to: Tom Lane (#2)
Re: User Defined Functions/AM's inherently slow?

On Jan 17, 2004, at 10:22 PM, Tom Lane wrote:

Eric Ridge <ebr@tcdi.com> writes:

I've created a stub AM that literally does nothing.

It's not possible for an index AM to "do nothing", at least not for an
indexscan. It has to return tuple pointers. What are you doing for
that?

costestimate: {
PG_RETURN_VOID();
}

beginscan: {
Relation index = (Relation) PG_GETARG_POINTER(0);
int keysz = PG_GETARG_INT32(1);
ScanKey scankey = (ScanKey) PG_GETARG_POINTER(2);
IndexScanDesc scan = RelationGetIndexScan(index, keysz, scankey);

PG_RETURN_POINTER(scan);
}

rescan: {
PG_RETURN_VOID();
}

gettuple: {
PG_RETURN_BOOL(false);
}

endscan: {
PG_RETURN_VOID();
}

I think the above is about as close to "nothing" as one can get.

In trying to track down some performance issues with my real AM, I
decided to make this stub AM just to see what the overhead is... then I
started seeing these crazy results.

eric

#4Eric Ridge
ebr@tcdi.com
In reply to: Tom Lane (#2)
Re: User Defined Functions/AM's inherently slow?

On Jan 17, 2004, at 10:22 PM, Tom Lane wrote:

Eric Ridge <ebr@tcdi.com> writes:

I've created a stub AM that literally does nothing.

It's not possible for an index AM to "do nothing", at least not for an
indexscan. It has to return tuple pointers. What are you doing for
that?

I should have included the entire explain output:

stub AM:
Index Scan using idxa_stub on test2 (cost=0.00..2.68 rows=1 width=5)
(actual time=0.002..0.002 rows=0 loops=1)
Index Cond: (a ==> '1'::text)
Total runtime: 0.247 ms

builtin btree AM:
Index Scan using idxa_btree on test2 (cost=0.00..4.68 rows=1 width=5)
(actual time=0.024..0.026 rows=1 loops=1)
Index Cond: (a = '1'::text)
Total runtime: 0.060 ms

If the "actual time" numbers are really a measure of the amount of time
spent in (at least) the index, it seems the stub should report a
smaller "total runtime", but alas, it doesn't.

eric

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Eric Ridge (#3)
Re: User Defined Functions/AM's inherently slow?

Eric Ridge <ebr@tcdi.com> writes:

costestimate: {
PG_RETURN_VOID();
}

This at least needs to set some values into the output parameters ---
zeroes are okay, not setting them at all isn't. I'm surprised the
planner doesn't go nuts. It looks from your EXPLAIN results like
the values are coming up zero anyway, but seeing that cost_index()
doesn't do anything to initialize those local variables, I'd expect
fairly unpredictable behavior.

The other stubs look fine on quick review.

regards, tom lane

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Eric Ridge (#4)
Re: User Defined Functions/AM's inherently slow?

Eric Ridge <ebr@tcdi.com> writes:

I should have included the entire explain output:

stub AM:
Index Scan using idxa_stub on test2 (cost=0.00..2.68 rows=1 width=5)
(actual time=0.002..0.002 rows=0 loops=1)
Index Cond: (a ==> '1'::text)
Total runtime: 0.247 ms

builtin btree AM:
Index Scan using idxa_btree on test2 (cost=0.00..4.68 rows=1 width=5)
(actual time=0.024..0.026 rows=1 loops=1)
Index Cond: (a = '1'::text)
Total runtime: 0.060 ms

If the "actual time" numbers are really a measure of the amount of time
spent in (at least) the index, it seems the stub should report a
smaller "total runtime", but alas, it doesn't.

The difference between "total runtime" and the top plan node's runtime
has to represent plan startup/shutdown time. I'm suspicious that your
stubs are somehow not initializing something, though on first glance I
do not see what.

Theory B would be that there's some huge overhead in calling non-built-in
functions on your platform. We do know that looking up a "C" function is
significantly more expensive than looking up a "builtin" function, but
there should only be half a dozen such calls involved in this test case;
it's hard to credit that that takes 200 msec. Does the time drop at all
on second and subsequent repetitions in a single backend run?

regards, tom lane

#7Eric Ridge
ebr@tcdi.com
In reply to: Tom Lane (#5)
Re: User Defined Functions/AM's inherently slow?

On Jan 17, 2004, at 11:27 PM, Tom Lane wrote:

Eric Ridge <ebr@tcdi.com> writes:

costestimate: {
PG_RETURN_VOID();
}

This at least needs to set some values into the output parameters ---
zeroes are okay, not setting them at all isn't. I'm surprised the
planner doesn't go nuts. It looks from your EXPLAIN results like
the values are coming up zero anyway, but seeing that cost_index()
doesn't do anything to initialize those local variables, I'd expect
fairly unpredictable behavior.

I have tried setting them all to zero, and even using the
backend/utils/adt/selfuncs.c:genericcostestimate() code (it's not
exported, it seems), but no matter what sane (or crazy!) numbers I
provide for the cost estimate, the "fact" remains, my AM, which does
"nothing" is slower than the builtin btree AM.

Could this just be related to the fact that my AM is in a .so, and
there's just some operating system/C runtime overhead in calling
functions in dynamically loaded libraries?

eric

#8Eric Ridge
ebr@tcdi.com
In reply to: Tom Lane (#6)
Re: User Defined Functions/AM's inherently slow?

On Jan 17, 2004, at 11:33 PM, Tom Lane wrote:

The difference between "total runtime" and the top plan node's runtime
has to represent plan startup/shutdown time. I'm suspicious that your
stubs are somehow not initializing something, though on first glance I
do not see what.

I can't see anything either... which is why I brought it up. I'm still
a noob with this stuff, and thought maybe I was just missing something.

Theory B would be that there's some huge overhead in calling
non-built-in
functions on your platform. We do know that looking up a "C" function
is
significantly more expensive than looking up a "builtin" function, but
there should only be half a dozen such calls involved in this test
case;
it's hard to credit that that takes 200 msec. Does the time drop at
all
on second and subsequent repetitions in a single backend run?

Yes, it drops from about .680ms to the .250ish that I posted.

I suppose I should try compiling this little stub into postgres, eh?

eric

#9Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Eric Ridge (#8)
Re: User Defined Functions/AM's inherently slow?

Theory B would be that there's some huge overhead in calling non-built-in
functions on your platform. We do know that looking up a "C" function is
significantly more expensive than looking up a "builtin" function, but
there should only be half a dozen such calls involved in this test case;
it's hard to credit that that takes 200 msec. Does the time drop at all
on second and subsequent repetitions in a single backend run?

Yes, it drops from about .680ms to the .250ish that I posted.

I suppose I should try compiling this little stub into postgres, eh?

What if you try the new preload_libraries (or whatever it's called)
config variable in postgresql.conf in the 7.4 release?

Chris

#10Eric Ridge
ebr@tcdi.com
In reply to: Christopher Kings-Lynne (#9)
Re: User Defined Functions/AM's inherently slow?

On Jan 18, 2004, at 1:34 AM, Christopher Kings-Lynne wrote:

Theory B would be that there's some huge overhead in calling
non-built-in
functions on your platform. We do know that looking up a "C"
function is
significantly more expensive than looking up a "builtin" function,
but
there should only be half a dozen such calls involved in this test
case;
it's hard to credit that that takes 200 msec. Does the time drop at
all
on second and subsequent repetitions in a single backend run?

Yes, it drops from about .680ms to the .250ish that I posted.
I suppose I should try compiling this little stub into postgres, eh?

What if you try the new preload_libraries (or whatever it's called)
config variable in postgresql.conf in the 7.4 release?

yes, that takes care of the initial load time of the library itself
(bringing the initial .680ms back to .250ish for the first run), but
this is not the problem.

eric

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Eric Ridge (#8)
Re: User Defined Functions/AM's inherently slow?

Theory B would be that there's some huge overhead in calling
non-built-in functions on your platform.

I've done some profiling and convinced myself that indeed there's pretty
steep overhead involved in fmgr_info() for a "C"-language function.
Much of it isn't platform-dependent either --- as best I can tell,
the lion's share of the time is being eaten in
expand_dynamic_library_name(). In scenarios where a function is called
many times per query, we cache the results of fmgr_info() ... but we do
not do so for operations like ambeginscan that are done just once per
query.

Every other function language uses shortcuts or caching to reduce the
cost of fmgr_info() lookup; external C language is the only one that
hasn't been optimized in this way. I shall see what I can do about that.
ISTM we can have a hash table that maps function OID to function address
using the same sorts of techniques that plpgsql et al use.

regards, tom lane

#12Eric B.Ridge
ebr@tcdi.com
In reply to: Tom Lane (#11)
Re: User Defined Functions/AM's inherently slow?

On Jan 18, 2004, at 7:28 PM, Tom Lane wrote:

Theory B would be that there's some huge overhead in calling
non-built-in functions on your platform.

I've done some profiling and convinced myself that indeed there's
pretty
steep overhead involved in fmgr_info() for a "C"-language function.
Much of it isn't platform-dependent either --- as best I can tell,
the lion's share of the time is being eaten in
expand_dynamic_library_name(). In scenarios where a function is called
many times per query, we cache the results of fmgr_info() ... but we do
not do so for operations like ambeginscan that are done just once per
query.

Wow, thanks for spending the time on this. What about for gettuple?
Do calls to it take advantage of the cache? If not, this likely
explains some of my custom am's performance troubles.

Every other function language uses shortcuts or caching to reduce the
cost of fmgr_info() lookup; external C language is the only one that
hasn't been optimized in this way. I shall see what I can do about
that.
ISTM we can have a hash table that maps function OID to function
address
using the same sorts of techniques that plpgsql et al use.

If there's anything I can do to help, let me know. I'll be happy to
test any patches you might come up with too.

eric

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Eric B.Ridge (#12)
Re: User Defined Functions/AM's inherently slow?

"Eric B. Ridge" <ebr@tcdi.com> writes:

Wow, thanks for spending the time on this. What about for gettuple?
Do calls to it take advantage of the cache? If not, this likely
explains some of my custom am's performance troubles.

gettuple is looked up once at the start of a scan, so there's no
per-tuple overhead involved there. As I said before, we're usually
pretty good about avoiding per-tuple overheads --- the cost you
identified here is a per-query overhead.

If there's anything I can do to help, let me know. I'll be happy to
test any patches you might come up with too.

I have committed a patch into CVS HEAD --- give it a try.

regards, tom lane