tweaking MemSet() performance - 7.4.5

Started by Marc Colosimoover 21 years ago13 messages
#1Marc Colosimo
mcolosimo@mitre.org

HI,

I'm using 7.4.5 on Mac OS X (G5) and was profiling it to see why it is
SO SLOW at committing inserts and deletes into a large database. One
of the many slowdowns was from MemSet. I found an old (2002) thread
about this and retried the tests (see below). The main point is that
the system memset crushes pg's!! Is it possible to add a define to call
the system memset at build time! This probably isn't the case on other
systems.

I wanted to know the size of FunctionCallInfoData (in execQual.c)
because the profiler said that if it was over 128 then use the system
call.

Here are my results:

pgMemSet
* 64
0.410u 0.000s 0:00.42 97.6% 0+0k 0+0io 0pf+0w
* 128
0.600u 0.000s 0:00.61 98.3% 0+0k 0+0io 0pf+0w
* 176 Size of fcinfo is 176, used in execQual.c which was being very
slow here!
0.790u 0.000s 0:00.79 100.0% 0+0k 0+0io 0pf+0w
* 256
1.040u 0.000s 0:01.08 96.2% 0+0k 0+0io 0pf+0w
* 512
2.030u 0.000s 0:02.04 99.5% 0+0k 0+0io 0pf+0w
* 1024
3.950u 0.010s 0:03.94 100.5% 0+0k 0+0io 0pf+0w
* 2048
7.710u 0.000s 0:07.75 99.4% 0+0k 0+0io 0pf+0w
* 4096
15.390u 0.000s 0:15.37 100.1% 0+0k 0+0io 0pf+0w

system memset
* 64
0.260u 0.000s 0:00.25 104.0% 0+0k 0+0io 0pf+0w
* 128
0.310u 0.000s 0:00.31 100.0% 0+0k 0+0io 0pf+0w
* 176 Size of fcinfo is 176
0.300u 0.010s 0:00.30 103.3% 0+0k 0+0io 0pf+0w
* 256
0.310u 0.000s 0:00.30 103.3% 0+0k 0+0io 0pf+0w
* 512
0.350u 0.000s 0:00.33 106.0% 0+0k 0+0io 0pf+0w
* 1024
0.590u 0.010s 0:00.63 95.2% 0+0k 0+0io 0pf+0w
* 2048
0.780u 0.000s 0:00.77 101.2% 0+0k 0+0io 0pf+0w
* 4096
1.320u 0.000s 0:01.33 99.2% 0+0k 0+0io 0pf+0w

#include <string.h>
#include "postgres.h"
#include "fmgr.h"

#undef MEMSET_LOOP_LIMIT
#define MEMSET_LOOP_LIMIT 1000000

int
main(int argc, char **argv)
{
int len = atoi(argv[1]);
char buffer[len];
long long i;

FunctionCallInfoData fcinfo;
printf("Size of fcinfo is %d\n", sizeof(fcinfo));

for (i = 0; i < 9900000; i++)
MemSet(buffer, 0, len);
//memset(buffer, 0, len);
return 0;
}

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marc Colosimo (#1)
Re: tweaking MemSet() performance - 7.4.5

Marc Colosimo <mcolosimo@mitre.org> writes:

I'm using 7.4.5 on Mac OS X (G5) and was profiling it to see why it is
SO SLOW at committing inserts and deletes into a large database. One
of the many slowdowns was from MemSet. I found an old (2002) thread
about this and retried the tests (see below). The main point is that
the system memset crushes pg's!!

Hmm. I tried to duplicate this on my G4 laptop, and found that they
were more or less on a par for small-to-middling block sizes (using
"gcc -O2"). Darwin's memset code must have some additional tweaks for
use on G5 hardware. Good for Apple --- this is the sort of thing that
OS vendors *ought* to be doing. The fact that we can beat the system
memset on so many platforms is an indictment of those platforms.

Is it possible to add a define to call
the system memset at build time! This probably isn't the case on other
systems.

Feel free to hack the definition of MemSet in src/include/c.h. See the
comments for it for more context.

Note that for small compile-time-constant block sizes (a case your test
program doesn't test, but it's common in pgsql), gcc with a sufficiently
high optimization setting can unroll the loop into a linear sequence of
words zeroings. I would expect that to beat the system memset up to a
few dozen words, no matter how tense the memset coding is. So you
probably want to think in terms of reducing MEMSET_LOOP_LIMIT rather
than diking out the macro code altogether. Or maybe reduce MemSet to
"memset(...)" but leave MemSetAligned and/or MemSetTest/MemSetLoop
as-is. In any case, reporting results without mentioning the compiler
and optimization level in use isn't going to convince anybody ...

regards, tom lane

#3Marc Colosimo
mcolosimo@mitre.org
In reply to: Tom Lane (#2)
Re: tweaking MemSet() performance - 7.4.5

On Sep 17, 2004, at 3:55 PM, Tom Lane wrote:

Marc Colosimo <mcolosimo@mitre.org> writes:

I'm using 7.4.5 on Mac OS X (G5) and was profiling it to see why it is
SO SLOW at committing inserts and deletes into a large database. One
of the many slowdowns was from MemSet. I found an old (2002) thread
about this and retried the tests (see below). The main point is that
the system memset crushes pg's!!

Hmm. I tried to duplicate this on my G4 laptop, and found that they
were more or less on a par for small-to-middling block sizes (using
"gcc -O2"). Darwin's memset code must have some additional tweaks for
use on G5 hardware. Good for Apple --- this is the sort of thing that
OS vendors *ought* to be doing. The fact that we can beat the system
memset on so many platforms is an indictment of those platforms.

Is it possible to add a define to call
the system memset at build time! This probably isn't the case on other
systems.

Feel free to hack the definition of MemSet in src/include/c.h. See the
comments for it for more context.

Note that for small compile-time-constant block sizes (a case your test
program doesn't test, but it's common in pgsql), gcc with a
sufficiently
high optimization setting can unroll the loop into a linear sequence of
words zeroings. I would expect that to beat the system memset up to a
few dozen words, no matter how tense the memset coding is. So you
probably want to think in terms of reducing MEMSET_LOOP_LIMIT rather
than diking out the macro code altogether. Or maybe reduce MemSet to
"memset(...)" but leave MemSetAligned and/or MemSetTest/MemSetLoop
as-is. In any case, reporting results without mentioning the compiler
and optimization level in use isn't going to convince anybody ...

Oops, I used the same setting as in the old hacking message (-O2, gcc
3.3). If I understand what you are saying, then it turns out yes, PG's
MemSet is faster for smaller blocksizes (see below, between 32 and 64).
I just replaced the whole MemSet with memset and it is not very low
when I profile. I could squeeze more out of it if I spent more time
trying to understand it (change MEMSET_LOOP_LIMIT to 32 and then add
memset after that?). I'm now working one understanding Spin Locks and
friends. Putting in a sync call (in s_lock.h) is really a time killer
and bad for performance (it takes up 35 cycles).

run on a single processor G5 (1.8Gz, other was on a DP 2Gz G5)
pgMemSet:
* 4
0.070u 0.000s 0:00.15 46.6% 0+0k 0+0io 0pf+0w
* 8
0.090u 0.000s 0:00.16 56.2% 0+0k 0+0io 0pf+0w
* 16
0.120u 0.000s 0:00.17 70.5% 0+0k 0+0io 0pf+0w
* 32
0.180u 0.000s 0:00.29 62.0% 0+0k 0+0io 0pf+0w
* 64
0.450u 0.000s 0:00.92 48.9% 0+0k 0+0io 0pf+0w

memset:
* 4
0.170u 0.010s 0:00.44 40.9% 0+0k 0+0io 0pf+0w
* 8
0.190u 0.000s 0:00.42 45.2% 0+0k 0+0io 0pf+0w
* 16
0.190u 0.010s 0:00.39 51.2% 0+0k 0+0io 0pf+0w
* 32
0.200u 0.000s 0:00.39 51.2% 0+0k 0+0io 0pf+0w
* 64
0.260u 0.000s 0:00.38 68.4% 0+0k 0+0io 0pf+0w

Marc

#4Manfred Spraul
manfred@colorfullife.com
In reply to: Marc Colosimo (#3)
Re: tweaking MemSet() performance - 7.4.5

Marc Colosimo wrote:

Oops, I used the same setting as in the old hacking message (-O2, gcc
3.3). If I understand what you are saying, then it turns out yes, PG's
MemSet is faster for smaller blocksizes (see below, between 32 and
64). I just replaced the whole MemSet with memset and it is not very
low when I profile.

Could you check what the OS-X memset function does internally?
One trick to speed up memset it to bypass the cache and bulk-write
directly from write buffers to main memory. i386 cpus support that and
in microbenchmarks it's 3 times faster (or something like that).
Unfortunately it's a loss in real-world tests: Typically a structure is
initialized with memset and then immediately accessed. If the memset
bypasses the cache then the following access will cause a cache line
miss, which can be so slow that using the faster memset can result in a
net performance loss.

I could squeeze more out of it if I spent more time trying to
understand it (change MEMSET_LOOP_LIMIT to 32 and then add memset
after that?). I'm now working one understanding Spin Locks and
friends. Putting in a sync call (in s_lock.h) is really a time killer
and bad for performance (it takes up 35 cycles).

That's the price you pay for weakly ordered memory access.
Linux on ppc uses eieio, on ppc64 lwsync is used. Could you check if
they are faster?

--
Manfred

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Manfred Spraul (#4)
Re: tweaking MemSet() performance - 7.4.5

Manfred Spraul <manfred@colorfullife.com> writes:

That's the price you pay for weakly ordered memory access.
Linux on ppc uses eieio, on ppc64 lwsync is used. Could you check if
they are faster?

I recall looking at lwsync and being concerned about portability
problems --- older assemblers will fail to recognize it. I'd want
to see some hard evidence that changing sync to lwsync would be a
significant performance win before taking any portability risk here.

regards, tom lane

#6Noname
mcolosimo@mitre.org
In reply to: Tom Lane (#5)
Re: tweaking MemSet() performance - 7.4.5

Marc Colosimo wrote:

Oops, I used the same setting as in the old hacking message (-O2, gcc
3.3). If I understand what you are saying, then it turns out yes, PG's
MemSet is faster for smaller blocksizes (see below, between 32 and
64). I just replaced the whole MemSet with memset and it is not very
low when I profile.

Could you check what the OS-X memset function does internally?
One trick to speed up memset it to bypass the cache and bulk-write
directly from write buffers to main memory. i386 cpus support that and
in microbenchmarks it's 3 times faster (or something like that).
Unfortunately it's a loss in real-world tests: Typically a structure is
initialized with memset and then immediately accessed. If the memset
bypasses the cache then the following access will cause a cache line
miss, which can be so slow that using the faster memset can result in a
net performance loss.

Could you suggest some structs to test? If I get your meaning, I would make a loop that sets then reads from the structure.

I could squeeze more out of it if I spent more time trying to
understand it (change MEMSET_LOOP_LIMIT to 32 and then add memset
after that?). I'm now working one understanding Spin Locks and
friends. Putting in a sync call (in s_lock.h) is really a time killer
and bad for performance (it takes up 35 cycles).

That's the price you pay for weakly ordered memory access.
Linux on ppc uses eieio, on ppc64 lwsync is used. Could you check if
they are faster?

I found the reason why "sync" was put in <http://archives.postgresql.org/pgsql-bugs/2002-09/msg00239.php&gt;, but it is odd why it works. Why syncing one processor prevents the other from doing something is interesting. What type of shared memory is being used on OS X? I'm confused about the two types of semaphores, sysV or POSIX. <http://archives.postgresql.org/pgsql-patches/2001-01/msg00052.php&gt;It seems the POSIX is the way to go on OS X.

Marc

#7Manfred Spraul
manfred@colorfullife.com
In reply to: Noname (#6)
Re: tweaking MemSet() performance - 7.4.5

mcolosimo@mitre.org wrote:

If the memset
bypasses the cache then the following access will cause a cache line
miss, which can be so slow that using the faster memset can result in a
net performance loss.

Could you suggest some structs to test? If I get your meaning, I would make a loop that sets then reads from the structure.

Read the sources and the cpu specs. Benchmarking such problems is
virtually impossible.
I don't have OS-X, thus I checked the Linux-kernel sources: It seems
that the power architecture doesn't have the same problem as x86.
There is a special clear cacheline instruction for large memsets and the
rest is done through carefully optimized store byte/halfword/word/double
word sequences.

Thus I'd check what happens if you memset not perfectly aligned buffers.
That's another point where over-optimized functions sometimes break
down. If there is no slowdown, then I'd replace the postgres function
with the OS provided function.

I'd add some __builtin_constant_p() optimizations, but I guess Tom won't
like gcc hacks ;-)
--
Manfred

#8Karel Zak
zakkr@zf.jcu.cz
In reply to: Manfred Spraul (#7)
Re: tweaking MemSet() performance - 7.4.5

On Sat, 2004-09-25 at 23:23 +0200, Manfred Spraul wrote:

mcolosimo@mitre.org wrote:

If the memset
bypasses the cache then the following access will cause a cache line
miss, which can be so slow that using the faster memset can result in a
net performance loss.

Could you suggest some structs to test? If I get your meaning, I would make a loop that sets then reads from the structure.

Read the sources and the cpu specs. Benchmarking such problems is
virtually impossible.
I don't have OS-X, thus I checked the Linux-kernel sources: It seems
that the power architecture doesn't have the same problem as x86.
There is a special clear cacheline instruction for large memsets and the
rest is done through carefully optimized store byte/halfword/word/double
word sequences.

Thus I'd check what happens if you memset not perfectly aligned buffers.
That's another point where over-optimized functions sometimes break
down. If there is no slowdown, then I'd replace the postgres function
with the OS provided function.

I'd add some __builtin_constant_p() optimizations, but I guess Tom won't
like gcc hacks ;-)

I think it cannot be problem if you write it to some .h file (in port
directory?) as macro with "#ifdef GCC". The other thing is real
advantage of hacks like this in practical PG usage :-)

Karel

--
Karel Zak
http://home.zf.jcu.cz/~zakkr

#9Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Karel Zak (#8)
Re: tweaking MemSet() performance - 7.4.5

Karel Zak wrote:

On Sat, 2004-09-25 at 23:23 +0200, Manfred Spraul wrote:

mcolosimo@mitre.org wrote:

If the memset
bypasses the cache then the following access will cause a cache line
miss, which can be so slow that using the faster memset can result in a
net performance loss.

Could you suggest some structs to test? If I get your meaning, I would make a loop that sets then reads from the structure.

Read the sources and the cpu specs. Benchmarking such problems is
virtually impossible.
I don't have OS-X, thus I checked the Linux-kernel sources: It seems
that the power architecture doesn't have the same problem as x86.
There is a special clear cacheline instruction for large memsets and the
rest is done through carefully optimized store byte/halfword/word/double
word sequences.

Thus I'd check what happens if you memset not perfectly aligned buffers.
That's another point where over-optimized functions sometimes break
down. If there is no slowdown, then I'd replace the postgres function
with the OS provided function.

I'd add some __builtin_constant_p() optimizations, but I guess Tom won't
like gcc hacks ;-)

I think it cannot be problem if you write it to some .h file (in port
directory?) as macro with "#ifdef GCC". The other thing is real
advantage of hacks like this in practical PG usage :-)

The reason MemSet is a win is not that the C code is great but because
it eliminates a function call.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#10Marc Colosimo
mcolosimo@mitre.org
In reply to: Bruce Momjian (#9)
Re: tweaking MemSet() performance - 7.4.5

On Sep 29, 2004, at 7:37 AM, Bruce Momjian wrote:

Karel Zak wrote:

On Sat, 2004-09-25 at 23:23 +0200, Manfred Spraul wrote:

mcolosimo@mitre.org wrote:

If the memset
bypasses the cache then the following access will cause a cache
line
miss, which can be so slow that using the faster memset can result
in a
net performance loss.

Could you suggest some structs to test? If I get your meaning, I
would make a loop that sets then reads from the structure.

Read the sources and the cpu specs. Benchmarking such problems is
virtually impossible.
I don't have OS-X, thus I checked the Linux-kernel sources: It seems
that the power architecture doesn't have the same problem as x86.
There is a special clear cacheline instruction for large memsets and
the
rest is done through carefully optimized store
byte/halfword/word/double
word sequences.

Thus I'd check what happens if you memset not perfectly aligned
buffers.
That's another point where over-optimized functions sometimes break
down. If there is no slowdown, then I'd replace the postgres function
with the OS provided function.

all memory (via malloc and friends) will be aligned on OS X, unless you
remove padding (which I don't think you do)

I'd add some __builtin_constant_p() optimizations, but I guess Tom
won't
like gcc hacks ;-)

I think it cannot be problem if you write it to some .h file (in port
directory?) as macro with "#ifdef GCC". The other thing is real
advantage of hacks like this in practical PG usage :-)

The reason MemSet is a win is not that the C code is great but because
it eliminates a function call.

Using MemSet really did speed things up. I think the function overhead
is okay. As for real world usage, the function ExecMakeFunctionResult
dropped from the top of the list when profiling (now < 1% vs 16%
before)! This was doing a big nasty delete (w/ cascading), insert in a
cursor.

Here are results for a Mac G4 (single processor) OS 10.3, using -O2.
This time the mac memset wins all around. Someone posted that this
wasn't the case.

PG MemSet:
pgmemset_test 32
0.670u 0.020s 0:00.70 98.5% 0+0k 0+0io 0pf+0w
pgmemset_test 64
1.060u 0.000s 0:01.05 100.9% 0+0k 0+0io 0pf+0w
pgmemset_test 128
1.750u 0.010s 0:01.76 100.0% 0+0k 0+0io 0pf+0w
pgmemset_test 512
6.010u 0.030s 0:06.04 100.0% 0+0k 0+0io 0pf+0w

Mac memset:
memset_test 32
0.660u 0.020s 0:00.67 101.4% 0+0k 0+0io 0pf+0w
memset_test 64
0.720u 0.000s 0:00.72 100.0% 0+0k 0+0io 0pf+0w
memset_test 128
0.800u 0.010s 0:00.81 100.0% 0+0k 0+0io 0pf+0w
memset_test 512
1.470u 0.010s 0:01.48 100.0% 0+0k 0+0io 0pf+0w

Now I check about setting a byte after I memset, and it does slow down
a tiny bit. But it is the same for both MemSet and memset for under 64.

#11Neil Conway
neilc@samurai.com
In reply to: Bruce Momjian (#9)
Re: tweaking MemSet() performance - 7.4.5

On Wed, 2004-09-29 at 21:37, Bruce Momjian wrote:

The reason MemSet is a win is not that the C code is great but because
it eliminates a function call.

A reasonable compiler ought to be able to implement memset() as a
compiler intrinsic where it makes sense to do so. MSVC++ can certainly
do this; per the GCC 3.4 docs, it seems GCC can/does as well:

The ISO C90 functions abort, abs, acos, asin, atan2, atan, calloc, ceil,
cosh, cos, exit, exp, fabs, floor, fmod, fprintf, fputs, frexp, fscanf,
labs, ldexp, log10, log, malloc, memcmp, memcpy, memset, modf, pow,
printf, putchar, puts, scanf, sinh, sin, snprintf, sprintf, sqrt,
sscanf, strcat, strchr, strcmp, strcpy, strcspn, strlen, strncat,
strncmp, strncpy, strpbrk, strrchr, strspn, strstr, tanh, tan, vfprintf,
vprintf and vsprintf are all recognized as built-in functions unless
-fno-builtin is specified (or -fno-builtin-function is specified for an
individual function). All of these functions have corresponding versions
prefixed with __builtin_.

(http://gcc.gnu.org/onlinedocs/gcc-3.4.2/gcc/Other-Builtins.html#Other-Builtins)

-Neil

#12Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Neil Conway (#11)
Re: tweaking MemSet() performance - 7.4.5

Neil Conway wrote:

On Wed, 2004-09-29 at 21:37, Bruce Momjian wrote:

The reason MemSet is a win is not that the C code is great but because
it eliminates a function call.

A reasonable compiler ought to be able to implement memset() as a
compiler intrinsic where it makes sense to do so. MSVC++ can certainly
do this; per the GCC 3.4 docs, it seems GCC can/does as well:

The ISO C90 functions abort, abs, acos, asin, atan2, atan, calloc, ceil,
cosh, cos, exit, exp, fabs, floor, fmod, fprintf, fputs, frexp, fscanf,
labs, ldexp, log10, log, malloc, memcmp, memcpy, memset, modf, pow,
printf, putchar, puts, scanf, sinh, sin, snprintf, sprintf, sqrt,
sscanf, strcat, strchr, strcmp, strcpy, strcspn, strlen, strncat,
strncmp, strncpy, strpbrk, strrchr, strspn, strstr, tanh, tan, vfprintf,
vprintf and vsprintf are all recognized as built-in functions unless
-fno-builtin is specified (or -fno-builtin-function is specified for an
individual function). All of these functions have corresponding versions
prefixed with __builtin_.

(http://gcc.gnu.org/onlinedocs/gcc-3.4.2/gcc/Other-Builtins.html#Other-Builtins)

MemSet was written when gcc 2.X wasn't even stable yet. Have you run
any tests on 3.4 to see if MemSet is still a win with that compiler?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#13Peter Eisentraut
peter_e@gmx.net
In reply to: Bruce Momjian (#12)
Re: tweaking MemSet() performance - 7.4.5

Bruce Momjian wrote:

MemSet was written when gcc 2.X wasn't even stable yet. Have you run
any tests on 3.4 to see if MemSet is still a win with that compiler?

I've done a test years ago that showed that memset is usually at least
as good as MemSet:

http://archives.postgresql.org/pgsql-patches/2002-10/msg00085.php

--
Peter Eisentraut
http://developer.postgresql.org/~petere/