spinlocks on HP-UX
I was able to obtain access to a 32-core HP-UX server. I repeated the
pgbench -S testing that I have previously done on Linux, and found
that the results were not too good. Here are the results at scale
factor 100, on 9.2devel, with various numbers of clients. Five minute
runs, shared_buffers=8GB.
1:tps = 5590.070816 (including connections establishing)
8:tps = 37660.233932 (including connections establishing)
16:tps = 67366.099286 (including connections establishing)
32:tps = 82781.624665 (including connections establishing)
48:tps = 18589.995074 (including connections establishing)
64:tps = 16424.661371 (including connections establishing)
And just for comparison, here are the numbers at scale factor 1000:
1:tps = 4751.768608 (including connections establishing)
8:tps = 33621.474490 (including connections establishing)
16:tps = 58959.043171 (including connections establishing)
32:tps = 78801.265189 (including connections establishing)
48:tps = 21635.234969 (including connections establishing)
64:tps = 18611.863567 (including connections establishing)
After mulling over the vmstat output for a bit, I began to suspect
spinlock contention. I took a look at document called "Implementing
Spinlocks on the Intel Itanium Architecture and PA-RISC", by Tor
Ekqvist and David Graves and available via the HP web site, which
states that when spinning on a spinlock on these machines, you should
use a regular, unlocked test first and use the atomic test only when
the unlocked test looks OK. I tried implementing this in two ways,
and both produced results which are FAR superior to our current
implementation. First, I did this:
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -726,7 +726,7 @@ tas(volatile slock_t *lock)
typedef unsigned int slock_t;
#include <ia64/sys/inline.h>
-#define TAS(lock) _Asm_xchg(_SZ_W, lock, 1, _LDHINT_NONE)
+#define TAS(lock) (*(lock) ? 1 : _Asm_xchg(_SZ_W, lock, 1, _LDHINT_NONE))
#endif /* HPUX on IA64, non gcc */
That resulted in these numbers. Scale factor 100:
1:tps = 5569.911714 (including connections establishing)
8:tps = 37365.364468 (including connections establishing)
16:tps = 63596.261875 (including connections establishing)
32:tps = 95948.157678 (including connections establishing)
48:tps = 90708.253920 (including connections establishing)
64:tps = 100109.065744 (including connections establishing)
Scale factor 1000:
1:tps = 4878.332996 (including connections establishing)
8:tps = 33245.469907 (including connections establishing)
16:tps = 56708.424880 (including connections establishing)
48:tps = 69652.232635 (including connections establishing)
64:tps = 70593.208637 (including connections establishing)
Then, I did this:
--- a/src/backend/storage/lmgr/s_lock.c
+++ b/src/backend/storage/lmgr/s_lock.c
@@ -96,7 +96,7 @@ s_lock(volatile slock_t *lock, const char *file, int line)
int delays = 0;
int cur_delay = 0;
- while (TAS(lock))
+ while (*lock ? 1 : TAS(lock))
{
/* CPU-specific delay each time through the loop */
SPIN_DELAY();
That resulted in these numbers, at scale factor 100:
1:tps = 5564.059494 (including connections establishing)
8:tps = 37487.090798 (including connections establishing)
16:tps = 66061.524760 (including connections establishing)
32:tps = 96535.523905 (including connections establishing)
48:tps = 92031.618360 (including connections establishing)
64:tps = 106813.631701 (including connections establishing)
And at scale factor 1000:
1:tps = 4980.338246 (including connections establishing)
8:tps = 33576.680072 (including connections establishing)
16:tps = 55618.677975 (including connections establishing)
32:tps = 73589.442746 (including connections establishing)
48:tps = 70987.026228 (including connections establishing)
Note sure why I am missing the 64-client results for that last set of
tests, but no matter.
Of course, we can't apply the second patch as it stands, because I
tested it on x86 and it loses. But it seems pretty clear we need to
do it at least for this architecture...
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Pity that this patch works only on hpux :(.
But i have an idea: maybe when executor stop at locked row, it should
process next row instead of wait.
Of course if query not contain "order by" or windowing functions.
--
------------
pasman
Import Notes
Resolved by subject fallback
Robert Haas <robertmhaas@gmail.com> writes:
First, I did this:
-#define TAS(lock) _Asm_xchg(_SZ_W, lock, 1, _LDHINT_NONE) +#define TAS(lock) (*(lock) ? 1 : _Asm_xchg(_SZ_W, lock, 1, _LDHINT_NONE))
Seems reasonable, and similar to x86 logic.
Then, I did this:
- while (TAS(lock)) + while (*lock ? 1 : TAS(lock))
Er, what? That sure looks like a manual application of what you'd
already done in the TAS macro.
Of course, we can't apply the second patch as it stands, because I
tested it on x86 and it loses. But it seems pretty clear we need to
do it at least for this architecture...
Please clarify: when you say "this architecture", are you talking about
IA64 or PA-RISC? Is there any reason to think that this is specific to
HP-UX rather than any other system on the same architecture? (I'm sure
I can get access to some IA64 clusters at Red Hat, though maybe not
64-core ones.)
I don't have an objection to the TAS macro change, but I do object to
fooling with the hardware-independent code in s_lock.c ... especially
when the additional improvement seems barely above the noise threshold.
You ought to be able to do whatever you need inside the TAS macro.
regards, tom lane
On Sun, Aug 28, 2011 at 11:35 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
First, I did this:
-#define TAS(lock) _Asm_xchg(_SZ_W, lock, 1, _LDHINT_NONE) +#define TAS(lock) (*(lock) ? 1 : _Asm_xchg(_SZ_W, lock, 1, _LDHINT_NONE))Seems reasonable, and similar to x86 logic.
Then, I did this:
- while (TAS(lock)) + while (*lock ? 1 : TAS(lock))Er, what? That sure looks like a manual application of what you'd
already done in the TAS macro.
Sorry, I blew through that a little too blithely. If you change TAS()
itself, then even the very first attempt to acquire the lock will try
the unlocked instruction first, whereas changing s_lock() allows you
to do something different in the contended case than you do in the
uncontended case. We COULD just change the TAS() macro since, in this
case, it seems to make only a minor difference, but what I was
thinking is that we could change s_lock.h to define two macros, TAS()
and TAS_SPIN(). If a particular architecture defines TAS() but not
TAS_SPIN(), then we define TAS_SPIN(x) to be TAS(x). Then, S_LOCK()
can stay as-is - calling TAS() - but s_lock() can call TAS_SPIN(),
which will normally be the same as TAS() but can be made different on
any architecture where the retry loop should do something different
than the initial attempt.
Please clarify: when you say "this architecture", are you talking about
IA64 or PA-RISC? Is there any reason to think that this is specific to
HP-UX rather than any other system on the same architecture? (I'm sure
I can get access to some IA64 clusters at Red Hat, though maybe not
64-core ones.)
I tested on IA64; I don't currently have access to a PA-RISC box. The
documentation I'm looking at implies that the same approach would be
desirable there, but that's just an unsubstantiated rumor at this
point....
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Sun, Aug 28, 2011 at 11:35 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
Then, I did this:
- � � � while (TAS(lock)) + � � � while (*lock ? 1 : TAS(lock))
Er, what? �That sure looks like a manual application of what you'd
already done in the TAS macro.
Sorry, I blew through that a little too blithely. If you change TAS()
itself, then even the very first attempt to acquire the lock will try
the unlocked instruction first, whereas changing s_lock() allows you
to do something different in the contended case than you do in the
uncontended case.
Yeah, I figured out that was probably what you meant a little while
later. I found a 64-CPU IA64 machine in Red Hat's test labs and am
currently trying to replicate your results; report to follow.
We COULD just change the TAS() macro since, in this
case, it seems to make only a minor difference, but what I was
thinking is that we could change s_lock.h to define two macros, TAS()
and TAS_SPIN().
Yeah, I was thinking along the same lines, though perhaps the name of
the new macro could use a little bikeshedding.
The comments in s_lock.h note that the unlocked test in x86 TAS is of
uncertain usefulness. It seems entirely possible to me that we ought
to use a similar design on x86, ie, use the unlocked test only once
we've entered the delay loop.
Please clarify: when you say "this architecture", are you talking about
IA64 or PA-RISC? �Is there any reason to think that this is specific to
HP-UX rather than any other system on the same architecture? �(I'm sure
I can get access to some IA64 clusters at Red Hat, though maybe not
64-core ones.)
I tested on IA64; I don't currently have access to a PA-RISC box. The
documentation I'm looking at implies that the same approach would be
desirable there, but that's just an unsubstantiated rumor at this
point....
Well, I've got a PA-RISC box, but it's only a single processor so it's
not gonna prove much. Anybody?
regards, tom lane
I wrote:
Yeah, I figured out that was probably what you meant a little while
later. I found a 64-CPU IA64 machine in Red Hat's test labs and am
currently trying to replicate your results; report to follow.
OK, these results are on a 64-processor SGI IA64 machine (AFAICT, 64
independent sockets, no hyperthreading or any funny business); 124GB
in 32 NUMA nodes; running RHEL5.7, gcc 4.1.2. I built today's git
head with --enable-debug (but not --enable-cassert) and ran with all
default configuration settings except shared_buffers = 8GB and
max_connections = 200. The test database is initialized at -s 100.
I did not change the database between runs, but restarted the postmaster
and then did this to warm the caches a tad:
pgbench -c 1 -j 1 -S -T 30 bench
Per-run pgbench parameters are as shown below --- note in particular
that I assigned one pgbench thread per 8 backends.
The numbers are fairly variable even with 5-minute runs; I did each
series twice so you could get a feeling for how much.
Today's git head:
pgbench -c 1 -j 1 -S -T 300 bench tps = 5835.213934 (including ...
pgbench -c 2 -j 1 -S -T 300 bench tps = 8499.223161 (including ...
pgbench -c 8 -j 1 -S -T 300 bench tps = 15197.126952 (including ...
pgbench -c 16 -j 2 -S -T 300 bench tps = 30803.255561 (including ...
pgbench -c 32 -j 4 -S -T 300 bench tps = 65795.356797 (including ...
pgbench -c 64 -j 8 -S -T 300 bench tps = 81644.914241 (including ...
pgbench -c 96 -j 12 -S -T 300 bench tps = 40059.202836 (including ...
pgbench -c 128 -j 16 -S -T 300 bench tps = 21309.615001 (including ...
run 2:
pgbench -c 1 -j 1 -S -T 300 bench tps = 5787.310115 (including ...
pgbench -c 2 -j 1 -S -T 300 bench tps = 8747.104236 (including ...
pgbench -c 8 -j 1 -S -T 300 bench tps = 14655.369995 (including ...
pgbench -c 16 -j 2 -S -T 300 bench tps = 28287.254924 (including ...
pgbench -c 32 -j 4 -S -T 300 bench tps = 61614.715187 (including ...
pgbench -c 64 -j 8 -S -T 300 bench tps = 79754.640518 (including ...
pgbench -c 96 -j 12 -S -T 300 bench tps = 40334.994324 (including ...
pgbench -c 128 -j 16 -S -T 300 bench tps = 23285.271257 (including ...
With modified TAS macro (see patch 1 below):
pgbench -c 1 -j 1 -S -T 300 bench tps = 6171.454468 (including ...
pgbench -c 2 -j 1 -S -T 300 bench tps = 8709.003728 (including ...
pgbench -c 8 -j 1 -S -T 300 bench tps = 14902.731035 (including ...
pgbench -c 16 -j 2 -S -T 300 bench tps = 29789.744482 (including ...
pgbench -c 32 -j 4 -S -T 300 bench tps = 59991.549128 (including ...
pgbench -c 64 -j 8 -S -T 300 bench tps = 117369.287466 (including ...
pgbench -c 96 -j 12 -S -T 300 bench tps = 112583.144495 (including ...
pgbench -c 128 -j 16 -S -T 300 bench tps = 110231.305282 (including ...
run 2:
pgbench -c 1 -j 1 -S -T 300 bench tps = 5670.097936 (including ...
pgbench -c 2 -j 1 -S -T 300 bench tps = 8230.786940 (including ...
pgbench -c 8 -j 1 -S -T 300 bench tps = 14785.952481 (including ...
pgbench -c 16 -j 2 -S -T 300 bench tps = 29335.875139 (including ...
pgbench -c 32 -j 4 -S -T 300 bench tps = 59605.433837 (including ...
pgbench -c 64 -j 8 -S -T 300 bench tps = 108884.294519 (including ...
pgbench -c 96 -j 12 -S -T 300 bench tps = 110387.439978 (including ...
pgbench -c 128 -j 16 -S -T 300 bench tps = 109046.121191 (including ...
With unlocked test in s_lock.c delay loop only (patch 2 below):
pgbench -c 1 -j 1 -S -T 300 bench tps = 5426.491088 (including ...
pgbench -c 2 -j 1 -S -T 300 bench tps = 8787.939425 (including ...
pgbench -c 8 -j 1 -S -T 300 bench tps = 15720.801359 (including ...
pgbench -c 16 -j 2 -S -T 300 bench tps = 33711.102718 (including ...
pgbench -c 32 -j 4 -S -T 300 bench tps = 61829.180234 (including ...
pgbench -c 64 -j 8 -S -T 300 bench tps = 109781.655020 (including ...
pgbench -c 96 -j 12 -S -T 300 bench tps = 107132.848280 (including ...
pgbench -c 128 -j 16 -S -T 300 bench tps = 106533.630986 (including ...
run 2:
pgbench -c 1 -j 1 -S -T 300 bench tps = 5705.283316 (including ...
pgbench -c 2 -j 1 -S -T 300 bench tps = 8442.798662 (including ...
pgbench -c 8 -j 1 -S -T 300 bench tps = 14423.723837 (including ...
pgbench -c 16 -j 2 -S -T 300 bench tps = 29112.751995 (including ...
pgbench -c 32 -j 4 -S -T 300 bench tps = 62258.984033 (including ...
pgbench -c 64 -j 8 -S -T 300 bench tps = 107741.988800 (including ...
pgbench -c 96 -j 12 -S -T 300 bench tps = 107138.968981 (including ...
pgbench -c 128 -j 16 -S -T 300 bench tps = 106110.215138 (including ...
So this pretty well confirms Robert's results, in particular that all of
the win from an unlocked test comes from using it in the delay loop.
Given the lack of evidence that a general change in TAS() is beneficial,
I'm inclined to vote against it, on the grounds that the extra test is
surely a loss at some level when there is not contention.
(IOW, +1 for inventing a second macro to use in the delay loop only.)
We ought to do similar tests on other architectures. I found some
lots-o-processors x86_64 machines at Red Hat, but they don't seem to
own any PPC systems with more than 8 processors. Anybody have big
iron with other non-Intel chips?
regards, tom lane
Patch 1: change TAS globally, non-HPUX code:
*** src/include/storage/s_lock.h.orig Sat Jan 1 13:27:24 2011
--- src/include/storage/s_lock.h Sun Aug 28 13:32:47 2011
***************
*** 228,233 ****
--- 228,240 ----
{
long int ret;
+ /*
+ * Use a non-locking test before the locking instruction proper. This
+ * appears to be a very significant win on many-core IA64.
+ */
+ if (*lock)
+ return 1;
+
__asm__ __volatile__(
" xchg4 %0=%1,%2 \n"
: "=r"(ret), "+m"(*lock)
***************
*** 243,248 ****
--- 250,262 ----
{
int ret;
+ /*
+ * Use a non-locking test before the locking instruction proper. This
+ * appears to be a very significant win on many-core IA64.
+ */
+ if (*lock)
+ return 1;
+
ret = _InterlockedExchange(lock,1); /* this is a xchg asm macro */
return ret;
Patch 2: change s_lock only (same as Robert's quick hack):
*** src/backend/storage/lmgr/s_lock.c.orig Sat Jan 1 13:27:09 2011
--- src/backend/storage/lmgr/s_lock.c Sun Aug 28 14:02:29 2011
***************
*** 96,102 ****
int delays = 0;
int cur_delay = 0;
! while (TAS(lock))
{
/* CPU-specific delay each time through the loop */
SPIN_DELAY();
--- 96,102 ----
int delays = 0;
int cur_delay = 0;
! while (*lock ? 1 : TAS(lock))
{
/* CPU-specific delay each time through the loop */
SPIN_DELAY();
On Sun, Aug 28, 2011 at 7:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
So this pretty well confirms Robert's results, in particular that all of
the win from an unlocked test comes from using it in the delay loop.
Given the lack of evidence that a general change in TAS() is beneficial,
I'm inclined to vote against it, on the grounds that the extra test is
surely a loss at some level when there is not contention.
(IOW, +1 for inventing a second macro to use in the delay loop only.)
Beautiful. Got a naming preference for that second macro? I
suggested TAS_SPIN() because it's what you use when you spin, as
opposed to what you use in the uncontended case, but I'm not attached
to that.
We ought to do similar tests on other architectures. I found some
lots-o-processors x86_64 machines at Red Hat, but they don't seem to
own any PPC systems with more than 8 processors. Anybody have big
iron with other non-Intel chips?
Aside from PPC, it would probably be worth testing SPARC and ARM if we
can find machines. Anything else is probably too old or too marginal
to get excited about. AFAIK these effects don't manifest with <32
cores, so I suspect that a lot of what's in s_lock.h is irrelevant
just because many of those architectures are too old to exist in
32-core versions.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Sun, Aug 28, 2011 at 7:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
(IOW, +1 for inventing a second macro to use in the delay loop only.)
Beautiful. Got a naming preference for that second macro? I
suggested TAS_SPIN() because it's what you use when you spin, as
opposed to what you use in the uncontended case, but I'm not attached
to that.
I had been thinking TAS_CONTENDED, but on reflection there's not a
strong argument for that over TAS_SPIN. Do what you will.
regards, tom lane
2011/8/28 pasman pasmański <pasman.p@gmail.com>:
Pity that this patch works only on hpux :(.
Well, not really. x86 is already well-behaved. On a 32-core x86 box
running Linux, performs seems to plateau and level off, and then fall
off gradually. But on ia64, performance just collapses after about 24
cores. The fact that we don't have that problem everywhere is a good
thing, not a bad thing...
But i have an idea: maybe when executor stop at locked row, it should
process next row instead of wait.Of course if query not contain "order by" or windowing functions.
That wouldn't really help, first of all because you'd then have to
remember to go back to that row (and chances are it would still be
contended then), and second because these aren't row-level locks
anyway. They're locks on various global data structures, such as the
ProcArray.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sun, Aug 28, 2011 at 8:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Sun, Aug 28, 2011 at 7:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
(IOW, +1 for inventing a second macro to use in the delay loop only.)
Beautiful. Got a naming preference for that second macro? I
suggested TAS_SPIN() because it's what you use when you spin, as
opposed to what you use in the uncontended case, but I'm not attached
to that.I had been thinking TAS_CONTENDED, but on reflection there's not a
strong argument for that over TAS_SPIN. Do what you will.
OK, done. I think while we're tidying up here we ought to do
something about this comment:
* ANOTHER CAUTION: be sure that TAS(), TAS_SPIN(), and
S_UNLOCK() represent
* sequence points, ie, loads and stores of other values must not be moved
* across a lock or unlock. In most cases it suffices to make
the operation
* be done through a "volatile" pointer.
IIUC, this is basically total nonsense. volatile only constrains the
optimizer, not the CPU; and only with regard to the ordering of one
volatile access vs. another, NOT the ordering of volatile accesses
with any non-volatile accesses that may be floating around. So its
practical utility for synchronization purposes seems to be nearly
zero. I think what this should say is that we expect these operations
to act as a full memory fence. Note that in some other worlds (e.g.
Linux) a spinlock acquisition or release is only required to act as a
half-fence; that is, other loads and stores are allowed to bleed into
the critical section but not out. However, I think we have been
assuming full-fence behavior. In either case, claiming that the use
of a volatile pointer is all that's needed here strikes me as pretty
misleading.
In the department of loose ends, there are a bunch of other things
that maybe need cleanup here: (1) the gcc/intel compiler cases on
ia64, (2) PA-RISC, (3) ARM, (4) PowerPC... and we should also perhaps
reconsider the 32-bit x86 case. Right now TAS() says:
/*
* Use a non-locking test before asserting the bus lock. Note that the
* extra test appears to be a small loss on some x86 platforms
and a small
* win on others; it's by no means clear that we should keep it.
*/
I can't get too excited about spending a lot of effort optimizing
32-bit PostgreSQL on any architecture at this point, but if someone
else is, we might want to check whether it makes more sense to do the
non-locking test only in TAS_SPIN().
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
OK, done. I think while we're tidying up here we ought to do
something about this comment:
* ANOTHER CAUTION: be sure that TAS(), TAS_SPIN(), and
S_UNLOCK() represent
* sequence points, ie, loads and stores of other values must not be moved
* across a lock or unlock. In most cases it suffices to make
the operation
* be done through a "volatile" pointer.
IIUC, this is basically total nonsense.
It could maybe be rewritten for more clarity, but it's far from being
nonsense. The responsibility for having an actual hardware memory fence
instruction lies with the author of the TAS macro. But the
responsibility for keeping the compiler from reordering stuff around the
TAS call is that of the *user* of the TAS macro (or spinlock), and in
most cases the way to do that is to make sure that both the spinlock and
the shared data structure are referenced through volatile pointers.
This isn't academic; we've seen bugs from failure to do that. (BTW,
the reason for not being equivalently tense about LWLock-protected
structures is that the compiler generally won't try to move operations
around an out-of-line function call. It's the fact that spinlocks are
inline-able that creates the risk here.)
In the department of loose ends, there are a bunch of other things
that maybe need cleanup here: (1) the gcc/intel compiler cases on
ia64, (2) PA-RISC, (3) ARM, (4) PowerPC... and we should also perhaps
reconsider the 32-bit x86 case.
The results I got yesterday seem sufficient basis to change the
gcc/intel cases for IA64, so I will go do that if you didn't already.
I am also currently running tests on x86_64 and PPC using Red Hat test
machines --- expect results later today. Red Hat doesn't seem to own
any many-CPU machines that are 32-bit-only, and I rather wonder if there
are any. It might be that it only makes sense to optimize the x86 paths
for a few CPUs, in which case this test methodology may not be very
helpful.
regards, tom lane
On Mon, Aug 29, 2011 at 11:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
OK, done. I think while we're tidying up here we ought to do
something about this comment:* ANOTHER CAUTION: be sure that TAS(), TAS_SPIN(), and
S_UNLOCK() represent
* sequence points, ie, loads and stores of other values must not be moved
* across a lock or unlock. In most cases it suffices to make
the operation
* be done through a "volatile" pointer.IIUC, this is basically total nonsense.
It could maybe be rewritten for more clarity, but it's far from being
nonsense. The responsibility for having an actual hardware memory fence
instruction lies with the author of the TAS macro.
Right... but the comment implies that you probably don't need one, and
doesn't even mention that you MIGHT need one.
In the department of loose ends, there are a bunch of other things
that maybe need cleanup here: (1) the gcc/intel compiler cases on
ia64, (2) PA-RISC, (3) ARM, (4) PowerPC... and we should also perhaps
reconsider the 32-bit x86 case.The results I got yesterday seem sufficient basis to change the
gcc/intel cases for IA64, so I will go do that if you didn't already.
I did not; please go ahead. I wasn't relishing the idea trying to
figure out how to install gcc to test that case.
I am also currently running tests on x86_64 and PPC using Red Hat test
machines --- expect results later today. Red Hat doesn't seem to own
any many-CPU machines that are 32-bit-only, and I rather wonder if there
are any. It might be that it only makes sense to optimize the x86 paths
for a few CPUs, in which case this test methodology may not be very
helpful.
FWIW, I tried spinning unlocked on x86_64 at 32 cores and got a
regression. Don't have any PPC gear at present.
I think optimizing spinlocks for machines with only a few CPUs is
probably pointless. Based on what I've seen so far, spinlock
contention even at 16 CPUs is negligible pretty much no matter what
you do. Whether your implementation is fast or slow isn't going to
matter, because even an inefficient implementation will account for
only a negligible percentage of the total CPU time - much less than 1%
- as opposed to a 64-core machine, where it's not that hard to find
cases where spin-waits consume the *majority* of available CPU time
(recall previous discussion of lseek). So I'm disinclined to spend
the time it would take to tinker with the 32-bit code, because it will
not matter; for that platform we're better off spending our time
installing a hash table in ScanKeywordLookup(). And there's always a
possibility that AMD and Intel chips could be different, or there
could even be differences between different chip generations from the
same manufacturer, so all in all it seems like a pretty unrewarding
exercise.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Mon, Aug 29, 2011 at 11:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
IIUC, this is basically total nonsense.
It could maybe be rewritten for more clarity, but it's far from being
nonsense. �The responsibility for having an actual hardware memory fence
instruction lies with the author of the TAS macro.
Right... but the comment implies that you probably don't need one, and
doesn't even mention that you MIGHT need one.
I think maybe we need to split it into two paragraphs, one addressing
the TAS author and the other for the TAS user. I'll have a go at that.
I think optimizing spinlocks for machines with only a few CPUs is
probably pointless. Based on what I've seen so far, spinlock
contention even at 16 CPUs is negligible pretty much no matter what
you do.
We did find significant differences several years ago, testing on
machines that probably had no more than four cores; that's where the
existing comments in s_lock.h came from. Whether those tests are
still relevant for today's source code is not obvious though.
regards, tom lane
On Mon, Aug 29, 2011 at 11:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Mon, Aug 29, 2011 at 11:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
IIUC, this is basically total nonsense.
It could maybe be rewritten for more clarity, but it's far from being
nonsense. The responsibility for having an actual hardware memory fence
instruction lies with the author of the TAS macro.Right... but the comment implies that you probably don't need one, and
doesn't even mention that you MIGHT need one.I think maybe we need to split it into two paragraphs, one addressing
the TAS author and the other for the TAS user. I'll have a go at that.
OK.
I think optimizing spinlocks for machines with only a few CPUs is
probably pointless. Based on what I've seen so far, spinlock
contention even at 16 CPUs is negligible pretty much no matter what
you do.We did find significant differences several years ago, testing on
machines that probably had no more than four cores; that's where the
existing comments in s_lock.h came from. Whether those tests are
still relevant for today's source code is not obvious though.
Hmm, OK. I guess if you want to put energy into it, I'm not going to
complain too much... just not sure it's the best use of time.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Aug 29, 2011 at 4:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
* ANOTHER CAUTION: be sure that TAS(), TAS_SPIN(), and
S_UNLOCK() represent
* sequence points, ie, loads and stores of other values must not be moved
* across a lock or unlock. In most cases it suffices to make
the operation
* be done through a "volatile" pointer.IIUC, this is basically total nonsense.
It could maybe be rewritten for more clarity, but it's far from being
nonsense.
The confusion for me is that it's talking about sequence points and
volatile pointers in the same breath as if one implies the other.
Making something a volatile pointer dose not create a sequence point.
It requires that the compiler not move the access or store across any
sequence points that are already there.
It might be helpful to include the actual bug that the comment is
trying to warn against because iirc it was a real case that caused you
to add the volatile modifiers.
--
greg
Greg Stark <stark@mit.edu> writes:
The confusion for me is that it's talking about sequence points and
volatile pointers in the same breath as if one implies the other.
Making something a volatile pointer dose not create a sequence point.
It requires that the compiler not move the access or store across any
sequence points that are already there.
Well, no, I don't think that's the useful way to think about it. Modern
compilers seem to only honor sequence points in terms of the semantics
seen by the executing program; they don't believe that they can't
reorder loads/stores freely. And as long as the memory in question is
only used by the given program, they're right. But for memory locations
shared with other threads of execution, you have to be careful about the
order of accesses to those locations. My understanding of "volatile"
is that the compiler is forbidden from altering the order of
volatile-qualified loads and stores *relative to each other*, or from
optimizing away a load or store that seems redundant in the context of
the given program. That's got nothing to do with sequence points per se.
It might be helpful to include the actual bug that the comment is
trying to warn against because iirc it was a real case that caused you
to add the volatile modifiers.
Well, if there were one and only one bug involved here, it wouldn't be
such a critical problem ...
regards, tom lane
On Mon, Aug 29, 2011 at 12:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Greg Stark <stark@mit.edu> writes:
The confusion for me is that it's talking about sequence points and
volatile pointers in the same breath as if one implies the other.
Making something a volatile pointer dose not create a sequence point.
It requires that the compiler not move the access or store across any
sequence points that are already there.Well, no, I don't think that's the useful way to think about it. Modern
compilers seem to only honor sequence points in terms of the semantics
seen by the executing program; they don't believe that they can't
reorder loads/stores freely.
This discussion seems to miss the fact that there are two levels of
reordering that can happen. First, the compiler can move things
around. Second, the CPU can move things around. Even on x86, for
example, a sequence like this can be reordered:
LOAD A
STORE A
LOAD B
Even though the compiler may emit those instructions in exactly that
order, an x86 CPU can, IIUC, decide to load B before it finishes
storing A, so that the actual apparent execution order as seen by
other CPUs will be either the above, or the above with the last two
instructions reversed. On a weakly ordered CPU, the load of B could
be moved back even further, before the LOAD of A.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
This discussion seems to miss the fact that there are two levels of
reordering that can happen. First, the compiler can move things
around. Second, the CPU can move things around.
Right, I think that's exactly the problem with the previous wording of
that comment; it doesn't address the two logical levels involved.
I've rewritten it, see what you think.
* Another caution for users of these macros is that it is the caller's
* responsibility to ensure that the compiler doesn't re-order accesses
* to shared memory to precede the actual lock acquisition, or follow the
* lock release. Typically we handle this by using volatile-qualified
* pointers to refer to both the spinlock itself and the shared data
* structure being accessed within the spinlocked critical section.
* That fixes it because compilers are not allowed to re-order accesses
* to volatile objects relative to other such accesses.
*
* On platforms with weak memory ordering, the TAS(), TAS_SPIN(), and
* S_UNLOCK() macros must further include hardware-level memory fence
* instructions to prevent similar re-ordering at the hardware level.
* TAS() and TAS_SPIN() must guarantee that loads and stores issued after
* the macro are not executed until the lock has been obtained. Conversely,
* S_UNLOCK() must guarantee that loads and stores issued before the macro
* have been executed before the lock is released.
regards, tom lane
On Mon, Aug 29, 2011 at 5:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Even though the compiler may emit those instructions in exactly that
order, an x86 CPU can, IIUC, decide to load B before it finishes
storing A, so that the actual apparent execution order as seen by
other CPUs will be either the above, or the above with the last two
instructions reversed. On a weakly ordered CPU, the load of B could
be moved back even further, before the LOAD of A.
My understanding of what the comment meant is that there is already a
full memory barrier as far as the CPU is concerned due to the TAS or
whatever, but it's important that there also be a sequence point there
so that the volatile memory access isn't reordered by the compiler to
occur before the memory barrier.
I was going to say the same thing as Tom that sequence points and
volatile pointers have nothing at all to do with each other. However
my brief searching online actually seemed to indicate that in fact the
compiler isn't supposed to reorder volatile memory accesses across
sequence points. That seemed to make sense since I couldn't think of
any other way to rigorously describe the constraints the compiler
should operate under.
--
greg
On Mon, Aug 29, 2011 at 1:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
This discussion seems to miss the fact that there are two levels of
reordering that can happen. First, the compiler can move things
around. Second, the CPU can move things around.Right, I think that's exactly the problem with the previous wording of
that comment; it doesn't address the two logical levels involved.
I've rewritten it, see what you think.* Another caution for users of these macros is that it is the caller's
* responsibility to ensure that the compiler doesn't re-order accesses
* to shared memory to precede the actual lock acquisition, or follow the
* lock release. Typically we handle this by using volatile-qualified
* pointers to refer to both the spinlock itself and the shared data
* structure being accessed within the spinlocked critical section.
* That fixes it because compilers are not allowed to re-order accesses
* to volatile objects relative to other such accesses.
*
* On platforms with weak memory ordering, the TAS(), TAS_SPIN(), and
* S_UNLOCK() macros must further include hardware-level memory fence
* instructions to prevent similar re-ordering at the hardware level.
* TAS() and TAS_SPIN() must guarantee that loads and stores issued after
* the macro are not executed until the lock has been obtained. Conversely,
* S_UNLOCK() must guarantee that loads and stores issued before the macro
* have been executed before the lock is released.
That's definitely an improvement.
I'm actually not convinced that we're entirely consistent here about
what we require the semantics of acquiring and releasing a spinlock to
be. For example, on x86 and x86_64, we acquire the lock using xchgb,
which acts a full memory barrier. But when we release the lock, we
just zero out the memory address, which is NOT a full memory barrier.
Stores can't cross it, but non-dependent loads of different locations
can back up over it. That's pretty close to a full barrier, but it
isn't, quite. Now, I don't see why that should really cause any
problem, at least for common cases like LWLockAcquire(). If the CPU
prefetches the data protected by the lwlock after we know we've got
the lock before we've actually released the spinlock and returned from
LWLockAcquire(), that should be fine, even good (for performance).
The real problem with being squiffy here is that it's not clear how
weak we can make the fence instruction on weakly ordered architectures
that support multiple types. Right now we're pretty conservative, but
I think that may be costing us. I might be wrong; more research is
needed here; but I think that we should at least start to get our head
about what semantics we actually need.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company