spinlocks on HP-UX

Started by Robert Haasover 14 years ago57 messages
#1Robert Haas
robertmhaas@gmail.com

I was able to obtain access to a 32-core HP-UX server. I repeated the
pgbench -S testing that I have previously done on Linux, and found
that the results were not too good. Here are the results at scale
factor 100, on 9.2devel, with various numbers of clients. Five minute
runs, shared_buffers=8GB.

1:tps = 5590.070816 (including connections establishing)
8:tps = 37660.233932 (including connections establishing)
16:tps = 67366.099286 (including connections establishing)
32:tps = 82781.624665 (including connections establishing)
48:tps = 18589.995074 (including connections establishing)
64:tps = 16424.661371 (including connections establishing)

And just for comparison, here are the numbers at scale factor 1000:

1:tps = 4751.768608 (including connections establishing)
8:tps = 33621.474490 (including connections establishing)
16:tps = 58959.043171 (including connections establishing)
32:tps = 78801.265189 (including connections establishing)
48:tps = 21635.234969 (including connections establishing)
64:tps = 18611.863567 (including connections establishing)

After mulling over the vmstat output for a bit, I began to suspect
spinlock contention. I took a look at document called "Implementing
Spinlocks on the Intel Itanium Architecture and PA-RISC", by Tor
Ekqvist and David Graves and available via the HP web site, which
states that when spinning on a spinlock on these machines, you should
use a regular, unlocked test first and use the atomic test only when
the unlocked test looks OK. I tried implementing this in two ways,
and both produced results which are FAR superior to our current
implementation. First, I did this:

--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -726,7 +726,7 @@ tas(volatile slock_t *lock)
 typedef unsigned int slock_t;
 #include <ia64/sys/inline.h>
-#define TAS(lock) _Asm_xchg(_SZ_W, lock, 1, _LDHINT_NONE)
+#define TAS(lock) (*(lock) ? 1 : _Asm_xchg(_SZ_W, lock, 1, _LDHINT_NONE))

#endif /* HPUX on IA64, non gcc */

That resulted in these numbers. Scale factor 100:

1:tps = 5569.911714 (including connections establishing)
8:tps = 37365.364468 (including connections establishing)
16:tps = 63596.261875 (including connections establishing)
32:tps = 95948.157678 (including connections establishing)
48:tps = 90708.253920 (including connections establishing)
64:tps = 100109.065744 (including connections establishing)

Scale factor 1000:

1:tps = 4878.332996 (including connections establishing)
8:tps = 33245.469907 (including connections establishing)
16:tps = 56708.424880 (including connections establishing)
48:tps = 69652.232635 (including connections establishing)
64:tps = 70593.208637 (including connections establishing)

Then, I did this:

--- a/src/backend/storage/lmgr/s_lock.c
+++ b/src/backend/storage/lmgr/s_lock.c
@@ -96,7 +96,7 @@ s_lock(volatile slock_t *lock, const char *file, int line)
        int                     delays = 0;
        int                     cur_delay = 0;

- while (TAS(lock))
+ while (*lock ? 1 : TAS(lock))
{
/* CPU-specific delay each time through the loop */
SPIN_DELAY();

That resulted in these numbers, at scale factor 100:

1:tps = 5564.059494 (including connections establishing)
8:tps = 37487.090798 (including connections establishing)
16:tps = 66061.524760 (including connections establishing)
32:tps = 96535.523905 (including connections establishing)
48:tps = 92031.618360 (including connections establishing)
64:tps = 106813.631701 (including connections establishing)

And at scale factor 1000:

1:tps = 4980.338246 (including connections establishing)
8:tps = 33576.680072 (including connections establishing)
16:tps = 55618.677975 (including connections establishing)
32:tps = 73589.442746 (including connections establishing)
48:tps = 70987.026228 (including connections establishing)

Note sure why I am missing the 64-client results for that last set of
tests, but no matter.

Of course, we can't apply the second patch as it stands, because I
tested it on x86 and it loses. But it seems pretty clear we need to
do it at least for this architecture...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#2pasman pasmański
pasman.p@gmail.com
In reply to: Robert Haas (#1)
Re: spinlocks on HP-UX

Pity that this patch works only on hpux :(.

But i have an idea: maybe when executor stop at locked row, it should
process next row instead of wait.

Of course if query not contain "order by" or windowing functions.

--
------------
pasman

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#1)
Re: spinlocks on HP-UX

Robert Haas <robertmhaas@gmail.com> writes:

First, I did this:

-#define TAS(lock) _Asm_xchg(_SZ_W, lock, 1, _LDHINT_NONE)
+#define TAS(lock) (*(lock) ? 1 : _Asm_xchg(_SZ_W, lock, 1, _LDHINT_NONE))

Seems reasonable, and similar to x86 logic.

Then, I did this:

-       while (TAS(lock))
+       while (*lock ? 1 : TAS(lock))

Er, what? That sure looks like a manual application of what you'd
already done in the TAS macro.

Of course, we can't apply the second patch as it stands, because I
tested it on x86 and it loses. But it seems pretty clear we need to
do it at least for this architecture...

Please clarify: when you say "this architecture", are you talking about
IA64 or PA-RISC? Is there any reason to think that this is specific to
HP-UX rather than any other system on the same architecture? (I'm sure
I can get access to some IA64 clusters at Red Hat, though maybe not
64-core ones.)

I don't have an objection to the TAS macro change, but I do object to
fooling with the hardware-independent code in s_lock.c ... especially
when the additional improvement seems barely above the noise threshold.
You ought to be able to do whatever you need inside the TAS macro.

regards, tom lane

#4Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#3)
Re: spinlocks on HP-UX

On Sun, Aug 28, 2011 at 11:35 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

First, I did this:

-#define TAS(lock) _Asm_xchg(_SZ_W, lock, 1, _LDHINT_NONE)
+#define TAS(lock) (*(lock) ? 1 : _Asm_xchg(_SZ_W, lock, 1, _LDHINT_NONE))

Seems reasonable, and similar to x86 logic.

Then, I did this:

-       while (TAS(lock))
+       while (*lock ? 1 : TAS(lock))

Er, what?  That sure looks like a manual application of what you'd
already done in the TAS macro.

Sorry, I blew through that a little too blithely. If you change TAS()
itself, then even the very first attempt to acquire the lock will try
the unlocked instruction first, whereas changing s_lock() allows you
to do something different in the contended case than you do in the
uncontended case. We COULD just change the TAS() macro since, in this
case, it seems to make only a minor difference, but what I was
thinking is that we could change s_lock.h to define two macros, TAS()
and TAS_SPIN(). If a particular architecture defines TAS() but not
TAS_SPIN(), then we define TAS_SPIN(x) to be TAS(x). Then, S_LOCK()
can stay as-is - calling TAS() - but s_lock() can call TAS_SPIN(),
which will normally be the same as TAS() but can be made different on
any architecture where the retry loop should do something different
than the initial attempt.

Please clarify: when you say "this architecture", are you talking about
IA64 or PA-RISC?  Is there any reason to think that this is specific to
HP-UX rather than any other system on the same architecture?  (I'm sure
I can get access to some IA64 clusters at Red Hat, though maybe not
64-core ones.)

I tested on IA64; I don't currently have access to a PA-RISC box. The
documentation I'm looking at implies that the same approach would be
desirable there, but that's just an unsubstantiated rumor at this
point....

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#4)
Re: spinlocks on HP-UX

Robert Haas <robertmhaas@gmail.com> writes:

On Sun, Aug 28, 2011 at 11:35 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Then, I did this:

- � � � while (TAS(lock))
+ � � � while (*lock ? 1 : TAS(lock))

Er, what? �That sure looks like a manual application of what you'd
already done in the TAS macro.

Sorry, I blew through that a little too blithely. If you change TAS()
itself, then even the very first attempt to acquire the lock will try
the unlocked instruction first, whereas changing s_lock() allows you
to do something different in the contended case than you do in the
uncontended case.

Yeah, I figured out that was probably what you meant a little while
later. I found a 64-CPU IA64 machine in Red Hat's test labs and am
currently trying to replicate your results; report to follow.

We COULD just change the TAS() macro since, in this
case, it seems to make only a minor difference, but what I was
thinking is that we could change s_lock.h to define two macros, TAS()
and TAS_SPIN().

Yeah, I was thinking along the same lines, though perhaps the name of
the new macro could use a little bikeshedding.

The comments in s_lock.h note that the unlocked test in x86 TAS is of
uncertain usefulness. It seems entirely possible to me that we ought
to use a similar design on x86, ie, use the unlocked test only once
we've entered the delay loop.

Please clarify: when you say "this architecture", are you talking about
IA64 or PA-RISC? �Is there any reason to think that this is specific to
HP-UX rather than any other system on the same architecture? �(I'm sure
I can get access to some IA64 clusters at Red Hat, though maybe not
64-core ones.)

I tested on IA64; I don't currently have access to a PA-RISC box. The
documentation I'm looking at implies that the same approach would be
desirable there, but that's just an unsubstantiated rumor at this
point....

Well, I've got a PA-RISC box, but it's only a single processor so it's
not gonna prove much. Anybody?

regards, tom lane

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#5)
Re: spinlocks on HP-UX

I wrote:

Yeah, I figured out that was probably what you meant a little while
later. I found a 64-CPU IA64 machine in Red Hat's test labs and am
currently trying to replicate your results; report to follow.

OK, these results are on a 64-processor SGI IA64 machine (AFAICT, 64
independent sockets, no hyperthreading or any funny business); 124GB
in 32 NUMA nodes; running RHEL5.7, gcc 4.1.2. I built today's git
head with --enable-debug (but not --enable-cassert) and ran with all
default configuration settings except shared_buffers = 8GB and
max_connections = 200. The test database is initialized at -s 100.
I did not change the database between runs, but restarted the postmaster
and then did this to warm the caches a tad:

pgbench -c 1 -j 1 -S -T 30 bench

Per-run pgbench parameters are as shown below --- note in particular
that I assigned one pgbench thread per 8 backends.

The numbers are fairly variable even with 5-minute runs; I did each
series twice so you could get a feeling for how much.

Today's git head:

pgbench -c 1 -j 1 -S -T 300 bench tps = 5835.213934 (including ...
pgbench -c 2 -j 1 -S -T 300 bench tps = 8499.223161 (including ...
pgbench -c 8 -j 1 -S -T 300 bench tps = 15197.126952 (including ...
pgbench -c 16 -j 2 -S -T 300 bench tps = 30803.255561 (including ...
pgbench -c 32 -j 4 -S -T 300 bench tps = 65795.356797 (including ...
pgbench -c 64 -j 8 -S -T 300 bench tps = 81644.914241 (including ...
pgbench -c 96 -j 12 -S -T 300 bench tps = 40059.202836 (including ...
pgbench -c 128 -j 16 -S -T 300 bench tps = 21309.615001 (including ...

run 2:

pgbench -c 1 -j 1 -S -T 300 bench tps = 5787.310115 (including ...
pgbench -c 2 -j 1 -S -T 300 bench tps = 8747.104236 (including ...
pgbench -c 8 -j 1 -S -T 300 bench tps = 14655.369995 (including ...
pgbench -c 16 -j 2 -S -T 300 bench tps = 28287.254924 (including ...
pgbench -c 32 -j 4 -S -T 300 bench tps = 61614.715187 (including ...
pgbench -c 64 -j 8 -S -T 300 bench tps = 79754.640518 (including ...
pgbench -c 96 -j 12 -S -T 300 bench tps = 40334.994324 (including ...
pgbench -c 128 -j 16 -S -T 300 bench tps = 23285.271257 (including ...

With modified TAS macro (see patch 1 below):

pgbench -c 1 -j 1 -S -T 300 bench tps = 6171.454468 (including ...
pgbench -c 2 -j 1 -S -T 300 bench tps = 8709.003728 (including ...
pgbench -c 8 -j 1 -S -T 300 bench tps = 14902.731035 (including ...
pgbench -c 16 -j 2 -S -T 300 bench tps = 29789.744482 (including ...
pgbench -c 32 -j 4 -S -T 300 bench tps = 59991.549128 (including ...
pgbench -c 64 -j 8 -S -T 300 bench tps = 117369.287466 (including ...
pgbench -c 96 -j 12 -S -T 300 bench tps = 112583.144495 (including ...
pgbench -c 128 -j 16 -S -T 300 bench tps = 110231.305282 (including ...

run 2:

pgbench -c 1 -j 1 -S -T 300 bench tps = 5670.097936 (including ...
pgbench -c 2 -j 1 -S -T 300 bench tps = 8230.786940 (including ...
pgbench -c 8 -j 1 -S -T 300 bench tps = 14785.952481 (including ...
pgbench -c 16 -j 2 -S -T 300 bench tps = 29335.875139 (including ...
pgbench -c 32 -j 4 -S -T 300 bench tps = 59605.433837 (including ...
pgbench -c 64 -j 8 -S -T 300 bench tps = 108884.294519 (including ...
pgbench -c 96 -j 12 -S -T 300 bench tps = 110387.439978 (including ...
pgbench -c 128 -j 16 -S -T 300 bench tps = 109046.121191 (including ...

With unlocked test in s_lock.c delay loop only (patch 2 below):

pgbench -c 1 -j 1 -S -T 300 bench tps = 5426.491088 (including ...
pgbench -c 2 -j 1 -S -T 300 bench tps = 8787.939425 (including ...
pgbench -c 8 -j 1 -S -T 300 bench tps = 15720.801359 (including ...
pgbench -c 16 -j 2 -S -T 300 bench tps = 33711.102718 (including ...
pgbench -c 32 -j 4 -S -T 300 bench tps = 61829.180234 (including ...
pgbench -c 64 -j 8 -S -T 300 bench tps = 109781.655020 (including ...
pgbench -c 96 -j 12 -S -T 300 bench tps = 107132.848280 (including ...
pgbench -c 128 -j 16 -S -T 300 bench tps = 106533.630986 (including ...

run 2:

pgbench -c 1 -j 1 -S -T 300 bench tps = 5705.283316 (including ...
pgbench -c 2 -j 1 -S -T 300 bench tps = 8442.798662 (including ...
pgbench -c 8 -j 1 -S -T 300 bench tps = 14423.723837 (including ...
pgbench -c 16 -j 2 -S -T 300 bench tps = 29112.751995 (including ...
pgbench -c 32 -j 4 -S -T 300 bench tps = 62258.984033 (including ...
pgbench -c 64 -j 8 -S -T 300 bench tps = 107741.988800 (including ...
pgbench -c 96 -j 12 -S -T 300 bench tps = 107138.968981 (including ...
pgbench -c 128 -j 16 -S -T 300 bench tps = 106110.215138 (including ...

So this pretty well confirms Robert's results, in particular that all of
the win from an unlocked test comes from using it in the delay loop.
Given the lack of evidence that a general change in TAS() is beneficial,
I'm inclined to vote against it, on the grounds that the extra test is
surely a loss at some level when there is not contention.
(IOW, +1 for inventing a second macro to use in the delay loop only.)

We ought to do similar tests on other architectures. I found some
lots-o-processors x86_64 machines at Red Hat, but they don't seem to
own any PPC systems with more than 8 processors. Anybody have big
iron with other non-Intel chips?

regards, tom lane

Patch 1: change TAS globally, non-HPUX code:

*** src/include/storage/s_lock.h.orig	Sat Jan  1 13:27:24 2011
--- src/include/storage/s_lock.h	Sun Aug 28 13:32:47 2011
***************
*** 228,233 ****
--- 228,240 ----
  {
  	long int	ret;
+ 	/*
+ 	 * Use a non-locking test before the locking instruction proper.  This
+ 	 * appears to be a very significant win on many-core IA64.
+ 	 */
+ 	if (*lock)
+ 		return 1;
+ 
  	__asm__ __volatile__(
  		"	xchg4 	%0=%1,%2	\n"
  :		"=r"(ret), "+m"(*lock)
***************
*** 243,248 ****
--- 250,262 ----
  {
  	int		ret;
+ 	/*
+ 	 * Use a non-locking test before the locking instruction proper.  This
+ 	 * appears to be a very significant win on many-core IA64.
+ 	 */
+ 	if (*lock)
+ 		return 1;
+ 
  	ret = _InterlockedExchange(lock,1);	/* this is a xchg asm macro */

return ret;

Patch 2: change s_lock only (same as Robert's quick hack):

*** src/backend/storage/lmgr/s_lock.c.orig	Sat Jan  1 13:27:09 2011
--- src/backend/storage/lmgr/s_lock.c	Sun Aug 28 14:02:29 2011
***************
*** 96,102 ****
  	int			delays = 0;
  	int			cur_delay = 0;
! 	while (TAS(lock))
  	{
  		/* CPU-specific delay each time through the loop */
  		SPIN_DELAY();
--- 96,102 ----
  	int			delays = 0;
  	int			cur_delay = 0;

! while (*lock ? 1 : TAS(lock))
{
/* CPU-specific delay each time through the loop */
SPIN_DELAY();

#7Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#6)
Re: spinlocks on HP-UX

On Sun, Aug 28, 2011 at 7:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

So this pretty well confirms Robert's results, in particular that all of
the win from an unlocked test comes from using it in the delay loop.
Given the lack of evidence that a general change in TAS() is beneficial,
I'm inclined to vote against it, on the grounds that the extra test is
surely a loss at some level when there is not contention.
(IOW, +1 for inventing a second macro to use in the delay loop only.)

Beautiful. Got a naming preference for that second macro? I
suggested TAS_SPIN() because it's what you use when you spin, as
opposed to what you use in the uncontended case, but I'm not attached
to that.

We ought to do similar tests on other architectures.  I found some
lots-o-processors x86_64 machines at Red Hat, but they don't seem to
own any PPC systems with more than 8 processors.  Anybody have big
iron with other non-Intel chips?

Aside from PPC, it would probably be worth testing SPARC and ARM if we
can find machines. Anything else is probably too old or too marginal
to get excited about. AFAIK these effects don't manifest with <32
cores, so I suspect that a lot of what's in s_lock.h is irrelevant
just because many of those architectures are too old to exist in
32-core versions.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#7)
Re: spinlocks on HP-UX

Robert Haas <robertmhaas@gmail.com> writes:

On Sun, Aug 28, 2011 at 7:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

(IOW, +1 for inventing a second macro to use in the delay loop only.)

Beautiful. Got a naming preference for that second macro? I
suggested TAS_SPIN() because it's what you use when you spin, as
opposed to what you use in the uncontended case, but I'm not attached
to that.

I had been thinking TAS_CONTENDED, but on reflection there's not a
strong argument for that over TAS_SPIN. Do what you will.

regards, tom lane

#9Robert Haas
robertmhaas@gmail.com
In reply to: pasman pasmański (#2)
Re: spinlocks on HP-UX

2011/8/28 pasman pasmański <pasman.p@gmail.com>:

Pity that this patch works only on hpux :(.

Well, not really. x86 is already well-behaved. On a 32-core x86 box
running Linux, performs seems to plateau and level off, and then fall
off gradually. But on ia64, performance just collapses after about 24
cores. The fact that we don't have that problem everywhere is a good
thing, not a bad thing...

But i have an idea: maybe when executor stop at locked row, it should
process next row instead of wait.

Of course if query not contain "order by" or windowing functions.

That wouldn't really help, first of all because you'd then have to
remember to go back to that row (and chances are it would still be
contended then), and second because these aren't row-level locks
anyway. They're locks on various global data structures, such as the
ProcArray.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#8)
Re: spinlocks on HP-UX

On Sun, Aug 28, 2011 at 8:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Sun, Aug 28, 2011 at 7:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

(IOW, +1 for inventing a second macro to use in the delay loop only.)

Beautiful.  Got a naming preference for that second macro?  I
suggested TAS_SPIN() because it's what you use when you spin, as
opposed to what you use in the uncontended case, but I'm not attached
to that.

I had been thinking TAS_CONTENDED, but on reflection there's not a
strong argument for that over TAS_SPIN.  Do what you will.

OK, done. I think while we're tidying up here we ought to do
something about this comment:

* ANOTHER CAUTION: be sure that TAS(), TAS_SPIN(), and
S_UNLOCK() represent
* sequence points, ie, loads and stores of other values must not be moved
* across a lock or unlock. In most cases it suffices to make
the operation
* be done through a "volatile" pointer.

IIUC, this is basically total nonsense. volatile only constrains the
optimizer, not the CPU; and only with regard to the ordering of one
volatile access vs. another, NOT the ordering of volatile accesses
with any non-volatile accesses that may be floating around. So its
practical utility for synchronization purposes seems to be nearly
zero. I think what this should say is that we expect these operations
to act as a full memory fence. Note that in some other worlds (e.g.
Linux) a spinlock acquisition or release is only required to act as a
half-fence; that is, other loads and stores are allowed to bleed into
the critical section but not out. However, I think we have been
assuming full-fence behavior. In either case, claiming that the use
of a volatile pointer is all that's needed here strikes me as pretty
misleading.

In the department of loose ends, there are a bunch of other things
that maybe need cleanup here: (1) the gcc/intel compiler cases on
ia64, (2) PA-RISC, (3) ARM, (4) PowerPC... and we should also perhaps
reconsider the 32-bit x86 case. Right now TAS() says:

/*
* Use a non-locking test before asserting the bus lock. Note that the
* extra test appears to be a small loss on some x86 platforms
and a small
* win on others; it's by no means clear that we should keep it.
*/

I can't get too excited about spending a lot of effort optimizing
32-bit PostgreSQL on any architecture at this point, but if someone
else is, we might want to check whether it makes more sense to do the
non-locking test only in TAS_SPIN().

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#10)
Re: spinlocks on HP-UX

Robert Haas <robertmhaas@gmail.com> writes:

OK, done. I think while we're tidying up here we ought to do
something about this comment:

* ANOTHER CAUTION: be sure that TAS(), TAS_SPIN(), and
S_UNLOCK() represent
* sequence points, ie, loads and stores of other values must not be moved
* across a lock or unlock. In most cases it suffices to make
the operation
* be done through a "volatile" pointer.

IIUC, this is basically total nonsense.

It could maybe be rewritten for more clarity, but it's far from being
nonsense. The responsibility for having an actual hardware memory fence
instruction lies with the author of the TAS macro. But the
responsibility for keeping the compiler from reordering stuff around the
TAS call is that of the *user* of the TAS macro (or spinlock), and in
most cases the way to do that is to make sure that both the spinlock and
the shared data structure are referenced through volatile pointers.
This isn't academic; we've seen bugs from failure to do that. (BTW,
the reason for not being equivalently tense about LWLock-protected
structures is that the compiler generally won't try to move operations
around an out-of-line function call. It's the fact that spinlocks are
inline-able that creates the risk here.)

In the department of loose ends, there are a bunch of other things
that maybe need cleanup here: (1) the gcc/intel compiler cases on
ia64, (2) PA-RISC, (3) ARM, (4) PowerPC... and we should also perhaps
reconsider the 32-bit x86 case.

The results I got yesterday seem sufficient basis to change the
gcc/intel cases for IA64, so I will go do that if you didn't already.
I am also currently running tests on x86_64 and PPC using Red Hat test
machines --- expect results later today. Red Hat doesn't seem to own
any many-CPU machines that are 32-bit-only, and I rather wonder if there
are any. It might be that it only makes sense to optimize the x86 paths
for a few CPUs, in which case this test methodology may not be very
helpful.

regards, tom lane

#12Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#11)
Re: spinlocks on HP-UX

On Mon, Aug 29, 2011 at 11:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

OK, done.  I think while we're tidying up here we ought to do
something about this comment:

 *      ANOTHER CAUTION: be sure that TAS(), TAS_SPIN(), and
S_UNLOCK() represent
 *      sequence points, ie, loads and stores of other values must not be moved
 *      across a lock or unlock.  In most cases it suffices to make
the operation
 *      be done through a "volatile" pointer.

IIUC, this is basically total nonsense.

It could maybe be rewritten for more clarity, but it's far from being
nonsense.  The responsibility for having an actual hardware memory fence
instruction lies with the author of the TAS macro.

Right... but the comment implies that you probably don't need one, and
doesn't even mention that you MIGHT need one.

In the department of loose ends, there are a bunch of other things
that maybe need cleanup here: (1) the gcc/intel compiler cases on
ia64, (2) PA-RISC, (3) ARM, (4) PowerPC... and we should also perhaps
reconsider the 32-bit x86 case.

The results I got yesterday seem sufficient basis to change the
gcc/intel cases for IA64, so I will go do that if you didn't already.

I did not; please go ahead. I wasn't relishing the idea trying to
figure out how to install gcc to test that case.

I am also currently running tests on x86_64 and PPC using Red Hat test
machines --- expect results later today.  Red Hat doesn't seem to own
any many-CPU machines that are 32-bit-only, and I rather wonder if there
are any.  It might be that it only makes sense to optimize the x86 paths
for a few CPUs, in which case this test methodology may not be very
helpful.

FWIW, I tried spinning unlocked on x86_64 at 32 cores and got a
regression. Don't have any PPC gear at present.

I think optimizing spinlocks for machines with only a few CPUs is
probably pointless. Based on what I've seen so far, spinlock
contention even at 16 CPUs is negligible pretty much no matter what
you do. Whether your implementation is fast or slow isn't going to
matter, because even an inefficient implementation will account for
only a negligible percentage of the total CPU time - much less than 1%
- as opposed to a 64-core machine, where it's not that hard to find
cases where spin-waits consume the *majority* of available CPU time
(recall previous discussion of lseek). So I'm disinclined to spend
the time it would take to tinker with the 32-bit code, because it will
not matter; for that platform we're better off spending our time
installing a hash table in ScanKeywordLookup(). And there's always a
possibility that AMD and Intel chips could be different, or there
could even be differences between different chip generations from the
same manufacturer, so all in all it seems like a pretty unrewarding
exercise.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#12)
Re: spinlocks on HP-UX

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Aug 29, 2011 at 11:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

IIUC, this is basically total nonsense.

It could maybe be rewritten for more clarity, but it's far from being
nonsense. �The responsibility for having an actual hardware memory fence
instruction lies with the author of the TAS macro.

Right... but the comment implies that you probably don't need one, and
doesn't even mention that you MIGHT need one.

I think maybe we need to split it into two paragraphs, one addressing
the TAS author and the other for the TAS user. I'll have a go at that.

I think optimizing spinlocks for machines with only a few CPUs is
probably pointless. Based on what I've seen so far, spinlock
contention even at 16 CPUs is negligible pretty much no matter what
you do.

We did find significant differences several years ago, testing on
machines that probably had no more than four cores; that's where the
existing comments in s_lock.h came from. Whether those tests are
still relevant for today's source code is not obvious though.

regards, tom lane

#14Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#13)
Re: spinlocks on HP-UX

On Mon, Aug 29, 2011 at 11:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Aug 29, 2011 at 11:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

IIUC, this is basically total nonsense.

It could maybe be rewritten for more clarity, but it's far from being
nonsense.  The responsibility for having an actual hardware memory fence
instruction lies with the author of the TAS macro.

Right... but the comment implies that you probably don't need one, and
doesn't even mention that you MIGHT need one.

I think maybe we need to split it into two paragraphs, one addressing
the TAS author and the other for the TAS user.  I'll have a go at that.

OK.

I think optimizing spinlocks for machines with only a few CPUs is
probably pointless.  Based on what I've seen so far, spinlock
contention even at 16 CPUs is negligible pretty much no matter what
you do.

We did find significant differences several years ago, testing on
machines that probably had no more than four cores; that's where the
existing comments in s_lock.h came from.  Whether those tests are
still relevant for today's source code is not obvious though.

Hmm, OK. I guess if you want to put energy into it, I'm not going to
complain too much... just not sure it's the best use of time.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15Greg Stark
stark@mit.edu
In reply to: Tom Lane (#11)
Re: spinlocks on HP-UX

On Mon, Aug 29, 2011 at 4:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

 *      ANOTHER CAUTION: be sure that TAS(), TAS_SPIN(), and
S_UNLOCK() represent
 *      sequence points, ie, loads and stores of other values must not be moved
 *      across a lock or unlock.  In most cases it suffices to make
the operation
 *      be done through a "volatile" pointer.

IIUC, this is basically total nonsense.

It could maybe be rewritten for more clarity, but it's far from being
nonsense.

The confusion for me is that it's talking about sequence points and
volatile pointers in the same breath as if one implies the other.
Making something a volatile pointer dose not create a sequence point.
It requires that the compiler not move the access or store across any
sequence points that are already there.

It might be helpful to include the actual bug that the comment is
trying to warn against because iirc it was a real case that caused you
to add the volatile modifiers.

--
greg

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Stark (#15)
Re: spinlocks on HP-UX

Greg Stark <stark@mit.edu> writes:

The confusion for me is that it's talking about sequence points and
volatile pointers in the same breath as if one implies the other.
Making something a volatile pointer dose not create a sequence point.
It requires that the compiler not move the access or store across any
sequence points that are already there.

Well, no, I don't think that's the useful way to think about it. Modern
compilers seem to only honor sequence points in terms of the semantics
seen by the executing program; they don't believe that they can't
reorder loads/stores freely. And as long as the memory in question is
only used by the given program, they're right. But for memory locations
shared with other threads of execution, you have to be careful about the
order of accesses to those locations. My understanding of "volatile"
is that the compiler is forbidden from altering the order of
volatile-qualified loads and stores *relative to each other*, or from
optimizing away a load or store that seems redundant in the context of
the given program. That's got nothing to do with sequence points per se.

It might be helpful to include the actual bug that the comment is
trying to warn against because iirc it was a real case that caused you
to add the volatile modifiers.

Well, if there were one and only one bug involved here, it wouldn't be
such a critical problem ...

regards, tom lane

#17Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#16)
Re: spinlocks on HP-UX

On Mon, Aug 29, 2011 at 12:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Greg Stark <stark@mit.edu> writes:

The confusion for me is that it's talking about sequence points and
volatile pointers in the same breath as if one implies the other.
Making something a volatile pointer dose not create a sequence point.
It requires that the compiler not move the access or store across any
sequence points that are already there.

Well, no, I don't think that's the useful way to think about it.  Modern
compilers seem to only honor sequence points in terms of the semantics
seen by the executing program; they don't believe that they can't
reorder loads/stores freely.

This discussion seems to miss the fact that there are two levels of
reordering that can happen. First, the compiler can move things
around. Second, the CPU can move things around. Even on x86, for
example, a sequence like this can be reordered:

LOAD A
STORE A
LOAD B

Even though the compiler may emit those instructions in exactly that
order, an x86 CPU can, IIUC, decide to load B before it finishes
storing A, so that the actual apparent execution order as seen by
other CPUs will be either the above, or the above with the last two
instructions reversed. On a weakly ordered CPU, the load of B could
be moved back even further, before the LOAD of A.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#17)
Re: spinlocks on HP-UX

Robert Haas <robertmhaas@gmail.com> writes:

This discussion seems to miss the fact that there are two levels of
reordering that can happen. First, the compiler can move things
around. Second, the CPU can move things around.

Right, I think that's exactly the problem with the previous wording of
that comment; it doesn't address the two logical levels involved.
I've rewritten it, see what you think.

* Another caution for users of these macros is that it is the caller's
* responsibility to ensure that the compiler doesn't re-order accesses
* to shared memory to precede the actual lock acquisition, or follow the
* lock release. Typically we handle this by using volatile-qualified
* pointers to refer to both the spinlock itself and the shared data
* structure being accessed within the spinlocked critical section.
* That fixes it because compilers are not allowed to re-order accesses
* to volatile objects relative to other such accesses.
*
* On platforms with weak memory ordering, the TAS(), TAS_SPIN(), and
* S_UNLOCK() macros must further include hardware-level memory fence
* instructions to prevent similar re-ordering at the hardware level.
* TAS() and TAS_SPIN() must guarantee that loads and stores issued after
* the macro are not executed until the lock has been obtained. Conversely,
* S_UNLOCK() must guarantee that loads and stores issued before the macro
* have been executed before the lock is released.

regards, tom lane

#19Greg Stark
stark@mit.edu
In reply to: Robert Haas (#17)
Re: spinlocks on HP-UX

On Mon, Aug 29, 2011 at 5:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Even though the compiler may emit those instructions in exactly that
order, an x86 CPU can, IIUC, decide to load B before it finishes
storing A, so that the actual apparent execution order as seen by
other CPUs will be either the above, or the above with the last two
instructions reversed.  On a weakly ordered CPU, the load of B could
be moved back even further, before the LOAD of A.

My understanding of what the comment meant is that there is already a
full memory barrier as far as the CPU is concerned due to the TAS or
whatever, but it's important that there also be a sequence point there
so that the volatile memory access isn't reordered by the compiler to
occur before the memory barrier.

I was going to say the same thing as Tom that sequence points and
volatile pointers have nothing at all to do with each other. However
my brief searching online actually seemed to indicate that in fact the
compiler isn't supposed to reorder volatile memory accesses across
sequence points. That seemed to make sense since I couldn't think of
any other way to rigorously describe the constraints the compiler
should operate under.

--
greg

#20Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#18)
Re: spinlocks on HP-UX

On Mon, Aug 29, 2011 at 1:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

This discussion seems to miss the fact that there are two levels of
reordering that can happen.  First, the compiler can move things
around.  Second, the CPU can move things around.

Right, I think that's exactly the problem with the previous wording of
that comment; it doesn't address the two logical levels involved.
I've rewritten it, see what you think.

 *      Another caution for users of these macros is that it is the caller's
 *      responsibility to ensure that the compiler doesn't re-order accesses
 *      to shared memory to precede the actual lock acquisition, or follow the
 *      lock release.  Typically we handle this by using volatile-qualified
 *      pointers to refer to both the spinlock itself and the shared data
 *      structure being accessed within the spinlocked critical section.
 *      That fixes it because compilers are not allowed to re-order accesses
 *      to volatile objects relative to other such accesses.
 *
 *      On platforms with weak memory ordering, the TAS(), TAS_SPIN(), and
 *      S_UNLOCK() macros must further include hardware-level memory fence
 *      instructions to prevent similar re-ordering at the hardware level.
 *      TAS() and TAS_SPIN() must guarantee that loads and stores issued after
 *      the macro are not executed until the lock has been obtained.  Conversely,
 *      S_UNLOCK() must guarantee that loads and stores issued before the macro
 *      have been executed before the lock is released.

That's definitely an improvement.

I'm actually not convinced that we're entirely consistent here about
what we require the semantics of acquiring and releasing a spinlock to
be. For example, on x86 and x86_64, we acquire the lock using xchgb,
which acts a full memory barrier. But when we release the lock, we
just zero out the memory address, which is NOT a full memory barrier.
Stores can't cross it, but non-dependent loads of different locations
can back up over it. That's pretty close to a full barrier, but it
isn't, quite. Now, I don't see why that should really cause any
problem, at least for common cases like LWLockAcquire(). If the CPU
prefetches the data protected by the lwlock after we know we've got
the lock before we've actually released the spinlock and returned from
LWLockAcquire(), that should be fine, even good (for performance).
The real problem with being squiffy here is that it's not clear how
weak we can make the fence instruction on weakly ordered architectures
that support multiple types. Right now we're pretty conservative, but
I think that may be costing us. I might be wrong; more research is
needed here; but I think that we should at least start to get our head
about what semantics we actually need.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#11)
Re: spinlocks on HP-UX

I wrote:

I am also currently running tests on x86_64 and PPC using Red Hat test
machines --- expect results later today.

OK, I ran some more tests. These are not directly comparable to my
previous results with IA64, because (a) I used RHEL6.2 and gcc 4.4.6;
(b) I used half as many pgbench threads as backends, rather than one
thread per eight backends. Testing showed that pgbench cannot saturate
more than two backends per thread in this test environment, as shown
for example by this series:

pgbench -c 8 -j 1 -S -T 300 bench tps = 22091.461409 (including ...
pgbench -c 8 -j 2 -S -T 300 bench tps = 42587.661755 (including ...
pgbench -c 8 -j 4 -S -T 300 bench tps = 77515.057885 (including ...
pgbench -c 8 -j 8 -S -T 300 bench tps = 75830.463821 (including ...

I find this entirely astonishing, BTW; the backend is surely doing far
more than twice as much work per query as pgbench. We need to look into
why pgbench is apparently still such a dog. However, that's not
tremendously relevant to the question of whether we need an unlocked
test in spinlocks.

These tests were run on a 32-CPU Opteron machine (Sun Fire X4600 M2,
8 quad-core sockets). Test conditions the same as my IA64 set, except
for the OS and the -j switches:

Stock git head:

pgbench -c 1 -j 1 -S -T 300 bench tps = 9515.435401 (including ...
pgbench -c 2 -j 1 -S -T 300 bench tps = 20239.289880 (including ...
pgbench -c 8 -j 4 -S -T 300 bench tps = 78628.371372 (including ...
pgbench -c 16 -j 8 -S -T 300 bench tps = 143065.596555 (including ...
pgbench -c 32 -j 16 -S -T 300 bench tps = 227349.424654 (including ...
pgbench -c 64 -j 32 -S -T 300 bench tps = 269016.946095 (including ...
pgbench -c 96 -j 48 -S -T 300 bench tps = 253884.095190 (including ...
pgbench -c 128 -j 64 -S -T 300 bench tps = 269235.253012 (including ...

Non-locked test in TAS():

pgbench -c 1 -j 1 -S -T 300 bench tps = 9316.195621 (including ...
pgbench -c 2 -j 1 -S -T 300 bench tps = 19852.444846 (including ...
pgbench -c 8 -j 4 -S -T 300 bench tps = 77701.546927 (including ...
pgbench -c 16 -j 8 -S -T 300 bench tps = 138926.775553 (including ...
pgbench -c 32 -j 16 -S -T 300 bench tps = 188485.669320 (including ...
pgbench -c 64 -j 32 -S -T 300 bench tps = 253602.490286 (including ...
pgbench -c 96 -j 48 -S -T 300 bench tps = 251181.310600 (including ...
pgbench -c 128 -j 64 -S -T 300 bench tps = 260812.933702 (including ...

Non-locked test in TAS_SPIN() only:

pgbench -c 1 -j 1 -S -T 300 bench tps = 9283.944739 (including ...
pgbench -c 2 -j 1 -S -T 300 bench tps = 20213.208443 (including ...
pgbench -c 8 -j 4 -S -T 300 bench tps = 78824.247744 (including ...
pgbench -c 16 -j 8 -S -T 300 bench tps = 141027.072774 (including ...
pgbench -c 32 -j 16 -S -T 300 bench tps = 201658.416366 (including ...
pgbench -c 64 -j 32 -S -T 300 bench tps = 271035.843105 (including ...
pgbench -c 96 -j 48 -S -T 300 bench tps = 261337.324585 (including ...
pgbench -c 128 -j 64 -S -T 300 bench tps = 271272.921058 (including ...

So basically there is no benefit to the unlocked test on this hardware.
But it doesn't cost much either, which is odd because the last time we
did this type of testing, adding an unlocked test was a "huge loss" on
Opteron. Apparently AMD improved their handling of the case, and/or
the other changes we've made change the usage pattern completely.

I am hoping to do a similar test on another machine with $bignum Xeon
processors, to see if Intel hardware reacts any differently. But that
machine is in the Westford office which is currently without power,
so it will have to wait a few days. (I can no longer get at either
of the machines cited in this mail, either, so if you want to see
more test cases it'll have to wait.)

These tests were run on a 32-processor PPC64 machine (IBM 8406-71Y,
POWER7 architecture; I think it might be 16 cores with hyperthreading,
not sure). The machine has "only" 6GB of RAM so I set shared_buffers to
4GB, other test conditions the same:

Stock git head:

pgbench -c 1 -j 1 -S -T 300 bench tps = 8746.076443 (including ...
pgbench -c 2 -j 1 -S -T 300 bench tps = 12297.297308 (including ...
pgbench -c 8 -j 4 -S -T 300 bench tps = 48697.392492 (including ...
pgbench -c 16 -j 8 -S -T 300 bench tps = 94133.227472 (including ...
pgbench -c 32 -j 16 -S -T 300 bench tps = 126822.857978 (including ...
pgbench -c 64 -j 32 -S -T 300 bench tps = 129364.417801 (including ...
pgbench -c 96 -j 48 -S -T 300 bench tps = 125728.697772 (including ...
pgbench -c 128 -j 64 -S -T 300 bench tps = 131566.394880 (including ...

Non-locked test in TAS():

pgbench -c 1 -j 1 -S -T 300 bench tps = 8810.484890 (including ...
pgbench -c 2 -j 1 -S -T 300 bench tps = 12336.612804 (including ...
pgbench -c 8 -j 4 -S -T 300 bench tps = 49023.435650 (including ...
pgbench -c 16 -j 8 -S -T 300 bench tps = 96306.706556 (including ...
pgbench -c 32 -j 16 -S -T 300 bench tps = 131731.475778 (including ...
pgbench -c 64 -j 32 -S -T 300 bench tps = 133451.416612 (including ...
pgbench -c 96 -j 48 -S -T 300 bench tps = 110076.269474 (including ...
pgbench -c 128 -j 64 -S -T 300 bench tps = 111339.797242 (including ...

Non-locked test in TAS_SPIN() only:

pgbench -c 1 -j 1 -S -T 300 bench tps = 8726.269726 (including ...
pgbench -c 2 -j 1 -S -T 300 bench tps = 12228.415466 (including ...
pgbench -c 8 -j 4 -S -T 300 bench tps = 48227.623829 (including ...
pgbench -c 16 -j 8 -S -T 300 bench tps = 93302.510254 (including ...
pgbench -c 32 -j 16 -S -T 300 bench tps = 130661.097475 (including ...
pgbench -c 64 -j 32 -S -T 300 bench tps = 133009.181697 (including ...
pgbench -c 96 -j 48 -S -T 300 bench tps = 128710.757986 (including ...
pgbench -c 128 -j 64 -S -T 300 bench tps = 133063.460934 (including ...

So basically no value to an unlocked test on this platform either.

regards, tom lane

#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#20)
Re: spinlocks on HP-UX

Robert Haas <robertmhaas@gmail.com> writes:

I'm actually not convinced that we're entirely consistent here about
what we require the semantics of acquiring and releasing a spinlock to
be. For example, on x86 and x86_64, we acquire the lock using xchgb,
which acts a full memory barrier. But when we release the lock, we
just zero out the memory address, which is NOT a full memory barrier.
Stores can't cross it, but non-dependent loads of different locations
can back up over it. That's pretty close to a full barrier, but it
isn't, quite.

Right. That's why I wrote the comment as I did; it says what the actual
requirement is. There probably are cases where our implementations are
more restrictive than necessary (I hope none where they are weaker).

regards, tom lane

#23Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#21)
Re: spinlocks on HP-UX

On Mon, Aug 29, 2011 at 2:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

These tests were run on a 32-CPU Opteron machine (Sun Fire X4600 M2,
8 quad-core sockets).  Test conditions the same as my IA64 set, except
for the OS and the -j switches:

Stock git head:

pgbench -c 1 -j 1 -S -T 300 bench       tps = 9515.435401 (including ...
pgbench -c 32 -j 16 -S -T 300 bench     tps = 227349.424654 (including ...

These tests were run on a 32-processor PPC64 machine (IBM 8406-71Y,
POWER7 architecture; I think it might be 16 cores with hyperthreading,
not sure).  The machine has "only" 6GB of RAM so I set shared_buffers to
4GB, other test conditions the same:

Stock git head:

pgbench -c 1 -j 1 -S -T 300 bench       tps = 8746.076443 (including ...
pgbench -c 32 -j 16 -S -T 300 bench     tps = 126822.857978 (including ...

Stepping beyond the immediate issue of whether we want an unlocked
test in there or not (and I agree that based on these numbers we
don't), there's a clear and puzzling difference between those sets of
numbers. The Opteron test is showing 32 clients getting about 23.9
times the throughput of a single client, which is not exactly linear
but is at least respectable, whereas the PPC64 test is showing 32
clients getting just 14.5 times the throughput of a single client,
which is pretty significantly less good. Moreover, cranking it up to
64 clients is squeezing a significant amount of additional work out on
Opteron, but not on PPC64. The HP-UX/Itanium numbers in my OP give a
ratio of 17.3x - a little better than your PPC64 numbers, but
certainly not great.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Stark (#19)
Re: spinlocks on HP-UX

Greg Stark <stark@mit.edu> writes:

I was going to say the same thing as Tom that sequence points and
volatile pointers have nothing at all to do with each other. However
my brief searching online actually seemed to indicate that in fact the
compiler isn't supposed to reorder volatile memory accesses across
sequence points. That seemed to make sense since I couldn't think of
any other way to rigorously describe the constraints the compiler
should operate under.

It's a bit confusing. I agree that if the code is written such that
there are two volatile accesses with no intervening sequence point,
the compiler is at liberty to do them in either order; for instance in

foo(*x, *y);

there are no guarantees about which value is fetched first, even if x
and y are volatile-qualified. What's bothering me is that in, say,

*x = 0;
*y = 1;
*z = 2;

if x and z are volatile-qualified but y isn't, I believe the compilers
think they are only required to store into *x before *z, and can reorder
the store to *y around either of the others. So this makes the notion
of a sequence point pretty squishy in itself.

regards, tom lane

#25Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#23)
1 attachment(s)
Re: spinlocks on HP-UX

Robert Haas <robertmhaas@gmail.com> wrote:

Stepping beyond the immediate issue of whether we want an unlocked
test in there or not (and I agree that based on these numbers we
don't), there's a clear and puzzling difference between those sets
of numbers. The Opteron test is showing 32 clients getting about
23.9 times the throughput of a single client, which is not exactly
linear but is at least respectable, whereas the PPC64 test is
showing 32 clients getting just 14.5 times the throughput of a
single client, which is pretty significantly less good. Moreover,
cranking it up to 64 clients is squeezing a significant amount of
additional work out on Opteron, but not on PPC64. The
HP-UX/Itanium numbers in my OP give a ratio of 17.3x - a little
better than your PPC64 numbers, but certainly not great.

I wouldn't make too much of that without comparing to a STREAM test
(properly configured -- the default array size is likely not to be
large enough for these machines). On a recently delivered 32 core
machine with 256 GB RAM, I saw numbers like this for just RAM
access:

Threads Copy Scale Add Triad
1 3332.3721 3374.8146 4500.1954 4309.7392
2 5822.8107 6158.4621 8563.3236 7756.9050
4 12474.9646 12282.3401 16960.7216 15399.2406
8 22353.6013 23502.4389 31911.5206 29574.8124
16 35760.8782 40946.6710 49108.4386 49264.6576
32 47567.3882 44935.4608 52983.9355 52278.1373
64 48428.9939 47851.7320 54141.8830 54560.0520
128 49354.4303 49586.6092 55663.2606 57489.5798
256 45081.3601 44303.1032 49561.3815 50073.3530
512 42271.9688 41689.8609 47362.4190 46388.9720

Graph attached for those who are visually inclined and have support
for the display of JPEG files.

Note that this is a machine which is *not* configured to be
blazingly fast for a single connection, but to scale up well for a
large number of concurrent processes:

http://www.redbooks.ibm.com/redpapers/pdfs/redp4650.pdf

Unless your benchmarks are falling off a lot faster than the STREAM
test on that hardware, I wouldn't worry.

-Kevin

Attachments:

cir-stream-tests_html_590087c.jpgimage/jpeg; name=cir-stream-tests_html_590087c.jpgDownload
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Grittner (#25)
Re: spinlocks on HP-UX

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

Robert Haas <robertmhaas@gmail.com> wrote:

Stepping beyond the immediate issue of whether we want an unlocked
test in there or not (and I agree that based on these numbers we
don't), there's a clear and puzzling difference between those sets
of numbers. The Opteron test is showing 32 clients getting about
23.9 times the throughput of a single client, which is not exactly
linear but is at least respectable, whereas the PPC64 test is
showing 32 clients getting just 14.5 times the throughput of a
single client, which is pretty significantly less good.

I wouldn't make too much of that without comparing to a STREAM test
(properly configured -- the default array size is likely not to be
large enough for these machines).

Yeah. One point I didn't mention is that the Opteron machine's memory
is split across 8 NUMA nodes, whereas the PPC machine's isn't. I would
bet there's a significant difference in aggregate available memory
bandwidth.

Also, if the PPC machine really is hyperthreaded (the internal webpage
for it says "Hyper? True" but /proc/cpuinfo doesn't provide any clear
indications), that might mean it's not going to scale too well past 16x
the single-thread case.

regards, tom lane

#27Ants Aasma
ants.aasma@eesti.ee
In reply to: Robert Haas (#1)

Sorry, forgot to cc the list.

On Mon, Aug 29, 2011 at 10:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Also, if the PPC machine really is hyperthreaded (the internal webpage
for it says "Hyper? True" but /proc/cpuinfo doesn't provide any clear
indications), that might mean it's not going to scale too well past 16x
the single-thread case.

According to IBM docs [1]http://www.redbooks.ibm.com/redpapers/pdfs/redp4655.pdf, 8406-71Y contains one 8 core POWER7 chip
that is 4-way multi-threaded and has 4 memory channels. X4600M2 should
have 16 memory channels, although at 2/3 the transfer rate. 6GB of
memory is a strange amount for the IBM, according to specs it should take
4 or 8GB DIMMs in pairs. Sounds like the server is split into multiple
partitions.

--
Ants Aasma

[1]: http://www.redbooks.ibm.com/redpapers/pdfs/redp4655.pdf

#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ants Aasma (#27)
Re: spinlocks on HP-UX

Ants Aasma <ants.aasma@eesti.ee> writes:

On Mon, Aug 29, 2011 at 10:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Also, if the PPC machine really is hyperthreaded (the internal webpage
for it says "Hyper? True" but /proc/cpuinfo doesn't provide any clear
indications), that might mean it's not going to scale too well past 16x
the single-thread case.

According to IBM docs [1], 8406-71Y contains one 8 core POWER7 chip
that is 4-way multi-threaded and has 4 memory channels.

Yeah, I looked at the docs. "Multi threading" is IBM's term for
hyperthreading, that is several instruction streams competing for use of
a single processor core's pipelines. So the 32 virtual processors on
the machine only really represent 8 physically independent cores,
squaring with the hardware designation. I found an IBM doc
http://www-03.ibm.com/systems/resources/pwrsysperf_SMT4OnP7.pdf
suggesting that the throughput benefit of 4-way SMT is typically 1.5 to
2X, that is you max out at 1.5 to 2X as much work as you'd get with just
8 virtual processors on the same 8 cores. So I'd say we're really doing
quite well to get the numbers I got. (The paper also implies that you
get more benefit from SMT with workloads that incur more memory-access
stalls, so the relatively large working set of this test case is helping
it look good.)

6GB of
memory is a strange amount for the IBM, according to specs it should take
4 or 8GB DIMMs in pairs. Sounds like the server is split into multiple
partitions.

I'm confused about that too. There definitely seemed to be only 6GB of
available RAM, but there's no way I can see that memory might be
partitioned across different blades. The blades look pretty independent
...

regards, tom lane

#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#21)
Re: spinlocks on HP-UX

I wrote:

I am hoping to do a similar test on another machine with $bignum Xeon
processors, to see if Intel hardware reacts any differently. But that
machine is in the Westford office which is currently without power,
so it will have to wait a few days.

OK, the lights are on again in Westford, so here are some results from
an 8-socket Fujitsu PRIMEQUEST 1800 with 10-core Xeon E7-8870 processors,
hyperthreading enabled for a total of 160 virtual processors.
All test conditions the same as from my Opteron runs yesterday,
except just for the heck of it I ran it up to 160 backends.

Stock git head (of a couple of days ago now):

pgbench -c 1 -j 1 -S -T 300 bench tps = 4401.589257 (including ...
pgbench -c 2 -j 1 -S -T 300 bench tps = 8585.789827 (including ...
pgbench -c 8 -j 4 -S -T 300 bench tps = 36315.227334 (including ...
pgbench -c 16 -j 8 -S -T 300 bench tps = 73841.195884 (including ...
pgbench -c 32 -j 16 -S -T 300 bench tps = 155309.526039 (including ...
pgbench -c 64 -j 32 -S -T 300 bench tps = 77477.101725 (including ...
pgbench -c 96 -j 48 -S -T 300 bench tps = 41301.481915 (including ...
pgbench -c 128 -j 64 -S -T 300 bench tps = 30443.815506 (including ...
pgbench -c 160 -j 80 -S -T 300 bench tps = 24600.584202 (including ...

Non-locked test in TAS():

pgbench -c 1 -j 1 -S -T 300 bench tps = 4412.336573 (including ...
pgbench -c 2 -j 1 -S -T 300 bench tps = 8739.900806 (including ...
pgbench -c 8 -j 4 -S -T 300 bench tps = 32957.710818 (including ...
pgbench -c 16 -j 8 -S -T 300 bench tps = 71538.032629 (including ...
pgbench -c 32 -j 16 -S -T 300 bench tps = 153892.469308 (including ...
pgbench -c 64 -j 32 -S -T 300 bench tps = 127786.277182 (including ...
pgbench -c 96 -j 48 -S -T 300 bench tps = 92108.895423 (including ...
pgbench -c 128 -j 64 -S -T 300 bench tps = 75382.131814 (including ...
pgbench -c 160 -j 80 -S -T 300 bench tps = 67277.057981 (including ...

Non-locked test in TAS_SPIN() only:

pgbench -c 1 -j 1 -S -T 300 bench tps = 4006.626861 (including ...
pgbench -c 2 -j 1 -S -T 300 bench tps = 9020.124850 (including ...
pgbench -c 8 -j 4 -S -T 300 bench tps = 36507.582318 (including ...
pgbench -c 16 -j 8 -S -T 300 bench tps = 69668.921550 (including ...
pgbench -c 32 -j 16 -S -T 300 bench tps = 150886.395754 (including ...
pgbench -c 64 -j 32 -S -T 300 bench tps = 216697.745497 (including ...
pgbench -c 96 -j 48 -S -T 300 bench tps = 171013.266643 (including ...
pgbench -c 128 -j 64 -S -T 300 bench tps = 115205.718495 (including ...
pgbench -c 160 -j 80 -S -T 300 bench tps = 92073.704665 (including ...

This suggests that (1) an unlocked test in TAS_SPIN might be a good idea
on x86_64 after all, and (2) this test scenario may not be pushing the
system hard enough to expose limitations of the spinlock implementation.

I am now thinking that the reason we saw clear differences in spinlock
implementations years ago, and now are not seeing them except on insane
hardware, is mainly that we've managed to reduce contention at higher
levels of the system. That doesn't mean spinlocks have become
uninteresting, just that "pgbench -S" isn't the ideal test case for
stressing them. I'm thinking maybe we need a test scenario that
generates sinval traffic, for example, or forces snapshots to be taken
more often. Ideas anyone?

regards, tom lane

#30Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#29)
Re: spinlocks on HP-UX

On Tue, Aug 30, 2011 at 4:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

This suggests that (1) an unlocked test in TAS_SPIN might be a good idea
on x86_64 after all, and (2) this test scenario may not be pushing the
system hard enough to expose limitations of the spinlock implementation.

I am now thinking that the reason we saw clear differences in spinlock
implementations years ago, and now are not seeing them except on insane
hardware, is mainly that we've managed to reduce contention at higher
levels of the system.  That doesn't mean spinlocks have become
uninteresting, just that "pgbench -S" isn't the ideal test case for
stressing them.  I'm thinking maybe we need a test scenario that
generates sinval traffic, for example, or forces snapshots to be taken
more often.  Ideas anyone?

On current sources, with a workload that fits into shared_buffers,
pgbench -S hammers the spinlock protecting ProcArrayLock extremely
hard. I'm sure it's possible to come up with a test case that
hammers them harder, but using a real workload can expose issues (like
aggregate memory bandwidth) that you might not see otherwise.

I am a bit surprised by your test results, because I also tried x86_64
with an unlocked test, also on pgbench -S, and I am pretty sure I got
a regression. Maybe I'll try rerunning that. It seems possible that
the x86_64 results depend on the particular sub-architecture and/or
whether HT is in use, which would be kind of a nuisance.

Also, did you happen to measure the amount of user time vs. system
time that your test runs used? If this is on Linux, I am surprised
that you didn't get killed by the lseek() contention problem on a
machine with that many cores. I found it to be visible at 32 and
crippling at 64, so I can't even imagine what it would be like at 160.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#30)
Re: spinlocks on HP-UX

Robert Haas <robertmhaas@gmail.com> writes:

I am a bit surprised by your test results, because I also tried x86_64
with an unlocked test, also on pgbench -S, and I am pretty sure I got
a regression. Maybe I'll try rerunning that. It seems possible that
the x86_64 results depend on the particular sub-architecture and/or
whether HT is in use, which would be kind of a nuisance.

Well, if you consider Opteron as a sub-architecture of x86_64, that was
already true the last time we did this. So far there have not been
cases where something really good for one implementation was really bad
for another, but someday we'll probably hit that.

Also, did you happen to measure the amount of user time vs. system
time that your test runs used?

Did not think about that. I was considering how to measure the average
context swap rate over each run, so that we could keep an eye out for
the "context swap storm" behavior that's the usual visible-in-top
symptom of these sorts of problems. But it'd have to be automated;
I'm not going to keep my eyes glued to "top" output for several hours.

I'd be happy to re-run these tests with any RHEL-compatible measurement
scaffolding somebody else provides, but if I have to write it, it
probably won't happen very soon.

If this is on Linux, I am surprised
that you didn't get killed by the lseek() contention problem on a
machine with that many cores.

Hm ... now that you mention it, all of these tests have been using
the latest-and-greatest unreleased RHEL kernels. Maybe Red Hat already
fixed that contention problem in their kernel? Have you got a RH
bugzilla number for the issue?

regards, tom lane

#32Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#31)
Re: spinlocks on HP-UX

On Tue, Aug 30, 2011 at 4:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

If this is on Linux, I am surprised
that you didn't get killed by the lseek() contention problem on a
machine with that many cores.

Hm ... now that you mention it, all of these tests have been using
the latest-and-greatest unreleased RHEL kernels.  Maybe Red Hat already
fixed that contention problem in their kernel?  Have you got a RH
bugzilla number for the issue?

No, I haven't had much luck filing bugs against Red Hat releases, so
I've sort of given up on that. I did have some off-list
correspondence with a Red Hat engineer who red my blog post, though.

It should be pretty easy to figure it out, though. Just fire up
pgbench with lots of clients (say, 160) and run vmstat in another
window. If the machine reports 10% system time, it's fixed. If it
reports 90% system time, it's not.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#32)
Re: spinlocks on HP-UX

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Aug 30, 2011 at 4:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

If this is on Linux, I am surprised
that you didn't get killed by the lseek() contention problem on a
machine with that many cores.

Hm ... now that you mention it, all of these tests have been using
the latest-and-greatest unreleased RHEL kernels.

It should be pretty easy to figure it out, though. Just fire up
pgbench with lots of clients (say, 160) and run vmstat in another
window. If the machine reports 10% system time, it's fixed. If it
reports 90% system time, it's not.

I ran it up to "pgbench -c 200 -j 200 -S -T 300 bench" and still see
vmstat numbers around 50% user time, 12% system time, 38% idle.
So no lseek problem here, boss. Kernel calls itself 2.6.32-192.el6.x86_64.

regards, tom lane

#34Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#33)
Re: spinlocks on HP-UX

On Tue, Aug 30, 2011 at 6:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Aug 30, 2011 at 4:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

If this is on Linux, I am surprised
that you didn't get killed by the lseek() contention problem on a
machine with that many cores.

Hm ... now that you mention it, all of these tests have been using
the latest-and-greatest unreleased RHEL kernels.

It should be pretty easy to figure it out, though.   Just fire up
pgbench with lots of clients (say, 160) and run vmstat in another
window.  If the machine reports 10% system time, it's fixed.  If it
reports 90% system time, it's not.

I ran it up to "pgbench -c 200 -j 200 -S -T 300 bench" and still see
vmstat numbers around 50% user time, 12% system time, 38% idle.
So no lseek problem here, boss.  Kernel calls itself 2.6.32-192.el6.x86_64.

Eh, wait a minute. 38% idle time? Did you use a scale factor that
doesn't fit in shared_buffers? If so you're probably testing how fast
you pass BufFreelistLock around...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#34)
Re: spinlocks on HP-UX

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Aug 30, 2011 at 6:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I ran it up to "pgbench -c 200 -j 200 -S -T 300 bench" and still see
vmstat numbers around 50% user time, 12% system time, 38% idle.
So no lseek problem here, boss. Kernel calls itself 2.6.32-192.el6.x86_64.

Eh, wait a minute. 38% idle time? Did you use a scale factor that
doesn't fit in shared_buffers?

Nope: -s 100, 8GB shared_buffers, same as all the other tests.

Typical strace of one backend looks like

recvfrom(9, "Q\0\0\0?SELECT abalance FROM pgbenc"..., 8192, 0, NULL, NULL) = 64
lseek(10, 0, SEEK_END) = 269213696
lseek(11, 0, SEEK_END) = 224641024
sendto(9, "T\0\0\0!\0\1abalance\0\0\0\241\267\0\3\0\0\0\27\0\4\377\377\377\377"..., 66, 0, NULL, 0) = 66
recvfrom(9, "Q\0\0\0?SELECT abalance FROM pgbenc"..., 8192, 0, NULL, NULL) = 64
lseek(10, 0, SEEK_END) = 269213696
lseek(11, 0, SEEK_END) = 224641024
sendto(9, "T\0\0\0!\0\1abalance\0\0\0\241\267\0\3\0\0\0\27\0\4\377\377\377\377"..., 66, 0, NULL, 0) = 66
recvfrom(9, "Q\0\0\0?SELECT abalance FROM pgbenc"..., 8192, 0, NULL, NULL) = 64
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
lseek(10, 0, SEEK_END) = 269213696
lseek(11, 0, SEEK_END) = 224641024
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
sendto(9, "T\0\0\0!\0\1abalance\0\0\0\241\267\0\3\0\0\0\27\0\4\377\377\377\377"..., 66, 0, NULL, 0) = 66
recvfrom(9, "Q\0\0\0?SELECT abalance FROM pgbenc"..., 8192, 0, NULL, NULL) = 64
lseek(10, 0, SEEK_END) = 269213696
lseek(11, 0, SEEK_END) = 224641024
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
sendto(9, "T\0\0\0!\0\1abalance\0\0\0\241\267\0\3\0\0\0\27\0\4\377\377\377\377"..., 66, 0, NULL, 0) = 66

No I/O anywhere. I'm thinking the reported idle time must correspond to
spinlock delays that are long enough to reach the select() calls in
s_lock. If so, 38% is depressingly high, but it's not out of line with
what we've seen in the past in tests designed to provoke spinlock
contention.

(BTW, this is with the unlocked test added to TAS_SPIN.)

regards, tom lane

#36Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#35)
Re: spinlocks on HP-UX

On Tue, Aug 30, 2011 at 7:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Aug 30, 2011 at 6:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I ran it up to "pgbench -c 200 -j 200 -S -T 300 bench" and still see
vmstat numbers around 50% user time, 12% system time, 38% idle.
So no lseek problem here, boss. Kernel calls itself 2.6.32-192.el6.x86_64.

Eh, wait a minute.  38% idle time?  Did you use a scale factor that
doesn't fit in shared_buffers?

Nope: -s 100, 8GB shared_buffers, same as all the other tests.

Typical strace of one backend looks like

recvfrom(9, "Q\0\0\0?SELECT abalance FROM pgbenc"..., 8192, 0, NULL, NULL) = 64
lseek(10, 0, SEEK_END)                  = 269213696
lseek(11, 0, SEEK_END)                  = 224641024
sendto(9, "T\0\0\0!\0\1abalance\0\0\0\241\267\0\3\0\0\0\27\0\4\377\377\377\377"..., 66, 0, NULL, 0) = 66
recvfrom(9, "Q\0\0\0?SELECT abalance FROM pgbenc"..., 8192, 0, NULL, NULL) = 64
lseek(10, 0, SEEK_END)                  = 269213696
lseek(11, 0, SEEK_END)                  = 224641024
sendto(9, "T\0\0\0!\0\1abalance\0\0\0\241\267\0\3\0\0\0\27\0\4\377\377\377\377"..., 66, 0, NULL, 0) = 66
recvfrom(9, "Q\0\0\0?SELECT abalance FROM pgbenc"..., 8192, 0, NULL, NULL) = 64
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
lseek(10, 0, SEEK_END)                  = 269213696
lseek(11, 0, SEEK_END)                  = 224641024
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
sendto(9, "T\0\0\0!\0\1abalance\0\0\0\241\267\0\3\0\0\0\27\0\4\377\377\377\377"..., 66, 0, NULL, 0) = 66
recvfrom(9, "Q\0\0\0?SELECT abalance FROM pgbenc"..., 8192, 0, NULL, NULL) = 64
lseek(10, 0, SEEK_END)                  = 269213696
lseek(11, 0, SEEK_END)                  = 224641024
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
sendto(9, "T\0\0\0!\0\1abalance\0\0\0\241\267\0\3\0\0\0\27\0\4\377\377\377\377"..., 66, 0, NULL, 0) = 66

No I/O anywhere.  I'm thinking the reported idle time must correspond to
spinlock delays that are long enough to reach the select() calls in
s_lock.  If so, 38% is depressingly high, but it's not out of line with
what we've seen in the past in tests designed to provoke spinlock
contention.

(BTW, this is with the unlocked test added to TAS_SPIN.)

Well, that is mighty interesting. That strace looks familiar, but I
have never seen a case where the idle time was more than a few
percentage points on this test (well, assuming you're using 9.2
sources, anyway).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#35)
Re: spinlocks on HP-UX

I wrote:

No I/O anywhere. I'm thinking the reported idle time must correspond to
spinlock delays that are long enough to reach the select() calls in
s_lock. If so, 38% is depressingly high, but it's not out of line with
what we've seen in the past in tests designed to provoke spinlock
contention.

I tried increasing MAX_SPINS_PER_DELAY from 1000 to 10000. (Again, this
is with the unlocked test added to TAS_SPIN.) This resulted in a very
significant drop in the reported idle-time percentage, down to 10% or so
at full load; but unfortunately the TPS numbers got worse for the higher
end of the curve:

pgbench -c 1 -j 1 -S -T 300 bench tps = 4526.914824 (including ...
pgbench -c 2 -j 1 -S -T 300 bench tps = 8183.815526 (including ...
pgbench -c 8 -j 4 -S -T 300 bench tps = 34637.075173 (including ...
pgbench -c 16 -j 8 -S -T 300 bench tps = 68792.550304 (including ...
pgbench -c 32 -j 16 -S -T 300 bench tps = 159195.038317 (including ...
pgbench -c 64 -j 32 -S -T 300 bench tps = 220544.912947 (including ...
pgbench -c 96 -j 48 -S -T 300 bench tps = 147367.793544 (including ...
pgbench -c 128 -j 64 -S -T 300 bench tps = 79187.042252 (including ...
pgbench -c 160 -j 80 -S -T 300 bench tps = 43957.912879 (including ...

So that confirms the idea that the reported idle time corresponds to
s_lock select() sleeps. Unfortunately, it doesn't appear to lead to
anything that would result in increasing performance. I suppose the
reason that performance gets worse, even though we've presumably
eliminated some process context swaps, is that we have more cache line
contention for whichever spinlock(s) they're all fighting over.

regards, tom lane

#38Tatsuo Ishii
ishii@postgresql.org
In reply to: Robert Haas (#1)
Re: spinlocks on HP-UX

Hi,

I am interested in this thread because I may be able to borrow a big
IBM machine and might be able to do some tests on it if it somewhat
contributes enhancing PostgreSQL. Is there anything I can do for this?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Show quoted text

I was able to obtain access to a 32-core HP-UX server. I repeated the
pgbench -S testing that I have previously done on Linux, and found
that the results were not too good. Here are the results at scale
factor 100, on 9.2devel, with various numbers of clients. Five minute
runs, shared_buffers=8GB.

1:tps = 5590.070816 (including connections establishing)
8:tps = 37660.233932 (including connections establishing)
16:tps = 67366.099286 (including connections establishing)
32:tps = 82781.624665 (including connections establishing)
48:tps = 18589.995074 (including connections establishing)
64:tps = 16424.661371 (including connections establishing)

And just for comparison, here are the numbers at scale factor 1000:

1:tps = 4751.768608 (including connections establishing)
8:tps = 33621.474490 (including connections establishing)
16:tps = 58959.043171 (including connections establishing)
32:tps = 78801.265189 (including connections establishing)
48:tps = 21635.234969 (including connections establishing)
64:tps = 18611.863567 (including connections establishing)

After mulling over the vmstat output for a bit, I began to suspect
spinlock contention. I took a look at document called "Implementing
Spinlocks on the Intel Itanium Architecture and PA-RISC", by Tor
Ekqvist and David Graves and available via the HP web site, which
states that when spinning on a spinlock on these machines, you should
use a regular, unlocked test first and use the atomic test only when
the unlocked test looks OK. I tried implementing this in two ways,
and both produced results which are FAR superior to our current
implementation. First, I did this:

--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -726,7 +726,7 @@ tas(volatile slock_t *lock)
typedef unsigned int slock_t;
#include <ia64/sys/inline.h>
-#define TAS(lock) _Asm_xchg(_SZ_W, lock, 1, _LDHINT_NONE)
+#define TAS(lock) (*(lock) ? 1 : _Asm_xchg(_SZ_W, lock, 1, _LDHINT_NONE))

#endif /* HPUX on IA64, non gcc */

That resulted in these numbers. Scale factor 100:

1:tps = 5569.911714 (including connections establishing)
8:tps = 37365.364468 (including connections establishing)
16:tps = 63596.261875 (including connections establishing)
32:tps = 95948.157678 (including connections establishing)
48:tps = 90708.253920 (including connections establishing)
64:tps = 100109.065744 (including connections establishing)

Scale factor 1000:

1:tps = 4878.332996 (including connections establishing)
8:tps = 33245.469907 (including connections establishing)
16:tps = 56708.424880 (including connections establishing)
48:tps = 69652.232635 (including connections establishing)
64:tps = 70593.208637 (including connections establishing)

Then, I did this:

--- a/src/backend/storage/lmgr/s_lock.c
+++ b/src/backend/storage/lmgr/s_lock.c
@@ -96,7 +96,7 @@ s_lock(volatile slock_t *lock, const char *file, int line)
int                     delays = 0;
int                     cur_delay = 0;

- while (TAS(lock))
+ while (*lock ? 1 : TAS(lock))
{
/* CPU-specific delay each time through the loop */
SPIN_DELAY();

That resulted in these numbers, at scale factor 100:

1:tps = 5564.059494 (including connections establishing)
8:tps = 37487.090798 (including connections establishing)
16:tps = 66061.524760 (including connections establishing)
32:tps = 96535.523905 (including connections establishing)
48:tps = 92031.618360 (including connections establishing)
64:tps = 106813.631701 (including connections establishing)

And at scale factor 1000:

1:tps = 4980.338246 (including connections establishing)
8:tps = 33576.680072 (including connections establishing)
16:tps = 55618.677975 (including connections establishing)
32:tps = 73589.442746 (including connections establishing)
48:tps = 70987.026228 (including connections establishing)

Note sure why I am missing the 64-client results for that last set of
tests, but no matter.

Of course, we can't apply the second patch as it stands, because I
tested it on x86 and it loses. But it seems pretty clear we need to
do it at least for this architecture...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39Robert Haas
robertmhaas@gmail.com
In reply to: Tatsuo Ishii (#38)
Re: spinlocks on HP-UX

On Tue, Sep 6, 2011 at 4:33 AM, Tatsuo Ishii <ishii@postgresql.org> wrote:

I am interested in this thread because I may be able to borrow a big
IBM machine and might be able to do some tests on it if it somewhat
contributes enhancing PostgreSQL. Is there anything I can do for this?

That would be great. What I've been using as a test case is pgbench
-S -c $NUM_CPU_CORES -j $NUM_CPU_CORES with scale factor 100 and
shared_buffers=8GB.

I think what you'd want to compare is the performance of unpatched
master, vs. the performance with this line added to s_lock.h for your
architecture:

#define TAS_SPIN(lock) (*(lock) ? 1 : TAS(lock))

We've now added that line for ia64 (the line is present in two
different places in the file, one for GCC and the other for HP's
compiler). So the question is whether we need it for any other
architectures.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#40Tatsuo Ishii
ishii@postgresql.org
In reply to: Robert Haas (#39)
Re: spinlocks on HP-UX

That would be great. What I've been using as a test case is pgbench
-S -c $NUM_CPU_CORES -j $NUM_CPU_CORES with scale factor 100 and
shared_buffers=8GB.

I think what you'd want to compare is the performance of unpatched
master, vs. the performance with this line added to s_lock.h for your
architecture:

#define TAS_SPIN(lock) (*(lock) ? 1 : TAS(lock))

We've now added that line for ia64 (the line is present in two
different places in the file, one for GCC and the other for HP's
compiler). So the question is whether we need it for any other
architectures.

Ok. Let me talk to IBM guys...
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

#41Tatsuo Ishii
ishii@postgresql.org
In reply to: Tatsuo Ishii (#40)
Re: spinlocks on HP-UX

That would be great. What I've been using as a test case is pgbench
-S -c $NUM_CPU_CORES -j $NUM_CPU_CORES with scale factor 100 and
shared_buffers=8GB.

I think what you'd want to compare is the performance of unpatched
master, vs. the performance with this line added to s_lock.h for your
architecture:

#define TAS_SPIN(lock) (*(lock) ? 1 : TAS(lock))

We've now added that line for ia64 (the line is present in two
different places in the file, one for GCC and the other for HP's
compiler). So the question is whether we need it for any other
architectures.

Ok. Let me talk to IBM guys...

With help from IBM Japan Ltd. we did some tests on a larger IBM
machine than Tom Lane has used for his
test(http://archives.postgresql.org/message-id/8292.1314641721@sss.pgh.pa.us).
In his case it was IBM 8406-71Y, which has 8 physical cores and
4SMT(32 threadings). Ours is IBM Power 750 Express, which has 32
physical cores and 4SMT(128 threadings), 256GB RAM.

The test method was same as the one in the article above. The
differences are OS(RHEL 6.1), gcc version (4.4.5) and shared buffer
size(8GB).

We tested 3 methods to enhance spin lock contention:

1) Add "hint" parameter to lwarx op which is usable POWER6 or later
architecure.

2) Add non-locked test in TAS()

3) #1 + #2

We saw small performance enhancement with #1, larger one with #2 and
even better with #1+#2.

Stock git head:

pgbench -c 1 -j 1 -S -T 300 bench tps = 10356.306513 (including ...
pgbench -c 2 -j 1 -S -T 300 bench tps = 21841.10285 (including ...
pgbench -c 8 -j 4 -S -T 300 bench tps = 63800.868529 (including ...
pgbench -c 16 -j 8 -S -T 300 bench tps = 144872.64726 (including ...
pgbench -c 32 -j 16 -S -T 300 bench tps = 120943.238461 (including ...
pgbench -c 64 -j 32 -S -T 300 bench tps = 108144.933981 (including ...
pgbench -c 128 -j 64 -S -T 300 bench tps = 92202.782791 (including ...

With hint (method #1):

pgbench -c 1 -j 1 -S -T 300 bench tps = 11198.1872 (including ...
pgbench -c 2 -j 1 -S -T 300 bench tps = 21390.592014 (including ...
pgbench -c 8 -j 4 -S -T 300 bench tps = 74423.488089 (including ...
pgbench -c 16 -j 8 -S -T 300 bench tps = 153766.351105 (including ...
pgbench -c 32 -j 16 -S -T 300 bench tps = 134313.758113 (including ...
pgbench -c 64 -j 32 -S -T 300 bench tps = 129392.154047 (including ...
pgbench -c 128 -j 64 -S -T 300 bench tps = 105506.948058 (including ...

Non-locked test in TAS() (method #2):

pgbench -c 1 -j 1 -S -T 300 bench tps = 10537.893154 (including ...
pgbench -c 2 -j 1 -S -T 300 bench tps = 22019.388666 (including ...
pgbench -c 8 -j 4 -S -T 300 bench tps = 78763.930379 (including ...
pgbench -c 16 -j 8 -S -T 300 bench tps = 142791.99724 (including ...
pgbench -c 32 -j 16 -S -T 300 bench tps = 222008.903675 (including ...
pgbench -c 64 -j 32 -S -T 300 bench tps = 209912.691058 (including ...
pgbench -c 128 -j 64 -S -T 300 bench tps = 199437.23965 (including ...

With hint and non-locked test in TAS (#1 + #2)

pgbench -c 1 -j 1 -S -T 300 bench tps = 11419.881375 (including ...
pgbench -c 2 -j 1 -S -T 300 bench tps = 21919.530209 (including ...
pgbench -c 8 -j 4 -S -T 300 bench tps = 74788.242876 (including ...
pgbench -c 16 -j 8 -S -T 300 bench tps = 156354.988564 (including ...
pgbench -c 32 -j 16 -S -T 300 bench tps = 240521.495 (including ...
pgbench -c 64 -j 32 -S -T 300 bench tps = 235709.272642 (including ...
pgbench -c 128 -j 64 -S -T 300 bench tps = 220135.729663 (including ...

Since each core usage is around 50% in the benchmark, there is room for
further performance improvement by eliminating other contentions, tuning
compiler option etc.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

#42Robert Haas
robertmhaas@gmail.com
In reply to: Tatsuo Ishii (#41)
Re: spinlocks on HP-UX

On Tue, Oct 18, 2011 at 12:11 AM, Tatsuo Ishii <ishii@postgresql.org> wrote:

That would be great.  What I've been using as a test case is pgbench
-S -c $NUM_CPU_CORES -j $NUM_CPU_CORES with scale factor 100 and
shared_buffers=8GB.

I think what you'd want to compare is the performance of unpatched
master, vs. the performance with this line added to s_lock.h for your
architecture:

#define TAS_SPIN(lock)  (*(lock) ? 1 : TAS(lock))

We've now added that line for ia64 (the line is present in two
different places in the file, one for GCC and the other for HP's
compiler).  So the question is whether we need it for any other
architectures.

Ok. Let me talk to IBM guys...

With help from IBM Japan Ltd. we did some tests on a larger IBM
machine than Tom Lane has used for his
test(http://archives.postgresql.org/message-id/8292.1314641721@sss.pgh.pa.us).
In his case it was IBM 8406-71Y, which has 8 physical cores and
4SMT(32 threadings). Ours is IBM Power 750 Express, which has 32
physical cores and 4SMT(128 threadings), 256GB RAM.

The test method was same as the one in the article above. The
differences are OS(RHEL 6.1), gcc version (4.4.5) and shared buffer
size(8GB).

We tested 3 methods to enhance spin lock contention:

1) Add "hint" parameter to lwarx op which is usable POWER6 or later
  architecure.

2) Add non-locked test in TAS()

3) #1 + #2

We saw small performance enhancement with #1, larger one with #2 and
even better with #1+#2.

Hmm, so you added the non-locked test in TAS()? Did you try adding it
just to TAS_SPIN()? On Itanium, I found that it was slightly better
to do it only in TAS_SPIN() - i.e. in the contended case.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#43Tatsuo Ishii
ishii@postgresql.org
In reply to: Robert Haas (#42)
Re: spinlocks on HP-UX

With help from IBM Japan Ltd. we did some tests on a larger IBM
machine than Tom Lane has used for his
test(http://archives.postgresql.org/message-id/8292.1314641721@sss.pgh.pa.us).
In his case it was IBM 8406-71Y, which has 8 physical cores and
4SMT(32 threadings). Ours is IBM Power 750 Express, which has 32
physical cores and 4SMT(128 threadings), 256GB RAM.

The test method was same as the one in the article above. The
differences are OS(RHEL 6.1), gcc version (4.4.5) and shared buffer
size(8GB).

We tested 3 methods to enhance spin lock contention:

1) Add "hint" parameter to lwarx op which is usable POWER6 or later
  architecure.

2) Add non-locked test in TAS()

3) #1 + #2

We saw small performance enhancement with #1, larger one with #2 and
even better with #1+#2.

Hmm, so you added the non-locked test in TAS()? Did you try adding it
just to TAS_SPIN()? On Itanium, I found that it was slightly better
to do it only in TAS_SPIN() - i.e. in the contended case.

The actual test was performed by one of our engineers in my company
(Toshihiro Kitagawa). I think the answer to your question is yes, but
let me talk to him to make it sure.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

#44Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Robert Haas (#42)
Re: spinlocks on HP-UX

On Tue, Oct 18, 2011 at 10:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Hmm, so you added the non-locked test in TAS()?  Did you try adding it
just to TAS_SPIN()?  On Itanium, I found that it was slightly better
to do it only in TAS_SPIN() - i.e. in the contended case.

Would it be a good change for S_LOCK() to use TAS_SPIN() as well ?

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

#45Robert Haas
robertmhaas@gmail.com
In reply to: Pavan Deolasee (#44)
Re: spinlocks on HP-UX

On Tue, Oct 18, 2011 at 2:20 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

On Tue, Oct 18, 2011 at 10:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Hmm, so you added the non-locked test in TAS()?  Did you try adding it
just to TAS_SPIN()?  On Itanium, I found that it was slightly better
to do it only in TAS_SPIN() - i.e. in the contended case.

Would it be a good change for S_LOCK() to use TAS_SPIN()  as well ?

Well, that would be sort of missing the point of why we invented
TAS_SPIN() in the first place. What we found on Itanium is that using
the unlocked test always was better than never doing it, but what was
even slightly better was to use the unlocked first test *only when
spinning*. In other words, on the very first go-around, we use the
atomic instruction right away. Only if that fails do we switch to
using the unlocked test first.

Now it's possible that on some other architecture it's better to do
the unlocked test first every time. But it seems somewhat unlikely,
because in the hopefully-common case where the spinlock is
uncontended, it's just a waste. If you're having enough spinlock
contention that the first TAS() is failing frequently, you need to fix
the underlying cause of the spinlock contention...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#46Tatsuo Ishii
ishii@postgresql.org
In reply to: Robert Haas (#42)
2 attachment(s)
Re: spinlocks on HP-UX

With help from IBM Japan Ltd. we did some tests on a larger IBM
machine than Tom Lane has used for his
test(http://archives.postgresql.org/message-id/8292.1314641721@sss.pgh.pa.us).
In his case it was IBM 8406-71Y, which has 8 physical cores and
4SMT(32 threadings). Ours is IBM Power 750 Express, which has 32
physical cores and 4SMT(128 threadings), 256GB RAM.

The test method was same as the one in the article above. The
differences are OS(RHEL 6.1), gcc version (4.4.5) and shared buffer
size(8GB).

We tested 3 methods to enhance spin lock contention:

1) Add "hint" parameter to lwarx op which is usable POWER6 or later
  architecure.

2) Add non-locked test in TAS()

3) #1 + #2

We saw small performance enhancement with #1, larger one with #2 and
even better with #1+#2.

Hmm, so you added the non-locked test in TAS()? Did you try adding it
just to TAS_SPIN()? On Itanium, I found that it was slightly better
to do it only in TAS_SPIN() - i.e. in the contended case.

Here is new patch using TAS_SPIN(), created by Manabu Ori from IBM
Japan. Also this patch deal with older Power architectures which do
not have "hint" argument of lwarx opcode.

According to him, the patch resulted in much better performance stock
git head.

Stock git head without patch:
pgbench -c 1 -j 1 -S -T 300 tps = 11360.472691 (including ...
pgbench -c 2 -j 1 -S -T 300 tps = 22173.943133 (including ...
pgbench -c 4 -j 2 -S -T 300 tps = 43397.331641 (including ...
pgbench -c 8 -j 4 -S -T 300 tps = 73469.073714 (including ...
pgbench -c 16 -j 8 -S -T 300 tps = 151094.270443 (including ...
pgbench -c 32 -j 16 -S -T 300 tps = 166752.637452 (including ...
pgbench -c 64 -j 32 -S -T 300 tps = 148139.338204 (including ...
pgbench -c 128 -j 64 -S -T 300 tps = 115412.622895 (including ...

Stock git head with patch:
pgbench -c 1 -j 1 -S -T 300 tps = 11103.370854 (including ...
pgbench -c 2 -j 1 -S -T 300 tps = 22118.907582 (including ...
pgbench -c 4 -j 2 -S -T 300 tps = 42608.641820 (including ...
pgbench -c 8 -j 4 -S -T 300 tps = 77592.862639 (including ...
pgbench -c 16 -j 8 -S -T 300 tps = 150469.841892 (including ...
pgbench -c 32 -j 16 -S -T 300 tps = 267726.082168 (including ...
pgbench -c 64 -j 32 -S -T 300 tps = 322582.271713 (including ...
pgbench -c 128 -j 64 -S -T 300 tps = 273071.683663 (including ...

(Graph is attached)

Test environment:
Power 750 (32 physical cores, virtually 128 cores using SMT4)
mem: 256GB
OS: RHEL6.1 kernel 2.6.32-131.0.15.el6.ppc64
gcc version 4.4.5 20110214 (Red Hat 4.4.5-6)
PostgreSQL Git head (0510b62d91151b9d8c1fe1aa15c9cf3ffe9bf25b)
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Attachments:

PostgreSQL-ppc-TAS_SPIN-20111228.pngimage/pngDownload
�PNG


IHDR�P���sRGB���	pHYs���+tIME�:���Q IDATx���w@�����I���SezU����E���^WmU���[��u ������^�����j+E��M�aF���7�""����������/���9'��QE^:J����������������������������������}�J�O��y������^����P(n��u��5{{{___:h%EQ�@�h����������U����n``����>z�(..����d2���w5����o�P(�w��������~�
e�R�!	�����N�������E��������w�������p|||�]����(j���FFF����X,=z����U�������a��g��Q������`��"����?W*���r���#F�x���������DGG{yy��l�@���'�H�]���422�9s���W����������#(�BS�N���7n��K�������0����S��W_
0`���b�822r���,����.422>|��3g����,Y�d2(���s�@ ����J��������6o���*"�h���'N�HOO_�p!���J�E}���zzz.LLLLKK;q�DUUEQ.
�����v�z����+�tzvvv��������
6,..�5��C����a``PYY�n�������(���t�����������zzz�����7�xc���MMM<���?V����������`0�A�����f�:u������l���O����	!7o�T�X[[8p���`UUUHH������#��'���(�j��I<O����#/]�TZZ*�JG��n�;w��������
�����d�����%�J+**T_���Au<��477�H$���eee���������Q���
���s+m��y�@7o�Twq��;v�������^{m���x����������~:t��o����������n�p8����::��������Bf������������6�BX,�\./**"����vxb���O������v+9F���S�Ht����;w~��������.^wD
�!���j)))^^^"�������{������P(:�RSS����B!!���K>>>]���@�����t���K�>�����k�������x�
��Z(����n{�����o���F?~��7��?��Ckk���[�nUUU��n��1i�$;;;�s����i��{^^���������� j@����\�zuAAAuu��M�T�O�<9::z����?>p�@xx�zKo�R��{��������;w^�re��Izzz���QQQ�R�����%&&v�������[�}�����������kp7�e��=z���������~Z�~��1c�����h#L�h�9s����+�J]]����D"B���3���"##:���{��I�*��� ooo__������l�������d2��]�|�r##����e��
2����~�z:��i����0:�nddt���v;�[@@����CBB***lmm_�����@��g�h�7�|�������
��������F�ux��;w������T-EZZZ��������'�uv�����bk�{�(J"�XZZ���]
xi�j]�����/,Ku����0h4��V�Pi��Y
�%D�!g j@�#�O�>�����p���=z��>044DeQ���������.:�����'N�d2{{��������b��`��@x�+ j�� j�� j�� j�� j�� j�J�����������������������������TYY�:@�f��\nW��Q�zi��u��m���e��A��OGe�������q#��{�����FJJJll�������F;s�Liii�vcc��3gB233o��������>v�X�@�>�i]�1���9��rmmm��a~~>.G�c���UyyyW�F/��q���}���m�}��R�����������B�P�kgg�������?$$d��"����;���'N� �<��9�t�����---���q	@�&��������E"�6���6m��i���~��g�����?�H��	!���aaaK�.mw#d��eG��;w��e��e3g���������������������9������������3f��������o��9m[�`AJJJ||�����F DDD���u�%������5$""b��%���w5z��]CBB���s��%u��(��������n��:uj�CC�1c����>����!� ���uQc��u���[�n9r�����N�T��}��w�������O333	!�������`ee%�H��E���D���D�����C��^����j�������jBHtt�������>|�����_������v����������b���>�����z�^$�466�����9������������EG����5kVhh��c�h4�\.o�/��r�����d�jaJ����r6��������Y,��Q�V�Xq�����B�:g����SU^^�n�T*���155}�K&�B�5D*�ZXX�B�SQC���E*�?�U[[�� 55�]Wjj����P(���9�����B��Q��~000h�m�Ztt4����a��S���������III���s����k����@��Fll��U���?_SS���v����>��F�%&&�Z�*&&F��z��u��1�)S���b�Bq��������@�E�u�u�����������Q���P,/^�������|���_}����k	!��������_�~������			J����M__������111������������B���X,���q�����%�P(��h]�z�!��n��Oh�3P�Tmu���d�}�jW��ct%D
@�@����@��c3��z��Q����|��|������H=(�g�V�>G j�� j���`��QD
��D�D�i]��"�HGF#UB��6Z+������O; ��<�s>o���6A�^(l�}
������A�AApA������a�1R�v�����1������(�������_0�Z���d&�'�,�yZ� �4�4$�Jz��/��3��"��P�������5�N��������B�Z
Z�=�F$�$�+�8:mw\<Z��k�j@_����Z����{��a�']����kN�$=ek�^���B/����J��@!���~��k7�7%L(W�|�N�_e����7�m�l�ysS��-��w�\��w�X������ ���B�W�5K!^&N���d*=Wz:�t���J��
�D�s���L��yP�V����������'�_9��I�I��l�h)������L��+g��&�K�Vf)�

�U0�@:Z�����S���Y�����P���	�2�q�������O��>�x�P�[���%gl�0L7�y��������6�������������e�g����=s��l\�l��~��c���� sMf�C=^�n�j(h��3t�P����<N�����-�-����;�uju��x�}cGo�/PD
�3(b}�z���4-q_b������K6T�p(�`v��	�z�����t�3&P�JefffJJ���������>J=�Y�t���w�'�"�~/[��2�K(��9�s�^-s��4x������e���L�����[�n�u������������r������������k���Q;x����N����nN�Q�K����>qo��9+�o�������/���4���}�����o��6|��C�eee544����?~��Y���~~~@���a�N���?r�s�K8��{r���gQPQ���j�j�-.���/��������B������k��y����233�8q���	!����;wnLL���ojj�"�f
x8�u��^��x��pz!��$i2mJ��f����nG����3������}z��J���_�w����]'c�z���>��������G����%�������-�B�h����\g]ww���WjI]o?��!��.G\������OZ�T-������q4�FKKKkk+�er����233����BH�7�
%Q�&�������7��4��JZw�]���R\.w����=���GE=�>����kj:�R^WW���Z$j��SRRbcc�l��wh4!$33����������c�������!��:���C�bdd$
���+<zN�:�U�����_y���C�����Wk9�g^=�n��7o>�h���"���w5F��������`0
Ekk+�N������,��>��i����������k�r��m�RY__O�����2b��Ht������'N<����!]9sssKK���p������z�(B$�,�Q��6��H��;�.`;mw�&q���������}�%���qqq�b�x�����-�;w.��nll������O�<Y[[��/�|��W���/h��g�}�u��s���7.%%�������s��U�.[�����.\6l������YYY������%���0��u�+��)~O\<���|c��ha��P����%�Kz����5^��)S�,]��m�������K�.BV�\������?k�{X�~}DDDLL��1c!,HII���WPSS#�r������qqq��d�D
��)h�Q��c�z���

�
}�{dV2#M��V�~��Q�M��5:��566v��	����������b�X��@HH��={.]�����[�nM�:�O)��P��a{ll��t���
�b}���Gmg��{�O�BH�qs�'��[S�J��x�������k�Y�066>t�P���{��WJ���k����[�u���#G�%���e�#���!�K$
i��U��?	�{�I��v$=~�1���7����H8�P<��6��{�77����������{��'N�HKK���Z�`����W�X�:���{NNN�=�����C��^����j������uuu|>���,��d;�������>����):�:n�rs��R�]�p(�jHU?��,�x�8q�RG���S�S�S��_^�N�BCC��;"##	!�`���G��1c��������9S�'t��ABHccc||���39��_~I���ry�#U�L�l���</77W#C�r9���O���(����:��3K^+���u����}����~���~?[:J����N_�`�����JKKE"�F�����}��5k!/��X,��Q�V�X�u�VSS����v��d2B���R�������P#C�R���~@��V�+��"����$�Jj�l����T���(�S���'n�+�Vg5�6���;v�X:�nnn�b�~��wu�*j�h...R����X(�����U�t�>r����"M
i�j��]�v��b�m�77/��<<"�"��'oK\8�u�����0�(���E
Z��o��������db�8��z��~���P8g��S�N55�/�%'''%%��1�����`
	��
��c����b/����I�s�3�*W_>��1�q��!1��E�������0���9����U�V�?���&!!!,,���}��F[�h���2e�X,V(w��


<t�P������������[�����i��t�4�PB��j��I-�-C&�H����Z�e����������qssKII��S144���/�r�>>>���W_}�v�ZUWBB�R�tss���;vl```LL���5;�����^�m��(����*8
�������
f�����7�>�/}w��z��������O����***b�X���zJ�P(T=������9�
@�7������?��A�7<���I�q���N�
K'��������tk�����fdd466:;;O�4���S�:|����Tmu��d���v�]�C��)s8��\�����9�+3����>qo��Y+��v�q��e��q���Mu5rrr������!UUU���(bu����}�iSbdb�K-J����hZQ��r�.��f�e������Z|s�������s�J��Vg���%�J��g+XX��W5�6�nN5���1��g�O����/��������_/++�>}������kjj:6{�l\@1�5q��L�$����Q�(�IGIeCev���mov����G���w���������7o�d2###�����u������P�o�r��~�-eCe�6��Q����u����gM+�]������e���]�(JGG�����H2xd���U�X����/��;]�����M� � ����^�]�WAa��M�v+�
�
����G
����P��6v��5Z5&�O�s�CIz�K���'}U���i`�@i��(��s�g��G�7{K�x�l�b���W%I�G�n���:@?�'�s�����+~�X�T��S�&=����A����������v�n�n�9������W'L�`ii)
Y,^`�~�����'�A�nI���@A^�L�T�W�~�jt����ogQ����^�r%�9��������y�_e����<�}����s���y�S����+W~��W��������\�*����*~���P�P�2d��7�2�a������	��mx�����^i@�@�Qc��ms�����5j*�7sF>�g�O�]�;�m����q�^��8G������q����{��?mnn=z������y���hu�����*\?���5y{��K�B�B-����z�Q��f�M�����i��z�s����3z��D���G�=��,�A���n����O��K�eee���{F_CCCtttAA��+W�<u������8�%K�8::������5222<=���l��M����J��N�:t�
�7h��@{u��FVV���!������k���zzz�&M�7o^ee���1����
����~��\��:K��BO����}3j���fgg{{{;vl���^^^���(����A���sV�&z���r9�=5���t��i��EFF���S����bjjjhh���&��~d��B���5��]�`0�=������1C����?}�t�/���q��$*+�Z�j�B�(
U�sssKK���p����f�-~�w	���"��������E"��m���_�^VV6}�t]]��/���t8l���(4�V3��DpE���x��G��>�����2����($$$33Q�����q��$*+������q��5�L���0��'9�qd���=R���������c�3��|����GrsL�@�G
�Z�g�
����������$�%�����Z����W���rF����:��5Tk5P�>��������GM�&T^N�x�����&===�@������E2oL��K����w/&&���K�������&M�4i����7�IE���M�2E�2`�sssTz��9�[�[�jz�dM	���;v����3gf��=~��i��}������7n���{�_|�r�����������%%%�'O���h{2���m666�9s&!$33����������c���c���C����1*'l��\�D���������h��c���7VWW��t��Jjjj@@@UU��������+W�B��s����!����B�P}������~������#F�D�;w��8q"  ��|� IDAT�����������k���z��]�P��U2V��_����;�l6����...m���-,,���lll4{����������	!������������ �������-]��������G���;W��l���'O���geeu�u���a��ukH~~���>�h/��\M�L����jP������	77����v����L&���Z��mcc���.�;o��c��555Bjjj�\n�!���^^^�d@���/jkk����������!QQQ��@{���]7�^���/EgQ���?��m[q������d}����i���gmm]]]�T*)��0j��uk���m[

����.BHw�����B-�[����I�#+yK4��h�,���K{��U���boo�z*��[��JeWf4������,���V�T��}�������&Lvvv�H$����FYYY���=����!yyy�P@K9F8��i�k�a����f���?s���m{g���cg���u��aBHuu5!$:::88x����O�������O����|~��|>���kv�X,�]������p��62�����Y��LL�@/��G�=z�K?��������=z��9�kk�����B��Y�BCC�;F������
���</77��.��n
���l6
h�j]Q����S'�ru�V���r��C����4��zNMMMS�L����N�R/
Q�������t������
�J�666���Ov�d2BH��H�R\(�uD"Z3
���^5BBB��;7~�x�������3##����wB���S�N-..�t����@��� 55�]Wjj����P(���9�h�m�/��MS������M��:�^5~������/��r��������/���}}}_������Q�F�d�[�nu�~���h��a��S�N��
�������<g���f����!����P@��V��v�*�a�z}�������"����DB��X�j�w�}�"���������P(���������jnnNLL\�jULLLMMMBBBXXXTT������[�`0�L�"�
���w������-Z�a��C��;���
h�N���h
�N��G
;;���lB������wU�uuu�G����+((HII���������p?~�x�b.����s������o�����0!!A�T�������;6000&&����������MS����e�:������g�����������[[[�=z���G�����s��K9���"���QT���Y"���'�{Z�s�6&��NW�������6��T^�mL���YRRB�p8���KJJ�<yryy�'�|���7++�'s!��d���v����Cz9�.���h�#�z��"�����pT�5*''�����dTUU����^B5u�����U�� x������r���P;��CW�+�)�VY2�:�����555��������B�vW+�Lv��e===[[[�������Gk1u�5�M�v��5��;w�l�K��6n�������;u���	hC���g���fT(�����KOO�����g����+�E�*�0u�5\]]U���O���~~~(@�%
����5�(hM�P		A�z3��f���f�f6��Q
��:{�\.W�Uihh���.((@�z]��c�c��
L���F��+W�:uJ�q\\�����%K������tN;�hJ�����6jdddxzz�>��i�����R��S�:���\f�������gc��8jdee���B����]����7i����WVV�v/������bxE��RT�8j������z����C��(OQEQ:::���"
�(L��v�,1�?~�����M���To����bjjjhh���f�����nx������v�Z�q���w�}w�����������p/��)�U�(���4��P�077���G)��q���(�(�p|�1ni�K#��������E"�3���]+**23��Kfv���?{E6rh����R���e�����1b��111���?F�z��)�#*Q���;Qc���\�f���W��<o���(@�B�>�B)@�t�,���#[�l�={vyy����W^���+--(@�\�c�7<��	h���j�7�]cuuuUU
�c��L�=��#1u}.j�D�����k<y�$���-
�������S'��:�@Y�l���~ZRR@���?q�DTTThh(
�3W&��d|�����Q��w������y��M�!���~�aHH
���L���Z�j@���5k��\�2//���X �������j=�i�E(L�@_��&�)���h��g���&�`��%����L�H����e��:D
�4��N��c�Q4��'s��M���n�������,g:�u���b�5@���;Qt���G�(++k�����������T
�1���$�$��,L�@�A�(�i}�g�X�t)!$..n��	E555������{�];�����m���>��Ee��9Ca~]z����%�%��f������]���}5222V�Z��x��M���?v�����7n����EQ>���-))�<y���G������7o�������;V����kvHW0�L��\.���W$t�����A9��tB5�7��~����`�����FFFuuu|>?66���K�T�����bccc��wjjj@@@UU��������+W�B��s����!���		1b�H$�s�Naa��'����!]�s�����2<<� <�-�����:��+]:Z�j@/'�����4pW���5;;������cC����R�x�(JGGG����rw��=q�DU�����������#""RRR�-[v����s��^�l���3/\��a{~~~VV�������������7�T�!�A:Z��}Ogw5����\�2m������;w��=���������YY=�4z�����m+..~��wSRR����]555�@.�{{{?���!K�<{�w5��
�p��G����jP�{w5:{���k'N�x���w�}w������������wbmm]]]�T*o��5u���]���c��!�t���!�������1�dn|�8k�u}Sg� ����������q��)S�L����q����;���H$����z���!����ipH^^.xA0u�:jB�R����*�H�ry��5k���3KOO���:|�p}}�j]j�X,!��v>���5M
���qqq�P�����v'�Ie�`�.��Q#66v���t:���\�K���F]]����G�=g��BA���eB�j����r�������jd�\.g���P�E0�hn|�8�_�-\L�@���7o��={�LfO�SSS��)S�r��S�h4������i�}KUd2!��v�TjccSXX��!R����
h��T�a��4S'��u�,����!!!=�3jkk�N�Z\\|��%�V�B�055������mll48���
hE��r��T���:�~5���RRRz�l���G�%��n���v��9s��:u���I�������4c������58$88
h��Es�x���Y�:�~5>���M�6����9���???�B��7�TUUe�Wss��E���)S�b�B��{�nPPP``��C�:l�����\(�Az�z�����0u}�3��������<i�$�������?�s��xzz�mOLL���LHHx�������J%���5k����Y,V���CCCM
�U�D�����~?������B�/_^XX�a���g_��677K$�PH������!��n�������9�mJ+Y�j����3P"##���{������@����j�*����1�L[[���kv�_�W�����,�9����jH�R77������__����m��9::>�6
x6�8owV�+�Wa�����j������$CCCU�\._�x�������t��^</mSZ�!�u�Hgw5n�����}}����{������v]���N^��	 j�QYY�����������TUU�v]E�m�
}6�D�?qww��o_���G������k@�,�[�xY��Z
[Q
�o:[�:o�<������������/�9rd���(@���9p(��@�x��i�jjj6l��g�U��
��'��q��Y���Z��@����o���[o�������������h5�������i�:D�?\�~���l������/^���QwI$��>�={6j�9�}���J�G`�5����>�y��L&322
		����p��3P��+'��"{6�D�6�]�FQ���!���3J�����O�w(������yK�}^��TL�@?����C�3!�������g�KK���F��i�%����K��V������)}������ ����L���\.�����E��9)8Z6u�iS��ke��������;Pjjj���	!


������d2���������S���<�}S'�}�~�z!���RI}�q��M�7��	/$jL�6���k�Ow�����F�m������D��k��	�4u��*H�����={���	!T(�����KOO�������pP=����^�5S'��K��u����'��FFuq�45\]]U���O���~~~(@WX��4�o��%�u@��:��j���������KM��M�--�N;z4o�0�A����l:���45�BBBP �.����c_2��N����=�������{��V*m"�0�t;;��/OG�~�|�RIQ!���TV���4����TRc�@����A��..���t�����ahh�t	E��:+8
�2�K?��4�REVVmvv}MM!��a�DF�6ut4pv ��?nT���l���TR�<�Nee���C��!uu���5ii�))�?��'�+�Hd�Jr�|<��7jH$6���r�6VVV
4�P�<ki�d����L���P��ee���E��z�\A16f�DS�Z::�D,,�iO���3��y��
)ry�LF�������~~�~~����V*+�.-�:5�����S�
	!�r��

���q0���o�����kg����1''g��I%%%��
������_M�����������E^^Ck+E�}�h����������������lt��$8�V�3�������pu0c�5!���1-�&5�:9����2�<��������yxF����qwwo�8h� �L&��x<�@(���S+���N�TV6�WZ<zT+��)�04[[�����^�P���p��f<l�q����dYZ���B���������O���+��G��c� #�� jt���V��W[2�L�T644 jB��Z%�|�����D��i��]_^�D��g������U7-��8Lfo�m��0|}�}}�	!
%�''W��V������"B�@�?p��*v��q�����Aquu������o�6FEEq8+++��U�����dbI������(T~�z!g]vv]]]+!d�����W�
D"�P;~C34''''�<KiiSrrUZZMJJ�/��)���f��s48���j�f3p-�����,\������@__����k��EGG����pL�h/���Is�R,���R���������>���h0}����������~��@�7n����,�ii5���O�\��������
U�C ���������skjj6n�x��!U�����~�v�Z����������-,�����S'��:��km{�"?�A���t�����`�H�Hd 
02����pt�����V?9����4���`�y�>����-O���z������Q�F�������j277���G)�v����x����BB����G�<F���J3?��pHyy�zG���Z�DN��������
U7-8���RZ*OMU���*'�^5���j�ZX���y�^D.�����D��tWC�Tn��u��]R����D��������Oo��	����3��>�}!C	�H�TE�vB)~�X�@QD"�gf��nZdg�UV6BX,�����a&NN��vv]]���1�@_ ��U3BHC�"#�&9�:5�����#Gr�tbg�4�h� ��h�����������5k�xyy��5K�����������n�S'5~�T~�r�j����VB�������	�U9��Y���y��������B�J��q]rruZZ���g�B�|���F��J8�mO;��7�o��._���������q���-[���=[u?C��W^���+--(�w�Vz>�����L�r''��3�X����5�N'��N�fE�J�T�#5����2�������
2rw7tw�v��������	!}��i�����6zG�(**7n\�������*���&�����-E"T��������Y�y����'�Q��4;;�*v�57�o�3!J%���H�&j�D�������{mO�<�f�qK�IE=x�`��1�+����>���z�
�d\�?Cu~	9�����+����99�g� ��h�4Z�R�#�����D�P�_��b���#F�����,Y����l��O?����$  ��������W#&&���z�����3g�����=���x���������7o�������;�mzZ�s�
�cb���ldd$
qEh;�J7�q���)�8%�ey$�3++9��&��On�N��HS���C��^�����6��x�����7o���1Q���`��w����\����������������������������������1B$��s�������l������3��]��/������w��`���A�(Q���Z����J��~��h��c�����=���Uz�z�����hnn���+../z_����3fTVV����n-��lmm����.]�������G���,[��������YYYv]�pa��a������yD
�>�����{_S;������a�����]��G����d�X,���:�6r$��7��1B�^f��J�����D"���m���Y���V*�666��������8x�`��������7g���C,X����n���qqqv��roo�n
�������@���U�MH���9|��xU���O?	z�[�-J������ss��w--��;W\\�hg�	
�7N�ba��5:[�;v�X:�nnn�b�^t��(j��E7n��hOv���p��v��n��7o^�CC�1c������}���K�.M�:�[Cbcc�����G���WK/����;��r�DK���l�<h�����6l�<���G������������?��k�k�Y�����5����������g��n��xn�_��W�]uuuJ��������OLL�0aBpp����D"���lw���U^^���!�������������.�yn�Cc�����D���o�<���������A��i~~�~~����g�]�$9}����h�t__�^�����=�|HHH����UWWB������V�^���COO�/���������;���gggw���7��!|>_"��B�WRSk���7��R��;9����;<�9�-kk�����N
_�JT^�����9~����u{�w5���RRR���_�YZ[[������q��f�


=v��Fk���"��y<^nn��]��)�"���l6.��#)�J�f��������G����,c��)S,���)������;v�YP����y!Q���>
2d�����?Q�?�jpp��_~Ii�i��T*���),,|�K&�uw�T*������O��W�����V��i��+r���KC�
b$�6�;W|��$&F2x07(�j�HS\��������&M���t�����/��kkkU�����JMM9rdQQQ�]�1d����P�E������0S10sqv��"��35�[��n��Wn�(;}��_�J���&O��<������t]gk5rrr����x�
�V�g/�h�� IDAT�������
N�:����nONNNJJ
�3gN�]3f������`\(}?g���\?h��5um:r��������?��/oo��cy�f�m�������45"##������zk�����>����U�V������$$$���EEE�^�z��uc��)b�X�P��{7(((00�����Ev:t��C���q��m~���g�N��Z9I����\\����<9���_IM�Y�<q��{/J���(N�z����p8�?^�xqII	!����_�~���L&3!!����vssS*�,k��Y�w�&�v��b��;W	@��r���]�MtX�����D��������9s��nI��-��-����I�S�Z��"O�l����z��f���]QTT�b����?;���Y"���'��zZ�s�v�F��n����S�����a�.�����3g�._.mjR���Yyy�h�`���6&722�s�Nox��V@��:��%���r���{�`xN���?�\z�LQAA��
+(�z��������Z
�����O*g���o�y���3�3���pt�M������m�P�������o�����[��-�W@�d�
�u}H� GUD���4�F#>>�>>�������%��
bd=b�NG��Ho�W�9Q�%����a��w4l
�fa��t��?�iw�j���E�*�O�j9i������'�}5���!�a/
���w���\a5:�����,����#�0zz�I�,&M�HM�>}����r��>��WAAV��5��k���n���[���!���T"g@�8�;p �������,���W���Y��������c���{������@����j�*\:�e?�jt,;Y����:l39z��1s��W�����W���E�7g������S�X�������I�T*uss

-((��������m�����
���������J|���~[�3��E^Z`����C�>��d���q�~���A_^���]����@���dhh�j����/^�`���wq��V`����5�+����kky(���k�:/]�p������U���q�����7���koV����q�FHH�:gB���CCC���WW��{@pS�nK<J��g\F���f��Y�l����u0��4}zlddvaac_�6;��QYY�����������TUU�*���$����{
���b��� �K����������
��+�tIr�t��7o�4k__���~���pww��o_���G���tkkk\���_0w����B��rh�����N
�}�T�����9~���V�w����Fhh��y�(�3f���OZZ�/��r����������LxT��!�#�A�{��fZ� �EX,����S�Z&&V�9S��7���>w�X�� +��O�,jL�6���f��
{��Q�p�����~�<@���4���Vg-�$�%��MB�m5d���!FRi��s�11�11����AAV#G�����Uy��6&�x�L�����Wv�dW�B�����W�����<P����]^:z��s���U�'�����-Q�ZZ�7�N�.�������'[N�lil�T�������RFF��w�������K$6���r�6VVV
4/<�@���g�o8 ��-��pm���[�&��dd��;W|�Z)!���M�|�RG�����f`��g�O{���o�y���v�999�����
�
�����S/��u�>L��}���ah��������MI�^�<��w��r!D����ZV�H��k�=��Y����qwwo�8h� �L&���z@/�*dy.�l-�P���5~�5����H�������N����P(H]]���'���9:��������e2�R�lhh�+
����<�
��5�+�{�S��@3�����F���U���UU-���Z5\]]�|�kTT������nx�x�<���[e��Com�=�5�����x��v�����i��&N�-��t�f�>�`���E������\�v-:::44/0�tfW���:�cMl�a�&g??c���y�����5]�?��
��&l���T@�%�/���(���]�����(�w�0�+�4%������@�DT����c�]~�/?�2�x=���>��y�s����dK9t(�u�aZZ��i���;t��
�*5�BaXXXTT��EOOo��5�W�����ecc����3D1�}�<t(�����B�����u����O�Y��o�z|����/^���������XXX������`\�uR��6���l���5���t���s��G{�}[BHTT^7�3�_jB��������o_�%�v)�q�����
�m����f�nx�}[e��
iax@]��4�
.Fw�>�������n����Pj�z�����g=f-���HU���n��ltJ
�N��Dw��Z|����w�vwt�G�����\����R��� ����{�����@J�+W�����t:����������Kx��]�������95��{{�@CJ�/��"))���NKK�F��d2�TJ�R�T�T*e0�V�
G'�+b���s�se����Q�W_����:@���1��C������w�	�X,
���{''������o����|��O?��N�W��w+�
.OVzR_�g��u�j�����-Z���*_2����744�Z�������
+--�����������z����'�\����~n��#"�8&�@-)��f�?~�����GFFF��o��eoo��B�P���y���Fh��oD��l'�#BV���t/aaa<h��GU{5������6o�����o����P�\[[kl�J�;p��9�@��1'''...??���u������V{������Mo�jdddkk�	�~t}^�<��(�_��T�SYo�U���.z�{�I@��a�B�h��/��r�����������������_?v���]��,YB��}{zz��#G:q�V�Xq��[[������������xyy988$&&�8qb��1m�w�*��rKK�^�zEDD`
t�V���:W�=�_�����K��p77�s���X|�������*5�r��;v��U^^N����8q��)Sf���\ 77�F������6���'3f����2d�r���=##���-::�q?���KO�<y���a���l/,,|��ag�RXX����R������z�N���������G��p75E����F��O�
��|Z�\.�p8�>r���={<�Xj|��III��B�X<x����{���}�vg��g����xu����P������������n����
(5�s�����7n�x<�X�����?���V(���kY����5ma��c���x����S[�'$$��u��VIHHhO���f�_�_j =�2~���ff:;w42B��!T�			����R�����U�4m��������z���������j{AAA'�RPP����'�lp���wk��$++��;����Qj8p (((22R[�u�yQ[[[SScjj��]Y��l755�����ULMMo�����*X\�p��X�Qur��������v�����L=��8{�lbb���3!:::
��B��,���b����(??�SV��L&n��8?s�|�����1o&�������;w��X4���Qucr�����_
ii���UTT4k������������V)//����D�L
b����w}�g�t��O���g����:���������}�[��r333�5*[Zm�p8��
���D�,����6���.�=>8��O3�w�pc�p�������uuu���S�N��o]��s��������olIOOOKK�1cF����J@@&
@���h>`v���'����n]�����m�����F^^�������)J��u�V��?�F�M�2%77W&���s��w�������j���7���U�x�
L�����v[���dg~�y���~}�����_��E���T����wo7�J6�}����s������r�1k�������`�����`��<�R���j�j�E��/{���i��N��s4[�������eee��O���.\
���6{���������q��f��j��sWQ
wh��#�������l��=|�qX�+�NE�����1��1c����|������SNNN����v�=
J
�V��_�_l)N��~!���/�G�4��3�=��h~%66V�PhiiB���PR�s1�����[�td"[���fN[������Y����o�?f���u�4����K��O����@�3
�C����6��{��_����]��g��'�T��������w��;�������w@�T=��tJ�����-��������N�3zz�QXX���x��AMM�����T���uttp�'����O�*��|��o�j�#�MPjL�6m��i�����V�Z����l)((X�hQpp0:���_����}�m����u@��jW�'�|�z���:�bkk�m�������8h���s�O�A��R�uiii�n����XWWW\\�����5��|y_�(5Z��r��������J�����;����T},^l�^��T��|���!!!�������
����~����'W�Z�<Kz,F	�����%K"W��\��R�O�F.\hffv����k�������L�4i���3g�D��X�9��c��2������wJFLo���}�r�-��N@�����x&�9k��Y�f������R�TB��+W&M����Y�8��s�c�jXo^�x�c�O�����c�����������=�+0G���h����[>�J�2����j�@A�R���s�qX�Y�}j���_6�L��P(HzZ������z��P(���P����:�Pj�)==].�������?������9���	h"��+�6:e:UU�2*�V
���\���t�WZ*����=�3b�����x<�L��_�a�g���RC[[��K]]��+WZYY�����4�6_��W�^�����*FUm*�v�nhP$�U\�����g4���l�*GC���_��bEjYY������3�9�F�f���p��g�������4�����cq��P��IO�g�Y�=~\{����O�Bi�~z��;�g�f���'������6o�gb��z�#��`�������������0�l6��4�T�'-�ST�vI�������7�>xP�f����x�M�~����##��]n�Iho��w��������������������[5� &�&��9�b����J����~vi��7������mF�����F|||~~~�K*�jff�j��e����4��bk�=�e�3k��d���t)�t������R���n`�������.�
^I�q��qt�F���z��ec�S�����,z�����s�)_V���Q��u�'��*5@�h?������w+Z�|t�M��3��~Y&6����|y��-Y,:
^U�������=w���g��������c~���H���#f�Gn>|^�f�'L��������^�W^jl��)..��&ag����&�L�}����J�LF�5���2����u�Fll����@M���-�1�A�A�E]������/W�������q�'��R�F�1Z�Gm��_1���0��V�m}.��O�V�Q��V����R�^W����c2�M+++KJJ����)]�e����)�
��-�?g�9�{�������>>8��W����o�^�z��M�������<y���Vt*t�OZ[����ig�����I�x�Co����MK���<WW�f����|>�����0�����?��l�"9�������$��#��p��q�Q]]-�H�5��|�\.�Pj�v�f?X�J5{F���NY�{�e�����T�jPj8;;�������o�x��a�emm��xm�����q������/�B���q�f�t����N���U�����P(|||����{����k���^��J�����1�CJ�����[=i�9�����+5�BaXXXTT��EOOo��5�W�F�t���,�hxR_S�*=���	}�Q�p��C��������������%%%vvv::8�u�rr��it�I���D�����9�vS��h����R������<O,7m�����w�(����)S�4����[ZZ�g�c��Y�OG|�l�|���4���9��YHI@����`0���	�FBB�����T���%��x�������>}�����x�����srr������]]]��oaa���c���Py�kv������1#���Z+m�j:����U���������o�[FF<i���*$2B0������_����'���>��������~�!
8P,s���;;;��1���		���rppHLL,..>q���1c�j��*�aii��W���LAh�Y����!��=���*]]��U�6�tD���e����/>e�a����!�W�qu�{%�o���rt t+b������?']�^��g�&&&��:�">���E�5k���X�tittt``��e���3g�<�|�����>|�Utu�H*x�:��9$x����d!��	���w����������|
�}E�n'3*�[��U1��P�y��T}����dddt��
��������XB�l�R]]=t��V�>��U0K�%�t�.|�o����J�����������U�B�=��X�~}xxxnn���D�B�V�?u���-l6{�����V�:�
f	�Rgs>���'H�!)kk�`�}�D�������:GGG�f�!����lkjj�r��[���=���������H��x�z�j���N���t`�xI���N�Xkvi�^W��=n�F^^�������)J��u�&
B�������?���������������555�����W^,��������G/�
���,���eII���s#"rT,�E/��C�j����;l���MCC���?�:k���k�FEE-Y��B�4��!DyMM�v�Xldd����B�0�L�h?�\��a�����d~f��*����]���b{{��zR�������M�>�N�_�pA(�������t+������m�����YEEE���|>!�e{yy9��)..~�U���0K��x<���������*�P1R���l�����f���W�J����iS\\��744		�����F3��b	.������]eK�����%%%/�
f	�1���U�
�YI��4#�}�v"3-)MV%L��	dL���F�J
B���U(��YYY*n��?~�F�9;;��3g����OfIOOOKK�1cFLLL����������*�%���A~�^����w��sDC�6�,�/t�8�,�dT�X&�/(|�@�.�v�V��1J�,&�JMM������?j���������[�`��;�B�����������w����9s�������������v`�vn'���?�MN�''�s���[����g�u����QeL�`����J�!����z���-������S���o�s�����Z�)^)�����?����'�SS�O?����>"�����w���;���E.�3�Y�f}��������=��g������N|�1��{�c�,�MF�	��n�w�/�W������c�������I[��d�-[��������^^^]��%%%�����[����q�\
����������N��&�{�23I���d�=I�����.�I2�����J����W��{5���RSS����0`��m^^^vvvM�VTT���l��I"�$%%��?�D�m�J[[������[4�\N��&'�3�Td�GJmWi
��[�	M��
��E�U�U55(/:��i�zzzQQQ�N�Z�hQyy���������kaaajjjqq���Vhh����]�
�^�������w�:)zC�8P��(z���T�&����W������RCi��io��vNNNrrrJJJff&���[P IDAT����O<==�����SPSACJ
?#�Z�k@��4���bMW�$�R�;��.,5!T*����������C7���H������Br��Pb1^1l������4T
��	P^��R@})$7�&�v��/]N��H��e��N�b��g.U������(5^DYY}��j�5�l����A��N�"mI�~�W�p���/���S1��z!D.'W�>qrbs�L��Ld+R�Vm�4�Nm�U�q��������N��t	��������B�(5��9$xH��f������L�x�&��D7H5t��\����(r��?�Q�5(/Pj�]@�2���.������Z�G��N�5Z���+��C������@��O������`�E�B�8�����{:���~����P�R����J����*_�D������������+���������e�E�����T�I�b�������o����Z�`A��}���������7<J���Q��3Z���l}��X>=�/(�n��j�Fvv���+�?���O�8���#W�\	[�d	����A����iI}]6����W�4}�o]c8BnbNX�e��z��d�=�@K��:99Bjjjbccttt������*++[}�*�pB��������*OF��W�2y��R��%���Fi��x,a�:��(����*��HX�,���m}����@K
gg�G�
<���#�
���$�(
�B�g��p2���S1�'.)�++�(k�������$�.�a15#L�����^������h�
�:�����TB�H	i0h@���Rc����-�6m���{w���l���033c������V��QQZZ���k
�b�E%������GL��1�P	!R������Z��������Nd#��K������������JH����77�
	q�������q��z�j��x��3f(�O���{���dG�x{�9:�w�g�������R���(-�+/����mj���#�5�b�L13�3�)A���mj�}�m�D��:n��R�2w�h,)��Y���I�1��24��066�7o������MMM==='O�<y��)S���^����b���2���OW�d��2^��~�X�tGEii]i���	������:�h�����`���BdZ�+�����[^����:n��%}u���jc�4��������4h����{�������/���������������3���;�!�	!��y����UT�kG��_
_��9Sg�����3����H�AW�J����+���p*E\��#�����8��K�5k���5�������g����/6l��|����
����J�o�-���	��������0������L����NjlN�����9M^gU'���e%���W��Pj��{��:t�Y���+w��UTT��r�}]I&S�_����BA���@5��5�e�D5�72�0��	!DL$zG$���s�E\��F$���X��J
ggg�7p�������T*���}��*+%*�}W�����!�E�QH���"���������[[����d,��Q�������{���FFF��X���_.[��F���������X��[�[���Dy����+������999\.���_WWW.�_�t���j���s��U.����z�j��k��'�jVj\�|��`�7N,��bB��#
EQQ����r���t"t��1b�T*��ys����==���*����J�u7��>�O��
����w]L���3@���WC(J$B�H$��d�o���+W���������Z�����H[���{��%�,[�L#�.**���������G�����_������v��8h� �����o��y���#F�����b��u����}||4rr�Z�������_#�;~��T*
�H�1m��������ZS�P(aaa(,Zjhh�4v�CNN�����5xG]^^^FF�_]]]rr��o����I$����Q�Fidt2�,99Y��p�������������?~���{5"##!����2�l����o���p8SSS��R���Y���K��R�����M�i�F���t��B�'vv]]]�D��������������={F111������2�LS�VUU544���ij�I����
��_&������l&��y�)��O�����X,���O�<a�X���]EE������x1999qqq������������h������N�700���B������������$J�������///�������'N�3=�a(5��������lY�t���'uuu�?(5^�|��������"
-,,����`��@��A������:uj�6�=v����tJ�����z�����������sPj����������:�����@���ttt(�X,n�.�5�Zv�]JKK�����3l����qc�����fff6k����p8��/k��9111����-���iii���}5�!
�o����{'''��3�w��������^�����w���r]]��������;w=�2�W�9�D����\.�BAo����P���@�(5�J
@�(5^��@��;wN jRP
�����			O�<y������4u�bbb&L��IA����W�fff�9���[WWW��M.���y�������:t����E��������}�����H������a����k��g�4�D"Y�h!���V�����011��h^^^������}�������<}�4!d����X,~��������,Y����b���]���&
'O��f�'O�����^q��������d������iF>w��1
[�F_�]��(54�����rvv

��R���������=S����rvv^�b���`ii�������&�{���P(����-�������D�L7??�������k�n����`\�rEc�P��c�:U�W�d%���@r�������$11���A���p8�����AAAG������#D����&M3f�&�]^^����]]]���d2���4/�233/^�x��A[[[��O?�t��m����0a�f���1R�4lk6����J���������L&���b���<��_T���?s������E�����@ �����={>|�o�>
;??����4=
.�J}}}5/�:���1l�0�Ht���������M��������/�;���a;�S=F�*
;���q��(+q�A]|���:::���KMM���:q�DUU�s�
W�(�Ni��y�������L�_��P(�O������?:�T�b�
33�O>���?����9q��F���)S/^L�R�\.!�����_U��V�^M��w�\.��i���T�Q�J�W���8^]��(5�CCC�����y�^tE�(5:�B�������G�����������~�z���_j����d����B�R	!��oW��K�����S[[;000--M�P�����7�F�����]t���4h�N���Z�paaaa�M����1�y�f�I�W���l�� +q�k���������y����������2D�TMMMpp����������]���dn��A#�.''����366>{������c����s��+++
K���r]]�o����bB���y�����w+O�S��<H���KJJ�9s&����k��M�N���11b���C�8
�����xuMV���^����^���E9�JJJ!C�Et�����L�"�cbb(�Dw���C����hiii��m������������-
�r��������n&&&���W�Sb0���NFF��F�`0F��|��={�l���N���4���Z#�%
�=��i�5_�-�����&����=u��F@�@tB����������Hc�n+O�
���I$���}��E#F���t�2e���w�&L�0v�X5�N)&&F��v�L����1R�����a'~���.�J\�����X,Vll,�#�����5���������hLtE�������WTTt��Yu��N��;w�i�L&�|���������������-�����]���P���=�����r�g�Pt��!d����'
;�������������y����%�J����Tj��}�tzZZ���.��P���7RRRV�\�1�egg�3FGG��~��������G���::������O������7����c'�����k&�i```ll�e�������7vvv�n8|���K����������WWUU;v����E����{����++���������o������{��E��	�����[YYu�4�����k�B�tEV���]��f�������������5��E�e��Vgrjj����BM.v}��bbb���-,,�t������+���55�x<��o����M��h>>>����]zz����-,,�Yfaa��W_�d����/4v�����/?U�Wd%E��)��JJJF�}��S��
���O---{��I����"kkk��35�N�n���;�Q���+�J��
��P@�(5PjJ
@��R^��\������g����W}��s��%%%
8��Q�
������������&W>BHLLLII����������;w��	�Tjmm�
��Y'��N�:u�����bSSS5�.##����w[���$��E>w��;F]�������v�!D��%��E�!jqK���322LLLh4���������v��}���4i��N�>M���\��b�[o��`0F��d���v�Z��8�P8y�d6�=y���� ww���`����_�oL&SWW7--M]�P��=w��2
Qj���P�|�\.o���^^^���AAA���X��S`aa���G�={�|YUU����b�
�>���Ru,5T�g�
������MILL�����������Q��>!�6n��`0�\��Fi�::�c��i�s5�������!�Biuw������Ibb����S�9N@@@�����9R__��5�F
�4i��1c4l���������2k�,&�����a���y����������T���O�m�v���	&�Q��N�u}��P?w��qrrz��i�h���g�����Gtj`K666�@.�kLt{��y����}�4o�������<x��@�T*������;t������a�D"��;w����7�BBB"##/^�8v�X�JC����OC��G$������4m����-�]���������2��.##c��u���l6[�����g��e#G����O�]�6w�����#G�jX����:99-Y�D__��w���������S��1�BCC#""�n�����vi�":c�Z��@7u������Y�F3���!!!#G������h�&M��e�������+V��1C�"-))���
����999S�L�9sfjj���RYY9h���>����z��EEEE���m���[�^Kj�~������_:;;�=���kjj���G�=g���n���L&s��
�:|999���������g����;������3g�XYYiR�������~����k,���y�����w+��U��<H���KJJ�9s&����kW�I�G���1b���]��(5������W�6m��_��]�X__?e��X��u,�+W�:t(%%���g����q�F##���/+_.Z�(00���FGG���jR�����?����;��������1�Q�F-_�|��=[�n����$
;]�c�b��B���G_C��Q����Rxx�F�WC��kO�B�����������F���4l�0
P"�0����f��-1b��e��)S����-0a���c��itJ111����j��*�ku�TT�:
q�@7RZZ:j�(>�occ�1qEDD�������_QQ���g5 @:�n``p�����2��������6K����<y��RYYy��5������zzz\.W���1"���9���!�t�]��������}�]UUUUU�����jkk�uhFFFFFF�vYB4������>������O�4�������G�),,������:w��-[���1c�����������?��V�s'N�<9n����G?x���_=���M�����i��GFF6�}�,
Qjt�����O
ww�������Z�


���?{���;*++{���������a������}��><x�D"��h�G�NHHP��7l6;77������RXXXl��}�����o��4�vc�S��r'�@ @t�u��r9���	C���������z�������6��7��a������b���{����'H�P����/3dZZZvvv�������n2F(5�Fuu���G=zt����s�":�� "Btj��(5�iz)�f������P����!@D��zrt�����WPjJ
��RPj����R�����[�(
����#�Pj`���|�pPjJ
��RPj���]Z�M��E#�gt
J
���y11�BC�������'���wq��A��4j8)�3���y��Gd2y�^������'�o�����]�j@sW����F�drBHYY��������-�R:�<�4)Z&��X�Jee�#Gz��������6v�m�����KJ���&Q.oQ�R)��:�64�m��L�h��a��E&����������g���NNNTjgV�qqq"���7�|���d�������}||��T���S\.w��������.����r���r��w���V�$D� �����u�������g1�D��� �P(D ���H����?��1��>1b�����q��_��`���������������:u��fw,����'O���W�B����+++k�Vu�����_~9y��|�6��Y ���/?v/��/N*!����TuF�`0��Uj 48�n&��=Be��v.���7���mW�~�������}��q����d����������� 66V"���|�����,�[m�KF��yE����B>h�b�O�|�M@:hp:t�LV+8W�e��	��^�[gB.\�0u����zzz��x744����>����B: 
���m�6^Ym��vJ�Ay��w~���V�,
������B�o�������}��azzz\.7((���\�����===�L�������y<^����C�<==322Z��P(�n�jooohh������;z��+W�Bd2����������������chh8s���W�6�GEE���1�L���7+M�8q��w�urr�������7o^����~����������q�F�q={�������k�����NNNAAA����{6lPF�j�FEE=x����n��g���w����$���xn/-[�l��]��<x0��2dHll����z�����i��qo+��3}�)�tP�th�����g2J
���F^^U�����������-��J�2�L��&�J�R�����
v���;v���:��������;w������7�]��j�*&���w���z���������_~9{�����i��%%%	�B��nNN����[��wCFF�����N�gii������K���X�n��
7m�t����[�B*++===�����������+6lP~��S�BBB&M�����t���+W��UnbbB�R������������O+7�����������Q������V7;::��n'����WTT4n�\.������yn/GFF:th����n�rrr������S��>v��}6lhk�����D�=B���w�n��tP�th�����gr7�
�����*$$�N�{{{_�t����n�"����+
�Tjmm�;�(IDATl�����h�-[������7����7o����B�X�vm�^�233��33������BN�<�P(!��k�U�������������}��g����m333k�	��������������6����J�P6l��r��+Wzyy)��q#���{��B��y�&�FS��E�@�����}��
�b��aM7FY\�x���4u�T==������������Wuo4�������j8�<U��<^f2w�LV�_���]���`��]YYY���o�����;�Z������d���M����8v�XYY�����r���O�:u��MWW�V�y��iyy������������===G�������x���>�?f�������,//��!D$9rd����/�y����O��^�|��V���8������7))I,B���}||�O�q���!C���:���v����e���7�$oo��/*n�7����W=����t@:�L:��d��3P�;rpp8u����;�m��������"B���C�6:4==�e{S			�}����~kkk��2�������b�h���ci��{�0�X\ZZJ�9s�����5�����������p8�|�
��1bD��������x<��A��~Z[���G��P(������O�>2d���#!�����;��mu���e����v,�
R�&h�7�)))Q=����t@:�L:��d�LF�����������/�����!���{M�����o_����>���{���3g�T�������y�u�����:��������/V����cPP�����[���������:{[[[�����|>������q�F||��!Cttt����ryBB�����lv�nW�S��������??W�����OOOBH[�����S�������&3f2J
�����
����?:SGG��<m�O�>l6���rB�X,�u��;���b��_h��u�BBB|||����V�P('N�~�zc���G�g���t�������b�8q�Y{mmmAA����/e2��#G��C*u��IM����@�gk�&N���n9r$!d���O�>�����'Ut,�������bff��K����������q���N�<YEo4�Xe�w���s����y�����(5�[X�titttII�H$�t����~:v��>}�4.`ccC�P�IH�R��Y��O?}���uuuYYY,`0k��	

=~����;���$IJJJ����k�����7.%%���y����?�z����8p ""��{B7��VO�o�y�������R��.((HMMe�Xc������Ajj�|���������O�<��w�I$����%K�(��Z���{������+�g���yxxl��e��q���nO ���k��m��+W�$�2$111%%E&��8q�����������������Axx���w'O���7�~���NHH��c�T�;���������y��Y=v&�\��+P���'�O�njjJ�Pz���d�����f����Y[[���ikkWVV~��l6[__�J�<X�|CCCxx8��������677o���S��v��eddt���V�'22���������7++KKK���3-O�o�UUUU����h��qk��Q^5�m�6###]]]KKK##����+����k��!t:���p��U'O����P�+����[����f�������	&�z������J�����}�!Dy�z�(�B9rd��-��Z�v�B����,Yblllhh���q��====�(�{i���������k���Y,�����#G�o���fK�R?�����{�p�-��Z�C�W��:�_�LV�_�(5Pjt�\^RR�z������������[.V\\����N����oB�W��g�^����x-�ry��K$����W7�	DE����������S�+��\Eo���Nw���EQ=��x&�W��'�j�z�_7�u��M�6=y��s�Jo�������?���@:`2��o{�f��R��C�@��M����_����|>�O�>��������Wd��
zzz�W�FW 0�{���(5������RPj4�|�����#�IEND�B`�
ppc-TAS_SPIN-20111228.diff.gzapplication/octet-streamDownload
#47Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tatsuo Ishii (#46)
Re: spinlocks on HP-UX

On 28.12.2011 14:03, Tatsuo Ishii wrote:

With help from IBM Japan Ltd. we did some tests on a larger IBM
machine than Tom Lane has used for his
test(http://archives.postgresql.org/message-id/8292.1314641721@sss.pgh.pa.us).
In his case it was IBM 8406-71Y, which has 8 physical cores and
4SMT(32 threadings). Ours is IBM Power 750 Express, which has 32
physical cores and 4SMT(128 threadings), 256GB RAM.

The test method was same as the one in the article above. The
differences are OS(RHEL 6.1), gcc version (4.4.5) and shared buffer
size(8GB).

We tested 3 methods to enhance spin lock contention:

1) Add "hint" parameter to lwarx op which is usable POWER6 or later
architecure.

2) Add non-locked test in TAS()

3) #1 + #2

We saw small performance enhancement with #1, larger one with #2 and
even better with #1+#2.

Hmm, so you added the non-locked test in TAS()? Did you try adding it
just to TAS_SPIN()? On Itanium, I found that it was slightly better
to do it only in TAS_SPIN() - i.e. in the contended case.

Here is new patch using TAS_SPIN(), created by Manabu Ori from IBM
Japan. Also this patch deal with older Power architectures which do
not have "hint" argument of lwarx opcode.

According to him, the patch resulted in much better performance stock
git head.

Impressive results.

config/c-compiler.m4 doesn't seem like the right place for the configure
test. Would there be any harm in setting the lwarx hint always; what
would happen on older ppc processors that don't support it?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#48Tatsuo Ishii
ishii@postgresql.org
In reply to: Heikki Linnakangas (#47)
Re: spinlocks on HP-UX

On 28.12.2011 14:03, Tatsuo Ishii wrote:

With help from IBM Japan Ltd. we did some tests on a larger IBM
machine than Tom Lane has used for his
test(http://archives.postgresql.org/message-id/8292.1314641721@sss.pgh.pa.us).
In his case it was IBM 8406-71Y, which has 8 physical cores and
4SMT(32 threadings). Ours is IBM Power 750 Express, which has 32
physical cores and 4SMT(128 threadings), 256GB RAM.

The test method was same as the one in the article above. The
differences are OS(RHEL 6.1), gcc version (4.4.5) and shared buffer
size(8GB).

We tested 3 methods to enhance spin lock contention:

1) Add "hint" parameter to lwarx op which is usable POWER6 or later
architecure.

2) Add non-locked test in TAS()

3) #1 + #2

We saw small performance enhancement with #1, larger one with #2 and
even better with #1+#2.

Hmm, so you added the non-locked test in TAS()? Did you try adding it
just to TAS_SPIN()? On Itanium, I found that it was slightly better
to do it only in TAS_SPIN() - i.e. in the contended case.

Here is new patch using TAS_SPIN(), created by Manabu Ori from IBM
Japan. Also this patch deal with older Power architectures which do
not have "hint" argument of lwarx opcode.

According to him, the patch resulted in much better performance stock
git head.

Impressive results.

config/c-compiler.m4 doesn't seem like the right place for the
configure test. Would there be any harm in setting the lwarx hint
always; what would happen on older ppc processors that don't support
it?

I think the load module just fails to run in this case, but I'd like
to confirm. Ori-san?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

#49Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#47)
Re: spinlocks on HP-UX

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

config/c-compiler.m4 doesn't seem like the right place for the configure
test. Would there be any harm in setting the lwarx hint always; what
would happen on older ppc processors that don't support it?

More to the point, a configure test only proves whether the
build machine can deal with the flag, not whether the machine
the executables will ultimately run on knows what the flag means.
We cannot assume that the build and execution boxes are the same.
(In general, AC_TRY_RUN tests are best avoided because of this.)

regards, tom lane

#50Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Tom Lane (#49)
Re: spinlocks on HP-UX

Tom Lane wrote:

a configure test only proves whether the build machine can deal
with the flag, not whether the machine the executables will
ultimately run on knows what the flag means. We cannot assume that
the build and execution boxes are the same. (In general,
AC_TRY_RUN tests are best avoided because of this.)

I understand why that is important in general, but as a shop which
builds from source, and is fine with a separate build for each
hardware model / OS version combination, it would be great if any
optimizations which are only available if you *do* assume that the
build machine and the run machine are the same (or at lease
identical) could be enabled with some configure switch. Maybe
something like --enable-platform-specific-optimizations.

I don't know if any such possible optimizations currently exist, I'm
just saying that if any are identified, it would be nice to have the
option of using them.

-Kevin

#51Tatsuo Ishii
ishii@postgresql.org
In reply to: Kevin Grittner (#50)
Re: spinlocks on HP-UX

OT:

Please use mail address "manabu.ori@gmail.com", not
"manabu.ori@gmailc.com" when following this thread. I accidently made
a mistake when I posted the first mail in this thread.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Show quoted text

Tom Lane wrote:

a configure test only proves whether the build machine can deal
with the flag, not whether the machine the executables will
ultimately run on knows what the flag means. We cannot assume that
the build and execution boxes are the same. (In general,
AC_TRY_RUN tests are best avoided because of this.)

I understand why that is important in general, but as a shop which
builds from source, and is fine with a separate build for each
hardware model / OS version combination, it would be great if any
optimizations which are only available if you *do* assume that the
build machine and the run machine are the same (or at lease
identical) could be enabled with some configure switch. Maybe
something like --enable-platform-specific-optimizations.

I don't know if any such possible optimizations currently exist, I'm
just saying that if any are identified, it would be nice to have the
option of using them.

-Kevin

#52Manabu Ori
manabu.ori@gmail.com
In reply to: Tatsuo Ishii (#48)
Re: spinlocks on HP-UX

2011/12/29 Tatsuo Ishii <ishii@postgresql.org>

Impressive results.

config/c-compiler.m4 doesn't seem like the right place for the
configure test. Would there be any harm in setting the lwarx hint
always; what would happen on older ppc processors that don't support
it?

I think the load module just fails to run in this case, but I'd like
to confirm. Ori-san?

I don't know where is the right config/*.m4 to place this kind of
configure test. Do you have any idea?

I believe lwarx hint would be no harm for recent PowerPC processors.
What I tested are:

(1) Built postgres on POWER6 + RHEL5, which got lwarx hint
included. Then copy these src tree to POWER5 + RHEL4 and
run "make test", finished successfully.

(2) Lwarx test in configure failed on POWER5 + RHEL4.

Note that POWER6 understands lwarx hint and POWER5 doesn't.
RHEL5 binutils supports lwarx hint and RHEL4 binutils doesn't.

The only concern is for very old PowerPC.
Referring to Power Instruction Set Architecture manual(*1), on
some processors that precede PowerISA v2.00, executing lwarx with
hint will cause the illegal instruction error.

Lwarx test in configure should fail on these kind of processors,
guessing from my test(2).

(*1) p.689 of
https://www.power.org/resources/downloads/PowerISA_V2.06B_V2_PUBLIC.pdf

Regards,
Manabu Ori

#53Manabu Ori
manabu.ori@gmail.com
In reply to: Kevin Grittner (#50)
Re: spinlocks on HP-UX

a configure test only proves whether the build machine can deal
with the flag, not whether the machine the executables will
ultimately run on knows what the flag means. We cannot assume that
the build and execution boxes are the same. (In general,
AC_TRY_RUN tests are best avoided because of this.)

I understand why that is important in general, but as a shop which
builds from source, and is fine with a separate build for each
hardware model / OS version combination, it would be great if any
optimizations which are only available if you *do* assume that the
build machine and the run machine are the same (or at lease
identical) could be enabled with some configure switch. Maybe
something like --enable-platform-specific-optimizations.

I don't know if any such possible optimizations currently exist, I'm
just saying that if any are identified, it would be nice to have the
option of using them.

I can't say the right way to go for now, but I'd like binary
packages could enjoy the effect of my patch as far as possible so
that I made lwarx hint test run in configure runtime.

Regards,
Manabu Ori

#54Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Manabu Ori (#52)
Re: spinlocks on HP-UX

On 29.12.2011 04:36, Manabu Ori wrote:

I believe lwarx hint would be no harm for recent PowerPC processors.
What I tested are:

(1) Built postgres on POWER6 + RHEL5, which got lwarx hint
included. Then copy these src tree to POWER5 + RHEL4 and
run "make test", finished successfully.

(2) Lwarx test in configure failed on POWER5 + RHEL4.

Note that POWER6 understands lwarx hint and POWER5 doesn't.
RHEL5 binutils supports lwarx hint and RHEL4 binutils doesn't.

The only concern is for very old PowerPC.
Referring to Power Instruction Set Architecture manual(*1), on
some processors that precede PowerISA v2.00, executing lwarx with
hint will cause the illegal instruction error.

Lwarx test in configure should fail on these kind of processors,
guessing from my test(2).

The Linux kernel does this (arch/powerpc/include/asm/ppc-opcode.h):

127 /*
128 * Only use the larx hint bit on 64bit CPUs. e500v1/v2 based CPUs will treat a
129 * larx with EH set as an illegal instruction.
130 */
131 #ifdef CONFIG_PPC64
132 #define __PPC_EH(eh) (((eh) & 0x1) << 0)
133 #else
134 #define __PPC_EH(eh) 0
135 #endif

We can't copy-paste code from Linux directly, and I'm not sure I like
that particular phrasing of the macro, but perhaps we should steal the
idea and only use the hint on 64-bit PowerPC processors? I presume all
the processors that support the hint are 64-bit, so the question is, is
there any 64-bit PowerPC processors that would get upset about it? It's
quite arbitrary to tie it to the word length, but if it works as a
dividing line in practice, I'm fine with it.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#55Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#54)
Re: spinlocks on HP-UX

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

The Linux kernel does this (arch/powerpc/include/asm/ppc-opcode.h):

Yeah, I was looking at that too.

We can't copy-paste code from Linux directly, and I'm not sure I like
that particular phrasing of the macro, but perhaps we should steal the
idea and only use the hint on 64-bit PowerPC processors?

The info that I've found says that the hint exists beginning in POWER6,
and there were certainly 64-bit Power machines before that. However,
it might be that the only machines that actually spit up on the hint bit
(rather than ignore it) were 32-bit, in which case this would be a
usable heuristic. Not sure how we can research that ... do we want to
just assume the kernel guys know what they're doing?

regards, tom lane

#56Bruce Momjian
bruce@momjian.us
In reply to: Manabu Ori (#53)
Re: spinlocks on HP-UX

On Thu, Dec 29, 2011 at 11:37:22AM +0900, Manabu Ori wrote:

a configure test only proves whether the build machine can deal
with the flag, not whether the machine the executables will
ultimately run on knows what the flag means. We cannot assume that
the build and execution boxes are the same. (In general,
AC_TRY_RUN tests are best avoided because of this.)

I understand why that is important in general, but as a shop which
builds from source, and is fine with a separate build for each
hardware model / OS version combination, it would be great if any
optimizations which are only available if you *do* assume that the
build machine and the run machine are the same (or at lease
identical) could be enabled with some configure switch. Maybe
something like --enable-platform-specific-optimizations.

I don't know if any such possible optimizations currently exist, I'm
just saying that if any are identified, it would be nice to have the
option of using them.

I can't say the right way to go for now, but I'd like binary
packages could enjoy the effect of my patch as far as possible so
that I made lwarx hint test run in configure runtime.

Was there any conclusion to this discussion?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#57Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#56)
Re: spinlocks on HP-UX

On Sun, Aug 26, 2012 at 9:45 AM, Bruce Momjian <bruce@momjian.us> wrote:

On Thu, Dec 29, 2011 at 11:37:22AM +0900, Manabu Ori wrote:

a configure test only proves whether the build machine can deal
with the flag, not whether the machine the executables will
ultimately run on knows what the flag means. We cannot assume that
the build and execution boxes are the same. (In general,
AC_TRY_RUN tests are best avoided because of this.)

I understand why that is important in general, but as a shop which
builds from source, and is fine with a separate build for each
hardware model / OS version combination, it would be great if any
optimizations which are only available if you *do* assume that the
build machine and the run machine are the same (or at lease
identical) could be enabled with some configure switch. Maybe
something like --enable-platform-specific-optimizations.

I don't know if any such possible optimizations currently exist, I'm
just saying that if any are identified, it would be nice to have the
option of using them.

I can't say the right way to go for now, but I'd like binary
packages could enjoy the effect of my patch as far as possible so
that I made lwarx hint test run in configure runtime.

Was there any conclusion to this discussion?

I don't think so. Broadly, I'm skeptical of hints. The CPU already
does prefetching, so there is value in hints only if we can be smarter
than the CPU. That's kind of an iffy proposition, particularly since
different CPUs have very different behavior. What may be a good idea
on one platform can stink on another platform. Also, something that
works well in low-contention situations can blow up in high-contention
situations, and visca versa. Consider this typical sequence:

- prefetch something
- do a little bit of other work
- use the thing we prefetched

If the system isn't under heavy contention, this can be a big win.
The work is overlayed with the prefetch, so that by the time we get to
step 3, we don't have to wait. But if the system now becomes
contended, the prefetch may grab the cache line that we need, and then
some other process may steal it, and then in step 3 we have to steal
it back. So now our huge win becomes a huge loss, because we're
fetching the same cache lines twice per backend instead of once. Now
you can (maybe) fix this by jiggering the amount of other work that
you do in between the prefetch and the usage, but that's likely to be
different on every architecture, so the whole thing feels like a
fairly impossible problem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company