Better LWLocks with compare-and-swap (9.4)

Started by Heikki Linnakangasover 12 years ago21 messages

hlinnakangas@vmware.com

over 12 years ago

2 attachment(s)

I've been working on-and-off on the WAL-insert scaling patch. It's in
pretty good shape now, and I'll post it shortly, but one thing I noticed
is that it benefits a lot from using an atomic compare-and-swap
instruction for the contention-critical part.

I realized that we could also use compare-and-swap to make LWLocks scale
better. The LWLock struct is too large to compare-and-swap atomically,
but we can still use CAS to increment/decrement the shared/exclusive
counters, when there's no need to manipulate the wait queue. That would
help with workloads where you have a lot of CPUs, and a lot of backends
need to acquire the same lwlock in shared mode, but there's no real
contention (ie. few exclusive lockers).

pgbench -S is such a workload. With 9.3beta1, I'm seeing this profile,
when I run "pgbench -S -c64 -j64 -T60 -M prepared" on a 32-core Linux
machine:

-  64.09%  postgres  postgres           [.] tas
    - tas
       - 99.83% s_lock
          - 53.22% LWLockAcquire
             + 99.87% GetSnapshotData
          - 46.78% LWLockRelease
               GetSnapshotData
             + GetTransactionSnapshot
+   2.97%  postgres  postgres           [.] tas
+   1.53%  postgres  libc-2.13.so       [.] 0x119873
+   1.44%  postgres  postgres           [.] GetSnapshotData
+   1.29%  postgres  [kernel.kallsyms]  [k] arch_local_irq_enable
+   1.18%  postgres  postgres           [.] AllocSetAlloc
...

So, on this test, a lot of time is wasted spinning on the mutex of
ProcArrayLock. If you plot a graph of TPS vs. # of clients, there is a
surprisingly steep drop in performance once you go beyond 29 clients
(attached, pgbench-lwlock-cas-local-clients-sets.png, red line). My
theory is that after that point all the cores are busy, and processes
start to be sometimes context switched while holding the spinlock, which
kills performance. Has anyone else seen that pattern? Curiously, I don't
see that when connecting pgbench via TCP over localhost, only when
connecting via unix domain sockets. Overall performance is higher over
unix domain sockets, so I guess the TCP layer adds some overhead,
hurting performance, and also affects scheduling somehow, making the
steep drop go away.

Using a compare-and-swap instruction in place of spinning solves the
problem (green line in attached graph). This needs some more testing
with different workloads to make sure it doesn't hurt other scenarios,
but I don't think it will. I'm also waiting for a colleague to test this
on a different machine, where he saw a similar steep drop in performance.

The attached patch is still work-in-progress. There needs to be a
configure test and fallback to spinlock if a CAS instruction is not
available. I used the gcc __sync_val_compare_and_swap() builtin
directly, that needs to be abstracted. Also, in the case that the wait
queue needs to be manipulated, the code spins on the CAS instruction,
but there is no delay mechanism like there is on a regular spinlock;
that needs to be added in somehow.

- Heikki

Attachments:

pgbench-lwlock-cas-local-clients-sets.pngimage/png; name=pgbench-lwlock-cas-local-clients-sets.pngDownload

lwlock-cas-2.patchtext/x-diff; name=lwlock-cas-2.patchDownload

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 4f88d3f..0d92a30 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -36,13 +36,27 @@
 /* We use the ShmemLock spinlock to protect LWLockAssign */
 extern slock_t *ShmemLock;
 
+/*
+ * The part of LWLock struct that needs to be swappable with an atomic
+ * compare-and-swap instruction. Assuming we have an atomic 64-bit CAS
+ * instruction, sizeof(cas_t) must be <= 8.
+ */
+typedef union
+{
+	struct
+	{
+		bool	mutex;			/* Protects LWLock and queue of PGPROCs */
+		bool	releaseOK;		/* T if ok to release waiters */
+		bool	waiters;		/* T if wait queue is not empty (head != NULL) */
+		char	exclusive;		/* # of exclusive holders (0 or 1) */
+		uint32	shared;			/* # of shared holders (0..MaxBackends) */
+	} f;
+	uint64 val;
+} LWLock_cas_t;
 
 typedef struct LWLock
 {
-	slock_t		mutex;			/* Protects LWLock and queue of PGPROCs */
-	bool		releaseOK;		/* T if ok to release waiters */
-	char		exclusive;		/* # of exclusive holders (0 or 1) */
-	int			shared;			/* # of shared holders (0..MaxBackends) */
+	LWLock_cas_t		casval;
 	PGPROC	   *head;			/* head of list of waiting PGPROCs */
 	PGPROC	   *tail;			/* tail of list of waiting PGPROCs */
 	/* tail is undefined when head is NULL */
@@ -267,6 +281,9 @@ CreateLWLocks(void)
 	char	   *ptr;
 	int			id;
 
+	StaticAssertStmt(sizeof(LWLock_cas_t) == sizeof(uint64),
+					 "unexpected size of LWLock_cas_t");
+
 	/* Allocate space */
 	ptr = (char *) ShmemAlloc(spaceLocks);
 
@@ -283,10 +300,10 @@ CreateLWLocks(void)
 	 */
 	for (id = 0, lock = LWLockArray; id < numLocks; id++, lock++)
 	{
-		SpinLockInit(&lock->lock.mutex);
-		lock->lock.releaseOK = true;
-		lock->lock.exclusive = 0;
-		lock->lock.shared = 0;
+		lock->lock.casval.val = 0; /* clear possible padding */
+		lock->lock.casval.f.releaseOK = true;
+		lock->lock.casval.f.exclusive = 0;
+		lock->lock.casval.f.shared = 0;
 		lock->lock.head = NULL;
 		lock->lock.tail = NULL;
 	}
@@ -344,6 +361,7 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 	PGPROC	   *proc = MyProc;
 	bool		retry = false;
 	int			extraWaits = 0;
+	LWLock_cas_t oldval;
 
 	PRINT_LWDEBUG("LWLockAcquire", lockid, lock);
 
@@ -392,27 +410,30 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 	 * cycle because the lock is not free when a released waiter finally gets
 	 * to run.	See pgsql-hackers archives for 29-Dec-01.
 	 */
+	oldval.val = 0;
+	oldval.f.mutex = false;
+	oldval.f.releaseOK = true;
+	oldval.f.exclusive = 0;
+	oldval.f.shared = 0;
 	for (;;)
 	{
 		bool		mustwait;
+		LWLock_cas_t newval;
+		LWLock_cas_t x;
 
 		/* Acquire mutex.  Time spent holding mutex should be short! */
-#ifdef LWLOCK_STATS
-		spin_delay_counts[lockid] += SpinLockAcquire(&lock->mutex);
-#else
-		SpinLockAcquire(&lock->mutex);
-#endif
+		newval = oldval;
 
 		/* If retrying, allow LWLockRelease to release waiters again */
 		if (retry)
-			lock->releaseOK = true;
+			newval.f.releaseOK = true;
 
 		/* If I can get the lock, do so quickly. */
 		if (mode == LW_EXCLUSIVE)
 		{
-			if (lock->exclusive == 0 && lock->shared == 0)
+			if (oldval.f.exclusive == 0 && oldval.f.shared == 0)
 			{
-				lock->exclusive++;
+				newval.f.exclusive++;
 				mustwait = false;
 			}
 			else
@@ -420,14 +441,29 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 		}
 		else
 		{
-			if (lock->exclusive == 0)
+			if (oldval.f.exclusive == 0)
 			{
-				lock->shared++;
+				newval.f.shared++;
 				mustwait = false;
 			}
 			else
 				mustwait = true;
 		}
+		newval.f.mutex = mustwait;
+		oldval.f.mutex = false;
+		x.val = __sync_val_compare_and_swap(&lock->casval.val, oldval.val, newval.val);
+		if (x.val != oldval.val)
+		{
+			/* failed to acquire lock or mutex, retry */
+			oldval = x;
+			continue;
+		}
+
+#ifdef LWLOCK_STATS
+		spin_delay_counts[lockid] += SpinLockAcquire(&lock->mutex);
+#else
+		//SpinLockAcquire(&lock->mutex);
+#endif
 
 		if (!mustwait)
 			break;				/* got the lock */
@@ -452,7 +488,11 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 		lock->tail = proc;
 
 		/* Can release the mutex now */
-		SpinLockRelease(&lock->mutex);
+		oldval = newval;
+		newval.f.waiters = true;
+		newval.f.mutex = false;
+		x.val = __sync_val_compare_and_swap(&lock->casval.val, oldval.val, newval.val);
+		Assert(x.val == oldval.val);
 
 		/*
 		 * Wait until awakened.
@@ -492,7 +532,6 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 	}
 
 	/* We are done updating shared state of the lock itself. */
-	SpinLockRelease(&lock->mutex);
 
 	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(lockid, mode);
 
@@ -518,6 +557,9 @@ LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode)
 {
 	volatile LWLock *lock = &(LWLockArray[lockid].lock);
 	bool		mustwait;
+	LWLock_cas_t oldval;
+	LWLock_cas_t newval;
+	LWLock_cas_t x;
 
 	PRINT_LWDEBUG("LWLockConditionalAcquire", lockid, lock);
 
@@ -533,14 +575,20 @@ LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode)
 	HOLD_INTERRUPTS();
 
 	/* Acquire mutex.  Time spent holding mutex should be short! */
-	SpinLockAcquire(&lock->mutex);
-
+	oldval.val = 0;
+	oldval.f.mutex = false;
+	oldval.f.releaseOK = true;
+	oldval.f.exclusive = 0;
+	oldval.f.shared = 0;
+
+retry:
+	newval = oldval;
 	/* If I can get the lock, do so quickly. */
 	if (mode == LW_EXCLUSIVE)
 	{
-		if (lock->exclusive == 0 && lock->shared == 0)
+		if (oldval.f.exclusive == 0 && oldval.f.shared == 0)
 		{
-			lock->exclusive++;
+			newval.f.exclusive++;
 			mustwait = false;
 		}
 		else
@@ -548,17 +596,29 @@ LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode)
 	}
 	else
 	{
-		if (lock->exclusive == 0)
+		if (oldval.f.exclusive == 0)
 		{
-			lock->shared++;
+			newval.f.shared++;
 			mustwait = false;
 		}
 		else
 			mustwait = true;
 	}
 
+	if (!mustwait)
+	{
+		newval.f.mutex = false;
+		oldval.f.mutex = false;
+		x.val = __sync_val_compare_and_swap(&lock->casval.val, oldval.val, newval.val);
+		if (x.val != oldval.val)
+		{
+			/* XXX would it be better to give up on CAS race? */
+			oldval = x;
+			goto retry;
+		}
+	}
+
 	/* We are done updating shared state of the lock itself. */
-	SpinLockRelease(&lock->mutex);
 
 	if (mustwait)
 	{
@@ -598,6 +658,9 @@ LWLockAcquireOrWait(LWLockId lockid, LWLockMode mode)
 	PGPROC	   *proc = MyProc;
 	bool		mustwait;
 	int			extraWaits = 0;
+	LWLock_cas_t oldval;
+	LWLock_cas_t newval;
+	LWLock_cas_t x;
 
 	PRINT_LWDEBUG("LWLockAcquireOrWait", lockid, lock);
 
@@ -619,14 +682,21 @@ LWLockAcquireOrWait(LWLockId lockid, LWLockMode mode)
 	HOLD_INTERRUPTS();
 
 	/* Acquire mutex.  Time spent holding mutex should be short! */
-	SpinLockAcquire(&lock->mutex);
+	oldval.val = 0;
+	oldval.f.mutex = false;
+	oldval.f.releaseOK = true;
+	oldval.f.exclusive = 0;
+	oldval.f.shared = 0;
+
+retry:
+	newval = oldval;
 
 	/* If I can get the lock, do so quickly. */
 	if (mode == LW_EXCLUSIVE)
 	{
-		if (lock->exclusive == 0 && lock->shared == 0)
+		if (oldval.f.exclusive == 0 && oldval.f.shared == 0)
 		{
-			lock->exclusive++;
+			newval.f.exclusive++;
 			mustwait = false;
 		}
 		else
@@ -634,15 +704,24 @@ LWLockAcquireOrWait(LWLockId lockid, LWLockMode mode)
 	}
 	else
 	{
-		if (lock->exclusive == 0)
+		if (oldval.f.exclusive == 0)
 		{
-			lock->shared++;
+			newval.f.shared++;
 			mustwait = false;
 		}
 		else
 			mustwait = true;
 	}
 
+	newval.f.mutex = mustwait;
+	oldval.f.mutex = false;
+	x.val = __sync_val_compare_and_swap(&lock->casval.val, oldval.val, newval.val);
+	if (x.val != oldval.val)
+	{
+		oldval = x;
+		goto retry;
+	}
+
 	if (mustwait)
 	{
 		/*
@@ -665,7 +744,11 @@ LWLockAcquireOrWait(LWLockId lockid, LWLockMode mode)
 		lock->tail = proc;
 
 		/* Can release the mutex now */
-		SpinLockRelease(&lock->mutex);
+		oldval = newval;
+		newval.f.mutex = false;
+		newval.f.waiters = true;
+		x.val = __sync_val_compare_and_swap(&lock->casval.val, oldval.val, newval.val);
+		Assert(x.val == oldval.val);
 
 		/*
 		 * Wait until awakened.  Like in LWLockAcquire, be prepared for bogus
@@ -694,8 +777,7 @@ LWLockAcquireOrWait(LWLockId lockid, LWLockMode mode)
 	}
 	else
 	{
-		/* We are done updating shared state of the lock itself. */
-		SpinLockRelease(&lock->mutex);
+		/* No need to update the lock */
 	}
 
 	/*
@@ -731,6 +813,9 @@ LWLockRelease(LWLockId lockid)
 	PGPROC	   *head;
 	PGPROC	   *proc;
 	int			i;
+	LWLock_cas_t oldval;
+	LWLock_cas_t newval;
+	LWLock_cas_t x;
 
 	PRINT_LWDEBUG("LWLockRelease", lockid, lock);
 
@@ -749,16 +834,38 @@ LWLockRelease(LWLockId lockid)
 	for (; i < num_held_lwlocks; i++)
 		held_lwlocks[i] = held_lwlocks[i + 1];
 
-	/* Acquire mutex.  Time spent holding mutex should be short! */
-	SpinLockAcquire(&lock->mutex);
+	oldval.val = 0;
+	oldval.f.mutex = false;
+	oldval.f.releaseOK = true;
+	/* guess that we're holding the lock in exclusive mode, and no waiters */
+	oldval.f.waiters = 0;
+	oldval.f.exclusive = 1;
+	oldval.f.shared = 0;
+
+retry:
+	newval = oldval;
 
 	/* Release my hold on lock */
-	if (lock->exclusive > 0)
-		lock->exclusive--;
+	if (oldval.f.exclusive > 0)
+		newval.f.exclusive--;
 	else
 	{
-		Assert(lock->shared > 0);
-		lock->shared--;
+		Assert(oldval.f.shared > 0);
+		newval.f.shared--;
+	}
+
+	/*
+	 * If there are waiters, need to acquire mutex to manipulate the waiter
+	 * list.
+	 */
+	newval.f.mutex = oldval.f.waiters;
+	oldval.f.mutex = false;
+	x.val = __sync_val_compare_and_swap(&lock->casval.val, oldval.val, newval.val);
+	if (x.val != oldval.val)
+	{
+		/* someone else was holding mutex, retry */
+		oldval = x;
+		goto retry;
 	}
 
 	/*
@@ -767,10 +874,17 @@ LWLockRelease(LWLockId lockid)
 	 * if someone has already awakened waiters that haven't yet acquired the
 	 * lock.
 	 */
-	head = lock->head;
+	if (oldval.f.waiters)
+	{
+		head = lock->head;
+		Assert(head != NULL);
+	}
+	else
+		head = NULL;
 	if (head != NULL)
 	{
-		if (lock->exclusive == 0 && lock->shared == 0 && lock->releaseOK)
+		oldval = newval;
+		if (newval.f.exclusive == 0 && newval.f.shared == 0 && newval.f.releaseOK)
 		{
 			/*
 			 * Remove the to-be-awakened PGPROCs from the queue.
@@ -812,17 +926,23 @@ LWLockRelease(LWLockId lockid)
 			if (proc->lwWaitMode != LW_WAIT_UNTIL_FREE)
 				releaseOK = false;
 
-			lock->releaseOK = releaseOK;
+			newval.f.releaseOK = releaseOK;
 		}
 		else
 		{
 			/* lock is still held, can't awaken anything */
 			head = NULL;
 		}
+
+		/* release mutex */
+		newval.f.mutex = false;
+		if (lock->head == NULL)
+			newval.f.waiters = false;
+		x.val = __sync_val_compare_and_swap(&lock->casval.val, oldval.val, newval.val);
+		Assert(x.val == oldval.val);
 	}
 
 	/* We are done updating shared state of the lock itself. */
-	SpinLockRelease(&lock->mutex);
 
 	TRACE_POSTGRESQL_LWLOCK_RELEASE(lockid);

Merlin Moncure

mmoncure@gmail.com

over 12 years ago

In reply to: Heikki Linnakangas (#1)

Re: Better LWLocks with compare-and-swap (9.4)

On Mon, May 13, 2013 at 7:50 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

The attached patch is still work-in-progress. There needs to be a configure
test and fallback to spinlock if a CAS instruction is not available. I used
the gcc __sync_val_compare_and_swap() builtin directly, that needs to be
abstracted. Also, in the case that the wait queue needs to be manipulated,
the code spins on the CAS instruction, but there is no delay mechanism like
there is on a regular spinlock; that needs to be added in somehow.

These are really interesting results. Why is the CAS method so much
faster then TAS? Did you see any contention?

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Daniel Farina

daniel@heroku.com

over 12 years ago

In reply to: Heikki Linnakangas (#1)

Re: Better LWLocks with compare-and-swap (9.4)

On Mon, May 13, 2013 at 5:50 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

pgbench -S is such a workload. With 9.3beta1, I'm seeing this profile, when
I run "pgbench -S -c64 -j64 -T60 -M prepared" on a 32-core Linux machine:
-  64.09%  postgres  postgres           [.] tas
- tas
- 99.83% s_lock
- 53.22% LWLockAcquire
+ 99.87% GetSnapshotData
- 46.78% LWLockRelease
GetSnapshotData
+ GetTransactionSnapshot
+   2.97%  postgres  postgres           [.] tas
+   1.53%  postgres  libc-2.13.so       [.] 0x119873
+   1.44%  postgres  postgres           [.] GetSnapshotData
+   1.29%  postgres  [kernel.kallsyms]  [k] arch_local_irq_enable
+   1.18%  postgres  postgres           [.] AllocSetAlloc
...
So, on this test, a lot of time is wasted spinning on the mutex of
ProcArrayLock. If you plot a graph of TPS vs. # of clients, there is a
surprisingly steep drop in performance once you go beyond 29 clients
(attached, pgbench-lwlock-cas-local-clients-sets.png, red line). My theory
is that after that point all the cores are busy, and processes start to be
sometimes context switched while holding the spinlock, which kills
performance.

I have, I also used linux perf to come to this conclusion, and my
determination was similar: a system was undergoing increasingly heavy
load, in this case with processes >> number of processors. It was
also a phase-change type of event: at one moment everything would be
going great, but once a critical threshold was hit, s_lock would
consume enormous amount of CPU time. I figured preemption while in
the spinlock was to blame at the time, given the extreme nature.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Daniel Farina

daniel@heroku.com

over 12 years ago

In reply to: Daniel Farina (#3)

Re: Better LWLocks with compare-and-swap (9.4)

On Wed, May 15, 2013 at 3:08 PM, Daniel Farina <daniel@heroku.com> wrote:

On Mon, May 13, 2013 at 5:50 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
pgbench -S is such a workload. With 9.3beta1, I'm seeing this profile, when
I run "pgbench -S -c64 -j64 -T60 -M prepared" on a 32-core Linux machine:
-  64.09%  postgres  postgres           [.] tas
- tas
- 99.83% s_lock
- 53.22% LWLockAcquire
+ 99.87% GetSnapshotData
- 46.78% LWLockRelease
GetSnapshotData
+ GetTransactionSnapshot
+   2.97%  postgres  postgres           [.] tas
+   1.53%  postgres  libc-2.13.so       [.] 0x119873
+   1.44%  postgres  postgres           [.] GetSnapshotData
+   1.29%  postgres  [kernel.kallsyms]  [k] arch_local_irq_enable
+   1.18%  postgres  postgres           [.] AllocSetAlloc
...
So, on this test, a lot of time is wasted spinning on the mutex of
ProcArrayLock. If you plot a graph of TPS vs. # of clients, there is a
surprisingly steep drop in performance once you go beyond 29 clients
(attached, pgbench-lwlock-cas-local-clients-sets.png, red line). My theory
is that after that point all the cores are busy, and processes start to be
sometimes context switched while holding the spinlock, which kills
performance.

I accidentally some important last words from Heikki's last words in
his mail, which make my correspondence pretty bizarre to understand at
the outset. Apologies. He wrote:

[...] Has anyone else seen that pattern?

I have, I also used linux perf to come to this conclusion, and my
determination was similar: a system was undergoing increasingly heavy
load, in this case with processes >> number of processors. It was
also a phase-change type of event: at one moment everything would be
going great, but once a critical threshold was hit, s_lock would
consume enormous amount of CPU time. I figured preemption while in
the spinlock was to blame at the time, given the extreme nature.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Stephen Frost

sfrost@snowman.net

over 12 years ago

In reply to: Heikki Linnakangas (#1)

Re: Better LWLocks with compare-and-swap (9.4)

* Heikki Linnakangas (hlinnakangas@vmware.com) wrote:

My theory is that after that point all the cores are busy,
and processes start to be sometimes context switched while holding
the spinlock, which kills performance. Has anyone else seen that
pattern?

Isn't this the same issue which has prompted multiple people to propose
(sometimes with code, as I recall) to rip out our internal spinlock
system and replace it with kernel-backed calls which do it better,
specifically by dealing with issues like the above? Have you seen those
threads in the past? Any thoughts about moving in that direction?

Curiously, I don't see that when connecting pgbench via TCP
over localhost, only when connecting via unix domain sockets.
Overall performance is higher over unix domain sockets, so I guess
the TCP layer adds some overhead, hurting performance, and also
affects scheduling somehow, making the steep drop go away.

I wonder if the kernel locks around unix domain sockets are helping us
out here, while it's not able to take advantage of such knowledge about
the process that's waiting when it's a TCP connection? Just a hunch.

Thanks,

Stephen

Tom Lane

tgl@sss.pgh.pa.us

over 12 years ago

In reply to: Stephen Frost (#5)

Re: Better LWLocks with compare-and-swap (9.4)

Stephen Frost <sfrost@snowman.net> writes:

Isn't this the same issue which has prompted multiple people to propose
(sometimes with code, as I recall) to rip out our internal spinlock
system and replace it with kernel-backed calls which do it better,
specifically by dealing with issues like the above? Have you seen those
threads in the past? Any thoughts about moving in that direction?

All of the proposals of that sort that I've seen had a flavor of
"my OS is the only one that matters". While I don't object to
platform-dependent implementations of spinlocks as such, they're not
much of a cure for a generic performance issue.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Dickson S. Guedes

listas@guedesoft.net

over 12 years ago

In reply to: Heikki Linnakangas (#1)

Re: Better LWLocks with compare-and-swap (9.4)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Em 13-05-2013 09:50, Heikki Linnakangas escreveu:

I've been working on-and-off on the WAL-insert scaling patch. It's
in pretty good shape now, and I'll post it shortly, but one thing I
noticed is that it benefits a lot from using an atomic
compare-and-swap instruction for the contention-critical part.

I realized that we could also use compare-and-swap to make LWLocks
scale better. The LWLock struct is too large to compare-and-swap
atomically, but we can still use CAS to increment/decrement the
shared/exclusive counters, when there's no need to manipulate the
wait queue. That would help with workloads where you have a lot of
CPUs, and a lot of backends need to acquire the same lwlock in
shared mode, but there's no real contention (ie. few exclusive
lockers).

pgbench -S is such a workload. With 9.3beta1, I'm seeing this
profile, when I run "pgbench -S -c64 -j64 -T60 -M prepared" on a
32-core Linux machine:

- 64.09% postgres postgres [.] tas - tas - 99.83%
s_lock - 53.22% LWLockAcquire + 99.87% GetSnapshotData - 46.78%
LWLockRelease GetSnapshotData + GetTransactionSnapshot + 2.97%
postgres postgres [.] tas + 1.53% postgres
libc-2.13.so [.] 0x119873 + 1.44% postgres postgres
[.] GetSnapshotData + 1.29% postgres [kernel.kallsyms] [k]
arch_local_irq_enable + 1.18% postgres postgres [.]
AllocSetAlloc ...

I'd like to test this here but I couldn't reproduce that perf output
here in a 64-core or 24-core machines, could you post the changes to
postgresql.conf and the perf arguments that you used?

Thanks!

[]s
- --
Dickson S. Guedes
mail/xmpp: guedes@guedesoft.net - skype: guediz
http://guedesoft.net - http://www.postgresql.org.br
http://github.net/guedes - twitter: @guediz
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJRltDJAAoJEBa5zL7BI5C7UQkH/Au8p90pTMl1qvbft3q1Gtxp
a4PV8fjOrzQou2I+9Sxu5W1ql3qyVmfFare+bJVKg5L3LmvACjZ6bbw9oKBEnPGB
vzE9nB6+3F3eyo464Niq19cTVgmyRQBcuOT/Ye88Uh2mrrgUYB+lGfk9M2Af7on1
nUZI5YsWWXt/bm9wf6rRCzDs76fS7ity943V0aSg2AHryjfcB8o4oBhJBnrRfnm7
v+SxLg0xDEWQPo8VOCQlIw5IhoxNokHjMAt8Ho7o0dXJRR91vSerdulK4Uxkz13Q
E9GlDBDBzZsHmqHCGECNSglqVegXRA5g2i/o3tmQ/lEKzCF9OiX7GBSkXN+gEsc=
=nGJ5
-----END PGP SIGNATURE-----

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Heikki Linnakangas

hlinnakangas@vmware.com

over 12 years ago

In reply to: Dickson S. Guedes (#7)

Re: Better LWLocks with compare-and-swap (9.4)

On 18.05.2013 03:52, Dickson S. Guedes wrote:

pgbench -S is such a workload. With 9.3beta1, I'm seeing this
profile, when I run "pgbench -S -c64 -j64 -T60 -M prepared" on a
32-core Linux machine:

- 64.09% postgres postgres [.] tas - tas - 99.83%
s_lock - 53.22% LWLockAcquire + 99.87% GetSnapshotData - 46.78%
LWLockRelease GetSnapshotData + GetTransactionSnapshot + 2.97%
postgres postgres [.] tas + 1.53% postgres
libc-2.13.so [.] 0x119873 + 1.44% postgres postgres
[.] GetSnapshotData + 1.29% postgres [kernel.kallsyms] [k]
arch_local_irq_enable + 1.18% postgres postgres [.]
AllocSetAlloc ...

I'd like to test this here but I couldn't reproduce that perf output
here in a 64-core or 24-core machines, could you post the changes to
postgresql.conf and the perf arguments that you used?

Sure, here are the non-default postgresql.conf settings:

While pgbench was running, I ran this:

perf record -p 6050 -g -e cpu-clock

to connect to one of the backends. (I used cpu-clock, because the
default cpu-cycles event didn't work on the box)

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Dickson S. Guedes

listas@guedesoft.net

over 12 years ago

In reply to: Heikki Linnakangas (#8)

Re: Better LWLocks with compare-and-swap (9.4)

Em Dom, 2013-05-19 às 09:29 +0300, Heikki Linnakangas escreveu:

On 18.05.2013 03:52, Dickson S. Guedes wrote:

pgbench -S is such a workload. With 9.3beta1, I'm seeing this
profile, when I run "pgbench -S -c64 -j64 -T60 -M prepared" on a
32-core Linux machine:

- 64.09% postgres postgres [.] tas - tas - 99.83%
s_lock - 53.22% LWLockAcquire + 99.87% GetSnapshotData - 46.78%
LWLockRelease GetSnapshotData + GetTransactionSnapshot + 2.97%
postgres postgres [.] tas + 1.53% postgres
libc-2.13.so [.] 0x119873 + 1.44% postgres postgres
[.] GetSnapshotData + 1.29% postgres [kernel.kallsyms] [k]
arch_local_irq_enable + 1.18% postgres postgres [.]
AllocSetAlloc ...

I'd like to test this here but I couldn't reproduce that perf output
here in a 64-core or 24-core machines, could you post the changes to
postgresql.conf and the perf arguments that you used?

Sure, here are the non-default postgresql.conf settings:

Thank you for your information.

While pgbench was running, I ran this:

perf record -p 6050 -g -e cpu-clock

to connect to one of the backends. (I used cpu-clock, because the
default cpu-cycles event didn't work on the box)

Hum, I was supposing that I was doing something wrong but I'm getting
the same result as before even using your test case and my results is
still different from yours:

+ 71,27% postgres postgres         [.] AtEOXact_Buffers
+  7,67% postgres postgres         [.] AtEOXact_CatCache
+  6,30% postgres postgres         [.] AllocSetCheck
+  5,34% postgres libc-2.12.so     [.] __mcount_internal
+  2,14% postgres [kernel.kallsyms][k] activate_page

It's a 64-core machine with PGDATA in a SAN.

vendor_id : GenuineIntel
cpu family : 6
model : 47
model name : Intel(R) Xeon(R) CPU E7- 4830 @ 2.13GHz
stepping : 2
cpu MHz : 1064.000
cache size : 24576 KB
physical id : 3
siblings : 16
core id : 24
cpu cores : 8
apicid : 241
initial apicid : 241
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good
xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx
smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes
lahf_lm ida arat epb dts tpr_shadow vnmi flexpriority ept vpid
bogomips : 4255.87
clflush size : 64
cache_alignment : 64
address sizes : 44 bits physical, 48 bits virtual
power management:

Would you like that I test some other configuration to try to simulate
that expected workload?

[]s
--
Dickson S. Guedes
mail/xmpp: guedes@guedesoft.net - skype: guediz
http://guedesoft.net - http://www.postgresql.org.br
http://www.rnp.br/keyserver/pks/lookup?search=0x8F3E3C06D428D10A

#10

Andres Freund

andres@2ndquadrant.com

over 12 years ago

In reply to: Dickson S. Guedes (#9)

Re: Better LWLocks with compare-and-swap (9.4)

On 2013-05-20 09:31:15 -0300, Dickson S. Guedes wrote:

Hum, I was supposing that I was doing something wrong but I'm getting
the same result as before even using your test case and my results is
still different from yours:
+ 71,27% postgres postgres         [.] AtEOXact_Buffers
+  7,67% postgres postgres         [.] AtEOXact_CatCache
+  6,30% postgres postgres         [.] AllocSetCheck
+  5,34% postgres libc-2.12.so     [.] __mcount_internal
+  2,14% postgres [kernel.kallsyms][k] activate_page

That looks like you have configured with --enable-cassert and probably
also --enable-profiling? The former will give completely distorted
performance results...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Dickson S. Guedes

listas@guedesoft.net

over 12 years ago

In reply to: Andres Freund (#10)

Re: Better LWLocks with compare-and-swap (9.4)

Em Seg, 2013-05-20 às 14:35 +0200, Andres Freund escreveu:

On 2013-05-20 09:31:15 -0300, Dickson S. Guedes wrote:
Hum, I was supposing that I was doing something wrong but I'm getting
the same result as before even using your test case and my results is
still different from yours:
+ 71,27% postgres postgres         [.] AtEOXact_Buffers
+  7,67% postgres postgres         [.] AtEOXact_CatCache
+  6,30% postgres postgres         [.] AllocSetCheck
+  5,34% postgres libc-2.12.so     [.] __mcount_internal
+  2,14% postgres [kernel.kallsyms][k] activate_page
That looks like you have configured with --enable-cassert and probably
also --enable-profiling? The former will give completely distorted
performance results...

Ah! Wrong PATH, so wrong binaries. Thanks Andres.

--
Dickson S. Guedes
mail/xmpp: guedes@guedesoft.net - skype: guediz
http://guedesoft.net - http://www.postgresql.org.br
http://www.rnp.br/keyserver/pks/lookup?search=0x8F3E3C06D428D10A

#12

Heikki Linnakangas

hlinnakangas@vmware.com

over 12 years ago

In reply to: Merlin Moncure (#2)

1 attachment(s)

Re: Better LWLocks with compare-and-swap (9.4)

On 13.05.2013 17:21, Merlin Moncure wrote:

On Mon, May 13, 2013 at 7:50 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

The attached patch is still work-in-progress. There needs to be a configure
test and fallback to spinlock if a CAS instruction is not available. I used
the gcc __sync_val_compare_and_swap() builtin directly, that needs to be
abstracted. Also, in the case that the wait queue needs to be manipulated,
the code spins on the CAS instruction, but there is no delay mechanism like
there is on a regular spinlock; that needs to be added in somehow.

These are really interesting results. Why is the CAS method so much
faster then TAS?

If my theory is right, the steep fall is caused by backends being
sometimes context switched while holding the spinlock. The CAS method
alleviates that, because the lock is never "held" by anyone. With a
spinlock, if a process gets context switched while holding the lock,
everyone else will have to wait until the process gets context switched
back, and releases the spinlock. With the CAS method, if a process gets
switched in the CAS-loop, it doesn't hurt others.

With the patch, the CAS-loop still spins, just like a spinlock, when the
wait queue has to be manipulated. So when that happens, it will behave
more or less like the spinlock implementation. I'm not too concerned
about that optimizing that further; sleeping and signaling sleeping
backends are heavy-weight operations anyway.

Did you see any contention?

With the current spinlock implementation? Yeah, per the profile I
posted, a lot of CPU time was spent spinning, which is a sign of
contention. I didn't see contention with the CAS implemention.

Attached is a new version of the patch. I cleaned it up a lot, and added
a configure test, and fallback to the good old spinlock implementation
if compare-and-swap is not available. Also, if the CAS loops needs to
spin, like a spinlock, it now sleeps after a while and updates the
"spins_per_delay" like regular spinlock.

- Heikki

Attachments:

lwlock-cas-3.patchtext/x-diff; name=lwlock-cas-3.patchDownload

diff --git a/configure b/configure
index 98d889a..43faaf8 100755
--- a/configure
+++ b/configure
@@ -22742,71 +22742,6 @@ fi
 done
 
 
-{ $as_echo "$as_me:$LINENO: checking for builtin locking functions" >&5
-$as_echo_n "checking for builtin locking functions... " >&6; }
-if test "${pgac_cv_gcc_int_atomics+set}" = set; then
-  $as_echo_n "(cached) " >&6
-else
-  cat >conftest.$ac_ext <<_ACEOF
-/* confdefs.h.  */
-_ACEOF
-cat confdefs.h >>conftest.$ac_ext
-cat >>conftest.$ac_ext <<_ACEOF
-/* end confdefs.h.  */
-
-int
-main ()
-{
-int lock = 0;
-   __sync_lock_test_and_set(&lock, 1);
-   __sync_lock_release(&lock);
-  ;
-  return 0;
-}
-_ACEOF
-rm -f conftest.$ac_objext conftest$ac_exeext
-if { (ac_try="$ac_link"
-case "(($ac_try" in
-  *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;;
-  *) ac_try_echo=$ac_try;;
-esac
-eval ac_try_echo="\"\$as_me:$LINENO: $ac_try_echo\""
-$as_echo "$ac_try_echo") >&5
-  (eval "$ac_link") 2>conftest.er1
-  ac_status=$?
-  grep -v '^ *+' conftest.er1 >conftest.err
-  rm -f conftest.er1
-  cat conftest.err >&5
-  $as_echo "$as_me:$LINENO: \$? = $ac_status" >&5
-  (exit $ac_status); } && {
-	 test -z "$ac_c_werror_flag" ||
-	 test ! -s conftest.err
-       } && test -s conftest$ac_exeext && {
-	 test "$cross_compiling" = yes ||
-	 $as_test_x conftest$ac_exeext
-       }; then
-  pgac_cv_gcc_int_atomics="yes"
-else
-  $as_echo "$as_me: failed program was:" >&5
-sed 's/^/| /' conftest.$ac_ext >&5
-
-	pgac_cv_gcc_int_atomics="no"
-fi
-
-rm -rf conftest.dSYM
-rm -f core conftest.err conftest.$ac_objext conftest_ipa8_conftest.oo \
-      conftest$ac_exeext conftest.$ac_ext
-fi
-{ $as_echo "$as_me:$LINENO: result: $pgac_cv_gcc_int_atomics" >&5
-$as_echo "$pgac_cv_gcc_int_atomics" >&6; }
-if test x"$pgac_cv_gcc_int_atomics" = x"yes"; then
-
-cat >>confdefs.h <<\_ACEOF
-#define HAVE_GCC_INT_ATOMICS 1
-_ACEOF
-
-fi
-
 # Lastly, restore full LIBS list and check for readline/libedit symbols
 LIBS="$LIBS_including_readline"
 
@@ -28573,6 +28508,136 @@ _ACEOF
 fi
 
 
+# Check for GCC built-in atomic functions
+{ $as_echo "$as_me:$LINENO: checking for builtin test-and-set function" >&5
+$as_echo_n "checking for builtin test-and-set function... " >&6; }
+if test "${pgac_cv_gcc_int_test_and_set+set}" = set; then
+  $as_echo_n "(cached) " >&6
+else
+  cat >conftest.$ac_ext <<_ACEOF
+/* confdefs.h.  */
+_ACEOF
+cat confdefs.h >>conftest.$ac_ext
+cat >>conftest.$ac_ext <<_ACEOF
+/* end confdefs.h.  */
+
+int
+main ()
+{
+int lock = 0;
+   __sync_lock_test_and_set(&lock, 1);
+   __sync_lock_release(&lock);
+  ;
+  return 0;
+}
+_ACEOF
+rm -f conftest.$ac_objext conftest$ac_exeext
+if { (ac_try="$ac_link"
+case "(($ac_try" in
+  *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;;
+  *) ac_try_echo=$ac_try;;
+esac
+eval ac_try_echo="\"\$as_me:$LINENO: $ac_try_echo\""
+$as_echo "$ac_try_echo") >&5
+  (eval "$ac_link") 2>conftest.er1
+  ac_status=$?
+  grep -v '^ *+' conftest.er1 >conftest.err
+  rm -f conftest.er1
+  cat conftest.err >&5
+  $as_echo "$as_me:$LINENO: \$? = $ac_status" >&5
+  (exit $ac_status); } && {
+	 test -z "$ac_c_werror_flag" ||
+	 test ! -s conftest.err
+       } && test -s conftest$ac_exeext && {
+	 test "$cross_compiling" = yes ||
+	 $as_test_x conftest$ac_exeext
+       }; then
+  pgac_cv_gcc_int_test_and_set="yes"
+else
+  $as_echo "$as_me: failed program was:" >&5
+sed 's/^/| /' conftest.$ac_ext >&5
+
+	pgac_cv_gcc_int_test_and_set="no"
+fi
+
+rm -rf conftest.dSYM
+rm -f core conftest.err conftest.$ac_objext conftest_ipa8_conftest.oo \
+      conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:$LINENO: result: $pgac_cv_gcc_int_test_and_set" >&5
+$as_echo "$pgac_cv_gcc_int_test_and_set" >&6; }
+if test x"$pgac_cv_gcc_int_test_and_set" = x"yes"; then
+
+cat >>confdefs.h <<\_ACEOF
+#define HAVE_GCC_INT_TEST_AND_SET 1
+_ACEOF
+
+fi
+
+{ $as_echo "$as_me:$LINENO: checking for builtin compare-and-swap function" >&5
+$as_echo_n "checking for builtin compare-and-swap function... " >&6; }
+if test "${pgac_cv_gcc_int64_compare_and_swap+set}" = set; then
+  $as_echo_n "(cached) " >&6
+else
+  cat >conftest.$ac_ext <<_ACEOF
+/* confdefs.h.  */
+_ACEOF
+cat confdefs.h >>conftest.$ac_ext
+cat >>conftest.$ac_ext <<_ACEOF
+/* end confdefs.h.  */
+
+int
+main ()
+{
+PG_INT64_TYPE lock = 0;
+   (void) __sync_val_compare_and_swap(&lock, 1, 1);
+  ;
+  return 0;
+}
+_ACEOF
+rm -f conftest.$ac_objext conftest$ac_exeext
+if { (ac_try="$ac_link"
+case "(($ac_try" in
+  *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;;
+  *) ac_try_echo=$ac_try;;
+esac
+eval ac_try_echo="\"\$as_me:$LINENO: $ac_try_echo\""
+$as_echo "$ac_try_echo") >&5
+  (eval "$ac_link") 2>conftest.er1
+  ac_status=$?
+  grep -v '^ *+' conftest.er1 >conftest.err
+  rm -f conftest.er1
+  cat conftest.err >&5
+  $as_echo "$as_me:$LINENO: \$? = $ac_status" >&5
+  (exit $ac_status); } && {
+	 test -z "$ac_c_werror_flag" ||
+	 test ! -s conftest.err
+       } && test -s conftest$ac_exeext && {
+	 test "$cross_compiling" = yes ||
+	 $as_test_x conftest$ac_exeext
+       }; then
+  pgac_cv_gcc_int64_compare_and_swap="yes"
+else
+  $as_echo "$as_me: failed program was:" >&5
+sed 's/^/| /' conftest.$ac_ext >&5
+
+	pgac_cv_gcc_int64_compare_and_swap="no"
+fi
+
+rm -rf conftest.dSYM
+rm -f core conftest.err conftest.$ac_objext conftest_ipa8_conftest.oo \
+      conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:$LINENO: result: $pgac_cv_gcc_int64_compare_and_swap" >&5
+$as_echo "$pgac_cv_gcc_int64_compare_and_swap" >&6; }
+if test x"$pgac_cv_gcc_int64_compare_and_swap" = x"yes"; then
+
+cat >>confdefs.h <<\_ACEOF
+#define HAVE_GCC_INT64_COMPARE_AND_SWAP 1
+_ACEOF
+
+fi
+
 
 if test "$PORTNAME" != "win32"
 then
diff --git a/configure.in b/configure.in
index 4ea5699..ff8470e 100644
--- a/configure.in
+++ b/configure.in
@@ -1445,17 +1445,6 @@ fi
 AC_CHECK_FUNCS([strtoll strtoq], [break])
 AC_CHECK_FUNCS([strtoull strtouq], [break])
 
-AC_CACHE_CHECK([for builtin locking functions], pgac_cv_gcc_int_atomics,
-[AC_TRY_LINK([],
-  [int lock = 0;
-   __sync_lock_test_and_set(&lock, 1);
-   __sync_lock_release(&lock);],
-  [pgac_cv_gcc_int_atomics="yes"],
-  [pgac_cv_gcc_int_atomics="no"])])
-if test x"$pgac_cv_gcc_int_atomics" = x"yes"; then
-  AC_DEFINE(HAVE_GCC_INT_ATOMICS, 1, [Define to 1 if you have __sync_lock_test_and_set(int *) and friends.])
-fi
-
 # Lastly, restore full LIBS list and check for readline/libedit symbols
 LIBS="$LIBS_including_readline"
 
@@ -1730,6 +1719,28 @@ AC_CHECK_TYPES([int8, uint8, int64, uint64], [], [],
 # C, but is missing on some old platforms.
 AC_CHECK_TYPES(sig_atomic_t, [], [], [#include <signal.h>])
 
+# Check for GCC built-in atomic functions
+AC_CACHE_CHECK([for builtin test-and-set function], pgac_cv_gcc_int_test_and_set,
+[AC_TRY_LINK([],
+  [int lock = 0;
+   __sync_lock_test_and_set(&lock, 1);
+   __sync_lock_release(&lock);],
+  [pgac_cv_gcc_int_test_and_set="yes"],
+  [pgac_cv_gcc_int_test_and_set="no"])])
+if test x"$pgac_cv_gcc_int_test_and_set" = x"yes"; then
+  AC_DEFINE(HAVE_GCC_INT_TEST_AND_SET, 1, [Define to 1 if you have __sync_lock_test_and_set(int *) and __sync_lock_release(int *).])
+fi
+
+AC_CACHE_CHECK([for builtin compare-and-swap function], pgac_cv_gcc_int64_compare_and_swap,
+[AC_TRY_LINK([],
+  [PG_INT64_TYPE lock = 0;
+   (void) __sync_val_compare_and_swap(&lock, 1, 1);],
+  [pgac_cv_gcc_int64_compare_and_swap="yes"],
+  [pgac_cv_gcc_int64_compare_and_swap="no"])])
+if test x"$pgac_cv_gcc_int64_compare_and_swap" = x"yes"; then
+  AC_DEFINE(HAVE_GCC_INT64_COMPARE_AND_SWAP, 1, [Define to 1 if you have __sync_val_compare_and_swap(int64 *, int64, int64).])
+fi
+
 
 if test "$PORTNAME" != "win32"
 then
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 4f88d3f..adbeb77 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -27,6 +27,7 @@
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
+#include "storage/compare_and_swap.h"
 #include "storage/ipc.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
@@ -36,19 +37,215 @@
 /* We use the ShmemLock spinlock to protect LWLockAssign */
 extern slock_t *ShmemLock;
 
-
-typedef struct LWLock
+/*
+ * LWLock_atomic is the part of LWLock struct that needs to be swappable with
+ * an atomic compare-and-swap instruction. Assuming we have an atomic 64-bit
+ * CAS instruction, sizeof(LWLock_atomic) must be <= 8.
+ */
+typedef struct
 {
-	slock_t		mutex;			/* Protects LWLock and queue of PGPROCs */
+#ifdef HAS_COMPARE_AND_SWAP_64
+	bool		mutex;			/* Protects LWLock and queue of PGPROCs */
+#endif
 	bool		releaseOK;		/* T if ok to release waiters */
+	bool		waiters;		/* T if queue is not empty (head != NULL) */
 	char		exclusive;		/* # of exclusive holders (0 or 1) */
-	int			shared;			/* # of shared holders (0..MaxBackends) */
+	uint32		shared;			/* # of shared holders (0..MaxBackends) */
+} LWLock_atomic;
+
+/* this allows handling LWLock_atomic as an int64 */
+typedef union
+{
+	LWLock_atomic f;
+	int64_t		val;
+} LWLock_cas_union;
+
+typedef struct LWLock
+{
+	/*
+	 * If 64-bit compare-and-swap instruction is not available, all the fields
+	 * are protected by a spinlock. Otherwise, the 'atomic' part is manipulated
+	 * using compare-and-swap, and the head and tail pointers are protected
+	 * by atomic.f.mutex.
+	 */
+#ifndef HAS_COMPARE_AND_SWAP_64
+	slock_t		mutex;
+#endif
+	LWLock_cas_union atomic;
 	PGPROC	   *head;			/* head of list of waiting PGPROCs */
 	PGPROC	   *tail;			/* tail of list of waiting PGPROCs */
 	/* tail is undefined when head is NULL */
 } LWLock;
 
 /*
+ * There are two different implementations of the low-level locking that
+ * protects the LWLock struct: spinlock, and compare-and-swap. The
+ * compare-and-swap implementation scales better, but requires the CPU
+ * architecture to have an atomic 64-bit compare-and-swap instruction. All
+ * modern architectures have one, but some older ones do not. If a CAS
+ * instruction is not available, we fall back to the spinlock-based
+ * implementation.
+ *
+ * The spinlock based implementation uses a spinlock to protect the whole
+ * LWLock struct. When acquiring or releasing a lwlock, the spinlock is
+ * acquired first, and the exclusive/shared counter and wait queue are
+ * manipulated while holding the spinlock.
+ *
+ * In the compare-and-swap implementation, the CAS instruction is used to
+ * increment the shared/exclusive counter, when the lock can be acquired
+ * without waiting. Likewise, when releasing a lwlock, and there is no need to
+ * modify the wait queue to release waiting backends, the counter is
+ * decremented with the CAS instruction.  The wait queue is still protected by
+ * the 'mutex' field in LWLock_atomic, which amounts to a spinlock. When the
+ * mutex field is set, you must not modify any of the fields; you must spin
+ * until it is no longer set. This spinning is accomplished by looping with
+ * the CAS instruction, always using mutex = false in the old value to compare
+ * with.
+ *
+ * These two implementations are abstracted by the below macros. The intended
+ * usage is like this:
+ *
+ * {
+ *	LWLock *lock = ...;
+ *	LWLOCK_CAS_VARS;
+ *
+ *	LWLOCK_CAS_LOOP(lock);
+ *  {
+ *	     Check the old values in LWLock_atomic (oldval), and based on that,
+ *		 set the new values (newval). If you need to acquire the mutex that
+ *		 protects the wait queue, set newval.f.mutex = true;
+ *		 This part might be executed many times, if the compare-and-swap
+ *		 operation fails because of concurrent activity.
+ *
+ *      e.g:
+ *		if (oldval.f.shared > 0)
+ *			newval.f.shared++;
+ *  }
+ *  LWLOCK_CAS_END(lock);
+ *
+ *  At this point, the compare-and-swap succeeded. If you didn't acquire
+ *  the mutex by setting newval.f.mutex = true, you're done. Otherwise,
+ *  proceed to manipulate the wait queue, and when you're done, call
+ *  LWLOCK_RELEASE_MUTEX:
+ *
+ *  if (acquired_mutex)
+ *  {
+ *     ... manipulate wait queue ...
+ *		LWLOCK_RELEASE_MUTEX(lock);
+ *  }
+ *
+ *  ... all done ...
+ * }
+ *
+ * To break out of the CAS loop prematurely, without changing the value, use:
+ *
+ *	LWLOCK_CAS_ABORT();
+ *	break;
+ *
+ */
+#ifdef HAS_COMPARE_AND_SWAP_64
+
+#define LWLOCK_CAS_VARS													\
+	LWLock_cas_union oldval;											\
+	LWLock_cas_union newval;											\
+	int64_t		tmpval													\
+
+/*
+ * NB: The initial fetch of the old value isn't guaranteed to be atomic on
+ * all platforms. That's OK: if the value changes concurrently, the
+ * compare-and-swap operation will fail, and we will loop back here with the
+ * correct value. However, it means that all the CAS loops must tolerate a
+ * bogus, "torn", old value!
+ */
+#define LWLOCK_CAS_LOOP(lock)											\
+	do {																\
+		bool		spinned = false;									\
+																		\
+		/* initial fetch */												\
+		oldval.val = (lock)->atomic.val;								\
+		for (;;)														\
+		{																\
+			oldval.f.mutex = false;										\
+			newval = oldval												\
+
+			/* The user-code that modifies newval goes here */
+
+#define LWLOCK_CAS_END(lock, keepmutex)									\
+			newval.f.mutex = keepmutex;									\
+			tmpval = pg_compare_and_swap_64(&(lock)->atomic.val,		\
+											oldval.val, newval.val);	\
+			if (tmpval != oldval.val)									\
+			{															\
+				/* failed to acquire lock or mutex, retry */			\
+				if (oldval.f.mutex)										\
+				{														\
+					/*													\
+					 * When we fail because someone is holding the		\
+					 * mutex, this amounts to spinning on a spinlock,	\
+					 * so keep the spinlock delay accounting up-to-date	\
+					 */ 												\
+					spin_fail();										\
+					spinned = true;										\
+				}														\
+				oldval.val = tmpval;									\
+				continue;												\
+			}															\
+			else														\
+			{															\
+				/* CAS succeeded */										\
+				if (spinned)											\
+					spin_success();										\
+				oldval = newval;										\
+				break;													\
+			}															\
+		}																\
+	} while(0)
+
+#define LWLOCK_CAS_ABORT(lock) /* do nothing */
+
+#define LWLOCK_CAS_RELEASE_MUTEX(lock)									\
+	do {																\
+		Assert(oldval.f.mutex);											\
+		newval.f.mutex = false;											\
+		tmpval = pg_compare_and_swap_64(&(lock)->atomic.val,			\
+										oldval.val, newval.val);		\
+		/* should always succeed when we're holding the mutex */		\
+		Assert(tmpval == oldval.val);									\
+	} while(0)
+
+#else
+
+/* spinlock-based implementation */
+#define LWLOCK_CAS_VARS													\
+	LWLock_cas_union oldval;											\
+	LWLock_cas_union newval												\
+
+#define LWLOCK_CAS_LOOP(lock)											\
+	do {																\
+		SpinLockAcquire(&(lock)->mutex);								\
+		oldval = (lock)->atomic;										\
+		newval = oldval;												\
+
+#define LWLOCK_CAS_END(lock, keepmutex)									\
+		if (!keepmutex)													\
+		{																\
+			(lock)->atomic = newval;									\
+			SpinLockRelease(&(lock)->mutex);							\
+		}																\
+	} while(0)
+
+#define LWLOCK_CAS_ABORT(lock)											\
+		SpinLockRelease(&(lock)->mutex)
+
+#define LWLOCK_CAS_RELEASE_MUTEX(lock)									\
+	do {																\
+		(lock)->atomic = newval;										\
+		SpinLockRelease(&(lock)->mutex);								\
+	} while(0)
+
+#endif		/* HAS_COMPARE_AND_SWAP_64 */
+
+/*
  * All the LWLock structs are allocated as an array in shared memory.
  * (LWLockIds are indexes into the array.)	We force the array stride to
  * be a power of 2, which saves a few cycles in indexing, but more
@@ -267,6 +464,11 @@ CreateLWLocks(void)
 	char	   *ptr;
 	int			id;
 
+#ifdef HAS_COMPARE_AND_SWAP_64
+StaticAssertStmt(sizeof(LWLock_cas_union) == sizeof(int64_t),
+					 "unexpected size of LWLock_cas_t");
+#endif
+
 	/* Allocate space */
 	ptr = (char *) ShmemAlloc(spaceLocks);
 
@@ -283,10 +485,16 @@ CreateLWLocks(void)
 	 */
 	for (id = 0, lock = LWLockArray; id < numLocks; id++, lock++)
 	{
+#ifndef HAS_COMPARE_AND_SWAP_64
 		SpinLockInit(&lock->lock.mutex);
-		lock->lock.releaseOK = true;
-		lock->lock.exclusive = 0;
-		lock->lock.shared = 0;
+#else
+		lock->lock.atomic.f.mutex = false;
+#endif
+		lock->lock.atomic.val = 0; /* clear possible padding */
+		lock->lock.atomic.f.releaseOK = true;
+		lock->lock.atomic.f.waiters = false;
+		lock->lock.atomic.f.exclusive = 0;
+		lock->lock.atomic.f.shared = 0;
 		lock->lock.head = NULL;
 		lock->lock.tail = NULL;
 	}
@@ -395,39 +603,42 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 	for (;;)
 	{
 		bool		mustwait;
+		LWLOCK_CAS_VARS;
 
-		/* Acquire mutex.  Time spent holding mutex should be short! */
-#ifdef LWLOCK_STATS
-		spin_delay_counts[lockid] += SpinLockAcquire(&lock->mutex);
-#else
-		SpinLockAcquire(&lock->mutex);
-#endif
-
-		/* If retrying, allow LWLockRelease to release waiters again */
-		if (retry)
-			lock->releaseOK = true;
-
-		/* If I can get the lock, do so quickly. */
-		if (mode == LW_EXCLUSIVE)
+		/*
+		 * Increment the exclusive- or shared-hold counter, if we can acquire
+		 * the lock without waiting. Acquire mutex if we have to wait.  Time
+		 * spent holding mutex should be short!
+		 */
+		LWLOCK_CAS_LOOP(lock);
 		{
-			if (lock->exclusive == 0 && lock->shared == 0)
+			/* If retrying, allow LWLockRelease to release waiters again */
+			if (retry)
+				newval.f.releaseOK = true;
+
+			/* If I can get the lock, do so quickly. */
+			if (mode == LW_EXCLUSIVE)
 			{
-				lock->exclusive++;
-				mustwait = false;
+				if (oldval.f.exclusive == 0 && oldval.f.shared == 0)
+				{
+					newval.f.exclusive++;
+					mustwait = false;
+				}
+				else
+					mustwait = true;
 			}
 			else
-				mustwait = true;
-		}
-		else
-		{
-			if (lock->exclusive == 0)
 			{
-				lock->shared++;
-				mustwait = false;
+				if (oldval.f.exclusive == 0)
+				{
+					newval.f.shared++;
+					mustwait = false;
+				}
+				else
+					mustwait = true;
 			}
-			else
-				mustwait = true;
 		}
+		LWLOCK_CAS_END(lock, mustwait);
 
 		if (!mustwait)
 			break;				/* got the lock */
@@ -452,7 +663,8 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 		lock->tail = proc;
 
 		/* Can release the mutex now */
-		SpinLockRelease(&lock->mutex);
+		newval.f.waiters = true;
+		LWLOCK_CAS_RELEASE_MUTEX(lock);
 
 		/*
 		 * Wait until awakened.
@@ -492,7 +704,6 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 	}
 
 	/* We are done updating shared state of the lock itself. */
-	SpinLockRelease(&lock->mutex);
 
 	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(lockid, mode);
 
@@ -518,6 +729,7 @@ LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode)
 {
 	volatile LWLock *lock = &(LWLockArray[lockid].lock);
 	bool		mustwait;
+	LWLOCK_CAS_VARS;
 
 	PRINT_LWDEBUG("LWLockConditionalAcquire", lockid, lock);
 
@@ -532,33 +744,41 @@ LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode)
 	 */
 	HOLD_INTERRUPTS();
 
-	/* Acquire mutex.  Time spent holding mutex should be short! */
-	SpinLockAcquire(&lock->mutex);
-
-	/* If I can get the lock, do so quickly. */
-	if (mode == LW_EXCLUSIVE)
+	/* Try to acquire the lock. */
+	LWLOCK_CAS_LOOP(lock);
 	{
-		if (lock->exclusive == 0 && lock->shared == 0)
+		/* If I can get the lock, do so quickly. */
+		if (mode == LW_EXCLUSIVE)
 		{
-			lock->exclusive++;
-			mustwait = false;
+			if (oldval.f.exclusive == 0 && oldval.f.shared == 0)
+			{
+				newval.f.exclusive++;
+				mustwait = false;
+			}
+			else
+				mustwait = true;
 		}
 		else
-			mustwait = true;
-	}
-	else
-	{
-		if (lock->exclusive == 0)
 		{
-			lock->shared++;
-			mustwait = false;
+			if (oldval.f.exclusive == 0)
+			{
+				newval.f.shared++;
+				mustwait = false;
+			}
+			else
+				mustwait = true;
+		}
+
+		/* if we would need to wait, give up now */
+		if (mustwait)
+		{
+			LWLOCK_CAS_ABORT(lock);
+			break;
 		}
-		else
-			mustwait = true;
 	}
+	LWLOCK_CAS_END(lock, false);
 
 	/* We are done updating shared state of the lock itself. */
-	SpinLockRelease(&lock->mutex);
 
 	if (mustwait)
 	{
@@ -598,6 +818,7 @@ LWLockAcquireOrWait(LWLockId lockid, LWLockMode mode)
 	PGPROC	   *proc = MyProc;
 	bool		mustwait;
 	int			extraWaits = 0;
+	LWLOCK_CAS_VARS;
 
 	PRINT_LWDEBUG("LWLockAcquireOrWait", lockid, lock);
 
@@ -618,30 +839,36 @@ LWLockAcquireOrWait(LWLockId lockid, LWLockMode mode)
 	 */
 	HOLD_INTERRUPTS();
 
-	/* Acquire mutex.  Time spent holding mutex should be short! */
-	SpinLockAcquire(&lock->mutex);
-
-	/* If I can get the lock, do so quickly. */
-	if (mode == LW_EXCLUSIVE)
+	/*
+	 * Increment the exclusive- or shared-hold counter, if we can acquire the
+	 * lock without waiting. Acquire mutex if we have to wait.  Time spent
+	 * holding mutex should be short!
+	 */
+	LWLOCK_CAS_LOOP(lock);
 	{
-		if (lock->exclusive == 0 && lock->shared == 0)
+		/* If I can get the lock, do so quickly. */
+		if (mode == LW_EXCLUSIVE)
 		{
-			lock->exclusive++;
-			mustwait = false;
+			if (oldval.f.exclusive == 0 && oldval.f.shared == 0)
+			{
+				newval.f.exclusive++;
+				mustwait = false;
+			}
+			else
+				mustwait = true;
 		}
 		else
-			mustwait = true;
-	}
-	else
-	{
-		if (lock->exclusive == 0)
 		{
-			lock->shared++;
-			mustwait = false;
+			if (oldval.f.exclusive == 0)
+			{
+				newval.f.shared++;
+				mustwait = false;
+			}
+			else
+				mustwait = true;
 		}
-		else
-			mustwait = true;
 	}
+	LWLOCK_CAS_END(lock, mustwait);
 
 	if (mustwait)
 	{
@@ -665,7 +892,8 @@ LWLockAcquireOrWait(LWLockId lockid, LWLockMode mode)
 		lock->tail = proc;
 
 		/* Can release the mutex now */
-		SpinLockRelease(&lock->mutex);
+		newval.f.waiters = true;
+		LWLOCK_CAS_RELEASE_MUTEX(lock);
 
 		/*
 		 * Wait until awakened.  Like in LWLockAcquire, be prepared for bogus
@@ -694,8 +922,7 @@ LWLockAcquireOrWait(LWLockId lockid, LWLockMode mode)
 	}
 	else
 	{
-		/* We are done updating shared state of the lock itself. */
-		SpinLockRelease(&lock->mutex);
+		/* No need to update the lock */
 	}
 
 	/*
@@ -731,6 +958,8 @@ LWLockRelease(LWLockId lockid)
 	PGPROC	   *head;
 	PGPROC	   *proc;
 	int			i;
+	LWLOCK_CAS_VARS;
+	bool		releaseWaiters;
 
 	PRINT_LWDEBUG("LWLockRelease", lockid, lock);
 
@@ -749,80 +978,86 @@ LWLockRelease(LWLockId lockid)
 	for (; i < num_held_lwlocks; i++)
 		held_lwlocks[i] = held_lwlocks[i + 1];
 
-	/* Acquire mutex.  Time spent holding mutex should be short! */
-	SpinLockAcquire(&lock->mutex);
-
-	/* Release my hold on lock */
-	if (lock->exclusive > 0)
-		lock->exclusive--;
-	else
+	LWLOCK_CAS_LOOP(lock);
 	{
-		Assert(lock->shared > 0);
-		lock->shared--;
+		/* Release my hold on lock */
+		if (oldval.f.exclusive > 0)
+			newval.f.exclusive--;
+		else
+		{
+			Assert(oldval.f.shared > 0);
+			newval.f.shared--;
+		}
+
+		/*
+		 * See if I need to awaken any waiters.  If I released a non-last
+		 * shared hold, there cannot be anything to do.  Also, do not awaken
+		 * any waiters if someone has already awakened waiters that haven't
+		 * yet acquired the lock.
+		 *
+		 * If we need to awaken anyone, need to acquire the mutex.
+		 */
+		releaseWaiters = (newval.f.waiters > 0 && newval.f.releaseOK &&
+						  newval.f.exclusive == 0 && newval.f.shared == 0);
 	}
+	LWLOCK_CAS_END(lock, releaseWaiters);
 
-	/*
-	 * See if I need to awaken any waiters.  If I released a non-last shared
-	 * hold, there cannot be anything to do.  Also, do not awaken any waiters
-	 * if someone has already awakened waiters that haven't yet acquired the
-	 * lock.
-	 */
-	head = lock->head;
-	if (head != NULL)
+	if (releaseWaiters)
 	{
-		if (lock->exclusive == 0 && lock->shared == 0 && lock->releaseOK)
-		{
-			/*
-			 * Remove the to-be-awakened PGPROCs from the queue.
-			 */
-			bool		releaseOK = true;
-
-			proc = head;
-
-			/*
-			 * First wake up any backends that want to be woken up without
-			 * acquiring the lock.
-			 */
-			while (proc->lwWaitMode == LW_WAIT_UNTIL_FREE && proc->lwWaitLink)
-				proc = proc->lwWaitLink;
+		/*
+		 * Remove the to-be-awakened PGPROCs from the queue.
+		 */
+		bool		releaseOK = true;
+
+		head = lock->head;
+		Assert(head != NULL);
+
+		proc = head;
+
+		/*
+		 * First wake up any backends that want to be woken up without
+		 * acquiring the lock.
+		 */
+		while (proc->lwWaitMode == LW_WAIT_UNTIL_FREE && proc->lwWaitLink)
+			proc = proc->lwWaitLink;
 
-			/*
-			 * If the front waiter wants exclusive lock, awaken him only.
-			 * Otherwise awaken as many waiters as want shared access.
-			 */
-			if (proc->lwWaitMode != LW_EXCLUSIVE)
+		/*
+		 * If the front waiter wants exclusive lock, awaken him only.
+		 * Otherwise awaken as many waiters as want shared access.
+		 */
+		if (proc->lwWaitMode != LW_EXCLUSIVE)
+		{
+			while (proc->lwWaitLink != NULL &&
+				   proc->lwWaitLink->lwWaitMode != LW_EXCLUSIVE)
 			{
-				while (proc->lwWaitLink != NULL &&
-					   proc->lwWaitLink->lwWaitMode != LW_EXCLUSIVE)
-				{
-					if (proc->lwWaitMode != LW_WAIT_UNTIL_FREE)
-						releaseOK = false;
-					proc = proc->lwWaitLink;
-				}
+				if (proc->lwWaitMode != LW_WAIT_UNTIL_FREE)
+					releaseOK = false;
+				proc = proc->lwWaitLink;
 			}
-			/* proc is now the last PGPROC to be released */
-			lock->head = proc->lwWaitLink;
-			proc->lwWaitLink = NULL;
-
-			/*
-			 * Prevent additional wakeups until retryer gets to run. Backends
-			 * that are just waiting for the lock to become free don't retry
-			 * automatically.
-			 */
-			if (proc->lwWaitMode != LW_WAIT_UNTIL_FREE)
-				releaseOK = false;
-
-			lock->releaseOK = releaseOK;
-		}
-		else
-		{
-			/* lock is still held, can't awaken anything */
-			head = NULL;
 		}
+		/* proc is now the last PGPROC to be released */
+		lock->head = proc->lwWaitLink;
+		proc->lwWaitLink = NULL;
+
+		/*
+		 * Prevent additional wakeups until retryer gets to run. Backends
+		 * that are just waiting for the lock to become free don't retry
+		 * automatically.
+		 */
+		if (proc->lwWaitMode != LW_WAIT_UNTIL_FREE)
+			releaseOK = false;
+
+		newval.f.releaseOK = releaseOK;
+
+		/* release mutex */
+		if (lock->head == NULL)
+			newval.f.waiters = false;
+		LWLOCK_CAS_RELEASE_MUTEX(lock);
 	}
+	else
+		head = NULL;
 
 	/* We are done updating shared state of the lock itself. */
-	SpinLockRelease(&lock->mutex);
 
 	TRACE_POSTGRESQL_LWLOCK_RELEASE(lockid);
 
diff --git a/src/backend/storage/lmgr/s_lock.c b/src/backend/storage/lmgr/s_lock.c
index ed1f56a..bb74545 100644
--- a/src/backend/storage/lmgr/s_lock.c
+++ b/src/backend/storage/lmgr/s_lock.c
@@ -42,6 +42,9 @@ s_lock_stuck(volatile slock_t *lock, const char *file, int line)
 #endif
 }
 
+static int	spins = 0;
+static int	delays = 0;
+static int	cur_delay = 0;
 
 /*
  * s_lock(lock) - platform-independent portion of waiting for a spinlock.
@@ -93,42 +96,25 @@ s_lock(volatile slock_t *lock, const char *file, int line)
 #define MIN_DELAY_MSEC		1
 #define MAX_DELAY_MSEC		1000
 
-	int			spins = 0;
-	int			delays = 0;
-	int			cur_delay = 0;
+	spins = 0;
+	delays = 0;
+	cur_delay = 0;
 
 	while (TAS_SPIN(lock))
 	{
 		/* CPU-specific delay each time through the loop */
 		SPIN_DELAY();
 
-		/* Block the process every spins_per_delay tries */
 		if (++spins >= spins_per_delay)
-		{
-			if (++delays > NUM_DELAYS)
-				s_lock_stuck(lock, file, line);
-
-			if (cur_delay == 0) /* first time to delay? */
-				cur_delay = MIN_DELAY_MSEC;
-
-			pg_usleep(cur_delay * 1000L);
-
-#if defined(S_LOCK_TEST)
-			fprintf(stdout, "*");
-			fflush(stdout);
-#endif
-
-			/* increase delay by a random fraction between 1X and 2X */
-			cur_delay += (int) (cur_delay *
-					  ((double) random() / (double) MAX_RANDOM_VALUE) + 0.5);
-			/* wrap back to minimum delay when max is exceeded */
-			if (cur_delay > MAX_DELAY_MSEC)
-				cur_delay = MIN_DELAY_MSEC;
-
-			spins = 0;
-		}
+			s_lock_fail(file, line);
 	}
 
+	return s_lock_success(file, line);
+}
+
+int
+s_lock_success(const char *file, int line)
+{
 	/*
 	 * If we were able to acquire the lock without delaying, it's a good
 	 * indication we are in a multiprocessor.  If we had to delay, it's a sign
@@ -155,9 +141,46 @@ s_lock(volatile slock_t *lock, const char *file, int line)
 		if (spins_per_delay > MIN_SPINS_PER_DELAY)
 			spins_per_delay = Max(spins_per_delay - 1, MIN_SPINS_PER_DELAY);
 	}
+
+	spins = 0;
+	delays = 0;
+	cur_delay = 0;
+
 	return delays;
 }
 
+void
+s_lock_fail(const char *file, int line)
+{
+	/* CPU-specific delay each time through the loop */
+	SPIN_DELAY();
+
+	/* Block the process every spins_per_delay tries */
+	if (++spins >= spins_per_delay)
+	{
+		if (++delays > NUM_DELAYS)
+			s_lock_stuck(NULL, file, line); /* XXX: passing NULL here.. */
+
+		if (cur_delay == 0) /* first time to delay? */
+			cur_delay = MIN_DELAY_MSEC;
+
+		pg_usleep(cur_delay * 1000L);
+
+#if defined(S_LOCK_TEST)
+		fprintf(stdout, "*");
+		fflush(stdout);
+#endif
+
+		/* increase delay by a random fraction between 1X and 2X */
+		cur_delay += (int) (cur_delay *
+			  ((double) random() / (double) MAX_RANDOM_VALUE) + 0.5);
+		/* wrap back to minimum delay when max is exceeded */
+		if (cur_delay > MAX_DELAY_MSEC)
+			cur_delay = MIN_DELAY_MSEC;
+
+		spins = 0;
+	}
+}
 
 /*
  * Set local copy of spins_per_delay during backend startup.
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 8aabf3c..e31f652 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -173,8 +173,13 @@
 /* Define to 1 if your compiler understands __FUNCTION__. */
 #undef HAVE_FUNCNAME__FUNCTION
 
-/* Define to 1 if you have __sync_lock_test_and_set(int *) and friends. */
-#undef HAVE_GCC_INT_ATOMICS
+/* Define to 1 if you have __sync_val_compare_and_swap(int64 *, int64, int64).
+   */
+#undef HAVE_GCC_INT64_COMPARE_AND_SWAP
+
+/* Define to 1 if you have __sync_lock_test_and_set(int *) and
+   __sync_lock_release(int *). */
+#undef HAVE_GCC_INT_TEST_AND_SET
 
 /* Define to 1 if you have the `getaddrinfo' function. */
 #undef HAVE_GETADDRINFO
diff --git a/src/include/storage/compare_and_swap.h b/src/include/storage/compare_and_swap.h
new file mode 100644
index 0000000..7ff5c0f
--- /dev/null
+++ b/src/include/storage/compare_and_swap.h
@@ -0,0 +1,53 @@
+/*-------------------------------------------------------------------------
+ *
+ * compare_and_swap.h
+ *	  Atomic compare-and-swap operation.
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/compare_and_swap.h
+ *
+ * The API is:
+ *
+ * int64 pg_compare_and_swap_64(int64 *p, int64 oldval, int64 newval)
+ *		Compares *p with oldval, and if the values are equal, replaces *p
+ *		with newval. In either case, the old value in *p is returned.
+ *
+ *
+ * Not all platforms have an atomic compare-and-swap operation. Callers
+ * must use #ifdef HAS_COMPARE_AND_SWAP_64, and provide a fallback
+ * implementation e.g using a spinlock if it is not defined.
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef COMPARE_AND_SWAP_H
+#define COMPARE_AND_SWAP_H
+
+/*
+ * On Windows, use InterlockedCompareExchange64. (It compiles into a
+ * "lock cmpxchg8b" instruction.)
+ */
+#if !defined(HAS_COMPARE_AND_SWAP_64) && defined(WIN32_ONLY_COMPILER)
+
+#define HAS_COMPARE_AND_SWAP_64
+
+#define pg_compare_and_swap_64(p, oldval, newval) InterlockedCompareExchange64(p, newval, oldval)
+
+#endif
+
+/*
+ * Use gcc builtins, if available.
+ *
+ * We trust that if the compiler provides a built-in compare-and-swap, it
+ * actually works and performs.
+ */
+#if !defined(HAS_COMPARE_AND_SWAP_64) && defined(HAVE_GCC_INT64_COMPARE_AND_SWAP)
+
+#define HAS_COMPARE_AND_SWAP_64
+
+#define pg_compare_and_swap_64(p, oldval, newval) __sync_val_compare_and_swap(p, oldval, newval)
+
+#endif
+
+#endif   /* COMPARE_AND_SWAP_H */
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index ce45ffe..f29b757 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -1014,6 +1014,10 @@ extern int	tas(volatile slock_t *lock);		/* in port/.../tas.s, or
  * Platform-independent out-of-line support routines
  */
 extern int s_lock(volatile slock_t *lock, const char *file, int line);
+extern int s_lock_success(const char *file, int line);
+extern void s_lock_fail(const char *file, int line);
+#define spin_success() s_lock_success(__FILE__, __LINE__)
+#define spin_fail() s_lock_fail(__FILE__, __LINE__)
 
 /* Support for dynamic adjustment of spins_per_delay */
 #define DEFAULT_SPINS_PER_DELAY  100

#13

Bruce Momjian

bruce@momjian.us

over 12 years ago

In reply to: Tom Lane (#6)

Re: Better LWLocks with compare-and-swap (9.4)

On Thu, May 16, 2013 at 12:08:40PM -0400, Tom Lane wrote:

Stephen Frost <sfrost@snowman.net> writes:

Isn't this the same issue which has prompted multiple people to propose
(sometimes with code, as I recall) to rip out our internal spinlock
system and replace it with kernel-backed calls which do it better,
specifically by dealing with issues like the above? Have you seen those
threads in the past? Any thoughts about moving in that direction?

All of the proposals of that sort that I've seen had a flavor of
"my OS is the only one that matters". While I don't object to
platform-dependent implementations of spinlocks as such, they're not
much of a cure for a generic performance issue.

Uh, is this an x86-64-only optimization? Seems so.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Alvaro Herrera

alvherre@2ndquadrant.com

over 12 years ago

In reply to: Heikki Linnakangas (#12)

Re: Better LWLocks with compare-and-swap (9.4)

diff --git a/configure.in b/configure.in
index 4ea5699..ff8470e 100644
--- a/configure.in
+++ b/configure.in
@@ -1445,17 +1445,6 @@ fi
AC_CHECK_FUNCS([strtoll strtoq], [break])
AC_CHECK_FUNCS([strtoull strtouq], [break])
-AC_CACHE_CHECK([for builtin locking functions], pgac_cv_gcc_int_atomics,
-[AC_TRY_LINK([],
- [int lock = 0;
- __sync_lock_test_and_set(&lock, 1);
- __sync_lock_release(&lock);],
- [pgac_cv_gcc_int_atomics="yes"],
- [pgac_cv_gcc_int_atomics="no"])])
-if test x"$pgac_cv_gcc_int_atomics" = x"yes"; then
- AC_DEFINE(HAVE_GCC_INT_ATOMICS, 1, [Define to 1 if you have __sync_lock_test_and_set(int *) and friends.])
-fi
-

Careful here --- s_lock.h has some code conditional on
HAVE_GCC_INT_ATOMICS which your patch is not touching, yet it is
removing the definition, unless I'm misreading.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Heikki Linnakangas

hlinnakangas@vmware.com

over 12 years ago

In reply to: Bruce Momjian (#13)

Re: Better LWLocks with compare-and-swap (9.4)

On 20.05.2013 23:01, Bruce Momjian wrote:

On Thu, May 16, 2013 at 12:08:40PM -0400, Tom Lane wrote:

Stephen Frost<sfrost@snowman.net> writes:

Isn't this the same issue which has prompted multiple people to propose
(sometimes with code, as I recall) to rip out our internal spinlock
system and replace it with kernel-backed calls which do it better,
specifically by dealing with issues like the above? Have you seen those
threads in the past? Any thoughts about moving in that direction?

All of the proposals of that sort that I've seen had a flavor of
"my OS is the only one that matters". While I don't object to
platform-dependent implementations of spinlocks as such, they're not
much of a cure for a generic performance issue.

Uh, is this an x86-64-only optimization? Seems so.

All modern architectures have an atomic compare-and-swap instruction (or
something more powerful that can be used to implement it). That includes
x86, x86-64, ARM, PowerPC, among others.

There are some differences in how wide values can be swapped with it;
386 only supported 32-bit, until Pentium, which added a 64-bit variant.
I used the 64-bit variant in the patch, but for lwlocks, we could
actually get away with the 32-bit variant if we packed the booleans and
the shared lock counter more tightly.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Heikki Linnakangas

hlinnakangas@vmware.com

over 12 years ago

In reply to: Alvaro Herrera (#14)

Re: Better LWLocks with compare-and-swap (9.4)

On 20.05.2013 23:11, Alvaro Herrera wrote:

diff --git a/configure.in b/configure.in
index 4ea5699..ff8470e 100644
--- a/configure.in
+++ b/configure.in
@@ -1445,17 +1445,6 @@ fi
AC_CHECK_FUNCS([strtoll strtoq], [break])
AC_CHECK_FUNCS([strtoull strtouq], [break])
-AC_CACHE_CHECK([for builtin locking functions], pgac_cv_gcc_int_atomics,
-[AC_TRY_LINK([],
- [int lock = 0;
- __sync_lock_test_and_set(&lock, 1);
- __sync_lock_release(&lock);],
- [pgac_cv_gcc_int_atomics="yes"],
- [pgac_cv_gcc_int_atomics="no"])])
-if test x"$pgac_cv_gcc_int_atomics" = x"yes"; then
- AC_DEFINE(HAVE_GCC_INT_ATOMICS, 1, [Define to 1 if you have __sync_lock_test_and_set(int *) and friends.])
-fi
-
Careful here --- s_lock.h has some code conditional on
HAVE_GCC_INT_ATOMICS which your patch is not touching, yet it is
removing the definition, unless I'm misreading.

Thanks, good catch. I renamed HAVE_GCC_INT_ATOMICS to
HAVE_GCC_INT_TEST_AND_SET because "atomics" seems too generic when we
also test for __sync_val_compare_and_swap(p, oldval, newval).

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Bruce Momjian

bruce@momjian.us

over 12 years ago

In reply to: Heikki Linnakangas (#15)

Re: Better LWLocks with compare-and-swap (9.4)

On Mon, May 20, 2013 at 11:16:41PM +0300, Heikki Linnakangas wrote:

On 20.05.2013 23:01, Bruce Momjian wrote:

On Thu, May 16, 2013 at 12:08:40PM -0400, Tom Lane wrote:

Stephen Frost<sfrost@snowman.net> writes:

Isn't this the same issue which has prompted multiple people to propose
(sometimes with code, as I recall) to rip out our internal spinlock
system and replace it with kernel-backed calls which do it better,
specifically by dealing with issues like the above? Have you seen those
threads in the past? Any thoughts about moving in that direction?

All of the proposals of that sort that I've seen had a flavor of
"my OS is the only one that matters". While I don't object to
platform-dependent implementations of spinlocks as such, they're not
much of a cure for a generic performance issue.

Uh, is this an x86-64-only optimization? Seems so.

All modern architectures have an atomic compare-and-swap instruction
(or something more powerful that can be used to implement it). That
includes x86, x86-64, ARM, PowerPC, among others.

There are some differences in how wide values can be swapped with
it; 386 only supported 32-bit, until Pentium, which added a 64-bit
variant. I used the 64-bit variant in the patch, but for lwlocks, we
could actually get away with the 32-bit variant if we packed the
booleans and the shared lock counter more tightly.

So we are going to need to add this kind of assembly language
optimization for every CPU type?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Heikki Linnakangas

hlinnakangas@vmware.com

over 12 years ago

In reply to: Daniel Farina (#3)

Re: Better LWLocks with compare-and-swap (9.4)

On 16.05.2013 01:08, Daniel Farina wrote:

On Mon, May 13, 2013 at 5:50 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
pgbench -S is such a workload. With 9.3beta1, I'm seeing this profile, when
I run "pgbench -S -c64 -j64 -T60 -M prepared" on a 32-core Linux machine:
-  64.09%  postgres  postgres           [.] tas
- tas
- 99.83% s_lock
- 53.22% LWLockAcquire
+ 99.87% GetSnapshotData
- 46.78% LWLockRelease
GetSnapshotData
+ GetTransactionSnapshot
+   2.97%  postgres  postgres           [.] tas
+   1.53%  postgres  libc-2.13.so       [.] 0x119873
+   1.44%  postgres  postgres           [.] GetSnapshotData
+   1.29%  postgres  [kernel.kallsyms]  [k] arch_local_irq_enable
+   1.18%  postgres  postgres           [.] AllocSetAlloc
...
So, on this test, a lot of time is wasted spinning on the mutex of
ProcArrayLock. If you plot a graph of TPS vs. # of clients, there is a
surprisingly steep drop in performance once you go beyond 29 clients
(attached, pgbench-lwlock-cas-local-clients-sets.png, red line). My theory
is that after that point all the cores are busy, and processes start to be
sometimes context switched while holding the spinlock, which kills
performance.
I have, I also used linux perf to come to this conclusion, and my
determination was similar: a system was undergoing increasingly heavy
load, in this case with processes>> number of processors. It was
also a phase-change type of event: at one moment everything would be
going great, but once a critical threshold was hit, s_lock would
consume enormous amount of CPU time. I figured preemption while in
the spinlock was to blame at the time, given the extreme nature

Stop the press! I'm getting the same speedup on that 32-core box I got
with the compare-and-swap patch, from this one-liner:

--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -200,6 +200,8 @@ typedef unsigned char slock_t;

#define TAS(lock) tas(lock)

+#define TAS_SPIN(lock)	(*(lock) ? 1 : TAS(lock))
+
  static __inline__ int
  tas(volatile slock_t *lock)
  {

So, on this system, doing a non-locked test before the locked xchg
instruction while spinning, is a very good idea. That contradicts the
testing that was done earlier when the x86-64 implementation was added,
as we have this comment in the tas() implementation:

/*
* On Opteron, using a non-locking test before the locking instruction
* is a huge loss. On EM64T, it appears to be a wash or small loss,
* so we needn't bother to try to distinguish the sub-architectures.
*/

On my test system, the non-locking test is a big win. I tested this
because I was reading this article from Intel:

http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/.
It says very explicitly that the non-locking test is a good idea:

Spinning on volatile read vs. spin on lock attempt

One common mistake made by developers developing their own spin-wait loops is attempting to spin on an atomic instruction instead of spinning on a volatile read. Spinning on a dirty read instead of attempting to acquire a lock consumes less time and resources. This allows an application to only attempt to acquire a lock only when it is free.

Now, I'm not sure what to do about this. If we put the non-locking test
in there, according to the previous testing that would be a huge loss on
Opterons.

Perhaps we should just sleep earlier, ie. lower MAX_SPINS_PER_DELAY.
That way, even if each TAS_SPIN test is very expensive, we don't spend
too much time spinning if it's really busy, or held by a process that is
sleeping.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Heikki Linnakangas

hlinnakangas@vmware.com

over 12 years ago

In reply to: Heikki Linnakangas (#18)

1 attachment(s)

Spinlock implementation on x86_64 (was Re: Better LWLocks with compare-and-swap (9.4))

On 21.05.2013 00:20, Heikki Linnakangas wrote:

On 16.05.2013 01:08, Daniel Farina wrote:
On Mon, May 13, 2013 at 5:50 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
pgbench -S is such a workload. With 9.3beta1, I'm seeing this
profile, when
I run "pgbench -S -c64 -j64 -T60 -M prepared" on a 32-core Linux
machine:
- 64.09% postgres postgres [.] tas
- tas
- 99.83% s_lock
- 53.22% LWLockAcquire
+ 99.87% GetSnapshotData
- 46.78% LWLockRelease
GetSnapshotData
+ GetTransactionSnapshot
+ 2.97% postgres postgres [.] tas
+ 1.53% postgres libc-2.13.so [.] 0x119873
+ 1.44% postgres postgres [.] GetSnapshotData
+ 1.29% postgres [kernel.kallsyms] [k] arch_local_irq_enable
+ 1.18% postgres postgres [.] AllocSetAlloc
...
So, on this test, a lot of time is wasted spinning on the mutex of
ProcArrayLock. If you plot a graph of TPS vs. # of clients, there is a
surprisingly steep drop in performance once you go beyond 29 clients
(attached, pgbench-lwlock-cas-local-clients-sets.png, red line). My
theory
is that after that point all the cores are busy, and processes start
to be
sometimes context switched while holding the spinlock, which kills
performance.
I have, I also used linux perf to come to this conclusion, and my
determination was similar: a system was undergoing increasingly heavy
load, in this case with processes>> number of processors. It was
also a phase-change type of event: at one moment everything would be
going great, but once a critical threshold was hit, s_lock would
consume enormous amount of CPU time. I figured preemption while in
the spinlock was to blame at the time, given the extreme nature
Stop the press! I'm getting the same speedup on that 32-core box I got
with the compare-and-swap patch, from this one-liner:
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -200,6 +200,8 @@ typedef unsigned char slock_t;
#define TAS(lock) tas(lock)
+#define TAS_SPIN(lock) (*(lock) ? 1 : TAS(lock))
+
static __inline__ int
tas(volatile slock_t *lock)
{
So, on this system, doing a non-locked test before the locked xchg
instruction while spinning, is a very good idea. That contradicts the
testing that was done earlier when the x86-64 implementation was added,
as we have this comment in the tas() implementation:

/*
* On Opteron, using a non-locking test before the locking instruction
* is a huge loss. On EM64T, it appears to be a wash or small loss,
* so we needn't bother to try to distinguish the sub-architectures.
*/

On my test system, the non-locking test is a big win. I tested this
because I was reading this article from Intel:

http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/.
It says very explicitly that the non-locking test is a good idea:

Spinning on volatile read vs. spin on lock attempt

One common mistake made by developers developing their own spin-wait
loops is attempting to spin on an atomic instruction instead of
spinning on a volatile read. Spinning on a dirty read instead of
attempting to acquire a lock consumes less time and resources. This
allows an application to only attempt to acquire a lock only when it
is free.

Now, I'm not sure what to do about this. If we put the non-locking test
in there, according to the previous testing that would be a huge loss on
Opterons.

I think we should apply the attached patch to put a non-locked test in
TAS_SPIN() on x86_64.

Tom tested this back in 2011 on a 32-core Opteron [1]/messages/by-id/8292.1314641721@sss.pgh.pa.us and found little
difference. And on a 80-core Xeon (160 with hyperthreading) system, he
saw a quite significant gain from it [2]/messages/by-id/25989.1314734752@sss.pgh.pa.us. Reading the thread, I don't
understand why it wasn't applied to x86_64 back then. Maybe it was for
the fear of having a negative impact on performance on old Opterons, but
I strongly suspect that the performance would be OK even on old Opterons
if you only do the non-locked test in TAS_SPIN() and not in TAS(). Back
when the comment about poor performance with a non-locking test on
Opteron was written, we used the same TAS() macro in the first and while
spinning.

I tried to find some sort of an authoritative guide from AMD that would
say how spinlocks should be implemented, but didn't find any. The Intel
guide I linked above says we should apply the patch. Linux uses a ticket
spinlock thingie nowadays, which is slightly different, but if I'm
reading the older pre-ticket version correctly, they used to do the
same. Ie. non-locking test while spinning, but not before the first
attempt to grab the lock. Same in FreeBSD.

So, my plan is to apply the attached non-locked-tas-spin-x86_64.patch to
master. But I would love to get feedback from people running different
x86_64 hardware.

[1]: /messages/by-id/8292.1314641721@sss.pgh.pa.us
[2]: /messages/by-id/25989.1314734752@sss.pgh.pa.us

- Heikki

Attachments:

non-locked-tas-spin-x86_64.patchtext/x-diff; name=non-locked-tas-spin-x86_64.patchDownload

diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index 180c013..2a9c1c4 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -199,6 +199,15 @@ spin_delay(void)
 typedef unsigned char slock_t;
 
 #define TAS(lock) tas(lock)
+/*
+ * On Intel EM64T, it's a win to use a non-locking test before the xchg proper.
+ *
+ * See also Implementing Scalable Atomic Locks for Multi-Core Intel(tm) EM64T
+ * and IA32 by Michael Chynoweth and Mary R. Lee. As of this writing, it is
+ * available at:
+ * http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures
+ */
+#define TAS_SPIN(lock)    (*(lock) ? 1 : TAS(lock))
 
 static __inline__ int
 tas(volatile slock_t *lock)

#20

Tom Lane

tgl@sss.pgh.pa.us

over 12 years ago

In reply to: Heikki Linnakangas (#19)

Re: Spinlock implementation on x86_64 (was Re: Better LWLocks with compare-and-swap (9.4))

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

So, my plan is to apply the attached non-locked-tas-spin-x86_64.patch to
master. But I would love to get feedback from people running different
x86_64 hardware.

Surely this patch should update the existing comment at line 209? Or at
least point out that a non-locked test in TAS_SPIN is not the same as a
non-locked test in tas() itself.

Other than the commenting, I have no objection to this. I think you're
probably right that the old tests in which this looked like a bad idea
were adding the unlocked test to tas() not only the spin case.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Heikki Linnakangas

hlinnakangas@vmware.com

over 12 years ago

In reply to: Tom Lane (#20)

Re: Spinlock implementation on x86_64 (was Re: Better LWLocks with compare-and-swap (9.4))

On 28.08.2013 20:21, Tom Lane wrote:

Heikki Linnakangas<hlinnakangas@vmware.com> writes:

So, my plan is to apply the attached non-locked-tas-spin-x86_64.patch to
master. But I would love to get feedback from people running different
x86_64 hardware.

Surely this patch should update the existing comment at line 209? Or at
least point out that a non-locked test in TAS_SPIN is not the same as a
non-locked test in tas() itself.

Committed with some comment changes. I also added a note to the 32-bit
x86 implementation, suggesting we probably should do the same there,
no-one's just gotten around to do the testing.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers