lazy vxid locks, v1

Started by Robert Haasover 14 years ago26 messages
#1Robert Haas
robertmhaas@gmail.com
1 attachment(s)

Here is a patch that applies over the "reducing the overhead of
frequent table locks" (fastlock-v3) patch and allows heavyweight VXID
locks to spring into existence only when someone wants to wait on
them. I believe there is a large benefit to be had from this
optimization, because the combination of these two patches virtually
eliminates lock manager traffic on "pgbench -S" workloads. However,
there are several flies in the ointment.

1. It's a bit of a kludge. I leave it to readers of the patch to
determine exactly what about this patch they think is kludgey, but
it's likely not the empty set. I suspect that MyProc->fpLWLock needs
to be renamed to something a bit more generic if we're going to use it
like this, but I don't immediately know what to call it. Also, the
mechanism whereby we take SInvalWriteLock to work out the mapping from
BackendId to PGPROC * is not exactly awesome. I don't think it
matters from a performance point of view, because operations that need
VXID locks are sufficiently rare that the additional lwlock traffic
won't matter a bit. However, we could avoid this altogether if we
rejiggered the mechanism for allocating PGPROCs and backend IDs.
Right now, we allocate PGPROCs off of linked lists, except for
auxiliary procs which allocate them by scanning a three-element array
for an empty slot. Then, when the PGPROC subscribes to sinval, the
sinval mechanism allocates a backend ID by scanning for the lowest
unused backend ID in the ProcState array. If we changed the logic for
allocating PGPROCs to mimic what the sinval queue currently does, then
the backend ID could be defined as the offset into the PGPROC array.
Translating between a backend ID and a PGPROC * now becomes a matter
of pointer arithmetic. Not sure if this is worth doing.

2. Bad thing happen with large numbers of connections. This patch
increases peak performance, but as you increase the number of
concurrent connections beyond the number of CPU cores, performance
drops off faster with the patch than without it. For example, on the
32-core loaner from Nate Boley, using 80 pgbench -S clients, unpatched
HEAD runs at ~36K TPS; with fastlock, it jumps up to about ~99K TPS;
with this patch also applied, it drops down to about ~64K TPS, despite
the fact that nearly all the contention on the lock manager locks has
been eliminated. On Stefan Kaltenbrunner's 40-core box, he was
actually able to see performance drop down below unpatched HEAD with
this applied! This is immensely counterintuitive. What is going on?

Profiling reveals that the system spends enormous amounts of CPU time
in s_lock. LWLOCK_STATS reveals that the only lwlock with significant
amounts of blocking is the BufFreelistLock; but that doesn't explain
the high CPU utilization. In fact, it appears that the problem is
with the LWLocks that are frequently acquired in *shared* mode. There
is no actual lock conflict, but each LWLock is protected by a spinlock
which must be acquired and released to bump the shared locker counts.
In HEAD, everything bottlenecks on the lock manager locks and so it's
not really possible for enough traffic to build up on any single
spinlock to have a serious impact on performance. The locks being
sought there are exclusive, so when they are contended, processes just
get descheduled. But with the exclusive locks out of the way,
everyone very quickly lines up to acquire shared buffer manager locks,
buffer content locks, etc. and large pile-ups ensue, leaving to
massive cache line contention and tons of CPU usage. My initial
thought was that this was contention over the root block of the index
on the pgbench_accounts table and the buf mapping lock protecting it,
but instrumentation showed otherwise. I hacked up the system to
report how often each lwlock spinlock exceeded spins_per_delay. The
following is the end of a report showing the locks with the greatest
amounts of excess spinning:

lwlock 0: shacq 0 exacq 191032 blk 42554 spin 272
lwlock 41: shacq 5982347 exacq 11937 blk 1825 spin 4217
lwlock 38: shacq 6443278 exacq 11960 blk 1726 spin 4440
lwlock 47: shacq 6106601 exacq 12096 blk 1555 spin 4497
lwlock 34: shacq 6423317 exacq 11896 blk 1863 spin 4776
lwlock 45: shacq 6455173 exacq 12052 blk 1825 spin 4926
lwlock 39: shacq 6867446 exacq 12067 blk 1899 spin 5071
lwlock 44: shacq 6824502 exacq 12040 blk 1655 spin 5153
lwlock 37: shacq 6727304 exacq 11935 blk 2077 spin 5252
lwlock 46: shacq 6862206 exacq 12017 blk 2046 spin 5352
lwlock 36: shacq 6854326 exacq 11920 blk 1914 spin 5441
lwlock 43: shacq 7184761 exacq 11874 blk 1863 spin 5625
lwlock 48: shacq 7612458 exacq 12109 blk 2029 spin 5780
lwlock 35: shacq 7150616 exacq 11916 blk 2026 spin 5782
lwlock 33: shacq 7536878 exacq 11985 blk 2105 spin 6273
lwlock 40: shacq 7199089 exacq 12068 blk 2305 spin 6290
lwlock 456: shacq 36258224 exacq 0 blk 0 spin 54264
lwlock 42: shacq 43012736 exacq 11851 blk 10675 spin 62017
lwlock 4: shacq 72516569 exacq 190 blk 196 spin 341914
lwlock 5: shacq 145042917 exacq 0 blk 0 spin 798891
grand total: shacq 544277977 exacq 181886079 blk 82371 spin 1338135

So, the majority (60%) of the excess spinning appears to be due to
SInvalReadLock. A good chunk are due to ProcArrayLock (25%). And
everything else is peanuts by comparison, though I am guessing the
third and fourth places (5% and 4%, respectively) are in fact the
buffer mapping lock that covers the pgbench_accounts_pkey root index
block, and the content lock on that buffer.

What is to be done?

The SInvalReadLock acquisitions are all attributable, I believe, to
AcceptInvalidationMessages(), which is called in a number of places,
but in particular, after every heavyweight lock acquisition. I think
we need a quick way to short-circuit the lock acquisition there when
no work is to be done, which is to say, nearly always. Indeed, Noah
Misch just proposed something along these lines on another thread
("Make relation_openrv atomic wrt DDL"), though I think this data may
cast a new light on the details.

I haven't tracked down where the ProcArrayLock acquisitions are coming
from. The realistic possibilities appear to be
TransactionIdIsInProgress(), TransactionIdIsActive(), GetOldestXmin(),
and GetSnapshotData(). Nor do I have a clear idea what to do about
this.

The remaining candidates are mild by comparison, so I won't analyze
them further here for the moment.

Another way to attack this problem would be to come up with some more
general mechanism to make shared-lwlock acquisition cheaper, such as
having 3 or 4 shared-locker counts per lwlock, each with a separate
spinlock. Then, at least in the case where there's no real lwlock
contention, the spin-waiters can spread out across all of them. But
I'm not sure it's really worth it, considering that we have only a
handful of cases where this problem appears to be severe. But we
probably need to see what happens when we fix some of the current
cases where this is happening. If throughput goes up, then we're
good. If it just shifts the spin lock pile-up to someplace where it's
not so easily eliminated, then we might need to either eliminate all
the problem cases one by one, or else come up with some more general
mechanism.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

lazyvxid-v1.patchapplication/octet-stream; name=lazyvxid-v1.patchDownload
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2ca1c14..7cc47bb 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1696,14 +1696,7 @@ StartTransaction(void)
 	/*
 	 * Lock the virtual transaction id before we announce it in the proc array
 	 */
-	VirtualXactLockTableInsert(vxid);
-
-	/*
-	 * Advertise it in the proc array.	We assume assignment of
-	 * LocalTransactionID is atomic, and the backendId should be set already.
-	 */
-	Assert(MyProc->backendId == vxid.backendId);
-	MyProc->lxid = vxid.localTransactionId;
+	VirtualXactLockInitialize(vxid);
 
 	TRACE_POSTGRESQL_TRANSACTION_START(vxid.localTransactionId);
 
@@ -1849,6 +1842,7 @@ CommitTransaction(void)
 	 * must be done _before_ releasing locks we hold and _after_
 	 * RecordTransactionCommit.
 	 */
+	VirtualXactLockCleanup();
 	ProcArrayEndTransaction(MyProc, latestXid);
 
 	/*
@@ -2111,6 +2105,7 @@ PrepareTransaction(void)
 	 * done *after* the prepared transaction has been marked valid, else
 	 * someone may think it is unlocked and recyclable.
 	 */
+	VirtualXactLockCleanup();
 	ProcArrayClearTransaction(MyProc);
 
 	/*
@@ -2277,6 +2272,7 @@ AbortTransaction(void)
 	 * must be done _before_ releasing locks we hold and _after_
 	 * RecordTransactionAbort.
 	 */
+	VirtualXactLockCleanup();
 	ProcArrayEndTransaction(MyProc, latestXid);
 
 	/*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index b7c021d..a583399 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -482,7 +482,7 @@ DefineIndex(RangeVar *heapRelation,
 
 	while (VirtualTransactionIdIsValid(*old_lockholders))
 	{
-		VirtualXactLockTableWait(*old_lockholders);
+		VirtualXactLock(*old_lockholders, true);
 		old_lockholders++;
 	}
 
@@ -568,7 +568,7 @@ DefineIndex(RangeVar *heapRelation,
 
 	while (VirtualTransactionIdIsValid(*old_lockholders))
 	{
-		VirtualXactLockTableWait(*old_lockholders);
+		VirtualXactLock(*old_lockholders, true);
 		old_lockholders++;
 	}
 
@@ -665,7 +665,7 @@ DefineIndex(RangeVar *heapRelation,
 		}
 
 		if (VirtualTransactionIdIsValid(old_snapshots[i]))
-			VirtualXactLockTableWait(old_snapshots[i]);
+			VirtualXactLock(old_snapshots[i], true);
 	}
 
 	/*
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index fcc912f..9de1c9d 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1991,7 +1991,7 @@ do_autovacuum(void)
 			backendID = GetTempNamespaceBackendId(classForm->relnamespace);
 
 			/* We just ignore it if the owning backend is still active */
-			if (backendID == MyBackendId || !BackendIdIsActive(backendID))
+			if (backendID == MyBackendId || BackendIdGetProc(backendID) == NULL)
 			{
 				/*
 				 * We found an orphan temp table (which was probably left
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index e7593fa..2174061 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -363,7 +363,6 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
 		proc->xid = InvalidTransactionId;
-		proc->lxid = InvalidLocalTransactionId;
 		proc->xmin = InvalidTransactionId;
 		/* must be cleared with xid/xmin: */
 		proc->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
@@ -390,7 +389,6 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		 */
 		Assert(!TransactionIdIsValid(proc->xid));
 
-		proc->lxid = InvalidLocalTransactionId;
 		proc->xmin = InvalidTransactionId;
 		/* must be cleared with xid/xmin: */
 		proc->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
@@ -421,7 +419,6 @@ ProcArrayClearTransaction(PGPROC *proc)
 	 * ProcArray.
 	 */
 	proc->xid = InvalidTransactionId;
-	proc->lxid = InvalidLocalTransactionId;
 	proc->xmin = InvalidTransactionId;
 	proc->recoveryConflictPending = false;
 
diff --git a/src/backend/storage/ipc/sinvaladt.c b/src/backend/storage/ipc/sinvaladt.c
index 1df20c4..df6ffab 100644
--- a/src/backend/storage/ipc/sinvaladt.c
+++ b/src/backend/storage/ipc/sinvaladt.c
@@ -139,6 +139,7 @@ typedef struct ProcState
 {
 	/* procPid is zero in an inactive ProcState array entry. */
 	pid_t		procPid;		/* PID of backend, for signaling */
+	PGPROC	   *proc;			/* PGPROC of backend */
 	/* nextMsgNum is meaningless if procPid == 0 or resetState is true. */
 	int			nextMsgNum;		/* next message number to read */
 	bool		resetState;		/* backend needs to reset its state */
@@ -245,6 +246,7 @@ CreateSharedInvalidationState(void)
 	for (i = 0; i < shmInvalBuffer->maxBackends; i++)
 	{
 		shmInvalBuffer->procState[i].procPid = 0;		/* inactive */
+		shmInvalBuffer->procState[i].proc = NULL;
 		shmInvalBuffer->procState[i].nextMsgNum = 0;	/* meaningless */
 		shmInvalBuffer->procState[i].resetState = false;
 		shmInvalBuffer->procState[i].signaled = false;
@@ -313,6 +315,7 @@ SharedInvalBackendInit(bool sendOnly)
 
 	/* mark myself active, with all extant messages already read */
 	stateP->procPid = MyProcPid;
+	stateP->proc = MyProc;
 	stateP->nextMsgNum = segP->maxMsgNum;
 	stateP->resetState = false;
 	stateP->signaled = false;
@@ -352,6 +355,7 @@ CleanupInvalidationState(int status, Datum arg)
 
 	/* Mark myself inactive */
 	stateP->procPid = 0;
+	stateP->proc = NULL;
 	stateP->nextMsgNum = 0;
 	stateP->resetState = false;
 	stateP->signaled = false;
@@ -368,13 +372,16 @@ CleanupInvalidationState(int status, Datum arg)
 }
 
 /*
- * BackendIdIsActive
- *		Test if the given backend ID is currently assigned to a process.
+ * BackendIdGetProc
+ *		Get the PGPROC structure for a backend, given the backend ID.
+ *		The result may be out of date arbitrarily quickly, so the caller
+ *		must be careful about how this information is used.  NULL is
+ *		returned if the backend is not active.
  */
-bool
-BackendIdIsActive(int backendID)
+PGPROC *
+BackendIdGetProc(int backendID)
 {
-	bool		result;
+	PGPROC	   *result = NULL;
 	SISeg	   *segP = shmInvalBuffer;
 
 	/* Need to lock out additions/removals of backends */
@@ -384,10 +391,8 @@ BackendIdIsActive(int backendID)
 	{
 		ProcState  *stateP = &segP->procState[backendID - 1];
 
-		result = (stateP->procPid != 0);
+		result = stateP->proc;
 	}
-	else
-		result = false;
 
 	LWLockRelease(SInvalWriteLock);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 75b5ab4..3456e4a 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -82,7 +82,7 @@ InitRecoveryTransactionEnvironment(void)
 	 */
 	vxid.backendId = MyBackendId;
 	vxid.localTransactionId = GetNextLocalTransactionId();
-	VirtualXactLockTableInsert(vxid);
+	VirtualXactLockInitialize(vxid);
 
 	standbyState = STANDBY_INITIALIZED;
 }
@@ -201,7 +201,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 		standbyWait_us = STANDBY_INITIAL_WAIT_US;
 
 		/* wait until the virtual xid is gone */
-		while (!ConditionalVirtualXactLockTableWait(*waitlist))
+		while (!VirtualXactLock(*waitlist, false))
 		{
 			/*
 			 * Report via ps if we have been waiting for more than 500 msec
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 859b385..9d0994e 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -514,70 +514,6 @@ ConditionalXactLockTableWait(TransactionId xid)
 	return true;
 }
 
-
-/*
- *		VirtualXactLockTableInsert
- *
- * Insert a lock showing that the given virtual transaction ID is running ---
- * this is done at main transaction start when its VXID is assigned.
- * The lock can then be used to wait for the transaction to finish.
- */
-void
-VirtualXactLockTableInsert(VirtualTransactionId vxid)
-{
-	LOCKTAG		tag;
-
-	Assert(VirtualTransactionIdIsValid(vxid));
-
-	SET_LOCKTAG_VIRTUALTRANSACTION(tag, vxid);
-
-	(void) LockAcquire(&tag, ExclusiveLock, false, false);
-}
-
-/*
- *		VirtualXactLockTableWait
- *
- * Waits until the lock on the given VXID is released, which shows that
- * the top-level transaction owning the VXID has ended.
- */
-void
-VirtualXactLockTableWait(VirtualTransactionId vxid)
-{
-	LOCKTAG		tag;
-
-	Assert(VirtualTransactionIdIsValid(vxid));
-
-	SET_LOCKTAG_VIRTUALTRANSACTION(tag, vxid);
-
-	(void) LockAcquire(&tag, ShareLock, false, false);
-
-	LockRelease(&tag, ShareLock, false);
-}
-
-/*
- *		ConditionalVirtualXactLockTableWait
- *
- * As above, but only lock if we can get the lock without blocking.
- * Returns TRUE if the lock was acquired.
- */
-bool
-ConditionalVirtualXactLockTableWait(VirtualTransactionId vxid)
-{
-	LOCKTAG		tag;
-
-	Assert(VirtualTransactionIdIsValid(vxid));
-
-	SET_LOCKTAG_VIRTUALTRANSACTION(tag, vxid);
-
-	if (LockAcquire(&tag, ShareLock, false, true) == LOCKACQUIRE_NOT_AVAIL)
-		return false;
-
-	LockRelease(&tag, ShareLock, false);
-
-	return true;
-}
-
-
 /*
  *		LockDatabaseObject
  *
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 01df472..d4a6a8f 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -38,6 +38,7 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/sinvaladt.h"
 #include "storage/standby.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
@@ -138,6 +139,9 @@ static int			FastPathLocalUseCount = 0;
 #define FAST_PATH_CHECK_LOCKMODE(proc, n, l) \
 	 ((proc)->fpLockBits & (UINT64CONST(1) << FAST_PATH_BIT_POSITION(n, l)))
 
+#define FAST_PATH_DEFER_VXID_LOCK		\
+	(UINT64CONST(1) << (FAST_PATH_BITS_PER_SLOT * FP_LOCK_SLOTS_PER_BACKEND))
+
 /*
  * The fast-path lock mechanism is concerned only with relation locks on
  * unshared relations by backends bound to a database.  The fast-path
@@ -3512,3 +3516,154 @@ lock_twophase_postabort(TransactionId xid, uint16 info,
 {
 	lock_twophase_postcommit(xid, info, recdata, len);
 }
+
+/*
+ *		VirtualXactLockInitialize
+ *
+ *		We set a flag in MyProc->fpLockState indicating that we have a
+ *		"deferred" VXID lock.  That is, we have an active VXID, but we
+ *		haven't actually taken an exclusive lock on it.  VXID locks are
+ *		rarely waited for, so it makes sense to defer the actual lock
+ *		acquisition to the point when it's needed.  Another backend wishing
+ *		to wait on the lock can acquire the lock on our behalf and then
+ *		wait on it.  We'll figure it all out in VirtualXactLockCleanup().
+ *
+ *		We set lxid while holding the lock to guarantee that anyone who
+ *		sees the lxid set and subsequently takes our fpLWLock will also
+ *		see the FAST_PATH_DEFER_VXID lock bit set.
+ */
+void
+VirtualXactLockInitialize(VirtualTransactionId vxid)
+{
+	Assert(VirtualTransactionIdIsValid(vxid));
+
+	LWLockAcquire(MyProc->fpLWLock, LW_EXCLUSIVE);
+
+	Assert(MyProc->backendId == vxid.backendId);
+	MyProc->lxid = vxid.localTransactionId;
+	MyProc->fpLockBits |= FAST_PATH_DEFER_VXID_LOCK;
+
+	LWLockRelease(MyProc->fpLWLock);
+}
+
+/*
+ *		VirtualXactLockCleanup
+ *
+ *		Check whether a VXID lock has been materialized; if so, release it,
+ *		unblocking waiters.
+ */
+void
+VirtualXactLockCleanup()
+{
+	VirtualTransactionId	vxid;
+	bool	cleanup = false;
+
+	Assert(MyProc->backendId != InvalidBackendId);
+	Assert(MyProc->lxid != InvalidLocalTransactionId);
+
+	GET_VXID_FROM_PGPROC(vxid, *MyProc);
+
+	LWLockAcquire(MyProc->fpLWLock, LW_EXCLUSIVE);
+
+	if ((MyProc->fpLockBits & FAST_PATH_DEFER_VXID_LOCK) != 0)
+		MyProc->fpLockBits &= ~FAST_PATH_DEFER_VXID_LOCK;
+	else
+		cleanup = true;
+	MyProc->lxid = InvalidLocalTransactionId;
+
+	LWLockRelease(MyProc->fpLWLock);
+
+	/* If someone materialized the lock on our behalf, we must release it. */
+	if (cleanup)
+	{
+		LOCKTAG	locktag;
+
+		SET_LOCKTAG_VIRTUALTRANSACTION(locktag, vxid);
+		LockRefindAndRelease(LockMethods[DEFAULT_LOCKMETHOD], MyProc,
+							 &locktag, ExclusiveLock, false);
+	}	
+}
+
+/*
+ *		VirtualXactLock
+ *
+ * If wait = true, wait until the given VXID has been released, and then
+ * return true.
+ *
+ * If wait = false, just check whether the VXID is still running, and return
+ * true or false.
+ */
+bool
+VirtualXactLock(VirtualTransactionId vxid, bool wait)
+{
+	LOCKTAG		tag;
+	PGPROC	   *proc;
+
+	Assert(VirtualTransactionIdIsValid(vxid));
+
+	SET_LOCKTAG_VIRTUALTRANSACTION(tag, vxid);
+
+	/*
+	 * If a lock table entry must be made, this is the PGPROC on whose behalf
+	 * it must be done.  Note that the transaction might end or the PGPROC
+	 * might be reassigned to a new backend before we get around to examining
+	 * it, but it doesn't matter.  If we find upon examination that the
+	 * relevant lxid is no longer running here, that's enough to prove that
+	 * it's no longer running anywhere.
+	 */
+	proc = BackendIdGetProc(vxid.backendId);
+
+	/*
+	 * We must acquire this lock before checking the backendId and lxid
+	 * against the ones we're waiting for.  The target backend will only
+	 * set or clear lxid while holding this lock.
+	 */
+	LWLockAcquire(proc->fpLWLock, LW_EXCLUSIVE);
+
+	/* If the transaction has ended, our work here is done. */
+	if (proc->backendId != vxid.backendId || proc->lxid != vxid.localTransactionId)
+	{
+		LWLockRelease(proc->fpLWLock);
+		return true;
+	}
+
+	/*
+	 * If we aren't asked to wait, there's no need to set up a lock table
+	 * entry.  The transaction is still in progress, so just return false.
+	 */
+	if (!wait)
+	{
+		LWLockRelease(proc->fpLWLock);
+		return false;
+	}
+
+	/*
+	 * OK, we're going to need to sleep on the VXID.  But first, we must set
+	 * up the primary lock table entry, if needed.
+	 */
+	if ((proc->fpLockBits & FAST_PATH_DEFER_VXID_LOCK) != 0)
+	{
+		PROCLOCK   *proclock;
+		uint32		hashcode;
+
+		hashcode = LockTagHashCode(&tag);
+		proclock = SetupLockInTable(LockMethods[DEFAULT_LOCKMETHOD], proc,
+									&tag, hashcode, ExclusiveLock);
+		if (!proclock)
+			ereport(ERROR,
+					(errcode(ERRCODE_OUT_OF_MEMORY),
+					 errmsg("out of shared memory"),
+		  errhint("You might need to increase max_locks_per_transaction.")));
+		GrantLock(proclock->tag.myLock, proclock, ExclusiveLock);
+		proc->fpLockBits &= ~FAST_PATH_DEFER_VXID_LOCK;
+	}
+
+	/* Done with proc->fpLockBits */
+	LWLockRelease(proc->fpLWLock);
+
+	/* Time to wait. */
+	(void) LockAcquire(&tag, ShareLock, false, false);
+
+	LockRelease(&tag, ShareLock, false);
+	return true;
+}
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index bd44d92..340f6a3 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -56,11 +56,6 @@ extern void XactLockTableDelete(TransactionId xid);
 extern void XactLockTableWait(TransactionId xid);
 extern bool ConditionalXactLockTableWait(TransactionId xid);
 
-/* Lock a VXID (used to wait for a transaction to finish) */
-extern void VirtualXactLockTableInsert(VirtualTransactionId vxid);
-extern void VirtualXactLockTableWait(VirtualTransactionId vxid);
-extern bool ConditionalVirtualXactLockTableWait(VirtualTransactionId vxid);
-
 /* Lock a general object (other than a relation) of the current database */
 extern void LockDatabaseObject(Oid classid, Oid objid, uint16 objsubid,
 				   LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 6df878d..e72692c 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -537,4 +537,9 @@ extern void DumpLocks(PGPROC *proc);
 extern void DumpAllLocks(void);
 #endif
 
+/* Lock a VXID (used to wait for a transaction to finish) */
+extern void VirtualXactLockInitialize(VirtualTransactionId vxid);
+extern void VirtualXactLockCleanup(void);
+extern bool VirtualXactLock(VirtualTransactionId vxid, bool wait);
+
 #endif   /* LOCK_H */
diff --git a/src/include/storage/sinvaladt.h b/src/include/storage/sinvaladt.h
index c703558..a61d696 100644
--- a/src/include/storage/sinvaladt.h
+++ b/src/include/storage/sinvaladt.h
@@ -22,6 +22,7 @@
 #ifndef SINVALADT_H
 #define SINVALADT_H
 
+#include "storage/proc.h"
 #include "storage/sinval.h"
 
 /*
@@ -30,7 +31,7 @@
 extern Size SInvalShmemSize(void);
 extern void CreateSharedInvalidationState(void);
 extern void SharedInvalBackendInit(bool sendOnly);
-extern bool BackendIdIsActive(int backendID);
+extern PGPROC *BackendIdGetProc(int backendID);
 
 extern void SIInsertDataEntries(const SharedInvalidationMessage *data, int n);
 extern int	SIGetDataEntries(SharedInvalidationMessage *data, int datasize);
#2Greg Stark
stark@mit.edu
In reply to: Robert Haas (#1)
Re: lazy vxid locks, v1

On Sun, Jun 12, 2011 at 10:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I hacked up the system to
report how often each lwlock spinlock exceeded spins_per_delay.

I don't doubt the rest of your analysis but one thing to note, number
of spins on a spinlock is not the same as the amount of time spent
waiting for it.

When there's contention on a spinlock the actual test-and-set
instruction ends up taking a long time while cache lines are copied
around. In theory you could have processes spending an inordinate
amount of time waiting on a spinlock even though they never actually
hit spins_per_delay or you could have processes that quickly exceed
spins_per_delay.

I think in practice the results are the same because the code the
spinlocks protect is always short so it's hard to get the second case
on a multi-core box without actually having contention anyways.

--
greg

#3Robert Haas
robertmhaas@gmail.com
In reply to: Greg Stark (#2)
Re: lazy vxid locks, v1

On Sun, Jun 12, 2011 at 5:58 PM, Greg Stark <stark@mit.edu> wrote:

On Sun, Jun 12, 2011 at 10:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I hacked up the system to
report how often each lwlock spinlock exceeded spins_per_delay.

I don't doubt the rest of your analysis but one thing to note, number
of spins on a spinlock is not the same as the amount of time spent
waiting for it.

When there's contention on a spinlock the actual test-and-set
instruction ends up taking a long time while cache lines are copied
around. In theory you could have processes spending an inordinate
amount of time waiting on a spinlock even though they never actually
hit spins_per_delay or you could have processes that quickly exceed
spins_per_delay.

I think in practice the results are the same because the code the
spinlocks protect is always short so it's hard to get the second case
on a multi-core box without actually having contention anyways.

All good points. I don't immediately have a better way of measuring
what's going on. Maybe dtrace could do it, but I don't really know
how to use it and am not sure it's set up on any of the boxes I have
for testing. Throwing gettimeofday() calls into SpinLockAcquire()
seems likely to change the overall system behavior enough to make the
results utterly meaningless. It wouldn't be real difficult to count
the number of times that we TAS() rather than just counting the number
of times we TAS() more than spins-per-delay, but I'm not sure whether
that would really address your concern. Hopefully, further
experimentation will make things more clear.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Stefan Kaltenbrunner
stefan@kaltenbrunner.cc
In reply to: Robert Haas (#1)
Re: lazy vxid locks, v1

On 06/12/2011 11:39 PM, Robert Haas wrote:

Here is a patch that applies over the "reducing the overhead of
frequent table locks" (fastlock-v3) patch and allows heavyweight VXID
locks to spring into existence only when someone wants to wait on
them. I believe there is a large benefit to be had from this
optimization, because the combination of these two patches virtually
eliminates lock manager traffic on "pgbench -S" workloads. However,
there are several flies in the ointment.

1. It's a bit of a kludge. I leave it to readers of the patch to
determine exactly what about this patch they think is kludgey, but
it's likely not the empty set. I suspect that MyProc->fpLWLock needs
to be renamed to something a bit more generic if we're going to use it
like this, but I don't immediately know what to call it. Also, the
mechanism whereby we take SInvalWriteLock to work out the mapping from
BackendId to PGPROC * is not exactly awesome. I don't think it
matters from a performance point of view, because operations that need
VXID locks are sufficiently rare that the additional lwlock traffic
won't matter a bit. However, we could avoid this altogether if we
rejiggered the mechanism for allocating PGPROCs and backend IDs.
Right now, we allocate PGPROCs off of linked lists, except for
auxiliary procs which allocate them by scanning a three-element array
for an empty slot. Then, when the PGPROC subscribes to sinval, the
sinval mechanism allocates a backend ID by scanning for the lowest
unused backend ID in the ProcState array. If we changed the logic for
allocating PGPROCs to mimic what the sinval queue currently does, then
the backend ID could be defined as the offset into the PGPROC array.
Translating between a backend ID and a PGPROC * now becomes a matter
of pointer arithmetic. Not sure if this is worth doing.

2. Bad thing happen with large numbers of connections. This patch
increases peak performance, but as you increase the number of
concurrent connections beyond the number of CPU cores, performance
drops off faster with the patch than without it. For example, on the
32-core loaner from Nate Boley, using 80 pgbench -S clients, unpatched
HEAD runs at ~36K TPS; with fastlock, it jumps up to about ~99K TPS;
with this patch also applied, it drops down to about ~64K TPS, despite
the fact that nearly all the contention on the lock manager locks has
been eliminated. On Stefan Kaltenbrunner's 40-core box, he was
actually able to see performance drop down below unpatched HEAD with
this applied! This is immensely counterintuitive. What is going on?

just to add actual new numbers to the discussion(pgbench -n -S -T 120 -c
X -j X) on that particular 40cores/80 threads box:

unpatched:

c1: tps = 7808.098053 (including connections establishing)
c4: tps = 29941.444359 (including connections establishing)
c8: tps = 58930.293850 (including connections establishing)
c16: tps = 106911.385826 (including connections establishing)
c24: tps = 117401.654430 (including connections establishing)
c32: tps = 110659.627803 (including connections establishing)
c40: tps = 107689.945323 (including connections establishing)
c64: tps = 104835.182183 (including connections establishing)
c80: tps = 101885.549081 (including connections establishing)
c160: tps = 92373.395791 (including connections establishing)
c200: tps = 90614.141246 (including connections establishing)

fast locks:

c1: tps = 7710.824723 (including connections establishing)
c4: tps = 29653.578364 (including connections establishing)
c8: tps = 58827.195578 (including connections establishing)
c16: tps = 112814.382204 (including connections establishing)
c24: tps = 154559.012960 (including connections establishing)
c32: tps = 189281.391250 (including connections establishing)
c40: tps = 215807.263233 (including connections establishing)
c64: tps = 180644.527322 (including connections establishing)
c80: tps = 118266.615543 (including connections establishing)
c160: tps = 68957.999922 (including connections establishing)
c200: tps = 68803.801091 (including connections establishing)

fast locks + lazy vxid:

c1: tps = 7828.644389 (including connections establishing)
c4: tps = 30520.558169 (including connections establishing)
c8: tps = 60207.396385 (including connections establishing)
c16: tps = 117923.775435 (including connections establishing)
c24: tps = 158775.317590 (including connections establishing)
c32: tps = 195768.530589 (including connections establishing)
c40: tps = 223308.779212 (including connections establishing)
c64: tps = 152848.742883 (including connections establishing)
c80: tps = 65738.046558 (including connections establishing)
c160: tps = 57075.304457 (including connections establishing)
c200: tps = 59107.675182 (including connections establishing)

so my reading of that is that we currently "only" scale well to ~12
physical cores, the fast locks patch gets us pretty nicely past that
point to a total scale of a bit better than 2x. but it degrades fairly
quick after that point and at a level of 2x the number of threads in the
box we are only able to get 2/3 of unpatched -HEAD(!).

with the lazy vxid patch on top the curve looks even more interesting,
we are scaling to an even higher peak but we degrade even worse and at
c80 (which equals the number of threads in the box) we are already only
able to get the amount of tps that unpatched -HEAD would give at ~10 cores..
Another thing worth noting is that with the patches we have MUCH less
idle - which is good for the cases where we are getting a benefit for
(as in higher throughput) - but the extrem case now is fast locks + lazy
which manages to get us less than 8% idle @c160 BUT only 57000 tps while
unpatched -HEAD is 75% idle and doing 92000 tps, or said otherwise - we
need almost 4x the computing resoures to get only 2/3 of the performance
(so a ~7x WORSE on a CPU/tps scale).

all those tests are done with pgbench running on the same box - which
has a noticable impact on the results because pgbench is using ~1 core
per 8 cores of the backend tested in cpu resoures - though I don't think
it causes any changes in the results that would show the performance
behaviour in a different light.

Stefan

#5Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Stefan Kaltenbrunner (#4)
Re: lazy vxid locks, v1

Stefan Kaltenbrunner wrote:

on that particular 40cores/80 threads box:

unpatched:

c40: tps = 107689.945323 (including connections establishing)
c80: tps = 101885.549081 (including connections establishing)

fast locks:

c40: tps = 215807.263233 (including connections establishing)
c80: tps = 118266.615543 (including connections establishing)

fast locks + lazy vxid:

c40: tps = 223308.779212 (including connections establishing)
c80: tps = 65738.046558 (including connections establishing)

Is there any way to disable the HT (or whatever technology attempts
to make each core look like 2)? In my benchmarking that has kept
performance from tanking as badly when a large number of processes
are contending for CPU.

-Kevin

#6Stefan Kaltenbrunner
stefan@kaltenbrunner.cc
In reply to: Kevin Grittner (#5)
Re: lazy vxid locks, v1

On 06/13/2011 02:29 PM, Kevin Grittner wrote:

Stefan Kaltenbrunner wrote:

on that particular 40cores/80 threads box:

unpatched:

c40: tps = 107689.945323 (including connections establishing)
c80: tps = 101885.549081 (including connections establishing)

fast locks:

c40: tps = 215807.263233 (including connections establishing)
c80: tps = 118266.615543 (including connections establishing)

fast locks + lazy vxid:

c40: tps = 223308.779212 (including connections establishing)
c80: tps = 65738.046558 (including connections establishing)

Is there any way to disable the HT (or whatever technology attempts
to make each core look like 2)? In my benchmarking that has kept
performance from tanking as badly when a large number of processes
are contending for CPU.

I can do that tomorrow, but I have now done a fair amount of
benchmarking on that box using various tests and for CPU intense
workloads(various math stuff, parallel compiles of the linux kernel,
some inhouse stuff, and some other database) I usually get a 60-70x
speedup over just using a single core and most recent CPUs (this one is
actually a brand new Westmere-EX) showed pretty good scaling with
HT/threading.
I'm actually pretty sure that at leas in some benchmarks it was not HT
that was the real problem but rather our general inability to scale much
beyond 10-12 cores for reads and even worse for writes (due to WAL
contention).

Stefan

#7Stefan Kaltenbrunner
stefan@kaltenbrunner.cc
In reply to: Robert Haas (#1)
Re: lazy vxid locks, v1

On 06/12/2011 11:39 PM, Robert Haas wrote:

Here is a patch that applies over the "reducing the overhead of
frequent table locks" (fastlock-v3) patch and allows heavyweight VXID
locks to spring into existence only when someone wants to wait on
them. I believe there is a large benefit to be had from this
optimization, because the combination of these two patches virtually
eliminates lock manager traffic on "pgbench -S" workloads. However,
there are several flies in the ointment.

1. It's a bit of a kludge. I leave it to readers of the patch to
determine exactly what about this patch they think is kludgey, but
it's likely not the empty set. I suspect that MyProc->fpLWLock needs
to be renamed to something a bit more generic if we're going to use it
like this, but I don't immediately know what to call it. Also, the
mechanism whereby we take SInvalWriteLock to work out the mapping from
BackendId to PGPROC * is not exactly awesome. I don't think it
matters from a performance point of view, because operations that need
VXID locks are sufficiently rare that the additional lwlock traffic
won't matter a bit. However, we could avoid this altogether if we
rejiggered the mechanism for allocating PGPROCs and backend IDs.
Right now, we allocate PGPROCs off of linked lists, except for
auxiliary procs which allocate them by scanning a three-element array
for an empty slot. Then, when the PGPROC subscribes to sinval, the
sinval mechanism allocates a backend ID by scanning for the lowest
unused backend ID in the ProcState array. If we changed the logic for
allocating PGPROCs to mimic what the sinval queue currently does, then
the backend ID could be defined as the offset into the PGPROC array.
Translating between a backend ID and a PGPROC * now becomes a matter
of pointer arithmetic. Not sure if this is worth doing.

2. Bad thing happen with large numbers of connections. This patch
increases peak performance, but as you increase the number of
concurrent connections beyond the number of CPU cores, performance
drops off faster with the patch than without it. For example, on the
32-core loaner from Nate Boley, using 80 pgbench -S clients, unpatched
HEAD runs at ~36K TPS; with fastlock, it jumps up to about ~99K TPS;
with this patch also applied, it drops down to about ~64K TPS, despite
the fact that nearly all the contention on the lock manager locks has
been eliminated. On Stefan Kaltenbrunner's 40-core box, he was
actually able to see performance drop down below unpatched HEAD with
this applied! This is immensely counterintuitive. What is going on?

Profiling reveals that the system spends enormous amounts of CPU time
in s_lock.

just to reiterate that with numbers - at 160 threads with both patches
applied the profile looks like:

samples % image name symbol name
828794 75.8662 postgres s_lock
51672 4.7300 postgres LWLockAcquire
51145 4.6817 postgres LWLockRelease
17636 1.6144 postgres GetSnapshotData
7521 0.6885 postgres hash_search_with_hash_value
6193 0.5669 postgres AllocSetAlloc
4527 0.4144 postgres SearchCatCache
4521 0.4138 postgres PinBuffer
3385 0.3099 postgres SIGetDataEntries
3160 0.2893 postgres PostgresMain
2706 0.2477 postgres _bt_compare
2687 0.2460 postgres fmgr_info_cxt_security
1963 0.1797 postgres UnpinBuffer
1846 0.1690 postgres LockAcquireExtended
1770 0.1620 postgres exec_bind_message
1730 0.1584 postgres hash_any
1644 0.1505 postgres ExecInitExpr

even at the peak performance spot of the combined patch-set (-c40) the
contention is noticable in the profile:

samples % image name symbol name
1497826 22.0231 postgres s_lock
592104 8.7059 postgres LWLockAcquire
512213 7.5313 postgres LWLockRelease
230050 3.3825 postgres GetSnapshotData
176252 2.5915 postgres AllocSetAlloc
155122 2.2808 postgres hash_search_with_hash_value
116235 1.7091 postgres SearchCatCache
110197 1.6203 postgres _bt_compare
94101 1.3836 postgres PinBuffer
80119 1.1780 postgres PostgresMain
65584 0.9643 postgres fmgr_info_cxt_security
55198 0.8116 postgres hash_any
52872 0.7774 postgres exec_bind_message
48438 0.7122 postgres LockReleaseAll
46631 0.6856 postgres MemoryContextAlloc
45909 0.6750 postgres ExecInitExpr
42293 0.6219 postgres AllocSetFree

Stefan

#8Stefan Kaltenbrunner
stefan@kaltenbrunner.cc
In reply to: Stefan Kaltenbrunner (#4)
pgbench cpu overhead (was Re: lazy vxid locks, v1)

On 06/13/2011 01:55 PM, Stefan Kaltenbrunner wrote:

[...]

all those tests are done with pgbench running on the same box - which
has a noticable impact on the results because pgbench is using ~1 core
per 8 cores of the backend tested in cpu resoures - though I don't think
it causes any changes in the results that would show the performance
behaviour in a different light.

actuall testing against sysbench with the very same workload shows the
following performance behaviour:

with 40 threads(aka the peak performance point):

pgbench: 223308 tps
sysbench: 311584 tps

with 160 threads (backend contention dominated):

pgbench: 57075
sysbench: 43437

so it seems that sysbench is actually significantly less overhead than
pgbench and the lower throughput at the higher conncurency seems to be
cause by sysbench being able to stress the backend even more than
pgbench can.

for those curious - the profile for pgbench looks like:

samples % symbol name
29378 41.9087 doCustom
17502 24.9672 threadRun
7629 10.8830 pg_strcasecmp
5871 8.3752 compareVariables
2568 3.6633 getVariable
2167 3.0913 putVariable
2065 2.9458 replaceVariable
1971 2.8117 parseVariable
534 0.7618 xstrdup
278 0.3966 xrealloc
137 0.1954 xmalloc

Stefan

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stefan Kaltenbrunner (#7)
Re: lazy vxid locks, v1

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:

On 06/12/2011 11:39 PM, Robert Haas wrote:

Profiling reveals that the system spends enormous amounts of CPU time
in s_lock.

just to reiterate that with numbers - at 160 threads with both patches
applied the profile looks like:

samples % image name symbol name
828794 75.8662 postgres s_lock

Do you know exactly which spinlocks are being contended on here?
The next few entries

51672 4.7300 postgres LWLockAcquire
51145 4.6817 postgres LWLockRelease
17636 1.6144 postgres GetSnapshotData

suggest that it might be the ProcArrayLock as a result of a huge amount
of snapshot-fetching, but this is very weak evidence for that theory.

regards, tom lane

#10Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#9)
Re: lazy vxid locks, v1

On Mon, Jun 13, 2011 at 10:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:

On 06/12/2011 11:39 PM, Robert Haas wrote:

Profiling reveals that the system spends enormous amounts of CPU time
in s_lock.

just to reiterate that with numbers - at 160 threads with both patches
applied the profile looks like:

samples  %        image name               symbol name
828794   75.8662  postgres                 s_lock

Do you know exactly which spinlocks are being contended on here?
The next few entries

51672     4.7300  postgres                 LWLockAcquire
51145     4.6817  postgres                 LWLockRelease
17636     1.6144  postgres                 GetSnapshotData

suggest that it might be the ProcArrayLock as a result of a huge amount
of snapshot-fetching, but this is very weak evidence for that theory.

I don't know for sure what is happening on Stefan's system, but I did
post the results of some research on this exact topic in my original
post.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11Jeff Janes
jeff.janes@gmail.com
In reply to: Robert Haas (#1)
Re: lazy vxid locks, v1

On Sun, Jun 12, 2011 at 2:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:
...

Profiling reveals that the system spends enormous amounts of CPU time
in s_lock.  LWLOCK_STATS reveals that the only lwlock with significant
amounts of blocking is the BufFreelistLock;

This is curious. Clearly the entire working set fits in RAM, or you
wouldn't be getting number like this. But does the entire working set
fit in shared_buffers? If so, you shouldn't see any traffic on
BufFreelistLock once all the data is read in. I've only seen
contention here when all data fits in OS cache memory but not in
shared_buffers.

Cheers,

Jeff

#12Jeff Janes
jeff.janes@gmail.com
In reply to: Stefan Kaltenbrunner (#8)
Re: pgbench cpu overhead (was Re: lazy vxid locks, v1)

On Mon, Jun 13, 2011 at 7:03 AM, Stefan Kaltenbrunner
<stefan@kaltenbrunner.cc> wrote:
...

so it seems that sysbench is actually significantly less overhead than
pgbench and the lower throughput at the higher conncurency seems to be
cause by sysbench being able to stress the backend even more than
pgbench can.

Hi Stefan,

pgbench sends each query (per connection) and waits for the reply
before sending another.

Do we know whether sysbench does that, or if it just stuffs the
kernel's IPC buffer full of queries without synchronously waiting for
individual replies?

I can't get sysbench to "make" for me, or I'd strace in single client
mode and see what kind of messages are going back and forth.

Cheers,

Jeff

#13Itagaki Takahiro
itagaki.takahiro@gmail.com
In reply to: Jeff Janes (#12)
Re: pgbench cpu overhead (was Re: lazy vxid locks, v1)

On Tue, Jun 14, 2011 at 09:27, Jeff Janes <jeff.janes@gmail.com> wrote:

pgbench sends each query (per connection) and waits for the reply
before sending another.

We can use -j option to run pgbench in multiple threads to avoid
request starvation. What setting did you use, Stefan?

for those curious - the profile for pgbench looks like:
samples % symbol name
29378 41.9087 doCustom
17502 24.9672 threadRun
7629 10.8830 pg_strcasecmp

If the bench is bottleneck, it would be better to reduce pg_strcasecmp
calls by holding meta command names as integer values of sub-META_COMMAND
instead of string comparison for each loop.

--
Itagaki Takahiro

#14Greg Smith
greg@2ndQuadrant.com
In reply to: Jeff Janes (#12)
1 attachment(s)
Re: pgbench cpu overhead (was Re: lazy vxid locks, v1)

On 06/13/2011 08:27 PM, Jeff Janes wrote:

pgbench sends each query (per connection) and waits for the reply
before sending another.

Do we know whether sysbench does that, or if it just stuffs the
kernel's IPC buffer full of queries without synchronously waiting for
individual replies?

sysbench creates a thread for each client and lets them go at things at
whatever speed they can handle. You have to setup pgbench with a worker
per core to get them even close to level footing. And even in that
case, sysbench has a significant advantage, because it's got the
commands it runs more or less hard-coded in the program. pgbench is
constantly parsing things in its internal command language and then
turning them into SQL requests. That's flexible and allows it to be
used for some neat custom things, but it uses a lot more resources to
drive the same number of clients.

I can't get sysbench to "make" for me, or I'd strace in single client
mode and see what kind of messages are going back and forth.

If you're using a sysbench tarball, no surprise. It doesn't build on
lots of platforms now. If you grab
http://projects.2ndquadrant.it/sites/default/files/bottom-up-benchmarking.pdf
it has my sysbench notes starting on page 34. I had to checkout the
latest version from their development repo to get it to compile on any
recent system. The attached wrapper script may be helpful to you as
well to help get over the learning curve for how to run the program; it
iterates sysbench over a number of database sizes and thread counts
running the complicated to setup OLTP test.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us

Attachments:

oltp-readtext/plain; name=oltp-readDownload
#15Greg Smith
greg@2ndQuadrant.com
In reply to: Stefan Kaltenbrunner (#4)
Re: lazy vxid locks, v1

On 06/13/2011 07:55 AM, Stefan Kaltenbrunner wrote:

all those tests are done with pgbench running on the same box - which
has a noticable impact on the results because pgbench is using ~1 core
per 8 cores of the backend tested in cpu resoures - though I don't think
it causes any changes in the results that would show the performance
behaviour in a different light.

Yeah, this used to make a much bigger difference, but nowadays it's not
so important. So long as you have enough cores that you can spare a
chunk of them to drive the test with, and you crank "-j" up to a lot,
there doesn't seem to be much of an advantage to moving the clients to a
remote system now. You end up trading off CPU time for everything going
through the network stack, which adds yet another set of uncertainty to
the whole thing anyway.

I'm glad to see so many people have jumped onto doing these SELECT-only
tests now. The performance farm idea I've been working on runs a test
just like what's proven useful here. I'd suggested that because it's
been really sensitive to changes in locking and buffer management for me.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us

#16Alvaro Herrera
alvherre@commandprompt.com
In reply to: Jeff Janes (#12)
Re: pgbench cpu overhead (was Re: lazy vxid locks, v1)

Excerpts from Jeff Janes's message of lun jun 13 20:27:15 -0400 2011:

On Mon, Jun 13, 2011 at 7:03 AM, Stefan Kaltenbrunner
<stefan@kaltenbrunner.cc> wrote:
...

so it seems that sysbench is actually significantly less overhead than
pgbench and the lower throughput at the higher conncurency seems to be
cause by sysbench being able to stress the backend even more than
pgbench can.

Hi Stefan,

pgbench sends each query (per connection) and waits for the reply
before sending another.

I noticed that pgbench's doCustom (the function highest in the profile
posted) returns doing nothing if the connection is supposed to be
"sleeping"; seems an open door for busy waiting. I didn't check the
rest of the code to see if there's something avoiding that condition. I
also noticed that it seems to be very liberal about calling
INSTR_TIME_SET_CURRENT in the same function which perhaps could be
optimizing by calling it a single time at entry and reusing the value,
but I guess that would show up in the profile as a kernel call so it's
maybe not a problem.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#17Itagaki Takahiro
itagaki.takahiro@gmail.com
In reply to: Alvaro Herrera (#16)
Re: pgbench cpu overhead (was Re: lazy vxid locks, v1)

On Tue, Jun 14, 2011 at 13:09, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

I noticed that pgbench's doCustom (the function highest in the profile
posted) returns doing nothing if the connection is supposed to be
"sleeping"; seems an open door for busy waiting.

pgbench uses select() with/without timeout in the cases, no?

--
Itagaki Takahiro

#18Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Janes (#11)
Re: lazy vxid locks, v1

On Mon, Jun 13, 2011 at 8:10 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Sun, Jun 12, 2011 at 2:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:
...

Profiling reveals that the system spends enormous amounts of CPU time
in s_lock.  LWLOCK_STATS reveals that the only lwlock with significant
amounts of blocking is the BufFreelistLock;

This is curious.  Clearly the entire working set fits in RAM, or you
wouldn't be getting number like this.  But does the entire working set
fit in shared_buffers?  If so, you shouldn't see any traffic on
BufFreelistLock once all the data is read in.  I've only seen
contention here when all data fits in OS cache memory but not in
shared_buffers.

Yeah, that does seem odd:

rhaas=# select pg_size_pretty(pg_database_size(current_database()));
pg_size_pretty
----------------
1501 MB
(1 row)

rhaas=# select pg_size_pretty(pg_table_size('pgbench_accounts'));
pg_size_pretty
----------------
1281 MB
(1 row)

rhaas=# select pg_size_pretty(pg_table_size('pgbench_accounts_pkey'));
pg_size_pretty
----------------
214 MB
(1 row)

rhaas=# show shared_buffers;
shared_buffers
----------------
8GB
(1 row)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#19Jeff Janes
jeff.janes@gmail.com
In reply to: Alvaro Herrera (#16)
Re: pgbench cpu overhead (was Re: lazy vxid locks, v1)

On Mon, Jun 13, 2011 at 9:09 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

I noticed that pgbench's doCustom (the function highest in the profile
posted) returns doing nothing if the connection is supposed to be
"sleeping"; seems an open door for busy waiting.  I didn't check the
rest of the code to see if there's something avoiding that condition.

Yes, there is a "select" in threadRun that avoids that. Also, I don't
think anyone would but in a "sleep" in this particular type of pgbench
run.

 I
also noticed that it seems to be very liberal about calling
INSTR_TIME_SET_CURRENT in the same function which perhaps could be
optimizing by calling it a single time at entry and reusing the value,
but I guess that would show up in the profile as a kernel call so it's
maybe not a problem.

I think that only gets called when you specifically asked for
latencies or for logging, or when making new connection (which should
be rare)

Cheers,

Jeff

#20Stefan Kaltenbrunner
stefan@kaltenbrunner.cc
In reply to: Jeff Janes (#12)
Re: pgbench cpu overhead (was Re: lazy vxid locks, v1)

On 06/14/2011 02:27 AM, Jeff Janes wrote:

On Mon, Jun 13, 2011 at 7:03 AM, Stefan Kaltenbrunner
<stefan@kaltenbrunner.cc> wrote:
...

so it seems that sysbench is actually significantly less overhead than
pgbench and the lower throughput at the higher conncurency seems to be
cause by sysbench being able to stress the backend even more than
pgbench can.

Hi Stefan,

pgbench sends each query (per connection) and waits for the reply
before sending another.

Do we know whether sysbench does that, or if it just stuffs the
kernel's IPC buffer full of queries without synchronously waiting for
individual replies?

I can't get sysbench to "make" for me, or I'd strace in single client
mode and see what kind of messages are going back and forth.

yeah sysbench compiled from a release tarball needs some
autoconf/makefile hackery to get running on a modern system - but I can
provide you with the data you are interested in if you tell me exactly
what you are looking for...

Stefan

#21Florian Pflug
fgp@phlo.org
In reply to: Robert Haas (#1)
Re: lazy vxid locks, v1

On Jun12, 2011, at 23:39 , Robert Haas wrote:

So, the majority (60%) of the excess spinning appears to be due to
SInvalReadLock. A good chunk are due to ProcArrayLock (25%).

Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32.
However, cache lines are 64 bytes large on recent Intel CPUs AFAIK,
so I guess that two adjacent LWLocks currently share one cache line.

Currently, the ProcArrayLock has index 4 while SInvalReadLock has
index 5, so if I'm not mistaken exactly the two locks where you saw
the largest contention on are on the same cache line...

Might make sense to try and see if these numbers change if you
either make LWLockPadded 64bytes or arrange the locks differently...

best regards,
Florian Pflug

#22Jeff Janes
jeff.janes@gmail.com
In reply to: Stefan Kaltenbrunner (#8)
Re: pgbench cpu overhead (was Re: lazy vxid locks, v1)

On Mon, Jun 13, 2011 at 7:03 AM, Stefan Kaltenbrunner
<stefan@kaltenbrunner.cc> wrote:

On 06/13/2011 01:55 PM, Stefan Kaltenbrunner wrote:

[...]

all those tests are done with pgbench running on the same box - which
has a noticable impact on the results because pgbench is using ~1 core
per 8 cores of the backend tested in cpu resoures - though I don't think
it causes any changes in the results that would show the performance
behaviour in a different light.

actuall testing against sysbench with the very same workload shows the
following performance behaviour:

with 40 threads(aka the peak performance point):

pgbench:        223308 tps
sysbench:       311584 tps

with 160 threads (backend contention dominated):

pgbench:        57075
sysbench:       43437

so it seems that sysbench is actually significantly less overhead than
pgbench and the lower throughput at the higher conncurency seems to be
cause by sysbench being able to stress the backend even more than
pgbench can.

for those curious - the profile for pgbench looks like:

samples  %        symbol name
29378    41.9087  doCustom
17502    24.9672  threadRun
7629     10.8830  pg_strcasecmp
5871      8.3752  compareVariables
2568      3.6633  getVariable
2167      3.0913  putVariable
2065      2.9458  replaceVariable
1971      2.8117  parseVariable
534       0.7618  xstrdup
278       0.3966  xrealloc
137       0.1954  xmalloc

Hi Stefan,

How was this profile generated? I get a similar profile using
--enable-profiling and gprof, but I find it not believable. The
complete absence of any calls to libpq is not credible. I don't know
about your profiler, but with gprof they should be listed in the call
graph even if they take a negligible amount of time. So I think
pgbench is linking to libpq libraries that do not themselves support
profiling (I have no idea how that could happen though). If the calls
graphs are not getting recorded correctly, surely the timing can't be
reliable either.

(I also tried profiling pgbench with "perf", but in that case I get
nothing other than kernel and libc calls showing up. I don't know
what that means)

To support this, I've dummied up doCustom so that does all the work of
deciding what needs to happen, executing the metacommands,
interpolating the variables into the SQL string, but then simply
refrains from calling the PQ functions to send and receive the query
and response. (I had to make a few changes around the select loop in
threadRun to support this).

The result is that the dummy pgbench is charged with only 57% more CPU
time than the stock one, but it gets over 10 times as many
"transactions" done. I think this supports the notion that the CPU
bottleneck is not in pgbench.c, but somewhere in the libpq or the
kernel.

Cheers,

Jeff

#23Stefan Kaltenbrunner
stefan@kaltenbrunner.cc
In reply to: Jeff Janes (#22)
Re: pgbench cpu overhead (was Re: lazy vxid locks, v1)

On 07/24/2011 03:50 AM, Jeff Janes wrote:

On Mon, Jun 13, 2011 at 7:03 AM, Stefan Kaltenbrunner
<stefan@kaltenbrunner.cc> wrote:

On 06/13/2011 01:55 PM, Stefan Kaltenbrunner wrote:

[...]

all those tests are done with pgbench running on the same box - which
has a noticable impact on the results because pgbench is using ~1 core
per 8 cores of the backend tested in cpu resoures - though I don't think
it causes any changes in the results that would show the performance
behaviour in a different light.

actuall testing against sysbench with the very same workload shows the
following performance behaviour:

with 40 threads(aka the peak performance point):

pgbench: 223308 tps
sysbench: 311584 tps

with 160 threads (backend contention dominated):

pgbench: 57075
sysbench: 43437

so it seems that sysbench is actually significantly less overhead than
pgbench and the lower throughput at the higher conncurency seems to be
cause by sysbench being able to stress the backend even more than
pgbench can.

for those curious - the profile for pgbench looks like:

samples % symbol name
29378 41.9087 doCustom
17502 24.9672 threadRun
7629 10.8830 pg_strcasecmp
5871 8.3752 compareVariables
2568 3.6633 getVariable
2167 3.0913 putVariable
2065 2.9458 replaceVariable
1971 2.8117 parseVariable
534 0.7618 xstrdup
278 0.3966 xrealloc
137 0.1954 xmalloc

Hi Stefan,

How was this profile generated? I get a similar profile using
--enable-profiling and gprof, but I find it not believable. The
complete absence of any calls to libpq is not credible. I don't know
about your profiler, but with gprof they should be listed in the call
graph even if they take a negligible amount of time. So I think
pgbench is linking to libpq libraries that do not themselves support
profiling (I have no idea how that could happen though). If the calls
graphs are not getting recorded correctly, surely the timing can't be
reliable either.

hmm - the profile was generated using oprofile, but now that you are
mentioning this aspect, I suppose that this was a profile run without
opcontrol --seperate=lib...
I'm not currently in a position to retest that - but maybe you could do
a run?

(I also tried profiling pgbench with "perf", but in that case I get
nothing other than kernel and libc calls showing up. I don't know
what that means)

To support this, I've dummied up doCustom so that does all the work of
deciding what needs to happen, executing the metacommands,
interpolating the variables into the SQL string, but then simply
refrains from calling the PQ functions to send and receive the query
and response. (I had to make a few changes around the select loop in
threadRun to support this).

The result is that the dummy pgbench is charged with only 57% more CPU
time than the stock one, but it gets over 10 times as many
"transactions" done. I think this supports the notion that the CPU
bottleneck is not in pgbench.c, but somewhere in the libpq or the
kernel.

interesting - iirc we actually had some reports about current libpq
behaviour causing scaling issues on some OSes - see
http://archives.postgresql.org/pgsql-hackers/2009-06/msg00748.php and
some related threads. Iirc the final patch for that was never applied
though and the original author lost interest, I think that I was able to
measure some noticable performance gains back in the days but I don't
think I still have the numbers somewhere.

Stefan

#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Janes (#22)
Re: pgbench cpu overhead (was Re: lazy vxid locks, v1)

Jeff Janes <jeff.janes@gmail.com> writes:

How was this profile generated? I get a similar profile using
--enable-profiling and gprof, but I find it not believable. The
complete absence of any calls to libpq is not credible. I don't know
about your profiler, but with gprof they should be listed in the call
graph even if they take a negligible amount of time. So I think
pgbench is linking to libpq libraries that do not themselves support
profiling (I have no idea how that could happen though). If the calls
graphs are not getting recorded correctly, surely the timing can't be
reliable either.

Last I checked, gprof simply does not work for shared libraries on
Linux --- is that what you're testing on? If so, try oprofile or
some other Linux-specific solution.

regards, tom lane

#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stefan Kaltenbrunner (#23)
Re: pgbench cpu overhead (was Re: lazy vxid locks, v1)

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:

interesting - iirc we actually had some reports about current libpq
behaviour causing scaling issues on some OSes - see
http://archives.postgresql.org/pgsql-hackers/2009-06/msg00748.php and
some related threads. Iirc the final patch for that was never applied
though and the original author lost interest, I think that I was able to
measure some noticable performance gains back in the days but I don't
think I still have the numbers somewhere.

Huh? That patch did get applied in some form or other -- at least,
libpq does contain references to both SO_NOSIGPIPE and MSG_NOSIGNAL
these days.

regards, tom lane

#26Stefan Kaltenbrunner
stefan@kaltenbrunner.cc
In reply to: Tom Lane (#25)
Re: pgbench cpu overhead (was Re: lazy vxid locks, v1)

On 07/24/2011 05:55 PM, Tom Lane wrote:

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:

interesting - iirc we actually had some reports about current libpq
behaviour causing scaling issues on some OSes - see
http://archives.postgresql.org/pgsql-hackers/2009-06/msg00748.php and
some related threads. Iirc the final patch for that was never applied
though and the original author lost interest, I think that I was able to
measure some noticable performance gains back in the days but I don't
think I still have the numbers somewhere.

Huh? That patch did get applied in some form or other -- at least,
libpq does contain references to both SO_NOSIGPIPE and MSG_NOSIGNAL
these days.

hmm yeah - your are right, when I looked that up a few hours ago I
failed to find the right commit but it was indeed commited:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=cea80e726edd42a39bb0220290738f7825de8e57

I think I mentally mixed that up with "compare word-at-a-time in
bcTruelen" patch that was also discussed for affecting query rates for
trivial queries.
I actually wonder if -HEAD would show that issue even more clearly now
that we have parts of roberts performance work in the tree...

Stefan