testing ProcArrayLock patches

Started by Robert Haasabout 14 years ago32 messages
#1Robert Haas
robertmhaas@gmail.com
1 attachment(s)

We have three patches in the hopper that all have the same goal:
reduce ProcArrayLock contention. They are:

[1]: Pavan's patch (subsequently revised by Heikki) to put the "hot" members of the PGPROC structure into a separate array http://archives.postgresql.org/message-id/4EB7C4C9.9070309@enterprisedb.com
members of the PGPROC structure into a separate array
http://archives.postgresql.org/message-id/4EB7C4C9.9070309@enterprisedb.com

[2]: my FlexLocks patch, and http://archives.postgresql.org/message-id/CA+Tgmoax_14rbx8Y6mmgvW64gCQL4ZviDzwEObXEMuiV=TwmxQ@mail.gmail.com
http://archives.postgresql.org/message-id/CA+Tgmoax_14rbx8Y6mmgvW64gCQL4ZviDzwEObXEMuiV=TwmxQ@mail.gmail.com

[3]: my patch to eliminate some snapshot (I think this is also better semantics, but at any rate it also improves performance) http://archives.postgresql.org/message-id/CA+TgmoYDe3dx7xuK_rCPLWy7P67hp96ozyGe_K6W87kfx3YCGw@mail.gmail.com
semantics, but at any rate it also improves performance)
http://archives.postgresql.org/message-id/CA+TgmoYDe3dx7xuK_rCPLWy7P67hp96ozyGe_K6W87kfx3YCGw@mail.gmail.com

Interestingly, these all try to reduce ProcArrayLock contention in
different ways: [1]Pavan's patch (subsequently revised by Heikki) to put the "hot" members of the PGPROC structure into a separate array http://archives.postgresql.org/message-id/4EB7C4C9.9070309@enterprisedb.com does it by making snapshot-taking scan fewer cache
lines, [2]my FlexLocks patch, and http://archives.postgresql.org/message-id/CA+Tgmoax_14rbx8Y6mmgvW64gCQL4ZviDzwEObXEMuiV=TwmxQ@mail.gmail.com does it by reducing contention for the spinlock protecting
ProcArrayLock, and [3]my patch to eliminate some snapshot (I think this is also better semantics, but at any rate it also improves performance) http://archives.postgresql.org/message-id/CA+TgmoYDe3dx7xuK_rCPLWy7P67hp96ozyGe_K6W87kfx3YCGw@mail.gmail.com does it by taking fewer snapshots. So you
might think that the effects of these patches would add, at least to
some degree.

Now the first two patches are the ones that seem to show the most
performance improvement, so I tested both patches individually and
also a combination of the two patches (the combined patch for this is
attached, as there were numerous conflicts). I tested them on two
different machines with completely different architectures; Nate
Boley's AMD 6128 box (which has 32 cores) and an HP Integrity server
(also with 32 cores). On Integrity, I compiled using the aCC
compiler, adjusted the resulting binary with chatr +pi L +pd L, and
ran both pgbench and the server with rtsched -s SCHED_NOAGE -p 178,
which are settings that seem to be necessary for good performance on
that platform. pgbench was run locally on the AMD box but from
another server over a high-speed network interconnect on the Integrity
server. Both servers were configured with shared_buffers=8GB,
checkpoint_segments=300, wal_writer_delay=20ms, and
synchronous_commit=off. Some of the other settings were different; on
the Integrity server, I had effective_cache_size=340GB,
checkpoint_timeout=30min, and wal_buffers=16MB, while on the AMD box I
had checkpoint_completion_target=0.9 and maintenance_work_mem=1GB. I
doubt that these settings differences were material (except that they
probably made reinitializing the database between tests take longer on
the Integrity system, since I forgot to set maintenance_work_mem), but
I could double-check that if anyone is concerned about it.

The results are below. In a nutshell, either patch by itself is very,
very good; and both patches together are somewhat better. Which one
helps more individually is somewhat variable. Lines marked "m" are
unpatched master as of commit
ff4fd4bf53c5512427f8ecea08d6ca7777efa2c5. "p" is Pavan's PGPROC patch
(maybe I should have said ppp...) as revised by Heikki; "f" is the
latest version of my FlexLocks patch, and "b" is the combination patch
attached herewith. The number immediately following is the number of
clients used, each with its own pgbench thread (i.e. -c N -j N). As
usual, each number is the median of three five-minute runs at scale
factor 100.

Since Pavan's patch has the advantage of being quite simple, I'm
thinking we should push that one through to completion first, and then
test all the other possible improvements in this area relative to that
new baseline.

== AMD Opteron 6128, 32 cores, Permanent Tables ==

m01 tps = 631.208073 (including connections establishing)
p01 tps = 631.182923 (including connections establishing)
f01 tps = 636.308562 (including connections establishing)
b01 tps = 629.295507 (including connections establishing)
m08 tps = 4516.479854 (including connections establishing)
p08 tps = 4614.772650 (including connections establishing)
f08 tps = 4652.454768 (including connections establishing)
b08 tps = 4679.363474 (including connections establishing)
m16 tps = 7788.615240 (including connections establishing)
p16 tps = 7824.025406 (including connections establishing)
f16 tps = 7841.876146 (including connections establishing)
b16 tps = 7859.334650 (including connections establishing)
m24 tps = 11720.145052 (including connections establishing)
p24 tps = 12782.696214 (including connections establishing)
f24 tps = 12559.765555 (including connections establishing)
b24 tps = 12891.945766 (including connections establishing)
m32 tps = 10223.015618 (including connections establishing)
p32 tps = 11585.902050 (including connections establishing)
f32 tps = 11626.542744 (including connections establishing)
b32 tps = 11866.969986 (including connections establishing)
m80 tps = 7540.482189 (including connections establishing)
p80 tps = 11598.446238 (including connections establishing)
f80 tps = 11529.752081 (including connections establishing)
b80 tps = 11714.364294 (including connections establishing)

== AMD Opteron 6128, 32 cores, Unlogged Tables ==

m01 tps = 680.398630 (including connections establishing)
p01 tps = 673.293390 (including connections establishing)
f01 tps = 679.993953 (including connections establishing)
b01 tps = 679.377600 (including connections establishing)
m08 tps = 4760.964292 (including connections establishing)
p08 tps = 4870.037842 (including connections establishing)
f08 tps = 5028.719509 (including connections establishing)
b08 tps = 4893.439824 (including connections establishing)
m16 tps = 7997.051705 (including connections establishing)
p16 tps = 8218.884377 (including connections establishing)
f16 tps = 8160.373682 (including connections establishing)
b16 tps = 8144.707958 (including connections establishing)
m24 tps = 13066.867858 (including connections establishing)
p24 tps = 14523.109116 (including connections establishing)
f24 tps = 14098.978673 (including connections establishing)
b24 tps = 14526.330294 (including connections establishing)
m32 tps = 10800.711985 (including connections establishing)
p32 tps = 19159.131614 (including connections establishing)
f32 tps = 22224.839905 (including connections establishing)
b32 tps = 23373.672552 (including connections establishing)
m80 tps = 7885.663468 (including connections establishing)
p80 tps = 17760.149440 (including connections establishing)
f80 tps = 19960.356205 (including connections establishing)
b80 tps = 18665.581069 (including connections establishing)

== HP Integrity, 32 cores, Permanent Tables ==

m01 tps = 883.732295 (including connections establishing)
p01 tps = 866.449154 (including connections establishing)
f01 tps = 924.364403 (including connections establishing)
b01 tps = 926.797302 (including connections establishing)
m08 tps = 6098.047731 (including connections establishing)
p08 tps = 6293.537855 (including connections establishing)
f08 tps = 6059.635731 (including connections establishing)
b08 tps = 6250.132288 (including connections establishing)
m16 tps = 9995.755003 (including connections establishing)
p16 tps = 10654.562946 (including connections establishing)
f16 tps = 10258.008496 (including connections establishing)
b16 tps = 10712.776806 (including connections establishing)
m24 tps = 11646.915026 (including connections establishing)
p24 tps = 13483.345338 (including connections establishing)
f24 tps = 12815.456128 (including connections establishing)
b24 tps = 13506.218109 (including connections establishing)
m32 tps = 10433.315312 (including connections establishing)
p32 tps = 14111.719739 (including connections establishing)
f32 tps = 13990.284158 (including connections establishing)
b32 tps = 14697.189751 (including connections establishing)
m80 tps = 8177.428209 (including connections establishing)
p80 tps = 11343.667289 (including connections establishing)
f80 tps = 11651.244256 (including connections establishing)
b80 tps = 12523.308466 (including connections establishing)

== HP Integrity, 32 cores, Unlogged Tables ==

m01 tps = 949.594327 (including connections establishing)
p01 tps = 958.753925 (including connections establishing)
f01 tps = 931.276655 (including connections establishing)
b01 tps = 943.836646 (including connections establishing)
m08 tps = 6211.621726 (including connections establishing)
p08 tps = 6412.267441 (including connections establishing)
f08 tps = 5843.870591 (including connections establishing)
b08 tps = 6428.415940 (including connections establishing)
m16 tps = 10341.538889 (including connections establishing)
p16 tps = 11161.425798 (including connections establishing)
f16 tps = 10545.954472 (including connections establishing)
b16 tps = 11235.441290 (including connections establishing)
m24 tps = 11859.831632 (including connections establishing)
p24 tps = 14380.766878 (including connections establishing)
f24 tps = 13489.351324 (including connections establishing)
b24 tps = 14579.649665 (including connections establishing)
m32 tps = 10716.208372 (including connections establishing)
p32 tps = 15497.819188 (including connections establishing)
f32 tps = 14590.406972 (including connections establishing)
b32 tps = 15991.920288 (including connections establishing)
m80 tps = 8465.159253 (including connections establishing)
p80 tps = 11945.494890 (including connections establishing)
f80 tps = 14676.324769 (including connections establishing)
b80 tps = 15623.109737 (including connections establishing)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

flexlock-minimal-pgproc-heikki.patchapplication/octet-stream; name=flexlock-minimal-pgproc-heikki.patchDownload
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 8dc3054..51b24d0 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -105,7 +105,7 @@ typedef struct pgssEntry
  */
 typedef struct pgssSharedState
 {
-	LWLockId	lock;			/* protects hashtable search/modification */
+	FlexLockId	lock;			/* protects hashtable search/modification */
 	int			query_size;		/* max query length in bytes */
 } pgssSharedState;
 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d1e628f..8517b36 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6199,14 +6199,14 @@ LOG:  CleanUpLock: deleting: lock(0xb7acd844) id(24688,24696,0,0,0,1)
      </varlistentry>
 
      <varlistentry>
-      <term><varname>trace_lwlocks</varname> (<type>boolean</type>)</term>
+      <term><varname>trace_flexlocks</varname> (<type>boolean</type>)</term>
       <indexterm>
-       <primary><varname>trace_lwlocks</> configuration parameter</primary>
+       <primary><varname>trace_flexlocks</> configuration parameter</primary>
       </indexterm>
       <listitem>
        <para>
-        If on, emit information about lightweight lock usage.  Lightweight
-        locks are intended primarily to provide mutual exclusion of access
+        If on, emit information about FlexLock usage.  FlexLocks
+        are intended primarily to provide mutual exclusion of access
         to shared-memory data structures.
        </para>
        <para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index b9dc1d2..98ed0d3 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1724,49 +1724,49 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS procpid,
       or kilobytes of memory used for an internal sort.</entry>
     </row>
     <row>
-     <entry>lwlock-acquire</entry>
-     <entry>(LWLockId, LWLockMode)</entry>
-     <entry>Probe that fires when an LWLock has been acquired.
-      arg0 is the LWLock's ID.
-      arg1 is the requested lock mode, either exclusive or shared.</entry>
+     <entry>flexlock-acquire</entry>
+     <entry>(FlexLockId, FlexLockMode)</entry>
+     <entry>Probe that fires when an FlexLock has been acquired.
+      arg0 is the FlexLock's ID.
+      arg1 is the requested lock mode.</entry>
     </row>
     <row>
-     <entry>lwlock-release</entry>
-     <entry>(LWLockId)</entry>
-     <entry>Probe that fires when an LWLock has been released (but note
+     <entry>flexlock-release</entry>
+     <entry>(FlexLockId)</entry>
+     <entry>Probe that fires when a FlexLock has been released (but note
       that any released waiters have not yet been awakened).
-      arg0 is the LWLock's ID.</entry>
+      arg0 is the FlexLock's ID.</entry>
     </row>
     <row>
-     <entry>lwlock-wait-start</entry>
-     <entry>(LWLockId, LWLockMode)</entry>
-     <entry>Probe that fires when an LWLock was not immediately available and
+     <entry>flexlock-wait-start</entry>
+     <entry>(FlexLockId, FlexLockMode)</entry>
+     <entry>Probe that fires when an FlexLock was not immediately available and
       a server process has begun to wait for the lock to become available.
-      arg0 is the LWLock's ID.
+      arg0 is the FlexLock's ID.
       arg1 is the requested lock mode, either exclusive or shared.</entry>
     </row>
     <row>
-     <entry>lwlock-wait-done</entry>
-     <entry>(LWLockId, LWLockMode)</entry>
+     <entry>flexlock-wait-done</entry>
+     <entry>(FlexLockId, FlexLockMode)</entry>
      <entry>Probe that fires when a server process has been released from its
-      wait for an LWLock (it does not actually have the lock yet).
-      arg0 is the LWLock's ID.
+      wait for an FlexLock (it does not actually have the lock yet).
+      arg0 is the FlexLock's ID.
       arg1 is the requested lock mode, either exclusive or shared.</entry>
     </row>
     <row>
-     <entry>lwlock-condacquire</entry>
-     <entry>(LWLockId, LWLockMode)</entry>
-     <entry>Probe that fires when an LWLock was successfully acquired when the
-      caller specified no waiting.
-      arg0 is the LWLock's ID.
+     <entry>flexlock-condacquire</entry>
+     <entry>(FlexLockId, FlexLockMode)</entry>
+     <entry>Probe that fires when an FlexLock was successfully acquired when
+      the caller specified no waiting.
+      arg0 is the FlexLock's ID.
       arg1 is the requested lock mode, either exclusive or shared.</entry>
     </row>
     <row>
-     <entry>lwlock-condacquire-fail</entry>
-     <entry>(LWLockId, LWLockMode)</entry>
-     <entry>Probe that fires when an LWLock was not successfully acquired when
-      the caller specified no waiting.
-      arg0 is the LWLock's ID.
+     <entry>flexlock-condacquire-fail</entry>
+     <entry>(FlexLockId, FlexLockMode)</entry>
+     <entry>Probe that fires when an FlexLock was not successfully acquired
+      when the caller specified no waiting.
+      arg0 is the FlexLock's ID.
       arg1 is the requested lock mode, either exclusive or shared.</entry>
     </row>
     <row>
@@ -1813,11 +1813,11 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS procpid,
      <entry>unsigned int</entry>
     </row>
     <row>
-     <entry>LWLockId</entry>
+     <entry>FlexLockId</entry>
      <entry>int</entry>
     </row>
     <row>
-     <entry>LWLockMode</entry>
+     <entry>FlexLockMode</entry>
      <entry>int</entry>
     </row>
     <row>
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index f7caa34..09d5862 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -151,7 +151,7 @@ SimpleLruShmemSize(int nslots, int nlsns)
 	sz += MAXALIGN(nslots * sizeof(bool));		/* page_dirty[] */
 	sz += MAXALIGN(nslots * sizeof(int));		/* page_number[] */
 	sz += MAXALIGN(nslots * sizeof(int));		/* page_lru_count[] */
-	sz += MAXALIGN(nslots * sizeof(LWLockId));	/* buffer_locks[] */
+	sz += MAXALIGN(nslots * sizeof(FlexLockId));		/* buffer_locks[] */
 
 	if (nlsns > 0)
 		sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr));	/* group_lsn[] */
@@ -161,7 +161,7 @@ SimpleLruShmemSize(int nslots, int nlsns)
 
 void
 SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
-			  LWLockId ctllock, const char *subdir)
+			  FlexLockId ctllock, const char *subdir)
 {
 	SlruShared	shared;
 	bool		found;
@@ -202,8 +202,8 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
 		offset += MAXALIGN(nslots * sizeof(int));
 		shared->page_lru_count = (int *) (ptr + offset);
 		offset += MAXALIGN(nslots * sizeof(int));
-		shared->buffer_locks = (LWLockId *) (ptr + offset);
-		offset += MAXALIGN(nslots * sizeof(LWLockId));
+		shared->buffer_locks = (FlexLockId *) (ptr + offset);
+		offset += MAXALIGN(nslots * sizeof(FlexLockId));
 
 		if (nlsns > 0)
 		{
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 477982d..0805f9c 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -113,7 +113,8 @@ int			max_prepared_xacts = 0;
 
 typedef struct GlobalTransactionData
 {
-	PGPROC		proc;			/* dummy proc */
+	GlobalTransaction next;
+	int			pgprocno;		/* dummy proc */
 	BackendId	dummyBackendId; /* similar to backend id for backends */
 	TimestampTz prepared_at;	/* time of preparation */
 	XLogRecPtr	prepare_lsn;	/* XLOG offset of prepare record */
@@ -207,7 +208,8 @@ TwoPhaseShmemInit(void)
 					  sizeof(GlobalTransaction) * max_prepared_xacts));
 		for (i = 0; i < max_prepared_xacts; i++)
 		{
-			gxacts[i].proc.links.next = (SHM_QUEUE *) TwoPhaseState->freeGXacts;
+			gxacts[i].pgprocno = PreparedXactProcs[i].pgprocno;
+			gxacts[i].next = TwoPhaseState->freeGXacts;
 			TwoPhaseState->freeGXacts = &gxacts[i];
 
 			/*
@@ -243,6 +245,8 @@ MarkAsPreparing(TransactionId xid, const char *gid,
 				TimestampTz prepared_at, Oid owner, Oid databaseid)
 {
 	GlobalTransaction gxact;
+	PGPROC	   *proc;
+	PGPROC_MINIMAL *proc_minimal;
 	int			i;
 
 	if (strlen(gid) >= GIDSIZE)
@@ -274,7 +278,7 @@ MarkAsPreparing(TransactionId xid, const char *gid,
 			TwoPhaseState->numPrepXacts--;
 			TwoPhaseState->prepXacts[i] = TwoPhaseState->prepXacts[TwoPhaseState->numPrepXacts];
 			/* and put it back in the freelist */
-			gxact->proc.links.next = (SHM_QUEUE *) TwoPhaseState->freeGXacts;
+			gxact->next = TwoPhaseState->freeGXacts;
 			TwoPhaseState->freeGXacts = gxact;
 			/* Back up index count too, so we don't miss scanning one */
 			i--;
@@ -302,32 +306,36 @@ MarkAsPreparing(TransactionId xid, const char *gid,
 				 errhint("Increase max_prepared_transactions (currently %d).",
 						 max_prepared_xacts)));
 	gxact = TwoPhaseState->freeGXacts;
-	TwoPhaseState->freeGXacts = (GlobalTransaction) gxact->proc.links.next;
+	TwoPhaseState->freeGXacts = (GlobalTransaction) gxact->next;
 
-	/* Initialize it */
-	MemSet(&gxact->proc, 0, sizeof(PGPROC));
-	SHMQueueElemInit(&(gxact->proc.links));
-	gxact->proc.waitStatus = STATUS_OK;
+	proc = &ProcGlobal->allProcs[gxact->pgprocno];
+	proc_minimal = &ProcGlobal->allProcs_Minimal[gxact->pgprocno];
+
+	/* Initialize the PGPROC entry */
+	MemSet(proc, 0, sizeof(PGPROC));
+	proc->pgprocno = gxact->pgprocno;
+	SHMQueueElemInit(&(proc->links));
+	proc->waitStatus = STATUS_OK;
 	/* We set up the gxact's VXID as InvalidBackendId/XID */
-	gxact->proc.lxid = (LocalTransactionId) xid;
-	gxact->proc.xid = xid;
-	gxact->proc.xmin = InvalidTransactionId;
-	gxact->proc.pid = 0;
-	gxact->proc.backendId = InvalidBackendId;
-	gxact->proc.databaseId = databaseid;
-	gxact->proc.roleId = owner;
-	gxact->proc.inCommit = false;
-	gxact->proc.vacuumFlags = 0;
-	gxact->proc.lwWaiting = false;
-	gxact->proc.lwExclusive = false;
-	gxact->proc.lwWaitLink = NULL;
-	gxact->proc.waitLock = NULL;
-	gxact->proc.waitProcLock = NULL;
+	proc->lxid = (LocalTransactionId) xid;
+	proc_minimal->xid = xid;
+	proc_minimal->xmin = InvalidTransactionId;
+	proc_minimal->inCommit = false;
+	proc_minimal->vacuumFlags = 0;
+	proc->pid = 0;
+	proc->backendId = InvalidBackendId;
+	proc->databaseId = databaseid;
+	proc->roleId = owner;
+	proc->flWaitResult = false;
+	proc->flWaitMode = false;
+	proc->flWaitLink = NULL;
+	proc->waitLock = NULL;
+	proc->waitProcLock = NULL;
 	for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
-		SHMQueueInit(&(gxact->proc.myProcLocks[i]));
+		SHMQueueInit(&(proc->myProcLocks[i]));
 	/* subxid data must be filled later by GXactLoadSubxactData */
-	gxact->proc.subxids.overflowed = false;
-	gxact->proc.subxids.nxids = 0;
+	proc_minimal->overflowed = false;
+	proc_minimal->nxids = 0;
 
 	gxact->prepared_at = prepared_at;
 	/* initialize LSN to 0 (start of WAL) */
@@ -358,17 +366,19 @@ static void
 GXactLoadSubxactData(GlobalTransaction gxact, int nsubxacts,
 					 TransactionId *children)
 {
+	PGPROC *proc = &ProcGlobal->allProcs[gxact->pgprocno];
+	PGPROC_MINIMAL *proc_minimal = &ProcGlobal->allProcs_Minimal[gxact->pgprocno];
 	/* We need no extra lock since the GXACT isn't valid yet */
 	if (nsubxacts > PGPROC_MAX_CACHED_SUBXIDS)
 	{
-		gxact->proc.subxids.overflowed = true;
+		proc_minimal->overflowed = true;
 		nsubxacts = PGPROC_MAX_CACHED_SUBXIDS;
 	}
 	if (nsubxacts > 0)
 	{
-		memcpy(gxact->proc.subxids.xids, children,
+		memcpy(proc->subxids.xids, children,
 			   nsubxacts * sizeof(TransactionId));
-		gxact->proc.subxids.nxids = nsubxacts;
+		proc_minimal->nxids = nsubxacts;
 	}
 }
 
@@ -389,7 +399,7 @@ MarkAsPrepared(GlobalTransaction gxact)
 	 * Put it into the global ProcArray so TransactionIdIsInProgress considers
 	 * the XID as still running.
 	 */
-	ProcArrayAdd(&gxact->proc);
+	ProcArrayAdd(&ProcGlobal->allProcs[gxact->pgprocno]);
 }
 
 /*
@@ -406,6 +416,7 @@ LockGXact(const char *gid, Oid user)
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+		PGPROC *proc = &ProcGlobal->allProcs[gxact->pgprocno];
 
 		/* Ignore not-yet-valid GIDs */
 		if (!gxact->valid)
@@ -436,7 +447,7 @@ LockGXact(const char *gid, Oid user)
 		 * there may be some other issues as well.	Hence disallow until
 		 * someone gets motivated to make it work.
 		 */
-		if (MyDatabaseId != gxact->proc.databaseId)
+		if (MyDatabaseId != proc->databaseId)
 			ereport(ERROR,
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				  errmsg("prepared transaction belongs to another database"),
@@ -483,7 +494,7 @@ RemoveGXact(GlobalTransaction gxact)
 			TwoPhaseState->prepXacts[i] = TwoPhaseState->prepXacts[TwoPhaseState->numPrepXacts];
 
 			/* and put it back in the freelist */
-			gxact->proc.links.next = (SHM_QUEUE *) TwoPhaseState->freeGXacts;
+			gxact->next = TwoPhaseState->freeGXacts;
 			TwoPhaseState->freeGXacts = gxact;
 
 			LWLockRelease(TwoPhaseStateLock);
@@ -518,8 +529,9 @@ TransactionIdIsPrepared(TransactionId xid)
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+		PGPROC_MINIMAL *proc_minimal = &ProcGlobal->allProcs_Minimal[gxact->pgprocno];
 
-		if (gxact->valid && gxact->proc.xid == xid)
+		if (gxact->valid && proc_minimal->xid == xid)
 		{
 			result = true;
 			break;
@@ -642,6 +654,8 @@ pg_prepared_xact(PG_FUNCTION_ARGS)
 	while (status->array != NULL && status->currIdx < status->ngxacts)
 	{
 		GlobalTransaction gxact = &status->array[status->currIdx++];
+		PGPROC *proc = &ProcGlobal->allProcs[gxact->pgprocno];
+		PGPROC_MINIMAL *proc_minimal = &ProcGlobal->allProcs_Minimal[gxact->pgprocno];
 		Datum		values[5];
 		bool		nulls[5];
 		HeapTuple	tuple;
@@ -656,11 +670,11 @@ pg_prepared_xact(PG_FUNCTION_ARGS)
 		MemSet(values, 0, sizeof(values));
 		MemSet(nulls, 0, sizeof(nulls));
 
-		values[0] = TransactionIdGetDatum(gxact->proc.xid);
+		values[0] = TransactionIdGetDatum(proc_minimal->xid);
 		values[1] = CStringGetTextDatum(gxact->gid);
 		values[2] = TimestampTzGetDatum(gxact->prepared_at);
 		values[3] = ObjectIdGetDatum(gxact->owner);
-		values[4] = ObjectIdGetDatum(gxact->proc.databaseId);
+		values[4] = ObjectIdGetDatum(proc->databaseId);
 
 		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
 		result = HeapTupleGetDatum(tuple);
@@ -711,10 +725,11 @@ TwoPhaseGetDummyProc(TransactionId xid)
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+		PGPROC_MINIMAL *proc_minimal = &ProcGlobal->allProcs_Minimal[gxact->pgprocno];
 
-		if (gxact->proc.xid == xid)
+		if (proc_minimal->xid == xid)
 		{
-			result = &gxact->proc;
+			result = &ProcGlobal->allProcs[gxact->pgprocno];
 			break;
 		}
 	}
@@ -841,7 +856,9 @@ save_state_data(const void *data, uint32 len)
 void
 StartPrepare(GlobalTransaction gxact)
 {
-	TransactionId xid = gxact->proc.xid;
+	PGPROC *proc = &ProcGlobal->allProcs[gxact->pgprocno];
+	PGPROC_MINIMAL *proc_minimal = &ProcGlobal->allProcs_Minimal[gxact->pgprocno];
+	TransactionId xid = proc_minimal->xid;
 	TwoPhaseFileHeader hdr;
 	TransactionId *children;
 	RelFileNode *commitrels;
@@ -865,7 +882,7 @@ StartPrepare(GlobalTransaction gxact)
 	hdr.magic = TWOPHASE_MAGIC;
 	hdr.total_len = 0;			/* EndPrepare will fill this in */
 	hdr.xid = xid;
-	hdr.database = gxact->proc.databaseId;
+	hdr.database = proc->databaseId;
 	hdr.prepared_at = gxact->prepared_at;
 	hdr.owner = gxact->owner;
 	hdr.nsubxacts = xactGetCommittedChildren(&children);
@@ -913,7 +930,8 @@ StartPrepare(GlobalTransaction gxact)
 void
 EndPrepare(GlobalTransaction gxact)
 {
-	TransactionId xid = gxact->proc.xid;
+	PGPROC_MINIMAL *proc_minimal = &ProcGlobal->allProcs_Minimal[gxact->pgprocno];
+	TransactionId xid = proc_minimal->xid;
 	TwoPhaseFileHeader *hdr;
 	char		path[MAXPGPATH];
 	XLogRecData *record;
@@ -1021,7 +1039,7 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	START_CRIT_SECTION();
 
-	MyProc->inCommit = true;
+	MyProcMinimal->inCommit = true;
 
 	gxact->prepare_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE,
 									records.head);
@@ -1069,7 +1087,7 @@ EndPrepare(GlobalTransaction gxact)
 	 * checkpoint starting after this will certainly see the gxact as a
 	 * candidate for fsyncing.
 	 */
-	MyProc->inCommit = false;
+	MyProcMinimal->inCommit = false;
 
 	END_CRIT_SECTION();
 
@@ -1242,6 +1260,8 @@ void
 FinishPreparedTransaction(const char *gid, bool isCommit)
 {
 	GlobalTransaction gxact;
+	PGPROC	   *proc;
+	PGPROC_MINIMAL *proc_minimal;
 	TransactionId xid;
 	char	   *buf;
 	char	   *bufptr;
@@ -1260,7 +1280,9 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 * try to commit the same GID at once.
 	 */
 	gxact = LockGXact(gid, GetUserId());
-	xid = gxact->proc.xid;
+	proc = &ProcGlobal->allProcs[gxact->pgprocno];
+	proc_minimal = &ProcGlobal->allProcs_Minimal[gxact->pgprocno];
+	xid = proc_minimal->xid;
 
 	/*
 	 * Read and validate the state file
@@ -1309,7 +1331,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 									   hdr->nsubxacts, children,
 									   hdr->nabortrels, abortrels);
 
-	ProcArrayRemove(&gxact->proc, latestXid);
+	ProcArrayRemove(proc, latestXid);
 
 	/*
 	 * In case we fail while running the callbacks, mark the gxact invalid so
@@ -1540,10 +1562,11 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+		PGPROC_MINIMAL *proc_minimal = &ProcGlobal->allProcs_Minimal[gxact->pgprocno];
 
 		if (gxact->valid &&
 			XLByteLE(gxact->prepare_lsn, redo_horizon))
-			xids[nxids++] = gxact->proc.xid;
+			xids[nxids++] = proc_minimal->xid;
 	}
 
 	LWLockRelease(TwoPhaseStateLock);
@@ -1972,7 +1995,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
-	MyProc->inCommit = true;
+	MyProcMinimal->inCommit = true;
 
 	/* Emit the XLOG commit record */
 	xlrec.xid = xid;
@@ -2037,7 +2060,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	TransactionIdCommitTree(xid, nchildren, children);
 
 	/* Checkpoint can proceed now */
-	MyProc->inCommit = false;
+	MyProcMinimal->inCommit = false;
 
 	END_CRIT_SECTION();
 
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 61dcfed..7c986aa 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -54,7 +54,7 @@ GetNewTransactionId(bool isSubXact)
 	if (IsBootstrapProcessingMode())
 	{
 		Assert(!isSubXact);
-		MyProc->xid = BootstrapTransactionId;
+		MyProcMinimal->xid = BootstrapTransactionId;
 		return BootstrapTransactionId;
 	}
 
@@ -208,20 +208,21 @@ GetNewTransactionId(bool isSubXact)
 		 * TransactionId and int fetch/store are atomic.
 		 */
 		volatile PGPROC *myproc = MyProc;
+		volatile PGPROC_MINIMAL *myprocminimal = MyProcMinimal;
 
 		if (!isSubXact)
-			myproc->xid = xid;
+			myprocminimal->xid = xid;
 		else
 		{
-			int			nxids = myproc->subxids.nxids;
+			int			nxids = myprocminimal->nxids;
 
 			if (nxids < PGPROC_MAX_CACHED_SUBXIDS)
 			{
 				myproc->subxids.xids[nxids] = xid;
-				myproc->subxids.nxids = nxids + 1;
+				myprocminimal->nxids = nxids + 1;
 			}
 			else
-				myproc->subxids.overflowed = true;
+				myprocminimal->overflowed = true;
 		}
 	}
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c151d3b..21eb404 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -981,7 +981,7 @@ RecordTransactionCommit(void)
 		 * bit fuzzy, but it doesn't matter.
 		 */
 		START_CRIT_SECTION();
-		MyProc->inCommit = true;
+		MyProcMinimal->inCommit = true;
 
 		SetCurrentTransactionStopTimestamp();
 
@@ -1155,7 +1155,7 @@ RecordTransactionCommit(void)
 	 */
 	if (markXidCommitted)
 	{
-		MyProc->inCommit = false;
+		MyProcMinimal->inCommit = false;
 		END_CRIT_SECTION();
 	}
 
@@ -2248,7 +2248,7 @@ AbortTransaction(void)
 	 * Releasing LW locks is critical since we might try to grab them again
 	 * while cleaning up!
 	 */
-	LWLockReleaseAll();
+	FlexLockReleaseAll();
 
 	/* Clean up buffer I/O and buffer context locks, too */
 	AbortBufferIO();
@@ -4138,7 +4138,7 @@ AbortSubTransaction(void)
 	 * FIXME This may be incorrect --- Are there some locks we should keep?
 	 * Buffer locks, for example?  I don't think so but I'm not sure.
 	 */
-	LWLockReleaseAll();
+	FlexLockReleaseAll();
 
 	AbortBufferIO();
 	UnlockBuffers();
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 6bf2421..9ceee91 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -562,13 +562,13 @@ bootstrap_signals(void)
  * Begin shutdown of an auxiliary process.	This is approximately the equivalent
  * of ShutdownPostgres() in postinit.c.  We can't run transactions in an
  * auxiliary process, so most of the work of AbortTransaction() is not needed,
- * but we do need to make sure we've released any LWLocks we are holding.
+ * but we do need to make sure we've released any flex locks we are holding.
  * (This is only critical during an error exit.)
  */
 static void
 ShutdownAuxiliaryProcess(int code, Datum arg)
 {
-	LWLockReleaseAll();
+	FlexLockReleaseAll();
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 32985a4..23556fa 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -40,6 +40,7 @@
 #include "storage/lmgr.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
+#include "storage/procarraylock.h"
 #include "utils/acl.h"
 #include "utils/attoptcache.h"
 #include "utils/datum.h"
@@ -222,9 +223,9 @@ analyze_rel(Oid relid, VacuumStmt *vacstmt, BufferAccessStrategy bstrategy)
 	/*
 	 * OK, let's do it.  First let other backends know I'm in ANALYZE.
 	 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyProc->vacuumFlags |= PROC_IN_ANALYZE;
-	LWLockRelease(ProcArrayLock);
+	ProcArrayLockAcquire(PAL_EXCLUSIVE);
+	MyProcMinimal->vacuumFlags |= PROC_IN_ANALYZE;
+	ProcArrayLockRelease();
 
 	/*
 	 * Do the normal non-recursive ANALYZE.
@@ -249,9 +250,9 @@ analyze_rel(Oid relid, VacuumStmt *vacstmt, BufferAccessStrategy bstrategy)
 	 * Reset my PGPROC flag.  Note: we need this here, and not in vacuum_rel,
 	 * because the vacuum flag is cleared by the end-of-xact code.
 	 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyProc->vacuumFlags &= ~PROC_IN_ANALYZE;
-	LWLockRelease(ProcArrayLock);
+	ProcArrayLockAcquire(PAL_EXCLUSIVE);
+	MyProcMinimal->vacuumFlags &= ~PROC_IN_ANALYZE;
+	ProcArrayLockRelease();
 }
 
 /*
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index f42504c..480bf82 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -39,6 +39,7 @@
 #include "storage/lmgr.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
+#include "storage/procarraylock.h"
 #include "utils/acl.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -892,11 +893,11 @@ vacuum_rel(Oid relid, VacuumStmt *vacstmt, bool do_toast, bool for_wraparound)
 		 * MyProc->xid/xmin, else OldestXmin might appear to go backwards,
 		 * which is probably Not Good.
 		 */
-		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-		MyProc->vacuumFlags |= PROC_IN_VACUUM;
+		ProcArrayLockAcquire(PAL_EXCLUSIVE);
+		MyProcMinimal->vacuumFlags |= PROC_IN_VACUUM;
 		if (for_wraparound)
-			MyProc->vacuumFlags |= PROC_VACUUM_FOR_WRAPAROUND;
-		LWLockRelease(ProcArrayLock);
+			MyProcMinimal->vacuumFlags |= PROC_VACUUM_FOR_WRAPAROUND;
+		ProcArrayLockRelease();
 	}
 
 	/*
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index cacedab..f33f573 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -176,9 +176,10 @@ BackgroundWriterMain(void)
 		/*
 		 * These operations are really just a minimal subset of
 		 * AbortTransaction().	We don't have very many resources to worry
-		 * about in bgwriter, but we do have LWLocks, buffers, and temp files.
+		 * about in bgwriter, but we do have flex locks, buffers, and temp
+		 * files.
 		 */
-		LWLockReleaseAll();
+		FlexLockReleaseAll();
 		AbortBufferIO();
 		UnlockBuffers();
 		/* buffer pins are released here: */
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index e9ae1e8..49f07a7 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -281,9 +281,10 @@ CheckpointerMain(void)
 		/*
 		 * These operations are really just a minimal subset of
 		 * AbortTransaction().	We don't have very many resources to worry
-		 * about in checkpointer, but we do have LWLocks, buffers, and temp files.
+		 * about in checkpointer, but we do have flex locks, buffers, and temp
+		 * files.
 		 */
-		LWLockReleaseAll();
+		FlexLockReleaseAll();
 		AbortBufferIO();
 		UnlockBuffers();
 		/* buffer pins are released here: */
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 6758083..14b4368 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -109,6 +109,7 @@
 #include "postmaster/syslogger.h"
 #include "replication/walsender.h"
 #include "storage/fd.h"
+#include "storage/flexlock_internals.h"
 #include "storage/ipc.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
@@ -404,8 +405,6 @@ typedef struct
 typedef int InheritableSocket;
 #endif
 
-typedef struct LWLock LWLock;	/* ugly kluge */
-
 /*
  * Structure contains all variables passed to exec:ed backends
  */
@@ -426,7 +425,7 @@ typedef struct
 	slock_t    *ShmemLock;
 	VariableCache ShmemVariableCache;
 	Backend    *ShmemBackendArray;
-	LWLock	   *LWLockArray;
+	FlexLock   *FlexLockArray;
 	slock_t    *ProcStructLock;
 	PROC_HDR   *ProcGlobal;
 	PGPROC	   *AuxiliaryProcs;
@@ -4675,7 +4674,6 @@ MaxLivePostmasterChildren(void)
  * functions
  */
 extern slock_t *ShmemLock;
-extern LWLock *LWLockArray;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
@@ -4720,7 +4718,7 @@ save_backend_variables(BackendParameters *param, Port *port,
 	param->ShmemVariableCache = ShmemVariableCache;
 	param->ShmemBackendArray = ShmemBackendArray;
 
-	param->LWLockArray = LWLockArray;
+	param->FlexLockArray = FlexLockArray;
 	param->ProcStructLock = ProcStructLock;
 	param->ProcGlobal = ProcGlobal;
 	param->AuxiliaryProcs = AuxiliaryProcs;
@@ -4943,7 +4941,7 @@ restore_backend_variables(BackendParameters *param, Port *port)
 	ShmemVariableCache = param->ShmemVariableCache;
 	ShmemBackendArray = param->ShmemBackendArray;
 
-	LWLockArray = param->LWLockArray;
+	FlexLockArray = param->FlexLockArray;
 	ProcStructLock = param->ProcStructLock;
 	ProcGlobal = param->ProcGlobal;
 	AuxiliaryProcs = param->AuxiliaryProcs;
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 157728e..587443d 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -167,9 +167,9 @@ WalWriterMain(void)
 		/*
 		 * These operations are really just a minimal subset of
 		 * AbortTransaction().	We don't have very many resources to worry
-		 * about in walwriter, but we do have LWLocks, and perhaps buffers?
+		 * about in walwriter, but we do have flex locks, and perhaps buffers?
 		 */
-		LWLockReleaseAll();
+		FlexLockReleaseAll();
 		AbortBufferIO();
 		UnlockBuffers();
 		/* buffer pins are released here: */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index dd2d6ee..dc93b42 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -702,7 +702,7 @@ ProcessStandbyHSFeedbackMessage(void)
 	 * safe, and if we're moving it backwards, well, the data is at risk
 	 * already since a VACUUM could have just finished calling GetOldestXmin.)
 	 */
-	MyProc->xmin = reply.xmin;
+	MyProcMinimal->xmin = reply.xmin;
 }
 
 /* Main loop of walsender process */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e59af33..07356ec 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -141,7 +141,7 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
 	{
 		BufferTag	newTag;		/* identity of requested block */
 		uint32		newHash;	/* hash value for newTag */
-		LWLockId	newPartitionLock;	/* buffer partition lock for it */
+		FlexLockId	newPartitionLock;	/* buffer partition lock for it */
 		int			buf_id;
 
 		/* create a tag so we can lookup the buffer */
@@ -512,10 +512,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 {
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
-	LWLockId	newPartitionLock;		/* buffer partition lock for it */
+	FlexLockId	newPartitionLock;		/* buffer partition lock for it */
 	BufferTag	oldTag;			/* previous identity of selected buffer */
 	uint32		oldHash;		/* hash value for oldTag */
-	LWLockId	oldPartitionLock;		/* buffer partition lock for it */
+	FlexLockId	oldPartitionLock;		/* buffer partition lock for it */
 	BufFlags	oldFlags;
 	int			buf_id;
 	volatile BufferDesc *buf;
@@ -855,7 +855,7 @@ InvalidateBuffer(volatile BufferDesc *buf)
 {
 	BufferTag	oldTag;
 	uint32		oldHash;		/* hash value for oldTag */
-	LWLockId	oldPartitionLock;		/* buffer partition lock for it */
+	FlexLockId	oldPartitionLock;		/* buffer partition lock for it */
 	BufFlags	oldFlags;
 
 	/* Save the original buffer tag before dropping the spinlock */
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 56c0bd8..a2c570a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -113,7 +113,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, SUBTRANSShmemSize());
 		size = add_size(size, TwoPhaseShmemSize());
 		size = add_size(size, MultiXactShmemSize());
-		size = add_size(size, LWLockShmemSize());
+		size = add_size(size, FlexLockShmemSize());
 		size = add_size(size, ProcArrayShmemSize());
 		size = add_size(size, BackendStatusShmemSize());
 		size = add_size(size, SInvalShmemSize());
@@ -179,7 +179,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	 * needed for InitShmemIndex.
 	 */
 	if (!IsUnderPostmaster)
-		CreateLWLocks();
+		CreateFlexLocks();
 
 	/*
 	 * Set up shmem.c index hashtable
@@ -192,7 +192,6 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	XLOGShmemInit();
 	CLOGShmemInit();
 	SUBTRANSShmemInit();
-	TwoPhaseShmemInit();
 	MultiXactShmemInit();
 	InitBufferPool();
 
@@ -213,6 +212,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		InitProcGlobal();
 	CreateSharedProcArray();
 	CreateSharedBackendStatus();
+	TwoPhaseShmemInit();
 
 	/*
 	 * Set up shared-inval messaging
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 1a48485..8b6a9ef 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -52,6 +52,7 @@
 #include "access/twophase.h"
 #include "miscadmin.h"
 #include "storage/procarray.h"
+#include "storage/procarraylock.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
 #include "utils/snapmgr.h"
@@ -82,14 +83,17 @@ typedef struct ProcArrayStruct
 	TransactionId lastOverflowedXid;
 
 	/*
-	 * We declare procs[] as 1 entry because C wants a fixed-size array, but
+	 * We declare pgprocnos[] as 1 entry because C wants a fixed-size array, but
 	 * actually it is maxProcs entries long.
 	 */
-	PGPROC	   *procs[1];		/* VARIABLE LENGTH ARRAY */
+	int			pgprocnos[1];		/* VARIABLE LENGTH ARRAY */
 } ProcArrayStruct;
 
 static ProcArrayStruct *procArray;
 
+static PGPROC *allProcs;
+static PGPROC_MINIMAL *allProcs_Minimal;
+
 /*
  * Bookkeeping for tracking emulated transactions in recovery
  */
@@ -169,8 +173,8 @@ ProcArrayShmemSize(void)
 	/* Size of the ProcArray structure itself */
 #define PROCARRAY_MAXPROCS	(MaxBackends + max_prepared_xacts)
 
-	size = offsetof(ProcArrayStruct, procs);
-	size = add_size(size, mul_size(sizeof(PGPROC *), PROCARRAY_MAXPROCS));
+	size = offsetof(ProcArrayStruct, pgprocnos);
+	size = add_size(size, mul_size(sizeof(int), PROCARRAY_MAXPROCS));
 
 	/*
 	 * During Hot Standby processing we have a data structure called
@@ -211,8 +215,8 @@ CreateSharedProcArray(void)
 	/* Create or attach to the ProcArray shared structure */
 	procArray = (ProcArrayStruct *)
 		ShmemInitStruct("Proc Array",
-						add_size(offsetof(ProcArrayStruct, procs),
-								 mul_size(sizeof(PGPROC *),
+						add_size(offsetof(ProcArrayStruct, pgprocnos),
+								 mul_size(sizeof(int),
 										  PROCARRAY_MAXPROCS)),
 						&found);
 
@@ -231,6 +235,9 @@ CreateSharedProcArray(void)
 		procArray->lastOverflowedXid = InvalidTransactionId;
 	}
 
+	allProcs = ProcGlobal->allProcs;
+	allProcs_Minimal = ProcGlobal->allProcs_Minimal;
+
 	/* Create or attach to the KnownAssignedXids arrays too, if needed */
 	if (EnableHotStandby)
 	{
@@ -253,8 +260,9 @@ void
 ProcArrayAdd(PGPROC *proc)
 {
 	ProcArrayStruct *arrayP = procArray;
+	int index;
 
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ProcArrayLockAcquire(PAL_EXCLUSIVE);
 
 	if (arrayP->numProcs >= arrayP->maxProcs)
 	{
@@ -263,16 +271,37 @@ ProcArrayAdd(PGPROC *proc)
 		 * fixed supply of PGPROC structs too, and so we should have failed
 		 * earlier.)
 		 */
-		LWLockRelease(ProcArrayLock);
+		ProcArrayLockRelease();
 		ereport(FATAL,
 				(errcode(ERRCODE_TOO_MANY_CONNECTIONS),
 				 errmsg("sorry, too many clients already")));
 	}
 
-	arrayP->procs[arrayP->numProcs] = proc;
+	/*
+	 * Keep the procs array sorted by (PGPROC *) so that we can utilize
+	 * locality of references much better. This is useful while traversing the
+	 * ProcArray because there is a increased likelyhood of finding the next
+	 * PGPROC structure in the cache.
+	 * 
+	 * Since the occurance of adding/removing a proc is much lower than the
+	 * access to the ProcArray itself, the overhead should be marginal
+	 */
+	for (index = 0; index < arrayP->numProcs; index++)
+	{
+		/*
+		 * If we are the first PGPROC or if we have found our right position in
+		 * the array, break
+		 */
+		if ((arrayP->pgprocnos[index] == -1) || (arrayP->pgprocnos[index] > proc->pgprocno))
+			break;
+	}
+
+	memmove(&arrayP->pgprocnos[index + 1], &arrayP->pgprocnos[index],
+			(arrayP->numProcs - index) * sizeof (int));
+	arrayP->pgprocnos[index] = proc->pgprocno;
 	arrayP->numProcs++;
 
-	LWLockRelease(ProcArrayLock);
+	ProcArrayLockRelease();
 }
 
 /*
@@ -289,6 +318,7 @@ void
 ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 {
 	ProcArrayStruct *arrayP = procArray;
+	PGPROC_MINIMAL *proc_minimal = &allProcs_Minimal[proc->pgprocno];
 	int			index;
 
 #ifdef XIDCACHE_DEBUG
@@ -297,11 +327,11 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		DisplayXidCache();
 #endif
 
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ProcArrayLockAcquire(PAL_EXCLUSIVE);
 
 	if (TransactionIdIsValid(latestXid))
 	{
-		Assert(TransactionIdIsValid(proc->xid));
+		Assert(TransactionIdIsValid(proc_minimal->xid));
 
 		/* Advance global latestCompletedXid while holding the lock */
 		if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
@@ -311,23 +341,25 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 	else
 	{
 		/* Shouldn't be trying to remove a live transaction here */
-		Assert(!TransactionIdIsValid(proc->xid));
+		Assert(!TransactionIdIsValid(proc_minimal->xid));
 	}
 
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		if (arrayP->procs[index] == proc)
+		if (arrayP->pgprocnos[index] == proc->pgprocno)
 		{
-			arrayP->procs[index] = arrayP->procs[arrayP->numProcs - 1];
-			arrayP->procs[arrayP->numProcs - 1] = NULL; /* for debugging */
+			/* Keep the PGPROC array sorted. See notes above */
+			memmove(&arrayP->pgprocnos[index], &arrayP->pgprocnos[index + 1],
+					(arrayP->numProcs - index - 1) * sizeof (int));
+			arrayP->pgprocnos[arrayP->numProcs - 1] = -1; /* for debugging */
 			arrayP->numProcs--;
-			LWLockRelease(ProcArrayLock);
+			ProcArrayLockRelease();
 			return;
 		}
 	}
 
 	/* Ooops */
-	LWLockRelease(ProcArrayLock);
+	ProcArrayLockRelease();
 
 	elog(LOG, "failed to find proc %p in ProcArray", proc);
 }
@@ -349,56 +381,19 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 void
 ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 {
+	PGPROC_MINIMAL *proc_minimal = &allProcs_Minimal[proc->pgprocno];
+
 	if (TransactionIdIsValid(latestXid))
 	{
-		/*
-		 * We must lock ProcArrayLock while clearing proc->xid, so that we do
-		 * not exit the set of "running" transactions while someone else is
-		 * taking a snapshot.  See discussion in
-		 * src/backend/access/transam/README.
-		 */
-		Assert(TransactionIdIsValid(proc->xid));
-
-		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-
-		proc->xid = InvalidTransactionId;
-		proc->lxid = InvalidLocalTransactionId;
-		proc->xmin = InvalidTransactionId;
-		/* must be cleared with xid/xmin: */
-		proc->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
-		proc->inCommit = false; /* be sure this is cleared in abort */
-		proc->recoveryConflictPending = false;
-
-		/* Clear the subtransaction-XID cache too while holding the lock */
-		proc->subxids.nxids = 0;
-		proc->subxids.overflowed = false;
-
-		/* Also advance global latestCompletedXid while holding the lock */
-		if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
-								  latestXid))
-			ShmemVariableCache->latestCompletedXid = latestXid;
-
-		LWLockRelease(ProcArrayLock);
+		Assert(proc == MyProc);
+		ProcArrayLockClearTransaction(latestXid);		
 	}
 	else
-	{
-		/*
-		 * If we have no XID, we don't need to lock, since we won't affect
-		 * anyone else's calculation of a snapshot.  We might change their
-		 * estimate of global xmin, but that's OK.
-		 */
-		Assert(!TransactionIdIsValid(proc->xid));
+		proc_minimal->xmin = InvalidTransactionId;
 
-		proc->lxid = InvalidLocalTransactionId;
-		proc->xmin = InvalidTransactionId;
-		/* must be cleared with xid/xmin: */
-		proc->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
-		proc->inCommit = false; /* be sure this is cleared in abort */
-		proc->recoveryConflictPending = false;
-
-		Assert(proc->subxids.nxids == 0);
-		Assert(proc->subxids.overflowed == false);
-	}
+	proc->lxid = InvalidLocalTransactionId;
+	proc_minimal->inCommit = false; /* be sure this is cleared in abort */
+	proc->recoveryConflictPending = false;
 }
 
 
@@ -413,24 +408,26 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 void
 ProcArrayClearTransaction(PGPROC *proc)
 {
+	PGPROC_MINIMAL *proc_minimal = &allProcs_Minimal[proc->pgprocno];
+
 	/*
 	 * We can skip locking ProcArrayLock here, because this action does not
 	 * actually change anyone's view of the set of running XIDs: our entry is
 	 * duplicate with the gxact that has already been inserted into the
 	 * ProcArray.
 	 */
-	proc->xid = InvalidTransactionId;
+	proc_minimal->xid = InvalidTransactionId;
 	proc->lxid = InvalidLocalTransactionId;
-	proc->xmin = InvalidTransactionId;
+	proc_minimal->xmin = InvalidTransactionId;
 	proc->recoveryConflictPending = false;
 
 	/* redundant, but just in case */
-	proc->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
-	proc->inCommit = false;
+	proc_minimal->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
+	proc_minimal->inCommit = false;
 
 	/* Clear the subtransaction-XID cache too */
-	proc->subxids.nxids = 0;
-	proc->subxids.overflowed = false;
+	proc_minimal->nxids = 0;
+	proc_minimal->overflowed = false;
 }
 
 /*
@@ -528,7 +525,7 @@ ProcArrayApplyRecoveryInfo(RunningTransactions running)
 	/*
 	 * Nobody else is running yet, but take locks anyhow
 	 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ProcArrayLockAcquire(PAL_EXCLUSIVE);
 
 	/*
 	 * KnownAssignedXids is sorted so we cannot just add the xids, we have to
@@ -635,7 +632,7 @@ ProcArrayApplyRecoveryInfo(RunningTransactions running)
 	Assert(TransactionIdIsNormal(ShmemVariableCache->latestCompletedXid));
 	Assert(TransactionIdIsValid(ShmemVariableCache->nextXid));
 
-	LWLockRelease(ProcArrayLock);
+	ProcArrayLockRelease();
 
 	KnownAssignedXidsDisplay(trace_recovery(DEBUG3));
 	if (standbyState == STANDBY_SNAPSHOT_READY)
@@ -690,7 +687,7 @@ ProcArrayApplyXidAssignment(TransactionId topxid,
 	/*
 	 * Uses same locking as transaction commit
 	 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ProcArrayLockAcquire(PAL_EXCLUSIVE);
 
 	/*
 	 * Remove subxids from known-assigned-xacts.
@@ -703,7 +700,7 @@ ProcArrayApplyXidAssignment(TransactionId topxid,
 	if (TransactionIdPrecedes(procArray->lastOverflowedXid, max_xid))
 		procArray->lastOverflowedXid = max_xid;
 
-	LWLockRelease(ProcArrayLock);
+	ProcArrayLockRelease();
 }
 
 /*
@@ -795,7 +792,7 @@ TransactionIdIsInProgress(TransactionId xid)
 					 errmsg("out of memory")));
 	}
 
-	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	ProcArrayLockAcquire(PAL_SHARED);
 
 	/*
 	 * Now that we have the lock, we can check latestCompletedXid; if the
@@ -803,7 +800,7 @@ TransactionIdIsInProgress(TransactionId xid)
 	 */
 	if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid, xid))
 	{
-		LWLockRelease(ProcArrayLock);
+		ProcArrayLockRelease();
 		xc_by_latest_xid_inc();
 		return true;
 	}
@@ -811,7 +808,9 @@ TransactionIdIsInProgress(TransactionId xid)
 	/* No shortcuts, gotta grovel through the array */
 	for (i = 0; i < arrayP->numProcs; i++)
 	{
-		volatile PGPROC *proc = arrayP->procs[i];
+		int pgprocno = arrayP->pgprocnos[i];
+		volatile PGPROC *proc = &allProcs[pgprocno];
+		volatile PGPROC_MINIMAL *proc_minimal = &allProcs_Minimal[pgprocno];
 		TransactionId pxid;
 
 		/* Ignore my own proc --- dealt with it above */
@@ -819,7 +818,7 @@ TransactionIdIsInProgress(TransactionId xid)
 			continue;
 
 		/* Fetch xid just once - see GetNewTransactionId */
-		pxid = proc->xid;
+		pxid = proc_minimal->xid;
 
 		if (!TransactionIdIsValid(pxid))
 			continue;
@@ -829,7 +828,7 @@ TransactionIdIsInProgress(TransactionId xid)
 		 */
 		if (TransactionIdEquals(pxid, xid))
 		{
-			LWLockRelease(ProcArrayLock);
+			ProcArrayLockRelease();
 			xc_by_main_xid_inc();
 			return true;
 		}
@@ -844,14 +843,14 @@ TransactionIdIsInProgress(TransactionId xid)
 		/*
 		 * Step 2: check the cached child-Xids arrays
 		 */
-		for (j = proc->subxids.nxids - 1; j >= 0; j--)
+		for (j = proc_minimal->nxids - 1; j >= 0; j--)
 		{
 			/* Fetch xid just once - see GetNewTransactionId */
 			TransactionId cxid = proc->subxids.xids[j];
 
 			if (TransactionIdEquals(cxid, xid))
 			{
-				LWLockRelease(ProcArrayLock);
+				ProcArrayLockRelease();
 				xc_by_child_xid_inc();
 				return true;
 			}
@@ -864,7 +863,7 @@ TransactionIdIsInProgress(TransactionId xid)
 		 * we hold ProcArrayLock.  So we can't miss an Xid that we need to
 		 * worry about.)
 		 */
-		if (proc->subxids.overflowed)
+		if (proc_minimal->overflowed)
 			xids[nxids++] = pxid;
 	}
 
@@ -879,7 +878,7 @@ TransactionIdIsInProgress(TransactionId xid)
 
 		if (KnownAssignedXidExists(xid))
 		{
-			LWLockRelease(ProcArrayLock);
+			ProcArrayLockRelease();
 			xc_by_known_assigned_inc();
 			return true;
 		}
@@ -895,7 +894,7 @@ TransactionIdIsInProgress(TransactionId xid)
 			nxids = KnownAssignedXidsGet(xids, xid);
 	}
 
-	LWLockRelease(ProcArrayLock);
+	ProcArrayLockRelease();
 
 	/*
 	 * If none of the relevant caches overflowed, we know the Xid is not
@@ -961,14 +960,17 @@ TransactionIdIsActive(TransactionId xid)
 	if (TransactionIdPrecedes(xid, RecentXmin))
 		return false;
 
-	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	ProcArrayLockAcquire(PAL_SHARED);
 
 	for (i = 0; i < arrayP->numProcs; i++)
 	{
-		volatile PGPROC *proc = arrayP->procs[i];
+		int pgprocno = arrayP->pgprocnos[i];
+		volatile PGPROC *proc = &allProcs[pgprocno];
+		volatile PGPROC_MINIMAL *proc_minimal = &allProcs_Minimal[pgprocno];
+		TransactionId pxid;
 
 		/* Fetch xid just once - see GetNewTransactionId */
-		TransactionId pxid = proc->xid;
+		pxid = proc_minimal->xid;
 
 		if (!TransactionIdIsValid(pxid))
 			continue;
@@ -983,7 +985,7 @@ TransactionIdIsActive(TransactionId xid)
 		}
 	}
 
-	LWLockRelease(ProcArrayLock);
+	ProcArrayLockRelease();
 
 	return result;
 }
@@ -1046,7 +1048,7 @@ GetOldestXmin(bool allDbs, bool ignoreVacuum)
 	/* Cannot look for individual databases during recovery */
 	Assert(allDbs || !RecoveryInProgress());
 
-	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	ProcArrayLockAcquire(PAL_SHARED);
 
 	/*
 	 * We initialize the MIN() calculation with latestCompletedXid + 1. This
@@ -1060,9 +1062,11 @@ GetOldestXmin(bool allDbs, bool ignoreVacuum)
 
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		volatile PGPROC *proc = arrayP->procs[index];
+		int pgprocno = arrayP->pgprocnos[index];
+		volatile PGPROC *proc = &allProcs[pgprocno];
+		volatile PGPROC_MINIMAL *proc_minimal = &allProcs_Minimal[pgprocno];
 
-		if (ignoreVacuum && (proc->vacuumFlags & PROC_IN_VACUUM))
+		if (ignoreVacuum && (proc_minimal->vacuumFlags & PROC_IN_VACUUM))
 			continue;
 
 		if (allDbs ||
@@ -1070,7 +1074,7 @@ GetOldestXmin(bool allDbs, bool ignoreVacuum)
 			proc->databaseId == 0)		/* always include WalSender */
 		{
 			/* Fetch xid just once - see GetNewTransactionId */
-			TransactionId xid = proc->xid;
+			TransactionId xid = proc_minimal->xid;
 
 			/* First consider the transaction's own Xid, if any */
 			if (TransactionIdIsNormal(xid) &&
@@ -1084,7 +1088,7 @@ GetOldestXmin(bool allDbs, bool ignoreVacuum)
 			 * have an Xmin but not (yet) an Xid; conversely, if it has an
 			 * Xid, that could determine some not-yet-set Xmin.
 			 */
-			xid = proc->xmin;	/* Fetch just once */
+			xid = proc_minimal->xmin;	/* Fetch just once */
 			if (TransactionIdIsNormal(xid) &&
 				TransactionIdPrecedes(xid, result))
 				result = xid;
@@ -1099,7 +1103,7 @@ GetOldestXmin(bool allDbs, bool ignoreVacuum)
 		 */
 		TransactionId kaxmin = KnownAssignedXidsGetOldestXmin();
 
-		LWLockRelease(ProcArrayLock);
+		ProcArrayLockRelease();
 
 		if (TransactionIdIsNormal(kaxmin) &&
 			TransactionIdPrecedes(kaxmin, result))
@@ -1110,7 +1114,7 @@ GetOldestXmin(bool allDbs, bool ignoreVacuum)
 		/*
 		 * No other information needed, so release the lock immediately.
 		 */
-		LWLockRelease(ProcArrayLock);
+		ProcArrayLockRelease();
 
 		/*
 		 * Compute the cutoff XID by subtracting vacuum_defer_cleanup_age,
@@ -1200,6 +1204,8 @@ GetSnapshotData(Snapshot snapshot)
 	int			count = 0;
 	int			subcount = 0;
 	bool		suboverflowed = false;
+	static TransactionId *xmins = NULL;
+	int			numProcs;
 
 	Assert(snapshot != NULL);
 
@@ -1235,11 +1241,20 @@ GetSnapshotData(Snapshot snapshot)
 					 errmsg("out of memory")));
 	}
 
+	if (xmins == NULL)
+	{
+		xmins = malloc(procArray->maxProcs * sizeof(TransactionId));
+		if (xmins == NULL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OUT_OF_MEMORY),
+					 errmsg("out of memory")));
+	}
+
 	/*
 	 * It is sufficient to get shared lock on ProcArrayLock, even if we are
 	 * going to set MyProc->xmin.
 	 */
-	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	ProcArrayLockAcquire(PAL_SHARED);
 
 	/* xmax is always latestCompletedXid + 1 */
 	xmax = ShmemVariableCache->latestCompletedXid;
@@ -1261,6 +1276,8 @@ GetSnapshotData(Snapshot snapshot)
 
 	if (!snapshot->takenDuringRecovery)
 	{
+		int *pgprocnos = arrayP->pgprocnos;
+
 		/*
 		 * Spin over procArray checking xid, xmin, and subxids.  The goal is
 		 * to gather all active xids, find the lowest xmin, and try to record
@@ -1269,23 +1286,25 @@ GetSnapshotData(Snapshot snapshot)
 		 * prepared transaction xids are held in KnownAssignedXids, so these
 		 * will be seen without needing to loop through procs here.
 		 */
-		for (index = 0; index < arrayP->numProcs; index++)
+		numProcs = arrayP->numProcs;
+		for (index = 0; index < numProcs; index++)
 		{
-			volatile PGPROC *proc = arrayP->procs[index];
+			int pgprocno = pgprocnos[index];
+			volatile PGPROC_MINIMAL *proc_minimal = &allProcs_Minimal[pgprocno];
 			TransactionId xid;
 
 			/* Ignore procs running LAZY VACUUM */
-			if (proc->vacuumFlags & PROC_IN_VACUUM)
+			if (proc_minimal->vacuumFlags & PROC_IN_VACUUM)
+			{
+				xmins[index] = InvalidTransactionId;
 				continue;
+			}
 
 			/* Update globalxmin to be the smallest valid xmin */
-			xid = proc->xmin;	/* fetch just once */
-			if (TransactionIdIsNormal(xid) &&
-				TransactionIdPrecedes(xid, globalxmin))
-				globalxmin = xid;
+			xmins[index] = proc_minimal->xmin;	/* fetch just once */
 
 			/* Fetch xid just once - see GetNewTransactionId */
-			xid = proc->xid;
+			xid = proc_minimal->xid;
 
 			/*
 			 * If the transaction has been assigned an xid < xmax we add it to
@@ -1300,7 +1319,7 @@ GetSnapshotData(Snapshot snapshot)
 			{
 				if (TransactionIdFollowsOrEquals(xid, xmax))
 					continue;
-				if (proc != MyProc)
+				if (proc_minimal != MyProcMinimal)
 					snapshot->xip[count++] = xid;
 				if (TransactionIdPrecedes(xid, xmin))
 					xmin = xid;
@@ -1321,16 +1340,17 @@ GetSnapshotData(Snapshot snapshot)
 			 *
 			 * Again, our own XIDs are not included in the snapshot.
 			 */
-			if (!suboverflowed && proc != MyProc)
+			if (!suboverflowed && proc_minimal != MyProcMinimal)
 			{
-				if (proc->subxids.overflowed)
+				if (proc_minimal->overflowed)
 					suboverflowed = true;
 				else
 				{
-					int			nxids = proc->subxids.nxids;
+					int			nxids = proc_minimal->nxids;
 
 					if (nxids > 0)
 					{
+						volatile PGPROC *proc = &allProcs[pgprocno];
 						memcpy(snapshot->subxip + subcount,
 							   (void *) proc->subxids.xids,
 							   nxids * sizeof(TransactionId));
@@ -1342,6 +1362,7 @@ GetSnapshotData(Snapshot snapshot)
 	}
 	else
 	{
+		numProcs = 0;
 		/*
 		 * We're in hot standby, so get XIDs from KnownAssignedXids.
 		 *
@@ -1372,16 +1393,23 @@ GetSnapshotData(Snapshot snapshot)
 			suboverflowed = true;
 	}
 
-	if (!TransactionIdIsValid(MyProc->xmin))
-		MyProc->xmin = TransactionXmin = xmin;
-
-	LWLockRelease(ProcArrayLock);
+	if (!TransactionIdIsValid(MyProcMinimal->xmin))
+		MyProcMinimal->xmin = TransactionXmin = xmin;
+	ProcArrayLockRelease();
 
 	/*
 	 * Update globalxmin to include actual process xids.  This is a slightly
 	 * different way of computing it than GetOldestXmin uses, but should give
 	 * the same result.
 	 */
+	for (index = 0; index < numProcs; index++)
+	{
+		TransactionId xid = xmins[index];
+		if (TransactionIdIsNormal(xid) &&
+			TransactionIdPrecedes(xid, globalxmin))
+			globalxmin = xid;
+	}
+
 	if (TransactionIdPrecedes(xmin, globalxmin))
 		globalxmin = xmin;
 
@@ -1432,18 +1460,20 @@ ProcArrayInstallImportedXmin(TransactionId xmin, TransactionId sourcexid)
 		return false;
 
 	/* Get lock so source xact can't end while we're doing this */
-	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	ProcArrayLockAcquire(PAL_SHARED);
 
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		volatile PGPROC *proc = arrayP->procs[index];
+		int pgprocno = arrayP->pgprocnos[index];
+		volatile PGPROC *proc = &allProcs[pgprocno];
+		volatile PGPROC_MINIMAL *proc_minimal = &allProcs_Minimal[pgprocno];
 		TransactionId xid;
 
 		/* Ignore procs running LAZY VACUUM */
-		if (proc->vacuumFlags & PROC_IN_VACUUM)
+		if (proc_minimal->vacuumFlags & PROC_IN_VACUUM)
 			continue;
 
-		xid = proc->xid;	/* fetch just once */
+		xid = proc_minimal->xid;	/* fetch just once */
 		if (xid != sourcexid)
 			continue;
 
@@ -1459,7 +1489,7 @@ ProcArrayInstallImportedXmin(TransactionId xmin, TransactionId sourcexid)
 		/*
 		 * Likewise, let's just make real sure its xmin does cover us.
 		 */
-		xid = proc->xmin;	/* fetch just once */
+		xid = proc_minimal->xmin;	/* fetch just once */
 		if (!TransactionIdIsNormal(xid) ||
 			!TransactionIdPrecedesOrEquals(xid, xmin))
 			continue;
@@ -1470,13 +1500,13 @@ ProcArrayInstallImportedXmin(TransactionId xmin, TransactionId sourcexid)
 		 * GetSnapshotData first, we'll be overwriting a valid xmin here,
 		 * so we don't check that.)
 		 */
-		MyProc->xmin = TransactionXmin = xmin;
+		MyProcMinimal->xmin = TransactionXmin = xmin;
 
 		result = true;
 		break;
 	}
 
-	LWLockRelease(ProcArrayLock);
+	ProcArrayLockRelease();
 
 	return result;
 }
@@ -1550,7 +1580,7 @@ GetRunningTransactionData(void)
 	 * Ensure that no xids enter or leave the procarray while we obtain
 	 * snapshot.
 	 */
-	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	ProcArrayLockAcquire(PAL_SHARED);
 	LWLockAcquire(XidGenLock, LW_SHARED);
 
 	latestCompletedXid = ShmemVariableCache->latestCompletedXid;
@@ -1562,12 +1592,14 @@ GetRunningTransactionData(void)
 	 */
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		volatile PGPROC *proc = arrayP->procs[index];
+		int pgprocno = arrayP->pgprocnos[index];
+		volatile PGPROC *proc = &allProcs[pgprocno];
+		volatile PGPROC_MINIMAL *proc_minimal = &allProcs_Minimal[pgprocno];
 		TransactionId xid;
 		int			nxids;
 
 		/* Fetch xid just once - see GetNewTransactionId */
-		xid = proc->xid;
+		xid = proc_minimal->xid;
 
 		/*
 		 * We don't need to store transactions that don't have a TransactionId
@@ -1585,7 +1617,7 @@ GetRunningTransactionData(void)
 		 * Save subtransaction XIDs. Other backends can't add or remove
 		 * entries while we're holding XidGenLock.
 		 */
-		nxids = proc->subxids.nxids;
+		nxids = proc_minimal->nxids;
 		if (nxids > 0)
 		{
 			memcpy(&xids[count], (void *) proc->subxids.xids,
@@ -1593,7 +1625,7 @@ GetRunningTransactionData(void)
 			count += nxids;
 			subcount += nxids;
 
-			if (proc->subxids.overflowed)
+			if (proc_minimal->overflowed)
 				suboverflowed = true;
 
 			/*
@@ -1611,7 +1643,7 @@ GetRunningTransactionData(void)
 	CurrentRunningXacts->latestCompletedXid = latestCompletedXid;
 
 	/* We don't release XidGenLock here, the caller is responsible for that */
-	LWLockRelease(ProcArrayLock);
+	ProcArrayLockRelease();
 
 	Assert(TransactionIdIsValid(CurrentRunningXacts->nextXid));
 	Assert(TransactionIdIsValid(CurrentRunningXacts->oldestRunningXid));
@@ -1644,7 +1676,7 @@ GetOldestActiveTransactionId(void)
 
 	Assert(!RecoveryInProgress());
 
-	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	ProcArrayLockAcquire(PAL_SHARED);
 
 	oldestRunningXid = ShmemVariableCache->nextXid;
 
@@ -1653,11 +1685,12 @@ GetOldestActiveTransactionId(void)
 	 */
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		volatile PGPROC *proc = arrayP->procs[index];
+		int pgprocno = arrayP->pgprocnos[index];
+		volatile PGPROC_MINIMAL *proc_minimal = &allProcs_Minimal[pgprocno];
 		TransactionId xid;
 
 		/* Fetch xid just once - see GetNewTransactionId */
-		xid = proc->xid;
+		xid = proc_minimal->xid;
 
 		if (!TransactionIdIsNormal(xid))
 			continue;
@@ -1672,7 +1705,7 @@ GetOldestActiveTransactionId(void)
 		 */
 	}
 
-	LWLockRelease(ProcArrayLock);
+	ProcArrayLockRelease();
 
 	return oldestRunningXid;
 }
@@ -1705,20 +1738,22 @@ GetTransactionsInCommit(TransactionId **xids_p)
 	xids = (TransactionId *) palloc(arrayP->maxProcs * sizeof(TransactionId));
 	nxids = 0;
 
-	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	ProcArrayLockAcquire(PAL_SHARED);
 
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		volatile PGPROC *proc = arrayP->procs[index];
+		int pgprocno = arrayP->pgprocnos[index];
+		volatile PGPROC_MINIMAL *proc_minimal = &allProcs_Minimal[pgprocno];
+		TransactionId pxid;
 
 		/* Fetch xid just once - see GetNewTransactionId */
-		TransactionId pxid = proc->xid;
+		pxid = proc_minimal->xid;
 
-		if (proc->inCommit && TransactionIdIsValid(pxid))
+		if (proc_minimal->inCommit && TransactionIdIsValid(pxid))
 			xids[nxids++] = pxid;
 	}
 
-	LWLockRelease(ProcArrayLock);
+	ProcArrayLockRelease();
 
 	*xids_p = xids;
 	return nxids;
@@ -1740,16 +1775,18 @@ HaveTransactionsInCommit(TransactionId *xids, int nxids)
 	ProcArrayStruct *arrayP = procArray;
 	int			index;
 
-	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	ProcArrayLockAcquire(PAL_SHARED);
 
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		volatile PGPROC *proc = arrayP->procs[index];
+		int pgprocno = arrayP->pgprocnos[index];
+		volatile PGPROC_MINIMAL *proc_minimal = &allProcs_Minimal[pgprocno];
+		TransactionId pxid;
 
 		/* Fetch xid just once - see GetNewTransactionId */
-		TransactionId pxid = proc->xid;
+		pxid = proc_minimal->xid;
 
-		if (proc->inCommit && TransactionIdIsValid(pxid))
+		if (proc_minimal->inCommit && TransactionIdIsValid(pxid))
 		{
 			int			i;
 
@@ -1766,7 +1803,7 @@ HaveTransactionsInCommit(TransactionId *xids, int nxids)
 		}
 	}
 
-	LWLockRelease(ProcArrayLock);
+	ProcArrayLockRelease();
 
 	return result;
 }
@@ -1788,11 +1825,11 @@ BackendPidGetProc(int pid)
 	if (pid == 0)				/* never match dummy PGPROCs */
 		return NULL;
 
-	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	ProcArrayLockAcquire(PAL_SHARED);
 
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		PGPROC	   *proc = arrayP->procs[index];
+		PGPROC	   *proc = &allProcs[arrayP->pgprocnos[index]];
 
 		if (proc->pid == pid)
 		{
@@ -1801,7 +1838,7 @@ BackendPidGetProc(int pid)
 		}
 	}
 
-	LWLockRelease(ProcArrayLock);
+	ProcArrayLockRelease();
 
 	return result;
 }
@@ -1829,20 +1866,22 @@ BackendXidGetPid(TransactionId xid)
 	if (xid == InvalidTransactionId)	/* never match invalid xid */
 		return 0;
 
-	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	ProcArrayLockAcquire(PAL_SHARED);
 
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		volatile PGPROC *proc = arrayP->procs[index];
+		int pgprocno = arrayP->pgprocnos[index];
+		volatile PGPROC *proc = &allProcs[pgprocno];
+		volatile PGPROC_MINIMAL *proc_minimal = &allProcs_Minimal[pgprocno];
 
-		if (proc->xid == xid)
+		if (proc_minimal->xid == xid)
 		{
 			result = proc->pid;
 			break;
 		}
 	}
 
-	LWLockRelease(ProcArrayLock);
+	ProcArrayLockRelease();
 
 	return result;
 }
@@ -1897,22 +1936,24 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
 	vxids = (VirtualTransactionId *)
 		palloc(sizeof(VirtualTransactionId) * arrayP->maxProcs);
 
-	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	ProcArrayLockAcquire(PAL_SHARED);
 
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		volatile PGPROC *proc = arrayP->procs[index];
+		int pgprocno = arrayP->pgprocnos[index];
+		volatile PGPROC *proc = &allProcs[pgprocno];
+		volatile PGPROC_MINIMAL *proc_minimal = &allProcs_Minimal[pgprocno];
 
 		if (proc == MyProc)
 			continue;
 
-		if (excludeVacuum & proc->vacuumFlags)
+		if (excludeVacuum & proc_minimal->vacuumFlags)
 			continue;
 
 		if (allDbs || proc->databaseId == MyDatabaseId)
 		{
 			/* Fetch xmin just once - might change on us */
-			TransactionId pxmin = proc->xmin;
+			TransactionId pxmin = proc_minimal->xmin;
 
 			if (excludeXmin0 && !TransactionIdIsValid(pxmin))
 				continue;
@@ -1933,7 +1974,7 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
 		}
 	}
 
-	LWLockRelease(ProcArrayLock);
+	ProcArrayLockRelease();
 
 	*nvxids = count;
 	return vxids;
@@ -1992,11 +2033,13 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
 					 errmsg("out of memory")));
 	}
 
-	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	ProcArrayLockAcquire(PAL_SHARED);
 
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		volatile PGPROC *proc = arrayP->procs[index];
+		int pgprocno = arrayP->pgprocnos[index];
+		volatile PGPROC *proc = &allProcs[pgprocno];
+		volatile PGPROC_MINIMAL *proc_minimal = &allProcs_Minimal[pgprocno];
 
 		/* Exclude prepared transactions */
 		if (proc->pid == 0)
@@ -2006,7 +2049,7 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
 			proc->databaseId == dbOid)
 		{
 			/* Fetch xmin just once - can't change on us, but good coding */
-			TransactionId pxmin = proc->xmin;
+			TransactionId pxmin = proc_minimal->xmin;
 
 			/*
 			 * We ignore an invalid pxmin because this means that backend has
@@ -2025,7 +2068,7 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
 		}
 	}
 
-	LWLockRelease(ProcArrayLock);
+	ProcArrayLockRelease();
 
 	/* add the terminator */
 	vxids[count].backendId = InvalidBackendId;
@@ -2046,12 +2089,13 @@ CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode)
 	int			index;
 	pid_t		pid = 0;
 
-	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	ProcArrayLockAcquire(PAL_SHARED);
 
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
+		int pgprocno = arrayP->pgprocnos[index];
+		volatile PGPROC *proc = &allProcs[pgprocno];
 		VirtualTransactionId procvxid;
-		PGPROC	   *proc = arrayP->procs[index];
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
@@ -2072,7 +2116,7 @@ CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode)
 		}
 	}
 
-	LWLockRelease(ProcArrayLock);
+	ProcArrayLockRelease();
 
 	return pid;
 }
@@ -2104,7 +2148,9 @@ MinimumActiveBackends(int min)
 	 */
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		volatile PGPROC *proc = arrayP->procs[index];
+		int pgprocno = arrayP->pgprocnos[index];
+		volatile PGPROC *proc = &allProcs[pgprocno];
+		volatile PGPROC_MINIMAL *proc_minimal = &allProcs_Minimal[pgprocno];
 
 		/*
 		 * Since we're not holding a lock, need to check that the pointer is
@@ -2122,10 +2168,10 @@ MinimumActiveBackends(int min)
 
 		if (proc == MyProc)
 			continue;			/* do not count myself */
+		if (proc_minimal->xid == InvalidTransactionId)
+			continue;			/* do not count if no XID assigned */
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
-		if (proc->xid == InvalidTransactionId)
-			continue;			/* do not count if no XID assigned */
 		if (proc->waitLock != NULL)
 			continue;			/* do not count if blocked on a lock */
 		count++;
@@ -2146,11 +2192,12 @@ CountDBBackends(Oid databaseid)
 	int			count = 0;
 	int			index;
 
-	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	ProcArrayLockAcquire(PAL_SHARED);
 
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		volatile PGPROC *proc = arrayP->procs[index];
+		int pgprocno = arrayP->pgprocnos[index];
+		volatile PGPROC *proc = &allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -2159,7 +2206,7 @@ CountDBBackends(Oid databaseid)
 			count++;
 	}
 
-	LWLockRelease(ProcArrayLock);
+	ProcArrayLockRelease();
 
 	return count;
 }
@@ -2175,11 +2222,12 @@ CancelDBBackends(Oid databaseid, ProcSignalReason sigmode, bool conflictPending)
 	pid_t		pid = 0;
 
 	/* tell all backends to die */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ProcArrayLockAcquire(PAL_EXCLUSIVE);
 
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		volatile PGPROC *proc = arrayP->procs[index];
+		int pgprocno = arrayP->pgprocnos[index];
+		volatile PGPROC *proc = &allProcs[pgprocno];
 
 		if (databaseid == InvalidOid || proc->databaseId == databaseid)
 		{
@@ -2200,7 +2248,7 @@ CancelDBBackends(Oid databaseid, ProcSignalReason sigmode, bool conflictPending)
 		}
 	}
 
-	LWLockRelease(ProcArrayLock);
+	ProcArrayLockRelease();
 }
 
 /*
@@ -2213,11 +2261,12 @@ CountUserBackends(Oid roleid)
 	int			count = 0;
 	int			index;
 
-	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	ProcArrayLockAcquire(PAL_SHARED);
 
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		volatile PGPROC *proc = arrayP->procs[index];
+		int pgprocno = arrayP->pgprocnos[index];
+		volatile PGPROC *proc = &allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -2225,7 +2274,7 @@ CountUserBackends(Oid roleid)
 			count++;
 	}
 
-	LWLockRelease(ProcArrayLock);
+	ProcArrayLockRelease();
 
 	return count;
 }
@@ -2273,11 +2322,13 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
 
 		*nbackends = *nprepared = 0;
 
-		LWLockAcquire(ProcArrayLock, LW_SHARED);
+		ProcArrayLockAcquire(PAL_SHARED);
 
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
-			volatile PGPROC *proc = arrayP->procs[index];
+			int pgprocno = arrayP->pgprocnos[index];
+			volatile PGPROC *proc = &allProcs[pgprocno];
+			volatile PGPROC_MINIMAL *proc_minimal = &allProcs_Minimal[pgprocno];
 
 			if (proc->databaseId != databaseId)
 				continue;
@@ -2291,13 +2342,13 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
 			else
 			{
 				(*nbackends)++;
-				if ((proc->vacuumFlags & PROC_IS_AUTOVACUUM) &&
+				if ((proc_minimal->vacuumFlags & PROC_IS_AUTOVACUUM) &&
 					nautovacs < MAXAUTOVACPIDS)
 					autovac_pids[nautovacs++] = proc->pid;
 			}
 		}
 
-		LWLockRelease(ProcArrayLock);
+		ProcArrayLockRelease();
 
 		if (!found)
 			return false;		/* no conflicting backends, so done */
@@ -2321,8 +2372,8 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
 
 #define XidCacheRemove(i) \
 	do { \
-		MyProc->subxids.xids[i] = MyProc->subxids.xids[MyProc->subxids.nxids - 1]; \
-		MyProc->subxids.nxids--; \
+		MyProc->subxids.xids[i] = MyProc->subxids.xids[MyProcMinimal->nxids - 1]; \
+		MyProcMinimal->nxids--; \
 	} while (0)
 
 /*
@@ -2350,7 +2401,7 @@ XidCacheRemoveRunningXids(TransactionId xid,
 	 * to abort subtransactions, but pending closer analysis we'd best be
 	 * conservative.
 	 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ProcArrayLockAcquire(PAL_EXCLUSIVE);
 
 	/*
 	 * Under normal circumstances xid and xids[] will be in increasing order,
@@ -2361,7 +2412,7 @@ XidCacheRemoveRunningXids(TransactionId xid,
 	{
 		TransactionId anxid = xids[i];
 
-		for (j = MyProc->subxids.nxids - 1; j >= 0; j--)
+		for (j = MyProcMinimal->nxids - 1; j >= 0; j--)
 		{
 			if (TransactionIdEquals(MyProc->subxids.xids[j], anxid))
 			{
@@ -2377,11 +2428,11 @@ XidCacheRemoveRunningXids(TransactionId xid,
 		 * error during AbortSubTransaction.  So instead of Assert, emit a
 		 * debug warning.
 		 */
-		if (j < 0 && !MyProc->subxids.overflowed)
+		if (j < 0 && !MyProcMinimal->overflowed)
 			elog(WARNING, "did not find subXID %u in MyProc", anxid);
 	}
 
-	for (j = MyProc->subxids.nxids - 1; j >= 0; j--)
+	for (j = MyProcMinimal->nxids - 1; j >= 0; j--)
 	{
 		if (TransactionIdEquals(MyProc->subxids.xids[j], xid))
 		{
@@ -2390,7 +2441,7 @@ XidCacheRemoveRunningXids(TransactionId xid,
 		}
 	}
 	/* Ordinarily we should have found it, unless the cache has overflowed */
-	if (j < 0 && !MyProc->subxids.overflowed)
+	if (j < 0 && !MyProcMinimal->overflowed)
 		elog(WARNING, "did not find subXID %u in MyProc", xid);
 
 	/* Also advance global latestCompletedXid while holding the lock */
@@ -2398,7 +2449,7 @@ XidCacheRemoveRunningXids(TransactionId xid,
 							  latestXid))
 		ShmemVariableCache->latestCompletedXid = latestXid;
 
-	LWLockRelease(ProcArrayLock);
+	ProcArrayLockRelease();
 }
 
 #ifdef XIDCACHE_DEBUG
@@ -2565,7 +2616,7 @@ ExpireTreeKnownAssignedTransactionIds(TransactionId xid, int nsubxids,
 	/*
 	 * Uses same locking as transaction commit
 	 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ProcArrayLockAcquire(PAL_EXCLUSIVE);
 
 	KnownAssignedXidsRemoveTree(xid, nsubxids, subxids);
 
@@ -2574,7 +2625,7 @@ ExpireTreeKnownAssignedTransactionIds(TransactionId xid, int nsubxids,
 							  max_xid))
 		ShmemVariableCache->latestCompletedXid = max_xid;
 
-	LWLockRelease(ProcArrayLock);
+	ProcArrayLockRelease();
 }
 
 /*
@@ -2584,9 +2635,9 @@ ExpireTreeKnownAssignedTransactionIds(TransactionId xid, int nsubxids,
 void
 ExpireAllKnownAssignedTransactionIds(void)
 {
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ProcArrayLockAcquire(PAL_EXCLUSIVE);
 	KnownAssignedXidsRemovePreceding(InvalidTransactionId);
-	LWLockRelease(ProcArrayLock);
+	ProcArrayLockRelease();
 }
 
 /*
@@ -2596,9 +2647,9 @@ ExpireAllKnownAssignedTransactionIds(void)
 void
 ExpireOldKnownAssignedTransactionIds(TransactionId xid)
 {
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ProcArrayLockAcquire(PAL_EXCLUSIVE);
 	KnownAssignedXidsRemovePreceding(xid);
-	LWLockRelease(ProcArrayLock);
+	ProcArrayLockRelease();
 }
 
 
@@ -2820,7 +2871,7 @@ KnownAssignedXidsAdd(TransactionId from_xid, TransactionId to_xid,
 	{
 		/* must hold lock to compress */
 		if (!exclusive_lock)
-			LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+			ProcArrayLockAcquire(PAL_EXCLUSIVE);
 
 		KnownAssignedXidsCompress(true);
 
@@ -2828,7 +2879,7 @@ KnownAssignedXidsAdd(TransactionId from_xid, TransactionId to_xid,
 		/* note: we no longer care about the tail pointer */
 
 		if (!exclusive_lock)
-			LWLockRelease(ProcArrayLock);
+			ProcArrayLockRelease();
 
 		/*
 		 * If it still won't fit then we're out of memory
diff --git a/src/backend/storage/lmgr/Makefile b/src/backend/storage/lmgr/Makefile
index e12a854..27eaa97 100644
--- a/src/backend/storage/lmgr/Makefile
+++ b/src/backend/storage/lmgr/Makefile
@@ -12,7 +12,8 @@ subdir = src/backend/storage/lmgr
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = lmgr.o lock.o proc.o deadlock.o lwlock.o spin.o s_lock.o predicate.o
+OBJS = flexlock.o lmgr.o lock.o proc.o deadlock.o lwlock.o spin.o s_lock.o \
+	procarraylock.o predicate.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index 7e7f6af..4fd7bd7 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -450,6 +450,7 @@ FindLockCycleRecurse(PGPROC *checkProc,
 					 int *nSoftEdges)	/* output argument */
 {
 	PGPROC	   *proc;
+	PGPROC_MINIMAL *proc_minimal;
 	LOCK	   *lock;
 	PROCLOCK   *proclock;
 	SHM_QUEUE  *procLocks;
@@ -516,6 +517,7 @@ FindLockCycleRecurse(PGPROC *checkProc,
 	while (proclock)
 	{
 		proc = proclock->tag.myProc;
+		proc_minimal = &ProcGlobal->allProcs_Minimal[proc->pgprocno];
 
 		/* A proc never blocks itself */
 		if (proc != checkProc)
@@ -541,7 +543,7 @@ FindLockCycleRecurse(PGPROC *checkProc,
 					 * vacuumFlag bit), but we don't do that here to avoid
 					 * grabbing ProcArrayLock.
 					 */
-					if (proc->vacuumFlags & PROC_IS_AUTOVACUUM)
+					if (proc_minimal->vacuumFlags & PROC_IS_AUTOVACUUM)
 						blocking_autovacuum_proc = proc;
 
 					/* This proc hard-blocks checkProc */
diff --git a/src/backend/storage/lmgr/flexlock.c b/src/backend/storage/lmgr/flexlock.c
new file mode 100644
index 0000000..c88bd24
--- /dev/null
+++ b/src/backend/storage/lmgr/flexlock.c
@@ -0,0 +1,366 @@
+/*-------------------------------------------------------------------------
+ *
+ * flexlock.c
+ *	  Low-level routines for managing flex locks.
+ *
+ * Flex locks are intended primarily to provide mutual exclusion of access
+ * to shared-memory data structures.  Most, but not all, flex locks are
+ * lightweight locks (LWLocks).  This file contains support routines that
+ * are used for all types of flex locks, including lwlocks.  User-level
+ * locking should be done with the full lock manager --- which depends on
+ * LWLocks to protect its shared state.
+ *
+ * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/lmgr/flexlock.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "access/clog.h"
+#include "access/multixact.h"
+#include "access/subtrans.h"
+#include "commands/async.h"
+#include "storage/flexlock_internals.h"
+#include "storage/lwlock.h"
+#include "storage/predicate.h"
+#include "storage/proc.h"
+#include "storage/procarraylock.h"
+#include "storage/spin.h"
+#include "utils/elog.h"
+
+/*
+ * We use this structure to keep track of flex locks held, for release
+ * during error recovery.  The maximum size could be determined at runtime
+ * if necessary, but it seems unlikely that more than a few locks could
+ * ever be held simultaneously.
+ */
+#define MAX_SIMUL_FLEXLOCKS	100
+
+int	num_held_flexlocks = 0;
+FlexLockId held_flexlocks[MAX_SIMUL_FLEXLOCKS];
+
+static int	lock_addin_request = 0;
+static bool lock_addin_request_allowed = true;
+
+#ifdef LOCK_DEBUG
+bool		Trace_flexlocks = false;
+#endif
+
+/*
+ * This points to the array of FlexLocks in shared memory.  Backends inherit
+ * the pointer by fork from the postmaster (except in the EXEC_BACKEND case,
+ * where we have special measures to pass it down).
+ */
+FlexLockPadded *FlexLockArray = NULL;
+
+/* We use the ShmemLock spinlock to protect LWLockAssign */
+extern slock_t *ShmemLock;
+
+static void FlexLockInit(FlexLock *flex, char locktype);
+
+/*
+ * Compute number of FlexLocks to allocate.
+ */
+int
+NumFlexLocks(void)
+{
+	int			numLocks;
+
+	/*
+	 * Possibly this logic should be spread out among the affected modules,
+	 * the same way that shmem space estimation is done.  But for now, there
+	 * are few enough users of FlexLocks that we can get away with just keeping
+	 * the knowledge here.
+	 */
+
+	/* Predefined FlexLocks */
+	numLocks = (int) NumFixedFlexLocks;
+
+	/* bufmgr.c needs two for each shared buffer */
+	numLocks += 2 * NBuffers;
+
+	/* proc.c needs one for each backend or auxiliary process */
+	numLocks += MaxBackends + NUM_AUXILIARY_PROCS;
+
+	/* clog.c needs one per CLOG buffer */
+	numLocks += NUM_CLOG_BUFFERS;
+
+	/* subtrans.c needs one per SubTrans buffer */
+	numLocks += NUM_SUBTRANS_BUFFERS;
+
+	/* multixact.c needs two SLRU areas */
+	numLocks += NUM_MXACTOFFSET_BUFFERS + NUM_MXACTMEMBER_BUFFERS;
+
+	/* async.c needs one per Async buffer */
+	numLocks += NUM_ASYNC_BUFFERS;
+
+	/* predicate.c needs one per old serializable xid buffer */
+	numLocks += NUM_OLDSERXID_BUFFERS;
+
+	/*
+	 * Add any requested by loadable modules; for backwards-compatibility
+	 * reasons, allocate at least NUM_USER_DEFINED_FLEXLOCKS of them even if
+	 * there are no explicit requests.
+	 */
+	lock_addin_request_allowed = false;
+	numLocks += Max(lock_addin_request, NUM_USER_DEFINED_FLEXLOCKS);
+
+	return numLocks;
+}
+
+
+/*
+ * RequestAddinFlexLocks
+ *		Request that extra FlexLocks be allocated for use by
+ *		a loadable module.
+ *
+ * This is only useful if called from the _PG_init hook of a library that
+ * is loaded into the postmaster via shared_preload_libraries.	Once
+ * shared memory has been allocated, calls will be ignored.  (We could
+ * raise an error, but it seems better to make it a no-op, so that
+ * libraries containing such calls can be reloaded if needed.)
+ */
+void
+RequestAddinFlexLocks(int n)
+{
+	if (IsUnderPostmaster || !lock_addin_request_allowed)
+		return;					/* too late */
+	lock_addin_request += n;
+}
+
+
+/*
+ * Compute shmem space needed for FlexLocks.
+ */
+Size
+FlexLockShmemSize(void)
+{
+	Size		size;
+	int			numLocks = NumFlexLocks();
+
+	/* Space for the FlexLock array. */
+	size = mul_size(numLocks, FLEX_LOCK_BYTES);
+
+	/* Space for dynamic allocation counter, plus room for alignment. */
+	size = add_size(size, 2 * sizeof(int) + FLEX_LOCK_BYTES);
+
+	return size;
+}
+
+/*
+ * Allocate shmem space for FlexLocks and initialize the locks.
+ */
+void
+CreateFlexLocks(void)
+{
+	int			numLocks = NumFlexLocks();
+	Size		spaceLocks = FlexLockShmemSize();
+	FlexLockPadded *lock;
+	int		   *FlexLockCounter;
+	char	   *ptr;
+	int			id;
+
+	/* Allocate and zero space */
+	ptr = (char *) ShmemAlloc(spaceLocks);
+	memset(ptr, 0, spaceLocks);
+
+	/* Leave room for dynamic allocation counter */
+	ptr += 2 * sizeof(int);
+
+	/* Ensure desired alignment of FlexLock array */
+	ptr += FLEX_LOCK_BYTES - ((uintptr_t) ptr) % FLEX_LOCK_BYTES;
+
+	FlexLockArray = (FlexLockPadded *) ptr;
+
+	/* All of the "fixed" FlexLocks are LWLocks - except ProcArrayLock. */
+	for (id = 0, lock = FlexLockArray; id < NumFixedFlexLocks; id++, lock++)
+	{
+		if (id == ProcArrayLock)
+			FlexLockInit(&lock->flex, FLEXLOCK_TYPE_PROCARRAYLOCK);
+		else
+			FlexLockInit(&lock->flex, FLEXLOCK_TYPE_LWLOCK);
+	}
+
+	/*
+	 * Initialize the dynamic-allocation counter, which is stored just before
+	 * the first FlexLock.
+	 */
+	FlexLockCounter = (int *) ((char *) FlexLockArray - 2 * sizeof(int));
+	FlexLockCounter[0] = (int) NumFixedFlexLocks;
+	FlexLockCounter[1] = numLocks;
+}
+
+/*
+ * FlexLockAssign - assign a dynamically-allocated FlexLock number
+ *
+ * We interlock this using the same spinlock that is used to protect
+ * ShmemAlloc().  Interlocking is not really necessary during postmaster
+ * startup, but it is needed if any user-defined code tries to allocate
+ * LWLocks after startup.
+ */
+FlexLockId
+FlexLockAssign(char locktype)
+{
+	FlexLockId	result;
+
+	/* use volatile pointer to prevent code rearrangement */
+	volatile int *FlexLockCounter;
+
+	FlexLockCounter = (int *) ((char *) FlexLockArray - 2 * sizeof(int));
+	SpinLockAcquire(ShmemLock);
+	if (FlexLockCounter[0] >= FlexLockCounter[1])
+	{
+		SpinLockRelease(ShmemLock);
+		elog(ERROR, "no more FlexLockIds available");
+	}
+	result = (FlexLockId) (FlexLockCounter[0]++);
+	SpinLockRelease(ShmemLock);
+
+	FlexLockInit(&FlexLockArray[result].flex, locktype);
+
+	return result;
+}
+
+/*
+ * Initialize a FlexLock.
+ */
+static void
+FlexLockInit(FlexLock *flex, char locktype)
+{
+	SpinLockInit(&flex->mutex);
+	flex->releaseOK = true;
+	flex->locktype = locktype;
+	/*
+	 * We might need to think a little harder about what should happen here
+	 * if some future type of FlexLock requires more initialization than this.
+	 * For now, this will suffice.
+	 */
+}
+
+/*
+ * Remove lock from list of locks held.  Usually, but not always, it will
+ * be the latest-acquired lock; so search array backwards.
+ */
+void
+FlexLockRemember(FlexLockId id)
+{
+	if (num_held_flexlocks >= MAX_SIMUL_FLEXLOCKS)
+		elog(PANIC, "too many FlexLocks taken");
+	held_flexlocks[num_held_flexlocks++] = id;
+}
+
+/*
+ * Remove lock from list of locks held.  Usually, but not always, it will
+ * be the latest-acquired lock; so search array backwards.
+ */
+void
+FlexLockForget(FlexLockId id)
+{
+	int			i;
+
+	for (i = num_held_flexlocks; --i >= 0;)
+	{
+		if (id == held_flexlocks[i])
+			break;
+	}
+	if (i < 0)
+		elog(ERROR, "lock %d is not held", (int) id);
+	num_held_flexlocks--;
+	for (; i < num_held_flexlocks; i++)
+		held_flexlocks[i] = held_flexlocks[i + 1];
+}
+
+/*
+ * FlexLockWait - wait until awakened
+ *
+ * Since we share the process wait semaphore with the regular lock manager
+ * and ProcWaitForSignal, and we may need to acquire a FlexLock while one of
+ * those is pending, it is possible that we get awakened for a reason other
+ * than being signaled by a FlexLock release.  If so, loop back and wait again.
+ *
+ * Returns the number of "extra" waits absorbed so that, once we've gotten the
+ * FlexLock, we can re-increment the sema by the number of additional signals
+ * received, so that the lock manager or signal manager will see the received
+ * signal when it next waits.
+ */
+int
+FlexLockWait(FlexLockId id, int mode)
+{
+	int		extraWaits = 0;
+
+	FlexLockDebug("LWLockAcquire", id, "waiting");
+	TRACE_POSTGRESQL_FLEXLOCK_WAIT_START(id, mode);
+
+	for (;;)
+   	{
+		/* "false" means cannot accept cancel/die interrupt here. */
+		PGSemaphoreLock(&MyProc->sem, false);
+		/*
+		 * FLEXTODO: I think we should return this, instead of ignoring it.
+		 * Any non-zero value means "wake up".
+		 */
+		if (MyProc->flWaitResult)
+			break;
+		extraWaits++;
+   	}
+
+	TRACE_POSTGRESQL_FLEXLOCK_WAIT_DONE(id, mode);
+	FlexLockDebug("LWLockAcquire", id, "awakened");
+
+	return extraWaits;
+}
+
+/*
+ * FlexLockReleaseAll - release all currently-held locks
+ *
+ * Used to clean up after ereport(ERROR). An important difference between this
+ * function and retail LWLockRelease calls is that InterruptHoldoffCount is
+ * unchanged by this operation.  This is necessary since InterruptHoldoffCount
+ * has been set to an appropriate level earlier in error recovery. We could
+ * decrement it below zero if we allow it to drop for each released lock!
+ */
+void
+FlexLockReleaseAll(void)
+{
+	while (num_held_flexlocks > 0)
+	{
+		FlexLockId	id;
+		FlexLock   *flex;
+
+		HOLD_INTERRUPTS();		/* match the upcoming RESUME_INTERRUPTS */
+
+		id = held_flexlocks[num_held_flexlocks - 1];
+		flex = &FlexLockArray[id].flex;
+		if (flex->locktype == FLEXLOCK_TYPE_LWLOCK)
+			LWLockRelease(id);
+		else
+		{
+			Assert(id == ProcArrayLock);
+			ProcArrayLockRelease();
+		}
+	}
+}
+
+/*
+ * FlexLockHeldByMe - test whether my process currently holds a lock
+ *
+ * This is meant as debug support only.  We do not consider the lock mode.
+ */
+bool
+FlexLockHeldByMe(FlexLockId id)
+{
+	int			i;
+
+	for (i = 0; i < num_held_flexlocks; i++)
+	{
+		if (held_flexlocks[i] == id)
+			return true;
+	}
+	return false;
+}
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 905502f..edaff09 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -591,7 +591,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	bool		found;
 	ResourceOwner owner;
 	uint32		hashcode;
-	LWLockId	partitionLock;
+	FlexLockId	partitionLock;
 	int			status;
 	bool		log_lock = false;
 
@@ -1546,7 +1546,7 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	LOCALLOCK  *locallock;
 	LOCK	   *lock;
 	PROCLOCK   *proclock;
-	LWLockId	partitionLock;
+	FlexLockId	partitionLock;
 	bool		wakeupNeeded;
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
@@ -1912,7 +1912,7 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 	 */
 	for (partition = 0; partition < NUM_LOCK_PARTITIONS; partition++)
 	{
-		LWLockId	partitionLock = FirstLockMgrLock + partition;
+		FlexLockId	partitionLock = FirstLockMgrLock + partition;
 		SHM_QUEUE  *procLocks = &(MyProc->myProcLocks[partition]);
 
 		proclock = (PROCLOCK *) SHMQueueNext(procLocks, procLocks,
@@ -2197,7 +2197,7 @@ static bool
 FastPathTransferRelationLocks(LockMethod lockMethodTable, const LOCKTAG *locktag,
 					  uint32 hashcode)
 {
-	LWLockId		partitionLock = LockHashPartitionLock(hashcode);
+	FlexLockId		partitionLock = LockHashPartitionLock(hashcode);
 	Oid				relid = locktag->locktag_field2;
 	uint32			i;
 
@@ -2281,7 +2281,7 @@ FastPathGetRelationLockEntry(LOCALLOCK *locallock)
 	LockMethod		lockMethodTable = LockMethods[DEFAULT_LOCKMETHOD];
 	LOCKTAG		   *locktag = &locallock->tag.lock;
 	PROCLOCK	   *proclock = NULL;
-	LWLockId		partitionLock = LockHashPartitionLock(locallock->hashcode);
+	FlexLockId		partitionLock = LockHashPartitionLock(locallock->hashcode);
 	Oid				relid = locktag->locktag_field2;
 	uint32			f;
 
@@ -2382,7 +2382,7 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode)
 	SHM_QUEUE  *procLocks;
 	PROCLOCK   *proclock;
 	uint32		hashcode;
-	LWLockId	partitionLock;
+	FlexLockId	partitionLock;
 	int			count = 0;
 	int			fast_count = 0;
 
@@ -2593,7 +2593,7 @@ LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
 	PROCLOCKTAG proclocktag;
 	uint32		hashcode;
 	uint32		proclock_hashcode;
-	LWLockId	partitionLock;
+	FlexLockId	partitionLock;
 	bool		wakeupNeeded;
 
 	hashcode = LockTagHashCode(locktag);
@@ -2827,7 +2827,7 @@ PostPrepare_Locks(TransactionId xid)
 	 */
 	for (partition = 0; partition < NUM_LOCK_PARTITIONS; partition++)
 	{
-		LWLockId	partitionLock = FirstLockMgrLock + partition;
+		FlexLockId	partitionLock = FirstLockMgrLock + partition;
 		SHM_QUEUE  *procLocks = &(MyProc->myProcLocks[partition]);
 
 		proclock = (PROCLOCK *) SHMQueueNext(procLocks, procLocks,
@@ -3188,9 +3188,10 @@ GetRunningTransactionLocks(int *nlocks)
 			proclock->tag.myLock->tag.locktag_type == LOCKTAG_RELATION)
 		{
 			PGPROC	   *proc = proclock->tag.myProc;
+			PGPROC_MINIMAL *proc_minimal = &ProcGlobal->allProcs_Minimal[proc->pgprocno];
 			LOCK	   *lock = proclock->tag.myLock;
 
-			accessExclusiveLocks[index].xid = proc->xid;
+			accessExclusiveLocks[index].xid = proc_minimal->xid;
 			accessExclusiveLocks[index].dbOid = lock->tag.locktag_field1;
 			accessExclusiveLocks[index].relOid = lock->tag.locktag_field2;
 
@@ -3342,7 +3343,7 @@ lock_twophase_recover(TransactionId xid, uint16 info,
 	uint32		hashcode;
 	uint32		proclock_hashcode;
 	int			partition;
-	LWLockId	partitionLock;
+	FlexLockId	partitionLock;
 	LockMethod	lockMethodTable;
 
 	Assert(len == sizeof(TwoPhaseLockRecord));
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 079eb29..ce6c931 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -21,74 +21,23 @@
  */
 #include "postgres.h"
 
-#include "access/clog.h"
-#include "access/multixact.h"
-#include "access/subtrans.h"
-#include "commands/async.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
+#include "storage/flexlock_internals.h"
 #include "storage/ipc.h"
-#include "storage/predicate.h"
 #include "storage/proc.h"
 #include "storage/spin.h"
 
-
-/* We use the ShmemLock spinlock to protect LWLockAssign */
-extern slock_t *ShmemLock;
-
-
 typedef struct LWLock
 {
-	slock_t		mutex;			/* Protects LWLock and queue of PGPROCs */
-	bool		releaseOK;		/* T if ok to release waiters */
+	FlexLock	flex;			/* common FlexLock infrastructure */
 	char		exclusive;		/* # of exclusive holders (0 or 1) */
 	int			shared;			/* # of shared holders (0..MaxBackends) */
-	PGPROC	   *head;			/* head of list of waiting PGPROCs */
-	PGPROC	   *tail;			/* tail of list of waiting PGPROCs */
-	/* tail is undefined when head is NULL */
 } LWLock;
 
-/*
- * All the LWLock structs are allocated as an array in shared memory.
- * (LWLockIds are indexes into the array.)	We force the array stride to
- * be a power of 2, which saves a few cycles in indexing, but more
- * importantly also ensures that individual LWLocks don't cross cache line
- * boundaries.	This reduces cache contention problems, especially on AMD
- * Opterons.  (Of course, we have to also ensure that the array start
- * address is suitably aligned.)
- *
- * LWLock is between 16 and 32 bytes on all known platforms, so these two
- * cases are sufficient.
- */
-#define LWLOCK_PADDED_SIZE	(sizeof(LWLock) <= 16 ? 16 : 32)
-
-typedef union LWLockPadded
-{
-	LWLock		lock;
-	char		pad[LWLOCK_PADDED_SIZE];
-} LWLockPadded;
-
-/*
- * This points to the array of LWLocks in shared memory.  Backends inherit
- * the pointer by fork from the postmaster (except in the EXEC_BACKEND case,
- * where we have special measures to pass it down).
- */
-NON_EXEC_STATIC LWLockPadded *LWLockArray = NULL;
-
-
-/*
- * We use this structure to keep track of locked LWLocks for release
- * during error recovery.  The maximum size could be determined at runtime
- * if necessary, but it seems unlikely that more than a few locks could
- * ever be held simultaneously.
- */
-#define MAX_SIMUL_LWLOCKS	100
-
-static int	num_held_lwlocks = 0;
-static LWLockId held_lwlocks[MAX_SIMUL_LWLOCKS];
-
-static int	lock_addin_request = 0;
-static bool lock_addin_request_allowed = true;
+#define	LWLockPointer(lockid) \
+	(AssertMacro(FlexLockArray[lockid].flex.locktype == FLEXLOCK_TYPE_LWLOCK), \
+	 (volatile LWLock *) &FlexLockArray[lockid])
 
 #ifdef LWLOCK_STATS
 static int	counts_for_pid = 0;
@@ -98,27 +47,17 @@ static int *block_counts;
 #endif
 
 #ifdef LOCK_DEBUG
-bool		Trace_lwlocks = false;
-
 inline static void
-PRINT_LWDEBUG(const char *where, LWLockId lockid, const volatile LWLock *lock)
+PRINT_LWDEBUG(const char *where, FlexLockId lockid, const volatile LWLock *lock)
 {
-	if (Trace_lwlocks)
+	if (Trace_flexlocks)
 		elog(LOG, "%s(%d): excl %d shared %d head %p rOK %d",
 			 where, (int) lockid,
-			 (int) lock->exclusive, lock->shared, lock->head,
-			 (int) lock->releaseOK);
-}
-
-inline static void
-LOG_LWDEBUG(const char *where, LWLockId lockid, const char *msg)
-{
-	if (Trace_lwlocks)
-		elog(LOG, "%s(%d): %s", where, (int) lockid, msg);
+			 (int) lock->exclusive, lock->shared, lock->flex.head,
+			 (int) lock->flex.releaseOK);
 }
 #else							/* not LOCK_DEBUG */
 #define PRINT_LWDEBUG(a,b,c)
-#define LOG_LWDEBUG(a,b,c)
 #endif   /* LOCK_DEBUG */
 
 #ifdef LWLOCK_STATS
@@ -127,8 +66,8 @@ static void
 print_lwlock_stats(int code, Datum arg)
 {
 	int			i;
-	int		   *LWLockCounter = (int *) ((char *) LWLockArray - 2 * sizeof(int));
-	int			numLocks = LWLockCounter[1];
+	int		   *FlexLockCounter = (int *) ((char *) FlexLockArray - 2 * sizeof(int));
+	int			numLocks = FlexLockCounter[1];
 
 	/* Grab an LWLock to keep different backends from mixing reports */
 	LWLockAcquire(0, LW_EXCLUSIVE);
@@ -145,173 +84,15 @@ print_lwlock_stats(int code, Datum arg)
 }
 #endif   /* LWLOCK_STATS */
 
-
 /*
- * Compute number of LWLocks to allocate.
+ * LWLockAssign - initialize a new lwlock and return its ID
  */
-int
-NumLWLocks(void)
-{
-	int			numLocks;
-
-	/*
-	 * Possibly this logic should be spread out among the affected modules,
-	 * the same way that shmem space estimation is done.  But for now, there
-	 * are few enough users of LWLocks that we can get away with just keeping
-	 * the knowledge here.
-	 */
-
-	/* Predefined LWLocks */
-	numLocks = (int) NumFixedLWLocks;
-
-	/* bufmgr.c needs two for each shared buffer */
-	numLocks += 2 * NBuffers;
-
-	/* proc.c needs one for each backend or auxiliary process */
-	numLocks += MaxBackends + NUM_AUXILIARY_PROCS;
-
-	/* clog.c needs one per CLOG buffer */
-	numLocks += NUM_CLOG_BUFFERS;
-
-	/* subtrans.c needs one per SubTrans buffer */
-	numLocks += NUM_SUBTRANS_BUFFERS;
-
-	/* multixact.c needs two SLRU areas */
-	numLocks += NUM_MXACTOFFSET_BUFFERS + NUM_MXACTMEMBER_BUFFERS;
-
-	/* async.c needs one per Async buffer */
-	numLocks += NUM_ASYNC_BUFFERS;
-
-	/* predicate.c needs one per old serializable xid buffer */
-	numLocks += NUM_OLDSERXID_BUFFERS;
-
-	/*
-	 * Add any requested by loadable modules; for backwards-compatibility
-	 * reasons, allocate at least NUM_USER_DEFINED_LWLOCKS of them even if
-	 * there are no explicit requests.
-	 */
-	lock_addin_request_allowed = false;
-	numLocks += Max(lock_addin_request, NUM_USER_DEFINED_LWLOCKS);
-
-	return numLocks;
-}
-
-
-/*
- * RequestAddinLWLocks
- *		Request that extra LWLocks be allocated for use by
- *		a loadable module.
- *
- * This is only useful if called from the _PG_init hook of a library that
- * is loaded into the postmaster via shared_preload_libraries.	Once
- * shared memory has been allocated, calls will be ignored.  (We could
- * raise an error, but it seems better to make it a no-op, so that
- * libraries containing such calls can be reloaded if needed.)
- */
-void
-RequestAddinLWLocks(int n)
-{
-	if (IsUnderPostmaster || !lock_addin_request_allowed)
-		return;					/* too late */
-	lock_addin_request += n;
-}
-
-
-/*
- * Compute shmem space needed for LWLocks.
- */
-Size
-LWLockShmemSize(void)
-{
-	Size		size;
-	int			numLocks = NumLWLocks();
-
-	/* Space for the LWLock array. */
-	size = mul_size(numLocks, sizeof(LWLockPadded));
-
-	/* Space for dynamic allocation counter, plus room for alignment. */
-	size = add_size(size, 2 * sizeof(int) + LWLOCK_PADDED_SIZE);
-
-	return size;
-}
-
-
-/*
- * Allocate shmem space for LWLocks and initialize the locks.
- */
-void
-CreateLWLocks(void)
-{
-	int			numLocks = NumLWLocks();
-	Size		spaceLocks = LWLockShmemSize();
-	LWLockPadded *lock;
-	int		   *LWLockCounter;
-	char	   *ptr;
-	int			id;
-
-	/* Allocate space */
-	ptr = (char *) ShmemAlloc(spaceLocks);
-
-	/* Leave room for dynamic allocation counter */
-	ptr += 2 * sizeof(int);
-
-	/* Ensure desired alignment of LWLock array */
-	ptr += LWLOCK_PADDED_SIZE - ((uintptr_t) ptr) % LWLOCK_PADDED_SIZE;
-
-	LWLockArray = (LWLockPadded *) ptr;
-
-	/*
-	 * Initialize all LWLocks to "unlocked" state
-	 */
-	for (id = 0, lock = LWLockArray; id < numLocks; id++, lock++)
-	{
-		SpinLockInit(&lock->lock.mutex);
-		lock->lock.releaseOK = true;
-		lock->lock.exclusive = 0;
-		lock->lock.shared = 0;
-		lock->lock.head = NULL;
-		lock->lock.tail = NULL;
-	}
-
-	/*
-	 * Initialize the dynamic-allocation counter, which is stored just before
-	 * the first LWLock.
-	 */
-	LWLockCounter = (int *) ((char *) LWLockArray - 2 * sizeof(int));
-	LWLockCounter[0] = (int) NumFixedLWLocks;
-	LWLockCounter[1] = numLocks;
-}
-
-
-/*
- * LWLockAssign - assign a dynamically-allocated LWLock number
- *
- * We interlock this using the same spinlock that is used to protect
- * ShmemAlloc().  Interlocking is not really necessary during postmaster
- * startup, but it is needed if any user-defined code tries to allocate
- * LWLocks after startup.
- */
-LWLockId
+FlexLockId
 LWLockAssign(void)
 {
-	LWLockId	result;
-
-	/* use volatile pointer to prevent code rearrangement */
-	volatile int *LWLockCounter;
-
-	LWLockCounter = (int *) ((char *) LWLockArray - 2 * sizeof(int));
-	SpinLockAcquire(ShmemLock);
-	if (LWLockCounter[0] >= LWLockCounter[1])
-	{
-		SpinLockRelease(ShmemLock);
-		elog(ERROR, "no more LWLockIds available");
-	}
-	result = (LWLockId) (LWLockCounter[0]++);
-	SpinLockRelease(ShmemLock);
-	return result;
+	return FlexLockAssign(FLEXLOCK_TYPE_LWLOCK);
 }
 
-
 /*
  * LWLockAcquire - acquire a lightweight lock in the specified mode
  *
@@ -320,9 +101,9 @@ LWLockAssign(void)
  * Side effect: cancel/die interrupts are held off until lock release.
  */
 void
-LWLockAcquire(LWLockId lockid, LWLockMode mode)
+LWLockAcquire(FlexLockId lockid, LWLockMode mode)
 {
-	volatile LWLock *lock = &(LWLockArray[lockid].lock);
+	volatile LWLock *lock = LWLockPointer(lockid);
 	PGPROC	   *proc = MyProc;
 	bool		retry = false;
 	int			extraWaits = 0;
@@ -333,8 +114,8 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 	/* Set up local count state first time through in a given process */
 	if (counts_for_pid != MyProcPid)
 	{
-		int		   *LWLockCounter = (int *) ((char *) LWLockArray - 2 * sizeof(int));
-		int			numLocks = LWLockCounter[1];
+		int		   *FlexLockCounter = (int *) ((char *) FlexLockArray - 2 * sizeof(int));
+		int			numLocks = FlexLockCounter[1];
 
 		sh_acquire_counts = calloc(numLocks, sizeof(int));
 		ex_acquire_counts = calloc(numLocks, sizeof(int));
@@ -356,10 +137,6 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 	 */
 	Assert(!(proc == NULL && IsUnderPostmaster));
 
-	/* Ensure we will have room to remember the lock */
-	if (num_held_lwlocks >= MAX_SIMUL_LWLOCKS)
-		elog(ERROR, "too many LWLocks taken");
-
 	/*
 	 * Lock out cancel/die interrupts until we exit the code section protected
 	 * by the LWLock.  This ensures that interrupts will not interfere with
@@ -388,11 +165,11 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 		bool		mustwait;
 
 		/* Acquire mutex.  Time spent holding mutex should be short! */
-		SpinLockAcquire(&lock->mutex);
+		SpinLockAcquire(&lock->flex.mutex);
 
 		/* If retrying, allow LWLockRelease to release waiters again */
 		if (retry)
-			lock->releaseOK = true;
+			lock->flex.releaseOK = true;
 
 		/* If I can get the lock, do so quickly. */
 		if (mode == LW_EXCLUSIVE)
@@ -419,72 +196,30 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 		if (!mustwait)
 			break;				/* got the lock */
 
-		/*
-		 * Add myself to wait queue.
-		 *
-		 * If we don't have a PGPROC structure, there's no way to wait. This
-		 * should never occur, since MyProc should only be null during shared
-		 * memory initialization.
-		 */
-		if (proc == NULL)
-			elog(PANIC, "cannot wait without a PGPROC structure");
-
-		proc->lwWaiting = true;
-		proc->lwExclusive = (mode == LW_EXCLUSIVE);
-		proc->lwWaitLink = NULL;
-		if (lock->head == NULL)
-			lock->head = proc;
-		else
-			lock->tail->lwWaitLink = proc;
-		lock->tail = proc;
+		/* Add myself to wait queue. */
+		FlexLockJoinWaitQueue(lock, (int) mode);
 
 		/* Can release the mutex now */
-		SpinLockRelease(&lock->mutex);
-
-		/*
-		 * Wait until awakened.
-		 *
-		 * Since we share the process wait semaphore with the regular lock
-		 * manager and ProcWaitForSignal, and we may need to acquire an LWLock
-		 * while one of those is pending, it is possible that we get awakened
-		 * for a reason other than being signaled by LWLockRelease. If so,
-		 * loop back and wait again.  Once we've gotten the LWLock,
-		 * re-increment the sema by the number of additional signals received,
-		 * so that the lock manager or signal manager will see the received
-		 * signal when it next waits.
-		 */
-		LOG_LWDEBUG("LWLockAcquire", lockid, "waiting");
+		SpinLockRelease(&lock->flex.mutex);
+
+		/* Wait until awakened. */
+		extraWaits += FlexLockWait(lockid, mode);
 
 #ifdef LWLOCK_STATS
 		block_counts[lockid]++;
 #endif
 
-		TRACE_POSTGRESQL_LWLOCK_WAIT_START(lockid, mode);
-
-		for (;;)
-		{
-			/* "false" means cannot accept cancel/die interrupt here. */
-			PGSemaphoreLock(&proc->sem, false);
-			if (!proc->lwWaiting)
-				break;
-			extraWaits++;
-		}
-
-		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(lockid, mode);
-
-		LOG_LWDEBUG("LWLockAcquire", lockid, "awakened");
-
 		/* Now loop back and try to acquire lock again. */
 		retry = true;
 	}
 
 	/* We are done updating shared state of the lock itself. */
-	SpinLockRelease(&lock->mutex);
+	SpinLockRelease(&lock->flex.mutex);
 
-	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(lockid, mode);
+	TRACE_POSTGRESQL_FLEXLOCK_ACQUIRE(lockid, mode);
 
 	/* Add lock to list of locks held by this backend */
-	held_lwlocks[num_held_lwlocks++] = lockid;
+	FlexLockRemember(lockid);
 
 	/*
 	 * Fix the process wait semaphore's count for any absorbed wakeups.
@@ -501,17 +236,13 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
  * If successful, cancel/die interrupts are held off until lock release.
  */
 bool
-LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode)
+LWLockConditionalAcquire(FlexLockId lockid, LWLockMode mode)
 {
-	volatile LWLock *lock = &(LWLockArray[lockid].lock);
+	volatile LWLock *lock = LWLockPointer(lockid);
 	bool		mustwait;
 
 	PRINT_LWDEBUG("LWLockConditionalAcquire", lockid, lock);
 
-	/* Ensure we will have room to remember the lock */
-	if (num_held_lwlocks >= MAX_SIMUL_LWLOCKS)
-		elog(ERROR, "too many LWLocks taken");
-
 	/*
 	 * Lock out cancel/die interrupts until we exit the code section protected
 	 * by the LWLock.  This ensures that interrupts will not interfere with
@@ -520,7 +251,7 @@ LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode)
 	HOLD_INTERRUPTS();
 
 	/* Acquire mutex.  Time spent holding mutex should be short! */
-	SpinLockAcquire(&lock->mutex);
+	SpinLockAcquire(&lock->flex.mutex);
 
 	/* If I can get the lock, do so quickly. */
 	if (mode == LW_EXCLUSIVE)
@@ -545,20 +276,20 @@ LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode)
 	}
 
 	/* We are done updating shared state of the lock itself. */
-	SpinLockRelease(&lock->mutex);
+	SpinLockRelease(&lock->flex.mutex);
 
 	if (mustwait)
 	{
 		/* Failed to get lock, so release interrupt holdoff */
 		RESUME_INTERRUPTS();
-		LOG_LWDEBUG("LWLockConditionalAcquire", lockid, "failed");
-		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE_FAIL(lockid, mode);
+		FlexLockDebug("LWLockConditionalAcquire", lockid, "failed");
+		TRACE_POSTGRESQL_FLEXLOCK_CONDACQUIRE_FAIL(lockid, mode);
 	}
 	else
 	{
 		/* Add lock to list of locks held by this backend */
-		held_lwlocks[num_held_lwlocks++] = lockid;
-		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(lockid, mode);
+		FlexLockRemember(lockid);
+		TRACE_POSTGRESQL_FLEXLOCK_CONDACQUIRE(lockid, mode);
 	}
 
 	return !mustwait;
@@ -568,32 +299,18 @@ LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode)
  * LWLockRelease - release a previously acquired lock
  */
 void
-LWLockRelease(LWLockId lockid)
+LWLockRelease(FlexLockId lockid)
 {
-	volatile LWLock *lock = &(LWLockArray[lockid].lock);
+	volatile LWLock *lock = LWLockPointer(lockid);
 	PGPROC	   *head;
 	PGPROC	   *proc;
-	int			i;
 
 	PRINT_LWDEBUG("LWLockRelease", lockid, lock);
 
-	/*
-	 * Remove lock from list of locks held.  Usually, but not always, it will
-	 * be the latest-acquired lock; so search array backwards.
-	 */
-	for (i = num_held_lwlocks; --i >= 0;)
-	{
-		if (lockid == held_lwlocks[i])
-			break;
-	}
-	if (i < 0)
-		elog(ERROR, "lock %d is not held", (int) lockid);
-	num_held_lwlocks--;
-	for (; i < num_held_lwlocks; i++)
-		held_lwlocks[i] = held_lwlocks[i + 1];
+	FlexLockForget(lockid);
 
 	/* Acquire mutex.  Time spent holding mutex should be short! */
-	SpinLockAcquire(&lock->mutex);
+	SpinLockAcquire(&lock->flex.mutex);
 
 	/* Release my hold on lock */
 	if (lock->exclusive > 0)
@@ -610,10 +327,10 @@ LWLockRelease(LWLockId lockid)
 	 * if someone has already awakened waiters that haven't yet acquired the
 	 * lock.
 	 */
-	head = lock->head;
+	head = lock->flex.head;
 	if (head != NULL)
 	{
-		if (lock->exclusive == 0 && lock->shared == 0 && lock->releaseOK)
+		if (lock->exclusive == 0 && lock->shared == 0 && lock->flex.releaseOK)
 		{
 			/*
 			 * Remove the to-be-awakened PGPROCs from the queue.  If the front
@@ -621,17 +338,17 @@ LWLockRelease(LWLockId lockid)
 			 * as many waiters as want shared access.
 			 */
 			proc = head;
-			if (!proc->lwExclusive)
+			if (proc->flWaitMode != LW_EXCLUSIVE)
 			{
-				while (proc->lwWaitLink != NULL &&
-					   !proc->lwWaitLink->lwExclusive)
-					proc = proc->lwWaitLink;
+				while (proc->flWaitLink != NULL &&
+					   proc->flWaitLink->flWaitMode != LW_EXCLUSIVE)
+					proc = proc->flWaitLink;
 			}
 			/* proc is now the last PGPROC to be released */
-			lock->head = proc->lwWaitLink;
-			proc->lwWaitLink = NULL;
+			lock->flex.head = proc->flWaitLink;
+			proc->flWaitLink = NULL;
 			/* prevent additional wakeups until retryer gets to run */
-			lock->releaseOK = false;
+			lock->flex.releaseOK = false;
 		}
 		else
 		{
@@ -641,20 +358,20 @@ LWLockRelease(LWLockId lockid)
 	}
 
 	/* We are done updating shared state of the lock itself. */
-	SpinLockRelease(&lock->mutex);
+	SpinLockRelease(&lock->flex.mutex);
 
-	TRACE_POSTGRESQL_LWLOCK_RELEASE(lockid);
+	TRACE_POSTGRESQL_FLEXLOCK_RELEASE(lockid);
 
 	/*
 	 * Awaken any waiters I removed from the queue.
 	 */
 	while (head != NULL)
 	{
-		LOG_LWDEBUG("LWLockRelease", lockid, "release waiter");
+		FlexLockDebug("LWLockRelease", lockid, "release waiter");
 		proc = head;
-		head = proc->lwWaitLink;
-		proc->lwWaitLink = NULL;
-		proc->lwWaiting = false;
+		head = proc->flWaitLink;
+		proc->flWaitLink = NULL;
+		proc->flWaitResult = 1;		/* any non-zero value will do */
 		PGSemaphoreUnlock(&proc->sem);
 	}
 
@@ -664,43 +381,17 @@ LWLockRelease(LWLockId lockid)
 	RESUME_INTERRUPTS();
 }
 
-
-/*
- * LWLockReleaseAll - release all currently-held locks
- *
- * Used to clean up after ereport(ERROR). An important difference between this
- * function and retail LWLockRelease calls is that InterruptHoldoffCount is
- * unchanged by this operation.  This is necessary since InterruptHoldoffCount
- * has been set to an appropriate level earlier in error recovery. We could
- * decrement it below zero if we allow it to drop for each released lock!
- */
-void
-LWLockReleaseAll(void)
-{
-	while (num_held_lwlocks > 0)
-	{
-		HOLD_INTERRUPTS();		/* match the upcoming RESUME_INTERRUPTS */
-
-		LWLockRelease(held_lwlocks[num_held_lwlocks - 1]);
-	}
-}
-
-
 /*
  * LWLockHeldByMe - test whether my process currently holds a lock
  *
- * This is meant as debug support only.  We do not distinguish whether the
- * lock is held shared or exclusive.
+ * The following convenience routine might not be worthwhile but for the fact
+ * that we've had a function by this name since long before FlexLocks existed.
+ * Callers who want to check whether an arbitrary FlexLock (that may or may not
+ * be an LWLock) is held can use FlexLockHeldByMe directly.
  */
 bool
-LWLockHeldByMe(LWLockId lockid)
+LWLockHeldByMe(FlexLockId lockid)
 {
-	int			i;
-
-	for (i = 0; i < num_held_lwlocks; i++)
-	{
-		if (held_lwlocks[i] == lockid)
-			return true;
-	}
-	return false;
+	AssertMacro(FlexLockArray[lockid].flex.locktype == FLEXLOCK_TYPE_LWLOCK);
+	return FlexLockHeldByMe(lockid);
 }
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 345f6f5..15978a4 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -239,7 +239,7 @@
 #define PredicateLockHashPartition(hashcode) \
 	((hashcode) % NUM_PREDICATELOCK_PARTITIONS)
 #define PredicateLockHashPartitionLock(hashcode) \
-	((LWLockId) (FirstPredicateLockMgrLock + PredicateLockHashPartition(hashcode)))
+	((FlexLockId) (FirstPredicateLockMgrLock + PredicateLockHashPartition(hashcode)))
 
 #define NPREDICATELOCKTARGETENTS() \
 	mul_size(max_predicate_locks_per_xact, add_size(MaxBackends, max_prepared_xacts))
@@ -1840,7 +1840,7 @@ PageIsPredicateLocked(Relation relation, BlockNumber blkno)
 {
 	PREDICATELOCKTARGETTAG targettag;
 	uint32		targettaghash;
-	LWLockId	partitionLock;
+	FlexLockId	partitionLock;
 	PREDICATELOCKTARGET *target;
 
 	SET_PREDICATELOCKTARGETTAG_PAGE(targettag,
@@ -2073,7 +2073,7 @@ DeleteChildTargetLocks(const PREDICATELOCKTARGETTAG *newtargettag)
 		if (TargetTagIsCoveredBy(oldtargettag, *newtargettag))
 		{
 			uint32		oldtargettaghash;
-			LWLockId	partitionLock;
+			FlexLockId	partitionLock;
 			PREDICATELOCK *rmpredlock;
 
 			oldtargettaghash = PredicateLockTargetTagHashCode(&oldtargettag);
@@ -2285,7 +2285,7 @@ CreatePredicateLock(const PREDICATELOCKTARGETTAG *targettag,
 	PREDICATELOCKTARGET *target;
 	PREDICATELOCKTAG locktag;
 	PREDICATELOCK *lock;
-	LWLockId	partitionLock;
+	FlexLockId	partitionLock;
 	bool		found;
 
 	partitionLock = PredicateLockHashPartitionLock(targettaghash);
@@ -2586,10 +2586,10 @@ TransferPredicateLocksToNewTarget(PREDICATELOCKTARGETTAG oldtargettag,
 								  bool removeOld)
 {
 	uint32		oldtargettaghash;
-	LWLockId	oldpartitionLock;
+	FlexLockId	oldpartitionLock;
 	PREDICATELOCKTARGET *oldtarget;
 	uint32		newtargettaghash;
-	LWLockId	newpartitionLock;
+	FlexLockId	newpartitionLock;
 	bool		found;
 	bool		outOfShmem = false;
 
@@ -3578,7 +3578,7 @@ ClearOldPredicateLocks(void)
 			PREDICATELOCKTARGET *target;
 			PREDICATELOCKTARGETTAG targettag;
 			uint32		targettaghash;
-			LWLockId	partitionLock;
+			FlexLockId	partitionLock;
 
 			tag = predlock->tag;
 			target = tag.myTarget;
@@ -3656,7 +3656,7 @@ ReleaseOneSerializableXact(SERIALIZABLEXACT *sxact, bool partial,
 		PREDICATELOCKTARGET *target;
 		PREDICATELOCKTARGETTAG targettag;
 		uint32		targettaghash;
-		LWLockId	partitionLock;
+		FlexLockId	partitionLock;
 
 		nextpredlock = (PREDICATELOCK *)
 			SHMQueueNext(&(sxact->predicateLocks),
@@ -4034,7 +4034,7 @@ static void
 CheckTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag)
 {
 	uint32		targettaghash;
-	LWLockId	partitionLock;
+	FlexLockId	partitionLock;
 	PREDICATELOCKTARGET *target;
 	PREDICATELOCK *predlock;
 	PREDICATELOCK *mypredlock = NULL;
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index eda3a98..edb225a 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include <sys/time.h>
 
 #include "access/transam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "miscadmin.h"
 #include "postmaster/autovacuum.h"
@@ -45,6 +46,7 @@
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
+#include "storage/procarraylock.h"
 #include "storage/procsignal.h"
 #include "storage/spin.h"
 #include "utils/timestamp.h"
@@ -57,6 +59,7 @@ bool		log_lock_waits = false;
 
 /* Pointer to this process's PGPROC struct, if any */
 PGPROC	   *MyProc = NULL;
+PGPROC_MINIMAL	   *MyProcMinimal = NULL;
 
 /*
  * This spinlock protects the freelist of recycled PGPROC structures.
@@ -70,6 +73,7 @@ NON_EXEC_STATIC slock_t *ProcStructLock = NULL;
 /* Pointers to shared-memory structures */
 PROC_HDR *ProcGlobal = NULL;
 NON_EXEC_STATIC PGPROC *AuxiliaryProcs = NULL;
+PGPROC *PreparedXactProcs = NULL;
 
 /* If we are waiting for a lock, this points to the associated LOCALLOCK */
 static LOCALLOCK *lockAwaited = NULL;
@@ -106,13 +110,19 @@ ProcGlobalShmemSize(void)
 
 	/* ProcGlobal */
 	size = add_size(size, sizeof(PROC_HDR));
-	/* AuxiliaryProcs */
-	size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGPROC)));
 	/* MyProcs, including autovacuum workers and launcher */
 	size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC)));
+	/* AuxiliaryProcs */
+	size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGPROC)));
+	/* Prepared xacts */
+	size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGPROC)));
 	/* ProcStructLock */
 	size = add_size(size, sizeof(slock_t));
 
+	size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC_MINIMAL)));
+	size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGPROC_MINIMAL)));
+	size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGPROC_MINIMAL)));
+
 	return size;
 }
 
@@ -157,10 +167,11 @@ void
 InitProcGlobal(void)
 {
 	PGPROC	   *procs;
+	PGPROC_MINIMAL *procs_minimal;
 	int			i,
 				j;
 	bool		found;
-	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS;
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
 
 	/* Create the ProcGlobal shared structure */
 	ProcGlobal = (PROC_HDR *)
@@ -195,14 +206,38 @@ InitProcGlobal(void)
 				(errcode(ERRCODE_OUT_OF_MEMORY),
 				 errmsg("out of shared memory")));
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
+
+	/*
+	 * Also allocate a separate array of PROC_MINIMAL structures. We keep this
+	 * out of band of the main PGPROC array to ensure the very heavily accessed
+	 * members of the PGPROC structure are stored contiguously in the memory.
+	 * This provides significant performance benefits, especially on a
+	 * multiprocessor system by improving cache hit ratio.
+	 *
+	 * Note: We separate the members needed by GetSnapshotData since that's the
+	 * most frequently accessed code path. There is one PROC_MINIMAL structure
+	 * for every PGPROC structure.
+	 */
+	procs_minimal = (PGPROC_MINIMAL *) ShmemAlloc(TotalProcs * sizeof(PGPROC_MINIMAL));
+	MemSet(procs_minimal, 0, TotalProcs * sizeof(PGPROC_MINIMAL));
+	ProcGlobal->allProcs_Minimal = procs_minimal;
+
 	for (i = 0; i < TotalProcs; i++)
 	{
 		/* Common initialization for all PGPROCs, regardless of type. */
 
-		/* Set up per-PGPROC semaphore, latch, and backendLock */
-		PGSemaphoreCreate(&(procs[i].sem));
-		InitSharedLatch(&(procs[i].procLatch));
-		procs[i].backendLock = LWLockAssign();
+		/*
+		 * Set up per-PGPROC semaphore, latch, and backendLock. Prepared
+		 * xact dummy PGPROCs don't need these though - they're never
+		 * associated with a real process
+		 */
+		if (i < MaxBackends + NUM_AUXILIARY_PROCS)
+		{
+			PGSemaphoreCreate(&(procs[i].sem));
+			InitSharedLatch(&(procs[i].procLatch));
+			procs[i].backendLock = LWLockAssign();
+		}
+		procs[i].pgprocno = i;
 
 		/*
 		 * Newly created PGPROCs for normal backends or for autovacuum must
@@ -234,6 +269,7 @@ InitProcGlobal(void)
 	 * auxiliary proceses.
 	 */
 	AuxiliaryProcs = &procs[MaxBackends];
+	PreparedXactProcs = &procs[MaxBackends + NUM_AUXILIARY_PROCS];
 
 	/* Create ProcStructLock spinlock, too */
 	ProcStructLock = (slock_t *) ShmemAlloc(sizeof(slock_t));
@@ -296,6 +332,7 @@ InitProcess(void)
 				(errcode(ERRCODE_TOO_MANY_CONNECTIONS),
 				 errmsg("sorry, too many clients already")));
 	}
+	MyProcMinimal = &ProcGlobal->allProcs_Minimal[MyProc->pgprocno];
 
 	/*
 	 * Now that we have a PGPROC, mark ourselves as an active postmaster
@@ -313,21 +350,21 @@ InitProcess(void)
 	SHMQueueElemInit(&(MyProc->links));
 	MyProc->waitStatus = STATUS_OK;
 	MyProc->lxid = InvalidLocalTransactionId;
-	MyProc->xid = InvalidTransactionId;
-	MyProc->xmin = InvalidTransactionId;
+	MyProcMinimal->xid = InvalidTransactionId;
+	MyProcMinimal->xmin = InvalidTransactionId;
 	MyProc->pid = MyProcPid;
 	/* backendId, databaseId and roleId will be filled in later */
 	MyProc->backendId = InvalidBackendId;
 	MyProc->databaseId = InvalidOid;
 	MyProc->roleId = InvalidOid;
-	MyProc->inCommit = false;
-	MyProc->vacuumFlags = 0;
+	MyProcMinimal->inCommit = false;
+	MyProcMinimal->vacuumFlags = 0;
 	/* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
 	if (IsAutoVacuumWorkerProcess())
-		MyProc->vacuumFlags |= PROC_IS_AUTOVACUUM;
-	MyProc->lwWaiting = false;
-	MyProc->lwExclusive = false;
-	MyProc->lwWaitLink = NULL;
+		MyProcMinimal->vacuumFlags |= PROC_IS_AUTOVACUUM;
+	MyProc->flWaitResult = 0;
+	MyProc->flWaitMode = 0;
+	MyProc->flWaitLink = NULL;
 	MyProc->waitLock = NULL;
 	MyProc->waitProcLock = NULL;
 #ifdef USE_ASSERT_CHECKING
@@ -462,6 +499,7 @@ InitAuxiliaryProcess(void)
 	((volatile PGPROC *) auxproc)->pid = MyProcPid;
 
 	MyProc = auxproc;
+	MyProcMinimal = &ProcGlobal->allProcs_Minimal[auxproc->pgprocno];
 
 	SpinLockRelease(ProcStructLock);
 
@@ -472,16 +510,16 @@ InitAuxiliaryProcess(void)
 	SHMQueueElemInit(&(MyProc->links));
 	MyProc->waitStatus = STATUS_OK;
 	MyProc->lxid = InvalidLocalTransactionId;
-	MyProc->xid = InvalidTransactionId;
-	MyProc->xmin = InvalidTransactionId;
+	MyProcMinimal->xid = InvalidTransactionId;
+	MyProcMinimal->xmin = InvalidTransactionId;
 	MyProc->backendId = InvalidBackendId;
 	MyProc->databaseId = InvalidOid;
 	MyProc->roleId = InvalidOid;
-	MyProc->inCommit = false;
-	MyProc->vacuumFlags = 0;
-	MyProc->lwWaiting = false;
-	MyProc->lwExclusive = false;
-	MyProc->lwWaitLink = NULL;
+	MyProcMinimal->inCommit = false;
+	MyProcMinimal->vacuumFlags = 0;
+	MyProc->flWaitMode = 0;
+	MyProc->flWaitResult = 0;
+	MyProc->flWaitLink = NULL;
 	MyProc->waitLock = NULL;
 	MyProc->waitProcLock = NULL;
 #ifdef USE_ASSERT_CHECKING
@@ -607,7 +645,7 @@ IsWaitingForLock(void)
 void
 LockWaitCancel(void)
 {
-	LWLockId	partitionLock;
+	FlexLockId	partitionLock;
 
 	/* Nothing to do if we weren't waiting for a lock */
 	if (lockAwaited == NULL)
@@ -718,11 +756,11 @@ ProcKill(int code, Datum arg)
 #endif
 
 	/*
-	 * Release any LW locks I am holding.  There really shouldn't be any, but
-	 * it's cheap to check again before we cut the knees off the LWLock
+	 * Release any felx locks I am holding.  There really shouldn't be any, but
+	 * it's cheap to check again before we cut the knees off the flex lock
 	 * facility by releasing our PGPROC ...
 	 */
-	LWLockReleaseAll();
+	FlexLockReleaseAll();
 
 	/* Release ownership of the process's latch, too */
 	DisownLatch(&MyProc->procLatch);
@@ -779,8 +817,8 @@ AuxiliaryProcKill(int code, Datum arg)
 
 	Assert(MyProc == auxproc);
 
-	/* Release any LW locks I am holding (see notes above) */
-	LWLockReleaseAll();
+	/* Release any flex locks I am holding (see notes above) */
+	FlexLockReleaseAll();
 
 	/* Release ownership of the process's latch, too */
 	DisownLatch(&MyProc->procLatch);
@@ -865,7 +903,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 	LOCK	   *lock = locallock->lock;
 	PROCLOCK   *proclock = locallock->proclock;
 	uint32		hashcode = locallock->hashcode;
-	LWLockId	partitionLock = LockHashPartitionLock(hashcode);
+	FlexLockId	partitionLock = LockHashPartitionLock(hashcode);
 	PROC_QUEUE *waitQueue = &(lock->waitProcs);
 	LOCKMASK	myHeldLocks = MyProc->heldLocks;
 	bool		early_deadlock = false;
@@ -1045,16 +1083,17 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 		if (deadlock_state == DS_BLOCKED_BY_AUTOVACUUM && allow_autovacuum_cancel)
 		{
 			PGPROC	   *autovac = GetBlockingAutoVacuumPgproc();
+			PGPROC_MINIMAL *autovac_minimal = &ProcGlobal->allProcs_Minimal[autovac->pgprocno];
 
-			LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+			ProcArrayLockAcquire(PAL_EXCLUSIVE);
 
 			/*
 			 * Only do it if the worker is not working to protect against Xid
 			 * wraparound.
 			 */
 			if ((autovac != NULL) &&
-				(autovac->vacuumFlags & PROC_IS_AUTOVACUUM) &&
-				!(autovac->vacuumFlags & PROC_VACUUM_FOR_WRAPAROUND))
+				(autovac_minimal->vacuumFlags & PROC_IS_AUTOVACUUM) &&
+				!(autovac_minimal->vacuumFlags & PROC_VACUUM_FOR_WRAPAROUND))
 			{
 				int			pid = autovac->pid;
 
@@ -1062,7 +1101,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 					 pid);
 
 				/* don't hold the lock across the kill() syscall */
-				LWLockRelease(ProcArrayLock);
+				ProcArrayLockRelease();
 
 				/* send the autovacuum worker Back to Old Kent Road */
 				if (kill(pid, SIGINT) < 0)
@@ -1074,7 +1113,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 				}
 			}
 			else
-				LWLockRelease(ProcArrayLock);
+				ProcArrayLockRelease();
 
 			/* prevent signal from being resent more than once */
 			allow_autovacuum_cancel = false;
diff --git a/src/backend/storage/lmgr/procarraylock.c b/src/backend/storage/lmgr/procarraylock.c
new file mode 100644
index 0000000..6aa51f2
--- /dev/null
+++ b/src/backend/storage/lmgr/procarraylock.c
@@ -0,0 +1,343 @@
+/*-------------------------------------------------------------------------
+ *
+ * procarraylock.c
+ *	  Lock management for the ProcArray
+ *
+ * Because the ProcArray data structure is highly trafficked, it is
+ * critical that mutual exclusion for ProcArray options be as efficient
+ * as possible.  A particular problem is transaction end (commit or abort)
+ * which cannot be done in parallel with snapshot acquisition.  We
+ * therefore include some special hacks to deal with this case efficiently.
+ *
+ * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/lmgr/procarraylock.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "pg_trace.h"
+#include "access/transam.h"
+#include "storage/flexlock_internals.h"
+#include "storage/ipc.h"
+#include "storage/procarraylock.h"
+#include "storage/proc.h"
+#include "storage/spin.h"
+
+typedef struct ProcArrayLockStruct
+{
+	FlexLock	flex;			/* common FlexLock infrastructure */
+	char		exclusive;		/* # of exclusive holders (0 or 1) */
+	int			shared;			/* # of shared holders (0..MaxBackends) */
+	PGPROC	   *ending;			/* transactions wishing to clear state */
+	TransactionId	latest_ending_xid;	/* latest ending XID */
+} ProcArrayLockStruct;
+
+/* There is only one ProcArrayLock. */
+#define	ProcArrayLockPointer() \
+	(AssertMacro(FlexLockArray[ProcArrayLock].flex.locktype == \
+		FLEXLOCK_TYPE_PROCARRAYLOCK), \
+	 (volatile ProcArrayLockStruct *) &FlexLockArray[ProcArrayLock])
+
+/*
+ * ProcArrayLockAcquire - acquire a lightweight lock in the specified mode
+ *
+ * If the lock is not available, sleep until it is.
+ *
+ * Side effect: cancel/die interrupts are held off until lock release.
+ */
+void
+ProcArrayLockAcquire(ProcArrayLockMode mode)
+{
+	volatile ProcArrayLockStruct *lock = ProcArrayLockPointer();
+	PGPROC	   *proc = MyProc;
+	bool		retry = false;
+	int			extraWaits = 0;
+
+	/*
+	 * We can't wait if we haven't got a PGPROC.  This should only occur
+	 * during bootstrap or shared memory initialization.  Put an Assert here
+	 * to catch unsafe coding practices.
+	 */
+	Assert(!(proc == NULL && IsUnderPostmaster));
+
+	/*
+	 * Lock out cancel/die interrupts until we exit the code section protected
+	 * by the ProcArrayLock.  This ensures that interrupts will not interfere
+     * with manipulations of data structures in shared memory.
+	 */
+	HOLD_INTERRUPTS();
+
+	/*
+	 * Loop here to try to acquire lock after each time we are signaled by
+	 * ProcArrayLockRelease.  See comments in LWLockAcquire for an explanation
+	 * of why do we not attempt to hand off the lock directly.
+	 */
+	for (;;)
+	{
+		bool		mustwait;
+
+		/* Acquire mutex.  Time spent holding mutex should be short! */
+		SpinLockAcquire(&lock->flex.mutex);
+
+		/* If retrying, allow LWLockRelease to release waiters again */
+		if (retry)
+			lock->flex.releaseOK = true;
+
+		/* If I can get the lock, do so quickly. */
+		if (mode == PAL_EXCLUSIVE)
+		{
+			if (lock->exclusive == 0 && lock->shared == 0)
+			{
+				lock->exclusive++;
+				mustwait = false;
+			}
+			else
+				mustwait = true;
+		}
+		else
+		{
+			if (lock->exclusive == 0)
+			{
+				lock->shared++;
+				mustwait = false;
+			}
+			else
+				mustwait = true;
+		}
+
+		if (!mustwait)
+			break;				/* got the lock */
+
+		/* Add myself to wait queue. */
+		FlexLockJoinWaitQueue(lock, (int) mode);
+
+		/* Can release the mutex now */
+		SpinLockRelease(&lock->flex.mutex);
+
+		/* Wait until awakened. */
+		extraWaits += FlexLockWait(ProcArrayLock, mode);
+
+		/* Now loop back and try to acquire lock again. */
+		retry = true;
+	}
+
+	/* We are done updating shared state of the lock itself. */
+	SpinLockRelease(&lock->flex.mutex);
+
+	TRACE_POSTGRESQL_FLEXLOCK_ACQUIRE(lockid, mode);
+
+	/* Add lock to list of locks held by this backend */
+	FlexLockRemember(ProcArrayLock);
+
+	/*
+	 * Fix the process wait semaphore's count for any absorbed wakeups.
+	 */
+	while (extraWaits-- > 0)
+		PGSemaphoreUnlock(&proc->sem);
+}
+
+/*
+ * ProcArrayLockClearTransaction - safely clear transaction details
+ *
+ * This can't be done while ProcArrayLock is held, but it's so fast that
+ * we can afford to do it while holding the spinlock, rather than acquiring
+ * and releasing the lock.
+ */
+void
+ProcArrayLockClearTransaction(TransactionId latestXid)
+{
+	volatile ProcArrayLockStruct *lock = ProcArrayLockPointer();
+	PGPROC	   *proc = MyProc;
+	int			extraWaits = 0;
+	bool		mustwait;
+
+	HOLD_INTERRUPTS();
+
+	/* Acquire mutex.  Time spent holding mutex should be short! */
+	SpinLockAcquire(&lock->flex.mutex);
+
+	if (lock->exclusive == 0 && lock->shared == 0)
+	{
+		{
+			volatile PGPROC_MINIMAL *vproc_minimal = &ProcGlobal->allProcs_Minimal[proc->pgprocno];
+			/* If there are no lockers, clar the critical PGPROC fields. */
+			vproc_minimal->xid = InvalidTransactionId;
+	        vproc_minimal->xmin = InvalidTransactionId;
+	        /* must be cleared with xid/xmin: */
+	        vproc_minimal->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
+			vproc_minimal->nxids = 0;
+			vproc_minimal->overflowed = false;
+		}
+		mustwait = false;
+
+        /* Also advance global latestCompletedXid while holding the lock */
+        if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
+                                  latestXid))
+            ShmemVariableCache->latestCompletedXid = latestXid;
+	}
+	else
+	{
+		/* Rats, must wait. */
+		proc->flWaitLink = lock->ending;
+		lock->ending = proc;
+		if (!TransactionIdIsValid(lock->latest_ending_xid) ||
+				TransactionIdPrecedes(lock->latest_ending_xid, latestXid)) 
+			lock->latest_ending_xid = latestXid;
+		mustwait = true;
+	}
+
+	/* Can release the mutex now */
+	SpinLockRelease(&lock->flex.mutex);
+
+	/*
+	 * If we were not able to perfom the operation immediately, we must wait.
+	 * But we need not retry after being awoken, because the last lock holder
+	 * to release the lock will do the work first, on our behalf.
+	 */
+	if (mustwait)
+	{
+		extraWaits += FlexLockWait(ProcArrayLock, 2);
+		while (extraWaits-- > 0)
+			PGSemaphoreUnlock(&proc->sem);
+	}
+
+	RESUME_INTERRUPTS();
+}
+
+/*
+ * ProcArrayLockRelease - release a previously acquired lock
+ */
+void
+ProcArrayLockRelease(void)
+{
+	volatile ProcArrayLockStruct *lock = ProcArrayLockPointer();
+	PGPROC	   *head;
+	PGPROC	   *ending = NULL;
+	PGPROC	   *proc;
+
+	FlexLockForget(ProcArrayLock);
+
+	/* Acquire mutex.  Time spent holding mutex should be short! */
+	SpinLockAcquire(&lock->flex.mutex);
+
+	/* Release my hold on lock */
+	if (lock->exclusive > 0)
+		lock->exclusive--;
+	else
+	{
+		Assert(lock->shared > 0);
+		lock->shared--;
+	}
+
+	/*
+	 * If the lock is now free, but there are some transactions trying to
+	 * end, we must clear the critical PGPROC fields for them, and save a
+	 * list of them so we can wake them up.
+	 */
+	if (lock->exclusive == 0 && lock->shared == 0 && lock->ending != NULL)
+	{
+		volatile PGPROC *vproc;
+
+		ending = lock->ending;
+		vproc = ending;
+
+		while (vproc != NULL)
+		{
+			volatile PGPROC_MINIMAL *vproc_minimal = &ProcGlobal->allProcs_Minimal[vproc->pgprocno];
+			/* If there are no lockers, clar the critical PGPROC fields. */
+			vproc_minimal->xid = InvalidTransactionId;
+	        vproc_minimal->xmin = InvalidTransactionId;
+	        /* must be cleared with xid/xmin: */
+	        vproc_minimal->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
+			vproc_minimal->nxids = 0;
+			vproc_minimal->overflowed = false;
+			vproc = vproc->flWaitLink;
+		}
+
+		/* Also advance global latestCompletedXid */
+		if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
+								  lock->latest_ending_xid))
+			ShmemVariableCache->latestCompletedXid = lock->latest_ending_xid;
+
+		/* Reset lock state. */
+		lock->ending = NULL;
+		lock->latest_ending_xid = InvalidTransactionId;
+	}
+
+	/*
+	 * See if I need to awaken any waiters.  If I released a non-last shared
+	 * hold, there cannot be anything to do.  Also, do not awaken any waiters
+	 * if someone has already awakened waiters that haven't yet acquired the
+	 * lock.
+	 */
+	head = lock->flex.head;
+	if (head != NULL)
+	{
+		if (lock->exclusive == 0 && lock->shared == 0 && lock->flex.releaseOK)
+		{
+			/*
+			 * Remove the to-be-awakened PGPROCs from the queue.  If the front
+			 * waiter wants exclusive lock, awaken him only. Otherwise awaken
+			 * as many waiters as want shared access.
+			 */
+			proc = head;
+			if (proc->flWaitMode != LW_EXCLUSIVE)
+			{
+				while (proc->flWaitLink != NULL &&
+					   proc->flWaitLink->flWaitMode != LW_EXCLUSIVE)
+					proc = proc->flWaitLink;
+			}
+			/* proc is now the last PGPROC to be released */
+			lock->flex.head = proc->flWaitLink;
+			proc->flWaitLink = NULL;
+			/* prevent additional wakeups until retryer gets to run */
+			lock->flex.releaseOK = false;
+		}
+		else
+		{
+			/* lock is still held, can't awaken anything */
+			head = NULL;
+		}
+	}
+
+	/* We are done updating shared state of the lock itself. */
+	SpinLockRelease(&lock->flex.mutex);
+
+	TRACE_POSTGRESQL_FLEXLOCK_RELEASE(lockid);
+
+	/*
+	 * Awaken any waiters I removed from the queue.
+	 */
+	while (head != NULL)
+	{
+		FlexLockDebug("LWLockRelease", lockid, "release waiter");
+		proc = head;
+		head = proc->flWaitLink;
+		proc->flWaitLink = NULL;
+		proc->flWaitResult = 1;		/* any non-zero value will do */
+		PGSemaphoreUnlock(&proc->sem);
+	}
+
+	/*
+	 * Also awaken any processes whose critical PGPROC fields I cleared
+	 */
+	while (ending != NULL)
+	{
+		FlexLockDebug("LWLockRelease", lockid, "release ending");
+		proc = ending;
+		ending = proc->flWaitLink;
+		proc->flWaitLink = NULL;
+		proc->flWaitResult = 1;		/* any non-zero value will do */
+		PGSemaphoreUnlock(&proc->sem);
+	}
+
+	/*
+	 * Now okay to allow cancel/die interrupts.
+	 */
+	RESUME_INTERRUPTS();
+}
diff --git a/src/backend/utils/misc/check_guc b/src/backend/utils/misc/check_guc
index 293fb03..1a19e36 100755
--- a/src/backend/utils/misc/check_guc
+++ b/src/backend/utils/misc/check_guc
@@ -19,7 +19,7 @@
 INTENTIONALLY_NOT_INCLUDED="autocommit debug_deadlocks \
 is_superuser lc_collate lc_ctype lc_messages lc_monetary lc_numeric lc_time \
 pre_auth_delay role seed server_encoding server_version server_version_int \
-session_authorization trace_lock_oidmin trace_lock_table trace_locks trace_lwlocks \
+session_authorization trace_lock_oidmin trace_lock_table trace_locks trace_flexlocks \
 trace_notify trace_userlocks transaction_isolation transaction_read_only \
 zero_damaged_pages"
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index da7b6d4..52de233 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -59,6 +59,7 @@
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/bufmgr.h"
+#include "storage/flexlock_internals.h"
 #include "storage/standby.h"
 #include "storage/fd.h"
 #include "storage/predicate.h"
@@ -1071,12 +1072,12 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 	{
-		{"trace_lwlocks", PGC_SUSET, DEVELOPER_OPTIONS,
+		{"trace_flexlocks", PGC_SUSET, DEVELOPER_OPTIONS,
 			gettext_noop("No description available."),
 			NULL,
 			GUC_NOT_IN_SAMPLE
 		},
-		&Trace_lwlocks,
+		&Trace_flexlocks,
 		false,
 		NULL, NULL, NULL
 	},
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index 71c5ab0..5b9cfe6 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -15,8 +15,8 @@
  * in probe definitions, as they cause compilation errors on Mac OS X 10.5.
  */
 #define LocalTransactionId unsigned int
-#define LWLockId int
-#define LWLockMode int
+#define FlexLockId int
+#define FlexLockMode int
 #define LOCKMODE int
 #define BlockNumber unsigned int
 #define Oid unsigned int
@@ -29,12 +29,12 @@ provider postgresql {
 	probe transaction__commit(LocalTransactionId);
 	probe transaction__abort(LocalTransactionId);
 
-	probe lwlock__acquire(LWLockId, LWLockMode);
-	probe lwlock__release(LWLockId);
-	probe lwlock__wait__start(LWLockId, LWLockMode);
-	probe lwlock__wait__done(LWLockId, LWLockMode);
-	probe lwlock__condacquire(LWLockId, LWLockMode);
-	probe lwlock__condacquire__fail(LWLockId, LWLockMode);
+	probe flexlock__acquire(FlexLockId, FlexLockMode);
+	probe flexlock__release(FlexLockId);
+	probe flexlock__wait__start(FlexLockId, FlexLockMode);
+	probe flexlock__wait__done(FlexLockId, FlexLockMode);
+	probe flexlock__condacquire(FlexLockId, FlexLockMode);
+	probe flexlock__condacquire__fail(FlexLockId, FlexLockMode);
 
 	probe lock__wait__start(unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, LOCKMODE);
 	probe lock__wait__done(unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, LOCKMODE);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 50fb780..1f4f5b4 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -577,7 +577,7 @@ static void
 SnapshotResetXmin(void)
 {
 	if (RegisteredSnapshots == 0 && ActiveSnapshot == NULL)
-		MyProc->xmin = InvalidTransactionId;
+		MyProcMinimal->xmin = InvalidTransactionId;
 }
 
 /*
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index e48743f..680a87f 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -55,7 +55,7 @@ typedef enum
  */
 typedef struct SlruSharedData
 {
-	LWLockId	ControlLock;
+	FlexLockId	ControlLock;
 
 	/* Number of buffers managed by this SLRU structure */
 	int			num_slots;
@@ -69,7 +69,7 @@ typedef struct SlruSharedData
 	bool	   *page_dirty;
 	int		   *page_number;
 	int		   *page_lru_count;
-	LWLockId   *buffer_locks;
+	FlexLockId *buffer_locks;
 
 	/*
 	 * Optional array of WAL flush LSNs associated with entries in the SLRU
@@ -136,7 +136,7 @@ typedef SlruCtlData *SlruCtl;
 
 extern Size SimpleLruShmemSize(int nslots, int nlsns);
 extern void SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
-			  LWLockId ctllock, const char *subdir);
+			  FlexLockId ctllock, const char *subdir);
 extern int	SimpleLruZeroPage(SlruCtl ctl, int pageno);
 extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
 				  TransactionId xid);
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index 6c8e312..d3b74db 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -49,9 +49,9 @@
 #define SEQ_MINVALUE	(-SEQ_MAXVALUE)
 
 /*
- * Number of spare LWLocks to allocate for user-defined add-on code.
+ * Number of spare FlexLocks to allocate for user-defined add-on code.
  */
-#define NUM_USER_DEFINED_LWLOCKS	4
+#define NUM_USER_DEFINED_FLEXLOCKS	4
 
 /*
  * Define this if you want to allow the lo_import and lo_export SQL
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index b7d4ea5..ac7f665 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -103,7 +103,7 @@ typedef struct buftag
 #define BufTableHashPartition(hashcode) \
 	((hashcode) % NUM_BUFFER_PARTITIONS)
 #define BufMappingPartitionLock(hashcode) \
-	((LWLockId) (FirstBufMappingLock + BufTableHashPartition(hashcode)))
+	((FlexLockId) (FirstBufMappingLock + BufTableHashPartition(hashcode)))
 
 /*
  *	BufferDesc -- shared descriptor/state data for a single shared buffer.
@@ -143,8 +143,8 @@ typedef struct sbufdesc
 	int			buf_id;			/* buffer's index number (from 0) */
 	int			freeNext;		/* link in freelist chain */
 
-	LWLockId	io_in_progress_lock;	/* to wait for I/O to complete */
-	LWLockId	content_lock;	/* to lock access to buffer contents */
+	FlexLockId	io_in_progress_lock;	/* to wait for I/O to complete */
+	FlexLockId	content_lock;	/* to lock access to buffer contents */
 } BufferDesc;
 
 #define BufferDescriptorGetBuffer(bdesc) ((bdesc)->buf_id + 1)
diff --git a/src/include/storage/flexlock.h b/src/include/storage/flexlock.h
new file mode 100644
index 0000000..612c21a
--- /dev/null
+++ b/src/include/storage/flexlock.h
@@ -0,0 +1,102 @@
+/*-------------------------------------------------------------------------
+ *
+ * flexlock.h
+ *	  Flex lock manager
+ *
+ * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/flexlock.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef FLEXLOCK_H
+#define FLEXLOCK_H
+
+/*
+ * It's a bit odd to declare NUM_BUFFER_PARTITIONS and NUM_LOCK_PARTITIONS
+ * here, but we need them to set up enum FlexLockId correctly, and having
+ * this file include lock.h or bufmgr.h would be backwards.
+ */
+
+/* Number of partitions of the shared buffer mapping hashtable */
+#define NUM_BUFFER_PARTITIONS  16
+
+/* Number of partitions the shared lock tables are divided into */
+#define LOG2_NUM_LOCK_PARTITIONS  4
+#define NUM_LOCK_PARTITIONS  (1 << LOG2_NUM_LOCK_PARTITIONS)
+
+/* Number of partitions the shared predicate lock tables are divided into */
+#define LOG2_NUM_PREDICATELOCK_PARTITIONS  4
+#define NUM_PREDICATELOCK_PARTITIONS  (1 << LOG2_NUM_PREDICATELOCK_PARTITIONS)
+
+/*
+ * We have a number of predefined FlexLocks, plus a bunch of locks that are
+ * dynamically assigned (e.g., for shared buffers).  The FlexLock structures
+ * live in shared memory (since they contain shared data) and are identified
+ * by values of this enumerated type.  We abuse the notion of an enum somewhat
+ * by allowing values not listed in the enum declaration to be assigned.
+ * The extra value MaxDynamicFlexLock is there to keep the compiler from
+ * deciding that the enum can be represented as char or short ...
+ *
+ * If you remove a lock, please replace it with a placeholder. This retains
+ * the lock numbering, which is helpful for DTrace and other external
+ * debugging scripts.
+ */
+typedef enum FlexLockId
+{
+	BufFreelistLock,
+	ShmemIndexLock,
+	OidGenLock,
+	XidGenLock,
+	ProcArrayLock,
+	SInvalReadLock,
+	SInvalWriteLock,
+	WALInsertLock,
+	WALWriteLock,
+	ControlFileLock,
+	CheckpointLock,
+	CLogControlLock,
+	SubtransControlLock,
+	MultiXactGenLock,
+	MultiXactOffsetControlLock,
+	MultiXactMemberControlLock,
+	RelCacheInitLock,
+	BgWriterCommLock,
+	TwoPhaseStateLock,
+	TablespaceCreateLock,
+	BtreeVacuumLock,
+	AddinShmemInitLock,
+	AutovacuumLock,
+	AutovacuumScheduleLock,
+	SyncScanLock,
+	RelationMappingLock,
+	AsyncCtlLock,
+	AsyncQueueLock,
+	SerializableXactHashLock,
+	SerializableFinishedListLock,
+	SerializablePredicateLockListLock,
+	OldSerXidLock,
+	SyncRepLock,
+	/* Individual lock IDs end here */
+	FirstBufMappingLock,
+	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
+	FirstPredicateLockMgrLock = FirstLockMgrLock + NUM_LOCK_PARTITIONS,
+
+	/* must be last except for MaxDynamicFlexLock: */
+	NumFixedFlexLocks = FirstPredicateLockMgrLock + NUM_PREDICATELOCK_PARTITIONS,
+
+	MaxDynamicFlexLock = 1000000000
+} FlexLockId;
+
+/* Shared memory setup. */
+extern int	NumFlexLocks(void);
+extern Size FlexLockShmemSize(void);
+extern void RequestAddinFlexLocks(int n);
+extern void CreateFlexLocks(void);
+
+/* Error recovery and debugging support functions. */
+extern void FlexLockReleaseAll(void);
+extern bool FlexLockHeldByMe(FlexLockId id);
+
+#endif   /* FLEXLOCK_H */
diff --git a/src/include/storage/flexlock_internals.h b/src/include/storage/flexlock_internals.h
new file mode 100644
index 0000000..d1bca45
--- /dev/null
+++ b/src/include/storage/flexlock_internals.h
@@ -0,0 +1,89 @@
+/*-------------------------------------------------------------------------
+ *
+ * flexlock_internals.h
+ *	  Flex lock internals.  Only files which implement a FlexLock
+ *    type should need to include this.  Merging this with flexlock.h
+ *    creates a circular header dependency, but even if it didn't, this
+ *    is cleaner.
+ *
+ * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/flexlock_internals.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef FLEXLOCK_INTERNALS_H
+#define FLEXLOCK_INTERNALS_H
+
+#include "pg_trace.h"
+#include "storage/flexlock.h"
+#include "storage/proc.h"
+#include "storage/s_lock.h"
+
+/*
+ * Individual FlexLock implementations each get this many bytes to store
+ * its state; of course, a given implementation could also allocate additional
+ * shmem elsewhere, but we provide this many bytes within the array.  The
+ * header fields common to all FlexLock types are included in this number.
+ * A power of two should probably be chosen, to avoid alignment issues and
+ * cache line splitting.  It might be useful to increase this on systems where
+ * a cache line is more than 64 bytes in size.
+ */
+#define FLEX_LOCK_BYTES		64
+
+typedef struct FlexLock
+{
+	char		locktype;		/* see FLEXLOCK_TYPE_* constants */
+	slock_t		mutex;			/* Protects FlexLock state and wait queues */
+	bool		releaseOK;		/* T if ok to release waiters */
+	PGPROC	   *head;			/* head of list of waiting PGPROCs */
+	PGPROC	   *tail;			/* tail of list of waiting PGPROCs */
+	/* tail is undefined when head is NULL */
+} FlexLock;
+
+#define FLEXLOCK_TYPE_LWLOCK			'l'
+#define FLEXLOCK_TYPE_PROCARRAYLOCK		'p'
+
+typedef union FlexLockPadded
+{
+	FlexLock	flex;
+	char		pad[FLEX_LOCK_BYTES];
+} FlexLockPadded;
+
+extern FlexLockPadded *FlexLockArray;
+
+extern FlexLockId FlexLockAssign(char locktype);
+extern void FlexLockRemember(FlexLockId id);
+extern void FlexLockForget(FlexLockId id);
+extern int FlexLockWait(FlexLockId id, int mode);
+
+/*
+ * We must join the wait queue while holding the spinlock, so we define this
+ * as a macro, for speed.
+ */
+#define FlexLockJoinWaitQueue(lock, mode) \
+	do { \
+		Assert(MyProc != NULL); \
+		MyProc->flWaitResult = 0; \
+		MyProc->flWaitMode = mode; \
+		MyProc->flWaitLink = NULL; \
+		if (lock->flex.head == NULL) \
+			lock->flex.head = MyProc; \
+		else \
+			lock->flex.tail->flWaitLink = MyProc; \
+		lock->flex.tail = MyProc; \
+	} while (0)
+
+#ifdef LOCK_DEBUG
+extern bool	Trace_flexlocks;
+#define FlexLockDebug(where, id, msg) \
+	do { \
+		if (Trace_flexlocks) \
+			elog(LOG, "%s(%d): %s", where, (int) id, msg); \
+	} while (0)
+#else
+#define FlexLockDebug(where, id, msg)
+#endif
+
+#endif   /* FLEXLOCK_H */
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index e106ad5..ba87db2 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -471,7 +471,7 @@ typedef enum
 #define LockHashPartition(hashcode) \
 	((hashcode) % NUM_LOCK_PARTITIONS)
 #define LockHashPartitionLock(hashcode) \
-	((LWLockId) (FirstLockMgrLock + LockHashPartition(hashcode)))
+	((FlexLockId) (FirstLockMgrLock + LockHashPartition(hashcode)))
 
 
 /*
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 438a48d..f68cddc 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -14,82 +14,7 @@
 #ifndef LWLOCK_H
 #define LWLOCK_H
 
-/*
- * It's a bit odd to declare NUM_BUFFER_PARTITIONS and NUM_LOCK_PARTITIONS
- * here, but we need them to set up enum LWLockId correctly, and having
- * this file include lock.h or bufmgr.h would be backwards.
- */
-
-/* Number of partitions of the shared buffer mapping hashtable */
-#define NUM_BUFFER_PARTITIONS  16
-
-/* Number of partitions the shared lock tables are divided into */
-#define LOG2_NUM_LOCK_PARTITIONS  4
-#define NUM_LOCK_PARTITIONS  (1 << LOG2_NUM_LOCK_PARTITIONS)
-
-/* Number of partitions the shared predicate lock tables are divided into */
-#define LOG2_NUM_PREDICATELOCK_PARTITIONS  4
-#define NUM_PREDICATELOCK_PARTITIONS  (1 << LOG2_NUM_PREDICATELOCK_PARTITIONS)
-
-/*
- * We have a number of predefined LWLocks, plus a bunch of LWLocks that are
- * dynamically assigned (e.g., for shared buffers).  The LWLock structures
- * live in shared memory (since they contain shared data) and are identified
- * by values of this enumerated type.  We abuse the notion of an enum somewhat
- * by allowing values not listed in the enum declaration to be assigned.
- * The extra value MaxDynamicLWLock is there to keep the compiler from
- * deciding that the enum can be represented as char or short ...
- *
- * If you remove a lock, please replace it with a placeholder. This retains
- * the lock numbering, which is helpful for DTrace and other external
- * debugging scripts.
- */
-typedef enum LWLockId
-{
-	BufFreelistLock,
-	ShmemIndexLock,
-	OidGenLock,
-	XidGenLock,
-	ProcArrayLock,
-	SInvalReadLock,
-	SInvalWriteLock,
-	WALInsertLock,
-	WALWriteLock,
-	ControlFileLock,
-	CheckpointLock,
-	CLogControlLock,
-	SubtransControlLock,
-	MultiXactGenLock,
-	MultiXactOffsetControlLock,
-	MultiXactMemberControlLock,
-	RelCacheInitLock,
-	BgWriterCommLock,
-	TwoPhaseStateLock,
-	TablespaceCreateLock,
-	BtreeVacuumLock,
-	AddinShmemInitLock,
-	AutovacuumLock,
-	AutovacuumScheduleLock,
-	SyncScanLock,
-	RelationMappingLock,
-	AsyncCtlLock,
-	AsyncQueueLock,
-	SerializableXactHashLock,
-	SerializableFinishedListLock,
-	SerializablePredicateLockListLock,
-	OldSerXidLock,
-	SyncRepLock,
-	/* Individual lock IDs end here */
-	FirstBufMappingLock,
-	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
-	FirstPredicateLockMgrLock = FirstLockMgrLock + NUM_LOCK_PARTITIONS,
-
-	/* must be last except for MaxDynamicLWLock: */
-	NumFixedLWLocks = FirstPredicateLockMgrLock + NUM_PREDICATELOCK_PARTITIONS,
-
-	MaxDynamicLWLock = 1000000000
-} LWLockId;
-
+#include "storage/flexlock.h"
 
 typedef enum LWLockMode
 {
@@ -97,22 +22,10 @@ typedef enum LWLockMode
 	LW_SHARED
 } LWLockMode;
 
-
-#ifdef LOCK_DEBUG
-extern bool Trace_lwlocks;
-#endif
-
-extern LWLockId LWLockAssign(void);
-extern void LWLockAcquire(LWLockId lockid, LWLockMode mode);
-extern bool LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode);
-extern void LWLockRelease(LWLockId lockid);
-extern void LWLockReleaseAll(void);
-extern bool LWLockHeldByMe(LWLockId lockid);
-
-extern int	NumLWLocks(void);
-extern Size LWLockShmemSize(void);
-extern void CreateLWLocks(void);
-
-extern void RequestAddinLWLocks(int n);
+extern FlexLockId LWLockAssign(void);
+extern void LWLockAcquire(FlexLockId lockid, LWLockMode mode);
+extern bool LWLockConditionalAcquire(FlexLockId lockid, LWLockMode mode);
+extern void LWLockRelease(FlexLockId lockid);
+extern bool LWLockHeldByMe(FlexLockId lockid);
 
 #endif   /* LWLOCK_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 6e798b1..9f377a8 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -35,8 +35,6 @@
 
 struct XidCache
 {
-	bool		overflowed;
-	int			nxids;
 	TransactionId xids[PGPROC_MAX_CACHED_SUBXIDS];
 };
 
@@ -86,27 +84,14 @@ struct PGPROC
 	LocalTransactionId lxid;	/* local id of top-level transaction currently
 								 * being executed by this proc, if running;
 								 * else InvalidLocalTransactionId */
-
-	TransactionId xid;			/* id of top-level transaction currently being
-								 * executed by this proc, if running and XID
-								 * is assigned; else InvalidTransactionId */
-
-	TransactionId xmin;			/* minimal running XID as it was when we were
-								 * starting our xact, excluding LAZY VACUUM:
-								 * vacuum must not remove tuples deleted by
-								 * xid >= xmin ! */
-
 	int			pid;			/* Backend's process ID; 0 if prepared xact */
+	int			pgprocno;
 
 	/* These fields are zero while a backend is still starting up: */
 	BackendId	backendId;		/* This backend's backend ID (if assigned) */
 	Oid			databaseId;		/* OID of database this backend is using */
 	Oid			roleId;			/* OID of role using this backend */
 
-	bool		inCommit;		/* true if within commit critical section */
-
-	uint8		vacuumFlags;	/* vacuum-related flags, see above */
-
 	/*
 	 * While in hot standby mode, shows that a conflict signal has been sent
 	 * for the current transaction. Set/cleared while holding ProcArrayLock,
@@ -114,10 +99,10 @@ struct PGPROC
 	 */
 	bool		recoveryConflictPending;
 
-	/* Info about LWLock the process is currently waiting for, if any. */
-	bool		lwWaiting;		/* true if waiting for an LW lock */
-	bool		lwExclusive;	/* true if waiting for exclusive access */
-	struct PGPROC *lwWaitLink;	/* next waiter for same LW lock */
+	/* Info about FlexLock the process is currently waiting for, if any. */
+	int			flWaitResult;	/* result of wait, or 0 if still waiting */
+	int			flWaitMode;		/* lock mode sought */
+	struct PGPROC *flWaitLink;	/* next waiter for same FlexLock */
 
 	/* Info about lock the process is currently waiting for, if any. */
 	/* waitLock and waitProcLock are NULL if not currently waiting. */
@@ -147,7 +132,7 @@ struct PGPROC
 	struct XidCache subxids;	/* cache for subtransaction XIDs */
 
 	/* Per-backend LWLock.  Protects fields below. */
-	LWLockId	backendLock;	/* protects the fields below */
+	FlexLockId	backendLock;	/* protects the fields below */
 
 	/* Lock manager data, recording fast-path locks taken by this backend. */
 	uint64		fpLockBits;		/* lock modes held for each fast-path slot */
@@ -160,7 +145,35 @@ struct PGPROC
 
 
 extern PGDLLIMPORT PGPROC *MyProc;
+extern PGDLLIMPORT struct PGPROC_MINIMAL *MyProcMinimal;
+
+/*
+ * A minimal part of the PGPROC. We store these members out of the main PGPROC
+ * structure since they are very heavily accessed members and usually in a loop
+ * for all active PGPROCs. Storing them in a separate array ensures that these
+ * members can be very effeciently accessed with minimum cache misses. On a
+ * large multiprocessor system, this can show a significant performance
+ * improvement.
+ */
+struct PGPROC_MINIMAL
+{
+	TransactionId xid;			/* id of top-level transaction currently being
+								 * executed by this proc, if running and XID
+								 * is assigned; else InvalidTransactionId */
 
+	TransactionId xmin;			/* minimal running XID as it was when we were
+								 * starting our xact, excluding LAZY VACUUM:
+								 * vacuum must not remove tuples deleted by
+								 * xid >= xmin ! */
+
+	uint8		vacuumFlags;	/* vacuum-related flags, see above */
+	bool		overflowed;
+	bool		inCommit;		/* true if within commit critical section */
+
+	uint8		nxids;
+};
+
+typedef struct PGPROC_MINIMAL PGPROC_MINIMAL;
 
 /*
  * There is one ProcGlobal struct for the whole database cluster.
@@ -169,6 +182,8 @@ typedef struct PROC_HDR
 {
 	/* Array of PGPROC structures (not including dummies for prepared txns) */
 	PGPROC	   *allProcs;
+	/* Array of PGPROC_MINIMAL structures (not including dummies for prepared txns */
+	PGPROC_MINIMAL	*allProcs_Minimal;
 	/* Length of allProcs array */
 	uint32		allProcCount;
 	/* Head of list of free PGPROC structures */
@@ -186,6 +201,8 @@ typedef struct PROC_HDR
 
 extern PROC_HDR *ProcGlobal;
 
+extern PGPROC *PreparedXactProcs;
+
 /*
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
diff --git a/src/include/storage/procarraylock.h b/src/include/storage/procarraylock.h
new file mode 100644
index 0000000..678ca6f
--- /dev/null
+++ b/src/include/storage/procarraylock.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * procarraylock.h
+ *	  Lock management for the ProcArray
+ *
+ * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/lwlock.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PROCARRAYLOCK_H
+#define PROCARRAYLOCK_H
+
+#include "storage/flexlock.h"
+
+typedef enum ProcArrayLockMode
+{
+	PAL_EXCLUSIVE,
+	PAL_SHARED
+} ProcArrayLockMode;
+
+extern void ProcArrayLockAcquire(ProcArrayLockMode mode);
+extern void ProcArrayLockClearTransaction(TransactionId latestXid);
+extern void ProcArrayLockRelease(void);
+
+#endif   /* PROCARRAYLOCK_H */
#2Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#1)
Re: testing ProcArrayLock patches

Robert Haas <robertmhaas@gmail.com> wrote:

Nate Boley's AMD 6128 box (which has 32 cores) and an HP Integrity
server (also with 32 cores).

[clear improvement with flexlock patch]

Hmm. We have a 32-core Intel box (4 x X7560 @ 2.27GHz) with 256 GB
RAM. It's about a week from going into production, at which point
it will be extremely hard to schedule such tests, but for a few days
more I've got shots at it. The flexlock patch doesn't appear to be
such a clear win here.

I started from Robert's tests, but used these settings so that I
could go to higher client counts and better test serializable
transactions. Everything is fully cached.

max_connections = 200
max_pred_locks_per_transaction = 256
shared_buffers = 8GB
maintenance_work_mem = 1GB
checkpoint_segments = 30
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
seq_page_cost = 0.1
random_page_cost = 0.1
cpu_tuple_cost = 0.05
effective_cache_size = 40GB
default_transaction_isolation = '$iso'

Serializable results not shown here -- that's to gather information
for trying to improve SSI locking.

m1 tps = 7847.834544 (including connections establishing)
f1 tps = 7917.225382 (including connections establishing)
m2 tps = 18672.145526 (including connections establishing)
f2 tps = 17486.435322 (including connections establishing)
m4 tps = 34371.278253 (including connections establishing)
f4 tps = 34465.898173 (including connections establishing)
m8 tps = 68228.261694 (including connections establishing)
f8 tps = 68505.285830 (including connections establishing)
m16 tps = 127449.815100 (including connections establishing)
f16 tps = 127208.939670 (including connections establishing)
m32 tps = 201738.209348 (including connections establishing)
f32 tps = 201637.237903 (including connections establishing)
m64 tps = 380326.800557 (including connections establishing)
f64 tps = 380628.429408 (including connections establishing)
m80 tps = 366628.197546 (including connections establishing)
f80 tps = 162594.012051 (including connections establishing)
m96 tps = 360922.948775 (including connections establishing)
f96 tps = 366728.987041 (including connections establishing)
m128 tps = 352159.631878 (including connections establishing)
f128 tps = 355475.129448 (including connections establishing)

I did five runs each and took the median. In most cases, the values
were pretty close to one another in a group, so confidence is pretty
high that this is meaningful. There were a few anomalies where
performance for one or more samples was horrid. This seems
consistent with the theory of pathological pileups on the LW locks
(or also flexlocks?).

The problem groups:

m64 tps = 380407.768906 (including connections establishing)
m64 tps = 79197.470389 (including connections establishing)
m64 tps = 381112.194105 (including connections establishing)
m64 tps = 378579.036542 (including connections establishing)
m64 tps = 380326.800557 (including connections establishing)

m96 tps = 360582.945291 (including connections establishing)
m96 tps = 363021.805138 (including connections establishing)
m96 tps = 362468.870516 (including connections establishing)
m96 tps = 59614.322351 (including connections establishing)
m96 tps = 360922.948775 (including connections establishing)

f80 tps = 158905.149822 (including connections establishing)
f80 tps = 157192.460599 (including connections establishing)
f80 tps = 370757.790443 (including connections establishing)
f80 tps = 162594.012051 (including connections establishing)
f80 tps = 372170.638516 (including connections establishing)

f96 tps = 366804.733788 (including connections establishing)
f96 tps = 366728.987041 (including connections establishing)
f96 tps = 365490.380848 (including connections establishing)
f96 tps = 366770.193305 (including connections establishing)
f96 tps = 125225.371140 (including connections establishing)

So the lows don't seem to be as low when they happen with the
flexlock patch, but they still happen -- possibly more often?

-Kevin

#3Robert Haas
robertmhaas@gmail.com
In reply to: Kevin Grittner (#2)
Re: testing ProcArrayLock patches

On Fri, Nov 18, 2011 at 11:26 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

Robert Haas <robertmhaas@gmail.com> wrote:

Nate Boley's AMD 6128 box (which has 32 cores) and an HP Integrity
server (also with 32 cores).

[clear improvement with flexlock patch]

Hmm.  We have a 32-core Intel box (4 x X7560 @ 2.27GHz) with 256 GB
RAM.  It's about a week from going into production, at which point
it will be extremely hard to schedule such tests, but for a few days
more I've got shots at it.  The flexlock patch doesn't appear to be
such a clear win here.

I started from Robert's tests, but used these settings so that I
could go to higher client counts and better test serializable
transactions.  Everything is fully cached.

max_connections = 200
max_pred_locks_per_transaction = 256
shared_buffers = 8GB
maintenance_work_mem = 1GB
checkpoint_segments = 30
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
seq_page_cost = 0.1
random_page_cost = 0.1
cpu_tuple_cost = 0.05
effective_cache_size = 40GB
default_transaction_isolation = '$iso'

I had a dismaying benchmarking experience recently that involved
settings very similar to the ones you've got there - in particular, I
also had checkpoint_segments set to 30. When I raised it to 300,
performance improved dramatically at 8 clients and above.

Then again, is this a regular pgbench test or is this SELECT-only?
Because the absolute numbers you're posting are vastly higher than
anything I've ever seen on a write test.

Can you by any chance check top or vmstat during the 32-client test
and see what percentage you have of user time/system time/idle time?

What OS are you running?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#3)
Re: testing ProcArrayLock patches

Robert Haas <robertmhaas@gmail.com> wrote:

Then again, is this a regular pgbench test or is this SELECT-only?

SELECT-only

Can you by any chance check top or vmstat during the 32-client
test and see what percentage you have of user time/system
time/idle time?

You didn't say whether you wanted master or flexlock, but it turned
out that any difference was way too far into the noise to show.
They both looked like this:

procs --------------memory------------- ---swap-- -----io----
r b swpd free buff cache si so bi bo
----system---- -----cpu------
in cs us sy id wa st
38 0 352 1157400 207177020 52360472 0 0 0 16
13345 1190230 40 7 53 0 0
37 0 352 1157480 207177020 52360472 0 0 0 0
12953 1263310 40 8 52 0 0
36 0 352 1157484 207177020 52360472 0 0 0 0
13411 1233365 38 7 54 0 0
37 0 352 1157476 207177020 52360472 0 0 0 0
12780 1193575 41 7 51 0 0

Keep in mind that while there are really 32 cores, the cpu
percentages seem to be based on the "threads" from hyperthreading.
Top showed pgbench (running on the same machine) as eating a pretty
steady 5.2 of the cores, leaving 26.8 cores to actually drive the 32
postgres processes.

What OS are you running?

Linux new-CIR 2.6.32.43-0.4-default #1 SMP 2011-07-14 14:47:44 +0200
x86_64 x86_64 x86_64 GNU/Linux

SUSE Linux Enterprise Server 11 (x86_64)
VERSION = 11
PATCHLEVEL = 1

-Kevin

#5Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Kevin Grittner (#2)
Re: testing ProcArrayLock patches

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:

We have a 32-core Intel box (4 x X7560 @ 2.27GHz) with 256 GB
RAM.

In case anyone cares, this is the same box for which I posted STREAM
test results a while back. The PostgreSQL tests seem to peak on
this 32-core box at 64 clients, while the STREAM test of raw RAM
speed kept increasing up to 128 clients. Overall, though, it's
impressive how close PostgreSQL is now coming to the raw RAM access
speed curve.

http://archives.postgresql.org/pgsql-hackers/2011-08/msg01306.php

-Kevin

#6Robert Haas
robertmhaas@gmail.com
In reply to: Kevin Grittner (#4)
Re: testing ProcArrayLock patches

On Fri, Nov 18, 2011 at 12:03 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

Then again, is this a regular pgbench test or is this SELECT-only?

SELECT-only

Ah, OK. I would not expect flexlocks to help with that; Pavan's patch
might, though.

Can you by any chance check top or vmstat during the 32-client
test and see what percentage you have of user time/system
time/idle time?

You didn't say whether you wanted master or flexlock, but it turned
out that any difference was way too far into the noise to show.
They both looked like this:

procs --------------memory------------- ---swap-- -----io----
 r  b   swpd    free      buff    cache   si   so    bi    bo
 ----system---- -----cpu------
    in      cs us sy id wa st
38  0    352 1157400 207177020 52360472    0    0     0    16
 13345 1190230 40  7 53  0  0
37  0    352 1157480 207177020 52360472    0    0     0     0
 12953 1263310 40  8 52  0  0
36  0    352 1157484 207177020 52360472    0    0     0     0
 13411 1233365 38  7 54  0  0
37  0    352 1157476 207177020 52360472    0    0     0     0
 12780 1193575 41  7 51  0  0

Keep in mind that while there are really 32 cores, the cpu
percentages seem to be based on the "threads" from hyperthreading.
Top showed pgbench (running on the same machine) as eating a pretty
steady 5.2 of the cores, leaving 26.8 cores to actually drive the 32
postgres processes.

It doesn't make any sense for PostgreSQL master to be using only 50%
of the CPU and leaving the rest idle on a lots-of-clients SELECT-only
test. That could easily happen on 9.1, but my lock manager changes
eliminated the only place where anything gets put to sleep in that
path (except for the emergency sleeps done by s_lock, when a spinlock
is really badly contended). So I'm confused by these results. Are we
sure that the processes are being scheduled across all 32 physical
cores?

At any rate, I do think it's likely that you're being bitten by
spinlock contention, but we'd need to do some legwork to verify that
and work out the details. Any chance you can run oprofile (on either
branch, don't really care) against the 32 client test and post the
results? If it turns out s_lock is at the top of the heap, I can put
together a patch to help figure out which spinlock is the culprit.

Anyway, this is probably a digression as it relates to FlexLocks:
those are not optimizing for a read-only workload.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#7Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#6)
Re: testing ProcArrayLock patches

Robert Haas <robertmhaas@gmail.com> wrote:

Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:

Then again, is this a regular pgbench test or is this
SELECT-only?

SELECT-only

Ah, OK. I would not expect flexlocks to help with that; Pavan's
patch might, though.

OK. Sorry for misunderstanding that. I haven't gotten around to a
deep reading of the patch yet. :-( I based this on the test script
you posted here (with slight modifications for my preferred
directory structures):

http://archives.postgresql.org/pgsql-hackers/2011-10/msg00605.php

If I just drop the -S switch will I have a good test, or are there
other adjustments I should make (besides increasing checkpoint
segments)? (Well, for the SELECT-only test I didn't bother putting
pg_xlog on a separate RAID 10 on it's own BBU controller as we
normally would for this machine, I'll cover that, too.)

It doesn't make any sense for PostgreSQL master to be using only
50% of the CPU and leaving the rest idle on a lots-of-clients
SELECT-only test. That could easily happen on 9.1, but my lock
manager changes eliminated the only place where anything gets put
to sleep in that path (except for the emergency sleeps done by
s_lock, when a spinlock is really badly contended). So I'm
confused by these results. Are we sure that the processes are
being scheduled across all 32 physical cores?

I think so. My take was that it was showing 32 of 64 *threads*
active -- the hyperthreading funkiness. Is there something in
particular you'd like me to check?

At any rate, I do think it's likely that you're being bitten by
spinlock contention, but we'd need to do some legwork to verify
that and work out the details. Any chance you can run oprofile
(on either branch, don't really care) against the 32 client test
and post the results? If it turns out s_lock is at the top of the
heap, I can put together a patch to help figure out which spinlock
is the culprit.

oprofile isn't installed on this machine. I'll take care of that
and post results when I can.

-Kevin

#8Robert Haas
robertmhaas@gmail.com
In reply to: Kevin Grittner (#7)
Re: testing ProcArrayLock patches

On Fri, Nov 18, 2011 at 12:45 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

OK.  Sorry for misunderstanding that.  I haven't gotten around to a
deep reading of the patch yet.  :-(  I based this on the test script
you posted here (with slight modifications for my preferred
directory structures):

http://archives.postgresql.org/pgsql-hackers/2011-10/msg00605.php

If I just drop the -S switch will I have a good test, or are there
other adjustments I should make (besides increasing checkpoint
segments)?  (Well, for the SELECT-only test I didn't bother putting
pg_xlog on a separate RAID 10 on it's own BBU controller as we
normally would for this machine, I'll cover that, too.)

Yeah, I'd just drop -S. Make sure to use -c N -j N with pgbench, or
you'll probably not be able to saturate it. I've also had good luck
with wal_writer_delay=20ms, although if you have synchronous_commit=on
that might not matter, and it's much less important since Simon's
recent patch in that area went in.

What scale factor are you testing at?

It doesn't make any sense for PostgreSQL master to be using only
50% of the CPU and leaving the rest idle on a lots-of-clients
SELECT-only test.  That could easily happen on 9.1, but my lock
manager changes eliminated the only place where anything gets put
to sleep in that path (except for the emergency sleeps done by
s_lock, when a spinlock is really badly contended).  So I'm
confused by these results. Are we sure that the processes are
being scheduled across all 32 physical cores?

I think so.  My take was that it was showing 32 of 64 *threads*
active -- the hyperthreading funkiness.  Is there something in
particular you'd like me to check?

Not really, just don't understand the number.

At any rate, I do think it's likely that you're being bitten by
spinlock contention, but we'd need to do some legwork to verify
that and work out the details.  Any chance you can run oprofile
(on either branch, don't really care) against the 32 client test
and post the results?  If it turns out s_lock is at the top of the
heap, I can put together a patch to help figure out which spinlock
is the culprit.

oprofile isn't installed on this machine.  I'll take care of that
and post results when I can.

OK.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#9Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#8)
Re: testing ProcArrayLock patches

Robert Haas <robertmhaas@gmail.com> wrote:

Yeah, I'd just drop -S.

Easily done.

Make sure to use -c N -j N with pgbench, or you'll probably not be
able to saturate it.

Yeah, that's part of the script I copied from you.

I've also had good luck with wal_writer_delay=20ms, although if
you have synchronous_commit=on that might not matter, and it's
much less important since Simon's recent patch in that area went
in.

What the heck; will do.

What scale factor are you testing at?

100. Perhaps I should boost that since I'm going as far as 128
clients?

-Kevin

#10Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#6)
Re: testing ProcArrayLock patches

Robert Haas <robertmhaas@gmail.com> wrote:

Any chance you can run oprofile (on either branch, don't really
care) against the 32 client test and post the results?

Besides the other changes we discussed, I boosted scale to 150 and
ran at READ COMMITTED isolation level (because all threads promptly
crashed and burned at REPEATABLE READ -- we desperately need a
pgbench option to retry a transaction on serialization failure).
The oprofile hot spots at half a percent or higher:

CPU: Intel Core/i7, speed 2262 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with
a unit mask of 0x00 (No unit mask) count 100000
samples % image name symbol name
933394 4.9651 postgres AllocSetAlloc
848476 4.5134 postgres base_yyparse
719515 3.8274 postgres SearchCatCache
461275 2.4537 postgres hash_search_with_hash_value
426411 2.2682 postgres GetSnapshotData
322938 1.7178 postgres LWLockAcquire
322236 1.7141 postgres core_yylex
305471 1.6249 postgres MemoryContextAllocZeroAligned
281543 1.4976 postgres expression_tree_walker
270241 1.4375 postgres XLogInsert
234899 1.2495 postgres MemoryContextAlloc
210137 1.1178 postgres ScanKeywordLookup
184857 0.9833 postgres heap_page_prune
173608 0.9235 postgres hash_any
153011 0.8139 postgres _bt_compare
144538 0.7689 postgres nocachegetattr
131466 0.6993 postgres fmgr_info_cxt_security
131001 0.6968 postgres grouping_planner
130808 0.6958 postgres LWLockRelease
124112 0.6602 postgres PinBuffer
120745 0.6423 postgres LockAcquireExtended
112992 0.6010 postgres ExecInitExpr
112830 0.6002 postgres lappend
112311 0.5974 postgres new_list
110368 0.5871 postgres check_stack_depth
106036 0.5640 postgres AllocSetFree
102565 0.5456 postgres MemoryContextAllocZero
94689 0.5037 postgres SearchSysCache

Do you want line numbers or lower percentages?

Two runs:

tps = 21946.961196 (including connections establishing)
tps = 22911.873227 (including connections establishing)

For write transactions, that seems pretty respectable.

-Kevin

#11anarazel@anarazel.de
andres@anarazel.de
In reply to: Kevin Grittner (#10)
Re: testing ProcArrayLock patches

Kevin Grittner <Kevin.Grittner@wicourts.gov> schrieb:

Robert Haas <robertmhaas@gmail.com> wrote:

Any chance you can run oprofile (on either branch, don't really
care) against the 32 client test and post the results?

Besides the other changes we discussed, I boosted scale to 150 and
ran at READ COMMITTED isolation level (because all threads promptly
crashed and burned at REPEATABLE READ -- we desperately need a
pgbench option to retry a transaction on serialization failure).
The oprofile hot spots at half a percent or higher:

CPU: Intel Core/i7, speed 2262 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with
a unit mask of 0x00 (No unit mask) count 100000
samples % image name symbol name
933394 4.9651 postgres AllocSetAlloc
848476 4.5134 postgres base_yyparse
719515 3.8274 postgres SearchCatCache
461275 2.4537 postgres hash_search_with_hash_value
426411 2.2682 postgres GetSnapshotData
322938 1.7178 postgres LWLockAcquire
322236 1.7141 postgres core_yylex
305471 1.6249 postgres MemoryContextAllocZeroAligned
281543 1.4976 postgres expression_tree_walker
270241 1.4375 postgres XLogInsert
234899 1.2495 postgres MemoryContextAlloc
210137 1.1178 postgres ScanKeywordLookup
184857 0.9833 postgres heap_page_prune
173608 0.9235 postgres hash_any
153011 0.8139 postgres _bt_compare
144538 0.7689 postgres nocachegetattr
131466 0.6993 postgres fmgr_info_cxt_security
131001 0.6968 postgres grouping_planner
130808 0.6958 postgres LWLockRelease
124112 0.6602 postgres PinBuffer
120745 0.6423 postgres LockAcquireExtended
112992 0.6010 postgres ExecInitExpr
112830 0.6002 postgres lappend
112311 0.5974 postgres new_list
110368 0.5871 postgres check_stack_depth
106036 0.5640 postgres AllocSetFree
102565 0.5456 postgres MemoryContextAllocZero
94689 0.5037 postgres SearchSysCache

That profile looks like you ran pgbench with -m simple. How does it look with prepared instead?

Andres

Show quoted text

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: anarazel@anarazel.de (#11)
Re: testing ProcArrayLock patches

"anarazel@anarazel.de" <andres@anarazel.de> wrote:

Kevin Grittner <Kevin.Grittner@wicourts.gov> schrieb:

samples % image name symbol name
933394 4.9651 postgres AllocSetAlloc
848476 4.5134 postgres base_yyparse
719515 3.8274 postgres SearchCatCache

That profile looks like you ran pgbench with -m simple. How does
it look with prepared instead?

samples % image name symbol name
495463 3.6718 postgres hash_search_with_hash_value
490971 3.6385 postgres GetSnapshotData
443965 3.2902 postgres LWLockAcquire
443566 3.2872 postgres AllocSetAlloc
302388 2.2409 postgres XLogInsert
286889 2.1261 postgres SearchCatCache
246417 1.8262 postgres PostgresMain
235018 1.7417 postgres heap_page_prune
198442 1.4706 postgres _bt_compare
181446 1.3447 postgres hash_any
177131 1.3127 postgres ExecInitExpr
175775 1.3026 postgres LWLockRelease
152324 1.1288 postgres PinBuffer
150285 1.1137 postgres exec_bind_message
145214 1.0762 postgres fmgr_info_cxt_security
140493 1.0412 postgres s_lock
124162 0.9201 postgres LockAcquireExtended
120429 0.8925 postgres MemoryContextAlloc
117076 0.8676 postgres pfree
116493 0.8633 postgres AllocSetFree
105027 0.7783 postgres pgstat_report_activity
101407 0.7515 postgres ProcArrayLockAcquire
100797 0.7470 postgres MemoryContextAllocZeroAligned
98360 0.7289 postgres ProcArrayLockRelease
86938 0.6443 postgres heap_hot_search_buffer
82635 0.6124 postgres hash_search
79902 0.5921 postgres errstart
79465 0.5889 postgres HeapTupleSatisfiesVacuum
78709 0.5833 postgres ResourceOwnerReleaseInternal
76068 0.5637 postgres ExecModifyTable
73043 0.5413 postgres heap_update
72175 0.5349 postgres strlcpy
71253 0.5280 postgres MemoryContextAllocZero

tps = 27392.219364 (including connections establishing)

-Kevin

#13Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Kevin Grittner (#12)
Re: testing ProcArrayLock patches

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:

samples % image name symbol name
495463 3.6718 postgres hash_search_with_hash_value

When lines like these show up in the annotated version, I'm
impressed that we're still finding gains as big as we are:

44613 0.3306 : if (segp == NULL)
: hash_corrupted(hashp);

101910 0.7552 : keysize = hashp->keysize; /* ditto */

There goes over 1% of my server run time, right there!

Of course, these make no sense unless there is cache line
contention, which is why that area is bearing fruit.

-Kevin

#14Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#8)
Re: testing ProcArrayLock patches

Robert Haas <robertmhaas@gmail.com> wrote:

I think so. My take was that it was showing 32 of 64 *threads*
active -- the hyperthreading funkiness. Is there something in
particular you'd like me to check?

Not really, just don't understand the number.

I'm having trouble resolving the vmstat numbers I got during the
32-client pgbench runs which modified data.

-M simple:

procs --------------memory------------- ---swap-- -----io-----
r b swpd free buff cache si so bi bo
----system---- -----cpu------
in cs us sy id wa st
30 1 4464 513492 205564572 54472124 0 0 0 78170
621724 1246300 30 8 61 1 0
27 1 4464 509288 205564572 54474600 0 0 0 125620
599403 1192046 29 8 63 1 0
35 1 4464 508368 205564572 54476996 0 0 0 89801
595939 1186496 29 8 63 0 0
25 0 4464 506088 205564572 54478668 0 0 0 90121
594800 1189649 28 8 63 0 0

-M prepared:

procs --------------memory-------------- ---swap-- -----io-----
r b swpd free buff cache si so bi bo
----system---- -----cpu------
in cs us sy id wa st
28 0 5612 1204404 205107344 54230536 0 0 0 93212
527284 1456417 22 9 69 0 0
8 1 5612 1202044 205107344 54233336 0 0 0 93217
512819 1417457 21 9 70 1 0
17 1 5612 1201892 205107344 54236048 0 0 0 132699
502333 1412878 21 9 70 0 0
19 1 5612 1199208 205107344 54238936 0 0 0 93612
519113 1484386 21 9 69 0 0

So 60% or 70% idle without any I/O wait time. I don't know how to
explain that.

-Kevin

#15Andres Freund
andres@anarazel.de
In reply to: Kevin Grittner (#13)
Re: testing ProcArrayLock patches

On Friday, November 18, 2011 08:36:59 PM Kevin Grittner wrote:

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:

samples % image name symbol name
495463 3.6718 postgres hash_search_with_hash_value

When lines like these show up in the annotated version, I'm
impressed that we're still finding gains as big as we are:

44613 0.3306 : if (segp == NULL)

: hash_corrupted(hashp);

101910 0.7552 : keysize = hashp->keysize; /* ditto */

When doing line-level profiles I would suggest looking at the instructions.
Quite often the line shown doesn't have much to do whats executed as the
compiler tries to schedule instructions cleverly.
Also in many situations the shown cost doesn't actually lie in the instruction
shown but in some previous one. The shown instruction e.g. has to wait for the
result of the earlier instructions. Pipelining makes that hard to correctly
observe.

A simplified example would be something like:

bool func(int a, int b, int c){
int res = a / b;
if(res == c){
return true;
}
return false;
}

Likely the instruction showing up in the profile would be the comparison. Which
obviously is not the really expensive part...

There goes over 1% of my server run time, right there!

Of course, these make no sense unless there is cache line
contention, which is why that area is bearing fruit.

I don't think cache line contention is the most likely candidate here. Simple
cache-misses seem far more likely. In combination with pipeline stalls...

Newer cpus (nehalem+) can measure stalled cycles which can be really useful
when analyzing performance. I don't remember how to do that with oprofile right
now though as I use perf these days (its -e stalled-cycles{frontend|backend}
there}).

Andres

#16Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Andres Freund (#15)
Re: testing ProcArrayLock patches

Andres Freund <andres@anarazel.de> wrote:

When doing line-level profiles I would suggest looking at the
instructions.

What's the best way to do that?

I don't think cache line contention is the most likely candidate
here. Simple cache-misses seem far more likely. In combination
with pipeline stalls...

Newer cpus (nehalem+) can measure stalled cycles which can be
really useful when analyzing performance. I don't remember how to
do that with oprofile right now though as I use perf these days
(its -e stalled-cycles{frontend|backend} there}).

When I run oprofile, I still always go back to this post by Tom:

http://archives.postgresql.org/pgsql-performance/2009-06/msg00154.php

Can anyone provide such a "cheat sheet" for perf? I could give that
a try if I knew how.

-Kevin

#17Andres Freund
andres@anarazel.de
In reply to: Kevin Grittner (#16)
Re: testing ProcArrayLock patches

On Friday, November 18, 2011 09:16:01 PM Kevin Grittner wrote:

Andres Freund <andres@anarazel.de> wrote:

When doing line-level profiles I would suggest looking at the
instructions.

What's the best way to do that?

I think opannotate -a -s produces output with instructions/code intermingled.

I don't think cache line contention is the most likely candidate
here. Simple cache-misses seem far more likely. In combination
with pipeline stalls...

Newer cpus (nehalem+) can measure stalled cycles which can be
really useful when analyzing performance. I don't remember how to
do that with oprofile right now though as I use perf these days
(its -e stalled-cycles{frontend|backend} there}).

When I run oprofile, I still always go back to this post by Tom:
http://archives.postgresql.org/pgsql-performance/2009-06/msg00154.php

Hrm. I am on the train and for unknown reasons the only sensible working
protocols are smtp + pop.... Waiting.... Waiting....
Sorry, too slow/high latency atm. I wrote everything below and another mail
and the page still hasn't loaded.

oprofile can produces graphes as well (--callgraph). for both tools you need
-fno-omit-frame-pointers to get usable graphs.

Can anyone provide such a "cheat sheet" for perf? I could give that
a try if I knew how.

Unfortunately for sensible results the kernel needs to be rather new.
I would say > 2.6.28 or so (just guessed).

# to record activity
perf record [-g|--call-graph] program|-p pid

# to view a summation
perf report

graph:
# Overhead   Command      Shared Object                                     Symbol
# ........  ........  .................  .........................................
#
     4.09%  postgres  postgres           [.] slab_alloc_dyn
            |
            --- slab_alloc_dyn
               |          
               |--18.52%-- new_list
               |          |          
               |          |--63.79%-- lappend
               |          |          |          
               |          |          |--13.40%-- find_usable_indexes
               |          |          |          create_index_paths
               |          |          |          set_rel_pathlist
               |          |          |          make_one_rel

flat:

# Overhead Command Shared Object Symbol
# ........ ........ ................. .........................................
#
5.10% postgres [vdso] [.] 0x7ffff3d8d770
4.26% postgres postgres [.] base_yyparse
3.88% postgres postgres [.] slab_alloc_dyn
2.82% postgres postgres [.] core_yylex
2.37% postgres postgres [.] SearchCatCache
1.85% postgres libc-2.13.so [.] __memcpy_ssse3
1.66% postgres libc-2.13.so [.] __GI___strcmp_ssse3
1.23% postgres postgres [.] MemoryContextAlloc

# to view a line/source/instruction level view
perf annotate -l symbol

...
:
: /*
: * one-time startup overhead for each cache
: */
: if (cache->cc_tupdesc == NULL)
0.35 : 6e81fd: 48 83 7f 28 00 cmpq $0x0,0x28(%rdi)
/home/andres/src/postgresql/build/optimize/../../src/backend/utils/cache/catcache.c:1070
4.15 : 6e8202: 0f 84 54 04 00 00 je 6e865c <SearchCatCache+0x47c>
: #endif
:
: /*
: * initialize the search key information
: */
: memcpy(cur_skey, cache->cc_skey, sizeof(cur_skey));
0.00 : 6e8208: 48 8d bd a0 fe ff ff lea -0x160(%rbp),%rdi
0.17 : 6e820f: 49 8d 77 70 lea 0x70(%r15),%rsi
0.00 : 6e8213: b9 24 00 00 00 mov $0x24,%ecx
/home/andres/src/postgresql/build/optimize/../../src/backend/utils/cache/catcache.c:1080
33.22 : 6e8218: f3 48 a5 rep movsq %ds:(%rsi),%es:(%rdi)
: cur_skey[0].sk_argument = v1;
/home/andres/src/postgresql/build/optimize/../../src/backend/utils/cache/catcache.c:1081
1.56 : 6e821b: 48 89 9d e0 fe ff ff mov %rbx,-0x120(%rbp)
...

# get heaps of stats from something
perf stat -ddd someprogram|-p pid

1242.409965 task-clock # 0.824 CPUs utilized [100.00%]
14,572 context-switches # 0.012 M/sec [100.00%]
264 CPU-migrations # 0.000 M/sec [100.00%]
0 page-faults # 0.000 M/sec
2,854,775,135 cycles # 2.298 GHz [26.28%]
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
2,024,997,785 instructions # 0.71 insns per cycle [25.25%]
387,240,903 branches # 311.685 M/sec [26.51%]
21,756,886 branch-misses # 5.62% of all branches [26.26%]
753,912,137 L1-dcache-loads # 606.814 M/sec [13.24%]
52,733,007 L1-dcache-load-misses # 6.99% of all L1-dcache hits [14.72%]
35,006,406 LLC-loads # 28.176 M/sec [15.46%]
26,673 LLC-load-misses # 0.08% of all LL-cache hits [13.38%]
1,855,654,347 L1-icache-loads # 1493.593 M/sec [12.63%]
52,169,033 L1-icache-load-misses # 2.81% of all L1-icache hits [12.88%]
761,475,250 dTLB-loads # 612.902 M/sec [13.37%]
4,457,558 dTLB-load-misses # 0.59% of all dTLB cache hits [13.12%]
2,049,753,137 iTLB-loads # 1649.820 M/sec [20.09%]
4,139,394 iTLB-load-misses # 0.20% of all iTLB cache hits [19.31%]
3,705,429 L1-dcache-prefetches # 2.982 M/sec [19.64%]
<not supported> L1-dcache-prefetch-misses

1.507855345 seconds time elapsed

-r can repeat a command and gives you the standard derivation...

# show whats the system executing overall
perf top -az

# get help
perf help (record|report|annotate|stat|...)

In new versions many commands (those that produce pageable text) take --stdio
and --tui to select between two interfaces. I personally fnd --tui unusable.

I am not really sure how good the results are compared to oprofile I
just prefer the ui by far... Also the overhead seems to be measurably
smaller. Also its usable by every user, not just root...

Hope that suffices? I have no problem answering further questions, so ...

Andres

#18Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#17)
Re: testing ProcArrayLock patches

On Friday, November 18, 2011 11:12:02 PM Andres Freund wrote:

On Friday, November 18, 2011 09:16:01 PM Kevin Grittner wrote:

Andres Freund <andres@anarazel.de> wrote:

When doing line-level profiles I would suggest looking at the
instructions.

What's the best way to do that?

I think opannotate -a -s produces output with instructions/code
intermingled.

I don't think cache line contention is the most likely candidate
here. Simple cache-misses seem far more likely. In combination
with pipeline stalls...

Newer cpus (nehalem+) can measure stalled cycles which can be
really useful when analyzing performance. I don't remember how to
do that with oprofile right now though as I use perf these days
(its -e stalled-cycles{frontend|backend} there}).

When I run oprofile, I still always go back to this post by Tom:
http://archives.postgresql.org/pgsql-performance/2009-06/msg00154.php

Hrm. I am on the train and for unknown reasons the only sensible working
protocols are smtp + pop.... Waiting.... Waiting....
Sorry, too slow/high latency atm. I wrote everything below and another mail
and the page still hasn't loaded.

oprofile can produces graphes as well (--callgraph). for both tools you
need -fno-omit-frame-pointers to get usable graphs.

Can anyone provide such a "cheat sheet" for perf? I could give that
a try if I knew how.

Unfortunately for sensible results the kernel needs to be rather new.
I would say > 2.6.28 or so (just guessed).

# to record activity
perf record [-g|--call-graph] program|-p pid

# to view a summation
perf report

# get heaps of stats from something
perf stat -ddd someprogram|-p pid

# show whats the system executing overall
perf top -az

# get help
perf help (record|report|annotate|stat|...)

...
I forgot that there is also

# get a list of event types
perf list

# measure somethign for a specidif event
perf (record|stat|top) -e some_event_type

Andres

#19Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Andres Freund (#17)
1 attachment(s)
Re: testing ProcArrayLock patches

Andres Freund <andres@anarazel.de> wrote:

I think opannotate -a -s produces output with instructions/code
intermingled.

Thanks. I'll check out perf later (thanks for the tips!), but for
now, here's the function which was at the top of my oprofile
results, annotated with those options. I'm afraid it's a bit
intimidating to me -- the last time I did much with X86 assembly
language was in the mid-80s, on an 80286. :-/ Hopefully, since
this is at the top of the oprofile results when running with
prepared statements, it will be of use to somebody.

The instructions which are shown as having that 1% still seem odd to
me, but as you say, they were probably actually waiting for some
previous operation to finish:

43329 0.3211 : 70b56a: test %rbp,%rbp

99903 0.7404 : 70b58a: mov %rax,0x18(%rsp)

If anyone wants any other detail from what I captured, let me know.

-Kevin

Attachments:

opannotate-hash_search_with_hash_value.txttext/plain; name=opannotate-hash_search_with_hash_value.txtDownload
#20Robert Haas
robertmhaas@gmail.com
In reply to: Kevin Grittner (#10)
Re: testing ProcArrayLock patches

On Fri, Nov 18, 2011 at 2:05 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

Robert Haas <robertmhaas@gmail.com> wrote:

Any chance you can run oprofile (on either branch, don't really
care) against the 32 client test and post the results?

[ oprofile results ]

Hmm. That looks a lot like a profile with no lock contention at all.
Since I see XLogInsert in there, I assume this must be a pgbench write
test on unlogged tables? How close am I?

I was actually thinking it would be interesting to oprofile the
read-only test; see if we can figure out where those slowdowns are
coming from.

Two runs:

tps = 21946.961196 (including connections establishing)
tps = 22911.873227 (including connections establishing)

For write transactions, that seems pretty respectable.

Very. What do you get without the patch?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#21Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#20)
Re: testing ProcArrayLock patches

Robert Haas <robertmhaas@gmail.com> wrote:

Hmm. That looks a lot like a profile with no lock contention at
all. Since I see XLogInsert in there, I assume this must be a
pgbench write test on unlogged tables? How close am I?

Not unless pgbench on HEAD does that by default. Here are the
relevant statements:

$prefix/bin/pgbench -i -s 150
$prefix/bin/pgbench -T $time -c $clients -j $clients >>$resultfile

Perhaps the Intel cores implement the relevant primitives better?
Maybe I didn't run the profile or reports the right way?

I was actually thinking it would be interesting to oprofile the
read-only test; see if we can figure out where those slowdowns are
coming from.

I'll plan on doing that this weekend.

tps = 21946.961196 (including connections establishing)
tps = 22911.873227 (including connections establishing)

For write transactions, that seems pretty respectable.

Very. What do you get without the patch?

[quick runs a couple tests that way]

Single run with -M simple:

tps = 23018.314292 (including connections establishing)

Single run with -M prepared:

tps = 27910.621044 (including connections establishing)

So, the patch appears to hinder performance in this environment,
although certainty is quite low with so few samples. I'll schedule
a spectrum of runs before I leave this evening (very soon).

-Kevin

#22Robert Haas
robertmhaas@gmail.com
In reply to: Kevin Grittner (#21)
Re: testing ProcArrayLock patches

On Fri, Nov 18, 2011 at 6:46 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

tps = 21946.961196 (including connections establishing)
tps = 22911.873227 (including connections establishing)

For write transactions, that seems pretty respectable.

Very.  What do you get without the patch?

[quick runs a couple tests that way]

Single run with -M simple:

tps = 23018.314292 (including connections establishing)

Single run with -M prepared:

tps = 27910.621044 (including connections establishing)

So, the patch appears to hinder performance in this environment,
although certainty is quite low with so few samples.  I'll schedule
a spectrum of runs before I leave this evening (very soon).

Hmm. There's obviously something that's different in your environment
or configuration from what I tested, but I don't know what it is. The
fact that your scale factor is larger than shared_buffers might
matter; or Intel vs. AMD. Or maybe you're running with
synchronous_commit=on?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#23Andres Freund
andres@anarazel.de
In reply to: Kevin Grittner (#19)
Re: testing ProcArrayLock patches

On Saturday, November 19, 2011 12:18:07 AM Kevin Grittner wrote:

Andres Freund <andres@anarazel.de> wrote:

I think opannotate -a -s produces output with instructions/code
intermingled.

Thanks. I'll check out perf later (thanks for the tips!), but for
now, here's the function which was at the top of my oprofile
results, annotated with those options. I'm afraid it's a bit
intimidating to me -- the last time I did much with X86 assembly
language was in the mid-80s, on an 80286. :-/

While my assembly knoweldge surely isn't from the 80s be assured that I find it
intimidating as well ;)

Hopefully, since
this is at the top of the oprofile results when running with
prepared statements, it will be of use to somebody.

I think in quite many situations hash_search_with_hash_value is rather
noticeable in the profiles. Even without concurrency...

Looking at your annotation output the code seems to be almost entirely stalled
waiting for memory.
The first stall is after the first reading memory access which is likely to be
uncached (the first cacheline of the HTAB is accessed before but that will be
in the cache). The interesting thing is that I would have expected a higher
likelihood for this to stay in the cache.
2225 0.0165 : 70b543: mov (%rdi),%r15
:static inline uint32
:calc_bucket(HASHHDR *hctl, uint32 hash_val)
:{
: uint32 bucket;
:
: bucket = hash_val & hctl->high_mask;
4544 0.0337 : 70b546: and 0x2c(%r15),%ebx
: if (bucket > hctl->max_bucket)
53409 0.3958 : 70b54a: cmp 0x28(%r15),%ebx
: 70b54e: jbe 70b554 <hash_search_with_hash_value+0x34>

So a stall here is not that surprising.

Here we fetch data from memory which is unlikely to be prefetchable and then
require the result from that fetch. Note how segp = hashp->dir[segment_num];
is distributed over line 52, 64, 83.

: segp = hashp->dir[segment_num];
2062 0.0153 : 70b562: shr %cl,%eax
309 0.0023 : 70b564: mov %eax,%eax
643 0.0048 : 70b566: mov (%rdx,%rax,8),%rbp
:
: if (segp == NULL)
43329 0.3211 : 70b56a: test %rbp,%rbp

The next cacheline is referenced here. Again a fetch from memory which is soon
after needed to continue.
Unless I misunderstood the code-flow this disproves my theory that we might
have many collisions as that test seems to be outside the test (
: prevBucketPtr = &segp[segment_ndx];
: currBucket = *prevBucketPtr;
122 9.0e-04 : 70b586: mov 0x0(%rbp),%rbx
:
: /*
: * Follow collision chain looking for matching key
: */
: match = hashp->match; /* save one fetch in inner
loop */
: keysize = hashp->keysize; /* ditto */
99903 0.7404 : 70b58a: mov %rax,0x18(%rsp)
:
: while (currBucket != NULL)
1066 0.0079 : 70b58f: test %rbx,%rbx

line 136 is the first time the contents of the current bucket is needed. Thats
why the test is so noticeable.
: currBucket = *prevBucketPtr;
655 0.0049 : 70b5a3: mov (%rbx),%rbx
: * Follow collision chain looking for matching key
: */
: match = hashp->match; /* save one fetch in inner
loop */
: keysize = hashp->keysize; /* ditto */
:
: while (currBucket != NULL)
608 0.0045 : 70b5a6: test %rbx,%rbx
: 70b5a9: je 70b5d0 <hash_search_with_hash_value+0xb0>
: {
: if (currBucket->hashvalue == hashvalue &&
3504 0.0260 : 70b5ab: cmp %r12d,0x8(%rbx)
98486 0.7299 : 70b5af: nop
1233 0.0091 : 70b5b0: jne 70b5a0 <hash_search_with_hash_value+0x80>

That covers all the slow points in the function. And unless I am missing
something those are all the fetched cachelines of that function... For
HASH_FIND that is.

So I think that reinforces my belive that ordinary cachemisses are the culprit
here. Which is to be excepted in a hashtable...

Andres

PS: No idea whether that rambling made sense to anyone... But I looked at that
function fo the first time ;)

#24Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#22)
Re: testing ProcArrayLock patches

Robert Haas <robertmhaas@gmail.com> wrote:

Hmm. There's obviously something that's different in your
environment or configuration from what I tested, but I don't know
what it is. The fact that your scale factor is larger than
shared_buffers might matter; or Intel vs. AMD. Or maybe you're
running with synchronous_commit=on?

Yes, I had synchronous_commit = on for these runs. Here are the
settings:

cat >> $PGDATA/postgresql.conf <<EOM;
max_connections = 200
max_pred_locks_per_transaction = 256
shared_buffers = 10GB
maintenance_work_mem = 1GB
checkpoint_segments = 300
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
wal_writer_delay = 20ms
seq_page_cost = 0.1
random_page_cost = 0.1
cpu_tuple_cost = 0.05
effective_cache_size = 40GB
default_transaction_isolation = '$iso'
EOM

Is there any chance that having pg_xlog on a separate RAID 10 set of
drives with it's own BBU controller would explain anything? I mean,
I always knew that was a good idea for a big, heavily-loaded box,
but I remember being surprised at how *big* a difference that made
when a box accidentally went into production without moving the
pg_xlog directory there.

There is one other things which might matter, I didn't use the -n
pgbench option, and on the sample you showed, you were using it.

Here is the median of five from the latest runs. On these
read/write tests there was very little spread within each set of
five samples, with no extreme outliers like I had on the SELECT-only
tests. In the first position s means simple protocol and p means
prepared protocol. In the second position m means master, f means
with the flexlock patch.

sm1 tps = 1092.269228 (including connections establishing)
sf1 tps = 1090.511552 (including connections establishing)
sm2 tps = 2171.867100 (including connections establishing)
sf2 tps = 2158.609189 (including connections establishing)
sm4 tps = 4278.541453 (including connections establishing)
sf4 tps = 4269.921594 (including connections establishing)
sm8 tps = 8472.257182 (including connections establishing)
sf8 tps = 8476.150588 (including connections establishing)
sm16 tps = 15905.074160 (including connections establishing)
sf16 tps = 15937.372689 (including connections establishing)
sm32 tps = 22331.817413 (including connections establishing)
sf32 tps = 22861.258757 (including connections establishing)
sm64 tps = 26388.391614 (including connections establishing)
sf64 tps = 26529.152361 (including connections establishing)
sm80 tps = 25617.651194 (including connections establishing)
sf80 tps = 26560.541237 (including connections establishing)
sm96 tps = 24105.455175 (including connections establishing)
sf96 tps = 26569.244384 (including connections establishing)
sm128 tps = 21467.530210 (including connections establishing)
sf128 tps = 25883.023093 (including connections establishing)

pm1 tps = 1629.265970 (including connections establishing)
pf1 tps = 1619.024905 (including connections establishing)
pm2 tps = 3164.061963 (including connections establishing)
pf2 tps = 3137.469377 (including connections establishing)
pm4 tps = 6114.787505 (including connections establishing)
pf4 tps = 6061.750200 (including connections establishing)
pm8 tps = 11884.534375 (including connections establishing)
pf8 tps = 11870.670086 (including connections establishing)
pm16 tps = 20575.737107 (including connections establishing)
pf16 tps = 20437.648809 (including connections establishing)
pm32 tps = 27664.381103 (including connections establishing)
pf32 tps = 28046.846479 (including connections establishing)
pm64 tps = 26764.294547 (including connections establishing)
pf64 tps = 26631.589294 (including connections establishing)
pm80 tps = 27716.198263 (including connections establishing)
pf80 tps = 28393.642871 (including connections establishing)
pm96 tps = 26616.076293 (including connections establishing)
pf96 tps = 28055.921427 (including connections establishing)
pm128 tps = 23282.912620 (including connections establishing)
pf128 tps = 23072.766829 (including connections establishing)

Note that on this 32 core box, performance on the read/write pgbench
is peaking at 64 clients, but without a lot of variance between 32
and 96 clients. And with the patch, performance still hasn't fallen
off too badly at 128 clients. This is good news in terms of not
having to sweat connection pool sizing quite as much as earlier
releases.

Next I will get the profile for the SELECT-only runs. It seems to
make sense to profile at the peak performance level, which was 64
clients.

I can run one more set of tests tonight before I have to give it
back to the guy who's putting it into production. It sounds like a
set like the above except with synchronous_commit = off might be
desirable?

-Kevin

#25Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#20)
Re: testing ProcArrayLock patches

Robert Haas <robertmhaas@gmail.com> wrote:

I was actually thinking it would be interesting to oprofile the
read-only test; see if we can figure out where those slowdowns are
coming from.

CPU: Intel Core/i7, speed 2262 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with
a unit mask of 0x00 (No unit mask) count 100000
samples % image name symbol name
3124242 5.7137 postgres s_lock
2555554 4.6737 postgres AllocSetAlloc
2403412 4.3954 postgres GetSnapshotData
1967132 3.5975 postgres SearchCatCache
1872176 3.4239 postgres base_yyparse
1327256 2.4273 postgres hash_search_with_hash_value
1040131 1.9022 postgres _bt_compare
1038976 1.9001 postgres LWLockAcquire
817122 1.4944 postgres MemoryContextAllocZeroAligned
738321 1.3503 postgres core_yylex
622613 1.1386 postgres MemoryContextAlloc
597054 1.0919 postgres PinBuffer
556138 1.0171 postgres ScanKeywordLookup
552318 1.0101 postgres expression_tree_walker
494279 0.9039 postgres LWLockRelease
488628 0.8936 postgres hash_any
472906 0.8649 postgres nocachegetattr
396482 0.7251 postgres grouping_planner
382974 0.7004 postgres LockAcquireExtended
375186 0.6861 postgres AllocSetFree
375072 0.6859 postgres ProcArrayLockRelease
373668 0.6834 postgres new_list
365917 0.6692 postgres fmgr_info_cxt_security
301398 0.5512 postgres ProcArrayLockAcquire
300647 0.5498 postgres LockReleaseAll
292073 0.5341 postgres DirectFunctionCall1Coll
285745 0.5226 postgres MemoryContextAllocZero
284684 0.5206 postgres FunctionCall2Coll
282701 0.5170 postgres SearchSysCache

max_connections = 100
max_pred_locks_per_transaction = 64
shared_buffers = 8GB
maintenance_work_mem = 1GB
checkpoint_segments = 300
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
wal_writer_delay = 20ms
seq_page_cost = 0.1
random_page_cost = 0.1
cpu_tuple_cost = 0.05
effective_cache_size = 40GB
default_transaction_isolation = '$iso'

pgbench -i -s 100
pgbench -S -M simple -T 300 -c 80 -j 80

transaction type: SELECT only
scaling factor: 100
query mode: simple
number of clients: 80
number of threads: 80
duration: 300 s
number of transactions actually processed: 104391011
tps = 347964.636256 (including connections establishing)
tps = 347976.389034 (excluding connections establishing)

vmstat 1 showed differently this time -- no clue why.

procs --------------memory------------- ---swap-- -----io----
r b swpd free buff cache si so bi bo
---system---- -----cpu------
in cs us sy id wa st
91 0 8196 4189436 203925700 52314492 0 0 0 0
32255 1522807 85 13 1 0 0
92 0 8196 4189404 203925700 52314492 0 0 0 0
32796 1525463 85 14 1 0 0
67 0 8196 4189404 203925700 52314488 0 0 0 0
32343 1527988 85 13 1 0 0
93 0 8196 4189404 203925700 52314488 0 0 0 0
32701 1535827 85 13 1 0 0

-Kevin

#26Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Kevin Grittner (#24)
Re: testing ProcArrayLock patches

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:

I can run one more set of tests tonight before I have to give it
back to the guy who's putting it into production. It sounds like
a set like the above except with synchronous_commit = off might be
desirable?

OK, that's what I did. This gave me my best numbers yet for an
updating run of pgbench: tps = 38039.724212 for prepared statements
using the flexlock patch. This patch is a clear win when you get to
16 clients or more.

sm1 tps = 1312.501168 (including connections establishing)
sf1 tps = 1376.678293 (including connections establishing)
sm2 tps = 2705.571856 (including connections establishing)
sf2 tps = 2689.577938 (including connections establishing)
sm4 tps = 5461.403557 (including connections establishing)
sf4 tps = 5447.363103 (including connections establishing)
sm8 tps = 10524.695338 (including connections establishing)
sf8 tps = 10448.012069 (including connections establishing)
sm16 tps = 18952.968472 (including connections establishing)
sf16 tps = 18969.505631 (including connections establishing)
sm32 tps = 27392.393850 (including connections establishing)
sf32 tps = 29225.974112 (including connections establishing)
sm64 tps = 28947.675549 (including connections establishing)
sf64 tps = 31417.536816 (including connections establishing)
sm80 tps = 28053.684182 (including connections establishing)
sf80 tps = 29970.555401 (including connections establishing)
sm96 tps = 25885.679957 (including connections establishing)
sf96 tps = 28581.271436 (including connections establishing)
sm128 tps = 22261.902571 (including connections establishing)
sf128 tps = 24537.566960 (including connections establishing)

pm1 tps = 2082.958841 (including connections establishing)
pf1 tps = 2052.328339 (including connections establishing)
pm2 tps = 4287.257860 (including connections establishing)
pf2 tps = 4228.770795 (including connections establishing)
pm4 tps = 8653.196863 (including connections establishing)
pf4 tps = 8592.091631 (including connections establishing)
pm8 tps = 16071.432101 (including connections establishing)
pf8 tps = 16196.992207 (including connections establishing)
pm16 tps = 27146.441216 (including connections establishing)
pf16 tps = 27441.966562 (including connections establishing)
pm32 tps = 34983.352396 (including connections establishing)
pf32 tps = 38039.724212 (including connections establishing)
pm64 tps = 33182.643501 (including connections establishing)
pf64 tps = 34193.732669 (including connections establishing)
pm80 tps = 30686.712607 (including connections establishing)
pf80 tps = 33336.011769 (including connections establishing)
pm96 tps = 24692.015615 (including connections establishing)
pf96 tps = 32907.472665 (including connections establishing)
pm128 tps = 24164.441954 (including connections establishing)
pf128 tps = 25742.670928 (including connections establishing)

At lower client numbers the tps values within each set of five
samples were very tightly grouped. With either protocol, and
whether or not the patch was applied, the higher concurrency groups
tended to be bifurcated within a set of five samples between "good"
and "bad" numbers. The patch seemed to increase the number of
clients which could be handled without collapse into the bad
numbers. It really looks like there's some sort of performance
"collapse" at higher concurrency which may or may not happen in any
particular five minute run. Just as one example, running the simple
protocol with the flexlock patch:

tps = 24491.653873 (including connections establishing)
tps = 24537.566960 (including connections establishing)
tps = 28462.276323 (including connections establishing)
tps = 24403.373002 (including connections establishing)
tps = 28458.902549 (including connections establishing)

-Kevin

#27Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Kevin Grittner (#26)
Re: testing ProcArrayLock patches

On Mon, Nov 21, 2011 at 10:44 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:

I can run one more set of tests tonight before I have to give it
back to the guy who's putting it into production.  It sounds like
a set like the above except with synchronous_commit = off might be
desirable?

OK, that's what I did.  This gave me my best numbers yet for an
updating run of pgbench: tps = 38039.724212 for prepared statements
using the flexlock patch.  This patch is a clear win when you get to
16 clients or more.

It will be a great help if you could spare few minutes to also test
the patch to take out the frequently accessed PGPROC members to a
different array. We are seeing good improvements on HPUX IA platform
and the AMD Opteron and it will be interesting to know what happens on
the Intel platform too.

http://archives.postgresql.org/message-id/4EB7C4C9.9070309@enterprisedb.com

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

#28Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Pavan Deolasee (#27)
Re: testing ProcArrayLock patches

Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

It will be a great help if you could spare few minutes to also
test the patch to take out the frequently accessed PGPROC members
to a different array. We are seeing good improvements on HPUX IA
platform and the AMD Opteron and it will be interesting to know
what happens on the Intel platform too.

http://archives.postgresql.org/message-id/4EB7C4C9.9070309@enterprisedb.com

It's going to be hard to arrange more of the 20-hours runs I've been
doing, but I can work in some more abbreviated tests. What would be
the best test for this? (I would hate to try and find out I didn't
exercise the right code path.)

-Kevin

#29Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Kevin Grittner (#28)
Re: testing ProcArrayLock patches

On Mon, Nov 21, 2011 at 11:01 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

It will be a great help if you could spare few minutes to also
test the patch to take out the frequently accessed PGPROC members
to a different array. We are seeing good improvements on HPUX IA
platform and the AMD Opteron and it will be interesting to know
what happens on the Intel platform too.

http://archives.postgresql.org/message-id/4EB7C4C9.9070309@enterprisedb.com

It's going to be hard to arrange more of the 20-hours runs I've been
doing, but I can work in some more abbreviated tests.  What would be
the best test for this?  (I would hate to try and find out I didn't
exercise the right code path.)

I think 2-3 runs with 32 and 128 clients each with prepared statements
should suffice to quickly compare with the other numbers you posted
for the master.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

#30Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Pavan Deolasee (#27)
Re: testing ProcArrayLock patches

Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

It will be a great help if you could spare few minutes to also
test the patch to take out the frequently accessed PGPROC members
to a different array. We are seeing good improvements on HPUX IA
platform and the AMD Opteron and it will be interesting to know
what happens on the Intel platform too.

For a read only comparison (which was run using the simple
protocol), using identical settings to the previous master run, but
with the PGPROC split patch:

m32 tps = 201738.209348 (including connections establishing)
p32 tps = 201620.966988 (including connections establishing)

m128 tps = 352159.631878 (including connections establishing)
p128 tps = 363998.703900 (including connections establishing)

Clearly a win at 128 clients; not at 32.

For updates:

sm32 tps = 27392.393850 (including connections establishing)
sp32 tps = 27995.784333 (including connections establishing)

sm128 tps = 22261.902571 (including connections establishing)
sp128 tps = 23690.408272 (including connections establishing)

pm32 tps = 34983.352396 (including connections establishing)
pp32 tps = 36076.373389 (including connections establishing)

pm128 tps = 24164.441954 (including connections establishing)
pp128 tps = 27070.824588 (including connections establishing)

That's a pretty decisive win all around.

-Kevin

#31Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Kevin Grittner (#30)
Re: testing ProcArrayLock patches

On Tue, Nov 22, 2011 at 4:40 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

It will be a great help if you could spare few minutes to also
test the patch to take out the frequently accessed PGPROC members
to a different array. We are seeing good improvements on HPUX IA
platform and the AMD Opteron and it will be interesting to know
what happens on the Intel platform too.

For a read only comparison (which was run using the simple
protocol), using identical settings to the previous master run, but
with the PGPROC split patch:

m32 tps = 201738.209348 (including connections establishing)
p32 tps = 201620.966988 (including connections establishing)

m128 tps = 352159.631878 (including connections establishing)
p128 tps = 363998.703900 (including connections establishing)

Clearly a win at 128 clients; not at 32.

For updates:

sm32 tps = 27392.393850 (including connections establishing)
sp32 tps = 27995.784333 (including connections establishing)

sm128 tps = 22261.902571 (including connections establishing)
sp128 tps = 23690.408272 (including connections establishing)

pm32 tps = 34983.352396 (including connections establishing)
pp32 tps = 36076.373389 (including connections establishing)

pm128 tps = 24164.441954 (including connections establishing)
pp128 tps = 27070.824588 (including connections establishing)

That's a pretty decisive win all around.

Thanks for running those tests. The numbers are not that bad, but
definitely not as good as we saw on some other platforms. But its
possible that they may improve in percentage terms with even more
number of clients on this box. And given that we are seeing big gains
on other platforms, hopefully it will give us confident to proceed
with the patch.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

#32Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Pavan Deolasee (#31)
Re: testing ProcArrayLock patches

Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

The numbers are not that bad, but definitely not as good as we saw
on some other platforms.

Well, this machine is definitely designed to hold up under high
concurrency. As I understand it, each core is the memory manager
for two 4GB DIMMs, with two channels to them, each with two buffers.
The way the cores are connected, a core never needs to go through
more than one other core to get to memory not directly managed, and
that uses "snoop" technology which hands the cached data right over
from one core to the other when possible, rather than making the
core which now owns the cache line pull it from RAM. It seems the
2.6.32 kernel is able to manage that technology in a reasonable
fashion.

At first I was surprised to see performance top out on the update
tests between 80 and 96 clients. But then, that lands almost
exactly where my old reliable ((2 * core count) + effective spindle
count) would predict. The SELECT only tests peaked at 64 clients,
but those were fully cached, so effective spindle count was zero,
again fitting the formula. So these optimizations seem to me to
break down the barriers which had previously capped the number of
clients which could be handled, letting them peak at their "natural"
levels.

But its possible that they may improve in percentage terms with
even more number of clients on this box.

I think so; I think this box is just so scalable that at 128
clients we were just barely getting past the "knee" in the
performance graphs to where these patches help most.

-Kevin