dynamic shared memory and locks

Started by Robert Haasabout 12 years ago44 messages

robertmhaas@gmail.com

about 12 years ago

One of the things that you might want to do with dynamic shared memory
is store a lock in it. In fact, my bet is that almost everything that
uses dynamic shared memory will want to do precisely that, because, of
course, it's dynamic *shared* memory, which means that it is
concurrently accessed by multiple processes, which tends to require
locking. Typically, what you're going to want are either spinlocks
(for very short critical sections) or lwlocks (for longer ones). It
doesn't really make sense to talk about storing heavyweight locks in
dynamic shared memory, because we're talking about storing locks with
the data structures that they protect, and heavyweight locks are used
to protect database or shared objects, not shared memory structures.
Of course, someone might think of trying to provide a mechanism for
the heavyweight lock manager to overflow to dynamic shared memory, but
that's a different thing altogether and not what I'm talking about
here.

Right now, storing spinlocks in dynamic shared memory *almost* works,
but there are problems with --disable-spinlocks. In that
configuration, we use semaphores to simulate spinlocks. Every time
someone calls SpinLockInit(), it's going to allocate a new semaphore
which will never be returned to the operating system, so you're pretty
quickly going to run out. There are a couple of things we could do
about this:

1. Decide we don't care. If you compile with --disable-spinlocks, and
then you try to use dynamic shared memory, it's going to leak
semaphores until none remain, and then start failing from there until
the postmaster is restarted. If you don't like that, provide a
working spinlock implementation for your platform.

2. Forbid the use of dynamic shared memory when compiling with
--disable-spinlocks. This is a more polite version of #1. It seems
likely to me that nearly every piece of code that uses dynamic shared
memory will require locking. Instead of letting people allocate
dynamic shared memory segments anyway and then having them start
failing shortly after postmaster startup, we could just head the
problem off at the pass by denying the request for dynamic shared
memory in the first place. Dynamic shared memory allocation can
always fail (e.g. because we're out of memory) and also has an
explicit off switch that will make all requests fail
(dynamic_shared_memory_type=none), so any code that uses dynamic
shared memory has to be prepared for a failure at that point, whereas
a failure in SpinLockInit() might be more surprising.

3. Provide an inverse for SpinLockInit, say SpinLockDestroy, and
require all code written for dynamic shared memory to invoke this
function on every spinlock before the shared memory segment is
destroyed. I initially thought that this could be done using the
on_dsm_detach infrastructure, but it turns out that doesn't really
work. The on_dsm_detach infrastructure is designed to make sure that
you *release* all of your locks when detaching - i.e. those hooks get
invoked for each process that detaches. For this, you'd need an
on_dsm_final_detach callback that gets called only for the very last
detach (and after prohibiting any other processes from attaching). I
can certainly engineer all that, but it's a decent amount of extra
work for everyone who wants to use dynamic shared memory to write the
appropriate callback, and because few people actually use
--disable-spinlocks, I think those callbacks will tend to be rather
lightly tested and thus a breeding ground for marginal bugs that
nobody's terribly excited about fixing.

4. Drop support for --disable-spinlocks.

For what it's worth, my vote is currently for #2. I can't think of
many interesting to do with dynamic shared memory without having at
least spinlocks, so I don't think we'd be losing much. #1 seems
needlessly unfriendly, #3 seems like a lot of work for not much, and
#4 seems excessive at least as a solution to this particular problem,
though there may be other arguments for it. What do others think?

I think we're also going to want to be able to create LWLocks in
dynamic shared memory: some critical sections won't be short enough or
safe enough to be protected by spinlocks. At some level this seems
easy: change LWLockAcquire and friends to accept an LWLock * rather
than an LWLockId, and similarly change held_lwlocks[] to hold LWLock
pointers rather than LWLockIds. One tricky point is that you'd better
try not to detach a shared memory segment while you're holding lwlocks
inside that segment, but I think just making that a coding rule
shouldn't cause any great problem, and conversely you'd better release
all lwlocks in the segment before detaching it, but this seems mostly
OK: throwing an error will call LWLockReleaseAll before doing the
resource manager cleanups that will unmap the dynamic shared memory
segment, so that's probably OK too. There may be corner cases I
haven't thought about, though. A bigger problem is that I think we
want to avoid having a large amount of notational churn. The obvious
way to do that is to get rid of the LWLockId array and instead declare
each fixed LWLock separately as e.g. LWLock *ProcArrayLock. However,
creating a large number of new globals that will need to be
initialized in every new EXEC_BACKEND process seems irritating. So
maybe a better idea is to do something like this:

#define BufFreelistLock (&fixedlwlocks[0])
#define ShmemIndexLock (&fixedlwlocks[1])
...
#define AutoFileLock (&fixedlwlocks[36])
#define NUM_FIXED_LWLOCKS 37

Comments, suggestions?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Robert Haas (#1)

Re: dynamic shared memory and locks

On 2014-01-05 12:56:05 -0500, Robert Haas wrote:

Right now, storing spinlocks in dynamic shared memory *almost* works,
but there are problems with --disable-spinlocks. In that
configuration, we use semaphores to simulate spinlocks. Every time
someone calls SpinLockInit(), it's going to allocate a new semaphore
which will never be returned to the operating system, so you're pretty
quickly going to run out. There are a couple of things we could do
about this:

4. Drop support for --disable-spinlocks.

I very strongly vote 4). I think we're going to hit this more and more
often and it's a facility that benefits almost nobody. Just about every
new platform will be/is on gcc or clang and you can just duplicate the
compiler provided generic implementation we have for arm for there.

The atomics implementation make this an automatic fallback if there's
no compiler specific variant around.

I think we're also going to want to be able to create LWLocks in
dynamic shared memory: some critical sections won't be short enough or
safe enough to be protected by spinlocks.

Agreed.

At some level this seems easy: change LWLockAcquire and friends to
accept an LWLock * rather than an LWLockId, and similarly change
held_lwlocks[] to hold LWLock pointers rather than LWLockIds.

My primary reason isn't dsm TBH but wanting to embed the buffer lwlocks
in the bufferdesc, on the same cacheline as the buffer headers
spinlock. All the embedded ones can be allocated without padding, while
the relatively low number of non-embedded ones can be padded to the full
cacheline size.

A bigger problem is that I think we
want to avoid having a large amount of notational churn. The obvious
way to do that is to get rid of the LWLockId array and instead declare
each fixed LWLock separately as e.g. LWLock *ProcArrayLock. However,
creating a large number of new globals that will need to be
initialized in every new EXEC_BACKEND process seems irritating. So
maybe a better idea is to do something like this:

#define BufFreelistLock (&fixedlwlocks[0])
#define ShmemIndexLock (&fixedlwlocks[1])
...
#define AutoFileLock (&fixedlwlocks[36])
#define NUM_FIXED_LWLOCKS 37

Comments, suggestions?

My idea here was to just have two APIs, a legacy one that works like the
current one, and a new one that locks the lwlocks passed in by a
pointer. After all, LWLockAssign() which you use in extensions currently
returns a LWLockId. Seems ugly to turn that into a pointer.

But perhaps your idea is better anyway, no matter the hackery of turning
LWLockId into a pointer.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 12 years ago

In reply to: Robert Haas (#1)

Re: dynamic shared memory and locks

Robert Haas <robertmhaas@gmail.com> writes:

For what it's worth, my vote is currently for #2. I can't think of
many interesting to do with dynamic shared memory without having at
least spinlocks, so I don't think we'd be losing much. #1 seems
needlessly unfriendly, #3 seems like a lot of work for not much, and
#4 seems excessive at least as a solution to this particular problem,
though there may be other arguments for it. What do others think?

I agree with this position. There may be some good reason to drop
--disable-spinlocks altogether in future, but DSM isn't a sufficient
excuse.

I think we're also going to want to be able to create LWLocks in
dynamic shared memory: some critical sections won't be short enough or
safe enough to be protected by spinlocks. At some level this seems
easy: change LWLockAcquire and friends to accept an LWLock * rather
than an LWLockId, and similarly change held_lwlocks[] to hold LWLock
pointers rather than LWLockIds.

I seem to recall that there was some good reason for keeping all the
LWLocks in an array, back when the facility was first designed.
I'm too lazy to research the point right now, but you might want to
go back and look at the archives around when lwlock.c was written.

creating a large number of new globals that will need to be
initialized in every new EXEC_BACKEND process seems irritating.

This might've been the good reason, but not sure --- I think LWLocks
predate our Windows support.

In the end, though, that decision was taken before we were concerned
about being able to add new LWLocks on the fly, which is what this is
really about (whether they're stored in DSM or not is a secondary point).
The pressure for that has gotten strong enough that it may be time to
change the tradeoff.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Tom Lane (#3)

Re: dynamic shared memory and locks

On 2014-01-05 14:06:52 -0500, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

For what it's worth, my vote is currently for #2. I can't think of
many interesting to do with dynamic shared memory without having at
least spinlocks, so I don't think we'd be losing much. #1 seems
needlessly unfriendly, #3 seems like a lot of work for not much, and
#4 seems excessive at least as a solution to this particular problem,
though there may be other arguments for it. What do others think?

I agree with this position. There may be some good reason to drop
--disable-spinlocks altogether in future, but DSM isn't a sufficient
excuse.

Agreed that DSM isn't sufficient cause. The reasons for removing it for
future reasons I see are:
* It's not tested at all and it has been partially broken for
significants of time. Afair Heikki just noticed the breakage
accidentally when adding enough new spinlocks recently.
* It's showed up as problematic in a couple of patches while adding not
much value (at least dsm, atomic ops, afair some others)
* It's slow enough that it's not of a practical value.
* Implementing simple support for spinlocks on realistic platforms isn't
hard these days due to compiler intrinsics.
* The platforms that don't have a barrier implementation will rely on
spinlocks. And for correctness those spinlocks should employ
barriers. That might be more of an argument for getting rid of that
fallback tho.

I think we're also going to want to be able to create LWLocks in
dynamic shared memory: some critical sections won't be short enough or
safe enough to be protected by spinlocks. At some level this seems
easy: change LWLockAcquire and friends to accept an LWLock * rather
than an LWLockId, and similarly change held_lwlocks[] to hold LWLock
pointers rather than LWLockIds.

I seem to recall that there was some good reason for keeping all the
LWLocks in an array, back when the facility was first designed.
I'm too lazy to research the point right now, but you might want to
go back and look at the archives around when lwlock.c was written.

Your proposal is at
/messages/by-id/1054.1001520600@sss.pgh.pa.us -
afaics there hasn't been much discussion about implementation details at
that level.

In the end, though, that decision was taken before we were concerned
about being able to add new LWLocks on the fly, which is what this is
really about (whether they're stored in DSM or not is a secondary point).
The pressure for that has gotten strong enough that it may be time to
change the tradeoff.

I personally find the sharing of a cacheline between a buffer headers
spinlock and the lwlock much more interesting than DSM...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

about 12 years ago

In reply to: Tom Lane (#3)

1 attachment(s)

Re: dynamic shared memory and locks

On Sun, Jan 5, 2014 at 2:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I seem to recall that there was some good reason for keeping all the
LWLocks in an array, back when the facility was first designed.
I'm too lazy to research the point right now, but you might want to
go back and look at the archives around when lwlock.c was written.

To some extent it's an orthogonal question. It's true that there
isn't much point in using LWLock * rather than LWLockId to refer to
LWLocks unless we wish to be able to store them outside the shared
memory segment, but the reverse is not true: just because we have the
ability to move things outside the main array in the general case
doesn't make it a good idea in any particular case. Andres's email
seems to indicate that he sees performance advantages in moving buffer
locks elsewhere, and I have a sneaking suspicion that we'll find that
it's more convenient to move some other things around as well, but
that's policy, not mechanism. Very little of the core LWLock
machinery actually cares how the locks are stored, so the attached
patch to make it use LWLock * rather than LWLockId as a handle is
pretty straightforward.

The only real problem I see here is that we occasionally *print out*
LWLockIds as a way of identifying lwlocks. This is not a great
system, but printing out pointers is almost certainly much worse (e.g.
because of ASLR). The cases where this is currently an issue are:

- You try to release a lwlock you haven't acquired. We elog(ERROR) the ID.
- You define LWLOCK_STATS. The resulting reports are print the lock ID.
- You define LOCK_DEBUG and set trace_lwlocks=true. We print the lock
ID in the trace messages.
- You compile with --enable-dtrace. We pass the lock IDs to the dtrace probes.

In the attached patch I handled the first case by printing the pointer
(which I don't feel too bad about) and the remaining three cases by
leaving them broken. I wouldn't shed a tear about ripping out
trace_lwlocks, but LWLOCK_STATS is useful and I bet somebody is using
--enable-dtrace, so we probably need to fix those cases at least. I
suppose one option is to make LWLOCK_STATS and the dtrace probes only
look at locks in the main array and just ignore everything else. But
that's kind of crappy, especially if we're going to soon move buffer
locks out of the main array.

Another idea is to include some identifying information in the lwlock.
For example, each lwlock could have a char *name in it, and we could
print the name. In theory, this could be a big step forward in terms
of usability, because it'd spare us all needing to remember that 4 ==
ProcArrayLock. But it's awkward for buffer locks, of which there
might be a great many, and we surely don't want to allocate a
dynamically-generated string in shared memory for each one. You could
do a bit better by making the identifying information a string plus an
integer, because then all the buffer locks could set the string to a
static constant like "buffer content lock" and the integer to the
buffer number, and similarly for lock manager partition locks and so
on. This is appealing, but would increase the size of LWLockPadded
from 16 bytes to 32 on 64-bit platforms where slock_t is four bytes or
less, which I'm not that excited about even though it would reduce
cache line contention in some cases.

Yet a third idea is to try to reverse-engineer a name for a given
lwlock from the pointer address. If it's an offset into the main
array, this is easy enough to do, and even if we ended up with several
arrays (like one for bufmgr locks) it wouldn't be too hard to write
code to figure out which array contains it and emit the appropriate
string. The only real problem that I see with this is that it might
cause a performance hit. A performance hit when running with
trace_lwlocks or LWLOCK_STATS is not really a problem, but people
won't like if --enable-dtrace slow things up.

Preferences, other ideas?

None of these ideas are a complete solution for LWLOCK_STATS. In the
other three cases noted above, we only need an identifier for the lock
"instantaneously", so that we can pass it off to the logger or dtrace
or whatever. But LWLOCK_STATS wants to hold on to data about the
locks that were visited until the end of the session, and it does that
using an array that is *indexed* by lwlockid. I guess we could
replace that with a hash table. Ugh. Any suggestions?

(Incidentally, while developing this patch I found a bug in the
current code: lock.c iterates over all PGPROCs from 0 to
ProcGlobal->allProcCount and takes the backendLock for each, but if
max_prepared_transactions>0 then the PGPROCs for prepared transactions
do not have a backendLock and so we take and release BufFreelistLock -
i.e. 0 - a number of times equal to max_prepared_transactions. That's
not cool.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

lwlock-pointers.patchtext/x-patch; charset=US-ASCII; name=lwlock-pointers.patchDownload

diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index af24902..69ae046 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -151,7 +151,7 @@ SimpleLruShmemSize(int nslots, int nlsns)
 	sz += MAXALIGN(nslots * sizeof(bool));		/* page_dirty[] */
 	sz += MAXALIGN(nslots * sizeof(int));		/* page_number[] */
 	sz += MAXALIGN(nslots * sizeof(int));		/* page_lru_count[] */
-	sz += MAXALIGN(nslots * sizeof(LWLockId));	/* buffer_locks[] */
+	sz += MAXALIGN(nslots * sizeof(LWLock *));	/* buffer_locks[] */
 
 	if (nlsns > 0)
 		sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr));	/* group_lsn[] */
@@ -161,7 +161,7 @@ SimpleLruShmemSize(int nslots, int nlsns)
 
 void
 SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
-			  LWLockId ctllock, const char *subdir)
+			  LWLock *ctllock, const char *subdir)
 {
 	SlruShared	shared;
 	bool		found;
@@ -202,8 +202,8 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
 		offset += MAXALIGN(nslots * sizeof(int));
 		shared->page_lru_count = (int *) (ptr + offset);
 		offset += MAXALIGN(nslots * sizeof(int));
-		shared->buffer_locks = (LWLockId *) (ptr + offset);
-		offset += MAXALIGN(nslots * sizeof(LWLockId));
+		shared->buffer_locks = (LWLock **) (ptr + offset);
+		offset += MAXALIGN(nslots * sizeof(LWLock *));
 
 		if (nlsns > 0)
 		{
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 048a189..1692dd2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -449,8 +449,6 @@ typedef struct
 typedef int InheritableSocket;
 #endif
 
-typedef struct LWLock LWLock;	/* ugly kluge */
-
 /*
  * Structure contains all variables passed to exec:ed backends
  */
@@ -471,7 +469,7 @@ typedef struct
 	slock_t    *ShmemLock;
 	VariableCache ShmemVariableCache;
 	Backend    *ShmemBackendArray;
-	LWLock	   *LWLockArray;
+	LWLock	   *MainLWLockArray;
 	slock_t    *ProcStructLock;
 	PROC_HDR   *ProcGlobal;
 	PGPROC	   *AuxiliaryProcs;
@@ -5580,7 +5578,6 @@ PostmasterMarkPIDForWorkerNotify(int pid)
  * functions.  They are marked NON_EXEC_STATIC in their home modules.
  */
 extern slock_t *ShmemLock;
-extern LWLock *LWLockArray;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
@@ -5626,7 +5623,7 @@ save_backend_variables(BackendParameters *param, Port *port,
 	param->ShmemVariableCache = ShmemVariableCache;
 	param->ShmemBackendArray = ShmemBackendArray;
 
-	param->LWLockArray = LWLockArray;
+	param->MainLWLockArray = MainLWLockArray;
 	param->ProcStructLock = ProcStructLock;
 	param->ProcGlobal = ProcGlobal;
 	param->AuxiliaryProcs = AuxiliaryProcs;
@@ -5854,7 +5851,7 @@ restore_backend_variables(BackendParameters *param, Port *port)
 	ShmemVariableCache = param->ShmemVariableCache;
 	ShmemBackendArray = param->ShmemBackendArray;
 
-	LWLockArray = param->LWLockArray;
+	MainLWLockArray = param->MainLWLockArray;
 	ProcStructLock = param->ProcStructLock;
 	ProcGlobal = param->ProcGlobal;
 	AuxiliaryProcs = param->AuxiliaryProcs;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 081165f..af923ec 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -536,10 +536,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 {
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
-	LWLockId	newPartitionLock;		/* buffer partition lock for it */
+	LWLock	   *newPartitionLock;		/* buffer partition lock for it */
 	BufferTag	oldTag;			/* previous identity of selected buffer */
 	uint32		oldHash;		/* hash value for oldTag */
-	LWLockId	oldPartitionLock;		/* buffer partition lock for it */
+	LWLock	   *oldPartitionLock;		/* buffer partition lock for it */
 	BufFlags	oldFlags;
 	int			buf_id;
 	volatile BufferDesc *buf;
@@ -888,7 +888,7 @@ InvalidateBuffer(volatile BufferDesc *buf)
 {
 	BufferTag	oldTag;
 	uint32		oldHash;		/* hash value for oldTag */
-	LWLockId	oldPartitionLock;		/* buffer partition lock for it */
+	LWLock	   *oldPartitionLock;		/* buffer partition lock for it */
 	BufFlags	oldFlags;
 
 	/* Save the original buffer tag before dropping the spinlock */
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 536be44..38776e9 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -565,7 +565,7 @@ LockHasWaiters(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	LOCALLOCK  *locallock;
 	LOCK	   *lock;
 	PROCLOCK   *proclock;
-	LWLockId	partitionLock;
+	LWLock	   *partitionLock;
 	bool		hasWaiters = false;
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
@@ -702,7 +702,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	bool		found;
 	ResourceOwner owner;
 	uint32		hashcode;
-	LWLockId	partitionLock;
+	LWLock	   *partitionLock;
 	int			status;
 	bool		log_lock = false;
 
@@ -1744,7 +1744,7 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	LOCALLOCK  *locallock;
 	LOCK	   *lock;
 	PROCLOCK   *proclock;
-	LWLockId	partitionLock;
+	LWLock	   *partitionLock;
 	bool		wakeupNeeded;
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
@@ -2096,10 +2096,13 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 	 */
 	for (partition = 0; partition < NUM_LOCK_PARTITIONS; partition++)
 	{
-		LWLockId	partitionLock = FirstLockMgrLock + partition;
+		LWLock	   *partitionLock;
 		SHM_QUEUE  *procLocks = &(MyProc->myProcLocks[partition]);
 		PROCLOCK   *nextplock;
 
+		partitionLock =
+			&MainLWLockArray[LOCK_MANAGER_LWLOCK_OFFSET + partition].lock;
+
 		/*
 		 * If the proclock list for this partition is empty, we can skip
 		 * acquiring the partition lock.  This optimization is trickier than
@@ -2475,7 +2478,7 @@ static bool
 FastPathTransferRelationLocks(LockMethod lockMethodTable, const LOCKTAG *locktag,
 							  uint32 hashcode)
 {
-	LWLockId	partitionLock = LockHashPartitionLock(hashcode);
+	LWLock	   *partitionLock = LockHashPartitionLock(hashcode);
 	Oid			relid = locktag->locktag_field2;
 	uint32		i;
 
@@ -2565,7 +2568,7 @@ FastPathGetRelationLockEntry(LOCALLOCK *locallock)
 	LockMethod	lockMethodTable = LockMethods[DEFAULT_LOCKMETHOD];
 	LOCKTAG    *locktag = &locallock->tag.lock;
 	PROCLOCK   *proclock = NULL;
-	LWLockId	partitionLock = LockHashPartitionLock(locallock->hashcode);
+	LWLock	   *partitionLock = LockHashPartitionLock(locallock->hashcode);
 	Oid			relid = locktag->locktag_field2;
 	uint32		f;
 
@@ -2671,7 +2674,7 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode)
 	SHM_QUEUE  *procLocks;
 	PROCLOCK   *proclock;
 	uint32		hashcode;
-	LWLockId	partitionLock;
+	LWLock	   *partitionLock;
 	int			count = 0;
 	int			fast_count = 0;
 
@@ -2883,7 +2886,7 @@ LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
 	PROCLOCKTAG proclocktag;
 	uint32		hashcode;
 	uint32		proclock_hashcode;
-	LWLockId	partitionLock;
+	LWLock	   *partitionLock;
 	bool		wakeupNeeded;
 
 	hashcode = LockTagHashCode(locktag);
@@ -3159,10 +3162,13 @@ PostPrepare_Locks(TransactionId xid)
 	 */
 	for (partition = 0; partition < NUM_LOCK_PARTITIONS; partition++)
 	{
-		LWLockId	partitionLock = FirstLockMgrLock + partition;
+		LWLock	   *partitionLock;
 		SHM_QUEUE  *procLocks = &(MyProc->myProcLocks[partition]);
 		PROCLOCK   *nextplock;
 
+		partitionLock =
+			&MainLWLockArray[LOCK_MANAGER_LWLOCK_OFFSET + partition].lock;
+
 		/*
 		 * If the proclock list for this partition is empty, we can skip
 		 * acquiring the partition lock.  This optimization is safer than the
@@ -3400,7 +3406,12 @@ GetLockStatusData(void)
 	 * Must grab LWLocks in partition-number order to avoid LWLock deadlock.
 	 */
 	for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
-		LWLockAcquire(FirstLockMgrLock + i, LW_SHARED);
+	{
+		LWLock *partitionLock;
+
+		partitionLock = &MainLWLockArray[LOCK_MANAGER_LWLOCK_OFFSET + i].lock;
+		LWLockAcquire(partitionLock, LW_SHARED);
+	}
 
 	/* Now we can safely count the number of proclocks */
 	data->nelements = el + hash_get_num_entries(LockMethodProcLockHash);
@@ -3442,7 +3453,12 @@ GetLockStatusData(void)
 	 * behavior inside LWLockRelease.
 	 */
 	for (i = NUM_LOCK_PARTITIONS; --i >= 0;)
-		LWLockRelease(FirstLockMgrLock + i);
+	{
+		LWLock *partitionLock;
+
+		partitionLock = &MainLWLockArray[LOCK_MANAGER_LWLOCK_OFFSET + i].lock;
+		LWLockRelease(partitionLock);
+	}
 
 	Assert(el == data->nelements);
 
@@ -3477,7 +3493,12 @@ GetRunningTransactionLocks(int *nlocks)
 	 * Must grab LWLocks in partition-number order to avoid LWLock deadlock.
 	 */
 	for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
-		LWLockAcquire(FirstLockMgrLock + i, LW_SHARED);
+	{
+		LWLock	   *partitionLock;
+
+		partitionLock = &MainLWLockArray[LOCK_MANAGER_LWLOCK_OFFSET + i].lock;
+		LWLockAcquire(partitionLock, LW_SHARED);
+	}
 
 	/* Now we can safely count the number of proclocks */
 	els = hash_get_num_entries(LockMethodProcLockHash);
@@ -3537,7 +3558,12 @@ GetRunningTransactionLocks(int *nlocks)
 	 * behavior inside LWLockRelease.
 	 */
 	for (i = NUM_LOCK_PARTITIONS; --i >= 0;)
-		LWLockRelease(FirstLockMgrLock + i);
+	{
+		LWLock *partitionLock;
+
+		partitionLock = &MainLWLockArray[LOCK_MANAGER_LWLOCK_OFFSET + i].lock;
+		LWLockRelease(partitionLock);
+	}
 
 	*nlocks = index;
 	return accessExclusiveLocks;
@@ -3673,7 +3699,7 @@ lock_twophase_recover(TransactionId xid, uint16 info,
 	uint32		hashcode;
 	uint32		proclock_hashcode;
 	int			partition;
-	LWLockId	partitionLock;
+	LWLock	   *partitionLock;
 	LockMethod	lockMethodTable;
 
 	Assert(len == sizeof(TwoPhaseLockRecord));
@@ -4044,7 +4070,7 @@ VirtualXactLock(VirtualTransactionId vxid, bool wait)
 	{
 		PROCLOCK   *proclock;
 		uint32		hashcode;
-		LWLockId	partitionLock;
+		LWLock	   *partitionLock;
 
 		hashcode = LockTagHashCode(&tag);
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 4f88d3f..11fff91 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -37,43 +37,12 @@
 extern slock_t *ShmemLock;
 
 
-typedef struct LWLock
-{
-	slock_t		mutex;			/* Protects LWLock and queue of PGPROCs */
-	bool		releaseOK;		/* T if ok to release waiters */
-	char		exclusive;		/* # of exclusive holders (0 or 1) */
-	int			shared;			/* # of shared holders (0..MaxBackends) */
-	PGPROC	   *head;			/* head of list of waiting PGPROCs */
-	PGPROC	   *tail;			/* tail of list of waiting PGPROCs */
-	/* tail is undefined when head is NULL */
-} LWLock;
-
-/*
- * All the LWLock structs are allocated as an array in shared memory.
- * (LWLockIds are indexes into the array.)	We force the array stride to
- * be a power of 2, which saves a few cycles in indexing, but more
- * importantly also ensures that individual LWLocks don't cross cache line
- * boundaries.	This reduces cache contention problems, especially on AMD
- * Opterons.  (Of course, we have to also ensure that the array start
- * address is suitably aligned.)
- *
- * LWLock is between 16 and 32 bytes on all known platforms, so these two
- * cases are sufficient.
- */
-#define LWLOCK_PADDED_SIZE	(sizeof(LWLock) <= 16 ? 16 : 32)
-
-typedef union LWLockPadded
-{
-	LWLock		lock;
-	char		pad[LWLOCK_PADDED_SIZE];
-} LWLockPadded;
-
 /*
  * This points to the array of LWLocks in shared memory.  Backends inherit
  * the pointer by fork from the postmaster (except in the EXEC_BACKEND case,
  * where we have special measures to pass it down).
  */
-NON_EXEC_STATIC LWLockPadded *LWLockArray = NULL;
+LWLockPadded *MainLWLockArray = NULL;
 
 
 /*
@@ -85,7 +54,7 @@ NON_EXEC_STATIC LWLockPadded *LWLockArray = NULL;
 #define MAX_SIMUL_LWLOCKS	100
 
 static int	num_held_lwlocks = 0;
-static LWLockId held_lwlocks[MAX_SIMUL_LWLOCKS];
+static LWLock *held_lwlocks[MAX_SIMUL_LWLOCKS];
 
 static int	lock_addin_request = 0;
 static bool lock_addin_request_allowed = true;
@@ -130,7 +99,7 @@ static void print_lwlock_stats(int code, Datum arg);
 static void
 init_lwlock_stats(void)
 {
-	int		   *LWLockCounter = (int *) ((char *) LWLockArray - 2 * sizeof(int));
+	int		   *LWLockCounter = (int *) ((char *) MainLWLockArray - 2 * sizeof(int));
 	int			numLocks = LWLockCounter[1];
 
 	sh_acquire_counts = calloc(numLocks, sizeof(int));
@@ -145,7 +114,7 @@ static void
 print_lwlock_stats(int code, Datum arg)
 {
 	int			i;
-	int		   *LWLockCounter = (int *) ((char *) LWLockArray - 2 * sizeof(int));
+	int		   *LWLockCounter = (int *) ((char *) MainLWLockArray - 2 * sizeof(int));
 	int			numLocks = LWLockCounter[1];
 
 	/* Grab an LWLock to keep different backends from mixing reports */
@@ -180,7 +149,7 @@ NumLWLocks(void)
 	 */
 
 	/* Predefined LWLocks */
-	numLocks = (int) NumFixedLWLocks;
+	numLocks = NUM_FIXED_LWLOCKS;
 
 	/* bufmgr.c needs two for each shared buffer */
 	numLocks += 2 * NBuffers;
@@ -276,12 +245,12 @@ CreateLWLocks(void)
 	/* Ensure desired alignment of LWLock array */
 	ptr += LWLOCK_PADDED_SIZE - ((uintptr_t) ptr) % LWLOCK_PADDED_SIZE;
 
-	LWLockArray = (LWLockPadded *) ptr;
+	MainLWLockArray = (LWLockPadded *) ptr;
 
 	/*
-	 * Initialize all LWLocks to "unlocked" state
+	 * Initialize all LWLocks in main array to "unlocked" state
 	 */
-	for (id = 0, lock = LWLockArray; id < numLocks; id++, lock++)
+	for (id = 0, lock = MainLWLockArray; id < numLocks; id++, lock++)
 	{
 		SpinLockInit(&lock->lock.mutex);
 		lock->lock.releaseOK = true;
@@ -295,8 +264,8 @@ CreateLWLocks(void)
 	 * Initialize the dynamic-allocation counter, which is stored just before
 	 * the first LWLock.
 	 */
-	LWLockCounter = (int *) ((char *) LWLockArray - 2 * sizeof(int));
-	LWLockCounter[0] = (int) NumFixedLWLocks;
+	LWLockCounter = (int *) ((char *) MainLWLockArray - 2 * sizeof(int));
+	LWLockCounter[0] = NUM_FIXED_LWLOCKS;
 	LWLockCounter[1] = numLocks;
 }
 
@@ -309,22 +278,22 @@ CreateLWLocks(void)
  * startup, but it is needed if any user-defined code tries to allocate
  * LWLocks after startup.
  */
-LWLockId
+LWLock *
 LWLockAssign(void)
 {
-	LWLockId	result;
+	LWLock	   *result;
 
 	/* use volatile pointer to prevent code rearrangement */
 	volatile int *LWLockCounter;
 
-	LWLockCounter = (int *) ((char *) LWLockArray - 2 * sizeof(int));
+	LWLockCounter = (int *) ((char *) MainLWLockArray - 2 * sizeof(int));
 	SpinLockAcquire(ShmemLock);
 	if (LWLockCounter[0] >= LWLockCounter[1])
 	{
 		SpinLockRelease(ShmemLock);
-		elog(ERROR, "no more LWLockIds available");
+		elog(ERROR, "no more LWLocks available");
 	}
-	result = (LWLockId) (LWLockCounter[0]++);
+	result = &MainLWLockArray[LWLockCounter[0]++].lock;
 	SpinLockRelease(ShmemLock);
 	return result;
 }
@@ -338,9 +307,9 @@ LWLockAssign(void)
  * Side effect: cancel/die interrupts are held off until lock release.
  */
 void
-LWLockAcquire(LWLockId lockid, LWLockMode mode)
+LWLockAcquire(LWLock *l, LWLockMode mode)
 {
-	volatile LWLock *lock = &(LWLockArray[lockid].lock);
+	volatile LWLock *lock = l;
 	PGPROC	   *proc = MyProc;
 	bool		retry = false;
 	int			extraWaits = 0;
@@ -497,7 +466,7 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(lockid, mode);
 
 	/* Add lock to list of locks held by this backend */
-	held_lwlocks[num_held_lwlocks++] = lockid;
+	held_lwlocks[num_held_lwlocks++] = l;
 
 	/*
 	 * Fix the process wait semaphore's count for any absorbed wakeups.
@@ -514,9 +483,9 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
  * If successful, cancel/die interrupts are held off until lock release.
  */
 bool
-LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode)
+LWLockConditionalAcquire(LWLock *l, LWLockMode mode)
 {
-	volatile LWLock *lock = &(LWLockArray[lockid].lock);
+	volatile LWLock *lock = l;
 	bool		mustwait;
 
 	PRINT_LWDEBUG("LWLockConditionalAcquire", lockid, lock);
@@ -570,7 +539,7 @@ LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode)
 	else
 	{
 		/* Add lock to list of locks held by this backend */
-		held_lwlocks[num_held_lwlocks++] = lockid;
+		held_lwlocks[num_held_lwlocks++] = l;
 		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(lockid, mode);
 	}
 
@@ -592,9 +561,9 @@ LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode)
  * wake up, observe that their records have already been flushed, and return.
  */
 bool
-LWLockAcquireOrWait(LWLockId lockid, LWLockMode mode)
+LWLockAcquireOrWait(LWLock *l, LWLockMode mode)
 {
-	volatile LWLock *lock = &(LWLockArray[lockid].lock);
+	volatile LWLock *lock = l;
 	PGPROC	   *proc = MyProc;
 	bool		mustwait;
 	int			extraWaits = 0;
@@ -714,7 +683,7 @@ LWLockAcquireOrWait(LWLockId lockid, LWLockMode mode)
 	else
 	{
 		/* Add lock to list of locks held by this backend */
-		held_lwlocks[num_held_lwlocks++] = lockid;
+		held_lwlocks[num_held_lwlocks++] = l;
 		TRACE_POSTGRESQL_LWLOCK_WAIT_UNTIL_FREE(lockid, mode);
 	}
 
@@ -725,9 +694,9 @@ LWLockAcquireOrWait(LWLockId lockid, LWLockMode mode)
  * LWLockRelease - release a previously acquired lock
  */
 void
-LWLockRelease(LWLockId lockid)
+LWLockRelease(LWLock *l)
 {
-	volatile LWLock *lock = &(LWLockArray[lockid].lock);
+	volatile LWLock *lock = l;
 	PGPROC	   *head;
 	PGPROC	   *proc;
 	int			i;
@@ -740,11 +709,11 @@ LWLockRelease(LWLockId lockid)
 	 */
 	for (i = num_held_lwlocks; --i >= 0;)
 	{
-		if (lockid == held_lwlocks[i])
+		if (l == held_lwlocks[i])
 			break;
 	}
 	if (i < 0)
-		elog(ERROR, "lock %d is not held", (int) lockid);
+		elog(ERROR, "lock %p is not held", l);
 	num_held_lwlocks--;
 	for (; i < num_held_lwlocks; i++)
 		held_lwlocks[i] = held_lwlocks[i + 1];
@@ -874,13 +843,13 @@ LWLockReleaseAll(void)
  * lock is held shared or exclusive.
  */
 bool
-LWLockHeldByMe(LWLockId lockid)
+LWLockHeldByMe(LWLock *l)
 {
 	int			i;
 
 	for (i = 0; i < num_held_lwlocks; i++)
 	{
-		if (held_lwlocks[i] == lockid)
+		if (held_lwlocks[i] == l)
 			return true;
 	}
 	return false;
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index a8a0e98..0987596 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -241,7 +241,8 @@
 #define PredicateLockHashPartition(hashcode) \
 	((hashcode) % NUM_PREDICATELOCK_PARTITIONS)
 #define PredicateLockHashPartitionLock(hashcode) \
-	((LWLockId) (FirstPredicateLockMgrLock + PredicateLockHashPartition(hashcode)))
+	(&MainLWLockArray[PREDICATELOCK_MANAGER_LWLOCK_OFFSET + \
+		PredicateLockHashPartition(hashcode)].lock)
 
 #define NPREDICATELOCKTARGETENTS() \
 	mul_size(max_predicate_locks_per_xact, add_size(MaxBackends, max_prepared_xacts))
@@ -383,7 +384,7 @@ static SHM_QUEUE *FinishedSerializableTransactions;
  */
 static const PREDICATELOCKTARGETTAG ScratchTargetTag = {0, 0, 0, 0};
 static uint32 ScratchTargetTagHash;
-static int	ScratchPartitionLock;
+static LWLock *ScratchPartitionLock;
 
 /*
  * The local hash table used to determine when to combine multiple fine-
@@ -1398,7 +1399,13 @@ GetPredicateLockStatusData(void)
 	 * in ascending order, then SerializableXactHashLock.
 	 */
 	for (i = 0; i < NUM_PREDICATELOCK_PARTITIONS; i++)
-		LWLockAcquire(FirstPredicateLockMgrLock + i, LW_SHARED);
+	{
+		LWLock *partitionLock;
+
+		partitionLock =
+			&MainLWLockArray[PREDICATELOCK_MANAGER_LWLOCK_OFFSET + i].lock;
+		LWLockAcquire(partitionLock, LW_SHARED);
+	}
 	LWLockAcquire(SerializableXactHashLock, LW_SHARED);
 
 	/* Get number of locks and allocate appropriately-sized arrays. */
@@ -1427,7 +1434,13 @@ GetPredicateLockStatusData(void)
 	/* Release locks in reverse order */
 	LWLockRelease(SerializableXactHashLock);
 	for (i = NUM_PREDICATELOCK_PARTITIONS - 1; i >= 0; i--)
-		LWLockRelease(FirstPredicateLockMgrLock + i);
+	{
+		LWLock *partitionLock;
+
+		partitionLock =
+			&MainLWLockArray[PREDICATELOCK_MANAGER_LWLOCK_OFFSET + i].lock;
+		LWLockRelease(partitionLock);
+	}
 
 	return data;
 }
@@ -1856,7 +1869,7 @@ PageIsPredicateLocked(Relation relation, BlockNumber blkno)
 {
 	PREDICATELOCKTARGETTAG targettag;
 	uint32		targettaghash;
-	LWLockId	partitionLock;
+	LWLock	   *partitionLock;
 	PREDICATELOCKTARGET *target;
 
 	SET_PREDICATELOCKTARGETTAG_PAGE(targettag,
@@ -2089,7 +2102,7 @@ DeleteChildTargetLocks(const PREDICATELOCKTARGETTAG *newtargettag)
 		if (TargetTagIsCoveredBy(oldtargettag, *newtargettag))
 		{
 			uint32		oldtargettaghash;
-			LWLockId	partitionLock;
+			LWLock	    *partitionLock;
 			PREDICATELOCK *rmpredlock PG_USED_FOR_ASSERTS_ONLY;
 
 			oldtargettaghash = PredicateLockTargetTagHashCode(&oldtargettag);
@@ -2301,7 +2314,7 @@ CreatePredicateLock(const PREDICATELOCKTARGETTAG *targettag,
 	PREDICATELOCKTARGET *target;
 	PREDICATELOCKTAG locktag;
 	PREDICATELOCK *lock;
-	LWLockId	partitionLock;
+	LWLock	   *partitionLock;
 	bool		found;
 
 	partitionLock = PredicateLockHashPartitionLock(targettaghash);
@@ -2599,10 +2612,10 @@ TransferPredicateLocksToNewTarget(PREDICATELOCKTARGETTAG oldtargettag,
 								  bool removeOld)
 {
 	uint32		oldtargettaghash;
-	LWLockId	oldpartitionLock;
+	LWLock	   *oldpartitionLock;
 	PREDICATELOCKTARGET *oldtarget;
 	uint32		newtargettaghash;
-	LWLockId	newpartitionLock;
+	LWLock	   *newpartitionLock;
 	bool		found;
 	bool		outOfShmem = false;
 
@@ -2858,7 +2871,13 @@ DropAllPredicateLocksFromTable(Relation relation, bool transfer)
 	/* Acquire locks on all lock partitions */
 	LWLockAcquire(SerializablePredicateLockListLock, LW_EXCLUSIVE);
 	for (i = 0; i < NUM_PREDICATELOCK_PARTITIONS; i++)
-		LWLockAcquire(FirstPredicateLockMgrLock + i, LW_EXCLUSIVE);
+	{
+		LWLock *partitionLock;
+
+		partitionLock =
+			&MainLWLockArray[PREDICATELOCK_MANAGER_LWLOCK_OFFSET + i].lock;
+		LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+	}
 	LWLockAcquire(SerializableXactHashLock, LW_EXCLUSIVE);
 
 	/*
@@ -2996,7 +3015,13 @@ DropAllPredicateLocksFromTable(Relation relation, bool transfer)
 	/* Release locks in reverse order */
 	LWLockRelease(SerializableXactHashLock);
 	for (i = NUM_PREDICATELOCK_PARTITIONS - 1; i >= 0; i--)
-		LWLockRelease(FirstPredicateLockMgrLock + i);
+	{
+		LWLock *partitionLock;
+
+		partitionLock =
+			&MainLWLockArray[PREDICATELOCK_MANAGER_LWLOCK_OFFSET + i].lock;
+		LWLockRelease(partitionLock);
+	}
 	LWLockRelease(SerializablePredicateLockListLock);
 }
 
@@ -3611,7 +3636,7 @@ ClearOldPredicateLocks(void)
 			PREDICATELOCKTARGET *target;
 			PREDICATELOCKTARGETTAG targettag;
 			uint32		targettaghash;
-			LWLockId	partitionLock;
+			LWLock	   *partitionLock;
 
 			tag = predlock->tag;
 			target = tag.myTarget;
@@ -3690,7 +3715,7 @@ ReleaseOneSerializableXact(SERIALIZABLEXACT *sxact, bool partial,
 		PREDICATELOCKTARGET *target;
 		PREDICATELOCKTARGETTAG targettag;
 		uint32		targettaghash;
-		LWLockId	partitionLock;
+		LWLock	   *partitionLock;
 
 		nextpredlock = (PREDICATELOCK *)
 			SHMQueueNext(&(sxact->predicateLocks),
@@ -4068,7 +4093,7 @@ static void
 CheckTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag)
 {
 	uint32		targettaghash;
-	LWLockId	partitionLock;
+	LWLock	   *partitionLock;
 	PREDICATELOCKTARGET *target;
 	PREDICATELOCK *predlock;
 	PREDICATELOCK *mypredlock = NULL;
@@ -4360,7 +4385,13 @@ CheckTableForSerializableConflictIn(Relation relation)
 
 	LWLockAcquire(SerializablePredicateLockListLock, LW_EXCLUSIVE);
 	for (i = 0; i < NUM_PREDICATELOCK_PARTITIONS; i++)
-		LWLockAcquire(FirstPredicateLockMgrLock + i, LW_SHARED);
+	{
+		LWLock *partitionLock;
+
+		partitionLock =
+			&MainLWLockArray[PREDICATELOCK_MANAGER_LWLOCK_OFFSET + i].lock;
+		LWLockAcquire(partitionLock, LW_SHARED);
+	}
 	LWLockAcquire(SerializableXactHashLock, LW_SHARED);
 
 	/* Scan through target list */
@@ -4407,7 +4438,13 @@ CheckTableForSerializableConflictIn(Relation relation)
 	/* Release locks in reverse order */
 	LWLockRelease(SerializableXactHashLock);
 	for (i = NUM_PREDICATELOCK_PARTITIONS - 1; i >= 0; i--)
-		LWLockRelease(FirstPredicateLockMgrLock + i);
+	{
+		LWLock *partitionLock;
+
+		partitionLock =
+			&MainLWLockArray[PREDICATELOCK_MANAGER_LWLOCK_OFFSET + i].lock;
+		LWLockRelease(partitionLock);
+	}
 	LWLockRelease(SerializablePredicateLockListLock);
 }
 
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 122afb2..a991092 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -189,7 +189,8 @@ InitProcGlobal(void)
 	 */
 	procs = (PGPROC *) ShmemAlloc(TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
-	ProcGlobal->allProcCount = TotalProcs;
+	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
+	ProcGlobal->allProcCount = MaxBackends + NUM_AUXILIARY_PROCS;
 	if (!procs)
 		ereport(FATAL,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -663,7 +664,7 @@ IsWaitingForLock(void)
 void
 LockErrorCleanup(void)
 {
-	LWLockId	partitionLock;
+	LWLock	   *partitionLock;
 	DisableTimeoutParams timeouts[2];
 
 	AbortStrongLockAcquire();
@@ -942,7 +943,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 	LOCK	   *lock = locallock->lock;
 	PROCLOCK   *proclock = locallock->proclock;
 	uint32		hashcode = locallock->hashcode;
-	LWLockId	partitionLock = LockHashPartitionLock(hashcode);
+	LWLock	   *partitionLock = LockHashPartitionLock(hashcode);
 	PROC_QUEUE *waitQueue = &(lock->waitProcs);
 	LOCKMASK	myHeldLocks = MyProc->heldLocks;
 	bool		early_deadlock = false;
@@ -1440,7 +1441,12 @@ CheckDeadLock(void)
 	 * interrupts.
 	 */
 	for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
-		LWLockAcquire(FirstLockMgrLock + i, LW_EXCLUSIVE);
+	{
+		LWLock *partitionLock;
+
+		partitionLock = &MainLWLockArray[LOCK_MANAGER_LWLOCK_OFFSET + i].lock;
+		LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+	}
 
 	/*
 	 * Check to see if we've been awoken by anyone in the interim.
@@ -1522,7 +1528,12 @@ CheckDeadLock(void)
 	 */
 check_done:
 	for (i = NUM_LOCK_PARTITIONS; --i >= 0;)
-		LWLockRelease(FirstLockMgrLock + i);
+	{
+		LWLock *partitionLock;
+
+		partitionLock = &MainLWLockArray[LOCK_MANAGER_LWLOCK_OFFSET + i].lock;
+		LWLockRelease(partitionLock);
+	}
 }
 
 
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index fc2c503..3c668f5 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -55,7 +55,7 @@ typedef enum
  */
 typedef struct SlruSharedData
 {
-	LWLockId	ControlLock;
+	LWLock     *ControlLock;
 
 	/* Number of buffers managed by this SLRU structure */
 	int			num_slots;
@@ -69,7 +69,7 @@ typedef struct SlruSharedData
 	bool	   *page_dirty;
 	int		   *page_number;
 	int		   *page_lru_count;
-	LWLockId   *buffer_locks;
+	LWLock    **buffer_locks;
 
 	/*
 	 * Optional array of WAL flush LSNs associated with entries in the SLRU
@@ -136,7 +136,7 @@ typedef SlruCtlData *SlruCtl;
 
 extern Size SimpleLruShmemSize(int nslots, int nlsns);
 extern void SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
-			  LWLockId ctllock, const char *subdir);
+			  LWLock *ctllock, const char *subdir);
 extern int	SimpleLruZeroPage(SlruCtl ctl, int pageno);
 extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
 				  TransactionId xid);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index f0e5144..bc6353e 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -104,7 +104,8 @@ typedef struct buftag
 #define BufTableHashPartition(hashcode) \
 	((hashcode) % NUM_BUFFER_PARTITIONS)
 #define BufMappingPartitionLock(hashcode) \
-	((LWLockId) (FirstBufMappingLock + BufTableHashPartition(hashcode)))
+	(&MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET + \
+		BufTableHashPartition(hashcode)].lock)
 
 /*
  *	BufferDesc -- shared descriptor/state data for a single shared buffer.
@@ -144,8 +145,8 @@ typedef struct sbufdesc
 	int			buf_id;			/* buffer's index number (from 0) */
 	int			freeNext;		/* link in freelist chain */
 
-	LWLockId	io_in_progress_lock;	/* to wait for I/O to complete */
-	LWLockId	content_lock;	/* to lock access to buffer contents */
+	LWLock     *io_in_progress_lock;	/* to wait for I/O to complete */
+	LWLock	   *content_lock;	/* to lock access to buffer contents */
 } BufferDesc;
 
 #define BufferDescriptorGetBuffer(bdesc) ((bdesc)->buf_id + 1)
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 99bd945..f337af0 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -483,8 +483,8 @@ typedef enum
 #define LockHashPartition(hashcode) \
 	((hashcode) % NUM_LOCK_PARTITIONS)
 #define LockHashPartitionLock(hashcode) \
-	((LWLockId) (FirstLockMgrLock + LockHashPartition(hashcode)))
-
+	(&MainLWLockArray[LOCK_MANAGER_LWLOCK_OFFSET + \
+		LockHashPartition(hashcode)].lock)
 
 /*
  * function prototypes
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 3e42f6a..c6ffc5c 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -14,10 +14,94 @@
 #ifndef LWLOCK_H
 #define LWLOCK_H
 
+#include "storage/s_lock.h"
+
+struct PGPROC;
+
+typedef struct LWLock
+{
+	slock_t		mutex;			/* Protects LWLock and queue of PGPROCs */
+	bool		releaseOK;		/* T if ok to release waiters */
+	char		exclusive;		/* # of exclusive holders (0 or 1) */
+	int			shared;			/* # of shared holders (0..MaxBackends) */
+	struct PGPROC *head;			/* head of list of waiting PGPROCs */
+	struct PGPROC *tail;			/* tail of list of waiting PGPROCs */
+	/* tail is undefined when head is NULL */
+} LWLock;
+
+/*
+ * Prior to PostgreSQL 9.4, every lightweight lock in the system was stored
+ * in a single array.  For convenience and for compatibility with past
+ * releases, we still have a main array, but it's now also permissible to
+ * store LWLocks elsewhere in the main shared memory segment or in a dynamic
+ * shared memory segment.  In the main array, we force the array stride to
+ * be a power of 2, which saves a few cycles in indexing, but more importantly
+ * also ensures that individual LWLocks don't cross cache line boundaries.
+ * This reduces cache contention problems, especially on AMD Opterons.
+ * (Of course, we have to also ensure that the array start address is suitably
+ * aligned.)
+ *
+ * LWLock is between 16 and 32 bytes on all known platforms, so these two
+ * cases are sufficient.
+ */
+#define LWLOCK_PADDED_SIZE	(sizeof(LWLock) <= 16 ? 16 : 32)
+
+typedef union LWLockPadded
+{
+	LWLock		lock;
+	char		pad[LWLOCK_PADDED_SIZE];
+} LWLockPadded;
+extern LWLockPadded *MainLWLockArray;
+
+/*
+ * Some commonly-used locks have predefined positions within MainLWLockArray;
+ * defining macros here makes it much easier to keep track of these.  If you
+ * add a lock, add it to the end to avoid renumbering the existing locks;
+ * if you remove a lock, consider leaving a gap in the numbering sequence for
+ * the benefit of DTrace and other external debugging scripts.
+ */
+#define BufFreelistLock				(&MainLWLockArray[0].lock)
+#define ShmemIndexLock				(&MainLWLockArray[1].lock)
+#define OidGenLock					(&MainLWLockArray[2].lock)
+#define XidGenLock					(&MainLWLockArray[3].lock)
+#define ProcArrayLock				(&MainLWLockArray[4].lock)
+#define SInvalReadLock				(&MainLWLockArray[5].lock)
+#define SInvalWriteLock				(&MainLWLockArray[6].lock)
+#define WALBufMappingLock			(&MainLWLockArray[7].lock)
+#define WALWriteLock				(&MainLWLockArray[8].lock)
+#define ControlFileLock				(&MainLWLockArray[9].lock)
+#define CheckpointLock				(&MainLWLockArray[10].lock)
+#define CLogControlLock				(&MainLWLockArray[11].lock)
+#define SubtransControlLock			(&MainLWLockArray[12].lock)
+#define MultiXactGenLock			(&MainLWLockArray[13].lock)
+#define MultiXactOffsetControlLock	(&MainLWLockArray[14].lock)
+#define MultiXactMemberControlLock	(&MainLWLockArray[15].lock)
+#define RelCacheInitLock			(&MainLWLockArray[16].lock)
+#define CheckpointerCommLock		(&MainLWLockArray[17].lock)
+#define TwoPhaseStateLock			(&MainLWLockArray[18].lock)
+#define TablespaceCreateLock		(&MainLWLockArray[19].lock)
+#define BtreeVacuumLock				(&MainLWLockArray[20].lock)
+#define AddinShmemInitLock			(&MainLWLockArray[21].lock)
+#define AutovacuumLock				(&MainLWLockArray[22].lock)
+#define AutovacuumScheduleLock		(&MainLWLockArray[23].lock)
+#define SyncScanLock				(&MainLWLockArray[24].lock)
+#define RelationMappingLock			(&MainLWLockArray[25].lock)
+#define AsyncCtlLock				(&MainLWLockArray[26].lock)
+#define AsyncQueueLock				(&MainLWLockArray[27].lock)
+#define SerializableXactHashLock	(&MainLWLockArray[28].lock)
+#define SerializableFinishedListLock		(&MainLWLockArray[29].lock)
+#define SerializablePredicateLockListLock	(&MainLWLockArray[30].lock)
+#define OldSerXidLock				(&MainLWLockArray[31].lock)
+#define SyncRepLock					(&MainLWLockArray[32].lock)
+#define BackgroundWorkerLock		(&MainLWLockArray[33].lock)
+#define DynamicSharedMemoryControlLock		(&MainLWLockArray[34].lock)
+#define AutoFileLock				(&MainLWLockArray[35].lock)
+#define NUM_INDIVIDUAL_LWLOCKS		36
+
 /*
  * It's a bit odd to declare NUM_BUFFER_PARTITIONS and NUM_LOCK_PARTITIONS
- * here, but we need them to set up enum LWLockId correctly, and having
- * this file include lock.h or bufmgr.h would be backwards.
+ * here, but we need them to figure out offsets within MainLWLockArray, and
+ * having this file include lock.h or bufmgr.h would be backwards.
  */
 
 /* Number of partitions of the shared buffer mapping hashtable */
@@ -31,68 +115,14 @@
 #define LOG2_NUM_PREDICATELOCK_PARTITIONS  4
 #define NUM_PREDICATELOCK_PARTITIONS  (1 << LOG2_NUM_PREDICATELOCK_PARTITIONS)
 
-/*
- * We have a number of predefined LWLocks, plus a bunch of LWLocks that are
- * dynamically assigned (e.g., for shared buffers).  The LWLock structures
- * live in shared memory (since they contain shared data) and are identified
- * by values of this enumerated type.  We abuse the notion of an enum somewhat
- * by allowing values not listed in the enum declaration to be assigned.
- * The extra value MaxDynamicLWLock is there to keep the compiler from
- * deciding that the enum can be represented as char or short ...
- *
- * If you remove a lock, please replace it with a placeholder. This retains
- * the lock numbering, which is helpful for DTrace and other external
- * debugging scripts.
- */
-typedef enum LWLockId
-{
-	BufFreelistLock,
-	ShmemIndexLock,
-	OidGenLock,
-	XidGenLock,
-	ProcArrayLock,
-	SInvalReadLock,
-	SInvalWriteLock,
-	WALBufMappingLock,
-	WALWriteLock,
-	ControlFileLock,
-	CheckpointLock,
-	CLogControlLock,
-	SubtransControlLock,
-	MultiXactGenLock,
-	MultiXactOffsetControlLock,
-	MultiXactMemberControlLock,
-	RelCacheInitLock,
-	CheckpointerCommLock,
-	TwoPhaseStateLock,
-	TablespaceCreateLock,
-	BtreeVacuumLock,
-	AddinShmemInitLock,
-	AutovacuumLock,
-	AutovacuumScheduleLock,
-	SyncScanLock,
-	RelationMappingLock,
-	AsyncCtlLock,
-	AsyncQueueLock,
-	SerializableXactHashLock,
-	SerializableFinishedListLock,
-	SerializablePredicateLockListLock,
-	OldSerXidLock,
-	SyncRepLock,
-	BackgroundWorkerLock,
-	DynamicSharedMemoryControlLock,
-	AutoFileLock,
-	/* Individual lock IDs end here */
-	FirstBufMappingLock,
-	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
-	FirstPredicateLockMgrLock = FirstLockMgrLock + NUM_LOCK_PARTITIONS,
-
-	/* must be last except for MaxDynamicLWLock: */
-	NumFixedLWLocks = FirstPredicateLockMgrLock + NUM_PREDICATELOCK_PARTITIONS,
-
-	MaxDynamicLWLock = 1000000000
-} LWLockId;
-
+/* Offsets for various chunks of preallocated lwlocks. */
+#define BUFFER_MAPPING_LWLOCK_OFFSET	NUM_INDIVIDUAL_LWLOCKS
+#define LOCK_MANAGER_LWLOCK_OFFSET		\
+	(BUFFER_MAPPING_LWLOCK_OFFSET + NUM_BUFFER_PARTITIONS)
+#define PREDICATELOCK_MANAGER_LWLOCK_OFFSET	\
+	(NUM_INDIVIDUAL_LWLOCKS + NUM_LOCK_PARTITIONS)
+#define NUM_FIXED_LWLOCKS \
+	(PREDICATELOCK_MANAGER_LWLOCK_OFFSET + NUM_PREDICATELOCK_PARTITIONS)
 
 typedef enum LWLockMode
 {
@@ -108,13 +138,14 @@ typedef enum LWLockMode
 extern bool Trace_lwlocks;
 #endif
 
-extern LWLockId LWLockAssign(void);
-extern void LWLockAcquire(LWLockId lockid, LWLockMode mode);
-extern bool LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode);
-extern bool LWLockAcquireOrWait(LWLockId lockid, LWLockMode mode);
-extern void LWLockRelease(LWLockId lockid);
+extern LWLock *LWLockAssign(void);
+extern void LWLockInitialize(LWLock *);
+extern void LWLockAcquire(LWLock *lock, LWLockMode mode);
+extern bool LWLockConditionalAcquire(LWLock *lock, LWLockMode mode);
+extern bool LWLockAcquireOrWait(LWLock *lock, LWLockMode mode);
+extern void LWLockRelease(LWLock *lock);
 extern void LWLockReleaseAll(void);
-extern bool LWLockHeldByMe(LWLockId lockid);
+extern bool LWLockHeldByMe(LWLock *lock);
 
 extern int	NumLWLocks(void);
 extern Size LWLockShmemSize(void);
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 3b04d3c..fb00e79 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -131,7 +131,7 @@ struct PGPROC
 	struct XidCache subxids;	/* cache for subtransaction XIDs */
 
 	/* Per-backend LWLock.	Protects fields below. */
-	LWLockId	backendLock;	/* protects the fields below */
+	LWLock	   *backendLock;	/* protects the fields below */
 
 	/* Lock manager data, recording fast-path locks taken by this backend. */
 	uint64		fpLockBits;		/* lock modes held for each fast-path slot */

Heikki Linnakangas

hlinnakangas@vmware.com

about 12 years ago

In reply to: Robert Haas (#1)

Re: dynamic shared memory and locks

On 01/05/2014 07:56 PM, Robert Haas wrote:

Right now, storing spinlocks in dynamic shared memory *almost* works,
but there are problems with --disable-spinlocks. In that
configuration, we use semaphores to simulate spinlocks. Every time
someone calls SpinLockInit(), it's going to allocate a new semaphore
which will never be returned to the operating system, so you're pretty
quickly going to run out. There are a couple of things we could do
about this:

5. Allocate a fixed number of semaphores, and multiplex them to emulate
any number of spinlocks. For example, allocate 100 semaphores, and use
the address of the spinlock modulo 100 to calculate which semaphore to
use to emulate that spinlock.

That assumes that you never hold more than one spinlock at a time,
otherwise you can get deadlocks. I think that assumptions holds
currently, because acquiring two spinlocks at a time would be bad on
performance grounds anyway.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Heikki Linnakangas (#6)

Re: dynamic shared memory and locks

On 2014-01-06 10:35:59 +0200, Heikki Linnakangas wrote:

On 01/05/2014 07:56 PM, Robert Haas wrote:

Right now, storing spinlocks in dynamic shared memory *almost* works,
but there are problems with --disable-spinlocks. In that
configuration, we use semaphores to simulate spinlocks. Every time
someone calls SpinLockInit(), it's going to allocate a new semaphore
which will never be returned to the operating system, so you're pretty
quickly going to run out. There are a couple of things we could do
about this:

5. Allocate a fixed number of semaphores, and multiplex them to emulate any
number of spinlocks. For example, allocate 100 semaphores, and use the
address of the spinlock modulo 100 to calculate which semaphore to use to
emulate that spinlock.

That assumes that you never hold more than one spinlock at a time, otherwise
you can get deadlocks. I think that assumptions holds currently, because
acquiring two spinlocks at a time would be bad on performance grounds
anyway.

I think that's a pretty dangerous assumption - and one which would only
break without just about anybody noticing because nobody tests
--disable-spinlock builds regularly. We could improve on that by adding
cassert checks against nested spinlocks, but that'd be a far too high
price for little benefit imo.

I am not convinced by the performance argument - there's lots of
spinlock that are rarely if ever contended. If you lock two such locks
in a consistent nested fashion there won't be a performance problem.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Stephen Frost

sfrost@snowman.net

about 12 years ago

In reply to: Robert Haas (#5)

Re: dynamic shared memory and locks

* Robert Haas (robertmhaas@gmail.com) wrote:

Another idea is to include some identifying information in the lwlock.

That was my immediate reaction to this issue...

For example, each lwlock could have a char *name in it, and we could
print the name. In theory, this could be a big step forward in terms
of usability, because it'd spare us all needing to remember that 4 ==
ProcArrayLock. But it's awkward for buffer locks, of which there
might be a great many, and we surely don't want to allocate a
dynamically-generated string in shared memory for each one. You could
do a bit better by making the identifying information a string plus an
integer, because then all the buffer locks could set the string to a
static constant like "buffer content lock" and the integer to the
buffer number, and similarly for lock manager partition locks and so
on. This is appealing, but would increase the size of LWLockPadded
from 16 bytes to 32 on 64-bit platforms where slock_t is four bytes or
less, which I'm not that excited about even though it would reduce
cache line contention in some cases.

I'm not thrilled with this either but at the same time, it may well be
worth it.

Yet a third idea is to try to reverse-engineer a name for a given
lwlock from the pointer address. If it's an offset into the main
array, this is easy enough to do, and even if we ended up with several
arrays (like one for bufmgr locks) it wouldn't be too hard to write
code to figure out which array contains it and emit the appropriate
string. The only real problem that I see with this is that it might
cause a performance hit. A performance hit when running with
trace_lwlocks or LWLOCK_STATS is not really a problem, but people
won't like if --enable-dtrace slow things up.

This seems like an interesting idea- but this and my other thought
(having a defined array of strings) seem like they might fall over in
the face of DSMs.

Preferences, other ideas?

My preference would generally be "use more memory rather than CPU time";
so I'd vote for adding identifying info rather than having the code hunt
through arrays to try and find it.

None of these ideas are a complete solution for LWLOCK_STATS. In the
other three cases noted above, we only need an identifier for the lock
"instantaneously", so that we can pass it off to the logger or dtrace
or whatever. But LWLOCK_STATS wants to hold on to data about the
locks that were visited until the end of the session, and it does that
using an array that is *indexed* by lwlockid. I guess we could
replace that with a hash table. Ugh. Any suggestions?

Yeah, that's not fun. No good suggestions here offhand.

Thanks,

Stephen

Tom Lane

tgl@sss.pgh.pa.us

about 12 years ago

In reply to: Andres Freund (#7)

Re: dynamic shared memory and locks

Andres Freund <andres@2ndquadrant.com> writes:

On 2014-01-06 10:35:59 +0200, Heikki Linnakangas wrote:

That assumes that you never hold more than one spinlock at a time, otherwise
you can get deadlocks. I think that assumptions holds currently, because
acquiring two spinlocks at a time would be bad on performance grounds
anyway.

I think that's a pretty dangerous assumption

I think it's a fine assumption. Code that could possibly do that should
never get within hailing distance of being committed, because you are only
supposed to have short straight-line bits of code under a spinlock.
Taking another spinlock breaks both the "straight line code" and the "no
loops" aspects of that coding rule.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Tom Lane (#9)

Re: dynamic shared memory and locks

On 2014-01-06 09:59:49 -0500, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

On 2014-01-06 10:35:59 +0200, Heikki Linnakangas wrote:

That assumes that you never hold more than one spinlock at a time, otherwise
you can get deadlocks. I think that assumptions holds currently, because
acquiring two spinlocks at a time would be bad on performance grounds
anyway.

I think that's a pretty dangerous assumption

I think it's a fine assumption. Code that could possibly do that should
never get within hailing distance of being committed, because you are only
supposed to have short straight-line bits of code under a spinlock.
Taking another spinlock breaks both the "straight line code" and the "no
loops" aspects of that coding rule.

I think it's fair to say that no such code should be accepted - but I
personally don't trust the review process to prevent such cases and not
all spinlock usages are obvious (e.g. LockBufferHdr). And having rules
that only cause breakage in configurations that nobody ever uses doesn't
inspire much trust in that configuration.
If we go that way, we should add an Assert preventing recursive spinlock
acquiration...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Robert Haas

robertmhaas@gmail.com

about 12 years ago

In reply to: Tom Lane (#9)

Re: dynamic shared memory and locks

On Mon, Jan 6, 2014 at 9:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@2ndquadrant.com> writes:

On 2014-01-06 10:35:59 +0200, Heikki Linnakangas wrote:

That assumes that you never hold more than one spinlock at a time, otherwise
you can get deadlocks. I think that assumptions holds currently, because
acquiring two spinlocks at a time would be bad on performance grounds
anyway.

I think that's a pretty dangerous assumption

I think it's a fine assumption. Code that could possibly do that should
never get within hailing distance of being committed, because you are only
supposed to have short straight-line bits of code under a spinlock.
Taking another spinlock breaks both the "straight line code" and the "no
loops" aspects of that coding rule.

OK, so we could do this. But should we? If we're going to get rid of
--disable-spinlocks for other reasons, then there's not much point in
spending the time to make it work with dynamic shared memory segments
that contain locks.

In fact, even if we *aren't* going to get rid of it, it's not 100%
clear to me that there's much point in supporting that intersection: a
big point of the point of dsm is to enable parallelism, and the point
of parallelism is to be fast, and if you want things to be fast, you
should probably have working spinlocks. Of course, there are some
other applications for dynamic shared memory as well, so this isn't a
watertight argument, and in a quick test on my laptop, the performance
of --disable-spinlocks wasn't horrible, so this argument isn't
watertight.

I guess the question boils down to: why are we keeping
--disable-spinlocks around? If we're expecting that people might
really use it for serious work, then it needs to remain and it needs
to work with dynamic shared memory. If we're expecting that people
might use it while porting PostgreSQL to a new platform but only as an
aid to get things up and running, then it needs to remain, but it's
probably OK for dynamic shared memory not to work when this option is
in use. If nobody uses it at all, then we can just forget the whole
thing.

I guess my bottom line here is that if --disable-spinlocks is
important to somebody, then I'm willing to expend some effort on
keeping it working. But if it's just inertia, then I'd rather not
spend a lot of time on it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Tom Lane

tgl@sss.pgh.pa.us

about 12 years ago

In reply to: Robert Haas (#11)

Re: dynamic shared memory and locks

Robert Haas <robertmhaas@gmail.com> writes:

I guess the question boils down to: why are we keeping
--disable-spinlocks around? If we're expecting that people might
really use it for serious work, then it needs to remain and it needs
to work with dynamic shared memory. If we're expecting that people
might use it while porting PostgreSQL to a new platform but only as an
aid to get things up and running, then it needs to remain, but it's
probably OK for dynamic shared memory not to work when this option is
in use. If nobody uses it at all, then we can just forget the whole
thing.

I think we can eliminate the first of those. Semaphores for spinlocks
were a performance disaster fifteen years ago, and the situation has
surely only gotten worse since then. I do, however, believe that
--disable-spinlocks has some use while porting to a new platform.
(And I don't believe the argument advanced elsewhere in this thread
that everybody uses gcc; much less that gcc's atomic intrinsics are
of uniformly high quality even on oddball architectures.)

Heikki's idea has some attraction for porting work whether or not
you believe that DSM needs to work during initial porting. By
default, PG will try to create upwards of ten thousand spinlocks
just for buffer headers. It's likely that that will fail unless
you whack around the kernel's SysV parameters. Being able to
constrain the number of semaphores to something sane would probably
be useful.

Having said all that, I'm not personally going to take the time to
implement it, and I don't think the DSM patch should be required to
either. Somebody who actually needs it can go make it work.

But I think Heikki's idea points the way forward, so I'm now against
having the DSM code reject --disable-spinlocks. I think benign
neglect is the way for now.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Robert Haas

robertmhaas@gmail.com

about 12 years ago

In reply to: Tom Lane (#12)

1 attachment(s)

Re: dynamic shared memory and locks

On Mon, Jan 6, 2014 at 11:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think we can eliminate the first of those. Semaphores for spinlocks
were a performance disaster fifteen years ago, and the situation has
surely only gotten worse since then. I do, however, believe that
--disable-spinlocks has some use while porting to a new platform.
(And I don't believe the argument advanced elsewhere in this thread
that everybody uses gcc; much less that gcc's atomic intrinsics are
of uniformly high quality even on oddball architectures.)

Heikki's idea has some attraction for porting work whether or not
you believe that DSM needs to work during initial porting. By
default, PG will try to create upwards of ten thousand spinlocks
just for buffer headers. It's likely that that will fail unless
you whack around the kernel's SysV parameters. Being able to
constrain the number of semaphores to something sane would probably
be useful.

Having said all that, I'm not personally going to take the time to
implement it, and I don't think the DSM patch should be required to
either. Somebody who actually needs it can go make it work.

Well, I took a look at this and it turns out not to be very hard, so
here's a patch. Currently, we allocate 3 semaphore per shared buffer
and a bunch of others, but the 3 per shared buffer dominates, so you
end up with ~49k spinlocks for the default of 128MB shared_buffers. I
chose to peg the number of semaphores at 1024, which is quite small
compared to the current allocation, but the number of spinlock
allocations that can be in progress at any given time is limited by
the number of running backends. Even allowing for the birthday
paradox, that should be enough to run at least a few dozen backends
without suffering serious problems due to the multiplexing -
especially because in real workloads, contention is usually
concentrated around a small number of spinlocks that are unlikely to
all be mapped to the same underlying semaphore.

I'm happy enough with this way forward. Objections?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

spin-multiplex.patchtext/x-patch; charset=US-ASCII; name=spin-multiplex.patchDownload

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 048a189..7eae2a1 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -471,6 +471,9 @@ typedef struct
 	slock_t    *ShmemLock;
 	VariableCache ShmemVariableCache;
 	Backend    *ShmemBackendArray;
+#ifndef HAVE_SPINLOCKS
+	PGSemaphore	SpinlockSemaArray;
+#endif
 	LWLock	   *LWLockArray;
 	slock_t    *ProcStructLock;
 	PROC_HDR   *ProcGlobal;
@@ -5626,6 +5629,9 @@ save_backend_variables(BackendParameters *param, Port *port,
 	param->ShmemVariableCache = ShmemVariableCache;
 	param->ShmemBackendArray = ShmemBackendArray;
 
+#ifndef HAVE_SPINLOCKS
+	param->SpinlockSemaArray = SpinlockSemaArray;
+#endif
 	param->LWLockArray = LWLockArray;
 	param->ProcStructLock = ProcStructLock;
 	param->ProcGlobal = ProcGlobal;
@@ -5854,6 +5860,9 @@ restore_backend_variables(BackendParameters *param, Port *port)
 	ShmemVariableCache = param->ShmemVariableCache;
 	ShmemBackendArray = param->ShmemBackendArray;
 
+#ifndef HAVE_SPINLOCKS
+	SpinlockSemaArray = param->SpinlockSemaArray;
+#endif
 	LWLockArray = param->LWLockArray;
 	ProcStructLock = param->ProcStructLock;
 	ProcGlobal = param->ProcGlobal;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 040c7aa..ad663ec 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -105,6 +105,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		 * need to be so careful during the actual allocation phase.
 		 */
 		size = 100000;
+		size = add_size(size, SpinlockSemaSize());
 		size = add_size(size, hash_estimate_size(SHMEM_INDEX_SIZE,
 												 sizeof(ShmemIndexEnt)));
 		size = add_size(size, BufferShmemSize());
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 18ba426..4efd0fb 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -116,9 +116,24 @@ InitShmemAllocation(void)
 	Assert(shmhdr != NULL);
 
 	/*
-	 * Initialize the spinlock used by ShmemAlloc.	We have to do the space
-	 * allocation the hard way, since obviously ShmemAlloc can't be called
-	 * yet.
+	 * If spinlocks are disabled, initialize emulation layer.  We have to do
+	 * the space allocation the hard way, since obviously ShmemAlloc can't be
+	 * called yet.
+	 */
+#ifndef HAVE_SPINLOCKS
+	{
+		PGSemaphore spinsemas;
+
+		spinsemas = (PGSemaphore) (((char *) shmhdr) + shmhdr->freeoffset);
+		shmhdr->freeoffset += MAXALIGN(SpinlockSemaSize());
+		SpinlockSemaInit(spinsemas);
+		Assert(shmhdr->freeoffset <= shmhdr->totalsize);
+	}
+#endif
+
+	/*
+	 * Initialize the spinlock used by ShmemAlloc; we have to do this the hard
+	 * way, too, for the same reasons as above.
 	 */
 	ShmemLock = (slock_t *) (((char *) shmhdr) + shmhdr->freeoffset);
 	shmhdr->freeoffset += MAXALIGN(sizeof(slock_t));
diff --git a/src/backend/storage/lmgr/spin.c b/src/backend/storage/lmgr/spin.c
index f054be8..e108a37 100644
--- a/src/backend/storage/lmgr/spin.c
+++ b/src/backend/storage/lmgr/spin.c
@@ -28,6 +28,21 @@
 #include "storage/lwlock.h"
 #include "storage/spin.h"
 
+#ifdef USE_ASSERT_CHECKING
+bool any_spinlock_held;
+#endif
+
+PGSemaphore SpinlockSemaArray;
+
+/*
+ * Report the amount of shared memory needed to store semaphores for spinlock
+ * support.
+ */
+Size
+SpinlockSemaSize(void)
+{
+	return SpinlockSemas() * sizeof(PGSemaphoreData);
+}
 
 #ifdef HAVE_SPINLOCKS
 
@@ -43,8 +58,17 @@ SpinlockSemas(void)
 
 /*
  * No TAS, so spinlocks are implemented as PGSemaphores.
+ *
+ * In order to avoid requesting too many semaphores from the operating system,
+ * we multiplex however many spinlocks exist across a smaller number of OS
+ * semaphores.  This works reasonably well as long as the number of concurrent
+ * processes is small enough to make it unlikely that two processes will try
+ * to grab two different spinlocks that map to the same semaphore at the same
+ * time.  If this is not the case in your environment, you can raise this
+ * constant or, better yet, supply a real spinlock implementation for your
+ * platform.
  */
-
+#define	NUM_SPINLOCK_SEMAPHORES		1024
 
 /*
  * Report number of semaphores needed to support spinlocks.
@@ -52,22 +76,20 @@ SpinlockSemas(void)
 int
 SpinlockSemas(void)
 {
-	int			nsemas;
-
-	/*
-	 * It would be cleaner to distribute this logic into the affected modules,
-	 * similar to the way shmem space estimation is handled.
-	 *
-	 * For now, though, there are few enough users of spinlocks that we just
-	 * keep the knowledge here.
-	 */
-	nsemas = NumLWLocks();		/* one for each lwlock */
-	nsemas += NBuffers;			/* one for each buffer header */
-	nsemas += max_wal_senders;	/* one for each wal sender process */
-	nsemas += num_xloginsert_slots; /* one for each WAL insertion slot */
-	nsemas += 30;				/* plus a bunch for other small-scale use */
-
-	return nsemas;
+	return NUM_SPINLOCK_SEMAPHORES;
+}
+
+/*
+ * Initialize semaphores.
+ */
+extern void
+SpinlockSemaInit(PGSemaphore spinsemas)
+{
+	int	i;
+
+	for (i = 0; i < NUM_SPINLOCK_SEMAPHORES; ++i)
+		PGSemaphoreCreate(&spinsemas[i]);
+	SpinlockSemaArray = spinsemas;
 }
 
 /*
@@ -77,13 +99,15 @@ SpinlockSemas(void)
 void
 s_init_lock_sema(volatile slock_t *lock)
 {
-	PGSemaphoreCreate((PGSemaphore) lock);
+	static int counter = 0;
+
+	*lock = (++counter) % NUM_SPINLOCK_SEMAPHORES;
 }
 
 void
 s_unlock_sema(volatile slock_t *lock)
 {
-	PGSemaphoreUnlock((PGSemaphore) lock);
+	PGSemaphoreUnlock(&SpinlockSemaArray[*lock]);
 }
 
 bool
@@ -98,7 +122,7 @@ int
 tas_sema(volatile slock_t *lock)
 {
 	/* Note that TAS macros return 0 if *success* */
-	return !PGSemaphoreTryLock((PGSemaphore) lock);
+	return !PGSemaphoreTryLock(&SpinlockSemaArray[*lock]);
 }
 
 #endif   /* !HAVE_SPINLOCKS */
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index 7dcd5d9..6adecbb 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -915,7 +915,7 @@ spin_delay(void)
  * to fall foul of kernel limits on number of semaphores, so don't use this
  * unless you must!  The subroutines appear in spin.c.
  */
-typedef PGSemaphoreData slock_t;
+typedef int slock_t;
 
 extern bool s_lock_free_sema(volatile slock_t *lock);
 extern void s_unlock_sema(volatile slock_t *lock);
diff --git a/src/include/storage/spin.h b/src/include/storage/spin.h
index f459b90..6193a8c 100644
--- a/src/include/storage/spin.h
+++ b/src/include/storage/spin.h
@@ -60,14 +60,35 @@
 
 
 #define SpinLockInit(lock)	S_INIT_LOCK(lock)
+#define SpinLockFree(lock)	S_LOCK_FREE(lock)
 
+/*
+ * Spinlocks should only be acquired in straight-line code that does not
+ * loop, so no code should ever acquire a second spinlock while already
+ * holding one.  This is particularly important when --disable-spinlocks is
+ * in use, because in that case we map all of the spinlocks in the system
+ * onto a smaller number of semaphores allocated from the OS; attempting
+ * to acquire more than one at a time could self-deadlock.
+ */
+#ifdef USE_ASSERT_CHECKING
+extern bool any_spinlock_held;
+#define SpinLockAcquire(lock) \
+	(AssertMacro(!any_spinlock_held), any_spinlock_held = true, \
+		S_LOCK(lock))
+#define SpinLockRelease(lock) \
+	(AssertMacro(any_spinlock_held), any_spinlock_held = false, \
+		S_UNLOCK(lock))
+#else
 #define SpinLockAcquire(lock) S_LOCK(lock)
-
 #define SpinLockRelease(lock) S_UNLOCK(lock)
-
-#define SpinLockFree(lock)	S_LOCK_FREE(lock)
-
+#endif
 
 extern int	SpinlockSemas(void);
+extern Size	SpinlockSemaSize(void);
+
+#ifndef HAVE_SPINLOCKS
+extern void	SpinlockSemaInit(PGSemaphore);
+extern PGSemaphore	SpinlockSemaArray;
+#endif
 
 #endif   /* SPIN_H */

#14

Tom Lane

tgl@sss.pgh.pa.us

about 12 years ago

In reply to: Andres Freund (#4)

Re: dynamic shared memory and locks

Andres Freund <andres@2ndquadrant.com> writes:

On 2014-01-05 14:06:52 -0500, Tom Lane wrote:

I seem to recall that there was some good reason for keeping all the
LWLocks in an array, back when the facility was first designed.
I'm too lazy to research the point right now, but you might want to
go back and look at the archives around when lwlock.c was written.

Your proposal is at
/messages/by-id/1054.1001520600@sss.pgh.pa.us -
afaics there hasn't been much discussion about implementation details at
that level.

Yeah, there wasn't :-(. On reflection I believe that my reason for
building it like this was partially to reduce the number of pointer
variables needed, and partially to avoid exposing the contents of the
LWLock struct type to the rest of the backend.

Robert's idea of continuing to keep the "fixed" LWLocks in an array
would solve the first problem, but I don't think there's any way to
avoid exposing the struct type if we want to embed the things into
other structs, or even just rely on array indexing.

OTOH, the LWLock mechanism has been stable for long enough now that
we can probably suppose this struct is no more subject to churn than
any other widely-known one, so maybe that consideration is no longer
significant.

Another nice benefit of switching to separate allocations that we could
get rid of the horrid kluge that lwlock.h is the place that defines
NUM_BUFFER_PARTITIONS and some other constants that don't belong there.

So on the whole I'm in favor of this, as long as we can avoid any
notational churn at most call sites.

Should we get rid of RequestAddinLWLocks() and LWLockAssign(), in
favor of telling extensions to just allocate some space and do
LWLockInit() on it?

BTW, NumLWLocks() also becomes a big problem if lwlock creation
gets decentralized. This is only used by SpinlockSemas() ...
so maybe we need to use Heikki's modulo idea to get rid of the
need for that to know the number of LWLocks, independently of DSM.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Tom Lane

tgl@sss.pgh.pa.us

about 12 years ago

In reply to: Robert Haas (#13)

Re: dynamic shared memory and locks

Robert Haas <robertmhaas@gmail.com> writes:

Well, I took a look at this and it turns out not to be very hard, so
here's a patch. Currently, we allocate 3 semaphore per shared buffer
and a bunch of others, but the 3 per shared buffer dominates, so you
end up with ~49k spinlocks for the default of 128MB shared_buffers. I
chose to peg the number of semaphores at 1024, which is quite small
compared to the current allocation, but the number of spinlock
allocations that can be in progress at any given time is limited by
the number of running backends. Even allowing for the birthday
paradox, that should be enough to run at least a few dozen backends
without suffering serious problems due to the multiplexing -
especially because in real workloads, contention is usually
concentrated around a small number of spinlocks that are unlikely to
all be mapped to the same underlying semaphore.

I'm happy enough with this way forward. Objections?

-1 for the any_spinlock_held business (useless overhead IMO, as it doesn't
have anything whatsoever to do with enforcing the actual coding rule).
And I'd suggest defining NUM_SPINLOCK_SEMAPHORES in pg_config_manual.h,
and maybe dropping SpinlockSemas() altogether in favor of just referencing
the constant. Otherwise this seems reasonable.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Robert Haas

robertmhaas@gmail.com

about 12 years ago

In reply to: Tom Lane (#15)

Re: dynamic shared memory and locks

On Mon, Jan 6, 2014 at 2:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Well, I took a look at this and it turns out not to be very hard, so
here's a patch. Currently, we allocate 3 semaphore per shared buffer
and a bunch of others, but the 3 per shared buffer dominates, so you
end up with ~49k spinlocks for the default of 128MB shared_buffers. I
chose to peg the number of semaphores at 1024, which is quite small
compared to the current allocation, but the number of spinlock
allocations that can be in progress at any given time is limited by
the number of running backends. Even allowing for the birthday
paradox, that should be enough to run at least a few dozen backends
without suffering serious problems due to the multiplexing -
especially because in real workloads, contention is usually
concentrated around a small number of spinlocks that are unlikely to
all be mapped to the same underlying semaphore.

I'm happy enough with this way forward. Objections?

-1 for the any_spinlock_held business (useless overhead IMO, as it doesn't
have anything whatsoever to do with enforcing the actual coding rule).

Hmm. I thought that was a pretty well-aimed bullet myself; why do you
think that it isn't? I don't particularly mind ripping it out, but it
seemed like a good automated test to me.

And I'd suggest defining NUM_SPINLOCK_SEMAPHORES in pg_config_manual.h,
and maybe dropping SpinlockSemas() altogether in favor of just referencing
the constant. Otherwise this seems reasonable.

As far as pg_config_manual.h is concerned, is this the sort of thing
you have in mind?

#ifndef HAVE_SPINLOCKS
#define NUM_SPINLOCK_SEMAPHORES 1024
#endif

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Robert Haas

robertmhaas@gmail.com

about 12 years ago

In reply to: Tom Lane (#14)

Re: dynamic shared memory and locks

On Mon, Jan 6, 2014 at 1:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

OTOH, the LWLock mechanism has been stable for long enough now that
we can probably suppose this struct is no more subject to churn than
any other widely-known one, so maybe that consideration is no longer
significant.

On the whole, I'd say it's been more stable than most. But even if we
do decide to change it, I'm not sure that really matters very much.
We whack widely-used data structures around fairly regularly, and I
haven't observed that to cause many problems. Just as one concrete
example, PGPROC changes pretty regularly, and I'm not aware of much
code that's been broken by that.

Should we get rid of RequestAddinLWLocks() and LWLockAssign(), in
favor of telling extensions to just allocate some space and do
LWLockInit() on it?

I don't really see any reason to deprecate that mechanism - it'll just
make it harder for people who want to write code that works on
multiple PostgreSQL releases to do so easily, and it's my belief that
there's a decent amount of third-party code that uses those APIs.
EnterpriseDB has some, for example. Broadly, I don't see any problem
with presuming that most code that uses lwlocks will be happy to keep
those lwlocks in the main array while allowing them to be stored
elsewhere in those cases where that's not convenient. Perhaps in the
long run that won't be true, but then again perhaps it will. Either
way, I don't see a real reason to rush into a change in policy.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Tom Lane

tgl@sss.pgh.pa.us

about 12 years ago

In reply to: Robert Haas (#16)

Re: dynamic shared memory and locks

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Jan 6, 2014 at 2:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

-1 for the any_spinlock_held business (useless overhead IMO, as it doesn't
have anything whatsoever to do with enforcing the actual coding rule).

Hmm. I thought that was a pretty well-aimed bullet myself; why do you
think that it isn't? I don't particularly mind ripping it out, but it
seemed like a good automated test to me.

The coding rule is "only short straight-line code while holding a
spinlock". That could be violated in any number of nasty ways without
trying to take another spinlock. And since the whole point of the rule is
that spinlock-holding code segments should be quick, adding more overhead
to them doesn't seem very nice, even if it is only done in Assert builds.

I agree it'd be nicer if we had some better way than mere manual
inspection to enforce proper use of spinlocks; but this change doesn't
seem to me to move the ball downfield by any meaningful distance.

And I'd suggest defining NUM_SPINLOCK_SEMAPHORES in pg_config_manual.h,
and maybe dropping SpinlockSemas() altogether in favor of just referencing
the constant. Otherwise this seems reasonable.

As far as pg_config_manual.h is concerned, is this the sort of thing
you have in mind?

#ifndef HAVE_SPINLOCKS
#define NUM_SPINLOCK_SEMAPHORES 1024
#endif

I think we can just define it unconditionally, don't you? It shouldn't
get referenced in HAVE_SPINLOCK builds. Or is the point that you want
to enforce that?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Tom Lane

tgl@sss.pgh.pa.us

about 12 years ago

In reply to: Robert Haas (#17)

Re: dynamic shared memory and locks

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Jan 6, 2014 at 1:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

OTOH, the LWLock mechanism has been stable for long enough now that
we can probably suppose this struct is no more subject to churn than
any other widely-known one, so maybe that consideration is no longer
significant.

On the whole, I'd say it's been more stable than most. But even if we
do decide to change it, I'm not sure that really matters very much.

Actually, the real value of a module-local struct definition is that you
can be pretty darn sure that nothing except the code in that file is
manipulating the struct contents. I would've preferred that we expose
only an abstract struct definition, but don't quite see how to do that
if we're going to embed the things in buffer headers.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Robert Haas

robertmhaas@gmail.com

about 12 years ago

In reply to: Tom Lane (#18)

Re: dynamic shared memory and locks

On Mon, Jan 6, 2014 at 3:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Jan 6, 2014 at 2:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

-1 for the any_spinlock_held business (useless overhead IMO, as it doesn't
have anything whatsoever to do with enforcing the actual coding rule).

Hmm. I thought that was a pretty well-aimed bullet myself; why do you
think that it isn't? I don't particularly mind ripping it out, but it
seemed like a good automated test to me.

The coding rule is "only short straight-line code while holding a
spinlock". That could be violated in any number of nasty ways without
trying to take another spinlock. And since the whole point of the rule is
that spinlock-holding code segments should be quick, adding more overhead
to them doesn't seem very nice, even if it is only done in Assert builds.

I agree it'd be nicer if we had some better way than mere manual
inspection to enforce proper use of spinlocks; but this change doesn't
seem to me to move the ball downfield by any meaningful distance.

Well, my thought was mainly that, while it may be a bad idea to take
another spinlock while holding a spinlock under any circumstances,
somebody might do it and it might appear to work just fine. The most
likely sequences seems to me to be something like SpinLockAcquire(...)
followed by LWLockConditionalAcquire(), thinking that things are OK
because the lock acquisition is conditional - but in fact the
conditional acquire still takes the spinlock unconditionally. Even
so, without --disable-spinlocks, this will probably work just fine.
Most likely there won't even be a performance problem, because a
lwlock has to be *very* hot for the spinlock acquisitions to become a
problem. But with --disable-spinlocks, it will unpredictably
self-deadlock. And since few developers are likely to test with
--disable-spinlocks, the problem most likely won't be noticed. When
someone does then try to use --disable-spinlocks to port to a new
problem and starts hitting the deadlocks, they may not know enough to
attribute them to the real cause. All in all it doesn't seem like
unduly expensive insurance, but I can remove it if the consensus is
against it.

And I'd suggest defining NUM_SPINLOCK_SEMAPHORES in pg_config_manual.h,
and maybe dropping SpinlockSemas() altogether in favor of just referencing
the constant. Otherwise this seems reasonable.

As far as pg_config_manual.h is concerned, is this the sort of thing
you have in mind?

#ifndef HAVE_SPINLOCKS
#define NUM_SPINLOCK_SEMAPHORES 1024
#endif

I think we can just define it unconditionally, don't you? It shouldn't
get referenced in HAVE_SPINLOCK builds. Or is the point that you want
to enforce that?

I don't think it really matters much; seems like a style question to
me. That's why I asked.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Robert Haas

robertmhaas@gmail.com

about 12 years ago

In reply to: Tom Lane (#19)

Re: dynamic shared memory and locks

On Mon, Jan 6, 2014 at 3:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Jan 6, 2014 at 1:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

OTOH, the LWLock mechanism has been stable for long enough now that
we can probably suppose this struct is no more subject to churn than
any other widely-known one, so maybe that consideration is no longer
significant.

On the whole, I'd say it's been more stable than most. But even if we
do decide to change it, I'm not sure that really matters very much.

Actually, the real value of a module-local struct definition is that you
can be pretty darn sure that nothing except the code in that file is
manipulating the struct contents. I would've preferred that we expose
only an abstract struct definition, but don't quite see how to do that
if we're going to embed the things in buffer headers.

Agreed. I think it's good to keep struct definitions private as much
as possible, and I do. But I don't think this is going to cause a big
problem either; lwlocks have been around for a long time and the
conventions for using them are pretty well understood, I think.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22

Tom Lane

tgl@sss.pgh.pa.us

about 12 years ago

In reply to: Robert Haas (#20)

Re: dynamic shared memory and locks

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Jan 6, 2014 at 3:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I agree it'd be nicer if we had some better way than mere manual
inspection to enforce proper use of spinlocks; but this change doesn't
seem to me to move the ball downfield by any meaningful distance.

Well, my thought was mainly that, while it may be a bad idea to take
another spinlock while holding a spinlock under any circumstances,
somebody might do it and it might appear to work just fine. The most
likely sequences seems to me to be something like SpinLockAcquire(...)
followed by LWLockConditionalAcquire(), thinking that things are OK
because the lock acquisition is conditional - but in fact the
conditional acquire still takes the spinlock unconditionally.

The point I'm making is that no such code should get past review,
whether it's got an obvious performance problem or not.

There's a huge amount of stuff that in principle we could add overhead
to Assert-enabled builds for, but we prefer not to --- an example not
too far afield from this issue is that there's no mechanism for detecting
deadlocks among multiple LWLock holders. If we go too far down the path
of adding useless checks to Assert builds, people will stop using such
builds for routine development, which would surely be a large net negative
reliability-wise.

I'd be okay with adding overhead to SpinLockAcquire/Release if it had some
meaningful probability of catching real bugs, but I don't see that that
claim can be made here.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23

Robert Haas

robertmhaas@gmail.com

about 12 years ago

In reply to: Tom Lane (#22)

Re: dynamic shared memory and locks

On Mon, Jan 6, 2014 at 3:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Jan 6, 2014 at 3:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I agree it'd be nicer if we had some better way than mere manual
inspection to enforce proper use of spinlocks; but this change doesn't
seem to me to move the ball downfield by any meaningful distance.

Well, my thought was mainly that, while it may be a bad idea to take
another spinlock while holding a spinlock under any circumstances,
somebody might do it and it might appear to work just fine. The most
likely sequences seems to me to be something like SpinLockAcquire(...)
followed by LWLockConditionalAcquire(), thinking that things are OK
because the lock acquisition is conditional - but in fact the
conditional acquire still takes the spinlock unconditionally.

The point I'm making is that no such code should get past review,
whether it's got an obvious performance problem or not.

Sure, I agree, but we all make mistakes. It's just a judgement call
as to how likely you think it is that someone might make this
particular mistake, a topic upon which opinions may vary.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24

Robert Haas

robertmhaas@gmail.com

about 12 years ago

In reply to: Stephen Frost (#8)

1 attachment(s)

Re: dynamic shared memory and locks

On Mon, Jan 6, 2014 at 9:48 AM, Stephen Frost <sfrost@snowman.net> wrote:

None of these ideas are a complete solution for LWLOCK_STATS. In the
other three cases noted above, we only need an identifier for the lock
"instantaneously", so that we can pass it off to the logger or dtrace
or whatever. But LWLOCK_STATS wants to hold on to data about the
locks that were visited until the end of the session, and it does that
using an array that is *indexed* by lwlockid. I guess we could
replace that with a hash table. Ugh. Any suggestions?

Yeah, that's not fun. No good suggestions here offhand.

Replacing it with a hashtable turns out not to be too bad, either in
terms of code complexity or performance, so I think that's the way to
go. I did some test runs with pgbench -S, scale factor 300, 32
clients, shared_buffers=8GB, five minute runs and got these results:

resultsr.lwlock-stats.32.300.300:tps = 195493.037962 (including
connections establishing)
resultsr.lwlock-stats.32.300.300:tps = 189985.964658 (including
connections establishing)
resultsr.lwlock-stats.32.300.300:tps = 197641.293892 (including
connections establishing)
resultsr.lwlock-stats-htab.32.300.300:tps = 193286.066063 (including
connections establishing)
resultsr.lwlock-stats-htab.32.300.300:tps = 191832.100877 (including
connections establishing)
resultsr.lwlock-stats-htab.32.300.300:tps = 191899.834211 (including
connections establishing)
resultsr.master.32.300.300:tps = 197939.111998 (including connections
establishing)
resultsr.master.32.300.300:tps = 198641.337720 (including connections
establishing)
resultsr.master.32.300.300:tps = 198675.404349 (including connections
establishing)

"master" is the master branch, commit
10a82cda67731941c18256e009edad4a784a2994. "lwlock-stats" is the same,
but with LWLOCK_STATS defined. "lwlock-stats-htab" is the same, with
the attached patch and LWLOCK_STATS defined. The runs were
interleaved, but the results are shown here grouped by test run. If
we assume that the 189k result is an outlier, then there's probably
some regression associated with the lwlock-stats-htab patch, but not a
lot. Considering that this code isn't even compiled unless you have
LWLOCK_STATS defined, I think that's OK.

This is only part of the solution, of course: a complete solution will
involve making the hash table key something other than the lock ID.
What I'm thinking we can do is making the lock ID consist of two
unsigned 32-bit integers. One of these will be stored in the lwlock
itself, which if my calculations are correct won't increase the size
of LWLockPadded on any common platforms (a 64-bit integer would).
Let's call this the "tranch id". The other will be derived from the
LWLock address. Let's call this the "instance ID". We'll keep a
table of tranch IDs, which will be assigned consecutively starting
with 0. We'll keep an array of metadata for tranches, indexed by
tranch ID, and each entry will have three associated pieces of
information: an array base, a stride length, and a printable name.
When we need to identify an lwlock in the log or to dtrace, we'll
fetch the tranch ID from the lwlock itself and use that to index into
the tranch metadata array. We'll then take the address of the lwlock,
subtract the array base address for the tranch, and divide by the
stride length; the result is the instance ID. When reporting the
user, we can report either the tranch ID directly or the associated
name for that tranch; in either case, we'll also report the instance
ID.

So initially we'll probably just have tranch 0: the main LWLock array.
If we move buffer content and I/O locks to the buffer headers, we'll
define tranch 1 and tranch 2 with the same base address: the start of
the buffer descriptor array, and the same stride length, the size of a
buffer descriptor. One will have the associated name "buffer content
lock" and the other "buffer I/O lock". If we want, we can define
split the main LWLock array into several tranches so that we can more
easily identify lock manager locks, predicate lock manager locks, and
buffer mapping locks.

I like this system because it's very cheap - we only need a small
array of metadata and a couple of memory accesses to name a lock - but
it still lets us report data in a way that's actually *more*
human-readable than what we have now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

lwlock-stats-htab.patchtext/x-patch; charset=US-ASCII; name=lwlock-stats-htab.patchDownload

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 4f88d3f..c1da2a8 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -32,6 +32,10 @@
 #include "storage/proc.h"
 #include "storage/spin.h"
 
+#ifdef LWLOCK_STATS
+#include "utils/hsearch.h"
+#endif
+
 
 /* We use the ShmemLock spinlock to protect LWLockAssign */
 extern slock_t *ShmemLock;
@@ -91,11 +95,17 @@ static int	lock_addin_request = 0;
 static bool lock_addin_request_allowed = true;
 
 #ifdef LWLOCK_STATS
+typedef struct lwlock_stats
+{
+	LWLockId	lockid;
+	int		sh_acquire_count;
+	int		ex_acquire_count;
+	int		block_count;
+	int		spin_delay_count;
+} lwlock_stats;
+
 static int	counts_for_pid = 0;
-static int *sh_acquire_counts;
-static int *ex_acquire_counts;
-static int *block_counts;
-static int *spin_delay_counts;
+static HTAB *lwlock_stats_htab;
 #endif
 
 #ifdef LOCK_DEBUG
@@ -126,17 +136,25 @@ LOG_LWDEBUG(const char *where, LWLockId lockid, const char *msg)
 
 static void init_lwlock_stats(void);
 static void print_lwlock_stats(int code, Datum arg);
+static lwlock_stats *get_lwlock_stats_entry(LWLockId lockid);
 
 static void
 init_lwlock_stats(void)
 {
-	int		   *LWLockCounter = (int *) ((char *) LWLockArray - 2 * sizeof(int));
-	int			numLocks = LWLockCounter[1];
+	HASHCTL		ctl;
+
+	if (lwlock_stats_htab != NULL)
+	{
+		hash_destroy(lwlock_stats_htab);
+		lwlock_stats_htab = NULL;
+	}
 
-	sh_acquire_counts = calloc(numLocks, sizeof(int));
-	ex_acquire_counts = calloc(numLocks, sizeof(int));
-	spin_delay_counts = calloc(numLocks, sizeof(int));
-	block_counts = calloc(numLocks, sizeof(int));
+	MemSet(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(LWLockId);
+	ctl.entrysize = sizeof(lwlock_stats);
+	ctl.hash = tag_hash;
+	lwlock_stats_htab = hash_create("lwlock stats", 16384, &ctl,
+									HASH_ELEM | HASH_FUNCTION);
 	counts_for_pid = MyProcPid;
 	on_shmem_exit(print_lwlock_stats, 0);
 }
@@ -144,23 +162,47 @@ init_lwlock_stats(void)
 static void
 print_lwlock_stats(int code, Datum arg)
 {
-	int			i;
-	int		   *LWLockCounter = (int *) ((char *) LWLockArray - 2 * sizeof(int));
-	int			numLocks = LWLockCounter[1];
+	HASH_SEQ_STATUS scan;
+	lwlock_stats *lwstats;
+
+	hash_seq_init(&scan, lwlock_stats_htab);
 
 	/* Grab an LWLock to keep different backends from mixing reports */
 	LWLockAcquire(0, LW_EXCLUSIVE);
 
-	for (i = 0; i < numLocks; i++)
+	while ((lwstats = (lwlock_stats *) hash_seq_search(&scan)) != NULL)
 	{
-		if (sh_acquire_counts[i] || ex_acquire_counts[i] || block_counts[i] || spin_delay_counts[i])
-			fprintf(stderr, "PID %d lwlock %d: shacq %u exacq %u blk %u spindelay %u\n",
-					MyProcPid, i, sh_acquire_counts[i], ex_acquire_counts[i],
-					block_counts[i], spin_delay_counts[i]);
+		fprintf(stderr,
+				"PID %d lwlock %d: shacq %u exacq %u blk %u spindelay %u\n",
+				MyProcPid, lwstats->lockid, lwstats->sh_acquire_count,
+				lwstats->ex_acquire_count, lwstats->block_count,
+				lwstats->spin_delay_count);
 	}
 
 	LWLockRelease(0);
 }
+
+static lwlock_stats *
+get_lwlock_stats_entry(LWLockId lockid)
+{
+	lwlock_stats *lwstats;
+	bool	found;
+
+	/* Set up local count state first time through in a given process */
+	if (counts_for_pid != MyProcPid)
+		init_lwlock_stats();
+
+	/* Fetch or create the entry. */
+	lwstats = hash_search(lwlock_stats_htab, &lockid, HASH_ENTER, &found);
+	if (!found)
+	{
+		lwstats->sh_acquire_count = 0;
+		lwstats->ex_acquire_count = 0;
+		lwstats->block_count = 0;
+		lwstats->spin_delay_count = 0;
+	}
+	return lwstats;
+}
 #endif   /* LWLOCK_STATS */
 
 
@@ -344,18 +386,20 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 	PGPROC	   *proc = MyProc;
 	bool		retry = false;
 	int			extraWaits = 0;
+#ifdef LWLOCK_STATS
+	lwlock_stats *lwstats;
+#endif
 
 	PRINT_LWDEBUG("LWLockAcquire", lockid, lock);
 
 #ifdef LWLOCK_STATS
-	/* Set up local count state first time through in a given process */
-	if (counts_for_pid != MyProcPid)
-		init_lwlock_stats();
+	lwstats = get_lwlock_stats_entry(lockid);
+
 	/* Count lock acquisition attempts */
 	if (mode == LW_EXCLUSIVE)
-		ex_acquire_counts[lockid]++;
+		lwstats->ex_acquire_count++;
 	else
-		sh_acquire_counts[lockid]++;
+		lwstats->sh_acquire_count++;
 #endif   /* LWLOCK_STATS */
 
 	/*
@@ -398,7 +442,7 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 
 		/* Acquire mutex.  Time spent holding mutex should be short! */
 #ifdef LWLOCK_STATS
-		spin_delay_counts[lockid] += SpinLockAcquire(&lock->mutex);
+		lwstats->spin_delay_count += SpinLockAcquire(&lock->mutex);
 #else
 		SpinLockAcquire(&lock->mutex);
 #endif
@@ -469,7 +513,7 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 		LOG_LWDEBUG("LWLockAcquire", lockid, "waiting");
 
 #ifdef LWLOCK_STATS
-		block_counts[lockid]++;
+		lwstats->block_count++;
 #endif
 
 		TRACE_POSTGRESQL_LWLOCK_WAIT_START(lockid, mode);
@@ -598,13 +642,14 @@ LWLockAcquireOrWait(LWLockId lockid, LWLockMode mode)
 	PGPROC	   *proc = MyProc;
 	bool		mustwait;
 	int			extraWaits = 0;
+#ifdef LWLOCK_STATS
+	lwlock_stats *lwstats;
+#endif
 
 	PRINT_LWDEBUG("LWLockAcquireOrWait", lockid, lock);
 
 #ifdef LWLOCK_STATS
-	/* Set up local count state first time through in a given process */
-	if (counts_for_pid != MyProcPid)
-		init_lwlock_stats();
+	lwstats = get_lwlock_stats_entry(lockid);
 #endif
 
 	/* Ensure we will have room to remember the lock */
@@ -674,7 +719,7 @@ LWLockAcquireOrWait(LWLockId lockid, LWLockMode mode)
 		LOG_LWDEBUG("LWLockAcquireOrWait", lockid, "waiting");
 
 #ifdef LWLOCK_STATS
-		block_counts[lockid]++;
+		lwstats->block_count++;
 #endif
 
 		TRACE_POSTGRESQL_LWLOCK_WAIT_START(lockid, mode);

#25

Jim Nasby

jim@nasby.net

about 12 years ago

In reply to: Robert Haas (#23)

Re: dynamic shared memory and locks

On 1/6/14, 2:59 PM, Robert Haas wrote:

On Mon, Jan 6, 2014 at 3:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Jan 6, 2014 at 3:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I agree it'd be nicer if we had some better way than mere manual
inspection to enforce proper use of spinlocks; but this change doesn't
seem to me to move the ball downfield by any meaningful distance.

Well, my thought was mainly that, while it may be a bad idea to take
another spinlock while holding a spinlock under any circumstances,
somebody might do it and it might appear to work just fine. The most
likely sequences seems to me to be something like SpinLockAcquire(...)
followed by LWLockConditionalAcquire(), thinking that things are OK
because the lock acquisition is conditional - but in fact the
conditional acquire still takes the spinlock unconditionally.

The point I'm making is that no such code should get past review,
whether it's got an obvious performance problem or not.

Sure, I agree, but we all make mistakes. It's just a judgement call
as to how likely you think it is that someone might make this
particular mistake, a topic upon which opinions may vary.

There's been discussions in the past about the overhead added by testing and having more than one level of test capabilities so that facilities like the buildfarm can run more expensive testing without burdening developers with that. ISTM this is another case of that (presumably Tom's concern is the performance hit of adding an assert in a critical path).

If we had a more aggressive form of assert (assert_anal? :)) we could use that here and let the buildfarm bore the brunt of that cost.
--
Jim C. Nasby, Data Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26

Alvaro Herrera

alvherre@2ndquadrant.com

about 12 years ago

In reply to: Jim Nasby (#25)

Re: dynamic shared memory and locks

Jim Nasby escribiï¿½:

On 1/6/14, 2:59 PM, Robert Haas wrote:

On Mon, Jan 6, 2014 at 3:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The point I'm making is that no such code should get past review,
whether it's got an obvious performance problem or not.

Sure, I agree, but we all make mistakes. It's just a judgement call
as to how likely you think it is that someone might make this
particular mistake, a topic upon which opinions may vary.

I agree that excessive optimism might cause problems in the future.
Maybe it makes sense to have such a check #ifdef'ed out on most builds
to avoid extra overhead, but not having any check at all just because we
trust the review process too much doesn't strike me as the best of
ideas.

There's been discussions in the past about the overhead added by testing and having more than one level of test capabilities so that facilities like the buildfarm can run more expensive testing without burdening developers with that. ISTM this is another case of that (presumably Tom's concern is the performance hit of adding an assert in a critical path).

If we had a more aggressive form of assert (assert_anal? :)) we could use that here and let the buildfarm bore the brunt of that cost.

Sounds good, but let's not enable it blindly on all machines. There are
some that are under very high pressure to build and test all the
branches, so if we add something too costly on them it might cause
trouble. So whatever we do, it should at least have the option to
opt-out, or perhaps even make it opt-in. (I'd think opt-out is better,
because otherwise very few machines are going to have it enabled at
all.)

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Alvaro Herrera (#26)

Re: dynamic shared memory and locks

On 2014-01-06 21:35:22 -0300, Alvaro Herrera wrote:

Jim Nasby escribió:

On 1/6/14, 2:59 PM, Robert Haas wrote:

On Mon, Jan 6, 2014 at 3:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The point I'm making is that no such code should get past review,
whether it's got an obvious performance problem or not.

Sure, I agree, but we all make mistakes. It's just a judgement call
as to how likely you think it is that someone might make this
particular mistake, a topic upon which opinions may vary.

I don't think it's that unlikely as the previous implementation's rules
when viewed while squinting allowed nesting spinlocks. And it's a pretty
simple check.

Maybe it makes sense to have such a check #ifdef'ed out on most builds
to avoid extra overhead, but not having any check at all just because we
trust the review process too much doesn't strike me as the best of
ideas.

I don't think that check would have relevantly high performance impact
in comparison to the rest of --enable-cassert - it's a single process
local variable which is regularly accessed. It will just stay in
L1 or even registers.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28

Robert Haas

robertmhaas@gmail.com

about 12 years ago

In reply to: Andres Freund (#27)

Re: dynamic shared memory and locks

On Tue, Jan 7, 2014 at 6:54 AM, Andres Freund <andres@2ndquadrant.com> wrote:

Maybe it makes sense to have such a check #ifdef'ed out on most builds
to avoid extra overhead, but not having any check at all just because we
trust the review process too much doesn't strike me as the best of
ideas.

I don't think that check would have relevantly high performance impact
in comparison to the rest of --enable-cassert - it's a single process
local variable which is regularly accessed. It will just stay in
L1 or even registers.

I tried it out on a 16-core, 64-hwthread community box, PPC. pgbench
-S, 5-minute rules at scale factor 300 with shared_buffers=8GB:

resultsr.cassert.32.300.300:tps = 11341.627815 (including connections
establishing)
resultsr.cassert.32.300.300:tps = 11339.407976 (including connections
establishing)
resultsr.cassert.32.300.300:tps = 11321.339971 (including connections
establishing)
resultsr.cassert-spin-multiplex.32.300.300:tps = 11492.927372
(including connections establishing)
resultsr.cassert-spin-multiplex.32.300.300:tps = 11471.810937
(including connections establishing)
resultsr.cassert-spin-multiplex.32.300.300:tps = 11516.471053
(including connections establishing)

By comparison:

resultsr.master.32.300.300:tps = 197939.111998 (including connections
establishing)
resultsr.master.32.300.300:tps = 198641.337720 (including connections
establishing)
resultsr.master.32.300.300:tps = 198675.404349 (including connections
establishing)

So yeah, the overhead is negligible, if not negative. None of the
other changes in that patch were even compiled in this case, since all
that code is protected by --disable-spinlocks. Maybe somebody can
find another workload where the overhead is visible, but here it's
not. On the other hand, I did discover another bit of ugliness - the
macros actually have to be defined this way:

+#define SpinLockAcquire(lock) \
+       (AssertMacro(!any_spinlock_held), any_spinlock_held = true, \
+               S_LOCK(lock))
+#define SpinLockRelease(lock) \
+       do { Assert(any_spinlock_held); any_spinlock_held = false; \
+               S_UNLOCK(lock); } while (0)

SpinLockAcquire has to be a comma-separated expression rather than a
do {} while (0) block because S_LOCK returns a value (which is used
when compiling with LWLOCK_STATS); SpinLockRelease, however, has to be
done the other way because S_UNLOCK() is defined as a do {} while (0)
block already on PPC among other architectures.

Overall I don't really care enough about this to argue strenuously for
it. I'm not nearly so confident as Tom that no errors of this type
will creep into future code, but on the other hand, keeping
--disable-spinlocks working reliably isn't significant enough for me
to want to spend any political capital on it. It's probably got a dim
future regardless of what we do here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29

Robert Haas

robertmhaas@gmail.com

about 12 years ago

In reply to: Robert Haas (#24)

1 attachment(s)

Re: dynamic shared memory and locks

On Mon, Jan 6, 2014 at 5:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:

This is only part of the solution, of course: a complete solution will
involve making the hash table key something other than the lock ID.
What I'm thinking we can do is making the lock ID consist of two
unsigned 32-bit integers. One of these will be stored in the lwlock
itself, which if my calculations are correct won't increase the size
of LWLockPadded on any common platforms (a 64-bit integer would).
Let's call this the "tranch id". The other will be derived from the
LWLock address. Let's call this the "instance ID". We'll keep a
table of tranch IDs, which will be assigned consecutively starting
with 0. We'll keep an array of metadata for tranches, indexed by
tranch ID, and each entry will have three associated pieces of
information: an array base, a stride length, and a printable name.
When we need to identify an lwlock in the log or to dtrace, we'll
fetch the tranch ID from the lwlock itself and use that to index into
the tranch metadata array. We'll then take the address of the lwlock,
subtract the array base address for the tranch, and divide by the
stride length; the result is the instance ID. When reporting the
user, we can report either the tranch ID directly or the associated
name for that tranch; in either case, we'll also report the instance
ID.

So initially we'll probably just have tranch 0: the main LWLock array.
If we move buffer content and I/O locks to the buffer headers, we'll
define tranch 1 and tranch 2 with the same base address: the start of
the buffer descriptor array, and the same stride length, the size of a
buffer descriptor. One will have the associated name "buffer content
lock" and the other "buffer I/O lock". If we want, we can define
split the main LWLock array into several tranches so that we can more
easily identify lock manager locks, predicate lock manager locks, and
buffer mapping locks.

OK, I've implemented this: here's what I believe to be a complete
patch, based on the previous lwlock-pointers.patch but now handling
LOCK_DEBUG and TRACE_LWLOCKS and dtrace and a bunch of other loose
ends. I think this should be adequate for allowing lwlocks to be
stored elsewhere in the main shared memory segment as well as in
dynamic shared memory segments.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

lwlock-pointers-v2.patchtext/x-patch; charset=US-ASCII; name=lwlock-pointers-v2.patchDownload

diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index dbf8030..eeeede3 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -116,7 +116,12 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
 		 * possible deadlocks.
 		 */
 		for (i = 0; i < NUM_BUFFER_PARTITIONS; i++)
-			LWLockAcquire(FirstBufMappingLock + i, LW_SHARED);
+		{
+			LWLock	   *lock;
+
+			lock = &MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET + i].lock;
+			LWLockAcquire(lock, LW_SHARED);
+		}
 
 		/*
 		 * Scan though all the buffers, saving the relevant fields in the
@@ -157,7 +162,12 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
 		 * avoids O(N^2) behavior inside LWLockRelease.
 		 */
 		for (i = NUM_BUFFER_PARTITIONS; --i >= 0;)
-			LWLockRelease(FirstBufMappingLock + i);
+		{
+			LWLock	   *lock;
+
+			lock = &MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET + i].lock;
+			LWLockRelease(lock);
+		}
 	}
 
 	funcctx = SRF_PERCALL_SETUP();
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 2f069b7..858cce3 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -150,7 +150,7 @@ typedef struct pgssEntry
  */
 typedef struct pgssSharedState
 {
-	LWLockId	lock;			/* protects hashtable search/modification */
+	LWLock	   *lock;			/* protects hashtable search/modification */
 	int			query_size;		/* max query length in bytes */
 	double		cur_median_usage;		/* current median usage in hashtable */
 } pgssSharedState;
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4ec6981..3d278a6 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2212,49 +2212,55 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
     </row>
     <row>
      <entry>lwlock-acquire</entry>
-     <entry>(LWLockId, LWLockMode)</entry>
+     <entry>(char *, int, LWLockMode)</entry>
      <entry>Probe that fires when an LWLock has been acquired.
-      arg0 is the LWLock's ID.
-      arg1 is the requested lock mode, either exclusive or shared.</entry>
+      arg0 is the LWLock's tranche.
+      arg1 is the LWLock's offset within its trance.
+      arg2 is the requested lock mode, either exclusive or shared.</entry>
     </row>
     <row>
      <entry>lwlock-release</entry>
-     <entry>(LWLockId)</entry>
+     <entry>(char *, int)</entry>
      <entry>Probe that fires when an LWLock has been released (but note
       that any released waiters have not yet been awakened).
-      arg0 is the LWLock's ID.</entry>
+      arg0 is the LWLock's tranche.
+      arg1 is the LWLock's offset within its trance.</entry>
     </row>
     <row>
      <entry>lwlock-wait-start</entry>
-     <entry>(LWLockId, LWLockMode)</entry>
+     <entry>(char *, int, LWLockMode)</entry>
      <entry>Probe that fires when an LWLock was not immediately available and
       a server process has begun to wait for the lock to become available.
-      arg0 is the LWLock's ID.
-      arg1 is the requested lock mode, either exclusive or shared.</entry>
+      arg0 is the LWLock's tranche.
+      arg1 is the LWLock's offset within its trance.
+      arg2 is the requested lock mode, either exclusive or shared.</entry>
     </row>
     <row>
      <entry>lwlock-wait-done</entry>
-     <entry>(LWLockId, LWLockMode)</entry>
+     <entry>(char *, int, LWLockMode)</entry>
      <entry>Probe that fires when a server process has been released from its
       wait for an LWLock (it does not actually have the lock yet).
-      arg0 is the LWLock's ID.
-      arg1 is the requested lock mode, either exclusive or shared.</entry>
+      arg0 is the LWLock's tranche.
+      arg1 is the LWLock's offset within its trance.
+      arg2 is the requested lock mode, either exclusive or shared.</entry>
     </row>
     <row>
      <entry>lwlock-condacquire</entry>
-     <entry>(LWLockId, LWLockMode)</entry>
+     <entry>(char *, int, LWLockMode)</entry>
      <entry>Probe that fires when an LWLock was successfully acquired when the
       caller specified no waiting.
-      arg0 is the LWLock's ID.
-      arg1 is the requested lock mode, either exclusive or shared.</entry>
+      arg0 is the LWLock's tranche.
+      arg1 is the LWLock's offset within its trance.
+      arg2 is the requested lock mode, either exclusive or shared.</entry>
     </row>
     <row>
      <entry>lwlock-condacquire-fail</entry>
-     <entry>(LWLockId, LWLockMode)</entry>
+     <entry>(char *, int, LWLockMode)</entry>
      <entry>Probe that fires when an LWLock was not successfully acquired when
       the caller specified no waiting.
-      arg0 is the LWLock's ID.
-      arg1 is the requested lock mode, either exclusive or shared.</entry>
+      arg0 is the LWLock's tranche.
+      arg1 is the LWLock's offset within its trance.
+      arg2 is the requested lock mode, either exclusive or shared.</entry>
     </row>
     <row>
      <entry>lock-wait-start</entry>
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index f604aa9..b90db9a 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -151,7 +151,7 @@ SimpleLruShmemSize(int nslots, int nlsns)
 	sz += MAXALIGN(nslots * sizeof(bool));		/* page_dirty[] */
 	sz += MAXALIGN(nslots * sizeof(int));		/* page_number[] */
 	sz += MAXALIGN(nslots * sizeof(int));		/* page_lru_count[] */
-	sz += MAXALIGN(nslots * sizeof(LWLockId));	/* buffer_locks[] */
+	sz += MAXALIGN(nslots * sizeof(LWLock *));	/* buffer_locks[] */
 
 	if (nlsns > 0)
 		sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr));	/* group_lsn[] */
@@ -161,7 +161,7 @@ SimpleLruShmemSize(int nslots, int nlsns)
 
 void
 SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
-			  LWLockId ctllock, const char *subdir)
+			  LWLock *ctllock, const char *subdir)
 {
 	SlruShared	shared;
 	bool		found;
@@ -202,8 +202,8 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
 		offset += MAXALIGN(nslots * sizeof(int));
 		shared->page_lru_count = (int *) (ptr + offset);
 		offset += MAXALIGN(nslots * sizeof(int));
-		shared->buffer_locks = (LWLockId *) (ptr + offset);
-		offset += MAXALIGN(nslots * sizeof(LWLockId));
+		shared->buffer_locks = (LWLock **) (ptr + offset);
+		offset += MAXALIGN(nslots * sizeof(LWLock *));
 
 		if (nlsns > 0)
 		{
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a7e40cc..59c2941 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -449,8 +449,6 @@ typedef struct
 typedef int InheritableSocket;
 #endif
 
-typedef struct LWLock LWLock;	/* ugly kluge */
-
 /*
  * Structure contains all variables passed to exec:ed backends
  */
@@ -474,7 +472,7 @@ typedef struct
 #ifndef HAVE_SPINLOCKS
 	PGSemaphore	SpinlockSemaArray;
 #endif
-	LWLock	   *LWLockArray;
+	LWLock	   *MainLWLockArray;
 	slock_t    *ProcStructLock;
 	PROC_HDR   *ProcGlobal;
 	PGPROC	   *AuxiliaryProcs;
@@ -5583,7 +5581,6 @@ PostmasterMarkPIDForWorkerNotify(int pid)
  * functions.  They are marked NON_EXEC_STATIC in their home modules.
  */
 extern slock_t *ShmemLock;
-extern LWLock *LWLockArray;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
@@ -5633,6 +5630,7 @@ save_backend_variables(BackendParameters *param, Port *port,
 	param->SpinlockSemaArray = SpinlockSemaArray;
 #endif
 	param->LWLockArray = LWLockArray;
+	param->MainLWLockArray = MainLWLockArray;
 	param->ProcStructLock = ProcStructLock;
 	param->ProcGlobal = ProcGlobal;
 	param->AuxiliaryProcs = AuxiliaryProcs;
@@ -5863,7 +5861,7 @@ restore_backend_variables(BackendParameters *param, Port *port)
 #ifndef HAVE_SPINLOCKS
 	SpinlockSemaArray = param->SpinlockSemaArray;
 #endif
-	LWLockArray = param->LWLockArray;
+	MainLWLockArray = param->MainLWLockArray;
 	ProcStructLock = param->ProcStructLock;
 	ProcGlobal = param->ProcGlobal;
 	AuxiliaryProcs = param->AuxiliaryProcs;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index c6bae12..11ef8d0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -146,7 +146,7 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
 	{
 		BufferTag	newTag;		/* identity of requested block */
 		uint32		newHash;	/* hash value for newTag */
-		LWLockId	newPartitionLock;	/* buffer partition lock for it */
+		LWLock	   *newPartitionLock;	/* buffer partition lock for it */
 		int			buf_id;
 
 		/* create a tag so we can lookup the buffer */
@@ -536,10 +536,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 {
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
-	LWLockId	newPartitionLock;		/* buffer partition lock for it */
+	LWLock	   *newPartitionLock;		/* buffer partition lock for it */
 	BufferTag	oldTag;			/* previous identity of selected buffer */
 	uint32		oldHash;		/* hash value for oldTag */
-	LWLockId	oldPartitionLock;		/* buffer partition lock for it */
+	LWLock	   *oldPartitionLock;		/* buffer partition lock for it */
 	BufFlags	oldFlags;
 	int			buf_id;
 	volatile BufferDesc *buf;
@@ -888,7 +888,7 @@ InvalidateBuffer(volatile BufferDesc *buf)
 {
 	BufferTag	oldTag;
 	uint32		oldHash;		/* hash value for oldTag */
-	LWLockId	oldPartitionLock;		/* buffer partition lock for it */
+	LWLock	   *oldPartitionLock;		/* buffer partition lock for it */
 	BufFlags	oldFlags;
 
 	/* Save the original buffer tag before dropping the spinlock */
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 3c04fc3..9334af9 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -183,8 +183,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	 * Now initialize LWLocks, which do shared memory allocation and are
 	 * needed for InitShmemIndex.
 	 */
-	if (!IsUnderPostmaster)
-		CreateLWLocks();
+	CreateLWLocks();
 
 	/*
 	 * Set up shmem.c index hashtable
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 5c8b4b0..a5f4138 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -565,7 +565,7 @@ LockHasWaiters(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	LOCALLOCK  *locallock;
 	LOCK	   *lock;
 	PROCLOCK   *proclock;
-	LWLockId	partitionLock;
+	LWLock	   *partitionLock;
 	bool		hasWaiters = false;
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
@@ -702,7 +702,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	bool		found;
 	ResourceOwner owner;
 	uint32		hashcode;
-	LWLockId	partitionLock;
+	LWLock	   *partitionLock;
 	int			status;
 	bool		log_lock = false;
 
@@ -1744,7 +1744,7 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	LOCALLOCK  *locallock;
 	LOCK	   *lock;
 	PROCLOCK   *proclock;
-	LWLockId	partitionLock;
+	LWLock	   *partitionLock;
 	bool		wakeupNeeded;
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
@@ -2096,10 +2096,13 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 	 */
 	for (partition = 0; partition < NUM_LOCK_PARTITIONS; partition++)
 	{
-		LWLockId	partitionLock = FirstLockMgrLock + partition;
+		LWLock	   *partitionLock;
 		SHM_QUEUE  *procLocks = &(MyProc->myProcLocks[partition]);
 		PROCLOCK   *nextplock;
 
+		partitionLock =
+			&MainLWLockArray[LOCK_MANAGER_LWLOCK_OFFSET + partition].lock;
+
 		/*
 		 * If the proclock list for this partition is empty, we can skip
 		 * acquiring the partition lock.  This optimization is trickier than
@@ -2475,7 +2478,7 @@ static bool
 FastPathTransferRelationLocks(LockMethod lockMethodTable, const LOCKTAG *locktag,
 							  uint32 hashcode)
 {
-	LWLockId	partitionLock = LockHashPartitionLock(hashcode);
+	LWLock	   *partitionLock = LockHashPartitionLock(hashcode);
 	Oid			relid = locktag->locktag_field2;
 	uint32		i;
 
@@ -2565,7 +2568,7 @@ FastPathGetRelationLockEntry(LOCALLOCK *locallock)
 	LockMethod	lockMethodTable = LockMethods[DEFAULT_LOCKMETHOD];
 	LOCKTAG    *locktag = &locallock->tag.lock;
 	PROCLOCK   *proclock = NULL;
-	LWLockId	partitionLock = LockHashPartitionLock(locallock->hashcode);
+	LWLock	   *partitionLock = LockHashPartitionLock(locallock->hashcode);
 	Oid			relid = locktag->locktag_field2;
 	uint32		f;
 
@@ -2671,7 +2674,7 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode)
 	SHM_QUEUE  *procLocks;
 	PROCLOCK   *proclock;
 	uint32		hashcode;
-	LWLockId	partitionLock;
+	LWLock	   *partitionLock;
 	int			count = 0;
 	int			fast_count = 0;
 
@@ -2883,7 +2886,7 @@ LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
 	PROCLOCKTAG proclocktag;
 	uint32		hashcode;
 	uint32		proclock_hashcode;
-	LWLockId	partitionLock;
+	LWLock	   *partitionLock;
 	bool		wakeupNeeded;
 
 	hashcode = LockTagHashCode(locktag);
@@ -3159,10 +3162,13 @@ PostPrepare_Locks(TransactionId xid)
 	 */
 	for (partition = 0; partition < NUM_LOCK_PARTITIONS; partition++)
 	{
-		LWLockId	partitionLock = FirstLockMgrLock + partition;
+		LWLock	   *partitionLock;
 		SHM_QUEUE  *procLocks = &(MyProc->myProcLocks[partition]);
 		PROCLOCK   *nextplock;
 
+		partitionLock =
+			&MainLWLockArray[LOCK_MANAGER_LWLOCK_OFFSET + partition].lock;
+
 		/*
 		 * If the proclock list for this partition is empty, we can skip
 		 * acquiring the partition lock.  This optimization is safer than the
@@ -3400,7 +3406,12 @@ GetLockStatusData(void)
 	 * Must grab LWLocks in partition-number order to avoid LWLock deadlock.
 	 */
 	for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
-		LWLockAcquire(FirstLockMgrLock + i, LW_SHARED);
+	{
+		LWLock *partitionLock;
+
+		partitionLock = &MainLWLockArray[LOCK_MANAGER_LWLOCK_OFFSET + i].lock;
+		LWLockAcquire(partitionLock, LW_SHARED);
+	}
 
 	/* Now we can safely count the number of proclocks */
 	data->nelements = el + hash_get_num_entries(LockMethodProcLockHash);
@@ -3442,7 +3453,12 @@ GetLockStatusData(void)
 	 * behavior inside LWLockRelease.
 	 */
 	for (i = NUM_LOCK_PARTITIONS; --i >= 0;)
-		LWLockRelease(FirstLockMgrLock + i);
+	{
+		LWLock *partitionLock;
+
+		partitionLock = &MainLWLockArray[LOCK_MANAGER_LWLOCK_OFFSET + i].lock;
+		LWLockRelease(partitionLock);
+	}
 
 	Assert(el == data->nelements);
 
@@ -3477,7 +3493,12 @@ GetRunningTransactionLocks(int *nlocks)
 	 * Must grab LWLocks in partition-number order to avoid LWLock deadlock.
 	 */
 	for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
-		LWLockAcquire(FirstLockMgrLock + i, LW_SHARED);
+	{
+		LWLock	   *partitionLock;
+
+		partitionLock = &MainLWLockArray[LOCK_MANAGER_LWLOCK_OFFSET + i].lock;
+		LWLockAcquire(partitionLock, LW_SHARED);
+	}
 
 	/* Now we can safely count the number of proclocks */
 	els = hash_get_num_entries(LockMethodProcLockHash);
@@ -3537,7 +3558,12 @@ GetRunningTransactionLocks(int *nlocks)
 	 * behavior inside LWLockRelease.
 	 */
 	for (i = NUM_LOCK_PARTITIONS; --i >= 0;)
-		LWLockRelease(FirstLockMgrLock + i);
+	{
+		LWLock *partitionLock;
+
+		partitionLock = &MainLWLockArray[LOCK_MANAGER_LWLOCK_OFFSET + i].lock;
+		LWLockRelease(partitionLock);
+	}
 
 	*nlocks = index;
 	return accessExclusiveLocks;
@@ -3673,7 +3699,7 @@ lock_twophase_recover(TransactionId xid, uint16 info,
 	uint32		hashcode;
 	uint32		proclock_hashcode;
 	int			partition;
-	LWLockId	partitionLock;
+	LWLock	   *partitionLock;
 	LockMethod	lockMethodTable;
 
 	Assert(len == sizeof(TwoPhaseLockRecord));
@@ -4044,7 +4070,7 @@ VirtualXactLock(VirtualTransactionId vxid, bool wait)
 	{
 		PROCLOCK   *proclock;
 		uint32		hashcode;
-		LWLockId	partitionLock;
+		LWLock	   *partitionLock;
 
 		hashcode = LockTagHashCode(&tag);
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 0e319a7..55d9d78 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -31,50 +31,37 @@
 #include "storage/predicate.h"
 #include "storage/proc.h"
 #include "storage/spin.h"
+#include "utils/memutils.h"
+
+#ifdef LWLOCK_STATS
+#include "utils/hsearch.h"
+#endif
 
 
 /* We use the ShmemLock spinlock to protect LWLockAssign */
 extern slock_t *ShmemLock;
 
-
-typedef struct LWLock
-{
-	slock_t		mutex;			/* Protects LWLock and queue of PGPROCs */
-	bool		releaseOK;		/* T if ok to release waiters */
-	char		exclusive;		/* # of exclusive holders (0 or 1) */
-	int			shared;			/* # of shared holders (0..MaxBackends) */
-	PGPROC	   *head;			/* head of list of waiting PGPROCs */
-	PGPROC	   *tail;			/* tail of list of waiting PGPROCs */
-	/* tail is undefined when head is NULL */
-} LWLock;
-
 /*
- * All the LWLock structs are allocated as an array in shared memory.
- * (LWLockIds are indexes into the array.)	We force the array stride to
- * be a power of 2, which saves a few cycles in indexing, but more
- * importantly also ensures that individual LWLocks don't cross cache line
- * boundaries.	This reduces cache contention problems, especially on AMD
- * Opterons.  (Of course, we have to also ensure that the array start
- * address is suitably aligned.)
- *
- * LWLock is between 16 and 32 bytes on all known platforms, so these two
- * cases are sufficient.
+ * This is indexed by tranche ID and stores metadata for all tranches known
+ * to the current backend.
  */
-#define LWLOCK_PADDED_SIZE	(sizeof(LWLock) <= 16 ? 16 : 32)
+static LWLockTranche **LWLockTrancheArray = NULL;
+static int LWLockTranchesAllocated = 0;
 
-typedef union LWLockPadded
-{
-	LWLock		lock;
-	char		pad[LWLOCK_PADDED_SIZE];
-} LWLockPadded;
+#define T_NAME(lock) \
+	(LWLockTrancheArray[(lock)->tranche]->name)
+#define T_ID(lock) \
+	((int) ((((char *) lock) - \
+		((char *) LWLockTrancheArray[(lock)->tranche]->array_base)) / \
+		LWLockTrancheArray[(lock)->tranche]->array_stride))
 
 /*
- * This points to the array of LWLocks in shared memory.  Backends inherit
+ * This points to the main array of LWLocks in shared memory.  Backends inherit
  * the pointer by fork from the postmaster (except in the EXEC_BACKEND case,
  * where we have special measures to pass it down).
  */
-NON_EXEC_STATIC LWLockPadded *LWLockArray = NULL;
-
+LWLockPadded *MainLWLockArray = NULL;
+static LWLockTranche MainLWLockTranche;
 
 /*
  * We use this structure to keep track of locked LWLocks for release
@@ -85,58 +72,78 @@ NON_EXEC_STATIC LWLockPadded *LWLockArray = NULL;
 #define MAX_SIMUL_LWLOCKS	100
 
 static int	num_held_lwlocks = 0;
-static LWLockId held_lwlocks[MAX_SIMUL_LWLOCKS];
+static LWLock *held_lwlocks[MAX_SIMUL_LWLOCKS];
 
 static int	lock_addin_request = 0;
 static bool lock_addin_request_allowed = true;
 
 #ifdef LWLOCK_STATS
+typedef struct lwlock_stats_key
+{
+	int		tranche;
+	int		instance;
+} lwlock_stats_key;
+
+typedef struct lwlock_stats
+{
+	lwlock_stats_key	key;
+	int		sh_acquire_count;
+	int		ex_acquire_count;
+	int		block_count;
+	int		spin_delay_count;
+} lwlock_stats;
+
 static int	counts_for_pid = 0;
-static int *sh_acquire_counts;
-static int *ex_acquire_counts;
-static int *block_counts;
-static int *spin_delay_counts;
+static HTAB *lwlock_stats_htab;
 #endif
 
 #ifdef LOCK_DEBUG
 bool		Trace_lwlocks = false;
 
 inline static void
-PRINT_LWDEBUG(const char *where, LWLockId lockid, const volatile LWLock *lock)
+PRINT_LWDEBUG(const char *where, const volatile LWLock *lock)
 {
 	if (Trace_lwlocks)
-		elog(LOG, "%s(%d): excl %d shared %d head %p rOK %d",
-			 where, (int) lockid,
+		elog(LOG, "%s(%s %d): excl %d shared %d head %p rOK %d",
+			 where, T_NAME(lock), T_ID(lock),
 			 (int) lock->exclusive, lock->shared, lock->head,
 			 (int) lock->releaseOK);
 }
 
 inline static void
-LOG_LWDEBUG(const char *where, LWLockId lockid, const char *msg)
+LOG_LWDEBUG(const char *where, const char *name, int index, const char *msg)
 {
 	if (Trace_lwlocks)
-		elog(LOG, "%s(%d): %s", where, (int) lockid, msg);
+		elog(LOG, "%s(%s %d): %s", where, name, index, msg);
 }
 #else							/* not LOCK_DEBUG */
-#define PRINT_LWDEBUG(a,b,c)
-#define LOG_LWDEBUG(a,b,c)
+#define PRINT_LWDEBUG(a,b)
+#define LOG_LWDEBUG(a,b,c,d)
 #endif   /* LOCK_DEBUG */
 
 #ifdef LWLOCK_STATS
 
 static void init_lwlock_stats(void);
 static void print_lwlock_stats(int code, Datum arg);
+static lwlock_stats *get_lwlock_stats_entry(LWLock *lockid);
 
 static void
 init_lwlock_stats(void)
 {
-	int		   *LWLockCounter = (int *) ((char *) LWLockArray - 2 * sizeof(int));
-	int			numLocks = LWLockCounter[1];
+	HASHCTL		ctl;
+
+	if (lwlock_stats_htab != NULL)
+	{
+		hash_destroy(lwlock_stats_htab);
+		lwlock_stats_htab = NULL;
+	}
 
-	sh_acquire_counts = calloc(numLocks, sizeof(int));
-	ex_acquire_counts = calloc(numLocks, sizeof(int));
-	spin_delay_counts = calloc(numLocks, sizeof(int));
-	block_counts = calloc(numLocks, sizeof(int));
+	MemSet(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(lwlock_stats_key);
+	ctl.entrysize = sizeof(lwlock_stats);
+	ctl.hash = tag_hash;
+	lwlock_stats_htab = hash_create("lwlock stats", 16384, &ctl,
+									HASH_ELEM | HASH_FUNCTION);
 	counts_for_pid = MyProcPid;
 	on_shmem_exit(print_lwlock_stats, 0);
 }
@@ -144,30 +151,58 @@ init_lwlock_stats(void)
 static void
 print_lwlock_stats(int code, Datum arg)
 {
-	int			i;
-	int		   *LWLockCounter = (int *) ((char *) LWLockArray - 2 * sizeof(int));
-	int			numLocks = LWLockCounter[1];
+	HASH_SEQ_STATUS scan;
+	lwlock_stats *lwstats;
+
+	hash_seq_init(&scan, lwlock_stats_htab);
 
 	/* Grab an LWLock to keep different backends from mixing reports */
-	LWLockAcquire(0, LW_EXCLUSIVE);
+	LWLockAcquire(&MainLWLockArray[0].lock, LW_EXCLUSIVE);
 
-	for (i = 0; i < numLocks; i++)
+	while ((lwstats = (lwlock_stats *) hash_seq_search(&scan)) != NULL)
 	{
-		if (sh_acquire_counts[i] || ex_acquire_counts[i] || block_counts[i] || spin_delay_counts[i])
-			fprintf(stderr, "PID %d lwlock %d: shacq %u exacq %u blk %u spindelay %u\n",
-					MyProcPid, i, sh_acquire_counts[i], ex_acquire_counts[i],
-					block_counts[i], spin_delay_counts[i]);
+		fprintf(stderr,
+				"PID %d lwlock %s %d: shacq %u exacq %u blk %u spindelay %u\n",
+				MyProcPid, LWLockTrancheArray[lwstats->key.tranche]->name,
+				lwstats->key.instance, lwstats->sh_acquire_count,
+				lwstats->ex_acquire_count, lwstats->block_count,
+				lwstats->spin_delay_count);
 	}
 
-	LWLockRelease(0);
+	LWLockRelease(&MainLWLockArray[0].lock);
+}
+
+static lwlock_stats *
+get_lwlock_stats_entry(LWLock *lock)
+{
+	lwlock_stats_key	key;
+	lwlock_stats *lwstats;
+	bool	found;
+
+	/* Set up local count state first time through in a given process */
+	if (counts_for_pid != MyProcPid)
+		init_lwlock_stats();
+
+	/* Fetch or create the entry. */
+	key.tranche = lock->tranche;
+	key.instance = T_ID(lock);
+	lwstats = hash_search(lwlock_stats_htab, &key, HASH_ENTER, &found);
+	if (!found)
+	{
+		lwstats->sh_acquire_count = 0;
+		lwstats->ex_acquire_count = 0;
+		lwstats->block_count = 0;
+		lwstats->spin_delay_count = 0;
+	}
+	return lwstats;
 }
 #endif   /* LWLOCK_STATS */
 
 
 /*
- * Compute number of LWLocks to allocate.
+ * Compute number of LWLocks to allocate in the main array.
  */
-int
+static int
 NumLWLocks(void)
 {
 	int			numLocks;
@@ -180,7 +215,7 @@ NumLWLocks(void)
 	 */
 
 	/* Predefined LWLocks */
-	numLocks = (int) NumFixedLWLocks;
+	numLocks = NUM_FIXED_LWLOCKS;
 
 	/* bufmgr.c needs two for each shared buffer */
 	numLocks += 2 * NBuffers;
@@ -248,56 +283,67 @@ LWLockShmemSize(void)
 	size = mul_size(numLocks, sizeof(LWLockPadded));
 
 	/* Space for dynamic allocation counter, plus room for alignment. */
-	size = add_size(size, 2 * sizeof(int) + LWLOCK_PADDED_SIZE);
+	size = add_size(size, 3 * sizeof(int) + LWLOCK_PADDED_SIZE);
 
 	return size;
 }
 
 
 /*
- * Allocate shmem space for LWLocks and initialize the locks.
+ * Allocate shmem space for the main LWLock array and initialize it.  We also
+ * register the main tranch here.
  */
 void
 CreateLWLocks(void)
 {
-	int			numLocks = NumLWLocks();
-	Size		spaceLocks = LWLockShmemSize();
-	LWLockPadded *lock;
-	int		   *LWLockCounter;
-	char	   *ptr;
-	int			id;
+	if (!IsUnderPostmaster)
+	{
+		int			numLocks = NumLWLocks();
+		Size		spaceLocks = LWLockShmemSize();
+		LWLockPadded *lock;
+		int		   *LWLockCounter;
+		char	   *ptr;
+		int			id;
 
-	/* Allocate space */
-	ptr = (char *) ShmemAlloc(spaceLocks);
+		/* Allocate space */
+		ptr = (char *) ShmemAlloc(spaceLocks);
 
-	/* Leave room for dynamic allocation counter */
-	ptr += 2 * sizeof(int);
+		/* Leave room for dynamic allocation of locks and tranches */
+		ptr += 3 * sizeof(int);
 
-	/* Ensure desired alignment of LWLock array */
-	ptr += LWLOCK_PADDED_SIZE - ((uintptr_t) ptr) % LWLOCK_PADDED_SIZE;
+		/* Ensure desired alignment of LWLock array */
+		ptr += LWLOCK_PADDED_SIZE - ((uintptr_t) ptr) % LWLOCK_PADDED_SIZE;
 
-	LWLockArray = (LWLockPadded *) ptr;
+		MainLWLockArray = (LWLockPadded *) ptr;
 
-	/*
-	 * Initialize all LWLocks to "unlocked" state
-	 */
-	for (id = 0, lock = LWLockArray; id < numLocks; id++, lock++)
+		/* Initialize all LWLocks in main array */
+		for (id = 0, lock = MainLWLockArray; id < numLocks; id++, lock++)
+			LWLockInitialize(&lock->lock, 0);
+
+		/*
+		 * Initialize the dynamic-allocation counters, which are stored just
+		 * before the first LWLock.  LWLockCounter[0] is the allocation
+		 * counter for lwlocks, LWLockCounter[1] is the maximum number that
+		 * can be allocated from the main array, and LWLockCounter[2] is the
+		 *  allocation counter for tranches.
+		 */
+		LWLockCounter = (int *) ((char *) MainLWLockArray - 3 * sizeof(int));
+		LWLockCounter[0] = NUM_FIXED_LWLOCKS;
+		LWLockCounter[1] = numLocks;
+		LWLockCounter[2] = 1;			/* 0 is the main array */
+	}
+
+	if (LWLockTrancheArray == NULL)
 	{
-		SpinLockInit(&lock->lock.mutex);
-		lock->lock.releaseOK = true;
-		lock->lock.exclusive = 0;
-		lock->lock.shared = 0;
-		lock->lock.head = NULL;
-		lock->lock.tail = NULL;
+		LWLockTranchesAllocated = 16;
+		LWLockTrancheArray = MemoryContextAlloc(TopMemoryContext,
+			LWLockTranchesAllocated * sizeof(LWLockTranche *));
 	}
 
-	/*
-	 * Initialize the dynamic-allocation counter, which is stored just before
-	 * the first LWLock.
-	 */
-	LWLockCounter = (int *) ((char *) LWLockArray - 2 * sizeof(int));
-	LWLockCounter[0] = (int) NumFixedLWLocks;
-	LWLockCounter[1] = numLocks;
+	MainLWLockTranche.name = "main";
+	MainLWLockTranche.array_base = MainLWLockArray;
+	MainLWLockTranche.array_stride = sizeof(LWLockPadded);
+	LWLockRegisterTranche(0, &MainLWLockTranche);
 }
 
 
@@ -309,26 +355,86 @@ CreateLWLocks(void)
  * startup, but it is needed if any user-defined code tries to allocate
  * LWLocks after startup.
  */
-LWLockId
+LWLock *
 LWLockAssign(void)
 {
-	LWLockId	result;
+	LWLock	   *result;
 
 	/* use volatile pointer to prevent code rearrangement */
 	volatile int *LWLockCounter;
 
-	LWLockCounter = (int *) ((char *) LWLockArray - 2 * sizeof(int));
+	LWLockCounter = (int *) ((char *) MainLWLockArray - 3 * sizeof(int));
 	SpinLockAcquire(ShmemLock);
 	if (LWLockCounter[0] >= LWLockCounter[1])
 	{
 		SpinLockRelease(ShmemLock);
-		elog(ERROR, "no more LWLockIds available");
+		elog(ERROR, "no more LWLocks available");
 	}
-	result = (LWLockId) (LWLockCounter[0]++);
+	result = &MainLWLockArray[LWLockCounter[0]++].lock;
 	SpinLockRelease(ShmemLock);
 	return result;
 }
 
+/*
+ * Allocate a new tranche ID.
+ */
+int
+LWLockNewTrancheId(void)
+{
+	int			result;
+
+	/* use volatile pointer to prevent code rearrangement */
+	volatile int *LWLockCounter;
+
+	LWLockCounter = (int *) ((char *) MainLWLockArray - 3 * sizeof(int));
+	SpinLockAcquire(ShmemLock);
+	result = LWLockCounter[2]++;
+	SpinLockRelease(ShmemLock);
+
+	return result;
+}
+
+/*
+ * Register a tranche ID in the lookup table for the current process.  This
+ * routine will save a pointer to the tranche object passed as an argument,
+ * so that object should be allocated in a backend-lifetime context
+ * (TopMemoryContext, static variable, or similar).
+ */
+void
+LWLockRegisterTranche(int tranche_id, LWLockTranche *tranche)
+{
+	Assert(LWLockTrancheArray != NULL);
+
+	if (tranche_id >= LWLockTranchesAllocated)
+	{
+		int		i = LWLockTranchesAllocated;
+
+		while (i < tranche_id)
+			i *= 2;
+
+		LWLockTrancheArray = repalloc(LWLockTrancheArray,
+									  i * sizeof(LWLockTranche *));
+		LWLockTranchesAllocated = i;
+	}
+
+	LWLockTrancheArray[tranche_id] = tranche;
+}
+
+/*
+ * LWLockInitialize - initialize a new lwlock; it's initially unlocked
+ */
+void
+LWLockInitialize(LWLock *lock, int tranche_id)
+{
+	SpinLockInit(&lock->mutex);
+	lock->releaseOK = true;
+	lock->exclusive = 0;
+	lock->shared = 0;
+	lock->tranche = tranche_id;
+	lock->head = NULL;
+	lock->tail = NULL;
+}
+
 
 /*
  * LWLockAcquire - acquire a lightweight lock in the specified mode
@@ -338,24 +444,26 @@ LWLockAssign(void)
  * Side effect: cancel/die interrupts are held off until lock release.
  */
 void
-LWLockAcquire(LWLockId lockid, LWLockMode mode)
+LWLockAcquire(LWLock *l, LWLockMode mode)
 {
-	volatile LWLock *lock = &(LWLockArray[lockid].lock);
+	volatile LWLock *lock = l;
 	PGPROC	   *proc = MyProc;
 	bool		retry = false;
 	int			extraWaits = 0;
+#ifdef LWLOCK_STATS
+	lwlock_stats *lwstats;
+#endif
 
-	PRINT_LWDEBUG("LWLockAcquire", lockid, lock);
+	PRINT_LWDEBUG("LWLockAcquire", lock);
 
 #ifdef LWLOCK_STATS
-	/* Set up local count state first time through in a given process */
-	if (counts_for_pid != MyProcPid)
-		init_lwlock_stats();
+	lwstats = get_lwlock_stats_entry(l);
+
 	/* Count lock acquisition attempts */
 	if (mode == LW_EXCLUSIVE)
-		ex_acquire_counts[lockid]++;
+		lwstats->ex_acquire_count++;
 	else
-		sh_acquire_counts[lockid]++;
+		lwstats->sh_acquire_count++;
 #endif   /* LWLOCK_STATS */
 
 	/*
@@ -398,7 +506,7 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 
 		/* Acquire mutex.  Time spent holding mutex should be short! */
 #ifdef LWLOCK_STATS
-		spin_delay_counts[lockid] += SpinLockAcquire(&lock->mutex);
+		lwstats->spin_delay_count += SpinLockAcquire(&lock->mutex);
 #else
 		SpinLockAcquire(&lock->mutex);
 #endif
@@ -466,13 +574,13 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 		 * so that the lock manager or signal manager will see the received
 		 * signal when it next waits.
 		 */
-		LOG_LWDEBUG("LWLockAcquire", lockid, "waiting");
+		LOG_LWDEBUG("LWLockAcquire", T_NAME(l), T_ID(l), "waiting");
 
 #ifdef LWLOCK_STATS
-		block_counts[lockid]++;
+		lwstats->block_count++;
 #endif
 
-		TRACE_POSTGRESQL_LWLOCK_WAIT_START(lockid, mode);
+		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(l), T_ID(l), mode);
 
 		for (;;)
 		{
@@ -483,9 +591,9 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 			extraWaits++;
 		}
 
-		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(lockid, mode);
+		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(l), T_ID(l), mode);
 
-		LOG_LWDEBUG("LWLockAcquire", lockid, "awakened");
+		LOG_LWDEBUG("LWLockAcquire", T_NAME(l), T_ID(l), "awakened");
 
 		/* Now loop back and try to acquire lock again. */
 		retry = true;
@@ -494,10 +602,10 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 	/* We are done updating shared state of the lock itself. */
 	SpinLockRelease(&lock->mutex);
 
-	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(lockid, mode);
+	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(l), T_ID(l), mode);
 
 	/* Add lock to list of locks held by this backend */
-	held_lwlocks[num_held_lwlocks++] = lockid;
+	held_lwlocks[num_held_lwlocks++] = l;
 
 	/*
 	 * Fix the process wait semaphore's count for any absorbed wakeups.
@@ -514,12 +622,12 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
  * If successful, cancel/die interrupts are held off until lock release.
  */
 bool
-LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode)
+LWLockConditionalAcquire(LWLock *l, LWLockMode mode)
 {
-	volatile LWLock *lock = &(LWLockArray[lockid].lock);
+	volatile LWLock *lock = l;
 	bool		mustwait;
 
-	PRINT_LWDEBUG("LWLockConditionalAcquire", lockid, lock);
+	PRINT_LWDEBUG("LWLockConditionalAcquire", lock);
 
 	/* Ensure we will have room to remember the lock */
 	if (num_held_lwlocks >= MAX_SIMUL_LWLOCKS)
@@ -564,14 +672,14 @@ LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode)
 	{
 		/* Failed to get lock, so release interrupt holdoff */
 		RESUME_INTERRUPTS();
-		LOG_LWDEBUG("LWLockConditionalAcquire", lockid, "failed");
-		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE_FAIL(lockid, mode);
+		LOG_LWDEBUG("LWLockConditionalAcquire", T_NAME(l), T_ID(l), "failed");
+		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE_FAIL(T_NAME(l), T_ID(l), mode);
 	}
 	else
 	{
 		/* Add lock to list of locks held by this backend */
-		held_lwlocks[num_held_lwlocks++] = lockid;
-		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(lockid, mode);
+		held_lwlocks[num_held_lwlocks++] = l;
+		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(T_NAME(l), T_ID(l), mode);
 	}
 
 	return !mustwait;
@@ -592,19 +700,20 @@ LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode)
  * wake up, observe that their records have already been flushed, and return.
  */
 bool
-LWLockAcquireOrWait(LWLockId lockid, LWLockMode mode)
+LWLockAcquireOrWait(LWLock *l, LWLockMode mode)
 {
-	volatile LWLock *lock = &(LWLockArray[lockid].lock);
+	volatile LWLock *lock = l;
 	PGPROC	   *proc = MyProc;
 	bool		mustwait;
 	int			extraWaits = 0;
+#ifdef LWLOCK_STATS
+	lwlock_stats *lwstats;
+#endif
 
-	PRINT_LWDEBUG("LWLockAcquireOrWait", lockid, lock);
+	PRINT_LWDEBUG("LWLockAcquireOrWait", lock);
 
 #ifdef LWLOCK_STATS
-	/* Set up local count state first time through in a given process */
-	if (counts_for_pid != MyProcPid)
-		init_lwlock_stats();
+	lwstats = get_lwlock_stats_entry(l);
 #endif
 
 	/* Ensure we will have room to remember the lock */
@@ -671,13 +780,13 @@ LWLockAcquireOrWait(LWLockId lockid, LWLockMode mode)
 		 * Wait until awakened.  Like in LWLockAcquire, be prepared for bogus
 		 * wakups, because we share the semaphore with ProcWaitForSignal.
 		 */
-		LOG_LWDEBUG("LWLockAcquireOrWait", lockid, "waiting");
+		LOG_LWDEBUG("LWLockAcquireOrWait", T_NAME(l), T_ID(l), "waiting");
 
 #ifdef LWLOCK_STATS
-		block_counts[lockid]++;
+		lwstats->block_count++;
 #endif
 
-		TRACE_POSTGRESQL_LWLOCK_WAIT_START(lockid, mode);
+		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(l), T_ID(l), mode);
 
 		for (;;)
 		{
@@ -688,9 +797,9 @@ LWLockAcquireOrWait(LWLockId lockid, LWLockMode mode)
 			extraWaits++;
 		}
 
-		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(lockid, mode);
+		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(l), T_ID(l), mode);
 
-		LOG_LWDEBUG("LWLockAcquireOrWait", lockid, "awakened");
+		LOG_LWDEBUG("LWLockAcquireOrWait", T_NAME(l), T_ID(l), "awakened");
 	}
 	else
 	{
@@ -708,14 +817,14 @@ LWLockAcquireOrWait(LWLockId lockid, LWLockMode mode)
 	{
 		/* Failed to get lock, so release interrupt holdoff */
 		RESUME_INTERRUPTS();
-		LOG_LWDEBUG("LWLockAcquireOrWait", lockid, "failed");
-		TRACE_POSTGRESQL_LWLOCK_WAIT_UNTIL_FREE_FAIL(lockid, mode);
+		LOG_LWDEBUG("LWLockAcquireOrWait", T_NAME(l), T_ID(l), "failed");
+		TRACE_POSTGRESQL_LWLOCK_WAIT_UNTIL_FREE_FAIL(T_NAME(l), T_ID(l), mode);
 	}
 	else
 	{
 		/* Add lock to list of locks held by this backend */
-		held_lwlocks[num_held_lwlocks++] = lockid;
-		TRACE_POSTGRESQL_LWLOCK_WAIT_UNTIL_FREE(lockid, mode);
+		held_lwlocks[num_held_lwlocks++] = l;
+		TRACE_POSTGRESQL_LWLOCK_WAIT_UNTIL_FREE(T_NAME(l), T_ID(l), mode);
 	}
 
 	return !mustwait;
@@ -725,14 +834,14 @@ LWLockAcquireOrWait(LWLockId lockid, LWLockMode mode)
  * LWLockRelease - release a previously acquired lock
  */
 void
-LWLockRelease(LWLockId lockid)
+LWLockRelease(LWLock *l)
 {
-	volatile LWLock *lock = &(LWLockArray[lockid].lock);
+	volatile LWLock *lock = l;
 	PGPROC	   *head;
 	PGPROC	   *proc;
 	int			i;
 
-	PRINT_LWDEBUG("LWLockRelease", lockid, lock);
+	PRINT_LWDEBUG("LWLockRelease", lock);
 
 	/*
 	 * Remove lock from list of locks held.  Usually, but not always, it will
@@ -740,11 +849,11 @@ LWLockRelease(LWLockId lockid)
 	 */
 	for (i = num_held_lwlocks; --i >= 0;)
 	{
-		if (lockid == held_lwlocks[i])
+		if (l == held_lwlocks[i])
 			break;
 	}
 	if (i < 0)
-		elog(ERROR, "lock %d is not held", (int) lockid);
+		elog(ERROR, "lock %s %d is not held", T_NAME(l), T_ID(l));
 	num_held_lwlocks--;
 	for (; i < num_held_lwlocks; i++)
 		held_lwlocks[i] = held_lwlocks[i + 1];
@@ -824,14 +933,14 @@ LWLockRelease(LWLockId lockid)
 	/* We are done updating shared state of the lock itself. */
 	SpinLockRelease(&lock->mutex);
 
-	TRACE_POSTGRESQL_LWLOCK_RELEASE(lockid);
+	TRACE_POSTGRESQL_LWLOCK_RELEASE(T_NAME(l), T_ID(l));
 
 	/*
 	 * Awaken any waiters I removed from the queue.
 	 */
 	while (head != NULL)
 	{
-		LOG_LWDEBUG("LWLockRelease", lockid, "release waiter");
+		LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter");
 		proc = head;
 		head = proc->lwWaitLink;
 		proc->lwWaitLink = NULL;
@@ -874,13 +983,13 @@ LWLockReleaseAll(void)
  * lock is held shared or exclusive.
  */
 bool
-LWLockHeldByMe(LWLockId lockid)
+LWLockHeldByMe(LWLock *l)
 {
 	int			i;
 
 	for (i = 0; i < num_held_lwlocks; i++)
 	{
-		if (held_lwlocks[i] == lockid)
+		if (held_lwlocks[i] == l)
 			return true;
 	}
 	return false;
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 633b37d..9488292 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -241,7 +241,8 @@
 #define PredicateLockHashPartition(hashcode) \
 	((hashcode) % NUM_PREDICATELOCK_PARTITIONS)
 #define PredicateLockHashPartitionLock(hashcode) \
-	((LWLockId) (FirstPredicateLockMgrLock + PredicateLockHashPartition(hashcode)))
+	(&MainLWLockArray[PREDICATELOCK_MANAGER_LWLOCK_OFFSET + \
+		PredicateLockHashPartition(hashcode)].lock)
 
 #define NPREDICATELOCKTARGETENTS() \
 	mul_size(max_predicate_locks_per_xact, add_size(MaxBackends, max_prepared_xacts))
@@ -383,7 +384,7 @@ static SHM_QUEUE *FinishedSerializableTransactions;
  */
 static const PREDICATELOCKTARGETTAG ScratchTargetTag = {0, 0, 0, 0};
 static uint32 ScratchTargetTagHash;
-static int	ScratchPartitionLock;
+static LWLock *ScratchPartitionLock;
 
 /*
  * The local hash table used to determine when to combine multiple fine-
@@ -1398,7 +1399,13 @@ GetPredicateLockStatusData(void)
 	 * in ascending order, then SerializableXactHashLock.
 	 */
 	for (i = 0; i < NUM_PREDICATELOCK_PARTITIONS; i++)
-		LWLockAcquire(FirstPredicateLockMgrLock + i, LW_SHARED);
+	{
+		LWLock *partitionLock;
+
+		partitionLock =
+			&MainLWLockArray[PREDICATELOCK_MANAGER_LWLOCK_OFFSET + i].lock;
+		LWLockAcquire(partitionLock, LW_SHARED);
+	}
 	LWLockAcquire(SerializableXactHashLock, LW_SHARED);
 
 	/* Get number of locks and allocate appropriately-sized arrays. */
@@ -1427,7 +1434,13 @@ GetPredicateLockStatusData(void)
 	/* Release locks in reverse order */
 	LWLockRelease(SerializableXactHashLock);
 	for (i = NUM_PREDICATELOCK_PARTITIONS - 1; i >= 0; i--)
-		LWLockRelease(FirstPredicateLockMgrLock + i);
+	{
+		LWLock *partitionLock;
+
+		partitionLock =
+			&MainLWLockArray[PREDICATELOCK_MANAGER_LWLOCK_OFFSET + i].lock;
+		LWLockRelease(partitionLock);
+	}
 
 	return data;
 }
@@ -1856,7 +1869,7 @@ PageIsPredicateLocked(Relation relation, BlockNumber blkno)
 {
 	PREDICATELOCKTARGETTAG targettag;
 	uint32		targettaghash;
-	LWLockId	partitionLock;
+	LWLock	   *partitionLock;
 	PREDICATELOCKTARGET *target;
 
 	SET_PREDICATELOCKTARGETTAG_PAGE(targettag,
@@ -2089,7 +2102,7 @@ DeleteChildTargetLocks(const PREDICATELOCKTARGETTAG *newtargettag)
 		if (TargetTagIsCoveredBy(oldtargettag, *newtargettag))
 		{
 			uint32		oldtargettaghash;
-			LWLockId	partitionLock;
+			LWLock	    *partitionLock;
 			PREDICATELOCK *rmpredlock PG_USED_FOR_ASSERTS_ONLY;
 
 			oldtargettaghash = PredicateLockTargetTagHashCode(&oldtargettag);
@@ -2301,7 +2314,7 @@ CreatePredicateLock(const PREDICATELOCKTARGETTAG *targettag,
 	PREDICATELOCKTARGET *target;
 	PREDICATELOCKTAG locktag;
 	PREDICATELOCK *lock;
-	LWLockId	partitionLock;
+	LWLock	   *partitionLock;
 	bool		found;
 
 	partitionLock = PredicateLockHashPartitionLock(targettaghash);
@@ -2599,10 +2612,10 @@ TransferPredicateLocksToNewTarget(PREDICATELOCKTARGETTAG oldtargettag,
 								  bool removeOld)
 {
 	uint32		oldtargettaghash;
-	LWLockId	oldpartitionLock;
+	LWLock	   *oldpartitionLock;
 	PREDICATELOCKTARGET *oldtarget;
 	uint32		newtargettaghash;
-	LWLockId	newpartitionLock;
+	LWLock	   *newpartitionLock;
 	bool		found;
 	bool		outOfShmem = false;
 
@@ -2858,7 +2871,13 @@ DropAllPredicateLocksFromTable(Relation relation, bool transfer)
 	/* Acquire locks on all lock partitions */
 	LWLockAcquire(SerializablePredicateLockListLock, LW_EXCLUSIVE);
 	for (i = 0; i < NUM_PREDICATELOCK_PARTITIONS; i++)
-		LWLockAcquire(FirstPredicateLockMgrLock + i, LW_EXCLUSIVE);
+	{
+		LWLock *partitionLock;
+
+		partitionLock =
+			&MainLWLockArray[PREDICATELOCK_MANAGER_LWLOCK_OFFSET + i].lock;
+		LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+	}
 	LWLockAcquire(SerializableXactHashLock, LW_EXCLUSIVE);
 
 	/*
@@ -2996,7 +3015,13 @@ DropAllPredicateLocksFromTable(Relation relation, bool transfer)
 	/* Release locks in reverse order */
 	LWLockRelease(SerializableXactHashLock);
 	for (i = NUM_PREDICATELOCK_PARTITIONS - 1; i >= 0; i--)
-		LWLockRelease(FirstPredicateLockMgrLock + i);
+	{
+		LWLock *partitionLock;
+
+		partitionLock =
+			&MainLWLockArray[PREDICATELOCK_MANAGER_LWLOCK_OFFSET + i].lock;
+		LWLockRelease(partitionLock);
+	}
 	LWLockRelease(SerializablePredicateLockListLock);
 }
 
@@ -3611,7 +3636,7 @@ ClearOldPredicateLocks(void)
 			PREDICATELOCKTARGET *target;
 			PREDICATELOCKTARGETTAG targettag;
 			uint32		targettaghash;
-			LWLockId	partitionLock;
+			LWLock	   *partitionLock;
 
 			tag = predlock->tag;
 			target = tag.myTarget;
@@ -3690,7 +3715,7 @@ ReleaseOneSerializableXact(SERIALIZABLEXACT *sxact, bool partial,
 		PREDICATELOCKTARGET *target;
 		PREDICATELOCKTARGETTAG targettag;
 		uint32		targettaghash;
-		LWLockId	partitionLock;
+		LWLock	   *partitionLock;
 
 		nextpredlock = (PREDICATELOCK *)
 			SHMQueueNext(&(sxact->predicateLocks),
@@ -4068,7 +4093,7 @@ static void
 CheckTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag)
 {
 	uint32		targettaghash;
-	LWLockId	partitionLock;
+	LWLock	   *partitionLock;
 	PREDICATELOCKTARGET *target;
 	PREDICATELOCK *predlock;
 	PREDICATELOCK *mypredlock = NULL;
@@ -4360,7 +4385,13 @@ CheckTableForSerializableConflictIn(Relation relation)
 
 	LWLockAcquire(SerializablePredicateLockListLock, LW_EXCLUSIVE);
 	for (i = 0; i < NUM_PREDICATELOCK_PARTITIONS; i++)
-		LWLockAcquire(FirstPredicateLockMgrLock + i, LW_SHARED);
+	{
+		LWLock *partitionLock;
+
+		partitionLock =
+			&MainLWLockArray[PREDICATELOCK_MANAGER_LWLOCK_OFFSET + i].lock;
+		LWLockAcquire(partitionLock, LW_SHARED);
+	}
 	LWLockAcquire(SerializableXactHashLock, LW_SHARED);
 
 	/* Scan through target list */
@@ -4407,7 +4438,13 @@ CheckTableForSerializableConflictIn(Relation relation)
 	/* Release locks in reverse order */
 	LWLockRelease(SerializableXactHashLock);
 	for (i = NUM_PREDICATELOCK_PARTITIONS - 1; i >= 0; i--)
-		LWLockRelease(FirstPredicateLockMgrLock + i);
+	{
+		LWLock *partitionLock;
+
+		partitionLock =
+			&MainLWLockArray[PREDICATELOCK_MANAGER_LWLOCK_OFFSET + i].lock;
+		LWLockRelease(partitionLock);
+	}
 	LWLockRelease(SerializablePredicateLockListLock);
 }
 
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index ee6c24c..a301734 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -189,7 +189,8 @@ InitProcGlobal(void)
 	 */
 	procs = (PGPROC *) ShmemAlloc(TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
-	ProcGlobal->allProcCount = TotalProcs;
+	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
+	ProcGlobal->allProcCount = MaxBackends + NUM_AUXILIARY_PROCS;
 	if (!procs)
 		ereport(FATAL,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -663,7 +664,7 @@ IsWaitingForLock(void)
 void
 LockErrorCleanup(void)
 {
-	LWLockId	partitionLock;
+	LWLock	   *partitionLock;
 	DisableTimeoutParams timeouts[2];
 
 	AbortStrongLockAcquire();
@@ -942,7 +943,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 	LOCK	   *lock = locallock->lock;
 	PROCLOCK   *proclock = locallock->proclock;
 	uint32		hashcode = locallock->hashcode;
-	LWLockId	partitionLock = LockHashPartitionLock(hashcode);
+	LWLock	   *partitionLock = LockHashPartitionLock(hashcode);
 	PROC_QUEUE *waitQueue = &(lock->waitProcs);
 	LOCKMASK	myHeldLocks = MyProc->heldLocks;
 	bool		early_deadlock = false;
@@ -1440,7 +1441,12 @@ CheckDeadLock(void)
 	 * interrupts.
 	 */
 	for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
-		LWLockAcquire(FirstLockMgrLock + i, LW_EXCLUSIVE);
+	{
+		LWLock *partitionLock;
+
+		partitionLock = &MainLWLockArray[LOCK_MANAGER_LWLOCK_OFFSET + i].lock;
+		LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+	}
 
 	/*
 	 * Check to see if we've been awoken by anyone in the interim.
@@ -1522,7 +1528,12 @@ CheckDeadLock(void)
 	 */
 check_done:
 	for (i = NUM_LOCK_PARTITIONS; --i >= 0;)
-		LWLockRelease(FirstLockMgrLock + i);
+	{
+		LWLock *partitionLock;
+
+		partitionLock = &MainLWLockArray[LOCK_MANAGER_LWLOCK_OFFSET + i].lock;
+		LWLockRelease(partitionLock);
+	}
 }
 
 
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index 17c8e15..804ba6a 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -15,7 +15,6 @@
  * in probe definitions, as they cause compilation errors on Mac OS X 10.5.
  */
 #define LocalTransactionId unsigned int
-#define LWLockId int
 #define LWLockMode int
 #define LOCKMODE int
 #define BlockNumber unsigned int
@@ -29,14 +28,14 @@ provider postgresql {
 	probe transaction__commit(LocalTransactionId);
 	probe transaction__abort(LocalTransactionId);
 
-	probe lwlock__acquire(LWLockId, LWLockMode);
-	probe lwlock__release(LWLockId);
-	probe lwlock__wait__start(LWLockId, LWLockMode);
-	probe lwlock__wait__done(LWLockId, LWLockMode);
-	probe lwlock__condacquire(LWLockId, LWLockMode);
-	probe lwlock__condacquire__fail(LWLockId, LWLockMode);
-	probe lwlock__wait__until__free(LWLockId, LWLockMode);
-	probe lwlock__wait__until__free__fail(LWLockId, LWLockMode);
+	probe lwlock__acquire(const char *, int, LWLockMode);
+	probe lwlock__release(const char *, int);
+	probe lwlock__wait__start(const char *, int, LWLockMode);
+	probe lwlock__wait__done(const char *, int, LWLockMode);
+	probe lwlock__condacquire(const char *, int, LWLockMode);
+	probe lwlock__condacquire__fail(const char *, int, LWLockMode);
+	probe lwlock__wait__until__free(const char *, int, LWLockMode);
+	probe lwlock__wait__until__free__fail(const char *, int, LWLockMode);
 
 	probe lock__wait__start(unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, LOCKMODE);
 	probe lock__wait__done(unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, LOCKMODE);
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 4ec11b1..c7b4186 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -55,7 +55,7 @@ typedef enum
  */
 typedef struct SlruSharedData
 {
-	LWLockId	ControlLock;
+	LWLock     *ControlLock;
 
 	/* Number of buffers managed by this SLRU structure */
 	int			num_slots;
@@ -69,7 +69,7 @@ typedef struct SlruSharedData
 	bool	   *page_dirty;
 	int		   *page_number;
 	int		   *page_lru_count;
-	LWLockId   *buffer_locks;
+	LWLock    **buffer_locks;
 
 	/*
 	 * Optional array of WAL flush LSNs associated with entries in the SLRU
@@ -136,7 +136,7 @@ typedef SlruCtlData *SlruCtl;
 
 extern Size SimpleLruShmemSize(int nslots, int nlsns);
 extern void SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
-			  LWLockId ctllock, const char *subdir);
+			  LWLock *ctllock, const char *subdir);
 extern int	SimpleLruZeroPage(SlruCtl ctl, int pageno);
 extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
 				  TransactionId xid);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 457390f..222ba0a 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -104,7 +104,8 @@ typedef struct buftag
 #define BufTableHashPartition(hashcode) \
 	((hashcode) % NUM_BUFFER_PARTITIONS)
 #define BufMappingPartitionLock(hashcode) \
-	((LWLockId) (FirstBufMappingLock + BufTableHashPartition(hashcode)))
+	(&MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET + \
+		BufTableHashPartition(hashcode)].lock)
 
 /*
  *	BufferDesc -- shared descriptor/state data for a single shared buffer.
@@ -144,8 +145,8 @@ typedef struct sbufdesc
 	int			buf_id;			/* buffer's index number (from 0) */
 	int			freeNext;		/* link in freelist chain */
 
-	LWLockId	io_in_progress_lock;	/* to wait for I/O to complete */
-	LWLockId	content_lock;	/* to lock access to buffer contents */
+	LWLock     *io_in_progress_lock;	/* to wait for I/O to complete */
+	LWLock	   *content_lock;	/* to lock access to buffer contents */
 } BufferDesc;
 
 #define BufferDescriptorGetBuffer(bdesc) ((bdesc)->buf_id + 1)
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index f6a2029..f83397a 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -483,8 +483,8 @@ typedef enum
 #define LockHashPartition(hashcode) \
 	((hashcode) % NUM_LOCK_PARTITIONS)
 #define LockHashPartitionLock(hashcode) \
-	((LWLockId) (FirstLockMgrLock + LockHashPartition(hashcode)))
-
+	(&MainLWLockArray[LOCK_MANAGER_LWLOCK_OFFSET + \
+		LockHashPartition(hashcode)].lock)
 
 /*
  * function prototypes
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index efdb8b5..e977d55 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -14,10 +14,120 @@
 #ifndef LWLOCK_H
 #define LWLOCK_H
 
+#include "storage/s_lock.h"
+
+struct PGPROC;
+
+/*
+ * It's occasionally necessary to identify a particular LWLock "by name"; e.g.
+ * because we wish to report the lock to dtrace.  We could store a name or
+ * other identifying information in the lock itself, but since it's common
+ * to have many nearly-identical locks (e.g. one per buffer) this would end
+ * up wasting significant amounts of memory.  Instead, each lwlock stores a
+ * tranche ID which tells us which array it's part of.  Based on that, we can
+ * figure out where the lwlock lies within the array using the data structure
+ * shown below; the lock is then identified based on the tranche name and
+ * computed array index.  We need the array stride because the array might not
+ * be an array of lwlocks, but rather some larger data structure that includes
+ * one or more lwlocks per element.
+ */
+typedef struct LWLockTranche
+{
+	const char *name;
+	void	   *array_base;
+	Size		array_stride;
+} LWLockTranche;
+
+/*
+ * Code outside of lwlock.c should not manipulate the contents of this
+ * structure directly, but we have to declare it here to allow LWLocks to be
+ * incorporated into other data structures.
+ */
+typedef struct LWLock
+{
+	slock_t		mutex;			/* Protects LWLock and queue of PGPROCs */
+	bool		releaseOK;		/* T if ok to release waiters */
+	char		exclusive;		/* # of exclusive holders (0 or 1) */
+	int			shared;			/* # of shared holders (0..MaxBackends) */
+	int			tranche;		/* tranche ID */
+	struct PGPROC *head;			/* head of list of waiting PGPROCs */
+	struct PGPROC *tail;			/* tail of list of waiting PGPROCs */
+	/* tail is undefined when head is NULL */
+} LWLock;
+
+/*
+ * Prior to PostgreSQL 9.4, every lightweight lock in the system was stored
+ * in a single array.  For convenience and for compatibility with past
+ * releases, we still have a main array, but it's now also permissible to
+ * store LWLocks elsewhere in the main shared memory segment or in a dynamic
+ * shared memory segment.  In the main array, we force the array stride to
+ * be a power of 2, which saves a few cycles in indexing, but more importantly
+ * also ensures that individual LWLocks don't cross cache line boundaries.
+ * This reduces cache contention problems, especially on AMD Opterons.
+ * (Of course, we have to also ensure that the array start address is suitably
+ * aligned.)
+ *
+ * LWLock is between 16 and 32 bytes on all known platforms, so these two
+ * cases are sufficient.
+ */
+#define LWLOCK_PADDED_SIZE	(sizeof(LWLock) <= 16 ? 16 : 32)
+
+typedef union LWLockPadded
+{
+	LWLock		lock;
+	char		pad[LWLOCK_PADDED_SIZE];
+} LWLockPadded;
+extern LWLockPadded *MainLWLockArray;
+
+/*
+ * Some commonly-used locks have predefined positions within MainLWLockArray;
+ * defining macros here makes it much easier to keep track of these.  If you
+ * add a lock, add it to the end to avoid renumbering the existing locks;
+ * if you remove a lock, consider leaving a gap in the numbering sequence for
+ * the benefit of DTrace and other external debugging scripts.
+ */
+#define BufFreelistLock				(&MainLWLockArray[0].lock)
+#define ShmemIndexLock				(&MainLWLockArray[1].lock)
+#define OidGenLock					(&MainLWLockArray[2].lock)
+#define XidGenLock					(&MainLWLockArray[3].lock)
+#define ProcArrayLock				(&MainLWLockArray[4].lock)
+#define SInvalReadLock				(&MainLWLockArray[5].lock)
+#define SInvalWriteLock				(&MainLWLockArray[6].lock)
+#define WALBufMappingLock			(&MainLWLockArray[7].lock)
+#define WALWriteLock				(&MainLWLockArray[8].lock)
+#define ControlFileLock				(&MainLWLockArray[9].lock)
+#define CheckpointLock				(&MainLWLockArray[10].lock)
+#define CLogControlLock				(&MainLWLockArray[11].lock)
+#define SubtransControlLock			(&MainLWLockArray[12].lock)
+#define MultiXactGenLock			(&MainLWLockArray[13].lock)
+#define MultiXactOffsetControlLock	(&MainLWLockArray[14].lock)
+#define MultiXactMemberControlLock	(&MainLWLockArray[15].lock)
+#define RelCacheInitLock			(&MainLWLockArray[16].lock)
+#define CheckpointerCommLock		(&MainLWLockArray[17].lock)
+#define TwoPhaseStateLock			(&MainLWLockArray[18].lock)
+#define TablespaceCreateLock		(&MainLWLockArray[19].lock)
+#define BtreeVacuumLock				(&MainLWLockArray[20].lock)
+#define AddinShmemInitLock			(&MainLWLockArray[21].lock)
+#define AutovacuumLock				(&MainLWLockArray[22].lock)
+#define AutovacuumScheduleLock		(&MainLWLockArray[23].lock)
+#define SyncScanLock				(&MainLWLockArray[24].lock)
+#define RelationMappingLock			(&MainLWLockArray[25].lock)
+#define AsyncCtlLock				(&MainLWLockArray[26].lock)
+#define AsyncQueueLock				(&MainLWLockArray[27].lock)
+#define SerializableXactHashLock	(&MainLWLockArray[28].lock)
+#define SerializableFinishedListLock		(&MainLWLockArray[29].lock)
+#define SerializablePredicateLockListLock	(&MainLWLockArray[30].lock)
+#define OldSerXidLock				(&MainLWLockArray[31].lock)
+#define SyncRepLock					(&MainLWLockArray[32].lock)
+#define BackgroundWorkerLock		(&MainLWLockArray[33].lock)
+#define DynamicSharedMemoryControlLock		(&MainLWLockArray[34].lock)
+#define AutoFileLock				(&MainLWLockArray[35].lock)
+#define NUM_INDIVIDUAL_LWLOCKS		36
+
 /*
  * It's a bit odd to declare NUM_BUFFER_PARTITIONS and NUM_LOCK_PARTITIONS
- * here, but we need them to set up enum LWLockId correctly, and having
- * this file include lock.h or bufmgr.h would be backwards.
+ * here, but we need them to figure out offsets within MainLWLockArray, and
+ * having this file include lock.h or bufmgr.h would be backwards.
  */
 
 /* Number of partitions of the shared buffer mapping hashtable */
@@ -31,68 +141,14 @@
 #define LOG2_NUM_PREDICATELOCK_PARTITIONS  4
 #define NUM_PREDICATELOCK_PARTITIONS  (1 << LOG2_NUM_PREDICATELOCK_PARTITIONS)
 
-/*
- * We have a number of predefined LWLocks, plus a bunch of LWLocks that are
- * dynamically assigned (e.g., for shared buffers).  The LWLock structures
- * live in shared memory (since they contain shared data) and are identified
- * by values of this enumerated type.  We abuse the notion of an enum somewhat
- * by allowing values not listed in the enum declaration to be assigned.
- * The extra value MaxDynamicLWLock is there to keep the compiler from
- * deciding that the enum can be represented as char or short ...
- *
- * If you remove a lock, please replace it with a placeholder. This retains
- * the lock numbering, which is helpful for DTrace and other external
- * debugging scripts.
- */
-typedef enum LWLockId
-{
-	BufFreelistLock,
-	ShmemIndexLock,
-	OidGenLock,
-	XidGenLock,
-	ProcArrayLock,
-	SInvalReadLock,
-	SInvalWriteLock,
-	WALBufMappingLock,
-	WALWriteLock,
-	ControlFileLock,
-	CheckpointLock,
-	CLogControlLock,
-	SubtransControlLock,
-	MultiXactGenLock,
-	MultiXactOffsetControlLock,
-	MultiXactMemberControlLock,
-	RelCacheInitLock,
-	CheckpointerCommLock,
-	TwoPhaseStateLock,
-	TablespaceCreateLock,
-	BtreeVacuumLock,
-	AddinShmemInitLock,
-	AutovacuumLock,
-	AutovacuumScheduleLock,
-	SyncScanLock,
-	RelationMappingLock,
-	AsyncCtlLock,
-	AsyncQueueLock,
-	SerializableXactHashLock,
-	SerializableFinishedListLock,
-	SerializablePredicateLockListLock,
-	OldSerXidLock,
-	SyncRepLock,
-	BackgroundWorkerLock,
-	DynamicSharedMemoryControlLock,
-	AutoFileLock,
-	/* Individual lock IDs end here */
-	FirstBufMappingLock,
-	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
-	FirstPredicateLockMgrLock = FirstLockMgrLock + NUM_LOCK_PARTITIONS,
-
-	/* must be last except for MaxDynamicLWLock: */
-	NumFixedLWLocks = FirstPredicateLockMgrLock + NUM_PREDICATELOCK_PARTITIONS,
-
-	MaxDynamicLWLock = 1000000000
-} LWLockId;
-
+/* Offsets for various chunks of preallocated lwlocks. */
+#define BUFFER_MAPPING_LWLOCK_OFFSET	NUM_INDIVIDUAL_LWLOCKS
+#define LOCK_MANAGER_LWLOCK_OFFSET		\
+	(BUFFER_MAPPING_LWLOCK_OFFSET + NUM_BUFFER_PARTITIONS)
+#define PREDICATELOCK_MANAGER_LWLOCK_OFFSET	\
+	(NUM_INDIVIDUAL_LWLOCKS + NUM_LOCK_PARTITIONS)
+#define NUM_FIXED_LWLOCKS \
+	(PREDICATELOCK_MANAGER_LWLOCK_OFFSET + NUM_PREDICATELOCK_PARTITIONS)
 
 typedef enum LWLockMode
 {
@@ -108,18 +164,40 @@ typedef enum LWLockMode
 extern bool Trace_lwlocks;
 #endif
 
-extern LWLockId LWLockAssign(void);
-extern void LWLockAcquire(LWLockId lockid, LWLockMode mode);
-extern bool LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode);
-extern bool LWLockAcquireOrWait(LWLockId lockid, LWLockMode mode);
-extern void LWLockRelease(LWLockId lockid);
+extern void LWLockAcquire(LWLock *lock, LWLockMode mode);
+extern bool LWLockConditionalAcquire(LWLock *lock, LWLockMode mode);
+extern bool LWLockAcquireOrWait(LWLock *lock, LWLockMode mode);
+extern void LWLockRelease(LWLock *lock);
 extern void LWLockReleaseAll(void);
-extern bool LWLockHeldByMe(LWLockId lockid);
+extern bool LWLockHeldByMe(LWLock *lock);
 
-extern int	NumLWLocks(void);
 extern Size LWLockShmemSize(void);
 extern void CreateLWLocks(void);
 
+/*
+ * The traditional method for obtaining an lwlock for use by an extension is
+ * to call RequestAddinLWLocks() during postmaster startup; this will reserve
+ * space for the indicated number of locks in MainLWLockArray.  Subsequently,
+ * a lock can be allocated using LWLockAssign.
+ */
 extern void RequestAddinLWLocks(int n);
+extern LWLock *LWLockAssign(void);
+
+/*
+ * There is another, more flexible method of obtaining lwlocks. First, call
+ * LWLockNewTrancheId just once to obtain a tranche ID; this allocates from
+ * a shared counter.  Next, each individual process using the tranche should
+ * call LWLockRegisterTranche() to associate that tranche ID with appropriate
+ * metadata.  Finally, LWLockInitialize should be called just once per lwlock,
+ * passing the tranche ID as an argument.
+ *
+ * It may seem strange that each process using the tranche must register it
+ * separately, but dynamic shared memory segments aren't guaranteed to be
+ * mapped at the same address in all coordinating backends, so storing the
+ * registration in the main shared memory segment wouldn't work for that case.
+ */
+extern int LWLockNewTrancheId(void);
+extern void LWLockRegisterTranche(int, LWLockTranche *);
+extern void LWLockInitialize(LWLock *, int tranche_id);
 
 #endif   /* LWLOCK_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index acdc678..a3cadd9 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -131,7 +131,7 @@ struct PGPROC
 	struct XidCache subxids;	/* cache for subtransaction XIDs */
 
 	/* Per-backend LWLock.	Protects fields below. */
-	LWLockId	backendLock;	/* protects the fields below */
+	LWLock	   *backendLock;	/* protects the fields below */
 
 	/* Lock manager data, recording fast-path locks taken by this backend. */
 	uint64		fpLockBits;		/* lock modes held for each fast-path slot */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e3058be..c868e8e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -895,7 +895,6 @@ LPWSTR
 LSEG
 LVRelStats
 LWLock
-LWLockId
 LWLockMode
 LWLockPadded
 LabelProvider

#30

KaiGai Kohei

kaigai@ak.jp.nec.com

almost 12 years ago

In reply to: Robert Haas (#29)

Re: dynamic shared memory and locks

(2014/01/11 3:11), Robert Haas wrote:

On Mon, Jan 6, 2014 at 5:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:

This is only part of the solution, of course: a complete solution will
involve making the hash table key something other than the lock ID.
What I'm thinking we can do is making the lock ID consist of two
unsigned 32-bit integers. One of these will be stored in the lwlock
itself, which if my calculations are correct won't increase the size
of LWLockPadded on any common platforms (a 64-bit integer would).
Let's call this the "tranch id". The other will be derived from the
LWLock address. Let's call this the "instance ID". We'll keep a
table of tranch IDs, which will be assigned consecutively starting
with 0. We'll keep an array of metadata for tranches, indexed by
tranch ID, and each entry will have three associated pieces of
information: an array base, a stride length, and a printable name.
When we need to identify an lwlock in the log or to dtrace, we'll
fetch the tranch ID from the lwlock itself and use that to index into
the tranch metadata array. We'll then take the address of the lwlock,
subtract the array base address for the tranch, and divide by the
stride length; the result is the instance ID. When reporting the
user, we can report either the tranch ID directly or the associated
name for that tranch; in either case, we'll also report the instance
ID.

So initially we'll probably just have tranch 0: the main LWLock array.
If we move buffer content and I/O locks to the buffer headers, we'll
define tranch 1 and tranch 2 with the same base address: the start of
the buffer descriptor array, and the same stride length, the size of a
buffer descriptor. One will have the associated name "buffer content
lock" and the other "buffer I/O lock". If we want, we can define
split the main LWLock array into several tranches so that we can more
easily identify lock manager locks, predicate lock manager locks, and
buffer mapping locks.

OK, I've implemented this: here's what I believe to be a complete
patch, based on the previous lwlock-pointers.patch but now handling
LOCK_DEBUG and TRACE_LWLOCKS and dtrace and a bunch of other loose
ends. I think this should be adequate for allowing lwlocks to be
stored elsewhere in the main shared memory segment as well as in
dynamic shared memory segments.

Thoughts?

I briefly checked the patch. Most of lines are mechanical replacement
from LWLockId to LWLock *, and compiler didn't claim anything with
-Wall -Werror option.

My concern is around LWLockTranche mechanism. Isn't it too complicated
towards the purpose?
For example, it may make sense to add "char lockname[NAMEDATALEN];" at
the tail of LWLock structure if LOCK_DEBUG is enabled. Probably, it
also adds an argument of LWLockAssign() to gives the human readable
name. Is the additional 64bytes (= NAMEDATALEN) per lock too large
for recent hardware?

Below is minor comments.

It seems to me this newer form increased direct reference to
the MainLWLockArray. Is it really graceful?
My recommendation is to define a macro that wraps the reference to
MainLWLockArray.

For example:
#define PartitionLock(i) \
(&MainLWLockArray[LOCK_MANAGER_LWLOCK_OFFSET + (i)].lock)

instead of the following manner.

@@ -3400,7 +3406,12 @@ GetLockStatusData(void)
      * Must grab LWLocks in partition-number order to avoid LWLock deadlock.
      */
     for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
-       LWLockAcquire(FirstLockMgrLock + i, LW_SHARED);
+   {
+       LWLock *partitionLock;
+
+       partitionLock = &MainLWLockArray[LOCK_MANAGER_LWLOCK_OFFSET + i].lock;
+       LWLockAcquire(partitionLock, LW_SHARED);
+   }

This chunk is enclosed by #ifdef EXEC_BACKEND, the legacy LWLockArray
was not removed.

@@ -5633,6 +5630,7 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->SpinlockSemaArray = SpinlockSemaArray;
  #endif
     param->LWLockArray = LWLockArray;
+   param->MainLWLockArray = MainLWLockArray;
     param->ProcStructLock = ProcStructLock;
     param->ProcGlobal = ProcGlobal;
     param->AuxiliaryProcs = AuxiliaryProcs;

Thanks,
--
OSS Promotion Center / The PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31

Robert Haas

robertmhaas@gmail.com

almost 12 years ago

In reply to: KaiGai Kohei (#30)

Re: dynamic shared memory and locks

On Mon, Jan 20, 2014 at 11:23 PM, KaiGai Kohei <kaigai@ak.jp.nec.com> wrote:

I briefly checked the patch. Most of lines are mechanical replacement
from LWLockId to LWLock *, and compiler didn't claim anything with
-Wall -Werror option.

My concern is around LWLockTranche mechanism. Isn't it too complicated
towards the purpose?
For example, it may make sense to add "char lockname[NAMEDATALEN];" at
the tail of LWLock structure if LOCK_DEBUG is enabled. Probably, it
also adds an argument of LWLockAssign() to gives the human readable
name. Is the additional 64bytes (= NAMEDATALEN) per lock too large
for recent hardware?

Well, we'd need it when either LOCK_DEBUG was defined or when
LWLOCK_STATS was defined or when --enable-dtrace was used, and while
the first two are probably rarely enough used that that would be OK, I
think the third case is probably fairly common, and I don't think we
want to have such a potentially performance-critical difference
between builds with and without dtrace.

Also... yeah, it's a lot of memory. If we add an additional 64 bytes
to the structure, then we're looking at 96 bytes per lwlock instead of
32, after padding out to a 32-byte boundary to avoid crossing cache
lines. We need 2 lwlocks per buffer, so that's an additional 128
bytes per 8kB buffer. For shared_buffers = 8GB, that's an extra 128MB
for storing lwlocks. I'm not willing to assume nobody cares about
that. And while I agree that this is a bit complex, I don't think
it's really as bad as all that. We've gotten by for a long time
without people being able to put lwlocks in other parts of memory, and
most extension authors have gotten by with LWLockAssign() just fine
and can continue doing things that way; only users with particularly
sophisticated needs should bother to worry about the tranche stuff.

One idea I just had is to improve the dsm_toc module so that it can
optionally set up a tranche of lwlocks for you, and provide some
analogues of RequestAddinLWLocks and LWLockAssign for that case. That
would probably make this quite a bit simpler to use, at least for
people using it with dynamic shared memory. But I think that's a
separate patch.

Below is minor comments.

It seems to me this newer form increased direct reference to
the MainLWLockArray. Is it really graceful?
My recommendation is to define a macro that wraps the reference to
MainLWLockArray.

For example:
#define PartitionLock(i) \
(&MainLWLockArray[LOCK_MANAGER_LWLOCK_OFFSET + (i)].lock)

Yeah, that's probably a good idea.

This chunk is enclosed by #ifdef EXEC_BACKEND, the legacy LWLockArray
was not removed.

Oops, will fix.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32

KaiGai Kohei

kaigai@ak.jp.nec.com

almost 12 years ago

In reply to: Robert Haas (#31)

Re: dynamic shared memory and locks

(2014/01/22 1:37), Robert Haas wrote:

On Mon, Jan 20, 2014 at 11:23 PM, KaiGai Kohei <kaigai@ak.jp.nec.com> wrote:

I briefly checked the patch. Most of lines are mechanical replacement
from LWLockId to LWLock *, and compiler didn't claim anything with
-Wall -Werror option.

My concern is around LWLockTranche mechanism. Isn't it too complicated
towards the purpose?
For example, it may make sense to add "char lockname[NAMEDATALEN];" at
the tail of LWLock structure if LOCK_DEBUG is enabled. Probably, it
also adds an argument of LWLockAssign() to gives the human readable
name. Is the additional 64bytes (= NAMEDATALEN) per lock too large
for recent hardware?

Well, we'd need it when either LOCK_DEBUG was defined or when
LWLOCK_STATS was defined or when --enable-dtrace was used, and while
the first two are probably rarely enough used that that would be OK, I
think the third case is probably fairly common, and I don't think we
want to have such a potentially performance-critical difference
between builds with and without dtrace.

Also... yeah, it's a lot of memory. If we add an additional 64 bytes
to the structure, then we're looking at 96 bytes per lwlock instead of
32, after padding out to a 32-byte boundary to avoid crossing cache
lines. We need 2 lwlocks per buffer, so that's an additional 128
bytes per 8kB buffer. For shared_buffers = 8GB, that's an extra 128MB
for storing lwlocks. I'm not willing to assume nobody cares about
that. And while I agree that this is a bit complex, I don't think
it's really as bad as all that. We've gotten by for a long time
without people being able to put lwlocks in other parts of memory, and
most extension authors have gotten by with LWLockAssign() just fine
and can continue doing things that way; only users with particularly
sophisticated needs should bother to worry about the tranche stuff.

Hmm... 1/64 of main memory (if large buffer system) might not be
an ignorable memory consumption.

One idea I just had is to improve the dsm_toc module so that it can
optionally set up a tranche of lwlocks for you, and provide some
analogues of RequestAddinLWLocks and LWLockAssign for that case. That
would probably make this quite a bit simpler to use, at least for
people using it with dynamic shared memory. But I think that's a
separate patch.

I agree with this idea. It seems to me quite natural to keep properties
of objects held on shared memory (LWLock) on shared memory.
Also, a LWLock once assigned shall not be never released. So, I think
dsm_toc can provide a well suitable storage for them.

Thanks,
--
OSS Promotion Center / The PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: Robert Haas (#29)

Re: dynamic shared memory and locks

On 2014-01-10 13:11:32 -0500, Robert Haas wrote:

OK, I've implemented this: here's what I believe to be a complete
patch, based on the previous lwlock-pointers.patch but now handling
LOCK_DEBUG and TRACE_LWLOCKS and dtrace and a bunch of other loose
ends. I think this should be adequate for allowing lwlocks to be
stored elsewhere in the main shared memory segment as well as in
dynamic shared memory segments.

Thoughts?

Shouldn't we introduce a typedef LWLock* LWLockid; or something to avoid
breaking external code using lwlocks? Requiring code that has worked for
several versions to sprinkle version specific ifdefs seems unneccessary
here.

I see the comments still claim
+ * LWLock is between 16 and 32 bytes on all known platforms, so these two
+ * cases are sufficient.
+ */
+#define LWLOCK_PADDED_SIZE	(sizeof(LWLock) <= 16 ? 16 : 32)

I don't think that's true anymore with the addition of the tranche
id. The new minimum size seems to be 20 on 32bit platforms. That could
be fixed by making the tranche id a uint8, but I don't think that's
necessary, so just removing the support for < 32 seems sufficient.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34

Tom Lane

tgl@sss.pgh.pa.us

almost 12 years ago

In reply to: Andres Freund (#33)

Re: dynamic shared memory and locks

Andres Freund <andres@2ndquadrant.com> writes:

Shouldn't we introduce a typedef LWLock* LWLockid; or something to avoid
breaking external code using lwlocks?

+1, in fact there's probably no reason to touch most *internal* code using
that type name either.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35

Robert Haas

robertmhaas@gmail.com

almost 12 years ago

In reply to: Tom Lane (#34)

Re: dynamic shared memory and locks

On Wed, Jan 22, 2014 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@2ndquadrant.com> writes:

Shouldn't we introduce a typedef LWLock* LWLockid; or something to avoid
breaking external code using lwlocks?

+1, in fact there's probably no reason to touch most *internal* code using
that type name either.

I thought about this but figured it was too much of a misnomer to
refer to a pointer as an ID. But, if we're sure we want to go that
route, I can go revise the patch along those lines.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: Robert Haas (#35)

Re: dynamic shared memory and locks

On 2014-01-22 12:40:34 -0500, Robert Haas wrote:

On Wed, Jan 22, 2014 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@2ndquadrant.com> writes:

Shouldn't we introduce a typedef LWLock* LWLockid; or something to avoid
breaking external code using lwlocks?

+1, in fact there's probably no reason to touch most *internal* code using
that type name either.

I thought about this but figured it was too much of a misnomer to
refer to a pointer as an ID. But, if we're sure we want to go that
route, I can go revise the patch along those lines.

I personally don't care either way for internal code as long as external
code continues to work. There's the argument of making the commit better
readable by having less noise and less divergence in the branches and
there's your argument of that being less clear.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37

Kohei KaiGai

kaigai@kaigai.gr.jp

almost 12 years ago

In reply to: KaiGai Kohei (#32)

Re: dynamic shared memory and locks

Isn't it necessary to have an interface to initialize LWLock structure being
allocated on a dynamic shared memory segment?
Even though LWLock structure is exposed at lwlock.h, we have no common
way to initialize it.

How about to have a following function?

void
InitLWLock(LWLock *lock)
{
SpinLockInit(&lock->lock.mutex);
lock->lock.releaseOK = true;
lock->lock.exclusive = 0;
lock->lock.shared = 0;
lock->lock.head = NULL;
lock->lock.tail = NULL;
}

Thanks,

2014/1/22 KaiGai Kohei <kaigai@ak.jp.nec.com>:

(2014/01/22 1:37), Robert Haas wrote:

On Mon, Jan 20, 2014 at 11:23 PM, KaiGai Kohei <kaigai@ak.jp.nec.com>
wrote:

I briefly checked the patch. Most of lines are mechanical replacement
from LWLockId to LWLock *, and compiler didn't claim anything with
-Wall -Werror option.

My concern is around LWLockTranche mechanism. Isn't it too complicated
towards the purpose?
For example, it may make sense to add "char lockname[NAMEDATALEN];" at
the tail of LWLock structure if LOCK_DEBUG is enabled. Probably, it
also adds an argument of LWLockAssign() to gives the human readable
name. Is the additional 64bytes (= NAMEDATALEN) per lock too large
for recent hardware?

Well, we'd need it when either LOCK_DEBUG was defined or when
LWLOCK_STATS was defined or when --enable-dtrace was used, and while
the first two are probably rarely enough used that that would be OK, I
think the third case is probably fairly common, and I don't think we
want to have such a potentially performance-critical difference
between builds with and without dtrace.

Also... yeah, it's a lot of memory. If we add an additional 64 bytes
to the structure, then we're looking at 96 bytes per lwlock instead of
32, after padding out to a 32-byte boundary to avoid crossing cache
lines. We need 2 lwlocks per buffer, so that's an additional 128
bytes per 8kB buffer. For shared_buffers = 8GB, that's an extra 128MB
for storing lwlocks. I'm not willing to assume nobody cares about
that. And while I agree that this is a bit complex, I don't think
it's really as bad as all that. We've gotten by for a long time
without people being able to put lwlocks in other parts of memory, and
most extension authors have gotten by with LWLockAssign() just fine
and can continue doing things that way; only users with particularly
sophisticated needs should bother to worry about the tranche stuff.

Hmm... 1/64 of main memory (if large buffer system) might not be
an ignorable memory consumption.

One idea I just had is to improve the dsm_toc module so that it can
optionally set up a tranche of lwlocks for you, and provide some
analogues of RequestAddinLWLocks and LWLockAssign for that case. That
would probably make this quite a bit simpler to use, at least for
people using it with dynamic shared memory. But I think that's a
separate patch.

I agree with this idea. It seems to me quite natural to keep properties
of objects held on shared memory (LWLock) on shared memory.
Also, a LWLock once assigned shall not be never released. So, I think
dsm_toc can provide a well suitable storage for them.

Thanks,
--
OSS Promotion Center / The PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
KaiGai Kohei <kaigai@kaigai.gr.jp>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: Kohei KaiGai (#37)

Re: dynamic shared memory and locks

On 2014-01-23 23:03:40 +0900, Kohei KaiGai wrote:

Isn't it necessary to have an interface to initialize LWLock structure being
allocated on a dynamic shared memory segment?
Even though LWLock structure is exposed at lwlock.h, we have no common
way to initialize it.

There's LWLockInitialize() or similar in the patch afair.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39

Kohei KaiGai

kaigai@kaigai.gr.jp

almost 12 years ago

In reply to: Andres Freund (#38)

Re: dynamic shared memory and locks

2014/1/23 Andres Freund <andres@2ndquadrant.com>:

On 2014-01-23 23:03:40 +0900, Kohei KaiGai wrote:

Isn't it necessary to have an interface to initialize LWLock structure being
allocated on a dynamic shared memory segment?
Even though LWLock structure is exposed at lwlock.h, we have no common
way to initialize it.

There's LWLockInitialize() or similar in the patch afair.

Ahh, I oversight the code around tranche-id. Sorry for this noise.

Thanks,
--
KaiGai Kohei <kaigai@kaigai.gr.jp>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40

Robert Haas

robertmhaas@gmail.com

almost 12 years ago

In reply to: Andres Freund (#36)

Re: dynamic shared memory and locks

On Wed, Jan 22, 2014 at 12:42 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-01-22 12:40:34 -0500, Robert Haas wrote:

On Wed, Jan 22, 2014 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@2ndquadrant.com> writes:

Shouldn't we introduce a typedef LWLock* LWLockid; or something to avoid
breaking external code using lwlocks?

+1, in fact there's probably no reason to touch most *internal* code using
that type name either.

I thought about this but figured it was too much of a misnomer to
refer to a pointer as an ID. But, if we're sure we want to go that
route, I can go revise the patch along those lines.

I personally don't care either way for internal code as long as external
code continues to work. There's the argument of making the commit better
readable by having less noise and less divergence in the branches and
there's your argument of that being less clear.

OK, well then, if no one objects violently, I'll stick my current
approach of getting rid of all core mentions of LWLockId in favor of
LWLock *, but also add typedef LWLock *LWLockId with a comment that
this is to minimize breakage of third-party code.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41

Robert Haas

robertmhaas@gmail.com

almost 12 years ago

In reply to: Robert Haas (#40)

Re: dynamic shared memory and locks

On Thu, Jan 23, 2014 at 11:10 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jan 22, 2014 at 12:42 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-01-22 12:40:34 -0500, Robert Haas wrote:

On Wed, Jan 22, 2014 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@2ndquadrant.com> writes:

Shouldn't we introduce a typedef LWLock* LWLockid; or something to avoid
breaking external code using lwlocks?

+1, in fact there's probably no reason to touch most *internal* code using
that type name either.

I thought about this but figured it was too much of a misnomer to
refer to a pointer as an ID. But, if we're sure we want to go that
route, I can go revise the patch along those lines.

I personally don't care either way for internal code as long as external
code continues to work. There's the argument of making the commit better
readable by having less noise and less divergence in the branches and
there's your argument of that being less clear.

OK, well then, if no one objects violently, I'll stick my current
approach of getting rid of all core mentions of LWLockId in favor of
LWLock *, but also add typedef LWLock *LWLockId with a comment that
this is to minimize breakage of third-party code.

Hearing no objections, violent or otherwise, I've done that, made the
other adjustments suggested by Andres and KaiGai, and committed this.
Let's see what the buildfarm thinks...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42

Robert Haas

robertmhaas@gmail.com

almost 12 years ago

In reply to: Robert Haas (#31)

Re: dynamic shared memory and locks

On Tue, Jan 21, 2014 at 11:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:

One idea I just had is to improve the dsm_toc module so that it can
optionally set up a tranche of lwlocks for you, and provide some
analogues of RequestAddinLWLocks and LWLockAssign for that case. That
would probably make this quite a bit simpler to use, at least for
people using it with dynamic shared memory. But I think that's a
separate patch.

I played with this a bit today and it doesn't actually seem to
simplify things very much. The backend that creates the DSM needs to
do this:

lwlocks = shm_toc_allocate(toc, sizeof(LWLockPadded) * nlwlocks);
for (i = 0; i < nlwlocks; ++i)
LWLockInitialize(&lwlocks[i].lock, tranche_id);

Since that's all of three lines, encapsulating it doesn't look all
that helpful. Then each backend needs to do something like this:

static LWLockTranche mytranche;
mytranche.name = "some descriptive module name";
mytranche.array_base = lwlocks;
mytranche.array_stride = sizeof(LWLockPadded);
LWLockRegisterTranche(tranche_id, &mytranche);

That's not a lot of code either, and there's no obvious way to reduce
it much by hooking into shm_toc.

I thought maybe we needed some cleanup when the dynamic shared memory
segment is unmapped, but looks like we really don't. One can do
something like LWLockRegisterTranche(tranche_id, NULL) for debugging
clarity, but it isn't strictly needed; one can release all lwlocks
from the tranche or assert that none are held, but that should really
only be a problem if the user does something like
LWLockAcquire(lock_in_the_segment); dsm_detach(seg), because error
recovery starts by releasing *all* lwlocks. And if the user does it
explicitly, I'm kinda OK with that just seg faulting. After all, the
user could equally well have done dsm_detach(seg);
LWLockAcquire(lock_in_the_segment) and there's no way at all to
cross-check for that sort of mistake.

I do see one thing about the status quo that does look reasonably
annoying: the code expects that tranche IDs are going to stay
relatively small. For example, consider a module foobar that uses DSM
+ LWLocks. It won't do at all for each backend, on first use of the
foobar module, to do LWLockNewTrancheId() and then reuse that
tranche_id repeatedly for each new DSM - because over a system
lifetime of months, tranche IDs could grow into the millions, causing
LWLockTrancheArray to get really big (and eventually repalloc will
fail). Rather, the programmer needs to ensure that
LWLockNewTrancheId() gets called *once per postmaster lifetime*,
perhaps by allocating a chunk of permanent shared memory and using
that to store the tranche_id that should be used each time an
individual backend fires up a DSM. Considering that one of the goals
of dynamic shared memory is to allow modules to be loaded after
postmaster startup and still be able to do useful stuff, that's kind
of obnoxious. I don't have a good idea what to do about it, though;
making LWLockTrancheArray anything other than a flat array looks
likely to slow down --enable-dtrace builds unacceptably.

Bottom line, I guess, is that I don't currently have any real idea how
to make this any better than it already is. Maybe that will become
more clear as this facility (hopefully) acquires some users.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43

Kohei KaiGai

kaigai@kaigai.gr.jp

almost 12 years ago

In reply to: Robert Haas (#42)

Re: dynamic shared memory and locks

2014-02-08 4:52 GMT+09:00 Robert Haas <robertmhaas@gmail.com>:

On Tue, Jan 21, 2014 at 11:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:

One idea I just had is to improve the dsm_toc module so that it can
optionally set up a tranche of lwlocks for you, and provide some
analogues of RequestAddinLWLocks and LWLockAssign for that case. That
would probably make this quite a bit simpler to use, at least for
people using it with dynamic shared memory. But I think that's a
separate patch.

I played with this a bit today and it doesn't actually seem to
simplify things very much. The backend that creates the DSM needs to
do this:

lwlocks = shm_toc_allocate(toc, sizeof(LWLockPadded) * nlwlocks);
for (i = 0; i < nlwlocks; ++i)
LWLockInitialize(&lwlocks[i].lock, tranche_id);

Since that's all of three lines, encapsulating it doesn't look all
that helpful. Then each backend needs to do something like this:

static LWLockTranche mytranche;
mytranche.name = "some descriptive module name";
mytranche.array_base = lwlocks;
mytranche.array_stride = sizeof(LWLockPadded);
LWLockRegisterTranche(tranche_id, &mytranche);

That's not a lot of code either, and there's no obvious way to reduce
it much by hooking into shm_toc.

I thought maybe we needed some cleanup when the dynamic shared memory
segment is unmapped, but looks like we really don't. One can do
something like LWLockRegisterTranche(tranche_id, NULL) for debugging
clarity, but it isn't strictly needed; one can release all lwlocks
from the tranche or assert that none are held, but that should really
only be a problem if the user does something like
LWLockAcquire(lock_in_the_segment); dsm_detach(seg), because error
recovery starts by releasing *all* lwlocks. And if the user does it
explicitly, I'm kinda OK with that just seg faulting. After all, the
user could equally well have done dsm_detach(seg);
LWLockAcquire(lock_in_the_segment) and there's no way at all to
cross-check for that sort of mistake.

Does it make another problem if dsm_detach() also releases lwlocks
being allocated on the dsm segment to be released?
Lwlocks being held is tracked in the held_lwlocks[] array; its length is
usually 100. In case when dsm_detach() is called towards the segment
between addr ~ (addr + length), it seems to me an easy job to walk on
this array to find out lwlocks on the range.

I do see one thing about the status quo that does look reasonably
annoying: the code expects that tranche IDs are going to stay
relatively small. For example, consider a module foobar that uses DSM
+ LWLocks. It won't do at all for each backend, on first use of the
foobar module, to do LWLockNewTrancheId() and then reuse that
tranche_id repeatedly for each new DSM - because over a system
lifetime of months, tranche IDs could grow into the millions, causing
LWLockTrancheArray to get really big (and eventually repalloc will
fail). Rather, the programmer needs to ensure that
LWLockNewTrancheId() gets called *once per postmaster lifetime*,
perhaps by allocating a chunk of permanent shared memory and using
that to store the tranche_id that should be used each time an
individual backend fires up a DSM. Considering that one of the goals
of dynamic shared memory is to allow modules to be loaded after
postmaster startup and still be able to do useful stuff, that's kind
of obnoxious. I don't have a good idea what to do about it, though;
making LWLockTrancheArray anything other than a flat array looks
likely to slow down --enable-dtrace builds unacceptably.

Just my rough idea. Doesn't it make sense to have an offset from the
head of DSM segment that contains the lwlock, instead of the identifier
form, to indicate a cstring datum? It does not prevent to allocate it
later; after the postmaster starting up, and here is no problem if number
of dynamic lwlocks grows millions or more.
If lwlock has a "tranche_offset", not "tranche_id", all it needs to do is:
1. find out either of DSM segment or regular SHM segment that contains
the supplied lwlock.
2. Calculate mapped_address + tranche_offset; that is the local location
where cstring form is put.

Probably, it needs a common utility routine on dsm.c to find out
a particular DSM segment that contains the supplied address.
(Inverse function towards dsm_segment_address)

How about my ideas?

Thanks,
--
KaiGai Kohei <kaigai@kaigai.gr.jp>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44

Robert Haas

robertmhaas@gmail.com

almost 12 years ago

In reply to: Kohei KaiGai (#43)

Re: dynamic shared memory and locks

On Mon, Feb 10, 2014 at 7:17 PM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:

Does it make another problem if dsm_detach() also releases lwlocks
being allocated on the dsm segment to be released?
Lwlocks being held is tracked in the held_lwlocks[] array; its length is
usually 100. In case when dsm_detach() is called towards the segment
between addr ~ (addr + length), it seems to me an easy job to walk on
this array to find out lwlocks on the range.

Oh, sure, it could be done. I just don't see the point. If you're
explicitly detaching a shared memory segment while still holding
lwlocks within that segment, your code is seriously buggy. Making
dsm_detach() silently release those locks would probably just be
hiding a bug that you'd really rather find out about.

Just my rough idea. Doesn't it make sense to have an offset from the
head of DSM segment that contains the lwlock, instead of the identifier
form, to indicate a cstring datum? It does not prevent to allocate it
later; after the postmaster starting up, and here is no problem if number
of dynamic lwlocks grows millions or more.
If lwlock has a "tranche_offset", not "tranche_id", all it needs to do is:
1. find out either of DSM segment or regular SHM segment that contains
the supplied lwlock.
2. Calculate mapped_address + tranche_offset; that is the local location
where cstring form is put.

Well, if that offset is 8 bytes, it's going to make the LWLock
structure grow past 32 bytes on common platforms, which I do not want
to do. If it's 4 bytes, sure, that'll work, but now you've gotta make
sure that string gets into the 4GB of the relevant shared memory
segment, which sounds like an annoying requirement. How would you
satisfy it with respect to the main shared memory segment, for
example?

I mean, one way to handle this problem is to put a hash table in the
main shared memory segment with tranche ID -> name mappings. Backends
can suck mappings out of there and cache them locally as needed. But
that's a lot of mechanism for a facility with no known users.

Probably, it needs a common utility routine on dsm.c to find out
a particular DSM segment that contains the supplied address.
(Inverse function towards dsm_segment_address)

Yeah, I've thought of that. It may be useful enough to be worth
doing, whether we use it for this or not.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers